How to convert Python data frame to Sqlite3 database? It is quite easy in the Jupyter Notebook with sql extension. With this powerful extension, we can excecute SQL query in Jupyter Notebook, which is quite convenient.

In this article, I would share my experience in creating a Sqlite3 database to store Spotify songs data in the 3rd Normal Form using Python. It is quite easy, everyone who has some basic Python skills can follow my steps and try to do it. You can find my Github repository for this blog here. Data source: https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-01-21.

Step 0: load data

At the beginning, I use pandas.read_csv to load the data.

import numpy as np
import pandas as pd
import sqlite3
import warnings
warnings.simplefilter('ignore', FutureWarning)
%load_ext sql
df = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

Step 1: check duplication

## check duplicates
df.duplicated().any()  ## False

There seem to be no duplicate rows in the data frame. Even though the number of unique track_id is obviously less than the number of rows in the data frame. Since they are not exactly the same rows, we should keep them (I found that the only difference between rows with the same track_id is on the playlist, so I think they contains information).

Step 2: Convert the data frame into 1st normal form

The 1st normal form requires that each element contains only a single entry, in other words, we should split composite entries. After checking with the data, it does not seem to have any compound entries, so the data frame already satisfies the 1st normal form.

Step 3: Convert the data frame into 2nd normal form

To achieve the 2nd normal form, we first need to identify candidate Primary Key. In the codebook, track_id represents the unique ID for each song, so it could be considered as a potential Primary Key. We then need to identify partial dependencies in the original big data frame and break it into several small data frame based on dependencies.

df.columns

## playlist
playlist = df[df.duplicated(subset=['playlist_id'])]
playlist.sort_values(by = ['playlist_id'], axis=0)[['playlist_id','playlist_name','playlist_genre','playlist_subgenre']]
df_playlist = df[['playlist_id', 'playlist_name', 'playlist_genre', 'playlist_subgenre']].drop_duplicates()

## album
album = df[df.duplicated(subset=['track_album_id'])]
album.sort_values(by = ['track_album_id'], axis=0)[['track_album_id', 'track_album_name', 'track_album_release_date']]
df_album = df[['track_album_id','track_album_name', 'track_album_release_date']].drop_duplicates()

## song
song = df[df.duplicated(subset=['track_id'])]
song.sort_values(by = ['track_id'], axis=0)[['track_id','track_name', 'track_artist','track_popularity', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
                                             'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo','duration_ms']]
df_song = df[['track_id','track_name','track_artist','track_popularity', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']].drop_duplicate()

df_song_playlist = df[['track_id', 'playlist_id']].drop_duplicates()
df_song_album = df[['track_id', 'track_album_id']].drop_duplicates()

Step 4: Convert the data frame into 3rd normal form

To achieve the 3rd normal form, we should remove transitive dependencies, where a column totally depends on a column that is not Primary Key. In this case, since the subgenre of playlist determines the genre of playlist, we can create a new data frame to connect playlist_subgenre and playlist_genre.

df_playlist_ = df_playlist[['playlist_id', 'playlist_name', 'playlist_subgenre']]
df_playlist_genre = df_playlist[['playlist_subgenre', 'playlist_genre']].drop_duplicates()

Finally, we obtain the final tables df_song, df_song_album, df_song_playlist, df_playlist_, df_playlist_genre and df_album.

Step 5: Store it as a Sqlite3 database

# Create a database to connect to in memory
cr = sqlite3.connect('spotify.db')

# Store dataframes in the database
df_song.to_sql(name='df_song', con=cr, index = False)
df_song_album.to_sql(name='df_song_album', con=cr, index = False)
df_song_playlist.to_sql(name='df_song_playlist', con=cr, index = False)
df_playlist_.to_sql(name='df_playlist', con=cr, index = False)
df_playlist_genre.to_sql(name='df_playlist_genre', con=cr, index = False)
df_album.to_sql(name='df_album', con=cr, index = False)

We store all the final tables in spotify.db.

diagram

Connect to our Sqlite3 database

%load_ext sql ## load extension
%sql sqlite:///spotify.db ## connect to the database

%%sql
SELECT * FROM sqlite_master WHERE type='table';  ## get all tables in this database
type name tbl_name rootpage sql
table df_song df_song 2 CREATE TABLE "df_song" ( "track_id" TEXT, "track_name" TEXT, "track_artist" TEXT, "track_popularity" INTEGER, "danceability" REAL, "energy" REAL, "key" INTEGER, "loudness" REAL, "mode" INTEGER, "speechiness" REAL, "acousticness" REAL, "instrumentalness" REAL, "liveness" REAL, "valence" REAL, "tempo" REAL, "duration_ms" INTEGER )
table df_song_album df_song_album 1041 CREATE TABLE "df_song_album" ( "track_id" TEXT, "track_album_id" TEXT )
table df_song_playlist df_song_playlist 1408 CREATE TABLE "df_song_playlist" ( "track_id" TEXT, "playlist_id" TEXT )
table df_playlist df_playlist 1826 CREATE TABLE "df_playlist" ( "playlist_id" TEXT, "playlist_name" TEXT, "playlist_subgenre" TEXT )
table df_playlist_genre df_playlist_genre 1835 CREATE TABLE "df_playlist_genre" ( "playlist_subgenre" TEXT, "playlist_genre" TEXT )
table df_album df_album 1836 CREATE TABLE "df_album" ( "track_album_id" TEXT, "track_album_name" TEXT, "track_album_release_date" TEXT )

Find the names of all playlists that contain instrumentals.

From the codebook, instrumentalness represents whether a track contains no vocals. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

%%sql 
SELECT COUNT(instrumentalness) AS num_instrumentals, playlist_name
FROM df_song
INNER JOIN df_song_playlist
ON df_song_playlist.track_id = df_song.track_id
INNER JOIN df_playlist
ON df_playlist.playlist_id = df_song_playlist.playlist_id
WHERE instrumentalness > 0.5 
GROUP BY playlist_name
ORDER BY num_instrumentals DESC
LIMIT 10
num_instrumentals playlist_name
92 Fitness Workout Electro | House | Dance | Progressive House
91 Lush Lofi
88 Underground Party | Hypnotic | Minimal | Acid | Big Room | Tech | Liquid
81 Jazz Vibes
81 House Electro 2019
78 Lo-Fi Beats
75 Lofi Hip-Hop
60 Selected House
58 💊ELECTRO-HOUSE-TECH💊
54 Sunny Beats
%%sql 
SELECT COUNT(*)
FROM
(SELECT SUM(instrumentalness) AS num_instrumentals, playlist_name
FROM df_song
INNER JOIN df_song_playlist
ON df_song_playlist.track_id = df_song.track_id
INNER JOIN df_playlist
ON df_playlist.playlist_id = df_song_playlist.playlist_id
WHERE instrumentalness > 0
GROUP BY playlist_name
HAVING num_instrumentals > 0.5 
ORDER BY num_instrumentals DESC)
COUNT(*)
256

We can see that there are 256 playlists that contains at least one song that contains instrumentals. Fitness Workout Electro | House | Dance | Progressive House contains most songs with instrumentals.