Create a Sqlite3 database for Spotify songs data

How to convert Python data frame to Sqlite3 database? It is quite easy in the Jupyter Notebook with sql extension. With this powerful extension, we can excecute SQL query in Jupyter Notebook, which is quite convenient.

In this article, I would share my experience in creating a Sqlite3 database to store Spotify songs data in the 3rd Normal Form using Python. It is quite easy, everyone who has some basic Python skills can follow my steps and try to do it. You can find my Github repository for this blog here. Data source: https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-01-21.

Step 0: load data

At the beginning, I use pandas.read_csv to load the data.

import numpy as np
import pandas as pd
import sqlite3
import warnings
warnings.simplefilter('ignore', FutureWarning)
%load_ext sql
df = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

Step 1: check duplication

## check duplicates
df.duplicated().any()  ## False

There seem to be no duplicate rows in the data frame. Even though the number of unique track_id is obviously less than the number of rows in the data frame. Since they are not exactly the same rows, we should keep them (I found that the only difference between rows with the same track_id is on the playlist, so I think they contains information).

Step 2: Convert the data frame into 1st normal form

The 1st normal form requires that each element contains only a single entry, in other words, we should split composite entries. After checking with the data, it does not seem to have any compound entries, so the data frame already satisfies the 1st normal form.

Step 3: Convert the data frame into 2nd normal form

To achieve the 2nd normal form, we first need to identify candidate Primary Key. In the codebook, track_id represents the unique ID for each song, so it could be considered as a potential Primary Key. We then need to identify partial dependencies in the original big data frame and break it into several small data frame based on dependencies.

df.columns

## playlist
playlist = df[df.duplicated(subset=['playlist_id'])]
playlist.sort_values(by = ['playlist_id'], axis=0)[['playlist_id','playlist_name','playlist_genre','playlist_subgenre']]
df_playlist = df[['playlist_id', 'playlist_name', 'playlist_genre', 'playlist_subgenre']].drop_duplicates()

## album
album = df[df.duplicated(subset=['track_album_id'])]
album.sort_values(by = ['track_album_id'], axis=0)[['track_album_id', 'track_album_name', 'track_album_release_date']]
df_album = df[['track_album_id','track_album_name', 'track_album_release_date']].drop_duplicates()

## song
song = df[df.duplicated(subset=['track_id'])]
song.sort_values(by = ['track_id'], axis=0)[['track_id','track_name', 'track_artist','track_popularity', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
                                             'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo','duration_ms']]
df_song = df[['track_id','track_name','track_artist','track_popularity', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']].drop_duplicate()

df_song_playlist = df[['track_id', 'playlist_id']].drop_duplicates()
df_song_album = df[['track_id', 'track_album_id']].drop_duplicates()

Step 4: Convert the data frame into 3rd normal form

To achieve the 3rd normal form, we should remove transitive dependencies, where a column totally depends on a column that is not Primary Key. In this case, since the subgenre of playlist determines the genre of playlist, we can create a new data frame to connect playlist_subgenre and playlist_genre.

df_playlist_ = df_playlist[['playlist_id', 'playlist_name', 'playlist_subgenre']]
df_playlist_genre = df_playlist[['playlist_subgenre', 'playlist_genre']].drop_duplicates()

Finally, we obtain the final tables df_song, df_song_album, df_song_playlist, df_playlist_, df_playlist_genre and df_album.

Step 5: Store it as a Sqlite3 database

# Create a database to connect to in memory
cr = sqlite3.connect('spotify.db')

# Store dataframes in the database
df_song.to_sql(name='df_song', con=cr, index = False)
df_song_album.to_sql(name='df_song_album', con=cr, index = False)
df_song_playlist.to_sql(name='df_song_playlist', con=cr, index = False)
df_playlist_.to_sql(name='df_playlist', con=cr, index = False)
df_playlist_genre.to_sql(name='df_playlist_genre', con=cr, index = False)
df_album.to_sql(name='df_album', con=cr, index = False)

We store all the final tables in spotify.db.

Connect to our Sqlite3 database

%load_ext sql ## load extension
%sql sqlite:///spotify.db ## connect to the database

%%sql
SELECT * FROM sqlite_master WHERE type='table';  ## get all tables in this database

type	name	tbl_name	rootpage	sql
table	df_song	df_song	2	CREATE TABLE "df_song" ( "track_id" TEXT, "track_name" TEXT, "track_artist" TEXT, "track_popularity" INTEGER, "danceability" REAL, "energy" REAL, "key" INTEGER, "loudness" REAL, "mode" INTEGER, "speechiness" REAL, "acousticness" REAL, "instrumentalness" REAL, "liveness" REAL, "valence" REAL, "tempo" REAL, "duration_ms" INTEGER )
table	df_song_album	df_song_album	1041	CREATE TABLE "df_song_album" ( "track_id" TEXT, "track_album_id" TEXT )
table	df_song_playlist	df_song_playlist	1408	CREATE TABLE "df_song_playlist" ( "track_id" TEXT, "playlist_id" TEXT )
table	df_playlist	df_playlist	1826	CREATE TABLE "df_playlist" ( "playlist_id" TEXT, "playlist_name" TEXT, "playlist_subgenre" TEXT )
table	df_playlist_genre	df_playlist_genre	1835	CREATE TABLE "df_playlist_genre" ( "playlist_subgenre" TEXT, "playlist_genre" TEXT )
table	df_album	df_album	1836	CREATE TABLE "df_album" ( "track_album_id" TEXT, "track_album_name" TEXT, "track_album_release_date" TEXT )

Find the names of all playlists that contain instrumentals.

From the codebook, instrumentalness represents whether a track contains no vocals. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

%%sql 
SELECT COUNT(instrumentalness) AS num_instrumentals, playlist_name
FROM df_song
INNER JOIN df_song_playlist
ON df_song_playlist.track_id = df_song.track_id
INNER JOIN df_playlist
ON df_playlist.playlist_id = df_song_playlist.playlist_id
WHERE instrumentalness > 0.5 
GROUP BY playlist_name
ORDER BY num_instrumentals DESC
LIMIT 10

num_instrumentals	playlist_name
92	Fitness Workout Electro \| House \| Dance \| Progressive House
91	Lush Lofi
88	Underground Party \| Hypnotic \| Minimal \| Acid \| Big Room \| Tech \| Liquid
81	Jazz Vibes
81	House Electro 2019
78	Lo-Fi Beats
75	Lofi Hip-Hop
60	Selected House
58	💊ELECTRO-HOUSE-TECH💊
54	Sunny Beats

%%sql 
SELECT COUNT(*)
FROM
(SELECT SUM(instrumentalness) AS num_instrumentals, playlist_name
FROM df_song
INNER JOIN df_song_playlist
ON df_song_playlist.track_id = df_song.track_id
INNER JOIN df_playlist
ON df_playlist.playlist_id = df_song_playlist.playlist_id
WHERE instrumentalness > 0
GROUP BY playlist_name
HAVING num_instrumentals > 0.5 
ORDER BY num_instrumentals DESC)

COUNT(*)
256

We can see that there are 256 playlists that contains at least one song that contains instrumentals. Fitness Workout Electro | House | Dance | Progressive House contains most songs with instrumentals.