Create a Sqlite3 database for Spotify songs data
How to convert Python data frame to Sqlite3 database? It is quite easy in the Jupyter Notebook with sql
extension. With this powerful extension, we can excecute SQL query in Jupyter Notebook, which is quite convenient.
In this article, I would share my experience in creating a Sqlite3 database to store Spotify songs data in the 3rd Normal Form using Python. It is quite easy, everyone who has some basic Python skills can follow my steps and try to do it. You can find my Github repository for this blog here. Data source: https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-01-21.
Step 0: load data
At the beginning, I use pandas.read_csv
to load the data.
import numpy as np
import pandas as pd
import sqlite3
import warnings
warnings.simplefilter('ignore', FutureWarning)
%load_ext sql
df = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
Step 1: check duplication
## check duplicates
df.duplicated().any() ## False
There seem to be no duplicate rows in the data frame. Even though the number of unique track_id
is obviously less than the number of rows in the data frame. Since they are not exactly the same rows, we should keep them (I found that the only difference between rows with the same track_id
is on the playlist, so I think they contains information).
Step 2: Convert the data frame into 1st normal form
The 1st normal form requires that each element contains only a single entry, in other words, we should split composite entries. After checking with the data, it does not seem to have any compound entries, so the data frame already satisfies the 1st normal form.
Step 3: Convert the data frame into 2nd normal form
To achieve the 2nd normal form, we first need to identify candidate Primary Key. In the codebook, track_id
represents the unique ID for each song, so it could be considered as a potential Primary Key. We then need to identify partial dependencies in the original big data frame and break it into several small data frame based on dependencies.
df.columns
## playlist
playlist = df[df.duplicated(subset=['playlist_id'])]
playlist.sort_values(by = ['playlist_id'], axis=0)[['playlist_id','playlist_name','playlist_genre','playlist_subgenre']]
df_playlist = df[['playlist_id', 'playlist_name', 'playlist_genre', 'playlist_subgenre']].drop_duplicates()
## album
album = df[df.duplicated(subset=['track_album_id'])]
album.sort_values(by = ['track_album_id'], axis=0)[['track_album_id', 'track_album_name', 'track_album_release_date']]
df_album = df[['track_album_id','track_album_name', 'track_album_release_date']].drop_duplicates()
## song
song = df[df.duplicated(subset=['track_id'])]
song.sort_values(by = ['track_id'], axis=0)[['track_id','track_name', 'track_artist','track_popularity', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo','duration_ms']]
df_song = df[['track_id','track_name','track_artist','track_popularity', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']].drop_duplicate()
df_song_playlist = df[['track_id', 'playlist_id']].drop_duplicates()
df_song_album = df[['track_id', 'track_album_id']].drop_duplicates()
Step 4: Convert the data frame into 3rd normal form
To achieve the 3rd normal form, we should remove transitive dependencies, where a column totally depends on a column that is not Primary Key. In this case, since the subgenre of playlist determines the genre of playlist, we can create a new data frame to connect playlist_subgenre
and playlist_genre
.
df_playlist_ = df_playlist[['playlist_id', 'playlist_name', 'playlist_subgenre']]
df_playlist_genre = df_playlist[['playlist_subgenre', 'playlist_genre']].drop_duplicates()
Finally, we obtain the final tables df_song
, df_song_album
, df_song_playlist
, df_playlist_
, df_playlist_genre
and df_album
.
Step 5: Store it as a Sqlite3 database
# Create a database to connect to in memory
cr = sqlite3.connect('spotify.db')
# Store dataframes in the database
df_song.to_sql(name='df_song', con=cr, index = False)
df_song_album.to_sql(name='df_song_album', con=cr, index = False)
df_song_playlist.to_sql(name='df_song_playlist', con=cr, index = False)
df_playlist_.to_sql(name='df_playlist', con=cr, index = False)
df_playlist_genre.to_sql(name='df_playlist_genre', con=cr, index = False)
df_album.to_sql(name='df_album', con=cr, index = False)
We store all the final tables in spotify.db
.
Connect to our Sqlite3 database
%load_ext sql ## load extension
%sql sqlite:///spotify.db ## connect to the database
%%sql
SELECT * FROM sqlite_master WHERE type='table'; ## get all tables in this database
type | name | tbl_name | rootpage | sql |
---|---|---|---|---|
table | df_song | df_song | 2 | CREATE TABLE "df_song" ( "track_id" TEXT, "track_name" TEXT, "track_artist" TEXT, "track_popularity" INTEGER, "danceability" REAL, "energy" REAL, "key" INTEGER, "loudness" REAL, "mode" INTEGER, "speechiness" REAL, "acousticness" REAL, "instrumentalness" REAL, "liveness" REAL, "valence" REAL, "tempo" REAL, "duration_ms" INTEGER ) |
table | df_song_album | df_song_album | 1041 | CREATE TABLE "df_song_album" ( "track_id" TEXT, "track_album_id" TEXT ) |
table | df_song_playlist | df_song_playlist | 1408 | CREATE TABLE "df_song_playlist" ( "track_id" TEXT, "playlist_id" TEXT ) |
table | df_playlist | df_playlist | 1826 | CREATE TABLE "df_playlist" ( "playlist_id" TEXT, "playlist_name" TEXT, "playlist_subgenre" TEXT ) |
table | df_playlist_genre | df_playlist_genre | 1835 | CREATE TABLE "df_playlist_genre" ( "playlist_subgenre" TEXT, "playlist_genre" TEXT ) |
table | df_album | df_album | 1836 | CREATE TABLE "df_album" ( "track_album_id" TEXT, "track_album_name" TEXT, "track_album_release_date" TEXT ) |
Find the names of all playlists that contain instrumentals.
From the codebook, instrumentalness
represents whether a track contains no vocals. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
%%sql
SELECT COUNT(instrumentalness) AS num_instrumentals, playlist_name
FROM df_song
INNER JOIN df_song_playlist
ON df_song_playlist.track_id = df_song.track_id
INNER JOIN df_playlist
ON df_playlist.playlist_id = df_song_playlist.playlist_id
WHERE instrumentalness > 0.5
GROUP BY playlist_name
ORDER BY num_instrumentals DESC
LIMIT 10
num_instrumentals | playlist_name |
---|---|
92 | Fitness Workout Electro | House | Dance | Progressive House |
91 | Lush Lofi |
88 | Underground Party | Hypnotic | Minimal | Acid | Big Room | Tech | Liquid |
81 | Jazz Vibes |
81 | House Electro 2019 |
78 | Lo-Fi Beats |
75 | Lofi Hip-Hop |
60 | Selected House |
58 | 💊ELECTRO-HOUSE-TECH💊 |
54 | Sunny Beats |
%%sql
SELECT COUNT(*)
FROM
(SELECT SUM(instrumentalness) AS num_instrumentals, playlist_name
FROM df_song
INNER JOIN df_song_playlist
ON df_song_playlist.track_id = df_song.track_id
INNER JOIN df_playlist
ON df_playlist.playlist_id = df_song_playlist.playlist_id
WHERE instrumentalness > 0
GROUP BY playlist_name
HAVING num_instrumentals > 0.5
ORDER BY num_instrumentals DESC)
COUNT(*) |
---|
256 |
We can see that there are 256 playlists that contains at least one song that contains instrumentals. Fitness Workout Electro | House | Dance | Progressive House
contains most songs with instrumentals.