Spotify Final Research Report

Business Analytics Major Project 2020/2021
Module Code: MS5115

Group Number: 16
Exploring Spotify® Data Trends & Modelling Track Popularity

with Machine Learning Algorithms
July 2021
Caitlin Hogan
Neeharika Divi
Sabbarish Khanna Subramanian
Niall Muldoon
Under the Guidance of:

Dr. Anatoli Nachev
and
Dr. Michael Lang
Executive Summary
The music industry is a global juggernaut undergoing radical change. With this comes great
risk for many involved, from aspiring new artists to established labels seeking to produce, release,
and promote their music. Driven by huge changes in listening habits and a rapid transition to on-
demand recommendation and playlist-driven streaming services, the analysis of collected data and
use of algorithms is already an established aspect. This project explores a sample dataset from
Spotify, an online music platform with 365 million monthly listeners as of March 2021 (Spotify
2021) . After examining and exploring the available data, it focuses on the key popularity metric and
explores its relationships to other features, and how it might be predicted using a machine learning
model. This kind of analysis will be a valuable process for the music industry moving forward and
help manage risk and better understand very complex behaviour. Studying the industry by means of
data on songs, genres, artists, user listening habits, etc., can lead to insights, which could help create
predictive models or recommender systems, or better explain behaviour and changes.
With the use of an available Spotify dataset, a number of tools and languages are applied,
and various predictive models are created: multiple linear regression, extreme gradient boosting,
and histogram-based gradient boosting using Python/Scikit Learn, generalised linear models using
regression and classification with R, and linear regression, and regression tree, neural networks, and
bagging, boosting, and random tree ensemble methods, in Excel. All of these demonstrate various
ways to attempt to predict popularity based on other track attributes.
This report also describes the CRISP project methodology, analytical methodology, tools,
and techniques used, data preparation, and documentation and visualisation of the data and
findings. A brief overview of a MxTape playlist recommender application is included, demonstrating
the use of available Spotify APIs to access current data.
Table of Contents
1. Motivation ............................................................................................................................... 1
2. Project Objective Development ................................................................................................. 1
3. Target Objectives ..................................................................................................................... 2
4. Background.............................................................................................................................. 3
5. Methodology ........................................................................................................................... 4
5.1 Project Methodology .......................................................................................................... 4
5.2 Analytical Methodology ...................................................................................................... 6
5.2.1 Description of Data ................................................................................................................. 6
5.2.2. Data Cleansing..................................................................................................................... 18
5.2.2.1. Data Cleansing in Python ............................................................................................. 18
5.2.2.2. Data Cleansing in Excel ................................................................................................ 22
5.2.3. Data Modelling: Predicting popularity from Track Attributes .............................................. 23
5.2.3.1. Python Machine Learning Models ............................................................................... 23
5.2.3.2. Excel Machine Learning Models .................................................................................. 38
6. Findings & Analysis ................................................................................................................ 45
7. Discussion .............................................................................................................................. 48
8. Details of MxTape Application ................................................................................................ 49
9. Conclusion.............................................................................................................................. 50
10.1. Appendix A - Python Visualisations ........................................................................................ 52
10.2. Appendix B - R Visualisations.................................................................................................. 59
10.3. Appendix C - MxTape Playlist Recommendation Application................................................ 63
10.4. Appendix D - Gradient Boosting to show Feature Trends over the Years ............................. 66
10.5. Appendix E - Linear Regression to Determine popularity based on Artist’s # of Followers.. 70
10.6. Appendix F - Spotify Meetup................................................................................................... 72
10.7. Appendix G - Tableau Dashboard Visualisations ................................................................... 74
10.8. Appendix H – Minutes of Meetings ........................................................................................ 84
11. References ........................................................................................................................... 93
1. Motivation
The music industry is a multi-billion dollar global industry, which is, like many other media,
undergoing accelerated change due to social and technological shifts. It is in the midst of a difficult
period, due to issues such as the COVID-19 pandemic, which has shut down live gigging, illegal
pirated downloads and the ease of copying and sharing digital files, small royalties/fees paid to
artists/record labels in the streaming environments, and a move to transition to algorithmic models
that can be opaque, biased, or open to the influence of money. Harris (2020) states that it is
estimated that the huge streaming company Spotify pays approximately £0.0028 (or 0.28 p) per
stream to the owners of a song, while the stream rate for YouTube streams is a tiny £0.0012.
Throughout the media industry, there is a large amount of risk and uncertainty, for example in the
case of releasing new material from various artists and producers. This risk is enhanced further for
up-and-coming artists without established fanbases. This can lead to large financial losses if the
record labels invest poorly, or the loss of talented artists that cannot generate enough income to
pursue a career. Exploring the copious data being produced by streaming services is a vital avenue of
research to better understand these problems, while also offering interesting insights into questions
such as what attributes of a song are most linked to its popularity: what makes a hit, a hit?
2. Project Objective Development

Our project objectives have shifted somewhat during the course of our research. Initially, we
intended to focus on creating a music recommendation system using publicly available Spotify Web
APIs (Application Programming Interface), to generate user-specific playlists based on a users’
listening preferences and recent listening history - similar to the ‘Discover Weekly’ playlist algorithm
used by Spotify. By creating an application written in Python and using the Flask micro web
framework, we could retrieve useful data and potentially gain insights relating to track popularity,
genre classification, and why consumers tend to follow and listen to particular artists and songs. We
called this application MxTape - influenced from the recent-but-fast-receding past when music fans
would create and share their very own physical mixtapes of favourite songs. Mixtapes are mostly a
thing of the past and music lovers now instead use streaming applications such as Spotify, Pandora,
SoundCloud, Deezer, Apple Music, Bandcamp. The same idea of generating playlists has moved to an
online algorithmic format, and the fascination with creating sequences of songs and searching for
new and novel combinations continues unabated.
As we worked on the MxTape application, our focus was transposed to the large amount of
music data available and how data modelling techniques and analysis might be applied using
1
different tools. Thus, the body of this thesis will be centred on investigating patterns in the data, and
then focus on a song’s ‘popularity’ Spotify score and its most closely related attributes. Various
predictive models based on machine learning algorithms are implemented to explore how this
popularity might be predicted, as well as a study of visualisations using different tools to present to
potential business stakeholders. This data-focused analytic approach will demonstrate the brave
new ways that record labels, artists, journalists, and the music industry as a whole will reckon with
its changes.
3. Target Objectives
In summary, the aim of this report consists of three primary objectives:
1. Exploratory data analysis, examination, and visualisation, and applying machine learning
models on Kaggle’s Spotify Dataset 1922-2021, ~600k Tracks (Ay, 2021).
2. The application of regression models to study how the popularity metric might be predicted
from other track attributes.
3. Development of a simple playlist-generating MxTape application, tailored for users based on
their individual data accessed using Spotify Web APIs and showing how current data can be
retrieved for future work.
Throughout the work, a variety of tools will be applied where possible to highlight the variety of
options available for data analysis work, including Python, R, Excel, and Tableau.
Fig. 1: Evolution of Spotify (Constine, 2014)
2
4. Background
As previously mentioned, the music industry is a large and lucrative worldwide industry. In
the past decade or so, music streaming services and platforms have grown to become one of the
most important forms of exposure for artists and can also supply these artists with a form of income.
Listening habits for anyone with a fast Internet connection have changed enormously. Spotify, for
example, launched its service in 2008, and it took them over two years to gain one million
subscribers, but has grown exponentially since then and currently (in 2021), has over 155 million
paying subscribers to the service (Dean, 2021). With Spotify being the largest dedicated music
streaming service, the decision was made to base our work on their data collected from the Spotify
API.
What previous work has been done in this area?
Due to the size of the industry, there have been countless attempts to predict a song's
popularity prior to its mainstream release. Nasreldin (2018) explains that “previous studies
that considered lyrics to predict a song’s popularity had limited success” and there has been much
research and testing conducted on this particular topic and researchers are finding difficulties in
predicting a “hit”, while being easier to detect a non-popular song. In this particular example, the
research team went through specific steps:
● Acquire the Data ● Data Exploration

● Scraping Billboard Songs ● Selection of the Models and Tuning
● Cleaning Dataset ● Discussion of the Results
Interiano et al., (2018) also conducted more in-depth research on this topic. Here, UK charts,
MusicBrainz data and AcousticBrainz data were collected and used to conduct the analysis. They
used 18 different types of variables to classify the various types of songs, 12 of which were binary
variables, and 6 of which were categorical variables. The data was extracted, and numerous data
analysis and techniques were applied. First, the research group needed to decide what defined
“success” pertaining to the songs. They agreed that any song that reached the charts passed. From
here, they conducted various tests, which resulted in finding trends which were visualised with line
plots. A predictive model of success was achieved but the model could not reach over a prediction
score of 0.74.
3
How does this project extend our knowledge in this area?
From the information stated by Nasreldin (2018), the group can attempt to predict the
popularity of the songs through non-lyrical classification metrics, such as Spotify song descriptors.
Nasreldin (2018) also explained that prediction of whether a song will not be a hit proved much
simpler than prediction of an actual hit. This research can extend on the research by Nasreldin due
to the much greater sample size, as the previous research described used a sample size of only 1,200
songs. The added data-points will give greater insights into which song descriptors relate to
popularity and therefore create better, more accurate models. With the information gained from
the research into this field previously, this project aims to extend past their difficulties, resulting in a
superior prediction framework for this industry.
5. Methodology
5.1 Project Methodology
The CRISP-DM and its Six Stages
The term CRISP-DM stands for “Cross-Industry Process for Data Mining”. It is a popular data
mining technique and “is an open standard process framework model for data mining project
planning” (Wijaya, 2021). The process involves 6 stages which are Business Understanding, Data
Understanding, Data Preparation, Modelling, Evaluation, and Deployment.
1. Business Understanding
The stage entails creating a clear understanding of what the business is and what it wants to
accomplish from the project. Wijaya (2021) states that the business understanding phase uses
important tasks such as “determining the business question and objective, situation assessment,
determining the project goals” and “project plan” to increase the overall business understanding.
Our analyses will benefit any business, organisation or individual associated with music
creation/production, streaming, distribution, and/or sound engineering, design, history, academics,
and research. This study will create value by addressing various problems and questions associated
with music streaming, using Spotify as the business case example.
4
2. Data Understanding
In this phase, the focus is switched to gain a better understanding of the data to help
support the previous phase which will help to solve the problem statement. In this phase, collection
of data, description of the data, verification of the data quality, and exploration of the data are all
important aspects.
3. Data Preparation
The third phase requires that the data be correctly prepared in order to obtain the greatest
degree of accuracy. Data preparation can involve data selection, data cleaning, data formatting, data
aggregating, and data integration. Once completed, the data will be ready for the modelling stage.
4. Modelling
The modelling stage includes selecting the appropriate modelling techniques with adequate
model testing to validate the quality and accuracy of the created models. Training, testing, and
evaluating are all used. When sufficient models are created and tested, they are compared and
contrasted with one another to examine which model is the most applicable to answer the problem
statement, with the most accuracy.
5. Evaluation
The evaluation phase is used to examine and evaluate the model which concerns the original
intentions of the business which were highlighted in the business understanding phase. Wijaya
(2021) explains that this stage contains three main areas: evaluation of the results, a process review
and determination of the following steps to follow.
6. Deployment
The final product must be produced in a professional form and manner to the major stakeholders,
in this case to the project supervisor. Wijaya (2021) describes that this phase involves deployment of the
model, monitoring and maintenance of these models, a written final report, and finally a project review.
The project may finish prior to the deployment phase but the model is a continuous cycle. The
stakeholders for MxTape are the users. The results from the various machine learning predictive models
could potentially influence decisions made by record label executives, artists, managers, talent acquisition,
and the music industry overall.
5
Fig. 2: CRISP DM Flow Diagram (Maia et al., 2013)
5.2 Analytical Methodology
5.2.1 Description of Data

The first part of the analytical methodology is understanding the data. Many of the fields are
shared across the complete, publicly accessible Spotify Web API, that of the spotipy library (Lamere,
2014), as well as Ay’s Kaggle dataset. This uniformity and accessibility are why these sources were
selected. The Spotify Web API is made up of smaller APIs, as seen in Fig. 3, each with their own
additional parameters. This API gives a vast and robust set of data to pull from to explore and discover
new insights and trends.
6
Fig. 3: Spotify Web API Reference Index
Fig. 4: Spotify Authorization Code Flow
7
Spotify’s Web API is an example of a RESTful (Representational State Transfer) API, that is, an
API interacted with over HTTP requests (GET, POST, PUT, DELETE) making specific calls to URLs,
receiving data in the responses. The REST API endpoints use OAuth 2.0 Authorization (see the
Authorization Code Flow in Fig. 4, above), and return JSON metadata about artists, albums, and tracks
directly from the Spotify Data Catalogue (Web API, 2021). This data is reliable, clean, and up-to-date.
With over “70 million tracks, including more than 2.6 million podcast titles… Spotify is the world’s
most popular audio streaming subscription service with 356 million users, including 158 million
subscribers, across 178 markets”, as of March 31, 2021 (Spotify — Company Info, 2021), and “60,000
tracks [uploaded] per day”, that is, “a new track uploaded to its platform every 1.4 seconds” (Ingham,
2021). The markets are shown in Fig. 5 below.
Fig. 5: Spotify Markets (Spotify - Company Info, 2021)
8
Due to the data constantly being updated within the Web API, we decided to primarily use
the Kaggle dataset for our analyses (Ay, 2021), which pulls directly from the Spotify Web API, and is
comprised of 6 total .csv files, broken into 2 sections by date, as shown below.
● data_o.csv (audio features of tracks,

Section 1 - Dated to December 2020 170k rows) (29,127 KB)
● data_by_artist_o.csv (audio features of
artists, 30k rows) (5,131 KB)
● data_by_year_o.csv (audio features of
years, 100 rows) (21 KB)
● data_by_genres_o.csv (audio features
of genres, 3k rows) (566 KB)
Section 2 - Dated to April 2021 ● tracks.csv (audio features of tracks,

600k rows) (108,757 KB)
● artists.csv (popularity metrics of artists,
1.1M rows) (47,317 KB)
Dataset 1: data_o.csv consists of the audio features of tracks directly fetched from Spotify API and it
contains 19 columns.
9
10
Fig. 6: Description of Audio Features Object fetched from Spotify Web API Reference
(Spotify for Developers, 2021)
Dataset 2: tracks.csv has 20 columns. tracks.csv has the same features as data_o.csv except for year,
and additionally has:
11
Dataset 3: artists.csv has 5 columns.
Dataset 4: data_by_artist_o.csv shows the audio features of artists derived from data_o.csv. It is the
result of groupby-mean (and groupby-mode for categorical variables) operation (Ay, 2021), with 16
columns as shown below. All have been previously described except for count.
Dataset 5: data_by_genres_o.csv contains audio features of artists, grouped by genres. Ay calculated

means for numerical variables and modes for categorical variables during the aggregation process
(2021). There are 14 columns:
12
Dataset 6: data_by_year_o.csv contains the audio features sorted by year (1921-2020). Release
years were extracted using the release_date feature from data_o.csv and grouped by years. Ay
calculated means for numerical variables and modes for categorical variables (2021). There are 14
columns:
Note on the ‘popularity’ metric
Spotify uses internal models to generate the values of key data fields. The metric popularity,
for instance, is calculated by Spotify from a collection of other data fields. We do not have access to
the exact formula used, but we do know it is a measure based on the total number of plays compared
to other songs, and how recent those plays are. So over time, if a song is not listened to as much as
before then its popularity score will go down (afront, 2009), and older listens are gradually discounted
as they age. The temporal aspect of this feature is important in our analysis, as we will see. It would
be valuable to create a new column age in data_o.csv and tracks.csv which takes the most recent
release date - Nov 24th, 2020 for data_o and April 16th, 2021 for tracks - and subtracts all release
dates of other tracks from these values, obtaining an age, in days, of each song. This would allow us
13
to dive deeper into analyses of yearly, seasonally, monthly, weekly, and daily trends. We were unable
to do this due to missing data, i.e., the format of some track release dates only gave the year.
We can see the distribution of popularity in the histogram in Fig. 7. It is clear that a large
number of entries have a very low score of 0 or 1.
Fig. 7: Distribution of popularity
Feature Trends, Correlation and Selection with Python

We start by plotting some line graphs which show the average values per year in the
features. Here we see some interesting observations. First of all, we can conclude that the
(2) acousticness has plummeted over the years; going from around 90% in the 50s to 25% in the last
few years. Also, the (3) danceability has had a steady increase over time, with it being slightly lower
than 50% in the 50s to roughly 60% in recent years. (10) loudness has climbed over the years from -
16 db to -8 db. Additionally, music seems to have gotten a lot more (6) explicit in the 2000s,
reaching about 0.17, whereas in the 1950s, for example, explicit had a value of 0.0. The (1) valence
(emotional response) of songs has been on a rollercoaster, with the 70s and 80s producing the
happiest songs, and the 2010s showing a sudden decline. The (4) duration_ms has been wavering
between 200k ms-270k ms (~3.33min - 4.5min) in the 1950s to 2010s, after which there is a steep
decrease to 200k ms. (13) speechiness has an exact ‘U’ shape over the years having the highest in
songs in the 1950s and now in the 21st century. While (14) tempo of songs had a low in the 1950s,
(11) mode had a high in that time and vice versa in the present. The (8) key over years declined on
two occasions between 1950s and 1960s, the rest of the years all songs had a considerable
importance in this attribute. Both the (5) energy and (12) popularity were at their lowest during the
14
1950s having a steady increase throughout the years with energy level of songs being the highest
between 2000-2010 and popularity growing rapidly till the current year. The features (7)
instrumentalness and (9) liveliness both tend to be important characteristics in a song during the
90s, but in the last 10 years have seen a drastic decrease.
15
Fig 8: Feature Trends through the Years
Fig 9: Code to visualise feature correlation
Fig 10: Feature Selection of numeric variables with respect to hit variable
16
Fig 11: Python code to generate feature correlation heatmap
Fig 12: Heatmap of feature correlation
The heatmap in Fig. 12 (generated from the Python code shown in Figure 14) shows that the
most correlated features are year and popularity (0.86), energy and loudness (0.78), valence and
danceability (0.56), and year and energy (0.53). The scatterplots below convey the two most
correlated pairs. For "energy and loudness", there is a curvilinear positive relationship, i.e., as track
energy increases (from 0.0 to 1.0), the overall trend is that loudness also increases steadily (from
approximately -30 db to 0 db). With "year and popularity", there is a linear positive relationship.
There are more outliers and noise, however, the line-of-best-fit confirms the high correlation
between recency of track and popularity, which was previously hypothesized due to the way
popularity is inherently measured.
17
Fig 13: Two most correlated pairs of features
Feature Statistics in R
Fig. 14 below shows the summaries of the descriptive statistics (min, max, median, mean,
1st and 3rd quantile) for each attribute in data_o.csv produced in R.
Fig. 14: Descriptive Statistics of features in data_o
5.2.2. Data Cleansing
5.2.2.1. Data Cleansing in Python
Data cleansing is required in the case of any missing data. Some measures we took in
preparing our data are outlined below, using Python in Google’s Colab. Most attributes did not
contain missing data, as seen below. Duplicates were also checked but Spotify remediated this by
assigning unique IDs to artists, tracks, and albums.
18
Fig. 15: Descriptive Statistics for popularity in data_o.csv - no missing data (Ay, 2021)
Fig. 16: Descriptive Statistics for year in data_o.csv - no missing data (Ay, 2021)
Fig. 17: Check if data_o has missing data
Fig. 18: Count null values for each column in every dataset (data_o as example here)
19
We validated this for each dataset using the code above (with data_o as an example). data_o,
data_by_artist_o, data_by_genres_o, data_by_year_o, and tracks csv files all had 0 missing data. The
only dataset with missing data was artists.csv, as shown below.
Fig. 19: Missing data values shown for ‘followers’ in artists.csv
There were 27 missing values for ‘followers’, of which were filled with the mean number of
followers from the dataset. Another option would have been to simply drop these values from
df_train with the .dropna() function.
Fig. 20: Read artists.csv into artists_data
20
Fig. 21: Show Missing Data on artists.csv
Fig. 22: Fix ValueError
Fig. 23: How to drop ‘na’ values
Defining a ‘hit’ track for Classification
It should be noted that, for the sake of argument within this thesis, we defined and
partitioned a hit track to be that with a popularity score ≥ 90. This could be used with a classification
model, i.e., to classify tracks into “hit” or “not hit”. Fig. 24 shows the creation of a hit column in
Python, while Fig. 25 shows the corresponding implementation in R. Note that 90 was an arbitrary
value selected and can be modified based on results obtained.
Fig. 24: Partitioning the hit variable in Python
21
Fig. 25: Partitioning the hit variable in R
5.2.2.2. Data Cleansing in Excel
We also use Excel to clean the data for processing. This includes spell-checking, removing
duplicate rows, finding, and replacing text, removing spaces and nonprinting characters from text,
merging, and splitting columns, etc. In data_by_year_o.csv, there are 14 columns - mode, year,
acousticness, danceability, duration_ms, energy, instrumentalness, liveness, loudness, speechiness,
tempo, valence, popularity, and key. The first step is to find the mean, standard deviation, lower
limit and the upper limit of all the columns from the row C2:C101, D2: D101, E2:E101, F2:F101,
G2:G101, H2:G101, I2:I101, J2:J101, K2: K101, L2:L101, M2:M101 and N2:N101.
Fig. 26: Outlier calculations
The second step is to find the outlier for the columns C to N by using the formula
=IF(AND(C2>C$105, C2<C$106), TRUE, FALSE) (with the letters changed for the other columns D-N
accordingly using autofill). It is crucial to note that the dollar symbol $ is used to avoid increments in
the values that need to be fixed. After applying the formula, we must filter and find the outliers. The
outliers have very little differences between the lower limit and the upper limit so, we are going to
keep the outliers in the integrity of the data set. The differences in the outliers are shown below. All
the values in the dataset are consistent, the values are valid, and there are no missing values.
Fig. 27: Identified outliers
22
5.2.3. Data Modelling: Predicting popularity from Track Attributes
5.2.3.1. Python Machine Learning Models

Multiple Linear Regression
Now that all numeric features of the available track data have been explored, we can
address the interesting question of whether or not it is possible to use them to train a model that
will predict popularity. This is a key metric for various music industry stakeholders, so a predictive
model would be valuable. A popular and relatively straightforward predictive algorithm is linear
regression, which models the relationship between a response variable (popularity) and an
explanatory variable (e.g., year or energy). In this case, given the number of features, we will use
multiple linear regression (MLR), which allows multiple explanatory variables (e.g., year and energy
and …).
To run MLR, we first pull the data from the data_o.csv file. After producing a heatmap of the
14 numerical variables (see below, Fig. 28), it is concluded that popularity has the highest correlation
with year (0.86), energy (0.49), loudness (0.46), danceability (0.2), and explicit (0.19) content. Thus,
for feature selection, we choose these top 5 variables in the training data for x. Cleaning, scaling,
dividing 20% to test, and running the regression model was performed, obtaining an MLR formula
with regression coefficients and an intercept. We will also run the process for all 14 numeric
features.
Fig. 28: Import libraries for Regression
Fig. 29: Load dataset as dataframe
23
Fig. 30: Descriptive Stats for popularity
Fig. 31: Scatterplot depicting relationship between energy and popularity
24
Fig. 32: Scatterplot depicting relationship between loudness and popularity
Fig. 33: Scatterplot depicting relationship between danceability and popularity
25
Fig. 34: Python code to generate the 5-feature MLR model
26
Fig. 35: (MLR) Coefficients and Intercept; Test Set Actual vs. Predicted Values Plot; Metrics - 5
features
Fig. 36: Final MLR Formula with top 5 correlated features weightings
Fig. 37: Changing MLR code to account for 14 features
27
When the same process is done but for all 14 numerical variables in data_o, we obtain the following:
Fig. 38: (MLR) Coefficients and Intercept; Test Set Actual vs. Predicted Values Plot; Metrics - 14
features
Fig. 39: Final MLR Formula for 14 Features
PyCaret
PyCaret, a machine learning library, allows for a quick comparison across many models and
their MAE, MSE, RMSE, R2, RMSLE, and MAPE (Przybyla, 2021). LazyPredict is another library that
28
quickly runs 40 vanilla regression models with a few lines of code (Araujo, 2021). Applying these
libraries to various datasets, confirming data types, and removing inaccurate models and those that
take too long to train, is a good way to eliminate certain ML algorithms efficiently. It should be
noted, however, not to consider these results as final models, as double-checking and optimisation
of models from an expert analyst is crucial. From Figure 26 below, we see the PyCaret output in
which the Gradient Boosting Regressor (gbr) appears to be the best model based on MSE and RMSE,
followed closely by Random Forest Regressor (rf) and Extra Trees Regressor (et), which had better
MAE, R2, RMSLE (Root Mean Squared Logarithmic Error), and MAPE (Mean Absolute Percentage
Error). ElasticNet was much faster than these three models. It is always important to clarify what is
meant by ‘success’ in terms of these metrics.
Fig. 40: PyCaret Code
29
Fig. 41: PyCaret Output
XGBoost
Since PyCaret found that gradient boosting achieved a high score out of the many models
tested, we will adopt some gradient boosting models to extend the prior MLR model. Gradient
boosting refers to a class of ensemble machine learning algorithms that can be used for classification
or regression predictive modelling problems. These models are more complex but can be far more
effective. Ensemble methods consider and compare many different models, while boosting improves
accuracy by focusing on training instances that are misclassified. The first one we consider is
XGBoost, or “Extreme Gradient Boosting,” an implementation of gradient boosting that is regarded
as being very computationally efficient. XGBoost has proven very successful and has come to
“dominate many Kaggle competitions” (Becker, 2018).
Fig. 42: Feature selection for XGBoost; Train/Test split with Test as 20%
Gradient Boosting can be applied to the required features with assignment of those attributes
to variables. Here, X and y are the chosen variables where X holds the features in the above Fig. 42
30
and y holds the popularity feature. These 5 values for the XGBoost were chosen from the correlation
obtained from MLR above. The model can be run with data split into two parts for the algorithm
evaluation technique to be fast, as this is necessary for a large dataset. For faster computing,
XGBoost can make use of multiple cores on the CPU. This is possible because of a block structure in
its system design. Data is sorted and stored in in-memory units called blocks. Unlike other
algorithms, this enables the data layout to be reused by subsequent iterations, instead of computing
it again. This feature also serves useful for steps like split-finding and column sub-sampling. It
accepts sparse input for both tree-booster and linear-booster and is optimized for sparse input. It
supports loss functions, trained by minimizing loss of an objective function against a dataset. As
such, the choice of loss function is a “critical hyperparameter and tied directly to the type of
problem being solved, much like deep learning neural networks” (Guestrin, 2016).
XGBoost varies from other gradient boosting libraries such as LightGBM, which uses the
method of creating trees and histogram-based algorithms that do not store additional information
for pre-sorting feature values.
Fig. 43: XGBoost Model and Cross Validation for 5 features
The cross_val_score() function from the scikit-learn library allows us to evaluate a model using
the cross-validation scheme and returns a list of the scores for each model trained on each fold.
After running cross-validation, we end up with k different performance scores that can be
summarised using mean and standard deviation. The choice of k must allow the size of each test
partition to be large enough to be a reasonable sample of the problem, whilst allowing enough
repetitions of the train-test evaluation of the algorithm to provide a fair estimate of the algorithm’s
performance on unseen data. The k-fold here is 5.
Fig. 44: Python code to generate XGB Test Set Actual vs. Predicted Values
31
Fig. 45: Test Set Actual vs. Predicted Values with Metrics for 5 features (XGB)
Fig. 46: Test Set Actual vs. Predicted Values with Metrics for 14 features (XGB)
Histogram-Based Gradient Boosting
Gradient boost methods that discretize, or bin, the continuous input variables, use the bins to
construct feature histograms of the training dataset (instead of finding split-points based on the
sorted feature values), and evaluate the tree-node split candidates based on these feature
histograms are known as histogram-based gradient boosting (HBGB) ensembles (see Fig. 47 for the
32
algorithm). Evaluating the split points in gradient boosting is extremely slow, with O(nlog(n)) to sort
each feature, and O(n) at each split point (PyData, 2019). HBGB, inspired by the LightGBM library,
can be “orders of magnitude faster... when the number of samples is larger than tens of thousands
of samples…[and has] built-in support for missing values” (1.11. Ensemble methods, n.d.). “It costs
O(#data * #features) for histogram-building and O(#bins * #features) for split-point finding” (Ke et
al., n.d.). Since HBGB is more memory efficient and faster, we have decided to use it as one of our
modelling techniques.
Fig. 47: Histogram-based Algorithm (Ke et al., n.d.)
Fig. 48: Python import statements for HBGB
33
Fig. 49: Python code for HBGB Test Set Actual vs. Predicted Values (5 features)
Fig. 50: Test Set Actual vs. Predicted Values (HBGB) for 5 features
34
Fig. 51: Test Set Actual vs. Predicted Values (HBGB) for 14 features
Improving Accuracy with Additional Preprocessing
There are many ways to change and preprocess data to try and improve performance of a
model. With this Spotify dataset, we noticed that the popularity is dependent on the release_date,
and that a significant percentage is set to very low values (e.g., 0, 1, …). Here, we perform two
additional pre-processing steps and re-fit our HBGB model, to examine if predictions can be
improved. The first step is to try considering the (old) zero-popularity values as outliers and simply
remove them, while the second is to create a more accurate time-based feature, age, based on the
release_date, which first needs to be changed into a consistent date format.
Removing Low Popularity Values
As we saw in Fig. 7, a very large number of tracks (approx. 30,000) have a popularity of 0 or 1.
By examining this dataset, we can see that these are almost all very old tracks, with the year
property as early as 1921. If we consider these as outliers and remove them (see Fig. 52), we get
slightly different predictions as seen in Fig. 53; the model predictions are more clustered around a
somewhat linear space, though the changes are not radical. The bottom ‘bar’ that showed a problem
35
where many predicted values on the test set were far off the actual values of 0 or 1, has
disappeared.
Fig. 52: Removing entries with very little popularity
Fig. 53: Predicted vs. Actual values for HBGB with low popularity entries removed
Creating A More Useful ‘age’ Feature
The dataset contains a release_date attribute, and this can be used to generate a more
accurate age for a track. However, the entries are inconsistent and contain a number of date
formats and varying specificity. There are three main formats included, shown by the examples
‘12/18/1921’, ‘2021-07’, ‘8/25/1984’, and ‘1945’. In order to use this attribute, we have to make all
values consistent, valid dates. We can, for example, append the day 15 to those with only a year and
month, and month 7 and day 1 (approx the halfway point of the year) to those with only a year. We
can also change them all to a built-in datetime datatype, which will allow for the use of date
functions. If we do this, we can subtract each release_date from the most recent date in the dataset,
2020-11-24, to generate a new age attribute. Unlike the year attribute, this will capture the
important distinction between different ages within that year: we can see in the data that this is
important, as tracks later in 2020 are more likely to be popular than those in early 2020. The year
attribute we had been using will not capture this.
36
A number of steps were required to create this new age attribute (see Python code in Fig.
54). Examples of the original release_date format and the corresponding new numeric age values
that are generated are shown in Fig. 55. When used with the HBGB model, differences in the
predictions can be seen. For example, some higher values are now predicted.
Fig. 54 : Python code to fix release_date content/format and create new numeric age attribute
Fig. 55: Examples of original release_date and new calculated age attributes
37
Fig. 56: Predicted vs. Actual values for HBGB with new ‘age’ feature
5.2.3.2. Excel Machine Learning Models
In Excel using Data Mining we performed a standard partition to analyse a small amount of
data. The partitioning type is random with the ratio for training as 70%, validation as 20% and testing
as 10%. We have performed Linear Regression, Neural Networks, and Ensemble Methods - Bagging,
Boosting, and Random Tree models on the standard partition dataset. The sample of the data set is
shown below after the standard partition.
Fig. 57: Standard Partition
38
Linear Regression
We have performed Linear Regression on the standard partition dataset for the year and the
popularity of the songs. We have the testing prediction summary as shown below from this we can
depict that RMSE value is 3.7525 which is less. We can also see the Lift chart below, where the x-axis
is year and the y-axis is the popularity of the song.
Fig. 58: Prediction Summary; Decile vs. Global Mean
Fig. 59: Linear Regression Lift Charts - year/popularity
This shows that there is a decrease in the popularity of the songs over the years. The lift
chart between underestimated and overestimated estimated AOC value is 702.229, so this curve is
above the ROC curve.
The linear regression for all the selected variables and the popularity as the output variable
is performed and the RMSE value is 1.957 and the lift charts show that the AOC value is 184.606.
39
Fig. 60: Prediction Summary; Decile vs. Global Mean
Fig. 61: Linear Regression Lift Charts - all variables vs. popularity
Regression Tree
The next model is the regression tree for a fully grown model and the RMSE value is ~1.853
for year vs. popularity. The first lift chart depicts the graph in the stepped manner reduction, the
second chart shows that the area under the curve is very high, that is the AOC value is ~84.121, and
the final chart shows that the curve above the cumulative actual using average is a good fitted
model for the popularity of the songs.
40
Fig. 62: Regression Tree - year/popularity
The regression model for all the selected variables and the popularity as the output variables
is performed the fully grown model and the RMSE value is ~3.163 and the AOC value is~ 499.571.
41
Fig. 63: Regression Tree - all variables vs. popularity
Neural Networks
For this method we have selected variables year and the output variable as popularity. By
selecting the stopping rules dialog, we have selected the number of epochs as 100. We have also
tested for different values of epochs (e.g., 30 and 300), and found that there are no changes in the
error value. The below table shows that the training RMSE value is ~19.95 for all epochs values,
which is very high for the error value. This shows that this model is not a well-fitted regression.
Fig. 64: Neural Network - Year/Popularity Epochs - 100
The similar steps are performed for all the selected variables and the popularity as the
output variable and the number of epochs as 30 is ranging from ~19.95 to ~ 20.36.
Fig. 65: Neural Network - all variables/ popularity-epochs 30
42
Ensemble Method - Bagging
We first select the data partition sheet and the selected variable as year and the output
variable as popularity. As part of the ensemble methods, we have deployed linear regression, neural
networks, and decision tree models. The prediction summary of the methods is shown below. The
RMSE value for the three methods is ~3.763 for linear regression, ~39.813 for neural networks and
~1.325 for decision tree model. The AOC value for the methods is~ 702.149 in linear regression,
~18694.249 in neural network and ~49.655 in regression tree.
Fig. 66: Ensemble Method - Bagging: Linear Regression, Neural Network & Regression Tree-
year/popularity
We first select the data partition sheet and the selected variable as all the columns and the
output variable as popularity. As part of the ensemble methods, we have deployed linear regression,
neural networks, and decision tree models. The prediction summary of the methods is shown below.
The RMSE value for the three methods is ~ 2.029 for linear regression, ~35.724 for neural networks
and ~ 2.347 for decision tree model. The AOC value for the methods is ~205.235 in linear regression,
~18694.249 in neural network and ~260.499 in regression tree.
Fig. 67: Ensemble Method - Bagging: Linear Regression, Neural Network & Regression Tree - all
variables/popularity
Ensemble Method - Boosting
The same process is applied as in Bagging. The prediction summary of the methods is shown
below. The RMSE value for the three methods is ~3.664 for linear regression, ~39.15674 for neural
43
networks and ~1.721 for decision tree. The AOC value for the methods is ~662.622 in linear
regression, ~18694.249 in neural network and ~92.843 in regression tree
Fig. 68: Ensemble Method-Boosting - Linear, Neural Network & Regression Tree - Year/Popularity
The prediction summary for all the selected variables and the selected variable as popularity
is shown below. The RMSE value for the three methods is ~2.148 for linear regression, ~35.060 for
neural networks and ~3.259 for decision tree. The AOC value for the methods is ~229.923 in linear
regression, ~18694.249 in neural network and ~530.875 in regression tree.
Fig. 69: Ensemble Method- Boosting-Linear, Neural Network & Regression Tree (all variables vs.
popularity)
Ensemble Method - Random Tree
In this method after running the model we can depict that the RMSE value is ~1.325 and the
AOC value is ~49.655 for the scale variable Year and the output variable popularity.
Fig. 70: Ensemble Method - Random Tree- Year/ popularity
44
In this method after running the model we can depict that the RMSE value is ~2.347 and the
AOC value is ~260.499 for the scale variable year and the output variable popularity.
Fig. 71: Ensemble Method - Random Tree-all variables/popularity
6. Findings & Analysis
Multiple Linear Regression

It is clear that the MLR model with 14 features is proven to be slightly better than the MLR
model with 5 features, but not by much. The RMSE for 14 features is lower by ~0.152 (10.812 vs.
10.964). R2 score is also better for 14 features, being ~0.007 higher than that of 5 features (0.754 vs
0.747). Both of their y-intercepts are very close (~31.428 and ~31.426). The MAE for 14 features is
~8.027 and for 5 features it is ~8.073.
The models struggle to predict any popularity values over ~65, possibly as these are
uncommon in the dataset. Meanwhile the large contingent of very low popularity scores in the
dataset also has an effect. Even with the linear model it can be seen that popularity is a difficult
attribute to accurately predict, even though it can be seen that it is influenced by other features. The
fact that the recency of listens influences popularity means that the newness of a track is linked to
its chance of being popular, but while most popular songs are recent, nonetheless most recent songs
are unpopular.
XGBoost
After running cross-validation we obtain performance scores that we can summarize using
mean and standard deviation. The result is a more reliable estimate of the performance of the
algorithm on new data given some test data. It is more accurate because the algorithm is trained
and evaluated multiple times on different data. The cross_val_score() function from scikit-learn
45
allows us to evaluate a model using the cross-validation scheme and returns a list of the scores for
each model trained.
With 5 Features:
Fig. 72: XGBoost Model Score and Cross Validation Score for 5 features
Here, the model score is found to be ~0.843. The cross_val_score estimates the expected
accuracy of your model on out-of-training data. Cross-validation gives an estimate of how
precisely they are measuring your score. Here, cross-validation score is found to be ~0.829 and
standard deviation is found to be ~0.006.
With 14 features:
Fig. 73: XGBoost Model Score and Cross Validation Score for 14 features
46
Here, the model score is found to be ~0.851. The cross_val_score estimates the expected
accuracy of your model on out-of-training data. Cross-validation gives an estimate of how
precisely they are measuring your score. Here, cross-validation mean score is found to be 0.838 and
standard deviation is found to be ~0.009.
The results may vary given the stochastic nature of the algorithm or evaluation procedure,
or differences in numerical precision. Considering running the models for both 5, 14, or n-values a
few times and comparing the average outcome gives a better estimate of the tests being conducted.
Here, the values of cross-validation mean for the two feature selection sets had a max score of
almost 84% accuracy. The model mean score of the XGBoost model is approximately 2% higher for
predicting popularity based on 14 attributes, compared to the popularity prediction based on only 5
attributes.
Histogram-Based Gradient Boosting

For HBGB, the test dataset with 14 features was the best model, slightly outperforming XGB
with 14 features (except for MAE metric), with an R2 score of ~0.808, MSE ~91.226, RMSE ~9.551,
and MAE ~6.751 (XGB had R2 score of ~0.807, MSE ~91.676, RMSE ~9.575, and MAE ~6.745).
However, the XGB with 14 features was slightly better than the HBGB model with 5 features. See Fig.
74 below for a summary comparison of all the tested model metrics.
Fig. 74: Comparison of metrics across all models
It is evident that the PyCaret gbr model was actually quite accurate and performed very
well. HBGB with 14 features was the next best model, followed by XGB-14, HBGB-5, XGB-5, and
finally MLR-14 and MLR-5. This proves that the more complex models with a higher number of
features tend to be better predictive models.
Experimental Excel Results

After running all the models between the year/popularity and all the selected
variables/popularity. We have found that the regression tree is a good fit for the variable's
47
year/popularity with RMSE value as 1.893 with a smaller AOC value as 84.121. The smaller the AOC
value the better the model fits compared to the linear regression as the RMSE value was high. The
RMSE values of neural network and ensemble method for bagging, boosting and random tree are
high and the AOC values are also high, so these models are not a good fit. When the selected
variables are all the columns and the output variable as popularity the RMSE value is 1.957 and the
lift charts show that the AOC value is 184.606 in linear regression compared to the other models
which had a greater AOC value. After comparing the RMSE values and the lift charts we find that
linear regression is a better fitted model for the selected data set.
7. Discussion
Trying to solve the “Hit Song Science” problem (Nijkamp, 2018) is obviously quite an
extraordinary task to take on, as if it were already solved, it would be quite easy for any artist to be
popular and generate a heap of revenue quickly. Companies such as Music Intelligence Solutions and
researchers at the University of San Francisco (USF) have tried to tackle this problem head on with
some success. CEO David Meredith of Music Intelligence Solutions “says his software found that hits
have certain common patterns of rhythm, harmony, chord progression, length and lyrics. A study
conducted by the Harvard Business School found that the software was accurate 8 out of 10 times”
(Sydell, 2009). The researchers of USF created four models, of which the SVM model “achieved the
highest precision rate (99.53 percent), while the random forest model attained the best accuracy rate
(88 percent) and recall rate (85.51 percent)” (Fadelli, 2019).
Spotify itself acquired a predictive analytics company called “Echo Nest” in 2014, deeming the
data platform to be essential in helping to add to the user personalization experience and ultimately
to position themselves for an improved business strategy. Since the Echo Nest generates the “Discover
Weekly” and other professionally curated playlists on Spotify for paid subscribers, it is one of the main
attractions for users to choose Spotify as their music streaming service rather than competing rivals.
While Echo Nest capabilities do not necessarily predict whether a song will be a ‘hit’ or not, it does
predict what songs users will enjoy listening to, based on the specific user’s listening habits, or
“diversity—the coherence of the set of songs a user listens to” (Anderson et al., 2020), which may vary
drastically from one person to the next. An internal tool used by Spotify and Echo Nest employees is
The Truffle Pig, which can curate playlists “from almost anything: an artist's name, a song, a vague
adjective or feeling” (Pierce, 2015), even occasion-specific playlists. Spotify’s goal is to be “the first
app you open every time you put on headphones, because you trust it to have what you want even
48
when you don't know what you want. In the increasingly volatile, ever-changing music business, that
is the only powerful place to be” (Pierce, 2015).
The mission of this project draws from this mindset, understanding just how important of a
role data plays in the consumer-driven music industry. From the analysis conducted on track
popularity predictions and the creation of a recommender system, we can consolidate findings from
various machine learning models based on track, artist, genre, and user metadata, as well as speculate
some of the internal functionality within Spotify, and deliver insights on how social networking and
digital presence plays a role in the success of artists in modern-day music markets. There are ample
opportunities in music research that can be taken advantage of by Spotify and MxTape in which new
value creation can arise. “Examples include selling hit prediction models to record companies and
artists. Spotify might be interested in also collecting other ‘external’ data for this reason and possibly
add social functionalities such as a chat box. Especially now, as streaming companies like Spotify are
struggling with their profitability, diversifying the revenue channels, and innovating on the value
proposition is vital” (Nijkamp, 2018).
8. Details of MxTape Application
Fig. 75: MxTape Logo
The analysis in this report relied on static Spotify data published on the Kaggle website (Ay,
2021). This dataset was pulled from Spotify’s API. However, a small additional application, called,
Mxtape was created to demonstrate how to access up-to-date user-specific information from
Spotify. The goal of the application is to simulate a basic implementation of Spotify’s “Discover
Weekly” playlists. Thus, we used the spotipy library in Python to achieve this, which wraps API access
in handy function calls. MxTape uses the Flask web-framework to create a simple HTML/CSS
interface and simple UX/UI design, which links in nicely to the Spotify OAuth 2.0 authentication
process required to access a user’s details. MxTape does four things - 1) gets a list of ‘recently played
tracks’ using an HTTP request, 2) chooses a list of seed tracks, 3) gets a list of recommended tracks,
49
and 4) creates a new playlist on the user’s Spotify account with the date of creation as playlist title.
After setting environmental variables (spotipy client ID, spotipy client secret, spotipy redirect URI) in
the command line interface and running app.py, the user is redirected to http://127.0.0.1:8080/
where they can sign-in (to their own Spotify account). OAuth2.0 gets and validates the Spotify user
access token, and the spotipy plugin provides functionality to check authentication and access API
data. The Flask app then maps these functions such as discover() as routes,i.e. as specific URLs that
the basic Web UI displayed as links to the user. More Implementation details are included in
Appendix B. What is significant from a data analytics viewpoint is that it shows how ‘fresh’ data, and
other data such as user listening details, can be programmatically obtained: this could be used to
develop more sophisticated models or find new insights.
9. Conclusion
The music industry is rapidly changing and like many other environments is moving to
algorithmic-driven online platforms (e.g., Spotify). Vast amounts of listening data, as well as
qualitative analysis of the music itself, can be used widely in these platforms, and through access
provided through APIs, to the wider industry (e.g., promoters, producers, artists, fans, advertisers).
The project's goals were to explore a dataset from the world’s most popular music platform,
Spotify, and look at how machine learning models might be used to predict certain song attributes.
In this work, we focused on popularity, and built a number of predictive regression models using
various features.
Multiple linear regression, XGBoost, and histogram-based gradient boosting models were
used, with some success. A number of tools were applied to demonstrate techniques: Python, Excel,
R, and Tableau. It was learned that popularity, as defined in the dataset, is a fundamentally difficult
property to predict. Its temporal aspect (older listens are discounted) means that the time
dimension strongly influences the outcome and features such as danceability or acousticness have a
limited impact. More data might provide a different perspective: for example, if the
number_of_listens per day were retrieved for a period of time, it might be possible to create a model
that would have some success predicting which new releases would go onto become big hits, based
on their early growth.
50
A further part of this work involved building a simple playlist-generating application that
directly accesses Spotify’s data via its APIs. This demonstrates that a wide range of data can be used,
such as specific user-listening habits. This fresh data could be used to build much richer and more
useful models, as noted above. An alternative approach would be to use the large datasets of tracks
and artists along with user listening data to improve recommendations. Data-driven music platforms
already dominate listening habits of the developed world, and this is not going to change anytime
soon. Data analytics will only grow in importance across the industry, and both its compelling
strengths and potential pitfalls need to be carefully considered. It is vitally important that this data-
driven change needs to keep in mind the many people who are within, or trying to get into, the
music industry, often in precarious or uncertain circumstances, and how algorithms and data
analytics can impact them.
Potential recommendations for future work include:
1. Implementing more complete HTML/CSS (e.g., Bootstrap) into the Flask app, to have a
cleaner and more visually appealing front-end output of data on our MxTape website, with
more options and user-entered data. The playlist generator could, for example, ask the user
for keywords or mood to influence the track selections.
2. Having the user select a track from MxTape results which is then used to find
matching/similar tracks from the Kaggle dataset, which are then used to seed the
recommendations and perform further exploratory/predictive analyses.
3. Answering the question, “Can we predict a track’s possible genre(s), given the rest of its
attributes?” by building a classification or clustering model.
4. Exploring user-listening taste clusters (“generalists” vs. “specialists” - see Appendix E).
5. Expanding feature selection (e.g., writing a function to cycle through all possible
combinations of features to find the optimal ones for predicting the target variable).
6. Performing similar analyses, but with Echo Nest’s Truffle Pig data.
51
10.1. Appendix A - Python Visualisations
Fig A1.1: Heatmap Python code
Fig. A1.2: Heatmap of Attribute Correlation
52
Fig. A2: Heatmap showing correlation between popularity and energy
53
Fig. A3: Pie Chart of 10 Most Popular Genres
54
Fig. A4.1: Most Common Genres over the Years
Fig. A4.2: Larger Key
55
Here is a bar chart which shows the 100 most common music artists. The most common
artist in our dataset is Francisco Canaro with 2,228 songs. Some famous performers on our list
include Chopin, Beethoven, Mozart, Sinatra, Johhny Cash, Bob Dylan, and The Beatles.
Fig A5: Bar chart of Most common artists
We used the log transformation and box-cox methods to transform our data so that its
distribution resembles a bell shape. We visualised this by going through every variable, plotting a
histogram of the distribution and a boxplot below it. A sample for field valence is as follows:
Fig A6: Histogram and box plot of valence feature
56
Similarly, we plot the histogram and boxplot for each variable, thus finding how normal a
distribution is and also how close the box tends to be from the ends of the whisker. This also shows
the statistical significance of each variable in the data. We will bin the data by compressing the
unique number of values per feature into 20. The distribution per variable is shown below with
histograms.
Fig A7: Binning using histogram for each feature
57
Fig. A8: Most Frequent Artist “words” in the ‘Irish Folk’ Subgenre
A wordcloud was created in order to investigate which words were the most frequently used
in song titles across genres. The most common word found was “love”, followed by “don’t” and
“like”.
Fig. A9: Wordcloud of most frequent words in song titles.
58
10.2. Appendix B - R Visualisations
R is another useful platform for data analytics. In this section, we will use it to generate
visualisations. As with Python, it can also be used to train complex predictive models.
The heatmap shows the correlation between the most popular music genres, ranging on a
scale from -1 to 1. A score of 1 describes perfect positive correlation and -1 describes no correlation.
In the heatmap, latin and r&b display a moderate positive correlation with a score of 0.57. Pop and
rock also have a positive correlation with a score of 0.3.
Fig. B1: Correlation between Genres
59
A multiple line graph was created to show trends in popularity through the last two decades. Currently
R&B is the most popular, followed closely by Rap, Pop, and Latin genres. The six boxplots show the average
popularity of the genres rock, rap, R&B, pop, latin and EDM. The genre with the highest average popularity is
pop and the lowest is EDM.
Fig. B2: 21st Century Genre Popularity
Fig. B3: Popularity of Genres
60
This heatmap (Fig. B4) illustrates the correlation between song characteristics. The strongest
correlated features are energy and loudness (0.78), danceability and valence (0.54), and popularity
and loudness (0.35). The histogram (Fig. B5) depicts the popularity of songs by year of release. The
average popularity has risen greatly from the 50’s, which could be due to the lack of data retained at
this time compared to present-day. The highest year of average popularity was the year 2000, with a
slow decrease since.
Fig. B4: Correlation between Spotify Song Characteristics
Fig. B5: Average Popularity through the Years
61
R also allows us to apply many kinds of sophisticated machine learning/data science
algorithms. Fig. B6 shows an application of the generalised linear model (glm) algorithm on the
Spotify data. glm allows for different (i.e., non-normal) error distributions in the response variable,
which makes it more flexible. In the example below, the only feature used is year. As with in Python,
this could be extended to include many other features.
Fig. B6: Classification Generalised Linear Model with R example – Summary
The boxplot shows the average song length of songs in the dataset. The mean is
approximately 225 seconds with the lower quartile equalling approximately 180 seconds, and the
upper quartile approximately 260 seconds.
Fig. B7: Average duration of songs
62
10.3. Appendix C - MxTape Playlist Recommendation Application
Fig C1: Login screen
Fig C2: Code to display the index page
Fig C3: Index page displayed at redirect URI
63
Fig C4: Python track output
Fig C5: Same tracks added to a new MxTape playlist
64
Fig. C6: MxTape discover() function
Fig. C7: discover() function (continued)
65
Fig. C8: MxTape-generated playlists
10.4. Appendix D - Gradient Boosting to show Feature Trends over the

Years
Gradient Boosting
We visualised the Kaggle dataset and created a gradient booster to predict the popularity of
songs.
Fig. D1: Importing necessary dataset and loading packages.
66
Various loaded packages include:
Numpy: For linear algebra, working with arrays.
Pandas: For data manipulation, processing, and analysis.
Seaborn: For visualisation, using statistical graphics.
Matplotlib: For visualization and graphical plotting.
Xgboost: For achieving the performance on machine learning tasks.
Sklearn: For machine learning and statistical modelling(regression).
Sample of the dataset:
Fig. D2: CSV file in Excel sheet
Fig. D3: Displaying sample of the loaded Data Frame
67
Fig. D4: Statistical data of the numerical values of the Data Frame.
This part of this analysis we will be predicting the data. Our goal will be to predict the "year"
feature using every other numerical variable of the given dataframe.
The "X" is the features in our dataset except "year" all other variables are included, while the "y" is
now assigned to "year". Afterwards, we split the data further off into train and test sets.
Fig. D5: Split data
One of the key aspects of supervised machine learning is model evaluation and validation. When you
evaluate the predictive performance of your model, it is essential that the process be unbiased.
Using train_test_split() from the data science library scikit-learn, you can split your dataset into
subsets that minimize the potential for bias in your evaluation and validation process.
The training set is applied to train, or fit, your model. The validation set is used for unbiased model
evaluation during hyperparameter tuning. The test set is needed for an unbiased evaluation of the
final model.
o x_train: The training part of the first sequence (x)

o x_test: The test part of the first sequence (x)
o y_train: The training part of the second sequence (y)
o y_test: The test part of the second sequence (y)
o Training: used to train the model and optimize the model’s hyperparameters
68
o Testing: used to check that the optimized model works on unknown data to test that the
model generalizes well
o Validation: during optimizing some information about test set leaks into the model by
your choice of the parameters so you perform a final check on completely unknown data
Fig. D6: XGBoost Model Score and Cross Validation Score for 14 folds
The algorithm provides hyperparameters that should, and perhaps must, be tuned for a
specific dataset. Although there are many hyperparameters to tune, perhaps the most important are
as follows: the score() method will take in an input X_test, and it's target value Y_test, your model
will compute Y_pred for your X_test, and attribute a score , using the optimization function used by
your model, to your prediction. Here, the model score is found to be 0.8474353447896554.
The cross_val_score estimates the expected accuracy of your model on out-of-training data.
Cross validation give you an estimate of how precisely they are measuring your score. Here, the
cross-validation score is found to be 0.8346448908441406. What comes out are two accuracy
scores, which we could combine to get a better measure of the global model performance. The two
accuracy scores show that the nearest-neighbour classifier is about mean 84% accurate on this
dataset, which is the max score obtained.
69
10.5. Appendix E - Linear Regression to Determine popularity based on
Artist’s # of Followers
Fig. E1: Library imports; Set random seed
Fig. E2: Feature scaling and splitting of training and testing data
After required libraries are imported such as pandas, seaborn, numpy, scikit-learn, pylab, and
matplotlib, a random seed is set. Once df_train is cleaned and features are selected (x as followers
and y as popularity), feature scaling is applied to the independent variable x using the standardisation
formula as shown above. This will cause the followers distribution to have a mean value of 0 and
variance equal to 1. The dataset is then split into 80% training and 20% testing (168,204 is about 20%
of the total number of records – 841,023).
70
Fig. E3: Fit training data to linear model; Get coefficient and intercept
sklearn allows for efficient and quick application of linear regression to fit the data, returning
the regression coefficient and y-intercept as displayed, obtaining the final derived formula to predict
a given song’s popularity.
Fig. E4: Derived Linear Regression Formula
Fig. E5: Error Metrics
71
10.6. Appendix F - Spotify Meetup
The following images show slides from a Meetup we attended with a Research Scientist from
Spotify. This helped to motivate our project.
Fig. F1: Spotify Meetup - Diversity Metric (Maystre, 2021)
Fig. F2: Conversion Rates based on Diversity of Listening Activity (Maystre, 2021)
72
Fig. F3: 40-dimensional Spotify Model to Measure Track “Similarity” (Maystre, 2021)
Fig. F4: Genre “Islands” (Maystre, 2021)
73
Fig. F5: Importance of Diversifying Algorithmic Recommendations (Maystre, 2021)
10.7. Appendix G - Tableau Dashboard Visualisations
Dashboard 1:
Fig. G1: Dashboard 1
74
The Dashboard contains 3 sheets in a graphical format that includes the following:
Count and Loudness of Few Artists: The graph shows that the Count and Loudness has been a
significant feature in the songs of the few artists filtered from the list. The colors specific to the
artists makes it easier to interpret the above result in terms of this data.
Popularity vs Key: The packed bubbles graph shows that the popularity and the key as in music from
C, has been varying for different key values. Thus, as the bubble is bigger, using the keys of that
property can give a higher popularity.
Artists with Popularity>90: Here, the popularity of specific artists>90 is picked so as to find the
highest among them.
Dashboard 2:
Fig. G2: Dashboard 2
Popularity Vs Max Value of Popularity: This is a bar chart plotted with Popularity against Max of
Popularity and turns out to be almost a decreasing curve from value 35.
Max Popularity between 90-100 in the year 2021: This particular treemap is developed so as to keep
in account the popularity of the current year. This is specific to the year 2021 which shows that as
the color turns more blue, the popularity is more towards 95-100.
75
Dashboard 3:
Figure G3: Dashboard 3
Danceability and Loudness w.r.t Genres: This box and whisker graph finds the correlation between
Danceability and Loudness to the Genres.
Popularity Vs Genres: This horizontal graph finds how the Popularity of a song depends on Genre, for
eg, the Dutch Hip Hop has popularity<70 whereas the genre Early Music is <30 as those songs have
lesser likes in this decade.
Energy Vs Key(Instrumentalness as filter):Energy and Key are important criteria of a song. These two
features combined will show the artists have used them in their songs. The count of Energy when
lies between 35-45, has keys ranging from 7-9.
76
Fig. G4: Count and Loudness of Few Artists
77
Fig. G5: Popularity vs Key
78
Fig. G6: Artists with popularity > 90
79
Fig. G7: popularity Vs Max Value of popularity
80
Fig. G8: Max Popularity between 90-100 in the year 2021
81
Fig. G9: Danceability and Loudness w.r.t Genres
82
Fig. G10: Popularity Vs Genres
83
Fig. G11: energy vs key (instrumentalness as filter)
10.8. Appendix H – Minutes of Meetings
Initial Email:
 
Hi Anatoli,
 
We have decided on Music Genre Classification as our topic for the Major Project, of which we will
build using Python libraries integrated with the Spotify Web API: spotipy and tekore.
This presentation shows a decent, tentative outline for what we were thinking of replicating,
although we may only focus on 2-3 ML models, instead of all 8 shown. Here is another link we will
reference to build visualizations and determine what to put on a listener’s playlist, using reference
tracks from their listening history. 
 
Our app will essentially do 5 things:
84
a. Get last played tracks.
b. Choose seeds.
c. Get recommended tracks.
d. Decide playlist name.
e. Create Spotify playlist.
 
Will this be a sufficient idea for our project? We have other ideas as well that we could add
on later if need be, but for now, I think the meat of the project lies here. Please let us know when
you are available for a kick-off meeting.
 
Lastly, we were wondering about the due dates for our Outline Proposal (shown as Feb 14th in the
Course Outline on Blackboard). Due to the exam extensions caused by the high increase of COVID
cases, will there be extensions to these dates? How in depth should the proposal be?
Kick-off Meeting 08/02/21:

Present: Caitlin, Neeharika, Sabbarish, (Niall – COVID Test)
Agenda:
• Agree on methodology – CRISP-DM or else, should be stated early and followed.
• Stage 1: outline the project: 
• initial ideas, goals, formulate objectives.
• analytical tools and techniques to be used. 
• Stage 2: What dataset / source is, data structure (JSON format or else) re-structuring
needed or not?
• Look at the final report structure and over interim submissions try to address
components in one way or another to be reused later.
Moving Forward:
• Project proposal (2-3 pages): possible structure / components:
• Background.
• Motivation and project objective.
• Data overview.
• Tools & techniques.
• Scope, feasibility (e.g. access to data, knowledge, time,…), risk assessment.
85
Next Meeting: Tuesday 09/02/21.
Tues 09/02/21:
Present: Caitlin, Neeharika, Sabbarish & Niall
Agenda:
• Discuss Concerns from Kick-Off Meeting such as Data Extraction, Methodology, # of
Variables, etc.
• Project Proposal (possible components):
o Background.
o Motivation and project objective.
o Data overview.
o Tools & techniques.
o Scope, feasibility (e.g. access to data, knowledge, time,), risk
assessment.
 
Moving Forward:
• Decided on meeting this Friday to start with software installation, creation of app,
discussion, and draft of Proposal.
Next Meeting: Friday 12/02/21.
Wed 10/02/21:
Present: Caitlin, Niall, Sabbarish
• Major Project Information Session on MS Teams with Michael Lang: Overview, Q&A,
Course Outline. 
Fri 12/02/21:
Present: Caitlin, Neeharika, Niall
Agenda:
• Team building, coffee meetup.
• Create application name & email address.
• Make a start on creation of application.
86
• Begin our draft proposal.
• Make suggestions on possible issues down the line.
• Create a group Google Drive account so all members have access to work on
project.
Moving Forward:
• Agreed on steps moving forward (more research on topic and have all members to
have a clear idea of project by next meeting).
• Plan a meeting for next week.
Next Meeting: Thursday 18/02/21.
Thurs 18/02/21:
Agenda:
• Finalize the Proposal document for submission for 21/02/21.
• Brainstorm more ideas.
• Inform Sabbarish of our progress.
Moving Forward:
• Plan a meeting for next week.
• Research further.
• Work with the code once members are finished with their SAP case studies which is
due the 19/03/21.
Next Meeting: Saturday 27/02/21.
Thurs 11/03/21:
Agenda:
• The organisation of the workload- try to work smarter.
• The Interim/progress report due Monday 22/03/21
• Adding more detail and direction to the original Project Proposal
• CRISP-DM Method
• Background: Authorization Flow (OAuth2.0), Web APIs – Discuss Issues/Setbacks
• Idea for moving forward: Incorporate more of Spotipy library and pandas, seaborn
data visualizations.
Moving Forward:
87
• Decisions made: Sabbarish and Caitlin will focus on code; Neeharika and Niall will
focus on Research/Report.
• Caitlin and Sabbarish will revisit DISC for Authentication issues
• Priority list – ranked of Necessary items and extra “niceto-haves”/ “could be
implemented” if time allows.
Next Meeting: Friday 26/03/21.
Thurs 22/04/21:
Agenda:
• Recap of work done to date.
• Delegate tasks.
• Report Sections
To Do:
• Add Discover button to menu.
• We’re using 5 random seed tracks now – Idea: Seeding for a specific playlist to be
built (i.e, a cardio workout playlist seeded from high liveness, energy, or danceability).
• User can select from HTML Sliders (Bootstrap slider) of features: liveness,
danceability, etc. – then search Kaggle dataset for tracks that do have these categories –
distance metrics. ML algm?
• Quantity of music data is growing exponentially. In 1920s wouldn’t have as much
musical metadata, or even recordings. Can talk more about history in Background. Can
show exponential growth over years.
Questions to Explore:
• Predict a column – scikit-learn Linear Regression – to predict one missing feature.
• X – everything but danceability, y: danceability -> predict y.
• Train on data_o.csv (audio features of tracks, original data, 170k rows)
• Mode will always be 1, can’t really predict year. Predict aspects you might assume
would be correlated to other aspects.
• If you have a new track with all features – predict genre.
• Clustering of genres.
• Explicit content over years (0 or 1 – classification), most popular key per genre, etc.
88
• Change vars around – e.g. compare loudness to energy, valence
to instrumentalness? Etc.
• Show distribution graphs similar to as seen on Kaggle.
Moving Forward:
• Discuss business value and insights such as promoting new/different artists in
certain regions/markets, determining Top 5 cities that listen to artists the most, # of
listens/(popularity), and user experience.
• Sabbarish will check for links for business insights.
• For 2nd Report: Section on Spotify’s Big Data – reference articles
• For Final Report: Relate findings from Kaggle to Every Noise at Once or produce
similar visualisations.
• Show we have more developments for the website upcoming (Html templates,
Bootstrap/CSS).
Next Meeting: TBC
Thurs 11/06/21:
Agenda:
• Prepare questions to ask Anatoli in meeting
• Ask about Breakdown of Project Methodology and Analytical Methodology
• Excel Models – Brief Description – Include in Appendix?
• Ask if we should use other tools or continue in Python.
Moving Forward:
• R - Glm – How do track attributes predict a song’s popularity? Ask Anatoli if binomial
or gaussian is better for logistic regression. – Caitlin
• Excel – Neeharika
• R Visualisations – Histograms, plots – Niall
• Tableau – Sabbarish – gradient boosting Python and linear regression in Tableau
• Decision trees, SVMs, Neural Networks, Ensemble methods, etc.
• We have done analysis on static data set but here we explore how to access user
data. Description of how you do it.
• Snippets to show last 50 songs.
• Screenshots of showing how to build the playlist
89
Next Meeting: Monday 14/06/21
Mon 14/06/21:
Agenda:
• Ask Anatoli questions which were prepared previously
• Get feedback about project so far
• What directions we should go in next
Insights from Supervisor:

• Code will not be graded, software not run – only include screenshots and snippets
from it in Report that are valuable.
• Comment on heatmap
• Regression vs. Classification – Decide which. Depends on target variables – if has a
couple + classes use classification.
• Findings – Show if have achieved Objective from beginning of report.
• Screenshot of code to reduce word count
• Cross Validation, Data Partitioning, etc. – Analytical Methodology
Moving Forward:
• Increase number of models to gain greatest insights and accuracy
• Update report structure in line with advice from supervisor
• Think of best models to create for next meeting
Potential Recommendation/Future Work:

This app could be expanded on to specifically filter only ‘popular’ songs, or by specific genre or other
features. Name of playlist generated could also be more creative and have time stamp in addition to
date to avoid duplicate titles.
Mon 21/06/21:
Agenda:
• Assign remaining jobs such as executive summary and CRISP DM overview
• Discuss Anatoli’s feedback
• Focus on how to structure methodologies sections
90
• Discuss further models to explore
Moving Forward:
• Try to make good progress for Thursday’s meeting
• Neeharika - continue with Excel
• Sabbarish – continue with Tableau and gradient boosting
• Caitlin – model techniques
• Niall – executive summary and overview of the CRISP DM
Next Meeting: Thursday 24/06/21
Thurs 24/06/21:
Present: Caitlin, Neeharika, Sabbarish
Agenda:
• Finalize everyone’s parts and structure for the Report
• Niall uploaded Meeting Minutes Doc and Exec Summary draft
• Caitlin completed Motivation, Target Objectives, Data Description and Cleansing in
report
Moving Forward:
• Everyone will have updates in by Tonight, will Review again over the weekend
before writing Conclusion and Recommendations
• Aim to submit before the 30th
Tue 06/07/21:
Agenda:
• Make final adjustments to certain parts of the group document
• Try to reduce work count and number of pages in main body
Moving Forward:
• Send draft to Anatoli
• Organising final meeting before submission
Next Meeting: Thursday 08/07/21
Thurs 08/07/21:
Agenda:
91
• Read the conclusion together
• Suggesting any improvements to the final document
• Discuss final aesthetics of the final document
Moving Forward:
• Submit document tomorrow
92
11. References
Anderson, A., Maystre, L., Mehrotra, R., Anderson, I. and Lalmas, M., 2020. Algorithmic Effects on
the Diversity of Consumption on Spotify. [ebook] Taipei, Taiwan: cs.toronto.edu, pp.1-11. Available
at: <http://www.cs.toronto.edu/~ashton/pubs/alg-effects-spotify-www2020.pdf> [Accessed 5 May
2021].
Araujo, I., 2021. How to Run 40 Regression Models with a Few Lines of Code. [online] Medium.
Available at: <https://towardsdatascience.com/how-to-run-40-regression-models-with-a-few-lines-
of-code-5a24186de7d> [Accessed 14 June 2021].
Ay, Y., 2021. Spotify Dataset 1922-2021, ~600k Tracks. [online] kaggle.com. Available at:
<https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks> [Accessed 4 May
2021].
Behrens, N., 2017. nab0310/SpotifyML. [online] GitHub. Available at:

<https://github.com/nab0310/SpotifyML/blob/master/spotify/Final%20Pretty%20Spotify%20Data%
20Classifier.ipynb> [Accessed 20 March 2021].
Brownlee, J., 2020. Histogram-Based Gradient Boosting Ensembles in Python. [online] Machine
Learning Mastery. Available at: <https://machinelearningmastery.com/histogram-based-gradient-
boosting-ensembles/> [Accessed 15 June 2021].
Canales, J., 2021. How to get your favorite artist’s music data from Spotify using the Python library
Spotify. [online] Medium. Available at: <https://medium.com/analytics-vidhya/how-to-get-your-
favourite-artist-music-data-from-spotify-using-the-python-library-spotipy-d5c25065c91e> [Accessed
19 March 2021].
Constine, J., 2014. Inside The Spotify – Echo Nest Skunkworks. [online] Techcrunch.com. Available at:
<https://techcrunch.com/2014/10/19/the-sonic-mad-
scientists/?guccounter=1&guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig
=AQAAACapzIuC_rs7PEKIUeRhR7MJTbR7EAsowO5QcJdxz5TRq9idw9n86Uyv4kqSmF_YNZdEmNk_oX
paWUzFSSKRa6Z-qZy688veWQeFH2786EWoGBtzSd4HWn9S-1GT8LazQjqDQEcBNj88kGTn36zhlv-
429-Yngrw2HV7j_qoVtDP> [Accessed 24 May 2021].
Data Science Project Management. 2021. CRISP-DM - Data Science Project Management. [online]
Available at: <https://www.datascience-pm.com/crisp-dm-2/> [Accessed 16 March 2021].
93
Dean, B., 2021. Spotify Usage and Growth Statistics: How Many People Use Spotify in 2021?. [online]
Backlinko. Available at: <https://backlinko.com/spotify-users> [Accessed 5 May 2021].
Developer.spotify.com. 2021. Web API | Spotify for Developers. [online] Available at:
<https://developer.spotify.com/documentation/web-
api/#:~:text=Web%20API%20also%20provides%20access,in%20the%20Your%20Music%20library.&t
ext=To%20access%20private%20data%20through,via%20the%20Spotify%20Accounts%20service.>
[Accessed 20 March 2021].
Developer.spotify.com. 2021. Web API Reference | Spotify for Developers. [online] Available at:
<https://developer.spotify.com/documentation/web-api/reference/#objects-index> [Accessed 10
June 2021].
Fadelli, I., 2019. Using Spotify data to predict what songs will be hits. [online] Techxplore.com.
Available at: <https://techxplore.com/news/2019-09-spotify-songs.html> [Accessed 7 May 2021].
GitHub. 2021. plamere/spotipy. [online] Available at:

<https://github.com/plamere/spotipy/blob/master/examples/app.py> [Accessed 6 May 2021].
Harris, J., 2020. Unless we start paying, making music will become the preserve of the elite | John
Harris. [online] the Guardian. Available at:
<https://www.theguardian.com/commentisfree/2020/dec/20/spending-money-on-music-
coronavirus-britain-musicians-spotify-albums> [Accessed 6 May 2021].
Ingham, T., 2021. Over 60,000 tracks are now uploaded to Spotify every day. That’s nearly one per
second. - Music Business Worldwide. [online] Music Business Worldwide. Available at:
<https://www.musicbusinessworldwide.com/over-60000-tracks-are-now-uploaded-to-spotify-daily-
thats-nearly-one-per-
second/#:~:text=Spotify%20confirmed%20in%20November%20last,to%20around%2070%20million
%20tracks.> [Accessed 6 May 2021].
Interiano, M., Kazemi, K., Wang, L., Yang, J., Yu, Z. and Komarova, N., 2018. Musical trends and
predictability of success in contemporary songs in and out of the top charts. Royal Society Open
Science, 5(5), p.171-274.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. and Liu, T., n.d. LightGBM: A Highly
Efficient Gradient Boosting Decision Tree. [online] Papers.nips.cc. Available at:
<https://papers.nips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf>
[Accessed 4 July 2021].
94
Lamere, P., 2014. Welcome to Spotipy! — spotipy 2.0 documentation. [online]
Spotipy.readthedocs.io. Available at: <https://spotipy.readthedocs.io/en/latest/> [Accessed 2 March
2021].
Lamere, P., 2021. plamere/spotipy. [online] GitHub. Available at:

<https://github.com/plamere/spotipy/blob/master/examples/app.py> [Accessed 2 March 2021].
Maia, A., Oliveira, S., Moura, D., Sarubbi, J., Vercellino, R., Medeiros, B. and Griska, P., 2013. A
decision-tree-based model for evaluating the thermal comfort of horses. Scientia Agricola, 70(6), pp.
377-383.
Maystre, L., 2021. Algorithmic Effects on the Diversity of Consumption on Spotify.
Medium. 2021. How to get your favorite artist’s music data from Spotify using the Python library
Spotify. [online] Available at: <https://medium.com/analytics-vidhya/how-to-get-your-favourite-
artist-music-data-from-spotify-using-the-python-library-spotipy-d5c25065c91e> [Accessed 19 March
2021].
Nasreldin, M., 2018. Song Popularity Predictor. [online] Medium. Available at:
<https://towardsdatascience.com/song-popularity-predictor-1ef69735e380> [Accessed 6 May
2021].
Nijkamp, R., 2018. Prediction of product success: explaining song popularity by audio features from
Spotify data. Bachelor. University of Twente.
Pansentient.com. 2009. Song Popularity on Spotify – How it Works. [online] Available at:
<http://pansentient.com/2009/09/spotify-song-popularity/> [Accessed 14 June 2021].
Pierce, D., 2015. Inside Spotify's Hunt for the Perfect Playlist. [online] Wired.com. Available at:
<https://www.wired.com/2015/07/spotify-perfect-playlist/> [Accessed 7 May 2021].
Przybyla, M., 2021. Predicting Spotify Song Popularity. [online] Medium. Available at:
<https://towardsdatascience.com/predicting-spotify-song-popularity-49d000f254c7> [Accessed 10
June 2021].
PyData, 2019. Deep Dive into scikit-learn's HistGradientBoosting Classifier and Regressor. [video]
Available at: <https://www.youtube.com/watch?v=J9QQ6l_HToU> [Accessed 28 June 2021].
Rowe, W., 2018. Mean Square Error & R2 Score Clearly Explained. [online] BMC Blogs. Available at:
<https://www.bmc.com/blogs/mean-squared-error-r2-and-variance-in-regression-analysis/>
[Accessed 6 May 2021].
95
Scikit-learn.org. n.d. 1.11. Ensemble methods. [online] Available at: <https://scikit-
learn.org/stable/modules/ensemble.html#histogram-based-gradient-boosting> [Accessed 14 June
2021].
Scikit-learn.org. n.d. sklearn.metrics.r2_score — scikit-learn 0.24.2 documentation. [online] Available

at: <https://scikit-
learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score>
[Accessed 20 June 2021].
Spotify. 2021. Spotify — Company Info. [online] Available at:

<https://newsroom.spotify.com/company-info/> [Accessed 6 May 2021].
Sydell, L., 2009. New Music Software Predicts The Hits. [online] Npr.org. Available at:
<https://www.npr.org/templates/story/story.php?storyId=113673324&t=1620394364754>
[Accessed 7 May 2021].
The Sound of AI, 2020. Create Spotify Playlists Using Python. [video] Available at:
<https://www.youtube.com/watch?v=3vvvjdmBoyc> [Accessed 20 December 2020].
Tianqi Chen, Carlos Guestrin, 2016. XGBoost: A Scalable Tree Boosting System. Available at:
https://arxiv.org/abs/1603.02754 [Accessed 20 Jun2 2021]
Velardo, V., 2020. spotifyplaylistgenerator. [online] GitHub. Available at:
<https://github.com/musikalkemist/spotifyplaylistgenerator> [Accessed 20 December 2020].
Wijaya, C., 2021. CRISP-DM Methodology For Your First Data Science Project. [online] Medium.
Available at: <https://towardsdatascience.com/crisp-dm-methodology-for-your-first-data-science-
project-769f35e0346c> [Accessed 26 June 2021].
96

Spotify Final Research Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spotify Final Research Report

Uploaded by

Copyright:

Available Formats

Business Analytics Major Project 2020/2021

Module Code: MS5115

Exploring Spotify® Data Trends & Modelling Track Popularity

Under the Guidance of:

2. Project Objective Development

Fig. 1: Evolution of Spotify (Constine, 2014)

What previous work has been done in this area?

● Acquire the Data ● Data Exploration

5.1 Project Methodology

The CRISP-DM and its Six Stages

5.2 Analytical Methodology

5.2.1 Description of Data

Fig. 4: Spotify Authorization Code Flow

Fig. 5: Spotify Markets (Spotify - Company Info, 2021)

● data_o.csv (audio features of tracks,

Section 2 - Dated to April 2021 ● tracks.csv (audio features of tracks,

(Spotify for Developers, 2021)

Dataset 5: data_by_genres_o.csv contains audio features of artists, grouped by genres. Ay calculated

Note on the ‘popularity’ metric

Fig. 7: Distribution of popularity

Feature Trends, Correlation and Selection with Python

Fig 9: Code to visualise feature correlation

Fig 12: Heatmap of feature correlation

Fig. 14: Descriptive Statistics of features in data_o

5.2.2. Data Cleansing

5.2.2.1. Data Cleansing in Python

Fig. 17: Check if data_o has missing data

Fig. 19: Missing data values shown for ‘followers’ in artists.csv

Fig. 20: Read artists.csv into artists_data

Fig. 22: Fix ValueError

Fig. 23: How to drop ‘na’ values

Defining a ‘hit’ track for Classification

Fig. 24: Partitioning the hit variable in Python

5.2.2.2. Data Cleansing in Excel

Fig. 26: Outlier calculations

Fig. 27: Identified outliers

5.2.3.1. Python Machine Learning Models

Fig. 28: Import libraries for Regression

Fig. 29: Load dataset as dataframe

Fig. 31: Scatterplot depicting relationship between energy and popularity

Fig. 33: Scatterplot depicting relationship between danceability and popularity

Fig. 37: Changing MLR code to account for 14 features

Fig. 39: Final MLR Formula for 14 Features

Fig. 40: PyCaret Code

Fig. 43: XGBoost Model and Cross Validation for 5 features

Histogram-Based Gradient Boosting

Fig. 47: Histogram-based Algorithm (Ke et al., n.d.)

Fig. 48: Python import statements for HBGB

Improving Accuracy with Additional Preprocessing

Removing Low Popularity Values

Fig. 52: Removing entries with very little popularity

Creating A More Useful ‘age’ Feature

5.2.3.2. Excel Machine Learning Models

Fig. 57: Standard Partition

Fig. 58: Prediction Summary; Decile vs. Global Mean

Fig. 59: Linear Regression Lift Charts - year/popularity

Fig. 64: Neural Network - Year/Popularity Epochs - 100

Fig. 65: Neural Network - all variables/ popularity-epochs 30

Ensemble Method - Boosting

Ensemble Method - Random Tree

Fig. 70: Ensemble Method - Random Tree- Year/ popularity

Fig. 71: Ensemble Method - Random Tree-all variables/popularity

6. Findings & Analysis

Multiple Linear Regression