You are on page 1of 10

How does Spotify Algorithm work on a Cloud service based perspective

As we move ahead into the 2020s, an ever-increasing share of music consumption and
discovery is going to be mediated by AI-driven recommendation systems. Back in 2020, as
much as 62% of consumers rated across platforms like Spotify and YouTube among their top
sources of music discovery — and be sure that a healthy chunk of that discovery is going to
be mediated by recommender systems. On Spotify, for instance, over one third of all new
artist discoveries happen through "Made for You" recommendation sessions according to the
recently released Made to be Found report.
Yet, as algorithmic recommendations take center stage in the music discovery landscape, the
professional community at large still perceives these recommender algorithms as black boxes.
Music professionals rely on recommender systems across platforms like Spotify and
YouTube to amplify the ad budgets, connect with the new audiences, and all-around execute
successful release campaigns — while often having no clear vision of how these systems
operate and how to leverage them to amplify artist discovery.

How recommendation and music discovery works on Spotify?


In a lot of ways, Spotify's recommendation engine is dealing with a similar flow as TikTok's
"For You" algorithm, playing the matchmaker between the creators (or artists) and users (or
fans) on a two-sided marketplace.

Behind the algorithm: understanding music and user tastes


In broad strokes, at the core of any AI recommender system, there's an ML model optimized
for the key business goals: user retention, time spent on the platform, and, ultimately,
generated revenue. For this recommendation system to work, it needs to understand the
content it recommends and the users it recommends it to. On each side of that proposition,
Spotify employs several independent ML models and algorithms to generate item
representations and user representations.
Generating Track Representations: Content-based and Collaborative filtering
Spotify's approach to track representation is made up of two primary components:

1. Content-based filtering : aiming to describe the track by examining the content itself
2. Collaborative filtering : aiming to describe the track in its connection with other
tracks on the platform by studying user-generated assets
The recommendation engine needs data generated by both methods to get a holistic view of
the content on the platform and solve the cold start problems when dealing with newly
uploaded tracks. First, let's take a look at the content-based filtering algorithms:
Analyzing artist-sourced metadata
As soon as Spotify ingests the new track, an algorithm will analyze all the general song
metadata provided by the distributor and metadata specific to Spotify (sourced through the
Spotify for Artist pitch form). In the ideal scenario, where all the metadata is filled correctly
and makes its way to the Spotify database, this list should include:

 Track title
 Release title
 Artist name
 Featured artists
 Songwriter credits
 Producers credits
 Label
 Release Date
 Genre & sub-genre tags
 Music culture tags
 Mood tags
 Style tags
 Primary language
 Instruments used throughout recording
 Track typology (Is it a cover? Is it a remix? Is it an instrumental?)
 Artist hometown/local market

The artist-sourced metadata is then passed downstream, as input into other content-based
models and the recommender system itself.

Analyzing raw audio signals


The second step of the content-based filtering is the raw audio analysis, which runs as soon
as the audio files, accompanied by the artist-soured metadata, are ingested into Spotify's
database. The precise way in which that analysis is carried out remains one of the secret
sauces of the Spotify recommender system.
The audio features data available through Spotify API consists of 12 metrics describing the
sonic characteristics of the track. Most of these features have to do with objective sonic
descriptions. For example, the metric of "instrumentalness" reflects the algorithm's
confidence that the track has no vocals, scored on a scale from 0 to 1. However, on top of
these "objective" audio attributes, Spotify generates at least three perceptual, high-level
features designed to reflect how the track sounds like in a more holistic way:

1. Danceability : describing how suitable a track is for dancing based on a combination


of musical elements, including tempo, rhythm stability, beat strength, and overall
regularity.
2. Energy : representing "a perceptual measure of intensity and activity", based on the
track's dynamic range, perceived loudness, timbre, onset rate, and general entropy.
3. Valence : describing "the musical positiveness of the track". Generally speaking,
tracks with high valence sound more positive (e.g., happy, cheerful, euphoric), while
songs with low valence sound more negative (e.g., sad, depressed, angry)
Yet, these audio features are just the first component of Spotify's audio analysis system. In
addition to the audio feature extraction, a separate algorithm will also analyze the track's
temporal structure and split the audio into different segments of varying granularity: from
sections (defined by significant shifts in the song timbre or rhythm, that highlight transitions
between key parts of the track such as verse, chorus, bridge, solo, etc.) down to tatums
(representing the smallest cognitively meaningful subdivision of the main beat).
Educated assumption: The combination of data generated by the audio analysis methods
should allow Spotify to discern the audio characteristics of the song and follow their
development throughout time and between different sections of the track.
The chances are that in almost ten years, Spotify audio analysis algorithms have gone a long
way — which would mean that the audio feature data fed into the recommendation system
today is much more detailed and granular than what's available through the company's public
API.
Analyzing text with Natural Language Processing models
The final component of the content-based track representation is the Natural Language
Processing models, employed to extract semantic information describing the track/artist from
music-related text content. These models are applied in three primary contexts:

1. Lyrics analysis : The primary goal here is to establish the prominent themes and the
general meaning of the song's lyrics while also scanning for potential "clues" that
might be useful down the road, such as locations, brands, or people mentioned
throughout the text.
2. Web-crawled data : (focusing primarily on music blogs and online media outlets).
Running NLP models against web-crawled data allows Spotify to uncover how people
(and gatekeepers) describe music online by analyzing the terms and adjectives that
have the most co-occurrence with the song's title or the artist's name.
3. User-generated playlists : The NLP algorithms run against the user-generated
playlists featuring the track on Spotify to uncover additional insights into the song's
mood, style, and genre. "If the song appears on a lot of playlists with "sad" in the title,
it is a sad song."
The NLP models allow Spotify to tap into the track's cultural context and expand on the
sonic analysis of how the song sounds with a social dimension of how the song is perceived.
The three components outlined above — artist-sourced metadata, audio analysis, and
NLP models — make up the content-based approach of the track representation within
Spotify's recommender system. Yet, there's one more key ingredient to Spotify's recipe for
track representation
Collaborative Filtering
In many ways, collaborative filtering has become synonymous with Spotify's recommender
system. The DSP giant has pioneered the application of this so-called "Netflix approach" in
context of music recommendation — and widely publicized collaborative filtering as the
driving power behind its recommendation engine. So the chances are, you've heard the
process laid out time and again. At least the following version of it:
"We can understand songs to recommend to a user by looking at what other users with
similar tastes are listening to." The algorithm simply compares users' listening history: if user
A has enjoyed songs X, Y and Z, and user B has enjoyed songs X and Y (but haven't heard Z
yet), we should recommend song Z to them. By maintaining a massive user-item interaction
matrix covering all users and tracks on the platform, Spotify can tell if two songs are similar
(if similar users listen to them) and if two users are similar (if they listen to the same songs).
However, this item-user matrix approach comes with a host of issues that have to do with
accuracy, scalability, speed, and cold start problems. So, in recent years, Spotify has moved
away from consumption-based filtering — or at least drastically downplayed its role in track
representation. Instead, the current version of collaborative filtering focuses on the track's
organizational similarity: i.e., "two songs are similar if a user puts them on the same playlist".
By studying playlist and listening session co-occurrence, collaborative filtering algorithms
access a deeper level of detail and capture well-defined user signals. Simply put, streaming
users often have pretty broad and diverse listening profiles — in fact, building listening
diversity is one of Spotify's priorities
If, on the other hand, a lot of users put song A and song B on the same playlist, that is a
much more conclusive sign that these two songs have something in common. On top of that,
the playlist-centric approach also offers insight into the context in which these two songs are
similar — and with playlist creation being one of the most widespread practices on the
platform, Spotify has no shortage of collaborative filtering data to work through.
Today, the Spotify collaborative filtering model is trained on a sample of ~700 million user-
generated playlists selected out of the much broader set of all user-generated playlists on the
platform.
Now, we finally arrived at the point where the combination of collaborative and content-
based approaches allows Spotify's recommender system to develop a holistic representation
of the track. At this point, the track profile is further enriched by combining the outputs of
several independent algorithms to generate higher-level vectors (think of these as mood,
genre, style tags, etc.).
In addition, to deal with the cold start problem when processing freshly uploaded releases
that don't have enough NLP/playlist data behind them, some of these properties are also
extrapolated to develop overarching artist algorithmic profiles.

Generating User Taste Profiles


The approach to user profiling on Spotify is quite a bit simpler, at least once we solve the
track representations. Essentially, the recommender engine logs all of the user's listening
activity, split into separate context-rich listening sessions. This context component is vital
when interpreting user activity to generate taste profiles. For instance, if the user engages
with Spotify's "What's New" tab, the primary goal of the listening session is often to quickly
explore music recently added to the platform. In that context, high skip rates are to be
expected, as the user's primary goal is to skim through the feed and save some of the content
served for later — which means that a track skip shouldn't be interpreted as a definite
negative signal. On the other hand, if the user skips a track when listening to a "Deep Focus"
playlist designed to be consumed in the background, that skip is a much stronger sign of user
dissatisfaction.
Generally speaking, the user feedback can be split into two primary categories:

 Explicit, or active feedback : library saves, playlist adds, shares, skips, click-through
to artist/album page, artist follows, "downstream" plays
 Implicit, or passive feedback : listening sessions length, track playthrough, and
repeat listens
In the case of the Spotify recommender system, explicit feedback weighs in more when
developing user profiles. Music is often enjoyed as off-screen content, meaning that
uninterrupted consumption doesn't always relate to enjoyment. Then, user feedback data is
processed to develop the user profile, defined in terms of:

 Most-played and preferred songs and artists


 Saved songs and albums & followed artists
 Genre, mood, style, and era preferences
 Popularity and diversity preferences
 Temporal patterns
 Demographic & geolocation profile
Then the user taste profile is further subdivided based on the consumption context. In the end,
Spotify ends up with a context-aware user profile, that might look something like this:
This history-based user profile constantly develops and expands with fresh consumption and
interaction data. Recent activity is also prioritized over historic profile: for instance, if the
user gets into the new genre and it scores well in terms of user feedback, the recommender
system will try to serve more adjacent music — even if the user's all-time favorite music is
widely different.
Recommending music: integrating user and track representations
The intertwined constellation of algorithms behind the Spotify recommendations has
produced the two core components — track and user representations — required to serve
relevant music. Now, we just need the algorithm to make the perfect match between the two
and find the right track for the right person (and the right moment).
However, the recommendation landscape on Spotify is much more diverse than on some of
the other consumption platforms. Just consider the range of Spotify features that are
generated with the help of the recommendation engine:

1. Discover Weekly & Release Radar playlists


2. Your Daily Mix playlists
3. Artist / Decade / Mood / Genre Mix playlists
4. Special personalized playlists (Your Time Capsule, On Repeat, Repeat Rewind, etc.)
5. Personalized editorial playlists
6. Personalized browse section
7. Personalized search results
8. Playlist suggestions & enhance playlist feature
9. Artist/song radio and autoplay features
In one way or another, all these diverse spaces are mediated by the recommender engine but
each of them is running on a separate algorithm with its own inner logic and reward system.
The track and user representation form a sort of universal foundation for these algorithms,
providing a shared model layer designed to answer the common questions of feature-specific
algorithms, such as:

 User-entity affinity: "How much does user X like artist A or track B? What are the
favorite artists/tracks of user Y?"
 Item similarity: "How similar are artist A & artist B? What are the 10 tracks most
similar to track C?"
 Item clustering: "How would we split these 50 tracks/artists into separate groups?"

The feature-specific algorithms can then tap into these unified models to generate
recommendations optimized for a given consumption space/context. For instance, the
algorithm behind Your Time Capsule playlists would primarily engage with user-entity
affinity data to try and find the tracks that users love but haven't listened to in a while. On the
other hand, Discover Weekly algorithms would employ a mix of affinity and similarity data
to find tracks similar to the user's preferences, which they haven't heard yet. Finally,
generating Your Daily Mix playlists would involve all three methods — first, clustering the
user's preferences into several groups and then expanding these lists with similar tracks.
What is BaRT?

BaRT (Bandits for Recommendations as Treatments).


The BaRT algorithm can switch between two modes: exploration and exploitation.
The exploitation mode helps users find music based on what they already like while the
exploration mode helps them find new music without considering their music consumption
history.
The data considered in exploitation mode include the skips, liked songs, and repeated songs.
The exploration mode relies on data from the trending music to what other users enjoy, as
Spotify users don’t just want to be fed with what they already love, they also want to explore
new music on the platform.
Primarily, recommender systems function in exploitation mode using collaborative filtering.
It is the bees knees when the users provide it with sufficient data.

However, on the flipside, exploitation mode flops when there is no data for the algorithm to
act on, meaning the user has not interacted enough with the Spotify app. The exploitation
mode is always at the mercy of the user.
When the exploitation mode fails, the exploration mode comes into action. The designers of
the BaRT algorithm anticipated such situations. Exploration mode does not require data from
the user to function. In fact, it blossoms in uncertainty.

According to a research titled -- Explore, Exploit, and Explain: Personalising


Explainable Recommendations with Bandits by James McInerney, Benjamin Lacker,
Samantha Hansen, Karl Higley, Hugues Bouchard, Alois Gruson, and Rishabh Mehrotra –
Exploration recommends content with uncertain predicted user engagement for the purpose
of gathering more information. The importance of exploration has been recognized in recent
years, particularly in settings with new users, new items, non-stationary preferences, and
attributes.

Simply put, the algorithm gives a new song a chance to shine even though it has not garnered
enough user data about it. When the BaRT algorithm gives a new song exposure, it monitors
and records how users interacted with the record.

When users listen to a song for less than 30 seconds, it hurts their chances of getting more
exposure. On the other hand, when users listen for more than 30 seconds and interact with
the song further via playlisting, liking or saving, the song gets more exposure.

The BaRT algorithm considers the negative and positive recommendations before choosing
what songs to recommend to more users.

The exploration mode is useful when there is low certainty on the relevance of a record while
the reverse is true for the exploitation mode. Yes, the BaRT model can make mistakes but
the beauty of the algorithm is that it learns from its mistakes and predicts better next time. It
gauges consumer satisfaction using click-through rates and positive feedback.
Through reinforcement learning, BaRT is able to log, learn and adjust using its experience
with millions of Spotify users. To provide optimal satisfaction and correctly recommend
songs to users, the BaRT model utilizes a multi-armed bandit (which is programmed to
initiate a specific action ‘A’ in anticipation of the best reward ‘R’. Therefore, future action
‘A’ is reliant on former actions and rewards.

A Multi-armed Bandit (MAB) is tasked to pick actions that optimize the overall sum of
rewards. The first type of MAB did not have an eye for the context of the task such as device,
playlist, time of the day, user features among others. To better serve the listeners, the Spotify
team came up with a better version which is also known as the contextual multi-armed bandit.
This version collects and considers contextual data before picking the appropriate action.
Four determinants are responsible for the success or failure of a contextual MAB,
namely the context, the training procedure, the reward model, and the exploitation-
exploration policy. The report quoted earlier offers a formula for getting any of the three
parameters that are contained within the reward mode. The three parameters are an item j
(the record), an explanation e (why the song was chosen), and the context x. The formula
below houses the three parameters.

θ refers to the coefficients of the logistic regression and 1_i denotes a one-hot vector of
zeros with a single 1 at index.
Currently, the reward function influences the action. In a specific context x the user u
initiates the maximal action using the equation below

The authors in the report we quoted earlier used the epsilon greedy for the exploration
approach. According to them, “[This “gives equal probability mass to all non-optimal items
in the validity set f (e,x) and (1−ε) additional mass to the optimal action (j∗ ,e∗ ).” [2] The
policy is set to “either exploit or explore the item and explanation simultaneously”.

The procedure by which the algorithm enlightens itself is not out of the ordinary as BaRT
educates itself periodically in batch mode. The problem that arises when the algorithm
encounters a new artist with no prior songs to learn from is called the cold start problem.
Collaborative filtering is not the best method for new and relatively unknown songs.
The goals and rewards of Spotify recommendation algorithms
Now, as we mentioned in the beginning of this breakdown, the overarching goal of the
Spotify recommender system has to do primarily with retention, time spent on the platform,
and general user satisfaction. However, these top-level goals are way too broad to devise a
balanced reward system for ML algorithms serving content recommendations across a
variety of features and contexts — and so the definition of success for the algorithms will
largely depend on where and why the user engages with the system.
For instance, the success of the autoplay queue features is defined mainly in terms of user
engagement — explicit/implicit feedback of listen-through and skip rates, library and playlist
saves, click-through to the artist profile and/or album, shares, and so on. In the case of
Release Radar playlists, however, the set of rewards would be widely different, as users
would often skim through the playlist rather than listen to it from cover to cover. So, instead
of studying engagement with content, the algorithms would optimize for long-term feature
retention and feature-specific behavior. "Users are satisfied with the feature if they keep
coming back to it every week; users are satisfied with Release Radar if they save tracks to
their playlists or libraries."
Finally, in some cases, Spotify would employ yet another set of algorithms just to devise the
reward functions for a specific feature. For example, Spotify has trained a separate ML
model to predict user satisfaction with Discover Weekly (with the training set sourced by
user surveys). This model would look at the entire wealth of user interaction data, user past
Discover Weekly behavior, and user goal clusters (i.e., if the user engaged with Discover
Weekly as a background, to search for new music, save music or later, etc.) — and then
produce a unified satisfaction metric based on all that activity.
References

 https://www.loudlab.org/blog/understanding-how-spotify-algorithm-works/

 https://www.music-tomorrow.com/blog/how-spotify-recommendation-system-works-a-
complete-guide-
2022#:~:text=%22We%20can%20understand%20songs%20to,recommend%20song%20
Z%20to%20them.

 Chatgpt

You might also like