Introduction

  • Up until now, in all the sections previously, we’ve looked through many of the foundational elements of a recommendation system.
  • We’ve talked about topics ranging from candidate generation, to ranking and re-ranking, to a hands-on music rec project coded using pyspark.
  • Now, lets talk about the current research within this domain. As I make my way through more papers, I’ll keep updating this page with all the information I think is vital to share!

Multi-objective hyper-parameter optimization of behavioral song embeddings

  • In this paper by Massimo Quadrana, Antoine Larreche-Mouly and Matthias Mauch from Apple Music, the look at the results of tuning the hyperparameter on their song embeddings.
  • In this paper, they study how optimizing the hyperparameter of their song embeddings, (which is based on Word2Vec) on a few downstream tasks.
    • Specifically, the downstream tasks they focus on are next-song recommendation, false neighbor rejection and artist/genre clustering.
  • This paper also found that next-song recommendation quality of Word2Vec is anti-correlated with song popularity, helping negate the bias that comes with recommending popular songs.
  • Below, I have taken segments of this paper and divided them into smaller sections for ease of understanding. Let’s get started in looking at the details of this paper!
  • Note: All the work, experiments and images are taken from Multi-objective hyper-parameter optimization of behavioral song embeddings, I’d recommend going through that paper for yourself!

Background on RecSys

  • Today’s Recommendation Systems (RecSys) utilize embedding vectors to represent users and items.
  • These embeddings are leveraged to execute several tasks such as: song recommendation, search, tagging, and generating artist/ genre representations to later be used for music recommendation.
  • Usually, we create these embeddings early on in the RecSys pipeline, often through self-supervised methods such as Word2Vec.
    • Note: methods such as Word2Vec come with default hyperparameters tuned on non RecSys tasks such as NLP.
  • Due to this, the field needs resilient metrics to compare against these optimizing embeddings.
  • Recent research states that hyperparameter optimization can drastically improve recommendation quality from these tasks.
  • These authors believe that people who leverage music RecSys could benefit from a deeper understanding of how the song embedding behaves when its optimized in relation to factors such as song popularity to help alleviate popularity bias, amongst other things.

What the paper provides

  • Now that we’ve gotten a good background on the matter, lets go ahead and figure out what changes and contributions this paper makes to the field of RecSys.
  • This paper provides a framework with which you can monitor the performance of embedding models. Additionally, it also provides a means to leverage your embedding models for new and different tasks.
  • It does this by:
    • Defining metrics and optimization objectives for tasks such as next-song recommendation, false neighbor rejection, and genre/artist clustering.
    • Proposing a multi-objective optimization approach to combine recommendation and clustering tasks.
    • Showing next-song recommendation quality and song popularity are anti-correlated.
    • Studying different embedding optimizations at scale on a large dataset of billions of listening events.

How it all works

  • First lets talk about the embeddings.
    • This paper considers song embeddings based on Word2Vec. As we know, Word2Vec was created to represent words in a corpus through a self-supervised shallow neural network trained to learn dense vector representation of words from sentences based on their surrounding words using either Skipgram or CBOW.
    • This same principle is used in RecSys to compute item embeddings from user interactions such as site sessions or playlists.
    • This paper will study song embedding optimization for 4 tasks being: next-song prediction, false neighbor rejection, artist clustering and genre clustering.
  • Next-Song Prediction
    • This task recommends the next song to a user based on some context it has previously garnered. Sometimes, this means looking at the songs the user has previously listened to in a specific amount of time.
    • For this paper however, the goal was not to have the most efficient next song recommender but to study the effects of optimization on the embedding space and thus, they have only predicted next songs based on the single previously played song.
    • Here, the target song is the song from the evaluation set and the query song is the song the user has just played which we will use to recommend the target song.
    • HitRate is the number of times the target song was contained within the 100 nearest neighbors of the query song.
  • False Neighbor Rejection
    • False Neighbor Rejection task looks to filtering out songs that are within the closest neighborhood of another song but have little bearing to each other and have low metadata similarity.
    • Essentially these are the songs that have appeared in close proximity merely due to chance.
    • To do this, the paper uses a chi-squared \(X^2\) test to identify false neighbors within the dataset.
      • This test compares the observed co-occurrence frequencies of every pair of songs against expected co-occurrence frequencies.
      • A limitation here is that this approach requires a large amount of events to be able to successfully identify co-occurrence song pairs.
  • Artist and Genre Clustering
    • Lastly, the paper is interested in knowing how well these two classes, Artist and Genre, exist closely together in an embedding space.
    • Here, they introduce a concept of Local Genre Coherence of the embedding space as the average fraction of nearest neighbors belonging to the same artist as the query song.
    • “For example, an embedding space has Local Genre Coherence = 0.5 if on average 50% of the nearest neighbors of each song have its same primary genre”
    • However, since computing embeddings for every song can be very expensive, the paper proposes using a proxy metric called Calinski-Harabasz index or Variance Ratio Criterion (VRC).

The Experiments

  • Now that we have a general understanding of the tasks the authors wanted to target in this paper, lets take a deeper dive into their experiment.
  • The authors optimize the hyperparameters of their skipgram Word2Vec embedding by running Bayesian Hyper-Parameter Optimization that they initialized with 10 iterations of Random Search before having the Bayesian Search converge.
  • Word2Vec here uses the skipgram architecture with negative sampling because it provides computational efficiency.
  • The image below details the hyperparameters that Word2Vec used for their experiment.

  • \(d\) is the embedding size
  • \(L\) is the max window length
  • \(a\) alpha controls the negative sampling
  • \(N\) is the number of negative samples used to approximate the softmax
  • \(lambda\) is the learning rate

  • The table above represents the results of the two data sets when being optimized for a single or multiple tasks.

  • Next-Song Recommendation optimized
    • Lets start by looking at the first task and how they optimized for it.
    • The optimization was analyzed by running Hyper-Parameter Optimization (HPO) with the HitRate.
  • Genre and Artist Clustering optimized
    • Let’s move on to the next task here.
    • This optimization was done via Variance Ratio Criterion as a proxy objective.
    • The result of this was that the optimal configurations here had significantly worse next-song recommendation quality vs what we saw for single-objective next-song recommendation.
    • The learning here is that optimizing embeddings for clustering alone can hurt rather than help the recommendation quality.
  • Multi-objective optimization
    • As we saw above, optimizing for one property can directly harm the other, thus the paper looks into simultaneous optimization of both properties by using h MultiObjective Hyper-parameter Optimization (MOHPO).
    • This proved to be the most effective method; by combining recommendation and clustering objectives, both properties were able to mutually benefit from each other.
  • Popularity bias problem
    • In order to run this test, the songs were categorized into buckets of popularity.
    • The HitRate drops as the query and target song begin to differ in popularity. Thus, the HitRate is anti-correlated with popularity.
    • Optimization seems to balance the recommendation accuracy more across popularity buckets, which is what we want.
  • Try to use on another task
    • The authors also wanted to check if embedding optimization had benefits on tasks other than what it was originally optimized for.
    • It was found that “embedding optimization can have beneficial effects on tasks different from the one it was initially designed to address. Furthermore, it provides additional evidence on the superiority multi-objective optimization over single-objective one.”

TL;DR

  • Let’s talk about the key take-aways from this paper now.
  • This paper analyzed offline optimization of song embeddings by looking at individual tasks that we defined above.
  • The paper also proposed a way to leverage and optimize Word2Vec hyperparameters on recommendation and clustering tasks individualy and jointly.
  • We saw that there was substantial benefit over single-objective optimization variants when we used multi-objective optimization approach.
  • Personally, I found leveraging Word2Vec for song embeddings and then optimizing for each task pretty fascinating!