• Candidate generation is the first stage of recommendations and is typically achieved by finding features for users that relate to features of the items. Example, if the user searches for food, the search engine will look at the users location to determine possible candidates. It would filter out searches that were not in New York and look at latent representation of the user and find items that have latent representations that are close using approximate nearest neighbor algorithms.
  • Note: Candidate generation is for recommender systems like a search engine where the candidates are not particularly based on time (Eg: searching for how to tie a tie should roughly return the same results in every instance) whereas candidate retrieval is for news-feed based systems where time is important, such as Instagram or Facebook, to make sure you don’t keep seeing the same content again.
    • However, in order to achieve this, we would have to iterate over every item which could be computationally expensive.
  • Let’s dive a little deeper into candidate generation, aka the first step in a general recommendation architecture.
  • In recommender systems, candidate generation is the process of selecting a set of items that are likely to be recommended to a user. This process is typically performed after user preferences have been collected, and it involves filtering a large set of potential recommendations down to a more manageable number. The goal of candidate generation is to reduce the number of items that must be evaluated by the system, while still ensuring that the most relevant recommendations are included.
  • Given a query (user information), the model generates a set of relevant candidates (videos, movies).
  • There are two common candidate generation approaches:
    1. Content-based filtering: Uses similarity between content to recommend new content.
      • For e.g., if user watches corgi videos, the model will recommend more corgi videos.
    2. Collaborative filtering: Uses similarity between queries (2 or more users) and items (videos, movies) to provide recommendations.
      • For e.g., if user \(A\) watches corgi videos and user \(A\) is similar to user \(B\) (in demographics and other areas), then the model can recommend corgi videos to user \(B\) even if user \(B\) has never watched a corgi video before.


  • A quick note on notation you’ll see in the notes moving forward: we will represent \(r\) to be ratings, specifically if user \(i\) has rated item \(j\), \(n\) to hold a number value of users or items, and \(y\) to hold the rating given by a user to a particular item. We can see the notation in the image below (source)

Content-based Filtering

  • Content-based filtering is a type of recommendation system that suggests items to users based on the similarity between the features of items and the user’s preferences. In other words, it recommends items that are similar to items that a user has liked or interacted with in the past.
  • Content-based filtering works by analyzing the attributes or features of the items, such as text descriptions, images, or tags, and building a profile of the user’s preferences based on the attributes of the items they have interacted with. The system then suggests items that are similar to those the user has interacted with, based on the similarity between the item attributes and the user profile.
  • For example, consider a content-based filtering system for recommending movies. The system analyzes the attributes of the movies, such as the genre, actors, directors, and plot summary, and builds a user profile based on the movies they have liked in the past. If a user has previously liked action movies with Bruce Willis, the system might recommend other action movies that feature Bruce Willis.
  • Content-based filtering has several advantages over other recommendation approaches. For instance, it can be used effectively in situations where there is limited or no data on user behavior, such as in new or niche domains where there may not be a large user base. Additionally, content-based filtering can provide more personalized recommendations since it relies on the user’s individual preferences rather than the behavior of others.
  • However, content-based filtering also has limitations. For instance, it may fail to recommend items that are outside the user’s existing preferences or that are not similar to the items they have previously interacted with. Additionally, it may not be effective in situations where there is a lack of item information or where the items have limited attributes or features.
  • Item-based: In item-centered content-based filtering, the recommender system suggests new items based on their similarity to the previously selected items, which are considered as implicit feedback. We can see an example in the image below. (source)

  • User-based: User-centered content-based filtering is a recommendation system approach where user preferences are collected through explicit feedback, such as questionnaires. This information is then used to recommend items with similar features to the user’s liked items. We can see an example in the image below. (source)


  • Model does not need data about other users, only needs info for the current user. Makes it easier to scale.
  • Can recommend niche items to each user, items that other users may not be interested in.


  • Requires a lot of domain knowledge as the features are hand-engineered. Thus, the model is only as good as the hand-engineered features.
  • Model will only make recommendations based on existing interests of the user and is not able to expand on their interests.


  • Let’s look at an example, first by looking at features in the image below (source):
  • We will be creating an algorithm that learns to match users to items via Deep learning.
  • Below is what our architecture will look like (source):
  • Let’s look at a sequential model in TensorFlow implementing a content-based filtering model: (source)
    • We have 2 dense hidden layers and the final layer outputs 32 numbers. These are all using the relu activation function.

Code deep-dive

import numpy as np
import numpy.ma as ma
from numpy import genfromtxt
from collections import defaultdict
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
import tabulate
from recsysNN_utils import *
pd.set_option("display.precision", 1)

# Load Data, set configuration variables
item_train, user_train, y_train, item_features, user_features, item_vecs, movie_dict, user_to_genre = load_data()

num_user_features = user_train.shape[1] - 3  # remove userid, rating count and ave rating during training
num_item_features = item_train.shape[1] - 1  # remove movie id at train time
uvs = 3  # user genre vector start
ivs = 3  # item genre vector start
u_s = 3  # start of columns to use in training, user
i_s = 1  # start of columns to use in training, items
scaledata = True  # applies the standard scalar to data if true
print(f"Number of training vectors: {len(item_train)}")
  • Let’s take a quick look at our feature vectors and data:
pprint_train(user_train, user_features, uvs,  u_s, maxcount=5)
[user id]	[rating count]	[rating ave]	Act ion	Adve nture	Anim ation	Chil dren	Com edy	Crime	Docum entary	Drama	Fan tasy	Hor ror	Mys tery	Rom ance	Sci -Fi	Thri ller
2	16	4.1	3.9	5.0	0.0	0.0	4.0	4.2	4.0	4.0	0.0	3.0	4.0	0.0	4.2	3.9
2	16	4.1	3.9	5.0	0.0	0.0	4.0	4.2	4.0	4.0	0.0	3.0	4.0	0.0	4.2	3.9
2	16	4.1	3.9	5.0	0.0	0.0	4.0	4.2	4.0	4.0	0.0	3.0	4.0	0.0	4.2	3.9
2	16	4.1	3.9	5.0	0.0	0.0	4.0	4.2	4.0	4.0	0.0	3.0	4.0	0.0	4.2	3.9
2	16	4.1	3.9	5.0	0.0	0.0	4.0	4.2	4.0	4.0	0.0	3.0	4.0	0.0	4.2	3.9

  • Let’s prepare the data by doing a bit of preprocessing:
# scale training data
if scaledata:
    item_train_save = item_train
    user_train_save = user_train

    scalerItem = StandardScaler()
    item_train = scalerItem.transform(item_train)

    scalerUser = StandardScaler()
    user_train = scalerUser.transform(user_train)

    print(np.allclose(item_train_save, scalerItem.inverse_transform(item_train)))
    print(np.allclose(user_train_save, scalerUser.inverse_transform(user_train)))
  • And now split it into test and train:
item_train, item_test = train_test_split(item_train, train_size=0.80, shuffle=True, random_state=1)
user_train, user_test = train_test_split(user_train, train_size=0.80, shuffle=True, random_state=1)
y_train, y_test       = train_test_split(y_train,    train_size=0.80, shuffle=True, random_state=1)
print(f"movie/item training data shape: {item_train.shape}")
print(f"movie/item test  data shape: {item_test.shape}")
movie/item training data shape: (46549, 17)
movie/item test  data shape: (11638, 17)
  • Now, let’s construct our neural network as the architecture displayed above. It will have two networks that are combined by a dot product.
# UNQ_C1

num_outputs = 32
user_NN = tf.keras.models.Sequential([
  tf.keras.layers.Dense(256, activation='relu'),
  tf.keras.layers.Dense(128, activation='relu'),

item_NN = tf.keras.models.Sequential([
  tf.keras.layers.Dense(256, activation='relu'),
  tf.keras.layers.Dense(128, activation='relu'),

# create the user input and point to the base network
input_user = tf.keras.layers.Input(shape=(num_user_features))
vu = user_NN(input_user)
vu = tf.linalg.l2_normalize(vu, axis=1)

# create the item input and point to the base network
input_item = tf.keras.layers.Input(shape=(num_item_features))
vm = item_NN(input_item)
vm = tf.linalg.l2_normalize(vm, axis=1)

# compute the dot product of the two vectors vu and vm
output = tf.keras.layers.Dot(axes=1)([vu, vm])

# specify the inputs and output of the model
model = Model([input_user, input_item], output)

Model: "model"
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 14)]         0                                            
input_2 (InputLayer)            [(None, 16)]         0                                            
sequential (Sequential)         (None, 32)           40864       input_1[0][0]                    
sequential_1 (Sequential)       (None, 32)           41376       input_2[0][0]                    
tf_op_layer_l2_normalize/Square [(None, 32)]         0           sequential[0][0]                 
tf_op_layer_l2_normalize_1/Squa [(None, 32)]         0           sequential_1[0][0]               
tf_op_layer_l2_normalize/Sum (T [(None, 1)]          0           tf_op_layer_l2_normalize/Square[0
tf_op_layer_l2_normalize_1/Sum  [(None, 1)]          0           tf_op_layer_l2_normalize_1/Square
tf_op_layer_l2_normalize/Maximu [(None, 1)]          0           tf_op_layer_l2_normalize/Sum[0][0
tf_op_layer_l2_normalize_1/Maxi [(None, 1)]          0           tf_op_layer_l2_normalize_1/Sum[0]
tf_op_layer_l2_normalize/Rsqrt  [(None, 1)]          0           tf_op_layer_l2_normalize/Maximum[
tf_op_layer_l2_normalize_1/Rsqr [(None, 1)]          0           tf_op_layer_l2_normalize_1/Maximu
tf_op_layer_l2_normalize (Tenso [(None, 32)]         0           sequential[0][0]                 
tf_op_layer_l2_normalize_1 (Ten [(None, 32)]         0           sequential_1[0][0]               
dot (Dot)                       (None, 1)            0           tf_op_layer_l2_normalize[0][0]   
Total params: 82,240
Trainable params: 82,240
Non-trainable params: 0
  • Finally, it’s time to use a mean squared error loss and an Adam optimizer
cost_fn = tf.keras.losses.MeanSquaredError()
opt = keras.optimizers.Adam(learning_rate=0.01)
  • Let’s get to training!
model.fit([user_train[:, u_s:], item_train[:, i_s:]], ynorm_train, epochs=30)
Train on 46549 samples
Epoch 1/30
46549/46549 [==============================] - 6s 122us/sample - loss: 0.1254
Epoch 2/30
46549/46549 [==============================] - 5s 113us/sample - loss: 0.1187
Epoch 3/30
46549/46549 [==============================] - 5s 112us/sample - loss: 0.1169
Epoch 4/30
46549/46549 [==============================] - 5s 118us/sample - loss: 0.1154
Epoch 5/30
46549/46549 [==============================] - 5s 112us/sample - loss: 0.1142
Epoch 6/30
46549/46549 [==============================] - 5s 114us/sample - loss: 0.1130
Epoch 7/30
46549/46549 [==============================] - 5s 114us/sample - loss: 0.1119
Epoch 8/30
46549/46549 [==============================] - 5s 114us/sample - loss: 0.1110
Epoch 9/30
46549/46549 [==============================] - 5s 114us/sample - loss: 0.1095
Epoch 10/30
46549/46549 [==============================] - 5s 113us/sample - loss: 0.1083
Epoch 11/30
46549/46549 [==============================] - 5s 113us/sample - loss: 0.1073
Epoch 12/30
46549/46549 [==============================] - 5s 112us/sample - loss: 0.1066
Epoch 13/30
46549/46549 [==============================] - 5s 113us/sample - loss: 0.1059
Epoch 14/30
46549/46549 [==============================] - 5s 112us/sample - loss: 0.1054
Epoch 15/30
46549/46549 [==============================] - 5s 112us/sample - loss: 0.1047
Epoch 16/30
46549/46549 [==============================] - 5s 114us/sample - loss: 0.1041
Epoch 17/30
46549/46549 [==============================] - 5s 112us/sample - loss: 0.1036
Epoch 18/30
46549/46549 [==============================] - 5s 113us/sample - loss: 0.1030
Epoch 19/30
46549/46549 [==============================] - 5s 112us/sample - loss: 0.1027
Epoch 20/30
46549/46549 [==============================] - 5s 113us/sample - loss: 0.1021
Epoch 21/30
46549/46549 [==============================] - 5s 114us/sample - loss: 0.1018
Epoch 22/30
46549/46549 [==============================] - 5s 112us/sample - loss: 0.1014
Epoch 23/30
46549/46549 [==============================] - 5s 113us/sample - loss: 0.1010
Epoch 24/30
46549/46549 [==============================] - 5s 112us/sample - loss: 0.1006
Epoch 25/30
46549/46549 [==============================] - 5s 116us/sample - loss: 0.1003
Epoch 26/30
46549/46549 [==============================] - 5s 114us/sample - loss: 0.0999
Epoch 27/30
46549/46549 [==============================] - 5s 115us/sample - loss: 0.0997
Epoch 28/30
46549/46549 [==============================] - 5s 112us/sample - loss: 0.0991
Epoch 29/30
46549/46549 [==============================] - 5s 113us/sample - loss: 0.0989
Epoch 30/30
46549/46549 [==============================] - 5s 112us/sample - loss: 0.0985
<tensorflow.python.keras.callbacks.History at 0x7fab691f12d0>
  • Evaluate the model to determine the loss on the test data.
model.evaluate([user_test[:, u_s:], item_test[:, i_s:]], ynorm_test)
11638/11638 [==============================] - 0s 33us/sample - loss: 0.1045
  • Making predictions is the next step, first lets start off by predicting for a new user:
new_user_id = 5000
new_rating_ave = 1.0
new_action = 1.0
new_adventure = 1
new_animation = 1
new_childrens = 1
new_comedy = 5
new_crime = 1
new_documentary = 1
new_drama = 1
new_fantasy = 1
new_horror = 1
new_mystery = 1
new_romance = 5
new_scifi = 5
new_thriller = 1
new_rating_count = 3

user_vec = np.array([[new_user_id, new_rating_count, new_rating_ave,
                      new_action, new_adventure, new_animation, new_childrens,
                      new_comedy, new_crime, new_documentary,
                      new_drama, new_fantasy, new_horror, new_mystery,
                      new_romance, new_scifi, new_thriller]])
  • Let’s look at the top-rated movies for the new user:
# generate and replicate the user vector to match the number movies in the data set.
user_vecs = gen_user_vecs(user_vec,len(item_vecs))

# scale the vectors and make predictions for all movies. Return results sorted by rating.
sorted_index, sorted_ypu, sorted_items, sorted_user = predict_uservec(user_vecs,  item_vecs, model, u_s, i_s, 
                                                                       scaler, scalerUser, scalerItem, scaledata=scaledata)

print_pred_movies(sorted_ypu, sorted_user, sorted_items, movie_dict, maxcount = 10)
y_p	movie id	rating ave	title	genres
4.86762	64969	3.61765	Yes Man (2008)	Comedy
4.86692	69122	3.63158	Hangover, The (2009)	Comedy|Crime
4.86477	63131	3.625	Role Models (2008)	Comedy
4.85853	60756	3.55357	Step Brothers (2008)	Comedy
4.85785	68135	3.55	17 Again (2009)	Comedy|Drama
4.85178	78209	3.55	Get Him to the Greek (2010)	Comedy
4.85138	8622	3.48649	Fahrenheit 9/11 (2004)	Documentary
4.8505	67087	3.52941	I Love You, Man (2009)	Comedy
4.85043	69784	3.65	Brüno (Bruno) (2009)	Comedy
4.84934	89864	3.63158	50/50 (2011)	Comedy|Drama
  • Now let’s make predictions for an existing user:
uid =  36 
# form a set of user vectors. This is the same vector, transformed and repeated.
user_vecs, y_vecs = get_user_vecs(uid, scalerUser.inverse_transform(user_train), item_vecs, user_to_genre)

# scale the vectors and make predictions for all movies. Return results sorted by rating.
sorted_index, sorted_ypu, sorted_items, sorted_user = predict_uservec(user_vecs, item_vecs, model, u_s, i_s, scaler, 
                                                                      scalerUser, scalerItem, scaledata=scaledata)
sorted_y = y_vecs[sorted_index]

#print sorted predictions
print_existing_user(sorted_ypu, sorted_y.reshape(-1,1), sorted_user, sorted_items, item_features, ivs, uvs, movie_dict, maxcount = 10)
y_p	y	user	user genre ave	movie rating ave	title	genres
3.1	3.0	36	3.00	2.86	Time Machine, The (2002)	Adventure
3.0	3.0	36	3.00	2.86	Time Machine, The (2002)	Action
2.8	3.0	36	3.00	2.86	Time Machine, The (2002)	Sci-Fi
2.3	1.0	36	1.00	4.00	Beautiful Mind, A (2001)	Romance
2.2	1.0	36	1.50	4.00	Beautiful Mind, A (2001)	Drama
1.6	1.5	36	1.75	3.52	Road to Perdition (2002)	Crime
1.6	2.0	36	1.75	3.52	Gangs of New York (2002)	Crime
1.5	1.5	36	1.50	3.52	Road to Perdition (2002)	Drama
1.5	2.0	36	1.50	3.52	Gangs of New York (2002)	Drama

Collaborative Filtering

  • Collaborative filtering is a type of recommendation system that suggests items to users based on the similarities and patterns in the behavior of a group of users. In other words, it recommends items that users with similar preferences have liked or interacted with.

Collaborative filtering works on the premise that if two users share similar tastes, they’re likely to enjoy similar items/products.

  • Collaborative filtering works by analyzing the behavior and preferences of a group of users, such as their ratings, purchases, or clicks, and identifying users who have similar preferences or behavior. The system then recommends items to a user based on the items that have been liked or interacted with by other users who share similar preferences or behavior.
  • For example, consider a collaborative filtering system for recommending movies. The system analyzes the ratings or reviews of movies provided by a group of users and identifies users who have similar preferences. If a user has rated several movies highly that have also been highly rated by other users with similar preferences, the system might recommend unseen movies to the user that have been highly rated by those similar users.
  • Collaborative filtering has several advantages over other recommendation approaches. For instance, it can be used effectively in situations where there is limited information on item attributes or where the user base is large and diverse. Additionally, collaborative filtering can provide serendipitous recommendations since it can suggest items that a user may not have discovered otherwise.
  • However, collaborative filtering also has limitations. For instance, it may not be effective in situations where there is limited or no data on user behavior or where users have unique preferences that are not shared by others. Additionally, it may suffer from the “cold start” problem, where new items or users have limited data available, making it difficult to recommend items accurately.
  • To address some of the limitations of content-based filtering, collaborative filtering uses similarities between users and items simultaneously to provide recommendations. This allows for serendipitous recommendations; that is, collaborative filtering models can recommend an item to user \(A\) based on the interests of a similar user \(B\).
  • In contrast to content-based filtering, collaborative filtering is able to learn embeddings automatically, without relying on hand-engineered features.
  • It is also able to take interests of user \(A\) and recommend it to user \(B\).
  • The goal of a collaborative filtering recommender system is to generate two vectors:
    • A ‘parameter vector’ for each user that embodies the item’s tastes of a user.
    • A feature vector for each item which embodies some description of the movie.
    • The dot product of the two vectors plus the bias term should produce an estimate of the rating the user might give to that movie.


  • We don’t need to have prior domain knowledge as the embeddings are automatically learned.


  • Cold start problem: If the item/document has not been seen during training, our model can not create an embedding from it and can not query the model for this item.
    • Mean Normalization can help with this, as we see below. Another solution is to show some reasonable items to the new users and have them rate a few items.
    • You can also use the user’s information (demographics, expressed preferences, web browser they are using, their IP) to recommend them reasonable items.

A Movie Recommendation Case-study

  • Consider a movie recommendation system (taken from Google’s course on Recommendation Systems in which the training data consists of a feedback matrix in which:
    • Each row represents a user.
    • Each column represents an item (a movie).
  • The feedback about movies falls into one of two categories:
    • Explicit: users specify how much they liked a particular movie by providing a numerical rating.
    • Implicit: if a user watches a movie, the system infers that the user is interested.
  • To simplify, we will assume that the feedback matrix is binary; that is, a value of 1 indicates interest in the movie.
  • When a user visits the homepage, the system should recommend movies based on both:
    • Similarity to movies the user has liked in the past.
    • Movies that similar users liked.
  • For the sake of illustration, let’s hand-engineer some features for the movies described in the following table: (source for image below)

1D Embedding

  • Suppose we assign to each movie a scalar in [-1, 1] that describes whether the movie is for children (negative values) or adults (positive values). Suppose we also assign a scalar to each user in [-1, 1] that describes the user’s interest in children’s movies (closer to -1) or adult movies (closer to +1). The product of the movie embedding and the user embedding should be higher (closer to 1) for movies that we expect the user to like. (source for image below)

  • In the diagram below (source), each checkmark identifies a movie that a particular user watched. The third and fourth users have preferences that are well explained by this feature—the third user prefers movies for children and the fourth user prefers movies for adults. However, the first and second users’ preferences are not well explained by this single feature.

2D Embedding

  • One feature was not enough to explain the preferences of all users. To overcome this problem, let’s add a second feature: the degree to which each movie is a blockbuster or an art-house movie. With a second feature, we can now represent each movie with the following two-dimensional embedding: (source)

  • We again place our users in the same embedding space to best explain the feedback matrix: for each \((user, item)\) pair, we would like the dot product of the user embedding and the item embedding to be close to 1 when the user watched the movie, and to 0 otherwise. (source)

  • Note: We represented both items and users in the same embedding space. This may seem surprising. After all, users and items are two different entities. However, you can think of the embedding space as an abstract representation common to both items and users, in which we can measure similarity or relevance using a similarity metric. In this example, we hand-engineered the embeddings. In practice, the embeddings can be learned automatically, which is the power of collaborative filtering models. In the next two sections, we will discuss different models to learn these embeddings, and how to train them.

  • The collaborative nature of this approach is apparent when the model learns the embeddings. Suppose the embedding vectors for the movies are fixed. Then, the model can learn an embedding vector for the users to best explain their preferences. Consequently, embeddings of users with similar preferences will be close together. Similarly, if the embeddings for the users are fixed, then we can learn movie embeddings to best explain the feedback matrix. As a result, embeddings of movies liked by similar users will be close in the embedding space.


  • As a real-world example, let’s look at an example (source) with binary labels signifying engagement (using favorites, likes, and clicks):

  • Let’s talk about the meaning of each rating in the above dataset. We can choose what each of our labels mean so here, we have chosen the following:
    • 1: user was engaged after being shown the item.
    • 0: user did not engage after being shown the item.
    • ?: the item is not yet shown to the user.

Matrix factorization

  • Matrix factorization is a simple embedding model. Given the feedback matrix \(\mathrm{A} \in R^{m \times n}\), where \(m\) is the number of users (or queries) and \(n\) is the number of items, the model learns:
    • A user embedding matrix \(U \in \mathbb{R}^{m \times d}\), where row \(i\) is the embedding for user \(i\).
    • An item embedding matrix \(V \in \mathbb{R}^{n \times d}\), where row \(j\) is the embedding for item \(j\). The image below depicts this: (source)

  • The embeddings are learned such that the product \(U V^{T}\) is a good approximation of the feedback matrix. Observe that the \((i, j)\) entry of \(U . V^{T}\) is simply the dot product \(\left\langle U_{i}, V_{j}\right\rangle\) of the embeddings of user \(i\) and item \(j\), which you want to be close to \(A_{i, j}\).

  • Note: Matrix factorization typically gives a more compact representation than learning the full matrix. The full matrix has \(O(n m)\) entries, while the embedding matrices \(U, V\) have \(O((n+m) d)\) entries, where the embedding dimension \(d\) is typically much smaller than \(m\) and \(n\). As a result, matrix factorization finds latent structure in the data, assuming that observations lie close to a low-dimensional subspace. In the preceding example, the values of \(n, m\), and \(d\) are so low that the advantage is negligible. In real-world recommendation systems, however, matrix factorization can be significantly more compact than learning the full matrix.

Choosing the Objective Function

  • One intuitive objective function is the squared distance. To do this, minimize the sum of squared errors over all pairs of observed entries:
\[\min _{U \in \mathbb{R}^{m \times d}, V \in \mathbb{R}^{n \times d}} \sum_{(i, j) \in \mathrm{obs}}\left(A_{i j}-\left\langle U_{i}, V_{j}\right\rangle\right)^{2}\]
  • In this objective function, you only sum over observed pairs (i, j), that is, over non-zero values in the feedback matrix. However, only summing over values of one is not a good idea-a matrix of all ones will have a minimal loss and produce a model that can’t make effective recommendations and that generalizes poorly. The image below (source), depicts this.

  • Perhaps you could treat the unobserved values as zero, and sum over all entries in the matrix. This corresponds to minimizing the squared Frobenius distance between \(A\) and its approximation \(U V^{T}\):
\[\min _{U \in \mathbb{R}^{m \times d}, V \in \mathbb{R}^{n \times d}}\left\|A-U V^{T}\right\|_{F}^{2}\]
  • You can solve this quadratic problem through Singular Value Decomposition (SVD) of the matrix. However, SVD is not a great solution either, because in real applications, the matrix \(A\) may be very sparse. For example, think of all the videos on YouTube compared to all the videos a particular user has viewed. The solution \(U V^{T}\) (which corresponds to the model’s approximation of the input matrix) will likely be close to zero, leading to poor generalization performance.

  • In contrast, Weighted Matrix Factorization decomposes the objective into the following two sums:

    • A sum over observed entries.
    • A sum over unobserved entries (treated as zeroes).
\[\min _{U \in \mathbb{R}^{m \times d}, V \in \mathbb{R}^{n \times d}} \sum_{(i, j) \in \mathrm{obs}}\left(A_{i j}-\left\langle U_{i}, V_{j}\right\rangle\right)^{2}+w_{0} \sum_{(i, j) \notin \mathrm{obs}}\left(\left\langle U_{i}, V_{j}\right\rangle\right)^{2}\]
  • Here, \(w_{0}\) is a hyperparameter that weights the two terms so that the objective is not dominated by one or the other. Tuning this hyperparameter is very important.

  • Note: In practical applications, you also need to weight the observed pairs carefully. For example, frequent items (for example, extremely popular YouTube videos) or frequent queries (for example, heavy users) may dominate the objective function. You can correct for this effect by weighting training examples to account for item frequency. In other words, you can replace the objective function by:

    \[\sum_{(i, j) \in \mathrm{obs}} w_{i, j}\left(A_{i, j}-\left\langle U_{i}, V_{j}\right\rangle\right)^{2}+w_{0} \sum_{(i, j) \notin \mathrm{obs}}\left\langle U_{i}, V_{j}\right\rangle^{2}\]
    • where \(w_{i, j}\) is a function of the frequency of query \(\mathrm{i}\) and item \(\mathrm{j}\).

Minimizing the Objective Function

  • Common algorithms to minimize the objective function include:
  • Stochastic gradient descent (SGD) is a generic method to minimize loss functions.
  • Weighted Alternating Least Squares (WALS) is specialized to this particular objective. The objective is quadratic in each of the two matrices \(\mathrm{U}\) and \(\mathrm{V}\). (Note, however, that the problem is not jointly convex.) WALS works by initializing the embeddings randomly, then alternating between:
    • Fixing \(U\) and solving for \(V\).
    • Fixing \(V\) and solving for \(U\).
  • Each stage can be solved exactly (via solution of a linear system) and can be distributed. This technique is guaranteed to converge because each step is guaranteed to decrease the loss.

Cost function for binary labels (regression to classification)

  • Let’s look at how we can generalize our algorithm, note that this is a lot like linear regression but we instead to predict these binary outputs, thus move from regression to a binary classification problem.
    • Below we have the cost function for this binary application: (source)
    • This is how we take linear regression-like collaborative filtering algorithm and generalize it to work for binary labels.


  • SGD and WALS have advantages and disadvantages. Review the information below to see how they compare:


  • (+) Very flexible—can use other loss functions.
  • (+) Can be parallelized.
  • (-) Slower—does not converge as quickly.
  • (-) Harder to handle the unobserved entries (need to use negative sampling or gravity).


  • (-) Reliant on Loss Squares only.
  • (+) Can be parallelized.
  • (+) Converges faster than SGD.
  • (+) Easier to handle unobserved entries.

Example: Learning process for vectors

  • Below is a diagram of how vectors are learned using matrix factorization. (source)

  • \(Y\) contains ratings; 0.5 to 5 inclusive in 0.5 steps. 0 if the movie has not been rated.
  • \(R\) has a 1 where movies have been rated.
  • Movies are in rows, users in columns. Each user has a parameter vector \(w_{user}\) and bias. Each movie has a feature vector \(x_{movie}\). These vectors are simultaneously learned by using the existing user/movie ratings as training data.
  • One training example is shown above: \(w^{(1)} \cdot x^{(1)} + b^{(1)}=4\).
  • The feature vector \(x_{movie}\) must satisfy all the users while the user vector \(w_{user}\) must satisfy all the movies. Since all the users collaborate to generate the rating set, we call it collaborative filtering.
  • Once the feature vectors and parameters are learned, they can be used to predict how a user might rate an unrated movie.

Code deep-dive

  • Now lets look into the code for a movie recommendation model from Andrew Ng’s Coursera specialization.
  • The first part of the exercise was to implement the collaborative filtering cost function. Let’s skip ahead to training:
    movieList, movieList_df = load_Movie_List_pd()
    my_ratings = np.zeros(num_movies) #  Initialize my ratings
    # Check the file small_movie_list.csv for id of each movie in our dataset
    # For example, Toy Story 3 (2010) has ID 2700, so to rate it "5", you can set
    my_ratings[2700] = 5 
    #Or suppose you did not enjoy Persuasion (2007), you can set
    my_ratings[2609] = 2;
    # We have selected a few movies we liked / did not like and the ratings we gave are as follows:
    my_ratings[929]  = 5   # Lord of the Rings: The Return of the King, The
    my_ratings[246]  = 5   # Shrek (2001)
    my_ratings[2716] = 3   # Inception
    my_ratings[1150] = 5   # Incredibles, The (2004)
    my_ratings[382]  = 2   # Amelie (Fabuleux destin d'Amélie Poulain, Le)
    my_ratings[366]  = 5   # Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
    my_ratings[622]  = 5   # Harry Potter and the Chamber of Secrets (2002)
    my_ratings[988]  = 3   # Eternal Sunshine of the Spotless Mind (2004)
    my_ratings[2925] = 1   # Louis Theroux: Law & Disorder (2008)
    my_ratings[2937] = 1   # Nothing to Declare (Rien à déclarer)
    my_ratings[793]  = 5   # Pirates of the Caribbean: The Curse of the Black Pearl (2003)
    my_rated = [i for i in range(len(my_ratings)) if my_ratings[i] > 0]
    print('\nNew user ratings:\n')
    for i in range(len(my_ratings)):
        if my_ratings[i] > 0 :
            print(f'Rated {my_ratings[i]} for  {movieList_df.loc[i,"title"]}');
    New user ratings:
    Rated 5.0 for  Shrek (2001)
    Rated 5.0 for  Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
    Rated 2.0 for  Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)
    Rated 5.0 for  Harry Potter and the Chamber of Secrets (2002)
    Rated 5.0 for  Pirates of the Caribbean: The Curse of the Black Pearl (2003)
    Rated 5.0 for  Lord of the Rings: The Return of the King, The (2003)
    Rated 3.0 for  Eternal Sunshine of the Spotless Mind (2004)
    Rated 5.0 for  Incredibles, The (2004)
    Rated 2.0 for  Persuasion (2007)
    Rated 5.0 for  Toy Story 3 (2010)
    Rated 3.0 for  Inception (2010)
    Rated 1.0 for  Louis Theroux: Law & Disorder (2008)
    Rated 1.0 for  Nothing to Declare (Rien à déclarer) (2010)
  • Above we are starting with my ratings and printing all the ratings > 0.
  • Now let’s add these reviews to 𝑌 and 𝑅 and normalize the ratings.
  # Reload ratings and add new ratings
  Y, R = load_ratings_small()
  Y    = np.c_[my_ratings, Y]
  R    = np.c_[(my_ratings != 0).astype(int), R]
  # Normalize the Dataset
  Ynorm, Ymean = normalizeRatings(Y, R)
  • Now, let’s initialize the parameters and select the Adam optimizer
  #  Useful Values
  num_movies, num_users = Y.shape
  num_features = 100
  # Set Initial Parameters (W, X), use tf.Variable to track these variables
  tf.random.set_seed(1234) # for consistent results
  W = tf.Variable(tf.random.normal((num_users,  num_features),dtype=tf.float64),  name='W')
  X = tf.Variable(tf.random.normal((num_movies, num_features),dtype=tf.float64),  name='X')
  b = tf.Variable(tf.random.normal((1,          num_users),   dtype=tf.float64),  name='b')
  # Instantiate an optimizer.
  optimizer = keras.optimizers.Adam(learning_rate=1e-1)
  • The final step left here is to train our collaborative filtering model. This will learn the parameters \(X\), \(W\), and \(b\).
  • Since learning these parameters doesn’t follow the typical ‘layers’ offered in TensorFlow neural network package, we will need to use custom training loop instead of the traditional model flow: compile(), fit(), and predict().
  • In order to do this,we will first return the gradient of the loss relative to the tracked variables.
  • Further, the gradients can then be applied to the parameters using an optimizer.
  iterations = 200
  lambda_ = 1
  for iter in range(iterations):
      # Use TensorFlow’s GradientTape
      # to record the operations used to compute the cost 
      with tf.GradientTape() as tape:
          # Compute the cost (forward pass included in cost)
          cost_value = cofi_cost_func_v(X, W, b, Ynorm, R, lambda_)
      # Use the gradient tape to automatically retrieve
      # the gradients of the trainable variables with respect to the loss
      grads = tape.gradient( cost_value, [X,W,b] )
      # Run one step of gradient descent by updating
      # the value of the variables to minimize the loss.
      optimizer.apply_gradients( zip(grads, [X,W,b]) )
      # Log periodically.
      if iter % 20 == 0:
          print(f"Training loss at iteration {iter}: {cost_value:0.1f}")
  • Below is the loss we had logged periodically:
  Training loss at iteration 0: 2321138.6
  Training loss at iteration 20: 136158.3
  Training loss at iteration 40: 51858.7
  Training loss at iteration 60: 24596.9
  Training loss at iteration 80: 13629.5
  Training loss at iteration 100: 8487.1
  Training loss at iteration 120: 5807.2
  Training loss at iteration 140: 4311.2
  Training loss at iteration 160: 3434.9
  Training loss at iteration 180: 2901.7
  • Below we will start making our predictions which will be based off our initial ratings given earlier. We can compare the original and predicted value in the output below:
  # Make a prediction using trained weights and biases
  p = np.matmul(X.numpy(), np.transpose(W.numpy())) + b.numpy()
  #restore the mean
  pm = p + Ymean
  my_predictions = pm[:,0]
  # sort predictions
  ix = tf.argsort(my_predictions, direction='DESCENDING')
  for i in range(17):
      j = ix[i]
      if j not in my_rated:
          print(f'Predicting rating {my_predictions[j]:0.2f} for movie {movieList[j]}')
  print('\n\nOriginal vs Predicted ratings:\n')
  for i in range(len(my_ratings)):
      if my_ratings[i] > 0:
          print(f'Original {my_ratings[i]}, Predicted {my_predictions[i]:0.2f} for {movieList[i]}')
  Predicting rating 4.55 for movie My Sassy Girl (Yeopgijeogin geunyeo) (2001)
  Predicting rating 4.53 for movie Martin Lawrence Live: Runteldat (2002)
  Predicting rating 4.53 for movie Delirium (2014)
  Predicting rating 4.53 for movie Particle Fever (2013)
  Predicting rating 4.53 for movie Laggies (2014)
  Predicting rating 4.53 for movie One I Love, The (2014)
  Predicting rating 4.52 for movie Eichmann (2007)
  Predicting rating 4.52 for movie Battle Royale 2: Requiem (Batoru rowaiaru II: Chinkonka) (2003)
  Predicting rating 4.52 for movie Into the Abyss (2011)
  Predicting rating 4.52 for movie Son of the Bride (Hijo de la novia, El) (2001)
  Predicting rating 4.51 for movie Rivers and Tides (2001)
  Original vs Predicted ratings:
  Original 5.0, Predicted 4.89 for Shrek (2001)
  Original 5.0, Predicted 4.84 for Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
  Original 2.0, Predicted 2.13 for Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)
  Original 5.0, Predicted 4.89 for Harry Potter and the Chamber of Secrets (2002)
  Original 5.0, Predicted 4.91 for Lord of the Rings: The Return of the King, The (2003)
  Original 3.0, Predicted 3.00 for Eternal Sunshine of the Spotless Mind (2004)
  Original 5.0, Predicted 4.89 for Incredibles, The (2004)
  Original 2.0, Predicted 2.15 for Persuasion (2007)
  Original 5.0, Predicted 4.80 for Toy Story 3 (2010)
  Original 3.0, Predicted 3.01 for Inception (2010)
  Original 1.0, Predicted 1.48 for Louis Theroux: Law & Disorder (2008)
  • Below we have a dataframe showing data from the top 20 ratings:
  filter=(movieList_df["number of ratings"] > 20)
  movieList_df["pred"] = my_predictions
  movieList_df = movieList_df.reindex(columns=["pred", "mean rating", "number of ratings", "title"])
  movieList_df.loc[ix[:300]].loc[filter].sort_values("mean rating", ascending=False)
  pred  mean rating number of ratings title
  2112  4.097693  4.238255  149 Dark Knight, The (2008)
  155 4.004862  4.155914  93  Snatch (2000)
  211 4.246676  4.122642  159 Memento (2000)
  929 4.911655  4.118919  185 Lord of the Rings: The Return of the King, The...
  2700  4.795389  4.109091  55  Toy Story 3 (2010)
  653 4.331339  4.021277  188 Lord of the Rings: The Two Towers, The (2002)
  2804  4.393629  3.989362  47  Harry Potter and the Deathly Hallows: Part 1 (...
  773 4.082829  3.960993  141 Finding Nemo (2003)
  1771  4.246275  3.944444  81  Casino Royale (2006)
  2455  4.244220  3.887931  58  Harry Potter and the Half-Blood Prince (2009)
  361 4.052389  3.871212  132 Monsters, Inc. (2001)
  246 4.893447  3.867647  170 Shrek (2001)
  1150  4.885246  3.836000  125 Incredibles, The (2004)
  366 4.839989  3.761682  107 Harry Potter and the Sorcerer's Stone (a.k.a. ...
  79  4.226652  3.699248  133 X-Men (2000)
  622 4.889378  3.598039  102 Harry Potter and the Chamber of Secrets (2002)

Deep Neural Network based Recommendations

  • Deep neural networks (DNNs) have been increasingly used in recommendation systems due to their ability to learn complex patterns in user behavior and item attributes. DNN-based recommendation systems typically use a neural network architecture to model the user-item interactions and make predictions for item recommendations.
  • One common approach for DNN-based recommendation is to use a matrix factorization model as a baseline and then incorporate additional layers of neural networks to capture more complex patterns in the user-item interactions. This is known as a deep matrix factorization model. The neural network layers can be used to learn non-linear transformations of the input features, which can improve the accuracy of the recommendations.
  • Another popular approach for DNN-based recommendation is to use a sequence modeling architecture, such as a recurrent neural network (RNN) or a transformer network. These models can capture temporal dependencies in user behavior and item popularity, allowing for more accurate and personalized recommendations. For example, an RNN can be used to model the sequence of items that a user has interacted with over time, and then use this information to predict which item the user is likely to interact with next.
  • “DNNs can easily incorporate query features and item features (due to the flexibility of the input layer of the network), which can help capture the specific interests of a user and improve the relevance of recommendations.” (source)
  • A possible DNN model for recommendation is the softmax model, which frames the problem as a multiclass prediction task where the input is the user query and the output is a probability vector with the same size as the item corpus. This vector represents the likelihood of a user interacting with each item, such as watching or clicking on a YouTube video.
  • The input features to the DNN can include both dense features like watch time and time since last watch, as well as sparse features such as watch history and country. Unlike matrix factorization, side features like age or country can also be added to the input vector, which is denoted as x.
  • The image below (source) illustrates how that architecture would look like.

  • By incorporating hidden layers and non-linear activation functions like ReLU, the model can capture more intricate relationships within the data. However, as the number of parameters increases, the model can become more challenging to train and more resource-intensive to deploy.

Two-tower Model

  • Many online platforms, like YouTube, Facebook, and TikTok, use the two-tower Model in their recommender systems.
  • This process is as follows:
    • The two-tower model consists of two sub-neural networks: query and item.
    • The query tower encodes user data; the item tower encodes product data.
    • The output of each tower is an embedding, i.e., a dense vector.
    • The similarity of a user and product pair is measured using the dot product.
    • The trained embeddings of query and item towers are stored for fast retrieval.
  • The following diagram (source) visually illustrates the two-tower architecture.

Candidate Retrieval

  • Now that you have an embedding model, how would you decide which items to recommend given a user?
  • At serve time, given a query, you start by doing one of the following:
    • For a matrix factorization model, the query (or user) embedding is known statically, and the system can simply look it up from the user embedding matrix.
    • For a DNN model, the system computes the query embedding \(\psi(x)\) at serve time by running the network on the feature vector \(x\).
  • Once you have the query embedding \(q\), search for item embeddings \(V_{j}\) that are close to \(q\) in the embedding space. This is a nearest neighbor problem. Practically, approximate nearest neighbor algorithms are used to tackle this aspect. For example, you can return the top \(\mathrm{k}\) items according to the similarity score \(s\left(q, V_{j}\right)\).” (source)

  • “You can use a similar approach in related-item recommendations. For example, when the user is watching a YouTube video, the system can first look up the embedding of that item, and then look for embeddings of other items that are close in the embedding space.” (source)

Large-scale Retrieval

  • “To compute the nearest neighbors in the embedding space, the system can exhaustively score every potential candidate. Exhaustive scoring can be expensive for very large corpora, but you can use either of the following strategies to make it more efficient:
    • If the query embedding is known statically, the system can perform exhaustive scoring offline, precomputing and storing a list of the top candidates for each query. This is a common practice for related-item recommendation.
    • Use approximate nearest neighbors.” (source)

Use Cases

  • Twitter: Twitter uses candidate generation to recommend accounts for users to follow. The recommendation engine generates a list of accounts that the user might be interested in based on their activity on the platform. Twitter’s algorithm considers factors such as the user’s tweets, retweets, likes, and interactions with other accounts to generate a list of recommended accounts. In addition to recommending accounts, Twitter also generates candidate tweets for users to engage with. The system analyzes the user’s activity and suggests tweets that they may be interested in liking or retweeting.
  • Pinterest: Pinterest uses candidate generation to recommend pins to users based on their interests. The recommendation engine analyzes the user’s activity on the platform, such as the pins they have saved, liked, and commented on, to generate a list of recommended pins. Pinterest also generates candidate boards for users to follow. The system identifies boards that are related to the user’s interests and suggests them as potential candidates for follow.
  • Netflix: Netflix uses candidate generation to recommend TV shows and movies to users. The recommendation engine analyzes the user’s viewing history and behavior on the platform to generate a list of recommended titles. Netflix’s algorithm considers factors such as the user’s watch history, search queries, and interactions with the platform to generate a personalized list of recommended titles. The platform also generates candidate trailers for users to watch. The system analyzes the user’s viewing history and suggests trailers for titles that the user might be interested in.
  • Amazon: Amazon uses candidate generation to recommend products to users based on their browsing and purchasing history. The recommendation engine analyzes the user’s activity on the platform, such as the items they have searched for, added to their cart, and purchased, to generate a list of recommended products. Amazon’s algorithm considers factors such as the user’s search history, purchase history, and interactions with the platform to generate a personalized list of recommended products. The platform also generates candidate reviews for users to read. The system analyzes the user’s browsing history and suggests reviews for products that the user might be interested in.


  • The image below (source) shows the candidate generation architecture.

  • At a high level, they use a multi-class classification neural network model that is a non-linear matrix factorization model based on user watch history data.
  • Here, they count a fully watched video as a positive example and negative examples are randomly sampled but they made sure these dont contain positive examples as well.
  • ANN with similarity measures are used at inference time based on user and video embeddings to get the top 100 or so candidates.
  • Now lets go back to the architecture from the image above. Embedded videos are created for each user and a weighted average is taken for different categories such as search(since the user could search something and not watch it), watch(shows a positive example), and geolocation.
  • During training, it leverages softmax to predict class probabilities but during serving it uses ANN.
  • This then goes through a feed forward neural network that uses a ReLU activation.

Negative Sampling

  • Negative sampling and the addition of items that a user won’t like are typically applied during the training phase of recommender systems. The purpose of these techniques is to address bias in the training data and improve the performance and fairness of the models.
  • During training, negative sampling involves randomly selecting items that the user has not interacted with and assigning them negative labels to represent disinterest or lack of preference. This helps balance the dataset and provides a more representative view of user preferences. By including negative samples, the model can learn to distinguish between items that a user is likely to prefer and those they are likely to dislike.
  • Negative samples are used to help the model learn patterns and make better predictions, but they are not meant to be used as actual recommendations to users.
  • During the serving phase, the recommender system utilizes the trained model to generate personalized recommendations based on user preferences and other relevant factors. The goal is to provide recommendations that are aligned with the user’s interests and preferences, rather than deliberately suggesting items that the user won’t like.
  • The aim of using negative sampling and introducing negative examples during training is to improve the overall performance and fairness of the recommender system. By considering both positive and negative examples during the training phase, the model can better capture the nuanced preferences of users and avoid biases that may arise from the limited availability of positive data.
  • It’s worth mentioning that the exact techniques and approaches for mitigating bias may vary depending on the specific recommender system and the characteristics of the data. The ultimate objective is to strike a balance between addressing biases and providing accurate and relevant recommendations to users.

  • “To illustrate this, consider a simple example where we have a user and three posts (0, 1, 2) on a platform. Posts 1 and 2 were shown to the user, and they engaged with post 1 but not post 2. An oracle reveals that post 2 is slightly more relevant to the user than post 0. Each post has a 1D embedding, where the embedding value corresponds to the post ID. Without negative sampling, training the user embedding using gradient descent would likely result in the user embedding being closer to post 0 than post 2.” (source)


  • To test if a recommender system has a good candidate generation or candidate retrieval pool, you can use various evaluation metrics that measure the quality of the recommended items. Some commonly used metrics are:
    • Precision: Precision measures the percentage of relevant items among the recommended items. In the context of candidate generation, precision measures how many of the recommended candidates are actually relevant to the user’s preferences.
    • Recall: Recall measures the percentage of relevant items that were recommended. In the context of candidate generation, recall measures how many of the relevant candidates were recommended to the user.
    • F1 Score: F1 Score is the harmonic mean of precision and recall. It provides a balance between precision and recall and is a good overall metric for evaluating candidate generation performance.
    • Mean Average Precision (mAP): MAP measures the average precision across different queries. In the context of candidate generation, MAP measures the average precision of the recommended candidates across all the users.
    • Normalized Discounted Cumulative Gain (NDCG): NDCG measures the relevance of the recommended items by assigning higher scores to items that are more relevant. In the context of candidate generation, NDCG measures how well the recommended candidates are ranked according to their relevance.
  • By evaluating these metrics, you can assess the quality of the recommended items and identify areas where the candidate generation or retrieval pool needs improvement.

Further Reading