Recommendation Systems • MultiArmed Bandits
 Overview
 MAB Usecases
 Bandit Algorithms
 Contextual Bandit Algorithms
 Comparative Analysis of MAB Algorithms
 Advantages of MAB
 Contextual Bandits
 Offline Evaluation Replay
 Industry examples of bandits for RecSys
 Further Reading
 References
Overview
 The concept of MultiArmed Bandits (MAB) is inspired from the idea of gambling in a casino. MAB is a classic problem in probability theory and decisionmaking that involves a gambler who faces a row of slot machines, or “onearmed bandits,” each with a different unknown probability of winning. The gambler must decide which machines to play in order to maximize their total reward over a series of plays.
 The MAB problem is often used as a metaphor for decisionmaking under uncertainty in various fields, such as statistics, machine learning, reinforcement learning, and economics.
 MABs are a type of classical reinforcement learning that aims to strike a balance between exploration and exploitation. They achieve this by exploring new actions to understand their potential rewards and then exploiting the current best action to maximize overall reward. The objective is to gain knowledge about and select actions that maximize the total reward while minimizing regret.
 Let’s look at the different facets of this problem below.
MAB Usecases
 MAB are a class of algorithms used for decisionmaking and optimization in a variety of contexts. They are commonly used in the context of online testing and personalization.
Testing
 MAB A/B testing are both common approaches for testing and optimizing user experiences in various scenarios such as website design, online advertising, and recommendation systems. Both approaches involve running experiments to compare different versions of a product or service and measure their performance. However, they differ in several key respects, and each has its own strengths and limitations:
 A/B Testing:
 Method: A/B testing is a traditional approach where you split your audience into two (or more) groups and expose each group to a different version of your product or service. You then measure the performance of each version based on some defined metric (e.g., clickthrough rate, conversion rate) and pick the version that performs the best.
 Advantages: A/B testing is simple to implement and interpret. It gives you a clear comparison between two or more options. This approach is also robust to various statistical and methodological issues, as long as the test is properly designed.
 Limitations: A/B testing can be inefficient. While the test is running, a significant portion of your audience is exposed to suboptimal versions. Furthermore, it does not allow for realtime learning or adjustment based on the incoming data. A/B testing also doesn’t work well with personalization because it assumes each version performs uniformly across all audience segments.
 MAB:
 Method: MAB approaches aim to balance the explorationexploitation tradeoff. In other words, they try to learn from the ongoing experiment and adjust the allocation of traffic to different versions based on their observed performance. This allows for more efficient use of resources and realtime learning.
 Advantages: MAB approaches can be more efficient than A/B testing because they dynamically adjust the allocation of users to different versions based on their performance. This can lead to higher overall performance during the testing phase. MAB methods are also better suited to personalization because they can adapt to variations in performance across different user segments.
 Limitations: MAB methods can be more complex to implement and interpret. They also require careful tuning to balance exploration and exploitation. Furthermore, while MAB methods can learn and adapt more quickly, they might also be more sensitive to shortterm fluctuations in performance metrics.
Personalization
 In terms of personalization, MAB approaches often outperform A/B testing because they can handle more complex situations and adapt more quickly to the observed data. They can dynamically allocate more resources to versions that perform better for specific user segments, leading to more personalized experiences.
 However, the choice between A/B testing and MAB often depends on the specific context, the available resources, and the level of sophistication required.
Bandit Algorithms
 Let’s look at how the bandit algorithms work at a high level first.
 Looking at the image below (source), we see that there is a learner that interacts with the environment with an action and the environment responds back with a reward.
 The bandit game then, proceeds as follows for each round:
 The learner chooses one action from a set of actions and shares it with the environment.
 The environment generates a realvalued reward and sends it back to the learner.
 The goal for the learner here is to maximize the cumulative reward or minimize the cumulative regret, where regret is the difference in total reward gained in \(n\) rounds and the total reward that could have been gained had the optimal action been chosen.
Exploration vs. Exploitation Tradeoff
 Exploration vs. exploitation is at the heart of every bandit algorithm.
Exploitation
 Exploitation is about recommending the optimal candidate given all the evidence and data that is currently available. Let’s look at a few strategies for exploitation:
 Greedy exploit policy: the greedy exploit policy would involve always selecting the arm with the highest estimated reward and not exploring other arms. It assumes that the estimated rewards are accurate and the agent has a good understanding of the underlying reward distribution. It aims to make the best use of the current knowledge to select the arm with the highest expected reward at each time step, without allocating resources to explore other arms.
 The image below (source) shows an overview of how Netflix uses greedy exploit policy in recommending a title.
 When the member arrives at the homepage, we can see they have four titles in their candidate pool and the member features as well as the title features are extracted and provided to four pretrained models.
 At online time, the four models are scored based on the features that were extracted and a probability of playing the title is computed and the title with the highest probability is selected.
 Greedy exploit policy: the greedy exploit policy would involve always selecting the arm with the highest estimated reward and not exploring other arms. It assumes that the estimated rewards are accurate and the agent has a good understanding of the underlying reward distribution. It aims to make the best use of the current knowledge to select the arm with the highest expected reward at each time step, without allocating resources to explore other arms.
Exploration
 Exploration is about recommending other candidates to gather more feedback.
 While exploration sounds like a sacrifice, it may be a good long term strategy as it allows us to gather information to make the best overall decision. Let’s look at the different strategies of exploration:
 Naive Exploration or random exploration: Add noise to the greedy policy such as Egreedy algorithm. It chooses to explore without any specific strategy or information. In this approach, the system selects arms or recommendations randomly, without taking into account any knowledge or estimation of the probabilities of success or quality of the options.
 Optimism in the Face of Uncertainty: prefer actions where you have little information thus, gaining more information about unknown actions. Upper Confidence Bound (UCB) is a great example of this. Arms with higher uncertainty or unknown performance are given higher optimism or initial estimates of their potential rewards. These strategies are designed to encourage exploration of uncertain arms in order to gather more information and potentially identify higherrewarding arms.
 Probability Matching: select the action according to the best probability such as what Thompson Sampling does. It’s a strategy where an agent distributes its actions or choices across different arms in proportion to their probabilities of success, rather than consistently choosing the arm with the highest estimated reward.
Contextual Bandit Algorithms
Epsilon Greedy
 Epsilongreedy algorithm is a strategy that combines both exploitation and exploration in multiarmed bandit problems. Put simply, it is not solely an exploitation or exploration algorithm, but rather a hybrid approach that balances both aspects.
 The “exploitation” aspect of the epsilongreedy algorithm refers to the selection of the arm with the currently estimated highest reward based on the exploitation rate (1  \(\epsilon\)). This is done to maximize the shortterm rewards by choosing the arm that appears to be the best according to the current knowledge.
 The “exploration” aspect of the epsilongreedy algorithm refers to the selection of a random arm with a probability of \(\epsilon\). This is done to explore other arms and gather more information about their rewards, even if they are not currently estimated to be the best. This exploration allows the agent to potentially discover better arms that may have initially lower estimated rewards but could have higher rewards in the long run.
 By balancing exploitation and exploration, the epsilongreedy algorithm aims to find a tradeoff between taking advantage of the current best arm and exploring other arms to improve the overall performance. The value of epsilon (\(\epsilon\)) controls the level of exploration, where higher epsilon values result in more exploration, and lower epsilon values result in more exploitation.
 The pseudocode below (source) shows how the exploitation and exploration process work given the value of \(\epsilon\).
inputs:
machines :: [ GamblingMachine ]
num_plays :: Map(GamblingMachine > Int)
total_reward :: Map(GamblingMachine > Float)
select e // In the original post, e = 0.1
while True:
x < UNIFORM([0,1]) // A uniformly distributed random variable
if x < e: //exploration phase
let m = random_choice(machines)
reward < play_machine( m )
num_plays[m] += 1
total_reward[m] += reward
else: //exploitation phase
average_rewards = Map( (m, total_reward[m] / num_plays[m]) for m in machines )
best_machine = argmax( average_rewards ) //Find the machine with the highest reward
reward < play_machine( best_machine )
num_plays[best_machine] += 1
total_reward[best_machine] += reward
Thompson Sampling (Bayesian)
 This is a probabilistic algorithm that samples arms from their posterior distributions and selects the arm with the highest sampled reward. It uses Bayesian methods to update the posterior distributions based on observed data.
 The key idea behind Thompson Sampling is to maintain a probability distribution over the expected rewards of each arm or action, and sample from these distributions to determine which arm to choose at each decision point.
 The algorithm uses Bayesian methods to update the probability distributions based on observed rewards, which allows it to adaptively learn the optimal arm over time.
 In the code below, we assume there are 5 arms with unknown probabilities of success (
true_probs
), and we want to run the Thompson Sampling algorithm for 1000 rounds (n_rounds). We initialize counters for each arm (n_pulls
andn_successes
) as zeros, and in each round, we sample from the beta distribution using the updated counters, choose the arm with the highest sampled probability, simulate pulling the chosen arm, and observe the reward.
import numpy as np
# Number of arms
n_arms = 5
# True unknown probability of success for each arm (unknown to the algorithm)
true_probs = np.random.rand(n_arms)
# Number of rounds
n_rounds = 1000
# Initialize counters for each arm
n_pulls = np.zeros(n_arms)
n_successes = np.zeros(n_arms)
# Main loop for Thompson Sampling
for t in range(n_rounds):
# Sample from the probability distributions for each arm
sampled_probs = np.random.beta(n_successes + 1, n_pulls  n_successes + 1)
# Choose the arm with the highest sampled probability
chosen_arm = np.argmax(sampled_probs)
# Simulate pulling the chosen arm and observe the reward
reward = np.random.binomial(1, true_probs[chosen_arm])
# Update the counters for the chosen arm
n_pulls[chosen_arm] += 1
n_successes[chosen_arm] += reward
# Print the chosen arm and reward for this round
print("Round:", t+1)
print("Chosen Arm:", chosen_arm)
print("Reward:", reward)
print("")
# Print the estimated probabilities of success for each arm
estimated_probs = n_successes / n_pulls
print("Estimated Probabilities of Success:", estimated_probs)
Upper Confidence Bound
 As we saw earlier, this strategy is based on Optimism in the Face of Uncertainty principle.
 This algorithm uses an upper confidence bound to balance exploration and exploitation. It selects arms with higher upper confidence bounds, which represent the uncertainty of the estimated rewards.
 The UCB algorithm leverages the optimistic estimate of the expected rewards by using the upper confidence bound, which encourages exploration by choosing arms with uncertain or poorly estimated rewards. As the algorithm accumulates more data through the number of pulls, the confidence interval term shrinks, and the algorithm tends to exploit the arms with higher estimated rewards more often.
 In the code below, we can see how UCB is implemented with 5 arms.
import numpy as np
# Number of arms
n_arms = 5
# True unknown probability of success for each arm (unknown to the algorithm)
true_probs = np.random.rand(n_arms)
# Number of rounds
n_rounds = 1000
# Initialize counters for each arm
n_pulls = np.zeros(n_arms)
n_successes = np.zeros(n_arms)
# Main loop for UCB
for t in range(n_rounds):
# Calculate the upper confidence bound for each arm
ucb_values = n_successes / n_pulls + np.sqrt(2 * np.log(t + 1) / (n_pulls + 1e6))
# Choose the arm with the highest upper confidence bound
chosen_arm = np.argmax(ucb_values)
# Simulate pulling the chosen arm and observe the reward
reward = np.random.binomial(1, true_probs[chosen_arm])
# Update the counters for the chosen arm
n_pulls[chosen_arm] += 1
n_successes[chosen_arm] += reward
# Print the chosen arm and reward for this round
print("Round:", t+1)
print("Chosen Arm:", chosen_arm)
print("Reward:", reward)
print("")
# Print the estimated probabilities of success for each arm
estimated_probs = n_successes / n_pulls
print("Estimated Probabilities of Success:", estimated_probs)
Comparative Analysis of MAB Algorithms
 Credits for this section go to Eugene Yan’s Bandits for Recommender Systems blog.
 First, UCB and Thompson Sampling outperform \(\epsilon\)greedy. By default, \(\epsilon\)greedy is unguided and chooses actions uniformly at random. Put simply, at every trial, it randomly chooses an action with probability \(\epsilon\) and greedily chooses the highest value action with probability 1  \(\epsilon\), where \(\epsilon\) is parameter that balances the exploreexploit tradeoff, i.e., a higher \(\epsilon\) leads to more exploration while a lower \(\epsilon\) leads to more exploitation. In contrast, UCB and Thompson Sampling are guided by confidence bounds and probability distributions that shrink as the action is tried more often. As a result, because UCB and Thompson Sampling smartly explore arms that have higher uncertainty, they have lower regret.
 Second, when feedback is delayed, Thompson Sampling outperforms UCB. Delayed feedback, where useritem interactions are not processed immediately, is common for most realworld systems due to resource and runtime constraints. In this situation, because UCB selects arms deterministically, it chooses the same action until new feedback is incorporated. In contrast, because Thompson Sampling chooses actions stochastically by sampling from the posterior distribution, it randomizes over actions even without updated rewards. Yahoo’s evaluation of Thompson Sampling and Deezer’s music bandit observed that this led to wider exploration and thus better outcomes.
 To further study the impact of delayed rewards, Yahoo simulated different update delays of 10, 30, and 60 minutes. Thompson Sampling was competitive over all delays. On the other hand, while UCB outperformed Thompson Sampling when the delay was short (10 minutes), it performed worse than Thompson Sampling when the delay increased to 30 and 60 minutes. Overall, this suggests that stochastic policies are more robust to delay as they continue to explore even without updated rewards.
Advantages of MAB

MABs are particularly useful when there’s a need to balance exploration (trying new options) and exploitation (sticking with the bestknown option). Here are some of the benefits of using multiarmed bandits:

Efficient ExplorationExploitation Balance: MABs help in balancing the need to explore unknown options and exploit known good ones, optimizingfor/guidedby longterm rewards. They’re efficient because they can model uncertainty with limited useritem preference data (especially compared to deep learningbased models for recommender systems).

Online Learning: They are designed to learn and adapt over time, making them suitable for nonstationary environments where the underlying distribution can change.

Resource Optimization: By continually exploring and exploiting the best options, MABs can optimize the allocation of resources, such as budget, time, or any other limited resource.

Personalization: In contexts like recommendation systems, MABs can personalize content to individual users, adapting to their preferences and behaviors in realtime.

Ease of Implementation: Some MAB algorithms, like epsilongreedy, are quite simple to implement, making them accessible for a variety of applications.

Robustness: Many MAB algorithms are robust to noisy or incomplete information and can work well even when the assumptions about the underlying models are not exactly met.

Low Regret: MAB algorithms are designed to minimize regret, which is the difference between the reward obtained and the reward that would have been obtained by always choosing the optimal action. This ensures that they perform nearoptimally with enough observations.

Scalability: MABs can be applied to problems with many actions (or arms) and can scale to handle complex scenarios.

Realtime Decision Making: They enable realtime decisionmaking, allowing systems to respond quickly to changes in the environment.

Experimentation and A/B Testing: MABs can be used to conduct efficient A/B testing, dynamically allocating more traffic to betterperforming variations. This can speed up experimentation and lead to more rapid improvements.

Contextual Information: Contextual bandit algorithms, a variant of MAB, incorporate additional context information into the decisionmaking process, allowing for more nuanced and sophisticated strategies.

Application Across Various Domains: Multiarmed bandits have wide applicability across domains like healthcare (for personalized treatment strategies), online advertising (for ad placement), recommender systems, finance, and more.

Ethical Considerations: By potentially reducing unnecessary exploration in critical scenarios, MABs may help in making more ethical decisions, like in healthcare, where unnecessary exploration could have significant consequences.

In summary, the benefits of multiarmed bandits are manifold, and they provide a flexible, robust, and efficient means of decisionmaking and optimization in a wide array of applications and industries.
Contextual Bandits
Overview
 Contextual bandits extend the concept of multiarmed bandits by considering the context or additional information available before making each action selection. Instead of solely relying on historical data, contextual bandits take into account the specific context or conditions associated with each decision point.
 In the context of recommendation systems and search algorithms, this could involve considering various customerrelated data such as demographics, device type, historical preferences, as well as environmental factors like the day of the week or time of day. By incorporating context, contextual bandits aim to learn how different actions interact with specific contexts and their impact on the expected reward. This enables more personalized and optimized decisionmaking in dynamic environments.
Industrial Deployments
Spotify
 Spotify introduced a method that extends the use of contextual bandits, a technique commonly used in recommendation systems, to handle multiple objectives fairly. The approach uses a mathematical function called the Generalized Gini index (GGI) to combine and balance the different objectives. The paper proposes an algorithm that learns from user interactions and adjusts recommendations to maximize longterm rewards for each objective, as measured by the GGI function
 This paper discusses the design of multiobjective recommender systems for online platforms that involve multiple stakeholders. The recommender systems aim to optimize various objectives such as user satisfaction, fairness, diversity, and revenue. Traditional bandit models are limited in handling multiple objectives simultaneously, but contextual bandit models offer more flexibility in incorporating side information.
 The paper proposes a multiobjective formulation of a contextual bandit recommender system, which considers the uncertainty and relevance of items in the decisionmaking process. Unlike traditional bandits, contextual bandits use context or side information to guide recommendations. However, multiobjective variants of contextual bandits have received less attention in previous research.
 The focus of the paper is on developing multiobjective contextual bandit models for digital platforms with multiple stakeholders. The reward in this setting is assumed to be a noisy linear function of the context or side information. By considering multiple objectives, these models can effectively optimize recommendations while balancing different stakeholder goals.
 The paper describes the use of a multiobjective contextual bandit (MAB) model to optimize the recommender system on Spotify. The proposed model is based on the Generalized Gini Index (GGI) and assimilates contextual information to optimize multiple objectives simultaneously. The MAB framework allows Spotify to make sequential decisions by balancing exploration and exploitation, considering various usercentric and suppliercentric objectives.
 In the article, the authors explain that traditional bandit settings only consider a single scalar feedback after each action, while multiobjective systems require joint optimization of multiple criteria. By extending the contextual bandit framework to accommodate multiple objectives, Spotify aims to achieve efficiency and fairness in optimizing their recommender system.
 Therefore, the approach presented in the article involves the use of a multiobjective contextual bandit model, which leverages the GGI and contextual information to optimize for multiple objectives simultaneously.
 Existing research in multiarmed bandits has explored addressing the multiobjective nature of the problem. Various algorithms, such as the knowledge gradient, hierarchical optimistic optimization strategy, Thompson sampling, and combinations of biobjective optimization with combinatorial bandits, have been proposed to tackle multiobjective optimization (MOO) in traditional bandit settings. However, these approaches do not specifically address the contextual bandit setting, where additional context information is observed at each iteration.
 In the context of contextual bandits, there has been relatively less focus on multiobjective variants. Only a few algorithms exist that consider similarity information or dominant objectives, but these approaches rely on assumptions about access to distances between contextarm pairs rather than user behavioral data. These assumptions are often restrictive in realworld industrial settings, where rewards are derived from user behavioral data.
 Therefore, the proposed method aims to extend multiobjective bandit models specifically to the contextual bandit setting, considering the challenges and constraints of realworld scenarios where user behavioral data is used to derive rewards.
 The authors discusses the importance of optimizing multiple objectives in usercentric systems, such as music streaming platforms. Different usercentric objectives, like clicks, streams, and number of songs played, can be optimized, but they may have correlations and tradeoffs with each other. For example, optimizing for relevance may harm diversity. The article also considers additional objectives related to diversity and promotion. To understand the interplay between these objectives, the authors analyze user streaming data and estimate various metrics. They find that different objectives have different degrees of correlation, with usercentric objectives being strongly positively correlated and diversity and promotion objectives showing weak negative correlation with usercentric objectives.
 They emphasizes that optimizing for a single metric is insufficient in a multiobjective, multistakeholder platform setting. There is often a delicate tradeoff between objectives, and a recommender system designed for a single metric is not suitable. The authors propose exploring multiple metric optimization and describe a recommendation approach based on contextual bandits, which are commonly used to optimize a single user satisfaction metric.
 The article describes the use of contextual bandits for recommender systems. The recommendation problem is formalized as a combinatorial contextual bandit problem, where the system repeatedly interacts with users. The system observes a context, chooses an action (recommendation), and receives a reward based on user satisfaction. In the case of music streaming platforms, the context includes user features, playlist features, userplaylist affinity, and other contextual information.
 Actions in this context are sets of recommendations, where users are presented with playlists. The rewards in a traditional usercentric system are based on user satisfaction with the recommendation. However, in a multistakeholder system, multiple objectives are considered, and vectorial rewards are observed, with each objective having its corresponding reward.
 To optimize multiple objectives in content recommendation, the article proposes a multiobjective contextual bandit algorithm. The algorithm is described in detail, including mathematical notations and the problem setup. The armselection strategy is explained, which involves selecting a playlist to show to the user based on observed contextual features. The article highlights the benefits of multiobjective optimization for balancing competing objectives in recommender systems.
 The authors describes the challenges faced by recommender systems in multistakeholder platforms, where different stakeholders have different objectives. To address this, the researchers proposed MOLinCB, a multiobjective linear contextual bandit model. They used the Gini aggregation function to balance and optimize multiple objectives. The approach involved learning a recommendation policy using a scalable gradient ascent method.
 The findings of the study indicate that optimizing for multiple objectives can lead to improvements in all the objectives involved. By considering multiple satisfaction metrics, the model was able to provide better recommendations, as different metrics capture different aspects of user behavior. The experiments also showed that it is possible to achieve gains in both complementary and competing objectives through multiobjective optimization. Specifically, the proposed model successfully achieved gains in a promotional objective without negatively impacting user satisfaction metrics.
 While the presented model and findings have implications for recommender system design in multistakeholder platforms, there are several future research directions suggested. These include exploring alternative objective weighting functions to provide more control over the importance of different objectives, considering global objectives aggregated over time, quantifying objectives of other stakeholders, and expanding the range of objectives considered to include more competing objectives. These future research directions aim to further enhance the effectiveness and versatility of multiobjective recommender systems in multistakeholder settings.
Yahoo
 Yahoo implemented a contextual bandit approach for news article recommendations, focusing on the use of shared features across different arms (hybrid linear models) instead of disjoint linear models. In their experiments, they discovered that this approach enabled transfer learning, allowing clickthrough rate information from one article to inform the recommendations of other articles.
 For user representation, they considered 1,193 categorical features encompassing demographics, geography, and behavioral aspects related to news consumption history. News articles were represented by 83 categorical features, including URL category and manually tagged editor categories.
 To reduce dimensionality, they projected user features and article features into respective categories and clustered users and articles with similar preferences. This resulted in sixdimensional representations for both users and articles. To create userarticle cross features, they computed the outer product of the user and article features, resulting in a 36dimensional vector, which formed the basis of the hybrid linear models.
 Evaluating the performance of their policy was challenging due to the availability of offline data collected under a different policy. To address this, they made assumptions about the independence and identical distribution of individual events and the uniform random selection of arms by the policy that collected the logged data. They also had a learning bucket where some users were randomly assigned and served articles randomly.
 Their policy evaluator compared the learned policy with the logged policy. If the learned policy selected the same arm as the logged policy, the event was retained, added to the history, and used to update the payoff. If the learned policy chose a different arm, the event was ignored, and the algorithm proceeded to the next event without any changes in the state.
Netflix
 Netflix utilized contextual bandits to personalize movie images on the home page, aiming to provide a better user experience and reduce regret. They opted for contextual bandits because traditional batch machine learning approaches require significant time to collect data, train models, and conduct AB testing, resulting in a delay in delivering an improved experience to members.
 Initially, Netflix employed a noncontextual multiarmed bandit to determine the best artwork for all users. However, they transitioned to contextual bandits to personalize images for each individual. In this setup, the bandit selects from a set of images for each show (action) and observes the duration of time the user watches the show after being influenced by the image (reward). Additionally, the bandit takes into account user attributes (e.g., played titles, preferred genres, country, language preferences), temporal factors like the day of the week and time of day, and other relevant context information.
 Netflix mentioned several bandit models, including a greedy policy implemented through supervised regression models, epsilongreedy, LinUCB, and Thompson Sampling. However, they did not specify which specific model was used in their production system, suggesting that it may be an ensemble or a dynamic approach that is frequently updated.
Instacart
 “Contextual bandit models are a popular approach for personalizing the user experience by recommending relevant products. However, these models become challenging to train and evaluate when the number of actions, or products, in the recommendation pool become large.” (source)
 Within the MultiArmed Bandit (MAB) framework, the system acquires knowledge about the potential actions through realtime exploration. Following the execution of an action, the system observes a probabilistic outcome, representing the reward obtained. This process continues, and over time, the MAB model gains insights into the reward distribution associated with each action. Eventually, based on the learned information, the system selects the action with the highest average performance across all users as the production policy to be deployed.
 Contextual bandits contain context which includes userspecific features that allow the CB model to personalize the recommended action for each user. By leveraging this context, the CB model has the potential to achieve a higher average reward across all customers compared to an MAB model, which lacks context information
 At Instacart, the application of the Contextual Bandit (CB) framework was explored to enhance item retrieval and ranking, aiming to provide users with a more personalized shopping experience. Item retrieval and ranking involve finding and ordering products relevant to a user’s search query. Traditionally, a machine learning model is used to compute the relevance of each item to the query, based on the probability of it being added to the cart.
 However, the problem of ranking recommended items presents challenges for the CB framework. Retrieving several hundred candidate items from a large pool and sorting them based on their ranking scores does not have a clear set of discrete actions. Considering each individual item as a discrete action is also problematic since the reward for a specific item (click or purchase) depends on the presence of other items shown to the customer, violating the independent reward attribution assumption of CB.
 To address these challenges, one approach is to represent each potential action (sorted list of items) as a lowdimensional vector, as suggested in “OffPolicy Evaluation for Large Action Spaces via Embeddings.” For instance, in an item ranking problem, the vector can include features such as the average ranking score and average price of the top 5, 10, 15, 20 items. While additional numerical features can be added, the dimensionality of the action vector should be carefully considered to ensure the accuracy of the CB model in recommending the most suitable vector for each context.
 In 2022, Instacart initially took a simpler approach to apply CBs to the ranking problem. They observed that for specific queries like “milk,” a linear model that predicts relevance scores for items (ranking them in decreasing order) without relying on user features performs well by recommending popular milk products. On the other hand, for broader queries like “healthy snack,” a nonlinear item scoring model that considers various user features as inputs performs well by suggesting personalized items for the customer. These observations guided their early work in utilizing CBs for ranking problems.
 The Instacart team conducted experiments to investigate the effectiveness of Contextual Bandit (CB) models in improving search ranking and personalizing multiobjective ranking. Initially, they aimed to increase the relevance of search results by training a CB model to select the best search ranker for each userquery context, measured by cart_adds_per_search (items added to cart per search query). The CB model, implemented using the Xlearner framework, achieved a significant 0.66% increase in cart_adds_per_search compared to the neural network item scoring model.
 Building upon this success, the team explored the challenge of personalizing multiobjective ranking, which involves balancing personalized relevance, popularity, novelty, and other factors in recommending items. In their online experiment, they defined various ranking formulas that represented different objectives and treated them as potential actions for the CB model. Each action corresponded to a specific combination of weights assigned to the relevance, popularity, price, and availability scores of items.
 To evaluate the performance of the CB models, a randomized data collection experiment was conducted, exploring different coefficient values around those used by the existing production model. The Xlearner model and the XGBoost model demonstrated promising tradeoffs between counterfactual estimates for metrics such as Cart Adds Per Search (CAPS) and Gross Merchandise Value (GMV) per search. Subsequently, an online A/B test was conducted with these models, resulting in a statistically significant increase of nearly 0.6% in CAPS for Android users (who are more price sensitive), while GMV per user showed a positive but not statistically significant increase. Considering the lower latency of the XGBoost model, it was chosen for production deployment.
 The team plans to continue experimenting in this domain, exploring different action spaces for the CB model, and further refining the personalized ranking algorithms for improved user experience and business outcomes.
 The team is exploring the use of Contextual Bandit (CB) models not only for personalizing search ranking but also for recommending a preference vector over retrieval sources. The goal is to maximize a specific objective that can be computed from each sorted list of results. Different strategies can be employed to translate the preference vector into user treatment. This includes sampling candidates from each source based on their preferences, upranking candidates from the most preferred source, and utilizing CB preferences as an additional ranking feature.
 To train a CB model, a dataset needs to be collected where actions have been taken in various contexts, and the outcomes have been observed. Customers are randomly assigned to different variants, with each variant using a specific action for ranking items. The collected data is then used to train the CB model, learning the predicted reward for each action given a context. Counterfactual evaluation is performed to assess the trained model’s performance.
 There are different algorithms that can be used to train a CB model. A simple approach involves training an XGBoost or LightGBM model with the action as a categorical feature. To obtain recommended actions from the model, all allowed actions are sequentially substituted into the model, and the action with the highest predicted reward is selected.
 A more sophisticated approach is to use a neural network to represent the CB model, with input nodes corresponding to context features and output nodes representing predicted rewards for each possible action. During serving, the CB model can recommend the action with the highest predicted reward or make probabilistic recommendations based on predicted rewards. During training, the error between the predicted reward and the observed reward for the taken action is backpropagated through the corresponding output node.
 These training approaches enable the CB model to learn and make informed recommendations based on observed context and historical outcomes, allowing for personalized and effective decisionmaking in various domains.
 The images below (source) show the general architecture.
 A more advanced framework for training Contextual Bandit (CB) models called Xlearners was introduced in a 2019 paper. This method involves grouping the data based on the treatment (action) used for each group. For each group, a firstlevel machine learning (ML) model is trained to predict the observed reward. Then, secondlevel ML models are trained to predict the incremental lift, known as Conditional Average Treatment Effect (CATE), that would be observed if that group were to receive the “control” action (the currently deployed policy to be improved).
 During serving, the Xlearner CB model recommends the action with the highest predicted lift compared to the control action or makes probabilistic recommendations based on predicted lifts. The diagram provided illustrates the training steps for an Xlearner model with two actions: treatment0 (control) and treatment1. The input features are denoted as X, the recorded action as T, the observed rewards as Y (with Y0 for cases where T=0 was applied and Y1 for cases where T=1 was applied), and the predicted value of Y is represented as Y* by the firstlevel model.
 The Xlearner framework allows for a more sophisticated and accurate estimation of treatment effects by considering the incremental lift that can be achieved by different actions. By incorporating this framework into CB models, more informed and effective action recommendations can be made to optimize desired outcomes in various applications.
 The passage describes the implementation and evaluation of Contextual Bandit (CB) models for various applications at Instacart. The Xlearner framework, introduced in a 2019 paper, is used to train CB models. The framework involves grouping the data based on the treatment (action) used, training ML models to predict observed rewards, and then training additional models to predict the incremental lift compared to a control action. The trained CB models can recommend actions with the highest predicted lift or make probabilistic recommendations.
 To evaluate the trained CB models, two commonly used approaches are discussed. The first is Inverse Propensity Sampling (IPS), which weighs the rewards based on the ratio of probabilities between the new policy and the logging policy. The second is Doubly Robust (DR) estimation, which combines the predicted rewards and the IPS weighting for more accurate evaluation.
 Different tools and libraries are utilized for training and evaluating CB models. XGBoost is used for training CB models with discrete action spaces due to its fast training time and good predictive power. RLlib, an opensource library for training contextual bandit and reinforcement learning models, is used for neural networkbased CB models with discrete action spaces. RLlib supports a wide range of industry applications and can handle both discrete and continuous action spaces.
 Engineers from Anyscale collaborated with Instacart to create a Python script that leverages RLlib for training and evaluating CB/RL models. The script can be run on a laptop or in the cloud and provides IPS and DR estimates during training, which can be visualized using tensorboard. The bestperforming model based on IPS and DR metrics can be selected for further use.
 In conclusion, Instacart is actively exploring the use of CB models for retrieval and ranking problems in various areas. They have developed tools and techniques to train and evaluate these models and plan to share success stories in future blog posts.
Offline Evaluation Replay
 Offline evaluation replay is a method used to evaluate the performance of a recommendation or ranking system using historical data.
 In offline evaluation replay, the system’s recommendations or rankings are generated using a given algorithm, and then compared against the ground truth or actual outcomes from past interactions or historical data and is commonly used for contextual bandit algorithms.
 Offline evaluation replay has some limitations as well such as its inability to capture the dynamics of realtime online interactions, user feedback, or system updates. It assumes that the historical dataset is representative of the current user preferences and behaviors, which may not always be the case.
 Nevertheless, offline evaluation replay can provide valuable insights and initial assessment of the performance of recommendation or ranking algorithms before deploying them in live online systems.
Industry examples of bandits for RecSys
 Spotify: their recommending explanations for music recommendations, “recsplanations”, is an example of epsilongreedy.
 Yahoo: their news recommendations utilize UCB.
 Alibaba: also leverages UCB for item recommendations.
 DoorDash: uses Thompson Sampling for cuisine recommendations.
 Amazon: uses a bandit algorithm for realtime conversions on landing pages as well as clickthrough rates on search engine result pages.
 Twitter: uses bandit strategies to supply personalized recommendations.
Further Reading
 Recommended article from Eugene Yan’s’s blog: