Overview

  • In this article, we will delve into different architectures used in recommender systems.
  • “Recsys models are built from two types of features — dense (continuous real values) & sparse (categorical with low or high cardinality).
  • The model transforms can be divided in two parts:
  • Feed forward dense neural networks used to transform the dense features and interactions between dense features and sparse features.
  • Embedding network transforms for sparse features. This is usually replaced with lookup tables containing pre-computed embeddings.” (source)
  • Let’s look at an example:
  • Dense Features:
    1. Movie Ratings: This feature represents the continuous real values indicating the ratings given by users to movies. For example, a rating of 4.5 out of 5 would be a dense feature value.
  1. Movie Release Year: This feature represents the continuous real values indicating the year in which the movie was released. For example, the release year 2000 would be a dense feature value.
  • Sparse Features:
    1. Movie Genre: This feature represents the categorical information about the genre(s) of a movie, such as “Action,” “Comedy,” or “Drama.” These categorical values have low cardinality, meaning there are a limited number of distinct genres.
  1. Movie Actors: This feature represents the categorical information about the actors who starred in a movie. These categorical values can have high cardinality, as there could be numerous distinct actors in the dataset.
  • In the model architecture, the dense features like movie ratings and release year can be directly fed into a feed-forward dense neural network. The dense network performs transformations and computations on the continuous real values of these features.
  • On the other hand, the sparse features like movie genre and actors require a different approach. Instead of directly using the raw categorical values, an embedding network is employed. The embedding network maps each sparse feature value (e.g., genre or actor) to a low-dimensional dense vector representation called an embedding. These embeddings capture the semantic relationships and similarities between different categories. The embedding lookup tables contain pre-computed embeddings for each sparse feature value, allowing for efficient retrieval during the model’s inference.
  • By combining the outputs of the dense neural network and the embedding lookup tables, the model can capture the interactions between dense and sparse features, leading to better recommendations based on both continuous and categorical information.

  • The image below, (source), displays this.
  • ✔️ Serving paradigms: The trained model could be optimized further for serving. This includes physical transformations (model processing such as pruning, quantization etc) and logical transformations (split model, separate out embedding tables etc). The logical transformations are done to optimize model execution storage and latency w.r.t hardware available for serving. For example, embedding tables could be hosted separately on CPU machines with large memory and io bandwidth. Similarly, dense feature processing and dense sparse interaction could be hosted on GPU cards as they can benefit from parallel speedups. Since, the hardware setup for serving is mostly fixed for an enterprise, you can create a set of supported serving paradigms. Essentially, the deployment plan is the serving paradigm. A typical serving paradigm for recsys is using GPU cards for dense networks and high CPU memory machines for embedding tables.

Deep Neural Network Models for Recommendation

  • Deep neural network models have gained significant popularity in the field of recommendation systems. These models leverage various variations of artificial neural networks (ANNs) to effectively capture complex patterns and make accurate recommendations. Let’s explore some of these variations: (source for this section)
  • Pros: Capable of learning complex, non-linear relationships between inputs. Can handle a variety of feature types. Suitable for both candidate generation and fine ranking.
  • Cons: Can be computationally expensive and require a lot of data to train effectively. Might overfit on small datasets. The inner workings of the model can be hard to interpret (“black box”).
  • Use case: Best suited when you have a large dataset and require a model that can capture complex patterns and interactions between features.
    1. Feedforward Neural Networks (FNNs):
    • FNNs are a type of ANN where information flows in a unidirectional manner from one layer to the next.
    • Multilayer perceptrons (MLPs) are a specific type of FNNs that consist of at least three layers: an input layer, one or more hidden layers, and an output layer.
    • MLPs are versatile and can be applied to a wide range of scenarios.
  1. Convolutional Neural Networks (CNNs):
    • CNNs are primarily known for their effectiveness in image processing tasks, such as object identification and image classification.
    • They employ convolutional operations to extract important features from input data.
  2. Recurrent Neural Networks (RNNs):
    • RNNs are specifically designed to handle sequential data and capture temporal dependencies.
    • They are commonly used in natural language processing (NLP) tasks to parse language patterns and process sequential data.
  • In the realm of recommendation systems, deep learning (DL) models build upon traditional techniques like factorization to model interactions between variables. DL models also utilize embeddings to handle categorical variables. Embeddings are learned vector representations of entity features, where similar entities (users or items) have similar distances in the vector space. For example, a deep learning approach to collaborative filtering can learn user and item embeddings based on their interactions using a neural network.
  • DL techniques tap into the extensive library of novel network architectures and optimization algorithms. They excel in training on large datasets, leverage the power of deep learning for feature extraction, and enable the creation of more expressive models.
  • There are several current DL-based models for recommender systems, including:
  1. Deep Learning Recommendation Model (DLRM)
  2. Wide and Deep (W&D)
  3. Neural Collaborative Filtering (NCF)
  4. Variational AutoEncoder (VAE)
  5. BERT (for NLP)
  • These models, offered as part of the NVIDIA GPU-accelerated DL model portfolio, cover a wide range of network architectures and applications beyond recommender systems. They are designed and optimized for training with popular deep learning frameworks like TensorFlow and PyTorch. In addition to recommender systems, these models find applications in domains such as image analysis, text analysis, and speech analysis.

Deep FM

  • FM stands for factorization machines.
  • DeepFM combines factorization machines (FM) with deep neural networks. It utilizes FM to model pairwise feature interactions and a deep neural network to capture higher-order feature interactions. This architecture leverages both linear and non-linear relationships between features.
  • Pros: Combines the benefits of factorization machines (FM) and deep neural networks, capturing both pairwise and higher-order feature interactions.
    • Cons: Increased model complexity and potential challenges in training and optimization.
    • Advantages: Accurate modeling of both linear and non-linear relationships between features, providing a comprehensive understanding of feature interactions.
    • Example Use Case: Click-through rate prediction in online advertising or personalized recommendation systems.
  • Phase: Candidate Generation, Ranking. Recommendation Workflow: DeepFM is commonly utilized in both the candidate generation and ranking phases. It combines the strengths of factorization machines and deep neural networks. In the candidate generation phase, DeepFM can capture pairwise feature interactions efficiently. In the ranking phase, it can leverage deep neural networks to model higher-order feature interactions and improve the ranking of candidate items.

Deep and Cross Networks

  • The Deep and Cross Network (DCN) architecture includes a cross network component that captures cross-feature interactions. It combines a deep network with cross layers, allowing the model to learn explicit feature interactions and capture non-linear relationships between features.
    • Pros: Captures explicit feature interactions and non-linear relationships through cross layers, allowing for improved modeling of complex patterns.
    • Pros: Explicitly learns high-order feature interactions and combines low-rank and high-rank features. Can handle both numerical and categorical inputs.
    • Cons: More complex than simple feed-forward networks. May not perform well on tasks where feature interactions aren’t important.
    • Use case: Useful for tasks where high-order feature interactions are critical, such as CTR prediction and ranking tasks.
    • Cons: Increased model complexity, potential overfitting on sparse data, and challenges in training large-scale models.
    • Advantages: Enhanced modeling of feature interactions and non-linear relationships, improving recommendation accuracy.
    • Example Use Case: Advertising platforms where understanding the interactions between user characteristics and ad features is essential for personalized ad targeting.
  • Phase: Ranking, Final Ranking. Recommendation Workflow: The deep and cross architecture is typically applied in the ranking phase and the final ranking phase. The deep and cross network captures explicit feature interactions and non-linear relationships, enabling accurate ranking of candidate items based on user preferences. It contributes to the final ranking of candidate items, leveraging its ability to model complex patterns and interactions.

Wide and Deep

  • Wide part: The wide part of the model is a generalized linear model that takes into account cross-product feature transformations, in addition to the original features. The cross-product transformations capture interactions between categorical features. For example, if you were building a real estate recommendation system, you might include a cross-product transformation of “city=San Francisco” AND “type=condo”. These cross-product transformations can effectively capture specific, niche rules, offering the model the benefit of memorization.
  • Deep part: The deep part of the model is a feed-forward neural network that takes all features as input, both categorical and continuous. However, categorical features are typically transformed into embeddings first, as neural networks work with numerical data. The deep part of the model excels at generalizing patterns from the data to unseen examples, offering the model the benefit of generalization.
  • By combining these two components, Wide & Deep models aim to achieve a balance between memorization and generalization, which can be particularly useful in recommendation systems, where both aspects can be important. The wide part can capture specific item combinations that a particular user might like (based on historical data), while the deep part can generalize from user behavior to recommend items that the user hasn’t interacted with yet but might find appealing based on their broader preferences.
  • Wide and Deep architectures combine a deep neural network component for capturing complex patterns and a wide component that models feature interactions explicitly. This allows the model to learn both deep representations and exploit feature interactions, providing a balance between memorization and generalization.
  • The deep & wide model consists of two parts — dense neural network (deep) for continuous features and embedding models for the categorical features (wide).
  • The image below, (source), displays this.
  • Pros: Balances memorization (wide component) and generalization (deep component), capturing both complex patterns and explicit feature interactions.
  • Cons: Increased model complexity and potential challenges in training and optimization.
  • Advantages: Improved performance by leveraging both deep representations and explicit feature interactions.
  • Example Use Case: E-commerce platforms where a combination of user behavior and item features plays a crucial role in recommendations.
  • Phase: Candidate Generation, Ranking.
  • Recommendation Workflow: The deep and wide architecture is suitable for both candidate generation and ranking phases. The wide component can capture explicit feature interactions and enhance the candidate generation process. The deep component allows for learning complex patterns and interactions, improving the ranking of candidate items based on user-item preferences.

GNN

  • GNN architectures utilize graph structures to capture relationships between users, items, and their interactions. GNNs propagate information through the user-item interaction graph, enabling the model to learn user and item representations that incorporate relational dependencies. This is particularly useful in scenarios with rich graph-based data.
  • Pros: Captures relational dependencies and propagates information through graph structures, enabling better modeling of complex relationships.
    • Cons: Requires graph-based data and potentially higher computational resources for training and inference.
    • Advantages: Improved recommendations by incorporating the rich relational information among users, items, and their interactions.
    • Example Use Case: Social recommendation systems, where user-user connections or item-item relationships play a significant role in personalized recommendations.
  • Phase: Candidate Generation, Ranking, Retrieval. Recommendation Workflow: GNN architectures are suitable for multiple phases of the recommendation workflow. In the candidate generation phase, GNNs can leverage graph structures to capture relational dependencies and generate potential candidate items. In the ranking phase, GNNs can learn user and item embeddings that incorporate relational information, leading to improved ranking. In the retrieval phase, GNNs can assist in efficient retrieval of relevant items based on their graph-based representations.

Split Network

  • A split network is a generalized version of a two tower network. The same optimization of embedding lookup holds here as well. Instead of a dot product, a simple neural network could be used to produce output.
  • The image below, (source), displays this.
  • In a split network architecture, different components of the recommendation model are split and processed separately. For example, the user and item features may be processed independently and combined in a later stage. This allows for parallel processing and efficient handling of large-scale recommender systems.
  • Pros: Enables parallel processing, efficient handling of large-scale systems, and flexibility in designing and optimizing different components separately.
  • Cons: Requires additional coordination and synchronization between the split components, potentially increasing complexity.
  • Advantages: Scalability, flexibility, and improved performance in handling large-scale recommender systems.
  • Example Use Case: Recommendation systems with a massive number of users and items, where parallel processing is crucial for efficient computation.
  • Phase: Candidate Generation, Ranking, Final Ranking.
  • Recommendation Workflow: The split network architecture can be utilized in various phases. During the candidate generation phase, the split network can be used to process user and item features independently, allowing efficient retrieval of potential candidate items. In the ranking phase, the split network can be employed to learn representations and capture interactions between the user and candidate items. Finally, in the final ranking phase, the split network can contribute to the overall ranking of the candidate items based on learned representations.

Two towers

  • “One of the more popular architecture in personalization / recsys is two tower network. The two towers of the network usually represent user tower (U) and candidate tower (C). The towers produce a dense vector (embedding representation) of U and C respectively. The final network is just a dot product or cosine similarity function.
  • Let’s consider the cost of executing user tower/network is u and cost of executing candidate tower is c and dot product is d.
  • At request time, the cost of executing the whole network for ranking N candidates for one user : N*(u + c + d).
  • Since the user is fixed, you need to compute it only once. So, the cost becomes : u + N(c+d). Embeddings could be cached. So, the final cost becomes u + N d+ k when k is.” (source)
  • The image below, (source), displays this.
  • The two-tower architecture consists of two separate branches: a query tower and a candidate tower. The query tower learns user representations based on user history, while the candidate tower learns item representations based on item features. The two towers are typically combined in the final stage to generate recommendations.
  • Pros: Explicitly models user and item representations separately, allowing for better understanding of user preferences and item features.
  • Cons: Requires additional computation to learn and combine the representations from the query and candidate towers.
  • Advantages: Improved personalization by learning user and item representations separately, which can capture fine-grained preferences.
  • Example Use Case: Personalized recommendation systems where understanding the user’s historical behavior and item features separately is critical.
  • Phase: Candidate Generation, Ranking.
  • Recommendation Workflow: The two-tower architecture is often employed in the candidate generation and ranking phases. In the candidate generation phase, the two-tower architecture enables the separate processing of user and item features, capturing their respective representations. In the ranking phase, the learned representations from the query and candidate towers are combined to assess the relevance of candidate items to the user’s preferences.

Reference