Recommendation Systems • Ben Lau

Readings

Tools

Cast Studies

Machine Learning for Better User-Experience: A Reddit Case Study: Removing posts that users have already seen, Boosting subreddits that users have interacted with previously, etc. Aim go maximize interaction rate, likelihood to comment, time on subreddit, etc. It uses simple linear regression model at the end and have a great success.
Innovative Recommendation Applications Using Two Tower Embeddings at Uber
Scaling deep retrieval with TensorFlow Recommenders and Vertex AI Matching Engine
Machine Learning System Design (YouTube Recommendation System): It mentioned the position bias, where the recommendations at the top would be more likely to be clicked, no matter what other factors are. It implemented a shallow tower to remove the bias by predicting the conditional probability of engagement given the position at training time, which increased the engagement metric by a lot in live experiment.
47th #ebaytechtalk: Deep Learning for Recommender Systems: It compares different architectures of different big companies. It then discussed a case with data sparsity of 0.0046%, while MovieLens 1M data are of 0.0426%.

User journey

Seeing the link to the website
Visit
Search for items
View a item
Add to cart / Save for later
Purchase / Order
Finish the payment
Review

Identifying the business objectives and data in hand

Identifying the business objectives to use this framework is not obvious: The technical goal is to suggest the most suitable items to users. There are many business objectives could lead to this way, e.g. increase user engagement, user retention, user satisfaction, user conversion, user click-through rate, or decrease user churn etc. When the objective comes to personalization, recommendation system might be the key tool to achieve this goal.

A recommendation system not only solves a classification task, it also solves a ranking task. It returns a list of sorted items that might be interesting to a specific user. So it also needs interaction result of a pair of user id and item id, and we need lots of these rows. ref

Use thumbs system, don’t use stars system

At 2017, Facebook and Netflix had done A/B tests of the five-star system against a thumbs up/down system, and found that the simpler thumbs system collected twice as many ratings, since the Likert scale forced the users to think too much. So if you cannot give a strong incentive for users to rate items, you should use the thumbs system. ref

Not only the choices can be personalized, but also the content

Look at Netflix, the thumbnails are personalized based on the user’s behavior, e.g. the user’s favorite actor, genre, etc. So when considering recommendation systems, we might also think of personalized visuals, e.g. thumbnails, titles, descriptions, etc. Note that they modeled it as a multil-armed bandit problem, and used contextual bandit to solve it. Blog: Artwork Personalization at Netflix

To create a candidate pool of best moment pictures, Netflix uses AVA: The Art and Science of Image Discovery at Netflix. They considered three types of metadata, which are visual, contextual, and composition. The composition metadata includes know-how from domain expert considering how does a picture look good, such as rule-of-third and depth of field.

Resources

Content-based filtering

Recommend items based on the similarity of the content of the items. It suffers less from the cold start problem, because it only needs item profiles, maybe some tags, and a specific user interaction to some categories, but no interactions from other users are required. google course blog

For example, if a user likes comedy, another comedy is recommended.

Vector space model

We can create vectors for items based on the content of the items, and the interaction history of the users to those content.

If there are categories or tags on items, we could turn these into one-hot encoding using pd.crosstab and calculate the Jaccard similarity between item tags to evaluate the intersection over union of word sets.

If there are no categories or tags, but we have the content or summary of the items, we could use NLP techniques, e.g. TF-IDF to extract the document vectors, then calculate the cosine similarity between item tags to evaluate the angle between document vectors.

Classification model

Ranking can be modeled as a learning-to-rank or classification task, with the latter being more commonly seen. If deep learning is applied, the final output layer is either a softmax over a catalog of items, or a sigmoid predicting the likelihood of user interaction (e.g., click, purchase) for each user-item pair.

However, In general, softmax of catalog implies a fixed set of output items. Thus, whenever new items are added to the catalog, you’ll have to change the output layer and retrain the model. In addition, training with a large softmax layer is time-consuming. Though several techniques could be used to improve this. On the other hand, sigmoid of user-item pair can work with any number of items, as well as new items (provided the item embeddings are available). That said, it requires one prediction for each user-item pair and could be costly if many there are many pairs. (Contrast this with the softmax approach where you only need to predict once to get probabilities for all items). So Youtube use softmax instead of sigmoid. ref

User profiles could also be used as features in this setting, which could act as a pooling or generalization effect that benefits new users with no interaction history.

For example, Google used deep neural networks for Youtube recommendations, which treated recommendation as classification.

Regression model

In small sample size, we could first classify users into groups, then calculate rate of interaction of each group to each item, or each group of item, then use regression to predict the interaction rate of a new user to a new item, which would make it more robust to avoid the problem of low signal to noise ratio from the small sample size. Note that this method is also known as cohort analysis.

Collaborative filtering

The idea is to use the wisdom of the crowd to recommend items. Recommend items based on the similarity of the users who have interacted with the items, aka training on embeddings. In this framework, we don’t need user profiles or item profiles, we only need the user-item interaction matrix. The interactions could be explicit, e.g. ratings, or implicit, e.g. clicks, views, etc. The good thing of using implicit feedback is that we could learn the revealed preference, which might be different from what they admit. But it suffers from the cold start problem that a new user or a new item has no interaction history.

For example, if user A is similar to user B and user B likes a certain video, then this video is recommended to user A.

By training different embeddings, we can provide different recommendations based on the context, e.g users similar to you also like, items similar to items you like, etc.

Memory-based

Memory-based CF systems work with recorded values from item-item or user-user interactions assuming no model. Search is done based on similarities and nearest neighbours algorithms. For example, find the users that are the closest to user A and suggest items purchased by them. ref

User-based: Similar behaviors of users
Item-based: People who purchased this also purchased that, note that it is different from content-based filtering as it does not consider the content of the items, but only the user behaviors related to the items, such as buy, view, like, etc.

Model-based

Model-based approaches assume a generative model that explains user-item interactions and makes new predictions. They make use of matrix factorization algorithms that decompose the sparse user-item matrix into a product of two matrices: user-factor and item-factor. Recently, a lot of methods are being researched in the area of model-based RS. For example association rules, clustering algorithms, deep neural networks, etc.

KNN regression: in case of unknown ratings of items, we can use KNN regression to predict the ratings based on the known ratings of the nearest neighbors.
Matrix Factorization: due to sparsity of the user-item matrix, we can use matrix factorization to reduce the dimensionality of the matrix and find the latent factors that represent the users and items, which are so much easier for computation. However, it will lead to information loss, and decrease in interpretability. Note that the number of latent factors is a essential hyperparameter to tune. We need to consider the tradeoff of information and computation cost, and the risk of overfitting.

Matrix Factorization

Two Tower Model

Data sparsity

Users will only rate a few items which are most favored or disfavored, most of the items are not rated because users have no strong feelings about them. This leads to a sparse user-item matrix, which makes it hard to find the similarity between users or items. Sometimes calculating similarity requires imputation of missing values. As mentioned of skewness of ratings, we might impute the missing values by the mean instead of 0, or by KNN regression. But KNN still suffers from the sparsity problem, which performs poorly when the sparsity is high. In this case, we might consider matrix factorization.

Cold start problem

It exists in the collaborative filtering, and can be categorized into two types:

User cold start: new users with no interaction history
- starting page, onboarding, user survey, etc. to collect user information through multiple choice questions, interested items, topics, etc.
Item cold start: new items with no interaction history

Here is an example of LightFM to solve the item cold start problem on recommending StackExchange questions.

Here is another example of Microsoft to solve the cold-start problem by a unified Boltzmann machines, which are probailistic models that combine collaborative and content information in a coherent manner.

Exploration and Exploitation

Maximizing the profit might lead to long-term loss, because it will exploit the current best items, but ignore the potential best items. In the long run, those recommendations might be boring and repetitive. So we need to balance the exploration and exploitation, even if it leads to short-term loss compared to recommend the best profitable item. To achieve this, we could use Thompson sampling, multi-armed bandit, etc, or we could add the exploration term to the objective function, which rewards the model for exploring new items.

Model selection

Simplicity over complexity. Since 2007 Netflix Price, after 16 years of research Netflix had improved the recommendation system recall only 4%. Most of the time, we don’t need advanced models to achieve good results, a miss-use of advanced models may even lead to poor results due to data sparsity, cold start problem and overfitting. Sometimes best-seller list or hand-crafted ranking list may be the best recommendation system, especially for new platforms with limited data and development time. a history of netflix recommendation systems

Logistic Regression vs recommendation system

Evaluation

Readings

Similarity metrics

Pearson correlation coefficient: Good for ratings because it will normalize the ratings of users by subtracting the mean ratings of each items.
Cosine similarity: Good for high-dimensional or sparse data, defined over vectors
Jaccard similarity: Good for low-dimensional or dense data, defined over sets

For asymmetric binary vectors, i.e. sets, or one hot encoding, to consider the difference between Jaccard similarity and cosine similarity:

J(A, B) = \frac{|A \cap B|}{|A \cup B|}

C(A, B) = \frac{|A \cap B|}{\sqrt{|A| \cdot |B|}}

We can see that the cosine similarity does not consider the intersection in the denominator, while the Jaccard similarity does. More intersections will lead to higher Jaccard similarity, but not necessarily higher cosine similarity. So if we are interested in the intersection of sets, e.g. common categories, we should use Jaccard similarity. However, if we are interested in the semantic meaning of documents, i.e. non-binary case, e.g. comparing summaries of two products, we should use cosine similarity, and Jaccard similarity is not even applicable for this case.

Predictive metrics

They indicates the relevance of the recommended item to the user, could be calculated from interactions, e.g. clicks, views, purchases, etc. Note that they are usually bad, and ranking metrics are usually better, because recommendation is usually a ranked list, not single item.

graded relevance score: regression metrics
binary relevance label: classification metrics

Precision at K

It measures the proportion of relevant items among top K recommended items. It is easy to interpret. However, it has no rank awareness, and it is sensitive to the number of relevant items, that averaging the precision at K across all users might not be a good idea, and it is impossible to reach perfect precision when the K is larger than the total number of relevant items in the dataset. The highest precision at ten when there is only 3 total relevant items could be stuck at 30%.

Recall at K

It measures the coverage of relevant items in the top K, the denominator for this is the number of all relevant items, while the denominator for precision is K. Recall at K works well for applications with only a few relevant items, for example, in topic-specific information retrieval. You might expect the system to be able to return all relevant items in the search results, even at the cost of Precision. However, it also has no rank awareness, and it is also sensitive to the number of relevant items, that if the total number of relevant items is larger than K, the recall can never reach 1. When it is much larger, it could never reach 0.1. And it requires knowing the number of all relevant items, which is often impossible, that we don’t know the true relevance score for items the user did not see. details on precision at K and recall at K - F-score at K: Precision focuses on how accurate the system is when it recommends items, while recall emphasizes capturing all relevant items, even if it means recommending some irrelevant ones. F Beta score combines them to provide a balanced assessment, with a Beta parameter let us to control the balance.

Ranking metrics

Most recommendations and ranking scenarios involve lots of items. The users can’t feasibly interact with all of them. You can’t obtain enough actual labels.

Mean Reciprocal Rank (MRR) calculates the average of the reciprocal ranks of the first relevant item, i.e. how soon you can find the first relevant item. If the first relevant items takes the third place, then the RR equals 1/3. MRR is calculated by taking average of all users or queries. However, it solely focuses on the first relevant item, and disregards all the rest.
Mean Average Precision (MAP) at K measures the average precision across different recall levels for a ranked list. It heavily rewards correct recommendations at the top of the list, but it might be hard to communicate.
Hit rate at K measures the share of users that get at least one relevant recommendation in the top K, which is a binary score for each user, then be averaged across all users.
Normalized Discounted Cumulative Gain (NDCG) at K is complicated, even I cannot understand it in 2 minutes, it would be so hard to interpret to a layman, so just skip it.

Behavioral metrics

Metrics can go beyond accuracy and evaluate other important qualities of a recommender system

Diversity evaluates the variety of items recommended to users, which could be measured by the intra-list diversity, i.e. averaging cosine distance between pairs of items inside the list.
Novelty evaluates how unique or unusual the recommended items are, which could be computed as the negative logarithm of the probability of encountering a given item in a training set, i.e. the surprise in information theory. High novelty corresponds to long-tail items that few users interacted with, while low novelty corresponds to popular items. It reflects the system’s ability to recommend items that are not well-known in the dataset.
Serendipity evaluates the unexpectedness of the recommended items, which could be measured by dissimilarity between successfully recommended items and a user’s historical preferences via cosine distance. It is different from novelty that it is only about the specific user that the system is recommending to, while novelty is about the general population.
Popularity bias evaluates the tendency of the system to recommend popular items, indicating a lack of personalization. There are many ways to calculate this, one of which could be calculated by average overlap between the items in the lists.

Business metrics

Revenue
Click-through rate (CTR)
Conversion rate
User engagement metrics, e.g. session duration, bounce rate (the percentage of users who leave after only viewing one page or item), etc.

Perspective

User perspective for insights on user engagement, especially if the business often sees conversions after multiple interactions:

Conversion rate: number of purchases per user
Click-through rate: number of clicks per user

Session perspective for insights into immediate conversion effectiveness, ideal for businesses where conversions typically happen in a single session:

Session duration: the time spent on the platform per session
Conversion rate: the percentage of purchases per session
Click-through rate: the number of clicks per session

Item perspective for insights into immediate conversion effectiveness, ideal for businesses where conversions typically happen in a single recommendation, e.g. you show an ads, and you want the user to click on it without hesitation:

Conversion rate: number of purchases per recommended item
Click-through rate: number of clicks per recommended item

Time horizon

It depends on the business, i.e. how long the user journey is, or how frequently a conversion happens, e.g. for news recommendation, the time window should be short, e.g. 1 day, while for e-commerce, the time window could be longer, e.g. 1 week.

Evaluation after deployment

Methods

A/B testing to check confidence intervals of the metrics of the treatment group and the control group, or the cumulative treatment effects.
- t-test
- Regression discontinuity: point comparison, only for immediate effect
- Difference-in-differences: aggregate the effects
- Bayesian Structural Time Series (Causal Impact): aggregate the effects, isolate some latent factors, e.g. seasonality, trend, etc.

Linking back to traditional statistics

PCA is a kind of matrix factorization. ref