Multi-armed Bandit

The problem is to choose k out of n items in each time step t so that the reward is maximized, with the ability to explore potential high-reward items and exploit the known high-reward items.

Must read A Multi-Armed Bandit Framework for Recommendations at Netflix | Netflix - The task is to provide personalized homepage for each member. The goal is to quickly help members find content they’d like to watch. The risk is that members may lose interest and abandon the service. The challenge is the need to serve enormous 117M+ members. Another task is the billboard recommendation for each movie. The goal is to recommend a single relevant title to each member at the right time and respond quickly to member feedback.

Components

Arms: The items/actions to choose from
Reward: The feedback from the user
Policy: The strategy to choose the arms
Regret: The difference between the reward of the best arm and the reward of the chosen arm

Exploration vs Exploitation

The best long-term strategy may involve short-term sacrifices.

Exploration: Explore different options to learn more about them
Exploitation: Exploit the best known options to maximize the reward

Principles of exploration

Naive exploration: Add a noise to the greedy policy, i.e. epsilon-greedy
Optimism in the face of uncertainty: Prefer actions with uncertain values, i.e. upper confidence bound (UCB)
Probability matching: Select the actions according to the drawn probabilities, i.e. Thompson sampling

Contextual Bandit

There are two bandit types: context-free bandit or contextual bandit. The context-free bandit is based entirely on historical data, which is usually for general recommendation without considering individuals or any other context. On the other hand, the contextual bandit is based on the current context, i.e. the current state of the environment, e.g. user, product, or some other features, which is usually for personalized recommendation.

Contextual Bandit vs Recommendation System

Contextual bandits are well poised to handle 10-100 arms - when you have thousands, use a recommendation system ref
If you’re facing a cold start problem, you’ll be better served by a contextual bandit
If we need to make many decisions for each user, maybe a recommendation system is best. If only a few times, contextual bandits will work great.
For the biggest problems (and biggest opportunities), a recommendation system may be warranted. For everything else, there’s contextual bandits.

Thompson Sampling

It is a Bayesan learning method. In the case of website article recommendation, there are n articles we want to show on the front page, and we want to estimate the conversion rate of them, such that we could prioritize the articles. With Thompson sampling, we assume the conversion rates of n articles follow beta distribution independently. We then sample n random number from the beta distributions, and pick top-k articles with the highest sampled values to show. With users interactions and assuming conjugate priors, we could update the posterior distribution real quick using the result of listing, i.e. the clicks.

The beta distribution represents our current certainty about the conversion probabilities, and as we collect more data the variance of these decreases. If we are not certain that how an article performs, i.e. low mean with high variance, it would be chosen if the sampled data points are high due to high variance, and the corresponding result from users will give more evidences, which lower the variance. Eventually, an article that we know it is certainly bad, i.e. narrow distribution with low mean and variance, will almost never be chosen over better one. At the end, it picks the best same options almost every time.

Note that instead of streaming and updating the bandit in real-time, we could aggregate data daily and update in a batch.