Marketing • Ben Lau

There are three major models in marketing data science: survival analysis, marketing mix modeling (MMM) and customer lifetime value (CLV). Survival analysis is to model the time until an event of interest occurs, e.g. customer churn, product failure. MMM is to evaluate the effectiveness of advertising channels in order to allocate future advertising budgets, while CLV is about predicting the future value of a customer. The former is a explanatory model, while the latter is a predictive model.

Readings

Two types of churns

There are two types of churn, contractual churn and non contractual churn. The fundamental difference is do we have a clear boundary of customers being churned or not, e.g. retails or marketing have non contractual churns. Contractual churn is when a customer cancels a service because they are no longer continue their periodic agreements, while non-contractual churn is when a customer becomes inactive. The former is easier to get the target variable, while the latter is much harder, artificial and subjective.

Shifted Beta Geometric model could be used for contractual churn, while BG/NBD model could be used for non-contractual churn, and they both are bottom-up approaches.

Modeling

Bayesian GLM model with binomial distribution could be used.
A BART model could also be used.
pymc-marketing@shifted beta geometric
pymc-marketing@BG/NBD model
Survival analysis
Bayesian Proportional Hazard Model
BG/NBD Model in PyMC

Survival analysis

It is also called Time-To-Event Study. Good for contractual business.

Marketing mix modeling

It could differentiate changes due to advertising spend, holiday effect, seasonality, or macro-economic factors, or account for adstock (carry-over), saturation, or delayed effects of advertising. The adstock effect is the idea that the impact of advertising on sales will persist for a period of time after the advertising ceases. The saturation effect is the idea that the impact of advertising on sales will diminish due to long time exposure. The delayed effect is the idea that the impact of advertising on sales will not be immediate, but will occur after a delay.

Business problem

We are a marketing agency want to optimize the marketing budget of a client, and we have the access to sales and media spend data.

Some common questions that are best answered by MMM include Robyn doc:

How much sales (online and offline) did each media channel drive?
What was the ROI of each marketing channel?
How should I allocate budget by channel so as to maximize my KPIs?
Where should my next marketing dollar go?
What is the optimal level of spend for each major marketing channel?
How would sales be impacted if I made X change to my marketing plan?
If I needed to cut my marketing budget by X%, where should the dollars come from?
How is performance of channels such as FB impacted by the way they are executed (e.g., buying objective, frequency, creative quality or targeting strategy used)?
Should we raise our prices? If so, by how much?
What is the impact of competitor advertising on the performance of our brands?
How much incremental revenue to trade and promotional activities drive?

Some attributes from each channel or the whole campaign would be good

the target could be weekly sales
weekly spend on different media channels
some other domain knowledge about exogenous variables such as holiday effect, seasonality, macro-economic factors, etc, or any special events.

How to model it?

It is believed that the causal relationship between marketing and sales should be non linear, for example, a 10% increase in channel x1 spend does not necessarily translate into a 10% increase in sales. Since there could be a carry-over effect, i.e. the effect of spend on sales is not instantaneous but accumulates over time, or a saturation effect, i.e. the effect of spend on sales diminishes at some point due to long time exposure. MMM Example Notebook paper

The carry-over effect on sales could be modeled by a geometric distribution, and the saturation effect could be modeled by a logistic function.

In case of time-varying intercept, it could be modeled by a random walk, or a Gaussian process.

Actual use case

Customer lifetime value

There are non-contractual and contractual business. The entire section is talking about non-contractual business. Here is also an experienced data scientist talking about contractual CLV. ref

While MMM maximize the mean of target variable, e.g. sales, or user signups, sometimes it is better to focus a specific group of high value customers. CLV predicts future purchases and quantify the long-term value of each customer, so it can differentiate the value of customers, and help to allocate resources to the most valuable customers. blog

It build on the Buy Till You Die (BTYD) framework, which tells the story of people buying until they become inactive. A model in the BTYD family includes both repeat purchase and churn components. blog

PyMC-Marketing’s CLV module includes a range of models, to predict future churn rates, purchase frequency, and monetary value of customers.

It uses BG/NBD model to predict churn, and Gamma-Gamma model to predict CLV. The BG/NBD model is a latent attrition model, which assumes that all customers are active at the beginning of the observation period and that a customer can only drop out immediately following a transaction. Customers with no repeat transactions during the observation period haven’t had a chance to drop out so their probability of being alive equals 1. paper step-by-step derivation lifetime examples

There are three fundamental assumptions in the BG/NBD model:

Each customer has a different purchasing rate
Each customer can stop being your customer at any time
Deactivation is both permanent and latent. That is, if a customer is inactive, it is forever inactive. And they won’t explicitly tell you they’ve churned.

In summary, what it will do is learn the mass behavior from these individual behaviors and then make a probabilistic estimation specific to the individual. After making a purchase, the customer becomes partial churn. The BG/NBD Model probabilistically models two processes for the expected number of transactions. ref

First Process: Transaction Process (Buy)
- While active, the entire sequence of transactions made by a customer is modeled by a Poisson process with transaction rate λ, where each count at time t is drawn from a Poisson distribution with mean λ. ref
- Heterogeneous transaction rate between customers follows a gamma distribution with shape r and scale α
- Note: The gamma distribution is the conjugate prior of the Poisson distribution, combining the two gives Negative Binomial Posterior Predictive Distribution, that is the NBD.
- Note 2: Mean of the gamma distribution is rα, and variance is rα^2, which would be a good reference to determine the values of r and α
- Note 3: Mean of the Poisson distribution is λ, and variance is also λ
- Note 4: We can add covariates by simply multiplying $\alpha$ by $exp(-X\beta)$ . This might be relevant to link functions in glm
Second Process : Dropout process (Till You Die) → process of becoming churn
- Each customer becomes inactive after each transaction with probability p
- That is, the number of transactions made by a customer before becoming inactive follows a geometric distribution with dropout probability p
- Heterogeneous p follows a beta distribution with shape parameters a and b
- Note: Mean of the beta distribution is a/(a+b), and variance is ab/((a+b)^2(a+b+1)), which would be a good reference to determine the values of a and b
- Note 2: We can add covariates by simply multiplying a by $exp(X\beta_1)$ and multiplying b by $exp(X\beta_2)$
(Optional) Third Process: Monetary Value ref
- Assumptions:
  - The monetary value of a customer’s given transaction varies randomly around their average transaction value.
  - Average transaction values vary across customers but do not vary over time for any given individual.
  - The distribution of average transaction values across customers is independent of the transaction process.
- Customer monetary value per transaction follows a gamma distribution with shape parameter p and scale parameter v
- The scale parameter v of the former gamma distribution also follows a gamma distribution with shape parameter q and scale parameter $\gamma$
- All the other parameters except v could be modeled by a half normal distribution with educated guess of the variances
- Note: We can add covariates by simply multiplying v by $exp(-X\beta)$ , i.e. similar as the treatment at transaction process.
- Fun facts:
  - The total spend across x transactions of any customer is also gamma distributed with px, v due to the convolution property
  - The average spend across x transactions and all customer is also gamma distributed with px, vx due to scaling property

Note that these models are applicable for any processes that involve “transactions”. It can in fact be used to model any phenomenon that involves different “users” making repeated “transactions” and predict how many future transactions will be made by those users if they are still “active”. That is, transactions do not have to be purchases, it could be any event of interest, e.g. user logins. Monetary values do not have to be the actual money, it could be any value of interest associated with the event, e.g. user session duration. For example: ref

Predicting the future usage frequency of a mobile app by analyzing users’ usage history.
Predicting if your website users will return to your website.
Predicting if your distant relative who used to call you periodically is still alive, literally, by analyzing their call pattern.
Predicting if your Tinder dates have become disinterested in you by looking at their texting frequency.

Resources

Part 1: Customer lifetime value estimation via probabilistic modeling - CLV = expected number of transactions * expected profit, while former could be modeled by BG/NBD model, and the latter could be modeled by Gamma-Gamma model. Theoretical derivation is also given.
Part 2: Modeling Customer Lifetime Values with lifetimes
Part 3: Bayesian Customer Lifetime Values Modeling using PyMC3

CLV will give your answers to questions such assumes

What is the average order value for a single customer?
How much is my customer likely to spend in my webshop, or store, or both next year?
What is a single customer likely to spend next year?
What is the average lifetime value of each customer?
What is the likelihood of a customer leaving my business?
How many days have passed since a single customer’s first order?
How many days have passed since a single customer’s last order?
How many days usually pass between orders?

ref

Some attributes from each customer would be good

The target variable could be expected number of transaction * expected profit
frequency: number of repeated purchases
recency: time duration between first and last purchase, note that it is not the time between last purchase and now, which often confuse people. This version considers length, which is known as the LRFM model, was introduced as an improved version of the RFM model to identify more relevant and exact consumer groups for profit maximization. paper1 paper2 discussion
first purchase: time duration between first purchase and the present
number of repeated purchases
monetary value: average purchase value
membership

SQL to calculate recency, frequency, monetary value, and time since first purchase from databricks

But how to calculate the churn rate?

It is actually kind of complicated, but it works by considering frequency, recency, and time since first purchase. The basic idea is that there is not much evidence of customers with low repeated purchases, so we assume it is very likely they are still active. But when a customer has lots of repeated purchases, and the time between the first purchase and last purchase is short, i.e. time since last purchase is long, it is very likely that they are no longer active. formula code

The simplest way to calculate the churn rate could be $p=(\frac{\text{frequency}_i}{\text{frequency}_{max}})(\frac{\text{recency}_i}{\text{recency}_{max}})^2$

How to model the churn

The transaction process (Buy), i.e. the number of events within a fixed interval could be modeled by Poisson distribution with transaction rate, while the transaction rate could be modeled by a gamma distribution. Since gamma distribution is the conjugate prior of Poisson distribution, and it is a continuous analog of the negative binomial distribution, which is also a analog of geometric distribution but allowing overdispersion, i.e. the number of trials until the first success. As a result, the posterior predictive distribution would be a negative binomial distribution. ref

After any transaction, a customer becomes inactive with probability p, i.e. the dropout process (Till You Die). It could be seen as a geometric distribution, whose probability could be modeled by a beta distribution. notebook

The transaction rate and the dropout probability shall vary independently across customers. But instead of estimating the parameters for each specific customer, we could do it for a randomly chosen customer, i.e. the expected values. So it comes to find the posterior distribution of the parameters, which then can be depends on the purchase history of the customers.

Note that BG/NBD model assumes no future purchases

For frequent buyers, the probability of being alive drops very fast as we are assuming no future purchases, so be careful for interpretation. For them, a short time period would be preferred.

How to model the monetary value?

It can be modeled by Gamma-Gamma model. code paper

First of all we should filter out all those customers with only one purchase, i.e. no repeated purchases.

The model of spend per transaction is based on the following three general assumptions ref:

The monetary value of a customer’s given transaction varies randomly around their average transaction value.
Average transaction values vary across customers but do not vary over time for any given individual.
The distribution of average transaction values across customers is independent of the transaction process.

The monetary value could simply be the total spend divided by the number of transactions of any customer. Then the spend could be modeled by a gamma distribution, such that the total spend across x transactions of any customer is also gamma distributed due to the convolution property, and the average spend across x transactions and all customer is also gamma distributed due to scaling property.

Note that the Gamma-Gamma model assumes that there is no relationship between the monetary value and the purchase frequency, i.e. the distribution of average transaction values across customers is independent of the transaction process.

With the posterior distribution of the parameters, we could then predict the expected spend of each customer. Since the average transaction value is gamma distributed among all customers.

Combining the BG/NBD model and the Gamma-Gamma model, we could then predict the CLV of each customer with a discounted cash flow model.

Click-Through Rate (CTR)

When the business goal is to correctly predicting the click-through rate, it could be a regression or classification problem, but when the business goal is to increase the click-through rate, then it is a recommendation system problem, which usually is the right problem to solve. Though correctly predicting the click-through rate could be a good upstream task to be further pass down to experts to choose the actions, manual decision making is not scalable, that might be infeasible in large scale, e.g. 1000+ ads or users.

Evaluation

ROC curve and calibration curve is great. Notes on classification

Customer Decision Tree (CDT)

Often called CDT (short for Customer Decision Tree), the Customer Decision Tree is the visual translation, in product groups and segments, of the successive logical questions a shopper is asking herself when buying a product in a category. HPT PEDIA

It is good for assortment optimization. McKinsey & Company | Analytical assortment optimization

Oracle has a implementation guide on it doc

Without an actual implementation given, it is hard to say how industries model it. But I assume it could be done by hierarchical clustering on the product features.

How to get a job in marketing data science?

Marketing analytics/data science can definitely be a mixed bag. You probably want to avoid any postings that mention Google/Adobe Analytics. Most of your Bayesian MMM type work will come from marketing measurement related positions, it gets called out in the JD pretty often. Large CPGs, US Tech/ecommerce, and marketing measurement firms (NeilsenIQ and the like) are good places to look. Marketing DS jobs related to performance marketing can be pretty good options as well. ref