Conjugate Priors
Introduction
Firstly, let me remind you of the Bayes’ theorem:
In Bayesian theory, if given a likelihood function whose posterior distribution is in the same probability distribution family as the prior distribution, the prior and posterior are then called conjugate distributions with respect to that likelihood function and the prior is called a conjugate prior for the likelihood function. wiki
But why conjugate priors? Because they simplify the calculation and interpretation of the posterior distribution and the updates by giving a closed-form expression for the posterior, i.e. getting rid of intractable integrals.
One more fantastic property of conjugate priors is that if the likelihood function belongs to the exponential family, then a conjugate prior is guaranteed to exist, often also in the exponential family.
Using conjugate priors, the prior predictive distribution of an exponential family distribution can be determined analytically. Despite the analytical tractability of such distribution, they are in themselves usually not members of the exponential family, e.g. Student’s t distribution, beta-binomial distribution. ref
Moreover, when a conjugate prior is being used, The posterior distribution belongs to the same family as the prior predictive distribution, and is determined simply by plugging the updated hyperparameters for the posterior distribution of the parameters into the formula for the prior predictive distribution.
Examples
Beta distribution for discrete data
To demonstrate the power of conjugate priors, let’s see the Bernoulli distribution and its conjugate prior, the beta distribution.
When , and is unknown, if a conjugate prior is selected, the posterior distribution of has a closed form, which is also a beta distribution:
where is the outcome of being success or not, is the number of trials, while and can be interpreted as the number of successes and failures before the experiment.
By defining
Denoting the parameter from to to avoid ambiguity, the posterior predictive distribution of the next outcome can be calculated as
Note that is the likelihood function of the Bernoulli distribution, so it is just .
Since this integral is just the expectation of the posterior distribution, given that mean of any Beta distribution is
we have the posterior predictive distribution of as
Since it is binary, we have
And therefore
So mean of the posterior predictive distribution is
which is the same as the posterior mean of .
another proof with application
To see a practical usage, see Beta target encoding at #Use cases.
Similar procedures can be applied to other exponential family distributions, such as Poisson distribution with Gamma prior, Gaussian distribution with Gaussian prior on mean or inverse gamma prior on variance, etc. discussion on Gaussian proof of Gaussian Table of conjugate distributions Wiki
Use cases of the posterior and posterior predictive distributions
- Beta target encoding on Bernoulli target response
- it is a interesting application because the target response is not binary, but probability instead. However, the same procedure of estimating Bernoulli distribution with Beta prior was still applied, by treating the target response as the parameter of the Bernoulli distribution. In this way, and were estimated, but not observed directly, but the same analogy applied. Moreover, because it is just coming from the beta distribution, we can use statistics other than mean, such as median, mode, etc. to estimate the target response.
- estimating the posterior distributions of the target response (probability) for each category in each categorical variable, kinda like one hot encoding but substituting the posterior statistics of the target response, and without sparse matrix.
- Kalman filter
- Hypothesis testing
- Parameter estimation
- Uncertainty quantification, maybe for decision making
- Model selection
- Forecasting
Let’s say I get a posterior distribution by imposing conjugate prior to the likelihood function, how to update it?
We could just update the parameters of the prior distribution by using the parameters from posterior distribution, i.e. use the posterior distributions as the prior distributions for the next inference, especially if conjugate priors are being used. pymc updating priors
It plays a great role in online learning
The latency of online learning has to be quick. Conjugate priors give the corresponding posteriors in closed form which provide tractability that lead to instantaneous update. Thompson sampling is one of the example.
However, sometimes it would be more efficient and without losing much accuracy to update the model daily instead of updating it every time a new data point comes in. This way, we can use any inference method to update the model.
But what if conjugate priors cannot be used?
Other Bayesian inference methods could help, such as Laplace approximation, MCMC, Variational inference.