A post by Grace Yan, PhD student on the Compass programme.
Introduction
In many real-world problems, the exact posterior distribution is often infeasible due to non-conjugate priors and high-dimensional datasets. Thus, approximate Bayesian inference methods are used instead to obtain an approximate posterior. Some well-known examples of these methods include Variational Bayes (VB), Laplace approximation and Expectation Propagation (EP). In this blog post, I will focus on Expectation Propagation and explain: what it is, how it works, its strengths and limitations, and its relation to similar methods.

Background
EP was introduced by Minka in 2001 as an extension of the assumed-density filtering (ADF), which is a one-pass sequential algorithm for obtaining an approximate posterior [2]. Like VB methods, its aim is to approximate an intractable posterior with tractable distributions by minimising the Kullback-Leibler (KL) divergence. Recall that the KL divergence measures how different two distributions $p$ and $q$ are; often, $p$ is the true distribution and $q$ is a model distribution that we use to approximate $q$. There are two kinds of KL divergence: the forward KL and the reverse KL. Assuming $x$ is continuous, these are defined as
\[ \mathrm{KL}(p(x) \| q(x)) = \int p(x) \mathrm{log}\frac{p(x)}{q(x)} dx \]
and
\[ \mathrm{KL}(q(x) \| p(x)) = \int q(x) \mathrm{log}\frac{q(x)}{p(x)} dx \]
respectively (in the discrete case, the integrals are replaced by sums). These two types are not equivalent; [3] gives a good explanation of how they differ. EP uses the forward KL.
Expectation Propagation
Let $\mathbf{x}$ denote the observed data and $\boldsymbol{\theta}$ the parameters of interest. Recall that by Bayes’ theorem, the posterior is
\[ p(\boldsymbol{\theta}|\mathbf{x}) = \frac{p(\mathbf{x}, \boldsymbol{\theta})}{p(\mathbf{x})}, \]
where $p(\mathbf{x})$ is the model evidence. We can write the joint distribution $p(\mathbf{x}, \boldsymbol{\theta})$ in the form of a product of factors $f_i$, which are also called ‘sites’:
\[p(\mathbf{x}, \boldsymbol{\theta}) = p(\boldsymbol{\theta})p(\mathbf{x}|\boldsymbol{\theta}) = p(\boldsymbol{\theta}) \prod_{i=1}^n p(\mathbf{x}_i|\boldsymbol{\theta}) = p(\boldsymbol{\theta})\prod_{i=1}^n f_i(\boldsymbol{\theta}),\] where $p(\boldsymbol{\theta})$ is the prior and the factors from $1$ to $n$ is the likelihood partitioned into $n$ iid parts (e.g. each $i$ could be a data point).
For $f$, we drop the conditioning $x$ to simplify the notation. I use $f_j(\boldsymbol{\theta})$ to refer to one specific factor and $f_i(\boldsymbol{\theta})$ as factors in the plural sense. My notation is similar to the notation in [4].
The idea is to approximate the posterior by approximating the factors with $\tilde{f}_i$, which are often assumed to be Gaussian (or some other member of the exponential family). These approximations are refined one at a time in an iterative process until convergence. In EP, refining a factor $\tilde{f}_j$ is a “team effort”; it requires information from each of the other factors. This concept is known as message passing, because messages are being passed between different programs (a concept that largely belongs to computer science).

Using the approximations of the likelihood factors, the resulting approximate posterior is given by
\[q(\boldsymbol{\theta}) = p(\boldsymbol{\theta})\prod_{i=1}^n \tilde{f}_i(\boldsymbol{\theta}).\]
In EP, the prior $p(\boldsymbol{\theta})$ is also taken to be Gaussian. Since the product of Gaussians results in another Gaussian, $q$ has to be a Gaussian distribution. Therefore, we do not face the issue of finding the normalising constant for an unnormalised posterior.
To make the approximations as accurate as we can, we need a kind of measurement. Naturally, the global KL divergence comes to mind, so we might want to consider minimising the following:
\[
\mathrm{KL}(p \| q) = \mathrm{KL} \left( \frac{1}{p(\mathbf{x})} p(\boldsymbol{\theta})\prod_{i=1}^n f_i(\boldsymbol{\theta}) \bigg\| p(\boldsymbol{\theta})\prod_{i=1}^n \tilde{f}_i(\boldsymbol{\theta}) \right).
\]
However, the global KL divergence is difficult to optimise. Instead, EP minimises the KL divergence locally to update each factor one at a time, using a distribution called the tilted distribution. When updating the factor $\tilde{f}_j$, the tilted distribution is defined by
\[ q^\text{tilt}(\boldsymbol{\theta}) \propto f_j(\boldsymbol{\theta})q_{\setminus j}(\boldsymbol{\theta}), \]
where $q_{\setminus j}$ is called the cavity distribution, which is essentially the posterior distribution with one $\tilde{f}_j$ removed:
\[ q_{\setminus j}(\boldsymbol{\theta}) = \prod_{i \neq j} \tilde{f}_i(\boldsymbol{\theta}) = \frac{q(\boldsymbol{\theta})}{\tilde{f}_j(\boldsymbol{\theta})}. \]
Then EP finds the $\tilde{f}_j$ that minimises the KL divergence between the tilted distribution and the updated approximate posterior (which we call $q^\text{new}$):
\[
\mathrm{KL}(q^\text{tilt}(\boldsymbol{\theta}) \| q^\text{new}(\boldsymbol{\theta})), \]
where \[q^\text{new}(\boldsymbol{\theta}) = \tilde{f}_j(\boldsymbol{\theta}) q_{\setminus j}(\boldsymbol{\theta}).
\] If $q^\text{new}$ is a member of the exponential family (e.g. Gaussian), then we can minimise $\mathrm{KL}(q^\text{tilt} \| q^\text{new})$ by matching the moments of $q^\text{new}$ with the moments of $q^\text{tilt}$. This trick is called moment matching. In general, for approximating distributions from the exponential family, matching moments of the approximating distribution with those of the target distribution minimises the forward KL [5].
Note that the tilted distribution is not Gaussian and therefore it can be difficult to compute its moments analytically. Instead, the moments are often computed numerically: using MCMC, we can generate samples from the tilted distribution (in which case we would not need to calculate its normalising constant) and use the samples to calculate the moments empirically.
The Gaussian EP algorithm is given below:
- Initialise all the approximating factors \( \tilde{f}_i(\boldsymbol{\theta}), i=1,…,n \).
- Initialise the approximate posterior by setting\[
q(\boldsymbol{\theta}) = p(\boldsymbol{\theta})\prod_{i=1}^n \tilde{f}_i(\boldsymbol{\theta}),
\]where \( p(\boldsymbol{\theta}) \) is a Gaussian prior. - Until all \( \tilde{f}_i \) for \( i=1,…,n \) converge:
(a) Choose a factor \( \tilde{f}_j \) to refine.
(b) Evaluate the cavity distribution\[
q_{\setminus j}(\boldsymbol{\theta}) = \frac{q(\boldsymbol{\theta})}{\tilde{f}_j(\boldsymbol{\theta})}.
\]
(c) Set the tilted distribution\[
q^\text{tilt}(\boldsymbol{\theta}) \propto f_j(\boldsymbol{\theta}) q_{\setminus j}(\boldsymbol{\theta}).
\]
Calculate the mean \( \boldsymbol{\mu} \) and covariance \( \boldsymbol{\Sigma} \) of \( q^\text{tilt} \).
(d) Obtain the new posterior \( q^\text{new} \) that minimises \( \mathrm{KL}(q^\text{tilt}(\boldsymbol{\theta}) \| q^\text{new}(\boldsymbol{\theta})) \) by matching its moments with \( \boldsymbol{\mu} \) and \( \boldsymbol{\Sigma} \).
(e) Evaluate and store the refined factor\[
\tilde{f}_j(\boldsymbol{\theta}) = \frac{q^{\text{new}}(\boldsymbol{\theta})}{q_{\setminus j}(\boldsymbol{\theta})}.
\]
(f) Use the refined factors to update the approximate posterior as\[
q(\boldsymbol{\theta}) = p(\boldsymbol{\theta})\prod_{i=1}^n \tilde{f}_i(\boldsymbol{\theta}).
\]
Benefits and limitations of EP
As with any method, EP has advantages and disadvantages. Its advantages include the following:
- EP updates the approximation factor-by-factor rather than globally, which often leads to better approximations of the target distribution.
- EP is faster and computationally cheaper than MCMC. It can also speed up MCMC [1].
- EP is easy to parallelise [6].
- EP can easily be used in conjunction with other methods. Minka’s Roadmap on EP [7] provides a rich guide to the various areas that have employed EP, including regression, neural networks and nonlinear dynamic systems. EP have also been used with likelihood-free inference methods such as ABC (e.g. EP-ABC [8]).
- Due to its factorisation structure, EP is naturally suited to graphical models, such as Bayesian networks and Markov random fields.
However, EP has some serious limitations, which later works have tried to address:
- There is a lack of theoretical guarantees, e.g. convergence of the EP algorithm is not guaranteed.
- If the number of approximating factors is large, this can lead to substantial memory consumption.
Extensions of EP
EP is well-suited for parallelisation. The parallel version of the original EP algorithm (sometimes called ‘sequential EP’) is known as ‘parallel EP’. Here, factor updates occur simultaneously, meaning that $q$ is not updated at the end of each iteration in step 3 of the algorithm above, i.e. $f_j$ is refined using a cavity distribution that is the product of the unrefined factors minus $f_j$. $q$ is updated only after all the factors have been updated once (whereas in sequential EP, it was updated after each factor update), then the process repeats for multiple rounds.
Since the introduction of EP, many variants of EP have been developed, such as averaged EP (AEP) [9], power EP (PEP) [10] and stochastic EP (SEP) [11]. Different choices of divergence function has led to Variational Message Passing (VMP) [12], which uses the reverse KL, and Laplace propagation (LP) [13], which uses the Laplace approximation. Much work has been done to alleviate EP’s issues, such as guaranteeing convergence [14][9], bounding its approximate errors [15], and reducing memory consumption (e.g. SEP).
Due to the close relation between EP and Variational Inference (VI), many methods have been developed from the unification of the two. For example, Partitioned Variational Inference (PVI) [16] arises from a mixture of several methods including power EP, global VI and local VI.
References
[1] Barthelmé, S. (2016). The Expectation-Propagation Algorithm: a tutorial. Gipsa-lab,
CNRS. https://www.cirm-math.fr/ProgWeebly/Renc1619/Barthelme\_EP1.pdf
[2] Minka, T. P. (2001). A family of algorithms for approximate Bayesian inference [Doctoral dissertation]. Massachusetts Institute of Technology.
[3] Jang, E. (2016). A Beginner’s Guide to Variational Methods: Mean-Field Approximation. https://blog.evjang.com/2016/08/variational-bayes.html
[4] Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
[5] Murray, I. (2017). Variational objectives and KL Divergence [Lecture notes]. University of Edinburgh. https://www.inf.ed.ac.uk/teaching/courses/mlpr/2017/notes/w9a_variational_kl.pdf
[6] Cseke, B. and Heskes, T. (2011). Approximate marginals in latent Gaussian models. Journal of Machine Learning Research, 12:417–454.
[7] Minka, T. P. (n.d.). A roadmap to research on EP. https://tminka.github.io/papers/ep/roadmap.html
[8] Barthelmé, S. and Chopin, N. (2012). Expectation Propagation for Likelihood-Free Inference. arXiv:1107.5959.
[9] Dehaene, G. and Barthelmé, S. (2018). Expectation propagation in the large data limit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):199–217.
[10] Minka, T. P. (2004). Power EP [Technical Report MSR-TR-2004-149]. Microsoft Research Ltd.
[11] Li, Y., Hernández-Lobato, J. M. and Turner, R. E. (2015). Stochastic Expectation Propagation. arXiv:1506.04132.
[12] Winn, J., Bishop, C. M. and Jaakkola, T. (2005). Variational message passing. Journal of Machine Learning Research, 6(4):661–694.
[13] Smola, A., Vishwanathan, S. V. N. and Eskin, E. (2003). Laplace propagation. Advances in Neural Information Processing Systems, 16.
[14] Hasenclever, L., Webb, S., Lienart, T., Vollmer, S., Lakshminarayanan, B., Blundell, C. and Teh, Y. W. (2017). Distributed Bayesian Learning with Stochastic Natural Gradient Expectation Propagation and the Posterior Server. arXiv:1512.09327.
[15] Dehaene, G. and Barthelmé, S. (2016). Bounding errors of Expectation-Propagation. arXiv:1601.02387
[16] Bui, T. D., Nguyen, C. V., Swaroop, S. and Turner, R. E. (2018). Partitioned Variational Inference: A unified framework encompassing federated and continual learning. arXiv:1811.11206.
[17] Guo, H., Greengard, P., Wang, H., Gelman, A., Kim, Y. and Xing, E. P. (2023). Federated learning as variational inference: a scalable expectation propagation approach. arXiv:2302.04228.
[18] Wu, X., Huang, H., Ding, Y., Wang, H., Wang, Y. and Xu, Q. (2023). FedNP: Towards Non-IID Federated Learning via Federated Neural Propagation. Proceedings of the AAAI Conference on Artificial Intelligence, 37(9):10399-10407.