A post by Jack Simons, PhD student on the Compass programme.

Introduction

I began my PhD with my supervisors, Dr Song Liu and Professor Mark Beaumont with the intention of combining their respective fields of research; Density Ratio Estimation (DRE), and Simulation Based Inference (SBI):

DRE is a rapidly growing paradigm in machine learning which (broadly) provides efficient methods of comparing densities without the need to compute each density individually. For a comprehensive yet accessible overview of DRE in Machine Learning see [1].
SBI is a group of methods which seek to solve Bayesian inference problems when the likelihood function is intractable. If you wish for a concise overview of the current work, as well as motivation then I recommend [2].

Last year we released a paper, Variational Likelihood-Free Gradient Descent [3] which combined these fields. This blog post seeks to condense, and make more accessible, the contents of the paper.

Motivation: Likelihood-Free Inference

Let’s begin by introducing likelihood-free inference. We wish to do inference on the posterior distribution of parameters $\theta$ for a specific observation $x=x_{\mathrm{obs}}$ , i.e. we wish to infer $p(\theta|x_{\mathrm{obs}})$ which can be decomposed via Bayes’ rule as

$p(\theta|x_{\mathrm{obs}}) = \frac{p(x_{\mathrm{obs}}|\theta)p(\theta)}{\int p(x_{\mathrm{obs}}|\theta)p(\theta) \mathrm{d}\theta}.$

The likelihood-free setting is that, additional to the usual intractability of the normalising constant in the denominator, the likelihood, $p(x|\theta)$ , is also intractable. In lieu of this, we require an implicit likelihood which describes the relation between data $x$ and parameters $\theta$ in the form of a forward model/simulator (hence simulation based inference!).

This setting broadens the range of problems which we can perform posterior inference on. If we introduce fewer restrictions on the models scientists can perform inference on then we allow for more general models. There are other groups of methods which perform posterior inference in the likelihood-free setting, but generally these are hand-crafted for the model at hand. Consequently, SBI has seen a great success in many scientific fields since its inception, to name a few:

Population Dynamics – e.g. Lotka-Volterra
Epidemiology – e.g. Compartmental Models
Particle Physics – e.g. Particle Interactions in the Large Hadron Collider

( Left, Lotka Volterra Simulation LMaclean, A.D.A.M., CELSO, C.L. and STUMPF, M.P., Stem Cell Population Biology: Insights from Haematopoiesis. Middle, SIR Simulation Macal, C.M., 2010, December. To agent-based simulation from system dynamics. In Proceedings of the 2010 winter simulation conference (pp. 371-382). IEEE. Right, Large Hadron Collider Simulation CERN: https://cds.cern.ch/record/2646378))

We conclude our brief introduction to SBI, but would once again point readers to [2] if you would like to see an overview of the recent advances in the literature.

Method

Our idea, as in variational inference, is to minimize some divergence between our variational distribution $q_\Theta(\theta)$ and the target distribution $p(\theta|x_{\mathrm{obs}})$ . Specifically, we choose the reverse Kullback-Leibler (KL) divergence which is defined by

$\min_\Theta \mathrm{KL}[q_\Theta(\theta) || p(\theta|x_{\mathrm{obs}})] = \min_\Theta \int q_\Theta(\theta) \log\left(\frac{q_\Theta(\theta)}{p(\theta|x_{\mathrm{obs}})}\right)\mathrm{d}\theta$

$=\min_\Theta \int q_\Theta(\theta) \log\left(\frac{q_\Theta(\theta)p(x_{\mathrm{obs}})}{p(\theta)p(x_{\mathrm{obs}}|\theta)}\right)\mathrm{d}\theta$

$=\min_\Theta \mathbb{E}_{\theta\sim q_\Theta(\theta)}\left[ \log r(\theta,x_{\mathrm{obs}}) \right],$

where we have introduced the density ratio $r(\theta,x) := \frac{q_\Theta(\theta)p(x)}{p(\theta)p(x|\theta)}$ . Due to the intractability of some of the constituent densities of $r(\cdot,\cdot)$ we must estimate it using some $\hat{r}(\cdot,\cdot)$ . With this density ratio estimator we could estimate the above KL divergence using samples

$\mathrm{KL}[q_\Theta(\theta) || p(\theta|x_{\mathrm{obs}})] \approx \frac{1}{N} \sum_{i=1}^N \log \hat{r}(\theta_i,x_{\mathrm{obs}}),$

where $\theta_i \sim q_\Theta(\theta)$ .

Our procedure is comprised of repeating the following steps:

Estimate the density ratio $r(\cdot,\cdot)$ with $\hat{r}(\cdot,\cdot)$ via DRE
Perform a step of gradient descent w.r.t. the parameters of our variational distribution on our estimate of the KL

We name our method Variational Likelihood-Free Gradient Descent (VLFGD), and will explain exactly how we achieve these steps in subsequent sections.

Choice of Variational Family

We first specify exactly what form our variational distribution, $q_\Theta(\theta)$ , takes.
We select our variational distribution $q_\Theta$ to be an empirical distribution on particles $\{\theta'_i\}_{i=1}^{N}$ . To be clear, $\Theta$ , which is interpreted as the parameters of $q(\cdot)$ , are exactly the set of particles constituting the empirical distribution: $\Theta=\{\theta'_j\}_{j=1}^{N}$ . Consequently, when we write $\theta\sim q_{\Theta}(\theta)$ we actually mean sampling uniformly at random from $\Theta$ .

Our choice of perturbing particles via gradient descent to closer resemble the posterior is inspired by a popular Variational Inference method, Stein Variational Gradient Descent (SVGD) [4]. We note that VLFGD and SVGD have significant differences, i.e. SVGD is not applicable to the likelihood-free setting, and is based on the Kernelised Stein discrepancy not the KL.

Estimation of $r(\cdot,\cdot)$

We wish to estimate the density ratio

$r(\theta,x) = \frac{q_\Theta(\theta)p(x)}{p(\theta)p(x|\theta)},$

which we achieve with samples from the constituent densities. Samples from the denominator density, the joint distribution $(p(\theta)p(x|\theta))$ , are given by $\forall i: \theta_i \sim p(\theta),\ x_i \sim p(x|\theta_i)$ . Samples from the numerator are attained by pairing the $x$ samples from the denominator with samples $\theta' \sim q_\Theta(\theta)$ .

There is a well-established connection between binary classification and density ratio estimation [1]. We follow this connection and reframe our density ratio estimation problem as a binary classification problem.

Let’s label our numerator samples 1, and denominator samples 0. Once we have an estimate of the class probabilities $p(y=1|\theta,x) = 1-p(y=0|\theta,x)$ via some binary classification algorithm, then an estimate of the density ratio is recovered via

$\frac{q_\Theta(\theta)p(x)}{p(\theta)p(x|\theta)} = \frac{p(\theta,x|y=1)}{p(\theta,x|y=0)}$

$= \frac{p(y=1|\theta,x)p(\theta,x)}{p(y=1)} \cdot \frac{p(y=0)}{p(y=0|\theta,x)p(\theta,x)}$

$= \frac{p(y=1|\theta,x)}{p(y=0|\theta,x)},$

if the class proportions are equal. The binary classification algorithm we use is logistic regression. Specifically, the class probabilities are estimated by a function $g_\phi(\cdot,\cdot)$ wrapped in the logistic function

$p(y=1|\theta,x) = \sigma(g_\phi(\theta,x)) = \frac{1}{1+\exp(-g_\phi(\theta,x))},$

which is paired with the cross-entropy loss defined by

$\min_\phi \left\{ - \frac{1}{N}\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{N} \log \sigma(g_\phi(\theta_j',x_i)) -\frac{1}{N} \sum_{i=1}^{N} \log (1-\sigma(g_\phi(\theta_i,x_i))) \right\}.$

So, if we denote the solution to the above optimization problem as $\phi^{\star}$ then we have our estimate $\log\hat{r}(\cdot,\cdot)$ of $\log r(\cdot,\cdot)$ given by

$\log\hat{r}(\theta,x) = \log\left( \frac{\sigma(g_{\phi^{\star}}(\theta,x))}{1-\sigma(g_{\phi^{\star}}(\theta,x))} \right) = g_{\phi^{\star}}(\theta,x),$

via some simple algebra. We select the function $g_\phi$ to be a neural network.

Gradient of the KL

We are now in the position to estimate our global objective function

$\mathrm{KL}[q_\Theta(\theta) || p(\theta|x_{\mathrm{obs}})] \approx \frac{1}{N} \sum_{i=1}^N \log \hat{r}(\theta_i,x_{\mathrm{obs}}).$

We want to take the gradient of the above w.r.t. $\Theta$ (the parameters of our variational distribution and also the particles). Recalling that we defined $q_\Theta(\cdot)$ as an empirical distribution on particles $\{\theta_i\}_{i=1}^N$ , we see that

$\frac{\partial}{\partial{\theta_i}}\frac{1}{N} \sum_{j=1}^N \log \hat{r}(\theta_j,x_{\mathrm{obs}}) = \frac{1}{N} \frac{\partial}{\partial{\theta_i}} \log \hat{r}(\theta_i,x_{\mathrm{obs}})$

Thus the gradient descent update is given by

$\forall i: \quad \quad \theta_i \leftarrow \theta_i - \gamma_i \frac{1}{N} \frac{\partial}{\partial{\theta_i}}\log \hat{r}(\theta_i,x_{\mathrm{obs}}),$

where $\gamma$ is some learning rate.

Extras

Alternating between the density ratio estimation and gradient descent steps will drive the empirical distribution (our particles) towards the target posterior. Demonstration of our algorithm can be seen below

As intended, our particles more closely resemble samples from the posterior.

References

[1] Sugiyama, M., Suzuki, T. and Kanamori, T., 2012. Density ratio estimation in machine learning. Cambridge University Press.

[2] Cranmer, K., Brehmer, J. and Louppe, G., 2020. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48), pp.30055-30062.

[3] Simons, J., Liu, S. and Beaumont, M., 2021, November. Variational Likelihood-Free Gradient Descent. In Fourth Symposium on Advances in Approximate Bayesian Inference.

[4] Liu, Q. and Wang, D., 2016. Stein variational gradient descent: A general purpose bayesian inference algorithm. Advances in neural information processing systems, 29.