Student Perspectives: Factor-adjusted vector autoregressive models

A post by Dylan Dijk, PhD student on the Compass programme.


Introduction

My current project is looking to robustify the performance of time series models to heavy-tailed data. The models I have been focusing on are vector autoregressive (VAR) models, and additionally factor-adjusted VAR models. In this post I will not be covering the robust methodology, but will be introducing VAR models and providing the motivation for introducing the factor adjustment step when working with high-dimensional time series.

Vector autoregressive models

In time series analysis the objective is often to forecast a future value given past data, for example, one of the classical models for univariate time series is the autoregressive AR(d) model:
\[X_t = a_1 X_{t-1} + \dots + a_d X_{t-d} + \epsilon_t \, .\]
However, in many cases, the value of a variable is influenced not just by its own past values but also by past values of other variables. For example, in Economics, household consumption expenditures may depend on variables such as income, interest rates, and investment expenditures, therefore we would want to include these variables in our model.

The VAR model [1] is simply the multivariate generalisation of the univariate autoregressive model, that is, for a $p$-dimensional stochastic process $(\dots, \mathbf{X}_t, \mathbf{X}_{t+1}, \dots) \in \mathbb{R}^p$ we model an observation at time $t$ as a linear combination of previous observations up to some lag $d$ plus an error:

\[\mathbf{X}_t = \mathbf{A}_1 \mathbf{X}_{t-1} + \dots + \mathbf{A}_d \mathbf{X}_{t-d} + \boldsymbol{\epsilon}_t \, ,\]
where $\mathbf{A}_i$ are $p \times p$ coefficient matrices. Therefore, in addition to modelling serial dependence, the model takes into account cross-sectional dependence. This model can then be used for forecasting, and as an explanatory model to describe the dynamic interrelationships between a number of variables.

Estimation

Given a dataset of $n$ observations, $\{\mathbf{X}_1, \dots, \mathbf{X}_n \in \mathbb{R}^p\}$, we can aim to estimate the coefficient matrices. In order to do so, the model can be written in a stacked form:

\begin{align*} \underbrace{\left[\begin{array}{c}\left(\mathbf{X}_n\right)^{T} \\ \vdots \\ \left(\mathbf{X}_{d+1}\right)^{T}\end{array}\right]}_{\boldsymbol{\mathcal{Y}}} & =\underbrace{\left[\begin{array}{ccc}\left(\mathbf{X}_{n-1}\right)^{T} & \cdots & \left(\mathbf{X}_{n-d}\right)^{T} \\ \vdots & \ddots & \vdots \\ \left(\mathbf{X}_{d}\right)^{T} & \cdots & \left(\mathbf{X}_1\right)^{T}\end{array}\right]}_{\boldsymbol{\mathcal{X}}} \underbrace{\left[\begin{array}{c}\boldsymbol{A}_1^{T} \\ \vdots \\ \boldsymbol{A}_d^{T}\end{array}\right]}_{\boldsymbol{A}^T}+\underbrace{\left[\begin{array}{c}\left(\boldsymbol{\epsilon}_n\right)^{T} \\ \vdots \\ \left(\boldsymbol{\epsilon}_d\right)^{T}\end{array}\right]}_{\boldsymbol{E}}
\end{align*}
and subsequently vectorised to return a standard univariate linear regression problem
\begin{align*}
\operatorname{vec}(\boldsymbol{\mathcal{Y}}) & =\operatorname{vec}\left(\boldsymbol{\mathcal{X}} \boldsymbol{A}^T\right)+\operatorname{vec}(\boldsymbol{E}), \\ & =(\textbf{I} \otimes \boldsymbol{\mathcal{X}}) \operatorname{vec}\left(\boldsymbol{A}^T\right)+\operatorname{vec}(\boldsymbol{E}), \label{eq:stacked_var_regression_form}\\ \underbrace{\boldsymbol{Y}}_{N p \times 1} & =\underbrace{\boldsymbol{Z}}_{N p \times q} \underbrace{\boldsymbol{\beta}^*}_{q \times 1}+\underbrace{\operatorname{vec}(\boldsymbol{E})}_{N p \times 1}, \quad N=(n-d), \quad q=d p^2.
\end{align*}

Sparse VAR

There are $dp^2$ parameters to estimate in this model, and hence VAR estimation is naturally a high-dimensional statistical problem. Therefore, estimation methods and associated theory need to hold under high-dimensional scaling of the parameter dimension. Specifically, this means consistency is shown for when both $p$ and $n$ tend to infinity, as opposed to in classical statistics where $p$ is kept fixed.

The linear model in the high-dimensional setting is well understood [2]. To obtain a consistent estimator requires additional structural assumptions in the model, in particular, sparsity on the true vector $\boldsymbol\beta^*$. The common approach for estimation is lasso, which can be motivated from convex relaxation in the noiseless setting. Consistency of lasso is well studied [3][4], with consistency guaranteed under sparsity, and restrictions on the directions in which the hessian of the loss function is strictly positive.

The well known lasso objective is given by:

\begin{align*}
\underset{{\boldsymbol\beta \in \mathbb{R}^q}}{\text{argmin}} \, \|\boldsymbol{Y}-\boldsymbol{Z} \boldsymbol\beta\|_{2}^{2} + \lambda \|\boldsymbol\beta\|_1 \, ,
\end{align*}

and below, we give a simplified consistency result that can be obtained under certain assumptions.

We denote the sparsity of $\boldsymbol{A}$ by
$s_{0, j}=\left|\boldsymbol\beta^*_{(j)}\right|_0, s_0=\sum_{j=1}^p s_{0, j}$ and $s_{\text {in }}=\max _{1 \leq j \leq p} s_{0, j}$.

Lasso consistency result
Suppose
\begin{gather*}
\, s_{\text{in}} \leq C_1 \sqrt{\frac{n}{\log p}} \, \; \text{and } \; \lambda \geq C_2 (\|\boldsymbol{A}^T\|_{1,\infty} + 1)\sqrt{\frac{\log p}{n}} \; ,
\end{gather*}
then with high probability we have
\begin{align*}
|\widehat{\boldsymbol{A}} – \boldsymbol{A}|_2 \leq C_3 \sqrt{s_{0}} \lambda \quad \text{and} \quad |\widehat{\boldsymbol{A}} – \boldsymbol{A}|_1 \leq C_4 s_0 \lambda \, .
\end{align*}
What we mean here by consistency, is that as $n,p \rightarrow \infty$, the estimate $\widehat{\boldsymbol\beta}$ converges to $\boldsymbol\beta$ in probability. Where we think of $p$ as being a function of $n$, so the manner in which the dimension $p$ grows depends on the sample size. For example, in the result above, we can have consistency with $p = \exp(\sqrt{n})$.

The result indicates that for larger $p$ a more sparse solution, and a larger regularisation parameter is required. Similar results have been derived under various assumptions, for instance under a Gaussian VAR the result has been given in terms of the largest and smallest eigenvalues of the spectral density matrix of a series [5], and hence consistency requires that these quantities are bounded.

In summary, for lasso estimation to work we need $\boldsymbol{A}$ to be sufficiently sparse, and the largest eigenvalue of the spectral density matrix to be bounded. But are these reasonable assumptions to make?

First two leading eigenvalues of the spectral density matrix.

Heatmap of logged p-values for evidence of non-zero coefficients after fitting ridge regression model.

Well, intuitively, if a multivariate time series has strong cross-sectional dependence we would actually expect to have many non-zero entries in the VAR coefficients $\boldsymbol{A}_i$. The figures above, taken from [6], illustrate a real dataset in which there is statistical evidence for a non-sparse solution (heatmap), and that the leading eigenvalue of the spectral density matrix diverges linearly in $p$. Therefore providing an example in which two of the assumptions discussed above are unmet.

Factor-adjusted VAR

The idea now is to assume that the covariance of the observed vector $\mathbf{X}_t$ is driven by a lower dimensional latent vector. For example, the figures above were generated from a dataset of stock prices of financial institutions, in this case an interpretation of a latent factor could be overall market movements which captures the broad market trend, or a factor that captures the change in interest rates.
\begin{align}
\mathbf{X}_t &= \underset{p \times r}{\boldsymbol\Lambda} \underset{r \times 1}{\mathbf{F}_t} + \boldsymbol\xi_t \quad
\end{align}
Consequently, first fitting a factor model would account for strong cross-sectional correlations, leaving the remaining process to exhibit the individual behaviour of each series. Fitting a sparse VAR process will now be a more reasonable choice.

In the formula above, $\mathbf{F}_t$ is the factor random vector, and $\boldsymbol\Lambda$ the constant loading matrix, which quantifies the sensitivity of each variable to the common factors, and we can model $\boldsymbol\xi_t$ as a sparse VAR process, as described in the preceding sections.

References

[1] Lütkepohl, H. (2005) New introduction to multiple time series analysis. Berlin: Springer-Verlag.

[2] Wainwright, M. (2019) High-dimensional statistics: A non-asymptotic viewpoint – Chapter 7 – Sparse linear models in high dimensions. Cambridge, United Kingdom: Cambridge University Press.

[3] Geer, Sara A. van de, and Peter Bühlmann. (2009) On the Conditions Used to Prove Oracle Results for the Lasso. Electronic Journal of Statistics. Project Euclid, https://doi.org/10.1214/09-EJS506.

[4] Bickel, Peter J., Ya’acov Ritov, and Alexandre B. Tsybakov. (2009) Simultaneous Analysis of Lasso and Dantzig Selector. The Annals of Statistics. https://doi.org/10.1214/08-AOS620.

[5] Sumanta Basu, George Michailidis. (2015) Regularized estimation in sparse high-dimensional time series models. The Annals of Statistics. https://doi.org/10.1214/15-AOS1315.

[6] Barigozzi, M., Cho, H. and Owens, D. (2024). FNETS: Factor-adjusted network estimation and forecasting for high-dimensional time series. Journal of Business & Economic Statistics.

Student Perspectives: How can we spot anomalies in networks?

A post by Rachel Wood, PhD student on the Compass programme.

Introduction

As our online lives expand, more data than we can reasonably consider at once is collected. Many of this is sparse and noisy data, needing methods which can recover information encoded in these structures. An example of these kind of datasets are networks. In this blog post, I explain how we can do this to identify changes between networks observing the same subjects (e.g. snapshots of the same graph over time).

Problem Set-Up

We consider two undirected graphs, represented by their adjacency matrices $\mathbf{A}^{(1)}, \mathbf{A}^{(2)} \in \{0,1\}^{n \times n}$. As we can see below, there are two clusters (pink nodes form one, the yellow and blue nodes form another) in the first graph but in the second graph the blue nodes change behaviour to become a distinct third cluster.

A graph at two time points, where the first time point shows two clusters and the second time point shows the blue nodes forming a new distinct cluster

Our question becomes, how can we detect this change without prior knowledge of the labels?

We can simply look at the adjacency matrices, but these are often sparse, noisy and computationally expensive to work with. Using dimensionality reduction, we can “denoise” the matices to obtain a $d$-dimensional latent representation of each node, which provides a natural measure of node behaviour and a simple space in which to measure change.

Graph Embeddings

There is an extensive body of research investigating graph embeddings, however here we will focus on spectral methods.
Specifically we will compare the approaches of Unfolded Adjacency Spectral Embedding (UASE) presented in [1] and CLARITY presented in [2]. Both of these are explained in more detail below.

UASE

UASE takes as input the unfolded adjacency matrix $\mathbf{A} = \left[ \mathbf{A}^{(1)}\big| \mathbf{A}^{(2)}\right] \in \{0,1\}^{2n \times n}$ and performs $d$ truncated SVD [3] to obtain a $d$-dimensional static and a $d$-dimensional dynamic representation:

Illustration of UASE where the green blocks show the rows and columns we keep, and the red blocks represent the rows and columns we discard.

Mathematically we can write this as:
\begin{equation*}
\mathbf{A} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^T = \mathbf{U}_{\mathbf{A}} \boldsymbol{\Sigma}_{\mathbf{A}} \mathbf{V}_{\mathbf{A}}^T + \mathbf{U}_{\perp \!\!\!\ } \ \boldsymbol{\Sigma}_{\perp \!\!\!\ } \ \mathbf{V}_{\perp \!\!\!\ }^T \ \approx \mathbf{U}_{\mathbf{A}} \boldsymbol{\Sigma}_{\mathbf{A}} \mathbf{V}_{\mathbf{A}}^T = \mathbf{X} \mathbf{Y}^T
\end{equation*}

where $\mathbf{U}_{\mathbf{A}}, \mathbf{V}_{\mathbf{A}}$ are the first $d$ columns of $\mathbf{U}$ and $\mathbf{V}$ respectively and $\boldsymbol{\Sigma}_{\mathbf{A}}$ is the diagonal matrix which forms the $d \times d$ upper left block of $\boldsymbol{\Sigma}$. This gives a static embedding $\mathbf{X} \in \mathbb{R}^{n \times d}$ and a time evolving embedding $\mathbf{Y} \in \mathbb{R}^{2n \times d}$.

The general approach in UASE literature is to measure change by comparing latent positions, which is backed by [4]. This paper gives a theoretical demonstration for longitudinal and cross-sectional stability in UASE, i.e. for observations $i$ at time $s$ and $j$ at time $t$ behaving similarly, their latent positions should be the same: $\hat Y_i^{(s)} \approx \hat Y_j^{(t)}$. This backs the general approach in the UASE literature of comparing latent positions to quantify change.

Going back to our example graphs, we apply UASE to the unfolded adjacency matrix and visualise the first two dimensions of the embedding for each of the graphs:

First two dimensions of the latent positions of observations in each graph

As we can see above, the pink nodes have retained their positions, the yellow nodes have moved a little and the blue nodes have moved the most.

CLARITY

Clarity takes a different approach, by estimating $\mathbf{A}^{(2)}$ from $\mathbf{A}^{(1)}$. An illustration of how it is done is shown below:

Illustration of Clarity, which represents $\mathbf{A}^{(1)}$ in low dimensions and asks whether this can provide a good representation of $\mathbf{A}^{(2)}$ as well. It keeps the same “structure matrix” ($\mathbf{U}^{(1)}$) by allowing a new non-diagonal “relationship” matrix ($\boldsymbol{\Sigma}^{(2)}$)

Again we provide a mathmatical explanation of the method. First we perform a $d$-dimensional truncated eigendecompositionon $\mathbf{A}^{(1)}$:

\begin{equation*}
\mathbf{A}^{(1)} = \mathbf{U}^{(1)} \boldsymbol{\Sigma}^{(1)} \mathbf{U}^{(1)T} + \mathbf{U}_{\perp \!\!\!\ } \ \boldsymbol{\Sigma}_{\perp \!\!\!\ } \ \mathbf{U}_{\perp \!\!\!\ }^T \ \approx \mathbf{U}^{(1)} \boldsymbol{\Sigma}^{(1)} \mathbf{U}^{(1)T} = \hat{\mathbf{A}}^{(1)}
\end{equation*}
where $\mathbf{U} \in \mathbb{R}^{n \times d}$ is a matrix of the first $d$ eigenvectors and $\Sigma \in \mathbb{R}^{d \times d}$ is a diagonal matrix with the first $d$ eigenvalues.

Then we estimate $\mathbf{A}^{(2)}$ as
\begin{equation*}
\hat{\mathbf{A}}^{(2)} = \mathbf{U}^{(1)} \boldsymbol{\Sigma }^{(2)} \mathbf{U}^{(1)T} \hspace{1cm} \text{where} \hspace{1cm} \boldsymbol{\Sigma}^{(2)} = \mathbf{U}^{(1)T} \mathbf{A}^{(2)} \mathbf{U}^{(1)}
\end{equation*}

As opposed to UASE, Clarity examines change between $\mathbf{A}^{(1)}$ and $\mathbf{A}^{(2)}$ by a quantity called persistence. These are defined as
\begin{equation*}
\mathbf{P}_i = \sum_{j =1}^{n}\left( \mathbf{A}_{ij}^{(2)} -\hat{\mathbf{A}}_{ij}^{(2)} \right)
\end{equation*}

The intuition here is that the persistences will capture structure in $\mathbf{A}^{(2)}$ that is not present in or explained by $\mathbf{A}^{(1)}$.

Returning to our example problem, we can see heatmaps of $\mathbf{A}^{(1)}$ and $\mathbf{A}^{(2)}$ alongside their Clarity estimates:

$\mathbf{A}^{(1)}$ and $\mathbf{A}^{(2)}$ and their Clarity estimates

Looking at the figure above we can see that the Clarity estimate of ${\mathbf{A}^{(2)}}$ does not capture the third cluster that appears in the second graph and therefore should identify these nodes as anomalies.

Comparison

We can use receiver operating characteristic (ROC) curves to assess the success of our two methods. Given a score (in our case either the distance between latent positions or persistences) it plots the false positive rate against the true positive rate for a sequence of thresh-holds. We can see the ROCs below for $d = 2,3,4,5,6$

ROCs for Clarity (blue) and UASE (orange) for (left to right) $d = 2,3,4,5,6$

We can see that in lower dimensions UASE outperforms Clarity, but the performance degrades over time. This becomes a common problem in real world applications where the best choice for $d$ is unknown. Clarity on the other hand, does not have the same power as UASE but is more robust to dimension. Another difference between the two methods is that by allowing changes in relationship in the model, it is designed to cope with the entire graph changing a little bit.

Conclusion

We have now introduced two methods for identifying change and compared their performance in a simple example. One method produces stronger results overall but is much more sensitive to the choice of dimension than the other. My current research looks to investigate why Clarity succeeds in this area when many other methods fail, with the ultimate goal of using this knowledge to modify more powerful methods to also have this feature.

 

[1] Jones, A., & Rubin-Delanchy, P. (2020). The multilayer random dot product graph. arXiv preprint arXiv:2007.10455.

[2] Lawson, D. J., Solanki, V., Yanovich, I., Dellert, J., Ruck, D., & Endicott, P. (2021). CLARITY: comparing heterogeneous data using dissimilarity. Royal Society Open Science, 8(12), 202182.

[3] Wikipedia contributors. (2024, June 11). Singular value decomposition. In Wikipedia, The Free Encyclopedia. Retrieved 09:54, July 1, 2024, from https://en.wikipedia.org/w/index.php?title=Singular_value_decomposition&oldid=1228566091

[4] Gallagher, I., Jones, A., & Rubin-Delanchy, P. (2021). Spectral embedding for dynamic networks with stability guarantees. Advances in Neural Information Processing Systems, 34, 10158-10170.

Student Perspectives: Strategies for variational inference in non-conjugate problems

A post by Qi Chen, PhD student on the Compass programme.


Introduction

Variational inference is a method to approximate posterior distributions. In Bayesian statistics context, we would like to get access to the posterior distribution \[p(\theta|x) = \frac{p(x|\theta)p(\theta)}{\int_\mathcal{\theta} p(x|\theta)p(\theta) d\theta}\]
In most cases the denominator $p(D)$ is intractable, that is we can not compute it analytically. How should we proceed? There are two broad ways:

  • Using MCMC to simulate samples from the posterior distribution $p(\theta|D)$ to approximate the true posterior and get statistics of interest(mean, variance, etc.).
  • Approximate $p({\theta}|{x})\approx q(\theta)\in\mathcal{Q}$.

The former method is unbiased and the convergence is guaranteed by the law of large numbers. But it requires a large number of samples and is quite computational demanding if the dimension of parameters/dataset is large. The later one, called variational inference,is biased depends on the choice of $\mathcal{Q}$ but is much faster and more scalable.

We call $q$ the variational distributions. The idea behind variational inference, is to approximate the posterior $p({\theta}|{x})$ using $q({\theta})\in\mathcal{Q}$ by minimizing the KL divergence between $q({\theta})$ and the true posterior $p({\theta}|{x})$, with the following formal expression:\[q^*({\theta}) = argmin_{q\in\mathcal{Q}}\;KL(q({\theta})||p({\theta}|{x})) = \int_{\Theta} q({\theta})\log\left(\frac{q({\theta})}{p({\theta}|{x})}\right)d{\theta}\]
This is a traditional measure of distribution mismatch over the same domain, and it is easy to see that $q = p$ is equivalent to $KL(q||p)=0$.

There are broadly two questions we would like to answer:

  • How do we minimize $q$ over the space with the true posterior unknown?
  • How do choose the variational family $q$?

We now answer the first question:
Notice that \begin{align*}
\log p({x}) &= \int q({\theta})\log\left(\frac{p({x},\theta)q({\theta})}{p({\theta}|{x})q({\theta})}\right) d{\theta}\\
&= \int q({\theta}) \log\left(\frac{p({x},{\theta})}{q({\theta})}\right) d{\theta} + \int q({\theta})\log\left(\frac{q({\theta})}{p({\theta}|{x})}\right)d{\theta}
\end{align*}
From the above derivation, we see that the second part is simply just the KL divergence we wish to minimize. As $\log(p({x}))$ is fixed, minimizing KL divergence is equivalent to maximizing the first part. This answers the first question. The first part is called \textbf{evidence lower bound(ELBO)}, written in $\mathcal{L}(q({\theta}))$.

For the second question, in theory, suppose the variational distribution is parametrized by variational parameters ${\phi}$, we can start with any variaitonal distributions we like, following the basic criterion:
Supp($q({\theta};{\phi})\subseteq$Supp($p({\theta}|{x})$).
We also need Supp($p({\theta}|{x})\subseteq$Supp($p({\theta})$) which is guaranteed in most cases.

But randomly choosing some variational distributions with any model won’t make the algorithm always feasible. Indeed, all VI methods centered around the goal of optimizing the ELBO \[\phi^* = argmin_{{\phi}}\mathbb{E}_q\left[\log \frac{p({x},{\theta})}{q({\theta})}\right]\]

Traditional methods set the mean-field assumptions that all parameters are independent. This breaks down the objective and a local optimum could be achieved via a coordinate ascent algorithm. Some methods enlarge the mean-field space to some specific conditional dependences between parameters according to graphical models with conjugate exponential relationship between parent-child pairs[1]. This is further extended to non-conjugate pairs with custom approximations.

Some modern methods have been developed in the last decade based on the idea that the gradient of the ELBO could be expressed in the from of $\mathbb{E}_q(\cdot)$. This immediately brings attention to a combination of MC algorithms(for sampling from $q$) and stochastic gradient descent(for efficiency in the optimization). These methods benefits from the simplicity that there’s no need to analytically compute the gradients based on conditional dependence specifications for each model: it is an automatic algorithm, for a greater domain of models. But it is worth noting that even those methods are theoretically sound, they still face practical issue which I will show in the later sections.

In this post I will briefly go through some of these methods, specifically coordinate ascent variational inference, black-box variational inference and automatic differentiation variational inference.

Conjugate models: Coordinate Ascent Variational Inference

There are various assumptions we can make on $\mathcal{Q}$ . We start with the mean-field assumptions of the parameters [2] This is to assume the joint prior distributions of all parameters could be factorized completely. That is:\[q({\theta}) = \prod_{j=1}^m q_j({\theta}_j)\]
We now write ${\theta}_{-j}$ denote all the other latent variables except for ${\theta}_j$, with distribution $q_{-j}$.
If we only minimize $\mathcal{L}(q)$ against $q_j({\theta}_j)$, we are minimizing \[\mathbb{E}_{q_j}[\mathbb{E}_{q_{-j}}[\log p({\theta},{x})]] – \mathbb{E}_{q_j}[\log q_j({\theta}_j)]\]

Further write $r_j({\theta}_j) = \frac{1}{Z_j}\exp\{\mathbb{E}_{q_{-j}}[\log p({x},{\theta})]\}$ where $Z_j$ is some normalizing constant so that $r_j$ is a probability distribution. Then substitute in, we get \[\mathcal{L}(q_j) \propto \mathbb{E}_{q_j}[\log \frac{r_j({\theta}_j)}{q_j({\theta}_j)}] = -KL(q_j({\theta}_j)||r_j({\theta}_j))\]

Thus maximizing ELBO against $q_j$ is equivalent to set $q_j = r_j$, which is \[q_j({\theta}_j)\propto \exp\{\mathbb{E}_{q_{-j}}[\log p({x},{\theta})]\}\propto \exp\{\mathbb{E}_{q_{-j}}[\log p({\theta}_j|{\theta}_{-j},{x})]\}\]

Since we assume $q$ factorizes, maximizing $\mathcal{L}(q)$ is split into $m$ steps of maximizing $\mathcal{L}(q_j)$. This algorithm is called \textbf{coordinate ascent variational inference}(CAVI) or \textbf{block-coordinate assent}.

A algorithmic view is

  1.  Initialize $q({\theta}) = \prod_{j=1}^m q_j({\theta}_j)$
  2. Iterate until convergence:
    Update for each $q_j$ by $q_j = \frac{1}{Z_j}\exp(\mathbb{E}_{q_{-j}}[\log(p({\theta},{x}))])$This algorithm is guarantee to convergence since each iteration the ELBO increases.

This is not directly feasible for all cases, since we assume we can compute $r_j$ analytically. In case where there’s conditional conjugacy of likelihood and the prior on each $\theta_j$ conditioned on all other ${\theta}_{i\neq j}$. That is \[p(\theta_j|{\theta}_{i\neq j})\in \mathcal{A}(\alpha),\,p({x}|\theta_j, \theta_{i\neq j})\in \mathcal{B}(\theta_j)\rightarrow p(\theta_j|{x},{\theta}_{i\neq j})\in A(\alpha’)\]
this will be feasible. One particular family is that all complete conditionals lie in exponential family.

A distribution $p({\theta})$ is in exponential family if \[p(\theta) = h({\theta})\exp\{{\eta}^Tt({\theta}) – A({\eta})\}\]
Here $\eta$ is called natural parameter, and $A({\theta})$ satisfies \[A({\eta}) = \log \int h({\theta}) \exp \eta^Tt(\theta)d{\theta}\]
such that it integrates to 1.

Now assume that all the complete conditionals belong to an exponential family distribution, that is \[p({\theta}_j|{\theta}_{-j},{x}) = h({\theta}_j)\exp \{{\eta}_j^T({\theta}_{-j},{x}){\theta}_j – A({\eta}_j({\theta}_{-j},{x}))\}\]
where we assume that ${\theta}_j$ is already transformed to its appropriate sufficient statistic. We see now the CAVI becomes \begin{align*}
q_j({\theta}_j)&\propto \exp\{\log h({\theta}_j) + \mathbb{E}_{q_{-j}}[{\eta}_j({\theta}_{-j},{x})]^T{\theta}_j – \mathbb{E}_{q_{-j}}[A({\eta}_j({\theta}_{-j},{x}))]\}\\
&\propto h({\theta}_j)\exp\{\mathbb{E}_{q_{-j}}[{\eta}_j({\theta}_{-j},{x})]^T{\theta}_j\}
\end{align*}
where we see that the variational factors are in the same exponential family(due to conjugacy) as the complete conditionals, with the natural parameter updated to \[\phi_j = \mathbb{E}_{q_{-j}}[{\eta}_j({\theta}_{-j},{x})]\]

But in most cases, for example Bayesian logistic regression, we do not have conditional conjugacy in our model. In this blog post, we introduce two methods which are developed in the last decade tackling the lack of conjugacy. Notice that variational inference is indeed an optimization problem, and these methods are derived from expressing the derivatives of the ELBO in terms of expectation over the vatiational distributions q: \[\frac{\partial ELBO}{\partial {\phi}} = \mathbb{E}_{q({\theta};{\phi})}[\cdot]\]
\section{Evaluable Models: Black Box Variational Inference}

We want to optimize \[\mathcal{L}({\phi}) = \mathbb{E}_{q}[\log p({\theta},{x})] – \mathbb{E}_q[\log q({\theta};{\phi})]\]
and we notice that
\begin{align*}
\triangledown_{{\phi}}\mathcal{L}({\phi}) &= \triangledown_{{\phi}}\int q({\theta};{\phi})\log \frac{p({\theta},{x})}{q(\theta;{\phi})} d{\theta}\\
&= \int q({\theta};{\phi})\triangledown_{{\phi}} \log q({\theta};{\phi})\log \frac{p({\theta},{x})}{q({\theta};{\phi})} + q({\theta};{\phi})\triangledown_{{\phi}}\log \frac{p({\theta},{x})}{q({\theta};{\phi})} d{\theta}\\
&= \mathbb{E}_{q}[\triangledown_{{\phi}}\log q({\theta};{\phi})(\log p({\theta},{x})-\log q({\theta};{\phi}))]
\end{align*}
This is proposed in [3]. We see this is an expectation under the variational distributions, and we only need

  • simulate from $q$.
  • evaluate the derivatives of $q$.
  • evaluate the model $p({\theta},{x})$.

This significantly relaxes the constraint of CAVI and enlarges the domain of models applicable.
In practice, we will use stochastic gradient descent to derive a noisy unbiased estimator of the gradient and adapt some step functions satisfying some conditions, for example \[\sum_j \rho_j =\infty\;\;\;\;\sum_j \rho_j^2 < \infty\]
A naive algorithm is as follows:

  • $t \gets 0$, $\delta \gets \infty$
  • While{$\delta > \tau$}{
    • $t \gets t+1$
    • ${\theta}^1,…,{\theta}^S\sim q({\theta},{\phi}_{t-1})$
    • $\hat{\triangledown}_{{\phi}}\mathcal{L}({\phi}_{t-1})\gets \frac{1}{S}\sum_{s=1}^S \triangledown_{{\phi}}\log q({\theta}^s;{\phi}_{t-1})(\log p({\theta}^s,{x})-\log q({\theta}^s;{\phi}_{t-1}))$
    • ${\phi}_t\gets{\phi}_{t-1} + \rho_t\hat{\triangledown}_{{\phi}}\mathcal{L}({\phi}_{t-1})$
    • $\delta \gets \frac{||{\phi}_t – {\phi}_{t-1}||}{||{\phi}_{t-1}||}$

}
Output{${\phi}^* = {\phi}^t$}

However, in practice, this algorithm does not produce meaningful result for non-trivial model, since the variance of this estimates grows linearly with the number of parameters in the model ${\theta}$. Due to the high variance, we need some variance reduction technique.

Rao-Blackwellization

Rao-Blackwellization reduces the variance of some estimator $J(X,Y)$ by defining another estimator \[\hat{J}(X) = \mathbb{E}[J(X,Y)|X]\]
It is clear that the expectation is preserved:\[\mathbb{E}[\hat{J}(X)] = \mathbb{E}[J(X,Y)]\]by tower law. The variance of this estimator is \[Var(\hat{J}(X)) = Var(J(X,Y)) + \mathbb{E}[\hat{J}(X)^2] – \mathbb{E}[J(X,Y)^2] = Var(J(X,Y)) – \mathbb{E}[(J(X,Y)-\hat{J}(X))^2]\]

Thus this new estimator always has less variance compared to $J(X,Y)$ unless $\hat{J}(X) = J(X,Y)$.

We now apply this to BBVI. Assume the approximating family follows the mean-field assumption, and let $p({x},{\theta}) = p_i({x},{\theta}_{(i)})p_{-i}({x},{\theta}_{-i})$
where $p_i$ are all the terms containing $\theta_i$, and $\theta_{(i)}$ is the collection of all latent variables that appear in $p_i$.
We can thus rewrite the derivatives of ELBO respect to $\theta_i$ as \[\hat{\triangledown}_{\phi_i}^{RB}\mathcal{L}(\phi_i) = \mathbb{E}_{q_{(i)}}[\triangledown_{\phi_i}[\log q_i(z_i;\phi_i)(\log p_i({x},\theta_{(i)})-\log q_i(\theta_i;\phi_i))]]\]
This is a Rao-Blackwellized $\triangledown_{\phi_i}\mathcal{L}({\phi})$ as \[\mathbb{E}_q[\hat{\triangledown}_{\phi_i}\mathcal{L}({\phi}) – \hat{\triangledown}_{\phi_i}^{RB}\mathcal{L}(\phi_i)] = C\mathbb{E}_{q_i}[\triangledown_{\phi_i}[\log q_i(\theta_i;\phi_i)]] = 0\]
with \[C = \mathbb{E}_{q_{-i}}[\log p_{-i}({x},{\theta}_{-i})] – \mathbb{E}_{q_{-i}}[\sum_{j\neq i}\log q_j(\theta_j;\phi_j)]\]
The detailed derivation could be found in [3].

Control variates

We now introduce another method using regression estimator. Suppose we want to estimate some parameter $\mu$ and we have an estimator $f$ with $\mathbb{E}[f(u)] = \mu$, u is a random variable. Furthermore, if we have a “similar” function $h$ such that $\mathbb{E}[h(u)] = \nu$ is known. Then we define a new estimator of $\mu$:\[g(u) = f(u)-\beta(h(u)-\nu)\]

This is clearly an unbiased estimator and for the variance term\[Var(g(u)) = Var(f(u)) + \beta^2 Var(h(u)) – 2\beta Cov(f(u),h(u))\]

In order to minimize this variance, we choose \[\hat{\beta} = \frac{Cov(h(u),f(u))}{Var(h(u))}\]

This is also the OLS estimator for the linear regression:\[f(u) = \mu + \beta(h(u)-\nu)\] Now plugging in this $\hat{\beta}$ we have \[Var(g(u)) = Var(f(u))(1-\rho^2_{fh})\] where $\rho^2_{fh}$ is the correlation between $f(u)$ and $h(u)$. Such $h$ is called the control variate.

Improved BBVI

The original author in [3] combined these two methods and choose $\triangledown_{\phi_i}\log q_i(\theta_i;\phi_i)$ as the control variate for $\hat{\triangledown}_{\phi_i}^{RB}\mathcal{L}(\phi_i)$, which is shown below:

  • $t \gets 0$, $\delta \gets \infty$\
  • While{$\delta > \tau$}{
    • t \gets t+1$
    • ${\theta}^1,…,{\theta}^S\sim q({\theta},{\phi}_{t-1})$
    • For{$i\gets 1$to $n$}{
      • $f_i \gets \frac{1}{S}\sum_{s=1}^S \triangledown_{\phi_i}\log q(\theta_i^s;{\phi}^{t-1}_{i})(\log p_i({\theta}_{(i)}^s,{x})-\log q_i(\theta_i^s;{\phi}_i^{t-1}))$
      • $h_i\gets \frac{1}{S}\sum_{s} \triangledown_{\phi_i}[\log q_i(\theta_i^s;{\phi}_i^{t-1}))]$
      • $\hat{\beta}_i \gets \frac{\hat{Cov}(f_i,h_i)}{\hat{Var}(h_i)}$
      • $g_i \gets f_i-\hat{\beta}h_i$
      • $\phi_i^t\gets \phi_i^{t-1} + \rho_tg_i$
        }
    • $\delta \gets \frac{||{\phi}_t – \phi_{t-1}||}{||{\phi}_{t-1}||}$

}

  • Output{${\phi}^* = {\phi}^t$}

Final Conclusion for BBVI

According to the same authors in [4], they pointed out the limitation of BBVI. They found that the gradient can be very unstable for large values of their inputs, and adaptive step-size like AdaGrad needs extra tunning. Also, they found that, in the case of linear mixed effects model, it under-performs MH-Gibbs sampler. Also, they did experiment in LDA(Latent Dirichlet allocation), Gibbs sampler converged in couple of minutes for 20 topics but BBVI does not produce any reasonable results after hours of iterations for 2 topics. Thus, it requires more experiments and BBVI still has practical limitations.

Differentiable Models: Automatic Differentiation Variational Inference

The idea behind Automatic Differentiation Variational Inference(ADVI) is as follows

  • Transform the parameter space to real space: $T:Supp({\theta})\rightarrow\mathbb{R}^k$ by a one-to-one mapping.
  • Let ${\psi} = T({\theta})$ a joint normal distribution. That is \[q({\psi}|{\phi}) \sim \mathcal{N}({\mu},\Sigma)\] Notice that we need to ensure $\Sigma$ to be full rank. One way to do that is using Cholesky factorization: $\Sigma = LL^T$ where $L$ is a lower triangular matrix with dimension $(k+1)k/2$. Overall, ${\phi}$ lives in $\mathbb{R}^{(k+1)k/2+k}$ where $k$ is the dimension of parameters in our model. This comes with computational cost, so we may wish to make a mean-field assumption to ${\psi}$
  • Finally we make the standardization ${\eta} = S_{{\phi}}({\psi}) = L^{-1}({\psi}-{\mu})$. This makes $q({\eta}) = \mathcal{N}({\eta};{0},{I})$.

Following the above recipe, we can rewrite the ELBO as \[{\phi}^* = argmin_{\phi} \mathbb{E}_{\mathcal{N}({\eta};{0},{I})}\left[\log p\left({x},T^{-1}(S^{-1}_{{\phi}}({\eta}))\right) + \log |detJ_{T^{-1}}(S_{{\phi}}^{-1}({\eta}))|\right] + \mathbb{H}[q({\psi};{\phi})]\]
In this case, the variational parameters are contained in the transformation $S$. We now give the gradients:\[\triangledown_{{\mu}}\mathcal{L} = \mathbb{E}_{\mathcal{N}({\eta})}[\triangledown_{{\theta}}\log p({x},{\theta})\triangledown_{{\psi}}T^{-1}({\psi}) + \triangledown_{{\psi}}\log|detJ_{T^{-1}}({\psi})|]\]
and \[\triangledown_{L}\mathcal{L} =\mathbb{E}_{\mathcal{N}({\eta})}[\left(\triangledown_{{\theta}}\log p({x},{\theta})\triangledown_{{\psi}}T^{-1}({\psi}) + \triangledown_{{\psi}}\log|detJ_{T^{-1}}({\psi})|\right){\eta}^T] + (L^{-1})^T\]
Now similar to BBVI, we can use MC algorithm and SGD to get an approximate gradient and do gradient descent. In [5] they propose a gradient of the form
\[\rho_k^i = \eta\times i^{-1/2+\epsilon}\times\left(\tau + \sqrt{s_k^i}\right)^{-1}\]
where \[s_k^i = \alpha (g_k^i)^2 + (1-\alpha)s_k^{i-1}\]
Here $k$ is the kth element and $i$ is the ith iteration. $g_k^i$ is the gradient vector at iteration i, and $s_k^1 = (g_k^1)^2$

Notice that here $\eta$ is another variable controls the scale of the step size sequence, it could be searched among $\{0.001,0.1,1,10,100\}$. $\epsilon$ is set to be small, for example $\epsilon = 10^{-6}$, to satisfy the Robbins and Monro conditions. The last term is to keep the memory of the past gradients. More details could be found in [5].

 

It is shown that in ADVI, variance of estimates of the gradients is controled better compared to BBVI. The performance is also compared to those famous MC methods, result is also displayed below.

 

 

[1] John Winn and Christopher M. Bishop. Variational message passing. Journal of Machine Learning Research, 6(23):661–694, 2005.

[2] David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, apr 2017.

[3] Rajesh Ranganath, Sean Gerrish, and David M. Blei. Black box variational inference, 2013.

[4] Rajesh Ranganath, Sean Gerrish, and David Blei. Black Box Variational Inference. In Samuel Kaski and Jukka Corander, editors, Proceedings of the Seventeenth International Conference on

Artificial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pages 814–822, Reykjavik, Iceland, 22–25 Apr 2014. PMLR.

[5] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M. Blei. Automatic differentiation variational inference. J. Mach. Learn. Res., 18(1):430–474, jan 2017.

 

Student Perspectives: Avoiding our problems ~ Noise Contrastive Estimation

A post by Henry Bourne, PhD student on the Compass programme.


Currently I’ve been researching Noise Contrastive Estimation (NCE) techniques for representation learning aided by my supervisor Dr. Rihuan Ke. Representation learning concerns itself with learning low-dimensional representations of high-dimensional data that can then be used to quickly solve a general downstream task, eg. after learning general representations for images you could quickly and cheaply train a classification model on top of the representations.  

NCE is a general estimator for parametrised probability models as I will explain in this blogpost. However, it can also be cleverly used to learn useful representations in an unsupervised (or equivalently self-supervised) manner, which I will also explain. I’ll start by explaining the problem that NCE was created to solve, then provide a quick comparison to other methods, explain how researchers have built on this method to carry out representation learning and finally discuss what I am currently working on. 

NCE solves the problem of computing a normalising constant by avoiding the problem altogether and solving some other proxy problem. Methods that are able to model unnormalised probability models are known as Energy Based Models (EBM’s). We will begin by describing the problem with the normalising constant before getting on to how we will avoid it. 

 

The problem … with the normalising constant 

Let’s say we have some arbitrary probability distribution, $p_{d}(\cdot)$, and a parametrised probability model, $p_{m}(\cdot ; \alpha)$, which we would like to accurately model the underlying probability distribution. Let’s further assume that we’ve picked our model well such that $\exists \alpha^{*}$ such that $p_{d}(\cdot) = p_{m}(\cdot ; \alpha^{*})$. 

Let’s just fit it to our data sampled from the underlying distribution using Maximum Likelihood Estimation! Sounds like a good idea, MLE has been extensively used, is reliable, is efficient and achieves the Cramer-Rao lower bound (the lowest possible bound an unbiased estimator can achieve for its variance/MSE), is asymptotically normal, is consistent, is unbiased and doesn’t assume normality. Moreover, there are a lot of tweaked MLE techniques out there that you can use if you would like an estimator with slightly different properties. 

First let’s look under the hood of our probability model, we can write it as so:

$
\begin{array}{l|l}
p_{m}(\cdot;\alpha)=\frac{p_{m}^{0}(\cdot; \alpha)}{Z(\alpha)} & \text{where,} \: Z(\alpha) = \int p_{m}^{0}(u; \alpha) du
\end{array}
$

The likelihood is our probability model for some $\alpha$ evaluated over our dataset. Evaluating the likelihood becomes tricky when there isn’t an analytical solution for the normalisation term, $Z(\alpha)$, and the possible set of values $u$ can take becomes large. For example if we would like to learn a probability distribution over images then this normalisation term becomes intractable.

By working with the log we get better numerical stability, it makes things easier to read and it makes calculations and taking derivatives easier. So, let’s take the log of the above:

$
\begin{aligned}
&{} p_{m}(\cdot;\alpha) = \frac{p_{m}^{0}(\cdot; \alpha)}{Z(\alpha)} \\
& \Rightarrow \log p_{m}(\cdot; \theta) = \log p_{m}^{0} (\cdot ; \alpha) +c
\end{aligned}
$
$\text{Where, } \\ \theta = \{\alpha, c \}, \\ \text{c an estimate of} -\log Z(\alpha)$

Where, we write $p_{m}^{0}(\cdot;\alpha)$ to represent our unnormalized probability model. After taking the $\log$ we can write our normalising constant as $c$ and then include it as a parameter of our model. So, our new model now parameterised by $\theta$, $p_{m}(\cdot;\theta)$, is self-normalising, ie. it estimates it’s normalising constant. Another approach to make the model self-normalising would be to simply set $c=0$, implicitly making the model self-normalising. This is what is normally done in practice, but it assumes that your model is complex enough to be able to indirectly model $c$.

Couldn’t we just use MLE to estimate $\log p_{m}(\cdot ; \theta)$? No we can’t! This is because the likelihood can be made arbitrarily large by making $c$ large.

This is where Noise Contrastive Estimation (NCE) comes in. NCE has been shown theoretically and empirically to be a good estimator when taking this self-normalizing assumption. We’ll assess it versus competing methods at the end of the blogpost. But before we do that let’s first describe the original NCE method named binary-NCE [1] later we will mention some of the more complex versions of this estimator. 

 

Binary-NCE

The idea with binary-NCE [1] is that by avoiding our problems we fix our problems! ie. We would like to create and solve an ‘easier’ proxy problem which in the process solves our original problem.

Let’s say we have some noise-distribution, $p_{n}(\cdot)$, which is easy to sample from, allows for an analytical expression of $\log p_{n} (\cdot)$ and is in some way similar to our $p_{d}(\cdot)$ (our underlying probability distribution which we are trying to estimate). We would also like $p_{n}(\cdot)$ to be non-zero wherever $p_{d}(\cdot)$ is non-zero. Don’t worry too much about these assumptions as they are normally quite easy to satisfy, apart from an analytical expression being available. They just are necessary for our theoretical properties to hold and for binary-NCE to work in practice. 

We would like to create and solve a proxy problem where given a sample we would like to classify whether it was drawn from our probability model or from our noise distribution. Consider the following density ratio.

$
\begin{aligned}
\frac{p_{m}(u;\alpha)}{p_{n}(u)}
\end{aligned}
$

If this density ratio is bigger than one then it means that $u$ is more likely to have come from our probability model, $p_{m}(\cdot;\alpha)$. If it is smaller than one then $u$ is more likely to have come from our noise distribution, $p_{n}(\cdot)$. Therefore, if we can model this density ratio then we will have a model for how likely a sample is to have come from our probability model as opposed to have being sampled from our noise distribution. 

Notice that we are modelling our normalised probability model above, we can rewrite it in terms of our unnormalised probability model as follows.

$
\begin{aligned}
& \log \left(\frac{p_{m}(u;\alpha)}{p_{n}(u)} \right) \\
& = \log \left(\frac{p_{m}^{0}(u;\alpha)}{Z(\alpha)} \cdot \frac{1}{p_{n}(u)} \right) \\
& = \log \left(\frac{p_{m}^{0}(u;\alpha)}{p_{n}(u)} \right) +c \\
& = \log p_{m}^{0}(u;\alpha) + c – \log p_{n}(u) \\
& = \log p_{m}(u;\theta) – \log p_{n}(u)
\end{aligned}
$

Let’s now define a score function $s$ that we will use to model our rewrite of the density ratio just above:

$
\begin{aligned}
s(u;\theta) = \log p_{m}(u;\theta) – log p_{n}(u)
\end{aligned}
$

One further step before introducing our objective function. We would like to model our score function somewhat as a probability, we would also like our model to not just increase the score indefinitely. So we will put our modelled density ratio through the sigmoid/ logistic function.

$
\begin{aligned}
\sigma(s(u;\theta)) = \frac{1}{1+ \exp(-s(u;\theta))}
\end{aligned}
$

We would like to classify according to our model of the density ratio whether the sample is ‘real’ / ‘positive or just ‘noise’/  ‘fake’/ ‘negative’. So a natural choice for the objective function is the cross-entropy loss.
$
\begin{aligned}
J(\theta) = \frac{1}{2N} \sum_{n} \log [ \sigma(s(x_{n};\theta))] + \log [1- \sigma(s(x_{n}’;\theta))]
\end{aligned}
$

Where $x_{i} \sim p_{d}$, $x_{i}’ \sim p_{n}$ for $i \in \{1,…,N\}$. Here we simply assume one noise sample per observation, but we can trivially extend it to any integer $K>0$ and in fact asymptotically the estimator gets better performance as we increase K.  

Once we’ve estimated our density ratio we can easily recover our normalised probability model of the underlying distribution by adding the log probability density of the noise function and taking the exponential. 

This estimator is consistent, efficient and asymptotically normal. In [1] they also showed it working empirically in a range of different settings.

 

How does it compare to other estimators of unnormalised parameterised probability density models?

NCE is not the only method we can use to solve the problem of estimating an unnormalised parameterised probability model. As we mentioned NCE belongs to a family of methods named Energy Based Models (EBM’s) which all aim to solve this very problem of estimating an unnormalised probability model. Let’s very briefly mention some of the alternatives from this family of methods, please do check out the references in this sub-section if you would like to learn more. We will talk about the methods as they appeared in their seminal form. 

One alternative is called contrastive divergence which estimates an unnormalised parametrised probability model by using a combination of MCMC and the KL divergence. Contrastive Divergence was originally introduced with Boltzmann machines in mind [9], MCMC is used to generate samples of the activations of the Boltzmann machine and then the KL divergence measures the difference between the distribution of the activations given by the real data and the simulated activations. We then aim to minimise the KL divergence. 

Score matching [11] models a parameterised probability model without the computation of the normalising term by estimating the gradient of the log density which it calls the score function. It does this by minimising the expected square distance between the score function and the score function of the observed data. However, obtaining the score function of the observed data requires estimating a non-parametric model from the data. They magically avoid doing this by deriving an alternative form of the objective function, through partial integration, leaving only the computation of the score function and it’s derivative. 

Importance sampling [10], which has been around for quite a while uses a weighted version of MCMC to focus on parts of the distribution that are ‘more important’ and in the process self-normalises. Which makes it better than regular MCMC because you can use it on unnormalised probability models and it should be more efficient and have lower variance. 

[1] contains a simple comparison between NCE, contrastive divergence, importance sampling and score matching. In their experimental setting they found contrastive divergence got the best performance, closely followed by NCE. They also measured computation time and found NCE to be the best in terms of error versus computation time. This by no means crowns NCE as the best estimator but is a good suggestion as to it’s utility, so is the countless ways it’s been used with high efficacy on a multitude of real-world problems. 

 

Building on Binary-NCE (Ranking-NCE and Info-NCE)

Taking inspiration from Binary-NCE a number of other estimators have been devised. One such estimator is Ranking-NCE [2]. This estimator has two important elements.

The first is that the estimator assumes that we are trying to model a conditional distribution, for example $p(y|x)$. By making this assumption our normalising constant is different for each value of the random variable we are conditioning on, ie. Our normalising term is now some $Z(x;\theta)$ and we have one for each possible value of x. This loosens the constraints on our estimator as we don’t require our optimal parameters, $\theta^{*}$, to satisfy $\log Z(x;\theta^{*}) = c$ for some $c$ for all possible values of $x$. This means we can apply our model to problems where the number of possible values of $x$ is much larger than the number of parameters in our model. For further details on this please refer to [2], section 2. 

The second is that it has an objective that given an observed sample $x$, and an integer $K>1$ samples from the noise distirbution, the objective ranks the samples in order of how likely they were to have come from the model versus the noise distribution. Again for further details please refer to [2]. 

Importantly this version of the estimator can be applied to more complex problems and empirically has been shown to achieve better performance. 

Now what we’ve been waiting for … how can we use NCE for representation learning? This is where Info(rmation) NCE comes in. It essentially is Ranking-NCE but we chose our conditional distribution and noise distribution in a specific way.

We consider a conditional probability of the form p(y|x) where $y \in \mathbb{R}^{d_{y}}$, $x \in \mathbb{R}^{d_{x}}$, $d_{y} < d_{x}$. Where $x$ is some data and $y$ is the low-dimensional representation we would like to learn for $x$. We then choose our noise distribution, $p_{n}$, to be the marginal distribution of our representation $y$, $p_{y}$. So our density ratio becomes.

$
\begin{aligned}
\frac{p_{m}(y|x; \theta)}{p_{y}(y)}
\end{aligned}
$

This is now a measure of how likely a given $y$ is to have come from the conditional distribution we are trying to model, ie. how likely is this representation to have been obtained from $x$, versus being some randomly sampled representation. 

A key thing to notice is that we are unlikely to have an analytical form of the $log$ of the marginal distribution of $y$. In fact, this doesn’t matter as we aren’t actually interested in modelling the conditional distribution in this case. What we are interested in is the fact that by employing a Ranking-NCE style estimator and modelling the above density ratio we maximise a lower bound on the mutual information between $Y$ and $X$, $I(Y;X)$. A proof for this along with the actual objective function can be found in [3].  

This is quite an amazing result! We solve a proxy problem of a proxy problem and we get an estimator with great theoretical guarantees that is computationally efficient that maximises a mutual information which allows us to, in an unsupervised manner, learn general representations for data. So we avoid our problems twice! I appreciate that above were two big jumps with not much detail but I hope it gives a sense as to the link between NCE in it’s basic form and representation learning. More specifically, NCE is known as a self-supervised learning method which simply means an unsupervised method which uses supervised methods but generates its own teaching signal. Even more specifically, NCE is a contrastive method which gets its name from the fact that it contrasts samples against each other in order to learn. The other popular category of self-supervised learning methods are called generative models, you may have heard of these! 

 

My Research

Now we know a little bit about NCE and how we can use it to do representation learning, what am I researching?

Info-NCE has been applied with great success in many self-supervised representation learning techniques, a good one to check out is [4]. Contrastive self-supervised learning techniques have been shown to outperform supervised learning in many areas. They also solve some of the key challenges that face generative representation learning techniques in more challenging domains than language such as images and video. This review [5] is a good starting point for learning more about what contrastive learning and generative learning are and some of their differences. 

However, there are still lots of problem areas where applying NCE, without very fancy neural network architectures and techniques, doesn’t do so well or outright fails. Moreover, many of these techniques introduce extra requirements on memory, compute or both. Additionally, they can often be highly complex and their ablation studies are poor. 

Currently, I’m looking at applying new kinds of density ratio estimation methods to representation learning, in a similar way to info-NCE. These new density ratio estimation techniques when applied in the correct way will hopefully lead to representation learning techniques that are more capable in problem areas such as multi-modal learning [6], multi-task learning [7] and continual learning [8]. 

Currently, of most interest to me is multi-modal learning. This is concerned with learning a joint representation over data comprised of more than one modality, eg. text and images. By being able to learn representations on data consisting of multiple modalities it’s possible to learn higher quality representations (more information) and makes us capable of solving more complex tasks that require working over multiple modalities, eg. most robotics tasks. However, multi-modal learning has a unique set of difficult challenges that make naively using representation learning techniques on it challenging. One of the key challenges is balancing a trade-off between learning to construct representations that exploit the synergies between the modalities and not allowing the quality of the representations to be degraded by the varying quality and bias of each of the modalities. We hope to solve this problem in an elegant and simple manner using density ratio estimation techniques to create a novel info-NCE style estimator.

Hope you enjoyed! If you would like to reach me or read some of my other blogposts (I have some more in-depth ones about NCE coming out soon) then checkout my website at /phd.h-0-0.com. 

 

References

[1] :
Gutmann, M. and Hyvärinen, A., 2010, March. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 297-304). JMLR Workshop and Conference Proceedings.

[2] :

Ma, Z. and Collins, M., 2018. Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency. arXiv preprint arXiv:1809.01812.

[3] :

Oord, A.V.D., Li, Y. and Vinyals, O., 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.

[4] :

Chen, T., Kornblith, S., Norouzi, M. and Hinton, G., 2020, November. A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597-1607). PMLR.

[5] :
Liu, X., Zhang, F., Hou, Z., Mian, L., Wang, Z., Zhang, J. and Tang, J., 2021. Self-supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineering, 35(1), pp.857-876.
[6] :

Baltrušaitis, T., Ahuja, C. and Morency, L.P., 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2), pp.423-443.

[7] :

Zhang, Y. and Yang, Q., 2021. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12), pp.5586-5609.

[8] :
Wang, L., Zhang, X., Su, H. and Zhu, J., 2024. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9] :

Carreira-Perpinan, M.A. and Hinton, G., 2005, January. On contrastive divergence learning. In International workshop on artificial intelligence and statistics (pp. 33-40). PMLR.

[10] :

Kloek, T. and Van Dijk, H.K., 1978. Bayesian estimates of equation system parameters: an application of integration by Monte Carlo. Econometrica: Journal of the Econometric Society, pp.1-19.

[11] :

Hyvärinen, A. and Dayan, P., 2005. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4).

 

 

Student Perspectives: What is Confounding?

A post by Emma Tarmey, PhD student on the Compass programme.

This blog post serves as an introduction to the problem of confounder handling within the broader topic of covariate selection and model selection for causal inference purposes.  In this post, we begin with a motivating example, describe the problem of confounding, describe current solutions to the problem and how statistical solution methods compare to knowledge-based solution methods.  It is intended that readers come away from this article understanding which use cases each of the solution methods are intended for, as well as what advantages and disadvantages each method provides.

Introduction

There exists a common saying, “correlation does not imply causation”.  This phrase is often used when discussing statistical analyses to describe the idea that just because two phenomena or patterns often appear together, does not automatically mean that one necessarily causes the other.  There are a number of reasons why two events, A and B, may occur together, with “A causes B” being only one of several explanations for the observed correlation.  In epidemiology, substantiating a causal claim, “causal inference”, can be highly valuable towards determining medical best practice and testing the effectiveness of medical treatments and interventions.  A correlation between two events, A and B, may be distorted or even fabricated whole-cloth by the influence of an outside event C, which mutually causes both.  As such, particularly in the context of clinical trials for medical treatments, verifying that no such outside influences are distorting our results is essential for producing valid causal inferences.

Yellow Fingers and Lung Cancer

To motivate the idea of a distorted correlation from the introduction, we look to a famous example: the association between the yellowing at the tip’s of ones fingers and incidence of lung cancer.[1][2]  We observe from the literature that, when attempting to predict incidence of lung cancer, the yellowing of ones finger tips makes an excellent predictor variable.[1]  However, there is no causal link between these two events,  instead, both the yellowing and lung cancer are mutually caused by smoking.[2]  This, in turn, creates an unhelpful statistical association between the two variables, one which is then correctly estimated in modelling but no longer corresponds just to our causal pathway.  As such, when attempting to understand causal factors to lung cancer, it becomes important not to declare yellowing as a cause despite the fact that yellowing may “look like a cause” based on the data itself.

One can imagine, in an isolated example like this, it can be straight-forward to detect this from first principles using the existing causal knowledge we have.  But, if for example a given study has not recorded smoking as a variable, we become unable to identify the phenomenon and thus unable to correctly attribute the source of our statistical associations.  The phenomenon within causal structures of a common cause is referred to as “confounding”, thus giving us the sub-problem of “confounder handling” when attempting to use our statistical models for causal inference.  Notably, causal pathways can be more complex than the above example.  If we have a longer pathway by which we add to the statistical association between X and Y, any such covariate on that pathway is a potential confounder, whose adjustment will solve our problem.  We define our problem formally as follows:

Problem: Confounder Handling

Confounding is defined as the phenomenon within any causal structure wherein both an exposure (input variable) and outcome (output variable) are mutually caused by a third outside variable (the confounder).  This in turn creates a statistical association between the two covariates which is not attributable to a causal pathway from X to Y.  This phenomenon takes the following general shape:

It can be helpful to think of confounder-handling as containing two sub-problems which we solve together:

  1. Confounder Identification: Identifying the set of all covariates which act as confounders within a given causal structure
  2. Control-Set Selection: Selecting an optimal (by some criterion) subset of these identified confounders to include in the model to best control for confounding

This problem can be though of in the following way:

We may control for a variable by means of including it within our regression or remove the influence of a variable altogether by stratifying our data.  These, in turn, both remove the statistical association attributable to confounding.  However, when the causal structure is larger and more complex, correctly handling confounding becomes trickier.  Firstly, we risk inducing “selection”, and thus creating more confounding pathways, if we adjust for covariate which confounds the X-Y relationship but is itself also caused by other covariates.  Secondly, if we adjust for an “instrument” of X, that being a covariate Z which is a cause of X but not of Y, then we risk amplifying bias from unseen confounding.  Thirdly, further issues arise if many covariates within the model are correlated with each other, as then estimating a given causal effect becomes much more difficult, even for an unconfounded model.

Additionally, though this may seem to go without saying, we only have the variables that we have.  Unmeasured confounding, from a covariate not within our dataset, can very much produce the same distortions but also be impossible to control for.  With all this in mind, we look to the existing solutions to these above problems.

Solution: Confounder-Handling

There exist two broad solution types to the problem of confounder-handling, those being:

  1. A direct approach working from causal knowledge
  2. An indirect approach working from observed data

Existing knowledge-based solutions include:

  • Back-door path criterion: [3]
    • The back-door path criterion states that the causal effect is identifiable if there does not exist any “back door path” connecting the exposure X and outcome Y within the causal structure.
    • As such, we may prevent confounding by controlling a variable present on any such existing path to “block” this path and thus prevent confounding via that path.
  • Front-door path criterion: [3]
    • The front-door path criterion states that the causal effect is identifiable (our statistical association is still a consistent estimator of the causal effect), even if the backdoor path criterion isn’t strictly satisfied.  If we have a “mediator” covariate M, a covariate which sits between two covariates creating a direct path via itself, between X and Y, the the X-Y causal effect remains identifiable if we satisfy all of the following:
      1. M intercepts all causal pathways from X to Y
      2. There does not exist any backdoor path between X and M
      3. X blocks every backdoor path from M to Y
  • Pre-treatment criterion: [4]
    • The pre-treatment criterion states that, if we control for all covariates which occur prior to the exposure X in time, then we must necessarily have controlled for all confounders, and thus our causal effect is identifiable.
  • Common-cause criterion : [4]
    • The common-cause criterion states that, if we control for any and all covariates who mutually cause both the exposure X and outcome Y, then we must necessarily have controlled for all confounders.
  • (Twice-modified) Disjunctive-cause criterion: [4]
    • The (twice-modified) disjunctive cause criterion states that we can construct a sufficient adjustment set S in the following way:
      1. Add to our set S any pre-exposure covariate which is a cause of X, Y or both
      2. Remove from S any covariate Z which acts as an instrument of X
      3. Add to S any covariate which, though not satisfying condition 1, can act as a good proxy for unmeasured confounders of the X-Y relationship
  • District criterion (iterative graph expansion): [5]
    • The district criterion states that we have controlled for confounding if we our adjustment set S does indeed leave covariates X and Y in separate “districts” of a specially defined sub-graph of our wider causal structure, the setup of which is beyond the scope of this blog article.
    • This criterion forms the theoretical justification to the method of iterative graph expansion proposed in the same paper, which readers are encouraged to find from the references if they would like to learn more.

Existing statistically-based solutions include:

  • Step-wise regression: [6]
    • Stepwise regression is a variable selection and model fitting procedure, which works by means of iteratively adding and removing explanatory variables (covariates other than X and Y) to form an optimal model where all explanatory variables are considered significant by some outside significance criterion (such as AIC).
  • LASSO (Least absolute shrinkage and selection operator): [7]
    • LASSO is a parameter estimation procedure typically employed for variable selection, which can be employed similarly for confounder identification.

More bespoke statistical solutions include:

  • Change-in-estimate approach: [8]
    • The change in estimate approach detects confounding via statistical significance testing, iteratively as covariates are added and removed.  The idea, intuitively, is that if removing an outside variable as explanatory has a significant impact on the X-Y relationship, then it was likely confounding the two, and is identified as such.
  • Targeted maximum likelihood estimators: [9]
    • Targeted maximum likelihood estimators (TMLEs) are doubly-robust parameter estimators, which can be used for determining regression coefficients for statistical models while optimizing the bias-variance trade-off.  This is used for confounder identification similarly to LASSO.

We have seen many approaches to the problem, but which is best?  In thinking this through, we conclude that which approach is best depends on one’s intended use case.  Specifically:

  1. Whether or not causal knowledge is available, with causal methods preferred as these provide guarantees of unconfoundedness in the result
  2. If causal knowledge is available, how much?  Are we able to fully enumerate our problem?

Since different knowledge-based methods require different amounts of causal knowledge and provide stronger and weaker results correspondingly, it makes sense to select the approach most suited to the DAG we’re presently examining.  However, knowledge-based methods scale poorly to larger causal structures, both in terms of running their algorithms and of enumerating the DAG to begin with – they quickly become intractable.  Hence – statistical approaches, which provide weaker results with regards to unconfoundedness, but scale much better to larger causal scenarios and in principle require no causal knowledge to execute.

Conclusion

In conclusion, there exists a problem of confounding within the field of causal inference, and different solutions to this problem offer different advantages and disadvantages.  Which solution is necessarily “best” depends upon your use case, specifically size of use-case and amount of causal knowledge available.

Contact Details

Miss Emma Jane Tarmey (she/her), University of Bristol, emma.tarmey@bristol.ac.uk

References

  1. Smith, George Davey and Phillips, Andrew N. Confounding in epidemiological studies: why ”independent” effects may not be all they seem. British Medical Journal, 305(6856):757–759, September 1992.
  2. Rothman, Kenneth J. et al. Serum Beta-Carotene: A Mechanism or ”Yellow Finger”? Epidemiology, 3(4):277–279, July 1992.
  3. Pearl, Judea. Causal diagrams for empirical research. Biometrika, 82(4):669–710, 1995.
  4. VanderWeele, Tyler J. Principles of Confounder Selection. European Journal of Epidemiology, 34:211–219, 2019, Section 4
  5. F. Richard Guo and Qingyuan Zhao. Confounder Selection via Iterative Graph Expansion. arXiv, October 2023
  6. VanderWeele, Tyler J. Principles of Confounder Selection. European Journal of Epidemiology, 34:211–219, 2019, Section 5
  7. Susan M. Shortreed and Ashkan Ertefaie. Outcome-Adaptive Lasso: Variable Selection for Causal Inference. Biometrics, 73:1111–1122, 2017. Publisher: Wiley.
  8. Talbot, Denis and Diop, Awa and Lavigne-Robichaud, Mathilde and Brisson, Chantal. The change in estimate method for selecting confounders: A simulation study. Statistical Methods in Medical Research 30(9):2032–2044, 2021.
  9. Schuler, Megan S. and Rose, Sherri. Targeted Maximum Likelihood Estimation for Causal Inference in Observational Studies. American Journal of Epidemiology, 185(1):65–73, January 2017.
Skip to toolbar