Student Perspectives: Bayesian LLM Finetuning

A post by Sam Bowyer, PhD student on the Compass programme.

Training large AI models is tricky business. First you’ll want to raise money — and lots of it. (OpenAI’s GPT-4 reportedly cost over $100 million to train, roughly equivalent to 0.5% of Bristol’s GDP.) With that money you’ll need to buy hardware (25,000 NVIDIA A100 GPUs should do), hire a team of talented engineers, and purchase licensing to vast quantities of data (though you might consider foregoing that last one and just hope no one complains…). Once you’ve collected enough data (say, ~13 trillion tokens-worth1), settled on a model architecture with hundred of billions, if not trillions, of parameters (each taking up at least a byte of memory), you can sit back and wait around for 100 days whilst your engineers firefight software and hardware crashes to steer your model’s training to completion.2

But for those of us who can’t afford the $10^{25}$ FLOPs (floating point operations) needed to train such a model (or who might want to avoid the associated environmental costs), what can we do? The answer lies in finetuning: taking one of the available pretrained ‘foundation’ models (such as ChatGPT, or an open source model such as one from Meta’s Llama series) and tweaking them to suit your own purposes.

The basic idea is this: these foundation models are great multitaskers, they’ve been trained well enough to generate reasonable outputs to a wide variety of inputs, but if you’re only interested in using them on a particular set of data ($\mathcal{D}_\text{finetune}$), or for a particular task, then it might be a good idea to spend some extra time training on that data specifically, after the rest of (pre)training has taken place. Similarly, it’s worth noting that the foundation model you get straight out of pretraining will mimic its input dataset, $\mathcal{D}$. In the case that $\mathcal{D}$ is too large to be checked by humans (e.g. 13 trillion tokens — essentially including most of the public internet), your model will almost certainly have learnt undesirable behaviour and be capable of producing dangerous, offensive, and harmful output. Finetuning is critical to the pursuit of safe AI, putting guardrails in place and ensuring that a model’s behaviour is aligned with our desires, both in terms of utility and safety.3

In this blog post, I’ll give an overview of LLM finetuning, specifically parameter-efficient finetuning, which tackles the problem of finetuning models whilst avoiding the computational burden that was required for pretraining. Even if your finetuning dataset $\mathcal{D}_\text{finetune}$ is much smaller than your pretraining set $\mathcal{D}$, you’ve still got the computational problem of the model’s size: how do you efficiently4 do gradient-based optimisation on a model with potentially billions of parameters? I’ll also argue that taking a Bayesian approach can be beneficial, and that whilst the added computational cost of Bayes might not be feasible (or even all that helpful) in the pretraining setting, these costs are much less impactful when finetuning.

Parameter Efficient Finetuning

Perhaps the simplest way to finetune a model on $\mathcal{D}_\text{finetune}$ is to simply carry on training as before — with some gradient-based optimiser like Adam [1] — but on this new dataset (often repeatedly, i.e. for multiple ‘epochs’). This is known as full finetuning (FFT) and usually leads to the best results, however, it’s often infeasible due to the size of the model being finetuned.

Recall that the model we’re working with might have billions of parameters — in order to train these parameters we need to store not only their values, but also their gradients, as well as the activation values of each neuron in the network and, depending on your optimiser, potentially momentum and second order gradient information (e.g. Adam makes use of the exponential moving average of gradients and the EMA of squared gradients — all per parameter). On a model like Llama-7B, whose 7 billion parameters at 8-bit precision require 7GB of storage, these extra gradient costs can easily overwhelm the 16GB capacity of a typical high-end consumer GPU such as an NVIDIA RTX 4080. (Add to that the fact that we usually want to batch our input data — that is, pass multiple input examples through the model at a time — and you can see where things start to spiral out of control.)

This motivates the need for finetuning algorithms that have a smaller memory footprint. There’s an exciting field of literature in model compression and quantisation — using compression techniques to represent your model and its gradients by fewer and fewer bits5, but another approach is to simply reduce the number of parameters that you train during finetuning. However, choosing which parameters to train and which to freeze (thus freeing up space that would’ve gone to storing the gradient information of those parameters) is far from trivial.

Partial Finetuning

In order to discuss finetuning techniques, it’ll be useful to briefly touch on the basic architecture of neural networks. The simplest type of neural network is a multilayer perceptron, or MLP, which consists of $L$ layers in which the output of layer $l-1$, $x^{l-1} \in \mathbb{R}^{d_{l-1}}$, is multiplied by a learnable weight matrix $W^l \in \mathbb{R}^{d_{l} \times d_{l-1}}$ and added to a learnable bias vector $b^l \in \mathbb{R}^{d_{l}}$ before being transformed through a nonlinearity, such as a sigmoid $\sigma(x) = (1+e^{-x})^{-1}$:
$$x^l = \sigma(W^l x^{l-1} + b^l),$$
with $x^0 \in \mathbb{R}^{d_0}$ being input data.

A common strategy for finetuning is to freeze all weights in earlier layers, say, up until the final $\hat{L}$ layers, and only train the set of parameters $\{W^l, b^l | l \geq L-\hat{L}\}$. Assuming constant network width $d = d_0 = \ldots = d_L$ this reduces the number of trainable parameters from $L(d^2 + d)$ to $\hat{L}(d^2 + d)$.

Another simple finetuning strategy is BitFit [3], which works by only training the bias parameters, leading to a total of $Ld$ trainable parameters (though of course this does make the iterative finetuning updates significantly less expressive).

It’s important to note that the final-layers-only approach can also be applied more generally. Most LLMs architectures use transformers [4] as their backbone, which — very loosely speaking — consist of multi-headed attention layers (another, more complicated type of neural network) followed by an MLP (plus a whole bunch of other stuff containing yet more parameters), and with each transformer’s output typically going on to form the input of another transformer. So it’s common to see only the final transformer finetuned, or even only the final transformer’s MLP.

Since it would be ill-advised to take a long detour into the definition of multi-headed attention here (as that’d be fairly involved and might take the momentum out of our finetuning discussion), I won’t do that. (Instead, I’ll banish it to yet another (increasingly-obnoxious) footnote6.)

Adapter Tuning

Rather than retraining the weights already in your model, most modern finetuning approaches actually add new parameters to the model, termed ‘adapters’, and only train these instead. For example, [5], [6], and [7] all essentially propose techniques in which we insert two-layer MLPs at different places inside a transformer, with varying results.

Adapter methods have the benefit of being ‘plug-and-play’, in the sense that you can train multiple adapters on different finetuning tasks and then insert them into your model if you detect that it would be helpful for a user’s given request.

Low Rank Adaptation (LoRA)

By far the most common (and almost de facto standard as of 2024) finetuning method is Low Rank Adaptation (LoRA) [8]. The intuition behind LoRA is that the parameters inside your pretrained model are probably fairly close to their finetuned optimal values already, in the sense that those optimal values can probably be reached using only updates in a low-rank subspace. As such we can pose our finetuning problem in terms of finding the low-rank matrix $\Delta W \in \mathbb{R}^{d_\text{in} \times d_\text{out}}$ that optimises a given pretrained weight matrix $W_0$, leading to
$$W_\text{finetune} = W_0 + \Delta W,$$
where the low-rank of $\Delta W$ is enforced by parameterising it as $$\begin{aligned}\Delta W & = B A \\ B & \in \mathbb{R}^{d_\text{in} \times r} \\ A & \in \mathbb{R}^{r \times d_\text{out}} \end{aligned}$$so that $\text{rank}(\Delta W) \leq r \ll \text{rank}(W_0) \leq \min (d_\text{in}, d_\text{out})$. (Note that LoRA places the adapter in parallel to a pretrained weight matrix $W_0$, in contrast to the serial/in-between placement of the MLP adapters mentioned in the previous section.)

Figure reproduced from [8], showing a Gaussian-initialisation of $A$ and a zero-initialisation of $B$.
By only learning $A$ and $B$, we reduce the number of parameters from $d_\text{in} d_\text{out}$ to $r(d_\text{in} + d_\text{out})$. In practice, often $d_\text{in}$ and $d_\text{out}$ will be in the range ~$10^3$-$10^4$ whilst we’ll choose $r$ to be somewhere between 4 and 128. To cut down on the number of trainable parameters even further, we often only apply LoRA adapters to certain weight matrices in a model, for example only the query and value matrices ($W^Q$ and $W^V$) of attention layers.

LoRA’s success has led to a large number of variants, such as AdaLoRA [9] which adaptively decides which weight matrices to apply LoRA to based on their singular values. Other methods include PiSSA (Principal Singular Values and Singular Vectors Adaptation) [10] which performs LoRA updates only on the first few principle components of each weight matrix and freezes the ‘residuals’ which come from later principle components. One recent paper presents GaLore (Gradient Low Rank Projection) [11], which performs PCA on the weight matrix every few iterations and performs low-rank updates by specifically only optimising in the (low-rank) space spanned by the first few priniciple components.

Bayesian Finetuning

Although work has been done to introduce uncertainty estimation into pretraining, the results often aren’t worth the extra computational costs [12, 13]. Not only are the model sizes too large to make uncertainty quantification feasible, but the fact that your pretraining dataset, $\mathcal{D}$, is gigantic provides little uncertainty to reason about. However, in the context of finetuning we typically have a much smaller dataset, for which we’ll likely have much more uncertainty, and we also tend to work with far fewer parameters, allowing for extra computational budget to go towards the use of Bayesian methods.

Consider splitting up our finetuning set into prompt and target/response pairs $(X,y) \in \mathcal{D}_\text{finetune}$ where $X \in \mathcal{T}^{B \times n}$ is a matrix of $B$ sequences each of maximum length $n$ (potentially padded out with null-tokens) constructed with the token set $\mathcal{T}$, and $y \in \mathcal{Y}^B$ could be a corresponding batch of single tokens (in which case $\mathcal{Y} = \mathcal{T}$), or a batch of classification labels (e.g. in sentiment analysis, or multiple-choice Q&A, in which case $\mathcal{Y}$ might be different to $\mathcal{T}$).

What we fundamentally want to learn is a posterior distribution over all learnable parameters $$p(\theta | \mathcal{D}_\text{finetune}) = p(\theta | X, y),$$where, for example, in the case of LoRA finetuning, $\theta$ is the collection of all adapter weights $A$ and $B$. This not only gives us information about the uncertainty in the model’s parameters, which can be useful in itself, but can also be used to give us the posterior predictive distribution for a test input $x^* \in \mathcal{T}^n$, $$p(y^* | x^*, \mathcal{D}_\text{finetune}) = \int p(y^* | x^*, \theta)p(\theta|\mathcal{D}_\text{finetune})d\theta.$$
This is often more desirable than a predictive distribution that only uses a point estimate of $\theta$ and which would then ignore the uncertainty present in the model’s parameters.

Bayesian LoRA (via Laplace Approximation and KFAC)

Yang et al. [14] suggest a method for finding the posterior $$p(\theta | X, y) \propto p(y | X, \theta)p(\theta)$$post-hoc, i.e. after regular finetuning (with LoRA) using a Laplace approximation — which assumes the posterior is a Gaussian centered at the maximum a-posteriori (MAP) solution, $\theta_\text{MAP}$.

First, we note that the MAP solution can be written as the maximum of the log-joint $\mathcal{L}(y, X; \theta)$, $$\begin{align} \mathcal{L}(y, X; \theta) &= \log p(y | X, \theta) +\log p(\theta) = \log p(\theta | X, y) + \text{const} \\ \theta_\text{MAP} &= \arg\max{}_\theta \mathcal{L}(y, X; \theta). \end{align}$$
Then assuming that the finetuning successfully optimised $\theta$, i.e. reached parameter values $\theta_\text{MAP}$, the Laplace approximation involves taking the second-order Taylor expansion of the log-joint around $\theta_\text{MAP}$, $$\mathcal{L}(y, X; \theta) \approx \mathcal{L}(y, X; \theta_\text{MAP}) – \frac{1}{2}(\theta – \theta_\text{MAP})^T(\nabla_\theta^2 \mathcal{L}(X, y; \theta)|_{\theta_\text{MAP}})(\theta – \theta_\text{MAP}).$$(The expansion’s first-order term disappears because the gradient of the MAP objective at $\theta_\text{MAP}$ is zero.) This quadratic form can then be written as a Gaussian density, with mean $\theta_\text{MAP}$ and covariance given by the inverse of the log-joint Hessian: $$\begin{align}p(\theta | X, y) &\approx \mathcal{N}(\theta ; \theta_\text{MAP}, \Sigma), \\
\Sigma &= -(\nabla_\theta^2 \mathcal{L}(X, y; \theta))^{-1}.\end{align}$$
The authors makes use of various tricks to render computing this Hessian inverse feasible, most notably Kronecker-Factored Approximate Curvature (KFAC) [15]. (A nice explanation of which can be found at this blog post.)

Using the Laplace approximation comes with added benefits. Specifically, we can make use of the Gaussian form of the (approximate) posterior to easily compute two values of interest: samples from the posterior predictive distribution, and estimates of the marginal likelihood.

For the first of these, we can linearise our model, with output $f_\theta(x^*)$ approximated by a first-order Taylor expansion around $\theta_\text{MAP}$, $$f_\theta(x^*) \approx f_{\theta_\text{MAP}}(x^*) + \nabla_\theta f_\theta(x^*)|^T_{\theta_\text{MAP}}(\theta – \theta_\text{MAP}).$$
We can write this as a Gaussian density $$f_\theta(x^*) \sim \mathcal{N}(y^*; f_{\theta_\text{MAP}}(x^*), \Lambda)$$ where $$\Lambda = (\nabla_\theta f_\theta(x^*)|^T_{\theta_\text{MAP}})\Sigma(\nabla_\theta f_\theta(x^*)|_{\theta_\text{MAP}}).$$
With this, we can easily obtain samples from our predictive posterior through reparameterised sampling of some Gaussian noise $\mathbf{\xi} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and a Cholesky decomposition $\Lambda = LL^T$: $$\hat{y} = f_\theta(x^*) = f_{\theta_\text{MAP}}(x^*) + L\mathbf{\xi}.$$

The second value of interest is the marginal likelihood (also known as the model evidence), which is useful for hyperparameter optimisation and can be computed simply as follows $$\begin{align}p(y|X) &= \int p(y|X,\theta)p(\theta)d\theta \\ &\approx \exp (\mathcal{L}(y, X; \theta_\text{MAP}))(2\pi)^{D/2}\det(\Sigma)^{1/2}.\end{align}$$

Using Stein Variational Gradient Descent (SVGD)

A reasonable question to ask is whether it might be feasible to learn the posterior distribution during finetuning, rather than afterwards. One such method for achieving this is Stein variational gradient descent (SVGD) [16], in which a collection of $n$ parameter particles $\{\theta_i^{(0)}\}_{i=1}^n$ are iteratively updated to fit the true posterior using some similarity function (i.e. a kernel) $k: \Theta \times \Theta \to \mathbb{R}$, $$\begin{align}\theta_i^{t+1} &= \theta_i^{(t)} – \epsilon_i \phi(\theta_i^{(t)}) \\ \phi(\theta_i) &= \frac{1}{n} \sum_{j=1}^n \left[\frac{1}{T}k(\theta_j,\theta_i)\nabla_{\theta_j}\log p(\theta_j | \mathcal{D}_\text{finetune}) + \nabla_{\theta_j} k(\theta_j, \theta_i) \right],\end{align}$$where $\epsilon_i$ is a learning rate and $T$ is a temperature hyperparameter. The basic interpretation of the update is that the first term inside the summation drives particles towards areas of high posterior probability, whilst the second term penalises particles that are too similar to one another, acting as a repulsive force that encourages exploration of the parameter-space.

Once the particles have converged, we can simply approximate the posterior predictive as the average output of the network across each parameter particle $\theta_i$, $$p(y^* | x^*, \mathcal{D}_\text{finetune}) \approx \frac{1}{n} \sum_{i=1}^n f_{\theta_i}(x^*).$$

My current research lies in applying SVGD to LoRA adapters. The hopes are that we can learn a richer, multi-modal posterior distribution using SVGD’s particles without making the Gaussian posterior assumption of the Laplace approximation. Recent concurrent work [17] applies a very similar technique to computer-vision tasks and achieves promising results.

Conclusion

I hope this blog has been a useful introduction to the finetuning of LLMs. Feel free to get in touch if you’re interested! My email is sam.bowyer@bristol.ac.uk.

Footnotes

1: LLMs split input text up into a sequence of tokens. Roughly speaking, most words are split into one or two tokens depending on how common and how long they are. Using GPT-4’s tokenizer, this sentence is made from 17 tokens. (back to top)
2: Spare a moment, if you will, for the Meta engineers behind the OPT-175B (175 billion-parameters) model. The training logbook of which reads at times like that of a doomed ship at sea. (back to top)

3: Note that in the case of LLMs specifically, the straight-out-of-pretraining model will also likely be a poor virtual assistant, in the way we tend to desire of chatbots like ChatGPT. A model which can complete sentences to match the general patterns found in $\mathcal{D}$ won’t necessarily be much good at the user-agent back-and-forth conversation style we’d like, and as such might not have properly ‘learnt’ how to, for example, follow instructions and answer questions. It’s because of this that most public-facing LLMs go through what’s known as instruction fine-tuning, in which the model is finetuned on a large dataset of instruction-following chat logs before being deployed. (back to top)
4: That is, without using 25,000 GPUs… (back to top)
5: Consider this paper [2] by Huang et al. which boasts 1.08-bit quantisationa of 16-bit models, all whilst retaining impressive levels of performance. (back to top)

a : i.e. representing parameters with an average precision of 1.08 bits.

6:

An (Ill-Advised) Aside: Attention

Attention layers work by taking three matrices as input, $Q_\text{input}, K_\text{input}, V_\text{input} \in \mathbb{R}^{n \times d_\text{model}}$, typically representing $d_\text{model}$-dimensional embeddings of a sequence of $n$ tokens. First we project these matrices using learnable weight matrices $W^Q, W^K \in \mathbb{R}^{d_\text{model} \times d_k}$, and $W^V \in \mathbb{R}^{d_\text{model} \times d_v}$ to obtain our queries, keys and values: $$\begin{align}
Q &= Q_\text{input} W^Q \in \mathbb{R}^{n \times d_k} \\
K &= K_\text{input} W^K \in \mathbb{R}^{n \times d_k} \\
V &= V_\text{input} W^V \in \mathbb{R}^{n \times d_v}.
\end{align}$$With these, we then compute attention as $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ where $\text{softmax}$ is applied over each row such that, denoting the $i$th row of the matrix $A = QK^T$ as $A^{(i)}$ and that row’s $j$th element as $a^{(i)}_j$, we define: $$\text{softmax}(A)^{(i)} = \frac{\exp A^{(i)}}{\sum_{j=1}^n \exp a^{(i)}_j}.$$
The intuition behind this is that our $n \times n$ attention matrix $\text{softmax}(QK^T/\sqrt{d_k})$ has entries representing how much token $i$ relates (or ‘attends’) to token $j$. The $\text{softmax}$ normalises each row so that the entries all add up to one, allowing us to think of each row as a distribution over tokens. The final multiplication with $V$ might then be thought of as selecting (or weighting) tokens in $V$ according to those distributions.

One important limitation of the attention mechanism we’ve just described is that it only allows us to consider how each token attends to each other token in some universal way, whereas in reality there are multiple ways that words in a sentence (for example) can relate to each other. Because of this, most of the time we actually use multi-headed attention, in which we compute attention between the token sequences $H \in \mathbb{N}$ times, each time with different learnable weight matrices $W^Q_h, W^K_h, W^V_h$ for $h \in \{1,\ldots,H\}$. Then we combine these separate attention heads, using yet another learnable weight matrix $W^O \in \mathbb{R}^{H d_v \times d_\text{model}}$, $$\text{MultiHead}(Q_\text{input}, K_\text{input}, V_\text{input}) = \text{Concat}(\text{head}_1,\ldots,\text{head}_H)W^O \in \mathbb{R}^{n \times d_\text{model}},$$ where $\text{head}_h = \text{Attention}(Q_\text{input}Q_h, K_\text{input}K_h, V_\text{input}V_h)$. Allowing the model to learn different types of attention on different heads makes MHA an incredibly powerful and expressive part of a neural network.

To summarise and return to the discussion of finetuning: MHA layers contain a ton of learnable parameters (specifically, $2H d_\text{model} (d_k + d_v)$ of them). (back to top)

 

References

[1] Kingma, D.P., 2014. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980.
[2] Huang, W., Liu, Y., Qin, H., Li, Y., Zhang, S., Liu, X., Magno, M. and Qi, X., 2024. Billm: Pushing the limit of post-training quantization for llms. arXiv preprint arXiv:2402.04291.
[3] Zaken, E.B., Ravfogel, S. and Goldberg, Y., 2021. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199.
[4] Vaswani, A., 2017. Attention is all you need. arXiv preprint arXiv:1706.03762.
[5] Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M. and Gelly, S., 2019, May. Parameter-efficient transfer learning for NLP. In International conference on machine learning (pp. 2790-2799). PMLR.
[6] Lin, Z., Madotto, A. and Fung, P., 2020. Exploring versatile generative language model via parameter-efficient transfer learning. arXiv preprint arXiv:2004.03829.
[7] Pfeiffer, J., Kamath, A., Rücklé, A., Cho, K. and Gurevych, I., 2021. AdapterFusion: Non-Destructive Task Composition for Transfer Learning. EACL 2021.
[8] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. and Chen, W., 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
[9] Zhang, Q., Chen, M., Bukharin, A., Karampatziakis, N., He, P., Cheng, Y., Chen, W. and Zhao, T., 2023. AdaLoRA: Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512.
[10] Meng, F., Wang, Z. and Zhang, M., 2024. Pissa: Principal singular values and singular vectors adaptation of large language models. arXiv preprint arXiv:2404.02948.
[11] Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A. and Tian, Y., 2024. Galore: Memory-efficient llm training by gradient low-rank projection. arXiv preprint arXiv:2403.03507.
[12] Cinquin, T., Immer, A., Horn, M. and Fortuin, V., 2021. Pathologies in priors and inference for Bayesian transformers. arXiv preprint arXiv:2110.04020.
[13] Chen, W. and Li, Y., 2023. Calibrating transformers via sparse gaussian processes. arXiv preprint arXiv:2303.02444.
[14] Yang, A.X., Robeyns, M., Wang, X. and Aitchison, L., 2023. Bayesian low-rank adaptation for large language models. arXiv preprint arXiv:2308.13111.
[15] Martens, J. and Grosse, R., 2015, June. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning (pp. 2408-2417). PMLR.
[16] Liu, Q. and Wang, D., 2016. Stein variational gradient descent: A general purpose bayesian inference algorithm. Advances in neural information processing systems, 29.
[17] Doan, B.G., Shamsi, A., Guo, X.Y., Mohammadi, A., Alinejad-Rokny, H., Sejdinovic, D., Ranasinghe, D.C. and Abbasnejad, E., 2024. Bayesian Low-Rank LeArning (Bella): A Practical Approach to Bayesian Neural Networks. arXiv preprint arXiv:2407.20891.

Student Perspectives: How can we spot anomalies in networks?

A post by Rachel Wood, PhD student on the Compass programme.

Introduction

As our online lives expand, more data than we can reasonably consider at once is collected. Many of this is sparse and noisy data, needing methods which can recover information encoded in these structures. An example of these kind of datasets are networks. In this blog post, I explain how we can do this to identify changes between networks observing the same subjects (e.g. snapshots of the same graph over time).

Problem Set-Up

We consider two undirected graphs, represented by their adjacency matrices $\mathbf{A}^{(1)}, \mathbf{A}^{(2)} \in \{0,1\}^{n \times n}$. As we can see below, there are two clusters (pink nodes form one, the yellow and blue nodes form another) in the first graph but in the second graph the blue nodes change behaviour to become a distinct third cluster.

A graph at two time points, where the first time point shows two clusters and the second time point shows the blue nodes forming a new distinct cluster

Our question becomes, how can we detect this change without prior knowledge of the labels?

We can simply look at the adjacency matrices, but these are often sparse, noisy and computationally expensive to work with. Using dimensionality reduction, we can “denoise” the matices to obtain a $d$-dimensional latent representation of each node, which provides a natural measure of node behaviour and a simple space in which to measure change.

Graph Embeddings

There is an extensive body of research investigating graph embeddings, however here we will focus on spectral methods.
Specifically we will compare the approaches of Unfolded Adjacency Spectral Embedding (UASE) presented in [1] and CLARITY presented in [2]. Both of these are explained in more detail below.

UASE

UASE takes as input the unfolded adjacency matrix $\mathbf{A} = \left[ \mathbf{A}^{(1)}\big| \mathbf{A}^{(2)}\right] \in \{0,1\}^{2n \times n}$ and performs $d$ truncated SVD [3] to obtain a $d$-dimensional static and a $d$-dimensional dynamic representation:

Illustration of UASE where the green blocks show the rows and columns we keep, and the red blocks represent the rows and columns we discard.

Mathematically we can write this as:
\begin{equation*}
\mathbf{A} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^T = \mathbf{U}_{\mathbf{A}} \boldsymbol{\Sigma}_{\mathbf{A}} \mathbf{V}_{\mathbf{A}}^T + \mathbf{U}_{\perp \!\!\!\ } \ \boldsymbol{\Sigma}_{\perp \!\!\!\ } \ \mathbf{V}_{\perp \!\!\!\ }^T \ \approx \mathbf{U}_{\mathbf{A}} \boldsymbol{\Sigma}_{\mathbf{A}} \mathbf{V}_{\mathbf{A}}^T = \mathbf{X} \mathbf{Y}^T
\end{equation*}

where $\mathbf{U}_{\mathbf{A}}, \mathbf{V}_{\mathbf{A}}$ are the first $d$ columns of $\mathbf{U}$ and $\mathbf{V}$ respectively and $\boldsymbol{\Sigma}_{\mathbf{A}}$ is the diagonal matrix which forms the $d \times d$ upper left block of $\boldsymbol{\Sigma}$. This gives a static embedding $\mathbf{X} \in \mathbb{R}^{n \times d}$ and a time evolving embedding $\mathbf{Y} \in \mathbb{R}^{2n \times d}$.

The general approach in UASE literature is to measure change by comparing latent positions, which is backed by [4]. This paper gives a theoretical demonstration for longitudinal and cross-sectional stability in UASE, i.e. for observations $i$ at time $s$ and $j$ at time $t$ behaving similarly, their latent positions should be the same: $\hat Y_i^{(s)} \approx \hat Y_j^{(t)}$. This backs the general approach in the UASE literature of comparing latent positions to quantify change.

Going back to our example graphs, we apply UASE to the unfolded adjacency matrix and visualise the first two dimensions of the embedding for each of the graphs:

First two dimensions of the latent positions of observations in each graph

As we can see above, the pink nodes have retained their positions, the yellow nodes have moved a little and the blue nodes have moved the most.

CLARITY

Clarity takes a different approach, by estimating $\mathbf{A}^{(2)}$ from $\mathbf{A}^{(1)}$. An illustration of how it is done is shown below:

Illustration of Clarity, which represents $\mathbf{A}^{(1)}$ in low dimensions and asks whether this can provide a good representation of $\mathbf{A}^{(2)}$ as well. It keeps the same “structure matrix” ($\mathbf{U}^{(1)}$) by allowing a new non-diagonal “relationship” matrix ($\boldsymbol{\Sigma}^{(2)}$)

Again we provide a mathmatical explanation of the method. First we perform a $d$-dimensional truncated eigendecompositionon $\mathbf{A}^{(1)}$:

\begin{equation*}
\mathbf{A}^{(1)} = \mathbf{U}^{(1)} \boldsymbol{\Sigma}^{(1)} \mathbf{U}^{(1)T} + \mathbf{U}_{\perp \!\!\!\ } \ \boldsymbol{\Sigma}_{\perp \!\!\!\ } \ \mathbf{U}_{\perp \!\!\!\ }^T \ \approx \mathbf{U}^{(1)} \boldsymbol{\Sigma}^{(1)} \mathbf{U}^{(1)T} = \hat{\mathbf{A}}^{(1)}
\end{equation*}
where $\mathbf{U} \in \mathbb{R}^{n \times d}$ is a matrix of the first $d$ eigenvectors and $\Sigma \in \mathbb{R}^{d \times d}$ is a diagonal matrix with the first $d$ eigenvalues.

Then we estimate $\mathbf{A}^{(2)}$ as
\begin{equation*}
\hat{\mathbf{A}}^{(2)} = \mathbf{U}^{(1)} \boldsymbol{\Sigma }^{(2)} \mathbf{U}^{(1)T} \hspace{1cm} \text{where} \hspace{1cm} \boldsymbol{\Sigma}^{(2)} = \mathbf{U}^{(1)T} \mathbf{A}^{(2)} \mathbf{U}^{(1)}
\end{equation*}

As opposed to UASE, Clarity examines change between $\mathbf{A}^{(1)}$ and $\mathbf{A}^{(2)}$ by a quantity called persistence. These are defined as
\begin{equation*}
\mathbf{P}_i = \sum_{j =1}^{n}\left( \mathbf{A}_{ij}^{(2)} -\hat{\mathbf{A}}_{ij}^{(2)} \right)
\end{equation*}

The intuition here is that the persistences will capture structure in $\mathbf{A}^{(2)}$ that is not present in or explained by $\mathbf{A}^{(1)}$.

Returning to our example problem, we can see heatmaps of $\mathbf{A}^{(1)}$ and $\mathbf{A}^{(2)}$ alongside their Clarity estimates:

$\mathbf{A}^{(1)}$ and $\mathbf{A}^{(2)}$ and their Clarity estimates

Looking at the figure above we can see that the Clarity estimate of ${\mathbf{A}^{(2)}}$ does not capture the third cluster that appears in the second graph and therefore should identify these nodes as anomalies.

Comparison

We can use receiver operating characteristic (ROC) curves to assess the success of our two methods. Given a score (in our case either the distance between latent positions or persistences) it plots the false positive rate against the true positive rate for a sequence of thresh-holds. We can see the ROCs below for $d = 2,3,4,5,6$

ROCs for Clarity (blue) and UASE (orange) for (left to right) $d = 2,3,4,5,6$

We can see that in lower dimensions UASE outperforms Clarity, but the performance degrades over time. This becomes a common problem in real world applications where the best choice for $d$ is unknown. Clarity on the other hand, does not have the same power as UASE but is more robust to dimension. Another difference between the two methods is that by allowing changes in relationship in the model, it is designed to cope with the entire graph changing a little bit.

Conclusion

We have now introduced two methods for identifying change and compared their performance in a simple example. One method produces stronger results overall but is much more sensitive to the choice of dimension than the other. My current research looks to investigate why Clarity succeeds in this area when many other methods fail, with the ultimate goal of using this knowledge to modify more powerful methods to also have this feature.

 

[1] Jones, A., & Rubin-Delanchy, P. (2020). The multilayer random dot product graph. arXiv preprint arXiv:2007.10455.

[2] Lawson, D. J., Solanki, V., Yanovich, I., Dellert, J., Ruck, D., & Endicott, P. (2021). CLARITY: comparing heterogeneous data using dissimilarity. Royal Society Open Science, 8(12), 202182.

[3] Wikipedia contributors. (2024, June 11). Singular value decomposition. In Wikipedia, The Free Encyclopedia. Retrieved 09:54, July 1, 2024, from https://en.wikipedia.org/w/index.php?title=Singular_value_decomposition&oldid=1228566091

[4] Gallagher, I., Jones, A., & Rubin-Delanchy, P. (2021). Spectral embedding for dynamic networks with stability guarantees. Advances in Neural Information Processing Systems, 34, 10158-10170.

Student Perspectives: Strategies for variational inference in non-conjugate problems

A post by Qi Chen, PhD student on the Compass programme.


Introduction

Variational inference is a method to approximate posterior distributions. In Bayesian statistics context, we would like to get access to the posterior distribution \[p(\theta|x) = \frac{p(x|\theta)p(\theta)}{\int_\mathcal{\theta} p(x|\theta)p(\theta) d\theta}\]
In most cases the denominator $p(D)$ is intractable, that is we can not compute it analytically. How should we proceed? There are two broad ways:

  • Using MCMC to simulate samples from the posterior distribution $p(\theta|D)$ to approximate the true posterior and get statistics of interest(mean, variance, etc.).
  • Approximate $p({\theta}|{x})\approx q(\theta)\in\mathcal{Q}$.

The former method is unbiased and the convergence is guaranteed by the law of large numbers. But it requires a large number of samples and is quite computational demanding if the dimension of parameters/dataset is large. The later one, called variational inference,is biased depends on the choice of $\mathcal{Q}$ but is much faster and more scalable.

We call $q$ the variational distributions. The idea behind variational inference, is to approximate the posterior $p({\theta}|{x})$ using $q({\theta})\in\mathcal{Q}$ by minimizing the KL divergence between $q({\theta})$ and the true posterior $p({\theta}|{x})$, with the following formal expression:\[q^*({\theta}) = argmin_{q\in\mathcal{Q}}\;KL(q({\theta})||p({\theta}|{x})) = \int_{\Theta} q({\theta})\log\left(\frac{q({\theta})}{p({\theta}|{x})}\right)d{\theta}\]
This is a traditional measure of distribution mismatch over the same domain, and it is easy to see that $q = p$ is equivalent to $KL(q||p)=0$.

There are broadly two questions we would like to answer:

  • How do we minimize $q$ over the space with the true posterior unknown?
  • How do choose the variational family $q$?

We now answer the first question:
Notice that \begin{align*}
\log p({x}) &= \int q({\theta})\log\left(\frac{p({x},\theta)q({\theta})}{p({\theta}|{x})q({\theta})}\right) d{\theta}\\
&= \int q({\theta}) \log\left(\frac{p({x},{\theta})}{q({\theta})}\right) d{\theta} + \int q({\theta})\log\left(\frac{q({\theta})}{p({\theta}|{x})}\right)d{\theta}
\end{align*}
From the above derivation, we see that the second part is simply just the KL divergence we wish to minimize. As $\log(p({x}))$ is fixed, minimizing KL divergence is equivalent to maximizing the first part. This answers the first question. The first part is called \textbf{evidence lower bound(ELBO)}, written in $\mathcal{L}(q({\theta}))$.

For the second question, in theory, suppose the variational distribution is parametrized by variational parameters ${\phi}$, we can start with any variaitonal distributions we like, following the basic criterion:
Supp($q({\theta};{\phi})\subseteq$Supp($p({\theta}|{x})$).
We also need Supp($p({\theta}|{x})\subseteq$Supp($p({\theta})$) which is guaranteed in most cases.

But randomly choosing some variational distributions with any model won’t make the algorithm always feasible. Indeed, all VI methods centered around the goal of optimizing the ELBO \[\phi^* = argmin_{{\phi}}\mathbb{E}_q\left[\log \frac{p({x},{\theta})}{q({\theta})}\right]\]

Traditional methods set the mean-field assumptions that all parameters are independent. This breaks down the objective and a local optimum could be achieved via a coordinate ascent algorithm. Some methods enlarge the mean-field space to some specific conditional dependences between parameters according to graphical models with conjugate exponential relationship between parent-child pairs[1]. This is further extended to non-conjugate pairs with custom approximations.

Some modern methods have been developed in the last decade based on the idea that the gradient of the ELBO could be expressed in the from of $\mathbb{E}_q(\cdot)$. This immediately brings attention to a combination of MC algorithms(for sampling from $q$) and stochastic gradient descent(for efficiency in the optimization). These methods benefits from the simplicity that there’s no need to analytically compute the gradients based on conditional dependence specifications for each model: it is an automatic algorithm, for a greater domain of models. But it is worth noting that even those methods are theoretically sound, they still face practical issue which I will show in the later sections.

In this post I will briefly go through some of these methods, specifically coordinate ascent variational inference, black-box variational inference and automatic differentiation variational inference.

Conjugate models: Coordinate Ascent Variational Inference

There are various assumptions we can make on $\mathcal{Q}$ . We start with the mean-field assumptions of the parameters [2] This is to assume the joint prior distributions of all parameters could be factorized completely. That is:\[q({\theta}) = \prod_{j=1}^m q_j({\theta}_j)\]
We now write ${\theta}_{-j}$ denote all the other latent variables except for ${\theta}_j$, with distribution $q_{-j}$.
If we only minimize $\mathcal{L}(q)$ against $q_j({\theta}_j)$, we are minimizing \[\mathbb{E}_{q_j}[\mathbb{E}_{q_{-j}}[\log p({\theta},{x})]] – \mathbb{E}_{q_j}[\log q_j({\theta}_j)]\]

Further write $r_j({\theta}_j) = \frac{1}{Z_j}\exp\{\mathbb{E}_{q_{-j}}[\log p({x},{\theta})]\}$ where $Z_j$ is some normalizing constant so that $r_j$ is a probability distribution. Then substitute in, we get \[\mathcal{L}(q_j) \propto \mathbb{E}_{q_j}[\log \frac{r_j({\theta}_j)}{q_j({\theta}_j)}] = -KL(q_j({\theta}_j)||r_j({\theta}_j))\]

Thus maximizing ELBO against $q_j$ is equivalent to set $q_j = r_j$, which is \[q_j({\theta}_j)\propto \exp\{\mathbb{E}_{q_{-j}}[\log p({x},{\theta})]\}\propto \exp\{\mathbb{E}_{q_{-j}}[\log p({\theta}_j|{\theta}_{-j},{x})]\}\]

Since we assume $q$ factorizes, maximizing $\mathcal{L}(q)$ is split into $m$ steps of maximizing $\mathcal{L}(q_j)$. This algorithm is called \textbf{coordinate ascent variational inference}(CAVI) or \textbf{block-coordinate assent}.

A algorithmic view is

  1.  Initialize $q({\theta}) = \prod_{j=1}^m q_j({\theta}_j)$
  2. Iterate until convergence:
    Update for each $q_j$ by $q_j = \frac{1}{Z_j}\exp(\mathbb{E}_{q_{-j}}[\log(p({\theta},{x}))])$This algorithm is guarantee to convergence since each iteration the ELBO increases.

This is not directly feasible for all cases, since we assume we can compute $r_j$ analytically. In case where there’s conditional conjugacy of likelihood and the prior on each $\theta_j$ conditioned on all other ${\theta}_{i\neq j}$. That is \[p(\theta_j|{\theta}_{i\neq j})\in \mathcal{A}(\alpha),\,p({x}|\theta_j, \theta_{i\neq j})\in \mathcal{B}(\theta_j)\rightarrow p(\theta_j|{x},{\theta}_{i\neq j})\in A(\alpha’)\]
this will be feasible. One particular family is that all complete conditionals lie in exponential family.

A distribution $p({\theta})$ is in exponential family if \[p(\theta) = h({\theta})\exp\{{\eta}^Tt({\theta}) – A({\eta})\}\]
Here $\eta$ is called natural parameter, and $A({\theta})$ satisfies \[A({\eta}) = \log \int h({\theta}) \exp \eta^Tt(\theta)d{\theta}\]
such that it integrates to 1.

Now assume that all the complete conditionals belong to an exponential family distribution, that is \[p({\theta}_j|{\theta}_{-j},{x}) = h({\theta}_j)\exp \{{\eta}_j^T({\theta}_{-j},{x}){\theta}_j – A({\eta}_j({\theta}_{-j},{x}))\}\]
where we assume that ${\theta}_j$ is already transformed to its appropriate sufficient statistic. We see now the CAVI becomes \begin{align*}
q_j({\theta}_j)&\propto \exp\{\log h({\theta}_j) + \mathbb{E}_{q_{-j}}[{\eta}_j({\theta}_{-j},{x})]^T{\theta}_j – \mathbb{E}_{q_{-j}}[A({\eta}_j({\theta}_{-j},{x}))]\}\\
&\propto h({\theta}_j)\exp\{\mathbb{E}_{q_{-j}}[{\eta}_j({\theta}_{-j},{x})]^T{\theta}_j\}
\end{align*}
where we see that the variational factors are in the same exponential family(due to conjugacy) as the complete conditionals, with the natural parameter updated to \[\phi_j = \mathbb{E}_{q_{-j}}[{\eta}_j({\theta}_{-j},{x})]\]

But in most cases, for example Bayesian logistic regression, we do not have conditional conjugacy in our model. In this blog post, we introduce two methods which are developed in the last decade tackling the lack of conjugacy. Notice that variational inference is indeed an optimization problem, and these methods are derived from expressing the derivatives of the ELBO in terms of expectation over the vatiational distributions q: \[\frac{\partial ELBO}{\partial {\phi}} = \mathbb{E}_{q({\theta};{\phi})}[\cdot]\]
\section{Evaluable Models: Black Box Variational Inference}

We want to optimize \[\mathcal{L}({\phi}) = \mathbb{E}_{q}[\log p({\theta},{x})] – \mathbb{E}_q[\log q({\theta};{\phi})]\]
and we notice that
\begin{align*}
\triangledown_{{\phi}}\mathcal{L}({\phi}) &= \triangledown_{{\phi}}\int q({\theta};{\phi})\log \frac{p({\theta},{x})}{q(\theta;{\phi})} d{\theta}\\
&= \int q({\theta};{\phi})\triangledown_{{\phi}} \log q({\theta};{\phi})\log \frac{p({\theta},{x})}{q({\theta};{\phi})} + q({\theta};{\phi})\triangledown_{{\phi}}\log \frac{p({\theta},{x})}{q({\theta};{\phi})} d{\theta}\\
&= \mathbb{E}_{q}[\triangledown_{{\phi}}\log q({\theta};{\phi})(\log p({\theta},{x})-\log q({\theta};{\phi}))]
\end{align*}
This is proposed in [3]. We see this is an expectation under the variational distributions, and we only need

  • simulate from $q$.
  • evaluate the derivatives of $q$.
  • evaluate the model $p({\theta},{x})$.

This significantly relaxes the constraint of CAVI and enlarges the domain of models applicable.
In practice, we will use stochastic gradient descent to derive a noisy unbiased estimator of the gradient and adapt some step functions satisfying some conditions, for example \[\sum_j \rho_j =\infty\;\;\;\;\sum_j \rho_j^2 < \infty\]
A naive algorithm is as follows:

  • $t \gets 0$, $\delta \gets \infty$
  • While{$\delta > \tau$}{
    • $t \gets t+1$
    • ${\theta}^1,…,{\theta}^S\sim q({\theta},{\phi}_{t-1})$
    • $\hat{\triangledown}_{{\phi}}\mathcal{L}({\phi}_{t-1})\gets \frac{1}{S}\sum_{s=1}^S \triangledown_{{\phi}}\log q({\theta}^s;{\phi}_{t-1})(\log p({\theta}^s,{x})-\log q({\theta}^s;{\phi}_{t-1}))$
    • ${\phi}_t\gets{\phi}_{t-1} + \rho_t\hat{\triangledown}_{{\phi}}\mathcal{L}({\phi}_{t-1})$
    • $\delta \gets \frac{||{\phi}_t – {\phi}_{t-1}||}{||{\phi}_{t-1}||}$

}
Output{${\phi}^* = {\phi}^t$}

However, in practice, this algorithm does not produce meaningful result for non-trivial model, since the variance of this estimates grows linearly with the number of parameters in the model ${\theta}$. Due to the high variance, we need some variance reduction technique.

Rao-Blackwellization

Rao-Blackwellization reduces the variance of some estimator $J(X,Y)$ by defining another estimator \[\hat{J}(X) = \mathbb{E}[J(X,Y)|X]\]
It is clear that the expectation is preserved:\[\mathbb{E}[\hat{J}(X)] = \mathbb{E}[J(X,Y)]\]by tower law. The variance of this estimator is \[Var(\hat{J}(X)) = Var(J(X,Y)) + \mathbb{E}[\hat{J}(X)^2] – \mathbb{E}[J(X,Y)^2] = Var(J(X,Y)) – \mathbb{E}[(J(X,Y)-\hat{J}(X))^2]\]

Thus this new estimator always has less variance compared to $J(X,Y)$ unless $\hat{J}(X) = J(X,Y)$.

We now apply this to BBVI. Assume the approximating family follows the mean-field assumption, and let $p({x},{\theta}) = p_i({x},{\theta}_{(i)})p_{-i}({x},{\theta}_{-i})$
where $p_i$ are all the terms containing $\theta_i$, and $\theta_{(i)}$ is the collection of all latent variables that appear in $p_i$.
We can thus rewrite the derivatives of ELBO respect to $\theta_i$ as \[\hat{\triangledown}_{\phi_i}^{RB}\mathcal{L}(\phi_i) = \mathbb{E}_{q_{(i)}}[\triangledown_{\phi_i}[\log q_i(z_i;\phi_i)(\log p_i({x},\theta_{(i)})-\log q_i(\theta_i;\phi_i))]]\]
This is a Rao-Blackwellized $\triangledown_{\phi_i}\mathcal{L}({\phi})$ as \[\mathbb{E}_q[\hat{\triangledown}_{\phi_i}\mathcal{L}({\phi}) – \hat{\triangledown}_{\phi_i}^{RB}\mathcal{L}(\phi_i)] = C\mathbb{E}_{q_i}[\triangledown_{\phi_i}[\log q_i(\theta_i;\phi_i)]] = 0\]
with \[C = \mathbb{E}_{q_{-i}}[\log p_{-i}({x},{\theta}_{-i})] – \mathbb{E}_{q_{-i}}[\sum_{j\neq i}\log q_j(\theta_j;\phi_j)]\]
The detailed derivation could be found in [3].

Control variates

We now introduce another method using regression estimator. Suppose we want to estimate some parameter $\mu$ and we have an estimator $f$ with $\mathbb{E}[f(u)] = \mu$, u is a random variable. Furthermore, if we have a “similar” function $h$ such that $\mathbb{E}[h(u)] = \nu$ is known. Then we define a new estimator of $\mu$:\[g(u) = f(u)-\beta(h(u)-\nu)\]

This is clearly an unbiased estimator and for the variance term\[Var(g(u)) = Var(f(u)) + \beta^2 Var(h(u)) – 2\beta Cov(f(u),h(u))\]

In order to minimize this variance, we choose \[\hat{\beta} = \frac{Cov(h(u),f(u))}{Var(h(u))}\]

This is also the OLS estimator for the linear regression:\[f(u) = \mu + \beta(h(u)-\nu)\] Now plugging in this $\hat{\beta}$ we have \[Var(g(u)) = Var(f(u))(1-\rho^2_{fh})\] where $\rho^2_{fh}$ is the correlation between $f(u)$ and $h(u)$. Such $h$ is called the control variate.

Improved BBVI

The original author in [3] combined these two methods and choose $\triangledown_{\phi_i}\log q_i(\theta_i;\phi_i)$ as the control variate for $\hat{\triangledown}_{\phi_i}^{RB}\mathcal{L}(\phi_i)$, which is shown below:

  • $t \gets 0$, $\delta \gets \infty$\
  • While{$\delta > \tau$}{
    • t \gets t+1$
    • ${\theta}^1,…,{\theta}^S\sim q({\theta},{\phi}_{t-1})$
    • For{$i\gets 1$to $n$}{
      • $f_i \gets \frac{1}{S}\sum_{s=1}^S \triangledown_{\phi_i}\log q(\theta_i^s;{\phi}^{t-1}_{i})(\log p_i({\theta}_{(i)}^s,{x})-\log q_i(\theta_i^s;{\phi}_i^{t-1}))$
      • $h_i\gets \frac{1}{S}\sum_{s} \triangledown_{\phi_i}[\log q_i(\theta_i^s;{\phi}_i^{t-1}))]$
      • $\hat{\beta}_i \gets \frac{\hat{Cov}(f_i,h_i)}{\hat{Var}(h_i)}$
      • $g_i \gets f_i-\hat{\beta}h_i$
      • $\phi_i^t\gets \phi_i^{t-1} + \rho_tg_i$
        }
    • $\delta \gets \frac{||{\phi}_t – \phi_{t-1}||}{||{\phi}_{t-1}||}$

}

  • Output{${\phi}^* = {\phi}^t$}

Final Conclusion for BBVI

According to the same authors in [4], they pointed out the limitation of BBVI. They found that the gradient can be very unstable for large values of their inputs, and adaptive step-size like AdaGrad needs extra tunning. Also, they found that, in the case of linear mixed effects model, it under-performs MH-Gibbs sampler. Also, they did experiment in LDA(Latent Dirichlet allocation), Gibbs sampler converged in couple of minutes for 20 topics but BBVI does not produce any reasonable results after hours of iterations for 2 topics. Thus, it requires more experiments and BBVI still has practical limitations.

Differentiable Models: Automatic Differentiation Variational Inference

The idea behind Automatic Differentiation Variational Inference(ADVI) is as follows

  • Transform the parameter space to real space: $T:Supp({\theta})\rightarrow\mathbb{R}^k$ by a one-to-one mapping.
  • Let ${\psi} = T({\theta})$ a joint normal distribution. That is \[q({\psi}|{\phi}) \sim \mathcal{N}({\mu},\Sigma)\] Notice that we need to ensure $\Sigma$ to be full rank. One way to do that is using Cholesky factorization: $\Sigma = LL^T$ where $L$ is a lower triangular matrix with dimension $(k+1)k/2$. Overall, ${\phi}$ lives in $\mathbb{R}^{(k+1)k/2+k}$ where $k$ is the dimension of parameters in our model. This comes with computational cost, so we may wish to make a mean-field assumption to ${\psi}$
  • Finally we make the standardization ${\eta} = S_{{\phi}}({\psi}) = L^{-1}({\psi}-{\mu})$. This makes $q({\eta}) = \mathcal{N}({\eta};{0},{I})$.

Following the above recipe, we can rewrite the ELBO as \[{\phi}^* = argmin_{\phi} \mathbb{E}_{\mathcal{N}({\eta};{0},{I})}\left[\log p\left({x},T^{-1}(S^{-1}_{{\phi}}({\eta}))\right) + \log |detJ_{T^{-1}}(S_{{\phi}}^{-1}({\eta}))|\right] + \mathbb{H}[q({\psi};{\phi})]\]
In this case, the variational parameters are contained in the transformation $S$. We now give the gradients:\[\triangledown_{{\mu}}\mathcal{L} = \mathbb{E}_{\mathcal{N}({\eta})}[\triangledown_{{\theta}}\log p({x},{\theta})\triangledown_{{\psi}}T^{-1}({\psi}) + \triangledown_{{\psi}}\log|detJ_{T^{-1}}({\psi})|]\]
and \[\triangledown_{L}\mathcal{L} =\mathbb{E}_{\mathcal{N}({\eta})}[\left(\triangledown_{{\theta}}\log p({x},{\theta})\triangledown_{{\psi}}T^{-1}({\psi}) + \triangledown_{{\psi}}\log|detJ_{T^{-1}}({\psi})|\right){\eta}^T] + (L^{-1})^T\]
Now similar to BBVI, we can use MC algorithm and SGD to get an approximate gradient and do gradient descent. In [5] they propose a gradient of the form
\[\rho_k^i = \eta\times i^{-1/2+\epsilon}\times\left(\tau + \sqrt{s_k^i}\right)^{-1}\]
where \[s_k^i = \alpha (g_k^i)^2 + (1-\alpha)s_k^{i-1}\]
Here $k$ is the kth element and $i$ is the ith iteration. $g_k^i$ is the gradient vector at iteration i, and $s_k^1 = (g_k^1)^2$

Notice that here $\eta$ is another variable controls the scale of the step size sequence, it could be searched among $\{0.001,0.1,1,10,100\}$. $\epsilon$ is set to be small, for example $\epsilon = 10^{-6}$, to satisfy the Robbins and Monro conditions. The last term is to keep the memory of the past gradients. More details could be found in [5].

 

It is shown that in ADVI, variance of estimates of the gradients is controled better compared to BBVI. The performance is also compared to those famous MC methods, result is also displayed below.

 

 

[1] John Winn and Christopher M. Bishop. Variational message passing. Journal of Machine Learning Research, 6(23):661–694, 2005.

[2] David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, apr 2017.

[3] Rajesh Ranganath, Sean Gerrish, and David M. Blei. Black box variational inference, 2013.

[4] Rajesh Ranganath, Sean Gerrish, and David Blei. Black Box Variational Inference. In Samuel Kaski and Jukka Corander, editors, Proceedings of the Seventeenth International Conference on

Artificial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pages 814–822, Reykjavik, Iceland, 22–25 Apr 2014. PMLR.

[5] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M. Blei. Automatic differentiation variational inference. J. Mach. Learn. Res., 18(1):430–474, jan 2017.

 

Student Perspectives: Avoiding our problems ~ Noise Contrastive Estimation

A post by Henry Bourne, PhD student on the Compass programme.


Currently I’ve been researching Noise Contrastive Estimation (NCE) techniques for representation learning aided by my supervisor Dr. Rihuan Ke. Representation learning concerns itself with learning low-dimensional representations of high-dimensional data that can then be used to quickly solve a general downstream task, eg. after learning general representations for images you could quickly and cheaply train a classification model on top of the representations.  

NCE is a general estimator for parametrised probability models as I will explain in this blogpost. However, it can also be cleverly used to learn useful representations in an unsupervised (or equivalently self-supervised) manner, which I will also explain. I’ll start by explaining the problem that NCE was created to solve, then provide a quick comparison to other methods, explain how researchers have built on this method to carry out representation learning and finally discuss what I am currently working on. 

NCE solves the problem of computing a normalising constant by avoiding the problem altogether and solving some other proxy problem. Methods that are able to model unnormalised probability models are known as Energy Based Models (EBM’s). We will begin by describing the problem with the normalising constant before getting on to how we will avoid it. 

 

The problem … with the normalising constant 

Let’s say we have some arbitrary probability distribution, $p_{d}(\cdot)$, and a parametrised probability model, $p_{m}(\cdot ; \alpha)$, which we would like to accurately model the underlying probability distribution. Let’s further assume that we’ve picked our model well such that $\exists \alpha^{*}$ such that $p_{d}(\cdot) = p_{m}(\cdot ; \alpha^{*})$. 

Let’s just fit it to our data sampled from the underlying distribution using Maximum Likelihood Estimation! Sounds like a good idea, MLE has been extensively used, is reliable, is efficient and achieves the Cramer-Rao lower bound (the lowest possible bound an unbiased estimator can achieve for its variance/MSE), is asymptotically normal, is consistent, is unbiased and doesn’t assume normality. Moreover, there are a lot of tweaked MLE techniques out there that you can use if you would like an estimator with slightly different properties. 

First let’s look under the hood of our probability model, we can write it as so:

$
\begin{array}{l|l}
p_{m}(\cdot;\alpha)=\frac{p_{m}^{0}(\cdot; \alpha)}{Z(\alpha)} & \text{where,} \: Z(\alpha) = \int p_{m}^{0}(u; \alpha) du
\end{array}
$

The likelihood is our probability model for some $\alpha$ evaluated over our dataset. Evaluating the likelihood becomes tricky when there isn’t an analytical solution for the normalisation term, $Z(\alpha)$, and the possible set of values $u$ can take becomes large. For example if we would like to learn a probability distribution over images then this normalisation term becomes intractable.

By working with the log we get better numerical stability, it makes things easier to read and it makes calculations and taking derivatives easier. So, let’s take the log of the above:

$
\begin{aligned}
&{} p_{m}(\cdot;\alpha) = \frac{p_{m}^{0}(\cdot; \alpha)}{Z(\alpha)} \\
& \Rightarrow \log p_{m}(\cdot; \theta) = \log p_{m}^{0} (\cdot ; \alpha) +c
\end{aligned}
$
$\text{Where, } \\ \theta = \{\alpha, c \}, \\ \text{c an estimate of} -\log Z(\alpha)$

Where, we write $p_{m}^{0}(\cdot;\alpha)$ to represent our unnormalized probability model. After taking the $\log$ we can write our normalising constant as $c$ and then include it as a parameter of our model. So, our new model now parameterised by $\theta$, $p_{m}(\cdot;\theta)$, is self-normalising, ie. it estimates it’s normalising constant. Another approach to make the model self-normalising would be to simply set $c=0$, implicitly making the model self-normalising. This is what is normally done in practice, but it assumes that your model is complex enough to be able to indirectly model $c$.

Couldn’t we just use MLE to estimate $\log p_{m}(\cdot ; \theta)$? No we can’t! This is because the likelihood can be made arbitrarily large by making $c$ large.

This is where Noise Contrastive Estimation (NCE) comes in. NCE has been shown theoretically and empirically to be a good estimator when taking this self-normalizing assumption. We’ll assess it versus competing methods at the end of the blogpost. But before we do that let’s first describe the original NCE method named binary-NCE [1] later we will mention some of the more complex versions of this estimator. 

 

Binary-NCE

The idea with binary-NCE [1] is that by avoiding our problems we fix our problems! ie. We would like to create and solve an ‘easier’ proxy problem which in the process solves our original problem.

Let’s say we have some noise-distribution, $p_{n}(\cdot)$, which is easy to sample from, allows for an analytical expression of $\log p_{n} (\cdot)$ and is in some way similar to our $p_{d}(\cdot)$ (our underlying probability distribution which we are trying to estimate). We would also like $p_{n}(\cdot)$ to be non-zero wherever $p_{d}(\cdot)$ is non-zero. Don’t worry too much about these assumptions as they are normally quite easy to satisfy, apart from an analytical expression being available. They just are necessary for our theoretical properties to hold and for binary-NCE to work in practice. 

We would like to create and solve a proxy problem where given a sample we would like to classify whether it was drawn from our probability model or from our noise distribution. Consider the following density ratio.

$
\begin{aligned}
\frac{p_{m}(u;\alpha)}{p_{n}(u)}
\end{aligned}
$

If this density ratio is bigger than one then it means that $u$ is more likely to have come from our probability model, $p_{m}(\cdot;\alpha)$. If it is smaller than one then $u$ is more likely to have come from our noise distribution, $p_{n}(\cdot)$. Therefore, if we can model this density ratio then we will have a model for how likely a sample is to have come from our probability model as opposed to have being sampled from our noise distribution. 

Notice that we are modelling our normalised probability model above, we can rewrite it in terms of our unnormalised probability model as follows.

$
\begin{aligned}
& \log \left(\frac{p_{m}(u;\alpha)}{p_{n}(u)} \right) \\
& = \log \left(\frac{p_{m}^{0}(u;\alpha)}{Z(\alpha)} \cdot \frac{1}{p_{n}(u)} \right) \\
& = \log \left(\frac{p_{m}^{0}(u;\alpha)}{p_{n}(u)} \right) +c \\
& = \log p_{m}^{0}(u;\alpha) + c – \log p_{n}(u) \\
& = \log p_{m}(u;\theta) – \log p_{n}(u)
\end{aligned}
$

Let’s now define a score function $s$ that we will use to model our rewrite of the density ratio just above:

$
\begin{aligned}
s(u;\theta) = \log p_{m}(u;\theta) – log p_{n}(u)
\end{aligned}
$

One further step before introducing our objective function. We would like to model our score function somewhat as a probability, we would also like our model to not just increase the score indefinitely. So we will put our modelled density ratio through the sigmoid/ logistic function.

$
\begin{aligned}
\sigma(s(u;\theta)) = \frac{1}{1+ \exp(-s(u;\theta))}
\end{aligned}
$

We would like to classify according to our model of the density ratio whether the sample is ‘real’ / ‘positive or just ‘noise’/  ‘fake’/ ‘negative’. So a natural choice for the objective function is the cross-entropy loss.
$
\begin{aligned}
J(\theta) = \frac{1}{2N} \sum_{n} \log [ \sigma(s(x_{n};\theta))] + \log [1- \sigma(s(x_{n}’;\theta))]
\end{aligned}
$

Where $x_{i} \sim p_{d}$, $x_{i}’ \sim p_{n}$ for $i \in \{1,…,N\}$. Here we simply assume one noise sample per observation, but we can trivially extend it to any integer $K>0$ and in fact asymptotically the estimator gets better performance as we increase K.  

Once we’ve estimated our density ratio we can easily recover our normalised probability model of the underlying distribution by adding the log probability density of the noise function and taking the exponential. 

This estimator is consistent, efficient and asymptotically normal. In [1] they also showed it working empirically in a range of different settings.

 

How does it compare to other estimators of unnormalised parameterised probability density models?

NCE is not the only method we can use to solve the problem of estimating an unnormalised parameterised probability model. As we mentioned NCE belongs to a family of methods named Energy Based Models (EBM’s) which all aim to solve this very problem of estimating an unnormalised probability model. Let’s very briefly mention some of the alternatives from this family of methods, please do check out the references in this sub-section if you would like to learn more. We will talk about the methods as they appeared in their seminal form. 

One alternative is called contrastive divergence which estimates an unnormalised parametrised probability model by using a combination of MCMC and the KL divergence. Contrastive Divergence was originally introduced with Boltzmann machines in mind [9], MCMC is used to generate samples of the activations of the Boltzmann machine and then the KL divergence measures the difference between the distribution of the activations given by the real data and the simulated activations. We then aim to minimise the KL divergence. 

Score matching [11] models a parameterised probability model without the computation of the normalising term by estimating the gradient of the log density which it calls the score function. It does this by minimising the expected square distance between the score function and the score function of the observed data. However, obtaining the score function of the observed data requires estimating a non-parametric model from the data. They magically avoid doing this by deriving an alternative form of the objective function, through partial integration, leaving only the computation of the score function and it’s derivative. 

Importance sampling [10], which has been around for quite a while uses a weighted version of MCMC to focus on parts of the distribution that are ‘more important’ and in the process self-normalises. Which makes it better than regular MCMC because you can use it on unnormalised probability models and it should be more efficient and have lower variance. 

[1] contains a simple comparison between NCE, contrastive divergence, importance sampling and score matching. In their experimental setting they found contrastive divergence got the best performance, closely followed by NCE. They also measured computation time and found NCE to be the best in terms of error versus computation time. This by no means crowns NCE as the best estimator but is a good suggestion as to it’s utility, so is the countless ways it’s been used with high efficacy on a multitude of real-world problems. 

 

Building on Binary-NCE (Ranking-NCE and Info-NCE)

Taking inspiration from Binary-NCE a number of other estimators have been devised. One such estimator is Ranking-NCE [2]. This estimator has two important elements.

The first is that the estimator assumes that we are trying to model a conditional distribution, for example $p(y|x)$. By making this assumption our normalising constant is different for each value of the random variable we are conditioning on, ie. Our normalising term is now some $Z(x;\theta)$ and we have one for each possible value of x. This loosens the constraints on our estimator as we don’t require our optimal parameters, $\theta^{*}$, to satisfy $\log Z(x;\theta^{*}) = c$ for some $c$ for all possible values of $x$. This means we can apply our model to problems where the number of possible values of $x$ is much larger than the number of parameters in our model. For further details on this please refer to [2], section 2. 

The second is that it has an objective that given an observed sample $x$, and an integer $K>1$ samples from the noise distirbution, the objective ranks the samples in order of how likely they were to have come from the model versus the noise distribution. Again for further details please refer to [2]. 

Importantly this version of the estimator can be applied to more complex problems and empirically has been shown to achieve better performance. 

Now what we’ve been waiting for … how can we use NCE for representation learning? This is where Info(rmation) NCE comes in. It essentially is Ranking-NCE but we chose our conditional distribution and noise distribution in a specific way.

We consider a conditional probability of the form p(y|x) where $y \in \mathbb{R}^{d_{y}}$, $x \in \mathbb{R}^{d_{x}}$, $d_{y} < d_{x}$. Where $x$ is some data and $y$ is the low-dimensional representation we would like to learn for $x$. We then choose our noise distribution, $p_{n}$, to be the marginal distribution of our representation $y$, $p_{y}$. So our density ratio becomes.

$
\begin{aligned}
\frac{p_{m}(y|x; \theta)}{p_{y}(y)}
\end{aligned}
$

This is now a measure of how likely a given $y$ is to have come from the conditional distribution we are trying to model, ie. how likely is this representation to have been obtained from $x$, versus being some randomly sampled representation. 

A key thing to notice is that we are unlikely to have an analytical form of the $log$ of the marginal distribution of $y$. In fact, this doesn’t matter as we aren’t actually interested in modelling the conditional distribution in this case. What we are interested in is the fact that by employing a Ranking-NCE style estimator and modelling the above density ratio we maximise a lower bound on the mutual information between $Y$ and $X$, $I(Y;X)$. A proof for this along with the actual objective function can be found in [3].  

This is quite an amazing result! We solve a proxy problem of a proxy problem and we get an estimator with great theoretical guarantees that is computationally efficient that maximises a mutual information which allows us to, in an unsupervised manner, learn general representations for data. So we avoid our problems twice! I appreciate that above were two big jumps with not much detail but I hope it gives a sense as to the link between NCE in it’s basic form and representation learning. More specifically, NCE is known as a self-supervised learning method which simply means an unsupervised method which uses supervised methods but generates its own teaching signal. Even more specifically, NCE is a contrastive method which gets its name from the fact that it contrasts samples against each other in order to learn. The other popular category of self-supervised learning methods are called generative models, you may have heard of these! 

 

My Research

Now we know a little bit about NCE and how we can use it to do representation learning, what am I researching?

Info-NCE has been applied with great success in many self-supervised representation learning techniques, a good one to check out is [4]. Contrastive self-supervised learning techniques have been shown to outperform supervised learning in many areas. They also solve some of the key challenges that face generative representation learning techniques in more challenging domains than language such as images and video. This review [5] is a good starting point for learning more about what contrastive learning and generative learning are and some of their differences. 

However, there are still lots of problem areas where applying NCE, without very fancy neural network architectures and techniques, doesn’t do so well or outright fails. Moreover, many of these techniques introduce extra requirements on memory, compute or both. Additionally, they can often be highly complex and their ablation studies are poor. 

Currently, I’m looking at applying new kinds of density ratio estimation methods to representation learning, in a similar way to info-NCE. These new density ratio estimation techniques when applied in the correct way will hopefully lead to representation learning techniques that are more capable in problem areas such as multi-modal learning [6], multi-task learning [7] and continual learning [8]. 

Currently, of most interest to me is multi-modal learning. This is concerned with learning a joint representation over data comprised of more than one modality, eg. text and images. By being able to learn representations on data consisting of multiple modalities it’s possible to learn higher quality representations (more information) and makes us capable of solving more complex tasks that require working over multiple modalities, eg. most robotics tasks. However, multi-modal learning has a unique set of difficult challenges that make naively using representation learning techniques on it challenging. One of the key challenges is balancing a trade-off between learning to construct representations that exploit the synergies between the modalities and not allowing the quality of the representations to be degraded by the varying quality and bias of each of the modalities. We hope to solve this problem in an elegant and simple manner using density ratio estimation techniques to create a novel info-NCE style estimator.

Hope you enjoyed! If you would like to reach me or read some of my other blogposts (I have some more in-depth ones about NCE coming out soon) then checkout my website at /phd.h-0-0.com. 

 

References

[1] :
Gutmann, M. and Hyvärinen, A., 2010, March. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 297-304). JMLR Workshop and Conference Proceedings.

[2] :

Ma, Z. and Collins, M., 2018. Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency. arXiv preprint arXiv:1809.01812.

[3] :

Oord, A.V.D., Li, Y. and Vinyals, O., 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.

[4] :

Chen, T., Kornblith, S., Norouzi, M. and Hinton, G., 2020, November. A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597-1607). PMLR.

[5] :
Liu, X., Zhang, F., Hou, Z., Mian, L., Wang, Z., Zhang, J. and Tang, J., 2021. Self-supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineering, 35(1), pp.857-876.
[6] :

Baltrušaitis, T., Ahuja, C. and Morency, L.P., 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2), pp.423-443.

[7] :

Zhang, Y. and Yang, Q., 2021. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12), pp.5586-5609.

[8] :
Wang, L., Zhang, X., Su, H. and Zhu, J., 2024. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9] :

Carreira-Perpinan, M.A. and Hinton, G., 2005, January. On contrastive divergence learning. In International workshop on artificial intelligence and statistics (pp. 33-40). PMLR.

[10] :

Kloek, T. and Van Dijk, H.K., 1978. Bayesian estimates of equation system parameters: an application of integration by Monte Carlo. Econometrica: Journal of the Econometric Society, pp.1-19.

[11] :

Hyvärinen, A. and Dayan, P., 2005. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4).

 

 

Student Perspectives: Are larger models always better?

A post by Emma Ceccherini, PhD student on the Compass programme.


In December 2023, I attended NeurIPS, a machine learning conference, with some COMPASS colleagues. There, I attended a tutorial titled “Reconsidering Overfitting in the Age of Overparameterized Models”. The findings the speakers presented overturn some traditional statistical concepts, so I’d like to share some of these innovative ideas with the COMPASS blog readers.

Classical statistician vs deep learning practitioners
Classical statisticians argue that small models have high bias but large variance (Figure 1 (left)) and large models have low bias but high variance (Figure 1 (right)). This is called the bias-variance trade-off and is a crucial notion that can be found in all traditional statistic textbooks. Large, over-parameterised models perfectly interpolate the data points by fitting noise and they have a near-zero training error, but an increasing test error. This phenomenon is called overfitting and causes poor performances on unseen data. Overfitting implies low generalisation, which can be thought of as the model’s performance on newly generated data at test time.

Figure 1: Examples of models with low complexity, good complexity, and large complexity.

Therefore, statistics textbooks recommend avoiding overfitting and improving generalization by finding a balance in the bias-variance trade-off, either by reducing the number of parameters or using regularisation (Figure 1 (centre)).

However, as available computational power has increased, practitioners have made larger and larger models. For example, neural networks have millions of parameters, more than enough to fit noise, but they generalize very well in practice, performing significantly better than small models. These large over-parametrised models exceed the so-called interpolation threshold that is when the training error is approximately zero. Several theoretical statisticians are trying to infer what happens after this threshold. While we now have some answers, many questions are still up for debate!

Figure 2: The bias-variance trade-off.

 

The double descent

Nakkiran et al. [2019] show that in the under-parameterised regime, neural networks test errors exhibit the classical u-shape from the bias-variance trade-off, while in the over-parameterised regime, after the interpolation threshold, the test error decreases again creating the so-called double descent (see Figure 3). Figure 4 shows the test error of a neural network classifier on CIFAR-10, a standard image data set. The plot shows a double descent in the test error for neural networks trained until convergence (purple line).

Figure 3: The double descent.

The authors make two more innovative observations: harmless interpolation and good generalisation for large models. It can be observed from Figure 4 that regularisation, equivalent to early stopping (red line), is substantially beneficial around the interpolation threshold. However, as the model size grows the test error for optimal early stopped neural networks (red line) and the one of neural networks trained until convergence test (purple line) overlap. Therefore, For large models, interpolation (trained until convergence) is not worse than regularisation (optimal early stopped), that is interpolation is harmless. Finally, Figure 4 shows that the test error is low as the size of the model grows. Hence, for large models, we can achieve reasonably good test accuracy, namely as a result of good generalisation.

Figure 4: Classification using neural networks on CIFAR-10 Nakkiran et al. [2019].
Simple maths for linear models
Given these groundbreaking experimental results, statisticians seek to use theoretical analysis to understand when these three phenomena occur. Although neural networks were the initial motivation of this work, they are hard to analyse even for shallow networks. And so statisticians resorted to understanding these phenomena starting from the well-known linear models.

Over-parameterisation in linear models of the form $\mathbf{Y} = \mathbf{X}\theta^* + \mathbf{W}$ means there are more features $d$ than number of samples $n$, i.e. $d >n$ for an input matrix $\mathbf{X}$ of dimension $n \times d$. Then the system $\mathbf{X}\hat{\theta} = \mathbf{Y}$ has infinite solutions, thus consider the solution with minimum norm $\hat{\theta} = \text{arg min}||\hat{\theta}||_2$.

After the interpolation threshold, the variance is dominating (see Figure 3) so it needs to go down for the test error to go down. Indeed, Bartlett et al. [2020] show that in this setup the variance decreases as $d \gg n$, precisely $$\text{variance} \asymp \frac{\sigma^2n}{d}. $$

It can be shown that data is approximately orthogonal when $d \gg n$, namely $<X_i, X_j> \approx 0$ for $i \neq 0$, so the noise “energy” is spread out along the $d$ dimensions, hence as $d$ grows the noise contribution decreases.
However, Bartlett et al. [2020] also show that the bias increases with $d$, precisely $$\text{bias} \asymp (1-\frac{n}{d})||\theta^*||_2^2.$$ This is because the signal “energy” as well is spread out along $d$ dimensions.

Eventually, the bias will dominate and the test error will increase again, see Figure 5 (left). Therefore under this framework, the double descent and harmless interpolation can be achieved but good generalisation cannot.

Figure 5: Bias-variance trade-off after interpolation threshold for a simple linear model (left) and a linear model with spiked covariance (right).

Finally, Bartlett et al. [2020] show that in the special case where the $k$ features are “upweighted”, all three phenomena are observed. Assuming a spiked covariance $$\Sigma = \mathbb{E}[\mathbf{X}\mathbf{X}^T] = \begin{bmatrix}
R\mathbf{I}_k & \mathbf{0} \\
\mathbf{0} & \mathbf{I}_{d-k}
\end{bmatrix},$$ it can be shown that the variance and the bias will go to zero as $d \rightarrow \infty$ provided that $R \gg \frac{d}{n}$, therefore the double descent, harmless interpolation and good generalization are achieved (see Figure 5 (right)).

Many unanswered questions remain
Similar results to the ones described for linear models have been obtained for linear classification [Muthukumar et al., 2021]. While these types of results for neural networks [Frei et al., 2022] are still limited. Moreover, there are still many open questions on benign overfitting for neural networks. For example, the existing result focuses on $d \gg n$ regimes for neural networks, but there are no results on neural networks over-parameterised in low dimensions by increasing their width. Theoretical statisticians still have plenty of work to do to fully understand these phenomena!

References 

Peter L. Bartlett, Philip M. Long, G´abor Lugosi, and Alexander Tsigler. Benign overfitting in linear
regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, April 2020. ISSN
1091-6490. doi: 10.1073/pnas.1907378117. URL http://dx.doi.org/10.1073/pnas.1907378117.

Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro, and Wei Hu. Implicit bias in leaky relu
networks trained on high-dimensional data, 2022.

Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, and Anant
Sahai. Classification vs regression in overparameterized regimes: Does the loss function matter?,
2021.

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep
double descent: Where bigger models and more data hurt, 2019.

Compass students at ICLR 2024

Congratulations to Compass students Edward Milsom and Ben Anson who, along with their supervisor, had their paper accepted for a poster at ICLR 2024.

 

Convolutional Deep Kernel Machines

Edward Milsom, Ben Anson, Laurence Aitchison 

Ed and Ben: In this paper we explore the importance of representation learning in convolutional neural networks, specifically in the context of an infinite-width limit called the Neural Network Gaussian Process (NNGP) that is often used by theorists. Representation learning refers to the ability of models to learn a transformation of the data that is tailored to the task at hand. This is in contrast to algorithms that use a fixed transformation of the data, e.g. a support vector machine with a fixed kernel function like the RBF kernel. Representation learning is thought to be critical to the success of convolutional neural networks in vision tasks, but networks in the NNGP limit do not perform representation learning, instead transforming the data with a fixed kernel function. A recent modification to the NNGP limit, called the Deep Kernel Machine (DKM), allows one to gradually “add representation learning back in” to the NNGP, using a single hyperparameter that controls the amount of flexibility in the kernel. We extend this algorithm to convolutional architectures, which required us to develop a new sparse inducing point approximation scheme. This allowed us to test on the full CIFAR-10 image classification dataset, where we achieved state-of-the-art test accuracy for kernel methods, with 92.7%.

In the plot below, we see how changing the hyperparameter (x-axis) to reduce flexibility too much harms the performance on unseen data.

 

Student Perspectives: Group Testing

A post by Rahil Morjaria, PhD student on the Compass programme.

What is Group Testing?

Group Testing was first introduced in the 1940s as a way to test military recruits for syphilis during World War II. By combining blood samples, they hoped to reduce the total number of tests needed to detect the diseased individuals (compared to testing each recruit individually).

Example of pooling blood samples, where red and green depicts diseased and not diseased respectively.

Since then, there have been many advances in Group Testing with applications not just in detecting diseased individuals but also in communications, cybersecurity and data storage. In essence, whenever we have a situation where we need to detect a proportionally rare occurrence, Group Testing is probably applicable.

More formally, if we have some diseased set of individuals of size $k$ of a total population $n$, it might be considered (instead of testing each individual separately) to pool people together into groups (with replacement) and test these groups.

 

Matrix Form of Group Testing (each row depicting a test and each column depicting a member of the population) [1].

We often write our test design in matrix form where each row is a group/test and each column indicates a member of the population, where a $1$ indicates a individuals inclusion in a test.

The relationship between $k$ and $n$ is quite important, our focus is on the sparse case in which $k = O(n^\alpha)$ where $\alpha \in (0,1)$.

Often we assume our test apparatus is sensitive enough where a single diseased individual in the test will give us a positive result (as shown in the image above) this is known as the noiseless case (there is vast amounts of work done for different types of noise, for more information check out [1]).

Adaptive vs Non-Adaptive Group Testing

Adaptive Group Testing (as the name suggests) allows us to adapt our subsequent tests by the results of the previous. If we compare this to Non-Adaptive Group Testing, in which we have to define our tests (and thus our groups) before we obtain any results, we can expect stronger results.

As our tests have binary outputs, we can obtain at most $1$ bit of information per test. As there are $\binom{n}{k}$ possible defective sets, we would need $\log_2\binom{n}{k}$ bits to uniquely represent each possible set. This gives us the limit of $\log_2\binom{n}{k}$ tests needed, this is known as a converse result, a fundamental limit which we are unable to overcome.

In the noiseless case, Adaptive Group Testing is able to achieve this fundamental limit. First we split our total items $n$ into $k$ (the number of defective items) subsets of length $n/k$ without replacement, and then perform Binary Splitting.

Binary Splitting Adaptive Algorithm.

While Adaptive Group Testing is able to reach fundamental limits, our main focus is on Non-Adaptive Group Testing. Non-Adaptive Group Testing has many applications due to it’s ability to be ran in parallel (and other ease of use situations).

Non-Adaptive Group Testing procedures are often designed randomly (in which each items inclusion in a test is $Bern(v)$ for some $v$) or with near constant column weight (each item is in ‘nearly’ the same amount of tests). Out of these 2 designs near constant column weight gives stronger results.

A graph comparing different group testing designs, where the $\text{Rate} = \log_2\binom{n}{k}/T$ where $T$ is the number of tests needed to recover all the defective items (with high probability for the red lines and with certainty for the purple line). [1]

Goals

While there are many strong results in Group Testing, there is still much to explore. From looking at List Decoding (in which we allow a list of possible defective sets to be outputted), other forms of noise and our efficient algorithms still not being able to match up to our theoretical achievable methods, there is work to be done in all aspects. With improvements in technology, combined with the myriad of applications, the future of Group Testing definitely looks bright!

References:

[1] M. Aldridge, O. Johnson, and J. Scarlett. Group testing: an information theory perspective. CoRR, abs/1902.06002, 2019. URL http://arxiv.org/abs/1902.06002.

Student perspectives: Compass Annual Conference 2023

A post by Dominic Broadbent, PhD student on the Compass CDT, and Michael Whitehouse, PhD student of the Compass CDT (recently submitted thesis)

Introduction

September saw the second annual Compass Conference, hosted in the Fry Building, the home of the School of Mathematics. The event was particularly special as it is the first time that all five Compass cohorts were brought together, and it was a fantastic opportunity to celebrate the achievements and research of the Compass CDT with external partners.  This year the theme was “Communicating Research in Context“, focusing on how research can be better communicated, and the need to highlight the motivation and applications of mathematical research.

Research talks

The day began with four long form talks touching on the topic of communicating research. Starting with Alessio Zakaria’s talk which delved into Hypothesis tests, commenting on their criticial role as the defacto statistical tool across the sciences, and how p-hacking has led to a replication crisis that undermines public confidence in research. The next talk by Sam Stockman and Emerald Dilworth discussed the challenges they faced, and the key takeaways from their shared experience of communicating mathematics with researchers in the geographical sciences. Following this, Ed Davis’s interactive talk “The Universal Language of Visualisations” explored how effective visualisation techniques should differ by the intended audience, with examples from his research and activities outside of academia. The last talk by Dan Milner explored his research on understanding the effect of environmental factors on outcomes of smallholder farmers in Kenya. He took us through the process of collecting data on the ground, to modelling and communicating results to external partners.  After each talk there was an opportunity to ask questions, allowing for audience participation and the sparking of interesting discussions. The format mirrored that which is most frequently used in external academic conferences, giving the speakers a chance to practice their technique in front of friendly faces.

Lightning talks

After a short break, we jumped back into the fray with a series of 3-minute fast-paced lightning talks. A huge range of topics were covered, all the way from developing modelling techniques for the electric grid of the future, to predicting the incidence of Cerebral Vasospasm at the Southmead Hospital ICU. With such a short time to present, these talks were a great exercise in distilling research into just the essentials, knowing there is very limited time to garner the audience’s interest and convey an effective message.

Special guest lecture

After lunch, we reconvened to attend the special guest lecture. The talk, entitled Bridging the gap between research and industry, was delivered by Ruth Voisey, CEO of the Smith institute. It outlined Ruth’s journey from writing her PhD thesis ‘Multiple wave scattering by quasiperiodic structures’, to CEO of the Smith Institute – via an internship with the acoustic research team at Dyson. It was particularly refreshing to hear Ruth’s candid account of her ‘non-linear’ rise to CEO, accrediting her success to strong principles of clear research communication and ‘mathematical evangelicalism’.

As PhD students in the bubble of academia, the path to opportunities in the world of industry can often feel clouded – Ruth’s lecture painted a clear picture of how one can transition from university based research to a rewarding career outside of this bubble, applying such research to tangible problems in the real world. 

Panel discussion and poster session

The special guest lecture was followed by a discussion on communicating research in context, with panel members Ruth Voisey, David Greenwood, Helen Barugh, Oliver Johnson, plus Compass CDT students Ed Davis and Sam Stockman. The panel discussed the difficulty of communicating the nuances of research conclusions with the public, with a particular focus on handling these nuances when talking to journalists – stressing the importance of communicating the limitations of the research in question.

This was followed by a poster session, one enthusiastic student had the following comment “it was great to see of all the Compass students’ hard work being celebrated and shared with the wider data science community”.

Concluding remarks

To cap off the successful event, Compass students Hannah Sansford and Josh Givens delivered some concluding remarks which were drawn from comments made by students about what key points they’d taken from the day. These focused on the importance of clear communication of research throughout the whole pipeline, from inception in discussion with fellow academics to the dissemination of knowledge to the general population.

The day ended with a walk to Goldney Hall, where students, staff, and attendees enjoyed delicious food, wine, and access to the beautiful Orangery gardens.

Student Perspectives: Impurity Identification in Oligonucleotide Drug Samples

A post by Harry Tata, PhD student on the Compass programme.

Oligonucleotides in Medicine

Oligonucleotide therapies are at the forefront of modern pharmaceutical research and development, with recent years seeing major advances in treatments for a variety of conditions. Oligonucleotide drugs for Duchenne muscular dystrophy (FDA approved) [1], Huntington’s disease (Phase 3 clinical trials) [2], and Alzheimer’s disease [3] and amyotrophic lateral sclerosis (early-phase clinical trials) [4] show their potential for tackling debilitating and otherwise hard-to-treat conditions. With continuing development of synthetic oligonucleotides, analytical techniques such as mass spectrometry must be tailored to these molecules and keep pace with the field.

Working in conjunction with AstraZeneca, this project aims to advance methods for impurity detection and quantification in synthetic oligonucleotide mass spectra. In this blog post we apply a regularised version of the Richardson-Lucy algorithm, an established technique for image deconvolution, to oligonucleotide mass spectrometry data. This allows us to attribute signals in the data to specific molecular fragments, and therefore to detect impurities in oligonucleotide synthesis.

Oligonucleotide Fragmentation

If we have attempted to synthesise an oligonucleotide \mathcal O with a particular sequence, we can take a sample from this synthesis and analyse it via mass spectrometry. In this process, molecules in the sample are first fragmented — broken apart into ions — and these charged fragments are then passed through an electromagnetic field. The trajectory of each fragment through this field depends on its mass/charge ratio (m/z), so measuring these trajectories (e.g. by measuring time of flight before hitting some detector) allows us to calculate the m/z of fragments in the sample. This gives us a discrete mass spectrum: counts of detected fragments (intensity) across a range of m/z bins [5].

To get an idea of how much of \mathcal O is in a sample, and what impurities might be present, we first need to consider what fragments \mathcal O will produce. Oligonucleotides are short strands of DNA or RNA; polymers with a backbone of sugars (such as ribose in RNA) connected by linkers (e.g. a phosphodiester bond), where each sugar has an attached base which encodes genetic information [6].

On each monomer, there are two sites where fragmentation is likely to occur: at the linker (backbone cleavage) or between the base and sugar (base loss). Specifically, depending on which bond within the linker is broken, there are four modes of backbone cleavage [7,8].
We include in \mathcal F every product of a single fragmentation of \mathcal O — any of the four backbone cleavage modes or base loss anywhere along the nucleotide — as well as the results of every combination of two fragmentations (different cleavage modes at the same linker are mutually exclusive).

Sparse Richardson-Lucy Algorithm

Suppose we have a chemical sample which we have fragmented and analysed by mass spectrometry. This gives us a spectrum across n bins (each bin corresponding to a small m/z range), and we represent this spectrum with the column vector \mathbf{b}\in\mathbb R^n, where b_i is the intensity in the i^{th} bin. For a set \{f_1,\ldots,f_m\}=\mathcal F of possible fragments, let x_j be the amount of f_j that is actually present. We would like to estimate the amounts of each fragment based on the spectrum \mathbf b.

If we had a sample comprising a unit amount of a single fragment f_j, so x_j=1 and x_{k\ne j}=0, and this produced a spectrum \begin{pmatrix}a_{1j}&\ldots&a_{nj}\end{pmatrix}^T, we can say the intensity contributed to bin i by x_j is a_{ij}. In mass spectrometry, the intensity in a single bin due to a single fragment is linear in the amount of that fragment, and the intensities in a single bin due to different fragments are additive, so in some general spectrum we have b_i=\sum_j x_ja_{ij}.

By constructing a library matrix \mathbf{A}\in\mathbb R^{n\times m} such that \{\mathbf A\}_{ij}=a_{ij} (so the columns of \mathbf A correspond to fragments in \mathcal F), then in ideal conditions the vector of fragment amounts \mathbf x=\begin{pmatrix}x_1&\ldots&x_m\end{pmatrix}^T solves \mathbf{Ax}=\mathbf{b}. In practice this exact solution is not found — due to experimental noise and potentially because there are contaminant fragments in the sample not included in \mathcal F — and we instead make an estimate \mathbf {\hat x} for which \mathbf{A\hat x} is close to \mathbf b.

Note that the columns of \mathbf A correspond to fragments in \mathcal F: the values in a single column represent intensities in each bin due to a single fragment only. We \ell_1-normalise these columns, meaning the total intensity (over all bins) of each fragment in the library matrix is uniform, and so the values in \mathbf{\hat x} can be directly interpreted as relative abundances of each fragment.

The observed intensities — as counts of fragments incident on each bin — are realisations of latent Poisson random variables. Assuming these variables are i.i.d., it can be shown that the estimate of \mathbf{x} which maximises the likelihood of the system is approximated by the iterative formula

\mathbf {\hat{x}}^{(t+1)}=\left(\mathbf A^T \frac{\mathbf b}{\mathbf{A\hat x}^{(t)}}\right)\odot \mathbf{\hat x}^{(t)}.

Here, quotients and the operator \odot represent (respectively) elementwise division and multiplication of two vectors. This is known as the Richardson-Lucy algorithm [9].

In practice, when we enumerate oligonucleotide fragments to include in \mathcal F, most of these fragments will not actually be produced when the oligonucleotide passes through a mass spectrometer; there is a large space of possible fragments and (beyond knowing what the general fragmentation sites are) no well-established theory allowing us to predict, for a new oligonucleotide, which fragments will be abundant or negligible. This means we seek a sparse estimate, where most fragment abundances are zero.

The Richardson-Lucy algorithm, as a maximum likelihood estimate for Poisson variables, is analagous to ordinary least squares regression for Gaussian variables. Likewise lasso regression — a regularised least squares regression which favours sparse estimates, interpretable as a maximum a posteriori estimate with Laplace priors — has an analogue in the sparse Richardson-Lucy algorithm:

\mathbf {\hat{x}}^{(t+1)}=\left(\mathbf A^T \frac{\mathbf b}{\mathbf{A\hat x}^{(t)}}\right)\odot \frac{ \mathbf{\hat x}^{(t)}}{\mathbf 1 + \lambda},

where \lambda is a regularisation parameter [10].

Library Generation

For each oligonucleotide fragment f\in\mathcal F, we smooth and bin the m/z values of the most abundant isotopes of f, and store these values in the columns of \mathbf A. However, if these are the only fragments in \mathcal F then impurities will not be identified: the sparse Richardson-Lucy algorithm will try to fit oligonucleotide fragments to every peak in the spectrum, even ones that correspond to fragments not from the target oligonucleotide. Therefore we also include ‘dummy’ fragments corresponding to single peaks in the spectrum — the method will fit these to non-oligonucleotide peaks, showing the locations of any impurities.

Results

For a mass spectrum from a sample containing a synthetic oligonucleotide, we generated a library of oligonucleotide and dummy fragments as described above, and applied the sparse Richardson-Lucy algorithm. Below, the model fit is plotted alongside the (smoothed, binned) spectrum and the ten most abundant fragments as estimated by the model. These fragments are represented as bars with binned m/z at the peak fragment intensity, and are separated into oligonucleotide fragments and dummy fragments indicating possible impurities. All intensities and abundances are Anscombe transformed (x\rightarrow\sqrt{x+3/8}) for clarity.

As the oligonucleotide in question is proprietary, its specific composition and fragmentation is not mentioned here, and the bins plotted have been transformed (without changing the shape of the data) so that individual fragment m/z values are not identifiable.

We see the data is fit extremely closely, and that the spectrum is quite clean: there is one very pronounced peak roughly in the middle of the m/z range. This peak corresponds to one of the oligonucleotide fragments in the library, although there is also an abundant dummy fragment slightly to the left inside the main peak. Fragment intensities in the library matrix are smoothed, and it may be the case that the smoothing here is inappropriate for the observed peak, hence other fragments being fit at the peak edge. Investigating these effects is a target for the rest of the project.

We also see several smaller peaks, most of which are modelled with oligonucleotide fragments. One of these peaks, at approximately bin 5352, has a noticeably worse fit if excluding dummy fragments from the library matrix (see below). Using dummy fragments improves this fit and indicates a possible impurity. Going forward, understanding and quantification of these impurities will be improved by including other common fragments in the library matrix, and by grouping fragments which correspond to the same molecules.

References

[1] Junetsu Igarashi, Yasuharu Niwa, and Daisuke Sugiyama. “Research and Development of Oligonucleotide Therapeutics in Japan for Rare Diseases”. In: Future Rare Diseases 2.1 (Mar. 2022), FRD19.

[2] Karishma Dhuri et al. “Antisense Oligonucleotides: An Emerging Area in Drug Discovery and Development”. In: Journal of Clinical Medicine 9.6 (6 June 2020), p. 2004.

[3] Catherine J. Mummery et al. “Tau-Targeting Antisense Oligonucleotide MAPTRx in Mild Alzheimer’s Disease: A Phase 1b, Randomized, Placebo-Controlled Trial”. In: Nature Medicine (Apr. 24, 2023), pp. 1–11.

[4] Benjamin D. Boros et al. “Antisense Oligonucleotides for the Study and Treatment of ALS”. In: Neurotherapeutics: The Journal of the American Society for Experimental NeuroTherapeutics 19.4 (July 2022), pp. 1145–1158.

[5] Ingvar Eidhammer et al. Computational Methods for Mass Spectrometry Proteomics. John Wiley & Sons, Feb. 28, 2008. 299 pp.

[6] Harri Lönnberg. Chemistry of Nucleic Acids. De Gruyter, Aug. 10, 2020.

[7] S. A. McLuckey, G. J. Van Berkel, and G. L. Glish. “Tandem Mass Spectrometry of Small, Multiply Charged Oligonucleotides”. In: Journal of the American Society for Mass Spectrometry 3.1 (Jan. 1992), pp. 60–70.

[8] Scott A. McLuckey and Sohrab Habibi-Goudarzi. “Decompositions of Multiply Charged Oligonucleotide Anions”. In: Journal of the American Chemical Society 115.25 (Dec. 1, 1993), pp. 12085–12095.

[9] Mario Bertero, Patrizia Boccacci, and Valeria Ruggiero. Inverse Imaging with Poisson Data: From Cells to Galaxies. IOP Publishing, Dec. 1, 2018.

[10] Elad Shaked, Sudipto Dolui, and Oleg V. Michailovich. “Regularized Richardson-Lucy Algorithm for Reconstruction of Poissonian Medical Images”. In: 2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro. Mar. 2011, pp. 1754–1757.

 

Compass students attending AISTATS 2023 in Valencia

We (Ed Davis, Josh Givens, Alex Modell, and Hannah Sansford) attended the 2023 AISTATS conference in Valencia in order to explore the interesting research being presented as well as present some of our own work. While we talked about our work being published at the conference in this earlier blog post, having now attended the conference, we thought we’d talk about our experience there. We’ll spotlight some of the talks and posters which interested us most and talk about our highlights of Valencia as a whole.

Talks & Posters

Mode-Seeking Divergences: Theory and Applications to GANs

One especially interesting talk and poster at the conference was presented by Cheuk Ting Li on their work in collaboration with Farzan Farnia. This work aims to set up a formal classification for various probability measure divergences (such as f-Divergences, Wasserstein distance, etc.) in terms of there mode-seeking or mode-covering properties. By mode-seeking/mode-covering we mean the behaviour of the divergence when used to fit a unimodal distribution to a multi-model target. Specifically a mode-seeking divergence will encourage the target distribution to fit just one of the modes ignoring the other while a mode covering divergence will encourage the distribution to cover all modes leading to less accurate fitting on an individual mode but better covering the full support of the distribution. While these notions of mode-seeking and mode-covering divergences had been discussed before, up to this point there seems to be no formal definition for these properties, and disagreement on the appropriate categorisation of some divergences. This work presents such a definition and uses it to categorise many of the popular divergence methods. Additionally they show how an additive combination a mode seeking f-divergence and the 1-Wasserstein distance retain the mode-seeking property of the f-divergence while being implementable using only samples from our target distribution (rather than knowledge of the distribution itself) making it a desirable divergence for use with GANs.

Talk: https://youtu.be/F7LdHIzZQow

Paper: https://proceedings.mlr.press/v206/ting-li23a.html

Using Sliced Mutual Information to Study Memorization and Generalization in
Deep Neural Networks

The benefit of attending large conferences like AISTATS is having the opportunity to hear talks that are not related to your main research topic. This was the case with a talk by Wongso et. al. was one such talk. Although it did not overlap with any of our main research areas, we all found this talk very interesting.
The talk was on the topic of tracking memorisation and generalisation in deep neural networks (DNNs) through the use of /sliced mutual information/. Mutual information (MI) is commonly used in information theory and represents the reduction of uncertainty about one random variable, given the knowledge of the other. However, MI is hard to estimate in high dimensions, which makes it a prohibitive metric for use in neural networks.
Enter sliced mutual information (SMI). This metric is the average of the MI terms between their one-dimensional projections. The main difference between SMI and MI is that SMI is scalable to high dimensions and scales faster than MI.
Next, let’s talk about memorisation. Memorisation is known to occur in DNNs and is where the DNN fits random labels in training as it has memorised noisy labels in training, leading to bad generalisation. The authors demonstrate this behaviour by fitting a multi-layer perceptron to the MNIST dataset with various amounts of label noise. As the noise increased, the difference between the training and test accuracy became greater.
As the label noise increases, the MI between the features and target variable does not change, meaning that MI did not track the loss in generalisation. However, the authors show that the SMI did track the generalisation. As the label noise increased, the SMI decreased significantly as the MLP’s generalisation got worse. Their main theorem showed that SMI is lower-bounded by a term which includes the spherical soft-margin separation, a quantity which is used to track memorisation and generalisation!
In summary, unlike MI, SMI can track memorization and generalisation in DNNs. If you’d like to know more, you can find the full paper here: https://proceedings.mlr.press/v206/wongso23a.html.

Invited Speakers and the Test of Time Award

As well as the talks on papers that had been selected for oral presentation, each day began with a (longer) invited talk which, for many of us, were highlights of each day. The invited speakers were extremely engaging and covered varied and interesting topics; from Arthur Gretton (UCL) presenting ‘Causal Effect Estimation with Context and Confounders’ to Shakir Mohamed (DeepMind) presenting ‘Elevating our Evaluations: Technical and Sociotechnical Standards of Assessment in Machine Learning’. A favourite amongst us was a talk from Tamara Broderick (MIT) titled ‘An Automatic Finite-Sample Robustness Check: Can Dropping a Little Data Change Conclusions?’. In this talk she addressed a worry that researchers might have when the goal is to analyse a data sample and apply any conclusions to a new population: was a small proportion of the data sample instrumental to the original conclusion? Tamara and collorators propose a method to assess the sensitivity of statistical conclusions to the removal of a very small fraction of the data set. They find that sensitivity is driven by a signal-to-noise ratio in the inference problem, does not disappear asymptotically, and is not decided by misspecification. In experiments they find that many data analyses are robust, but that the conclusions of severeal influential economics papers can be changed by removing (much) less than 1% of the data! A link to the talk can be found here: https://youtu.be/QYtIEqlwLHE

This year, AISTATS featured a Test of Time Award to recognise a paper from 10 years ago that has had a prominent impact in the field. It was awarded to Andreas Damianou and Neil Lawrence for the paper ‘Deep Gaussian Processes’, and their talk was a definite highlight of the conferece. Many of us had seen Neil speak at a seminar at the University of Bristol last year and, being the engaging speaker he is, we were looking forward to hearing him speak again. Rather than focussing on the technical details of the paper, Neils talk concentrated on his (and the machine learning community’s) research philosophy in the years preceeding writing the paper, and how the paper came about – a very interesting insight, and a refreshing break from the many technical talks!

Valencia

There was so much to like about Valencia even from our short stay there. We’ll try and give you a very brief highlight of our favourite things.

Food & Drink:

Obviously Valencia is renowned for being the birthplace of paella and while the paella was good we sampled many other delights our stay. Our collective highlight was the nicest Burrata any of us had ever had which, in a stunning display of individualism, all four of us decided to get on our first day at the conference.

Artist rendition of our 4 meals.

Beach:

About half an hours tram ride from the conference centre are the beaches of Valencia. These stretch for miles as well as having a good 100m in depth with (surprisingly hot) sand covering the lot. We visited these after the end of the conference on the Thursday and despite it being the only cloudy day of the week it was a perfect way to relax at the end of a hectic few days with the pleasantly temperate water being an added bonus.

Architecture:

Valencia has so much interesting architecture scattered around the city centre. One of the most memorable remarkable places was the San Nicolás de Bari and San Pedro Mártir (Church of San Nicolás) which is known as the Sistine chapel of Valencia (according to the audio-guide for the church at least) with its incredible painted ceiling and live organ playing.

Ceiling of the Church of San Nicolás

 

Skip to toolbar