A post by Xinrui Shi, PhD student on the Compass programme.

Skip to content
# Tag: University of Bristol

## Student Perspectives: The trade-off between sample size and number of trials in meta-analysis

## Student perspectives: Extending multilevel network meta-regression to disconnected networks and single-arm studies

**Network meta-analysis**

**Population adjustments & IPD network meta-regression**

**Multilevel – Network Meta-Regression**

**Disconnected networks**

**Example: Plaque Psoriasis**

**Reconnected network – internal evidence**

**Reconnected network – external evidence**

**Producing Population-Average Estimates**

**Relative Effects vs Placebo**

### Absolute probability of PASI75

### Key Findings

**Future Work**

### References

## Student Perspectives: Bayesian LLM Finetuning

## Parameter Efficient Finetuning

## Partial Finetuning

## Adapter Tuning

## Low Rank Adaptation (LoRA)

## Bayesian Finetuning

### Bayesian LoRA (via Laplace Approximation and KFAC)

### Using Stein Variational Gradient Descent (SVGD)

## Conclusion

#### Footnotes

### An (Ill-Advised) Aside: Attention

### References

## Compass Away Day 2024

## Student Perspectives: Factor-adjusted vector autoregressive models

# Introduction

# Vector autoregressive models

## Estimation

## Sparse VAR

# Factor-adjusted VAR

## Student Perspectives: Strategies for variational inference in non-conjugate problems

# Introduction

# Conjugate models: Coordinate Ascent Variational Inference

## Rao-Blackwellization

## Control variates

## Improved BBVI

## Final Conclusion for BBVI

# Differentiable Models: Automatic Differentiation Variational Inference

## ONS: DataScience@work seminar

## David Greenwood: DataScience@work seminar

## Kew Gardens: DataScience@work seminar

## Advai: DataScience@work seminar

A post by Sam Perren, PhD student on the Compass programme.

Over the past year, my research has been focused on a method called network meta-analysis (NMA), which is widely used in healthcare decision-making to summarise evidence on the relative effectiveness of different treatments. In particular, I have been interested in the challenges presented by disconnected networks of evidence and single-arm studies and aim to extend the *multinma* package to handle these challenges. Recently, I presented at the International Society for Clinical Biostatistics (ISCB) conference in Thessaloniki, Greece. In this blog post, I will outline the key points from that presentation and discuss the latest developments from my research.

Network Meta-Analysis (NMA) pools summary treatment effects from randomised control trials (RCTs) to estimate relative effects between multiple treatments [1]. NMA summarises all direct and indirect evidence about treatment effects, allowing comparisons to be made between all pairs of treatments [2]. Covariates such as age, biomarker status, or disease severity can be either Effect Modifiers that interact with treatment effects, or Prognostic Factors that predict outcomes without interacting with treatment effects[3]. NMA requires a connected network, either directly or indirectly, through a series of comparisons[4]. Plot 1 demonstrates the assumption in NMA of constancy of relative effects, that is, the AB effect observed in study AB would be exactly the same in study AC, if a B arm had been included. However, this assumption can break down if there are differences in effect modifiers between studies which can lead to bias.[6].

Population adjustment methods aim to relax the assumption of constancy of relative effects using available individual level data (IPD) to adjust for differences between study populations[3]. A network where IPD is available from every study enables the use of IPD network meta-regression and is considered the gold standard. However, having all IPD data in a network is rare; some studies may only provide aggregate data (AgD) in published papers.

Multilevel Network Meta-Regression (ML-NMR) is a population adjustment method that extends the NMA framework to synthesise mixtures of IPD and AgD. ML-NMR can produce estimates from networks of any size and for any given target population. It does this by first defining an individual-level regression model on the IPD, then it averages (integrates) each aggregate study population to form the aggregate level model using efficient and general numerical integration. [5]

Healthcare policymakers are increasingly encountering disconnected networks of evidence, which often include studies without control groups (single-arm studies)[6]. Very strong assumptions are required to make comparisons in a disconnected network; such as adjusting for all prognostic factors and all effect modifiers, which may not always be feasible with the available data. Current methods to handle disconnected networks include unanchored Matching-Adjusted indirect comparisons (MAIC)[7] and simulated treatment comparison (STC)[8]. However, these methods have limitations: they cannot generate estimates for target populations outside the network of evidence that might be relevant to decision makers and they are limited to a two study-scenario. So there remains a need for more flexible and robust methods, such as an extended version of the ML-NMR approach, to better handle disconnected networks of evidence.

We use a network of 6 active treatments plus placebo all used to treat moderate-to-severe plaque psoriasis, previously analysed by Philippo et al. [9]. In this network, we have AgD from the following studies: CLEAR, ERASURE, FEATURE, FIXTURE, and JUNCTURE. Additionally, we have IPD from the IXORA-S, UNCOVER-1, UNCOVER-2, and UNCOVER-3 studies. Outcomes of interest include success/failure to achieve at least 75%, 90% or 100% improvement on the Psoriasis Area and Severity Index (PASI) scale at 12 weeks compared to baseline, denoted PASI 75, PASI 90, and PASI 100, respectively. We make adjustments for potential effect modifiers, including duration of psoriasis, previous systemic treatment, body surface area affected, weight, and psoriatic arthritis.

This network (Plot 2) of evidence is connected; every pair of treatments is joined by a path of study comparisons. We will now disconnect this network to illustrate different methods for reconnecting using ML-NMR, and then compare the results back to the “true” results from the full evidence network. We removed the CLEAR study and removed the placebo arms from the ERASURE, FEATURE, and JUNCTURE studies, as well as the Secukinumab 150 mg and Secukinumab 300 mg arms from the FIXTURE study in the AgD. $N_1$ (Left hand side) shows studies comparing different doses of Secukinumab, 150mg and 300mg, $N_2$ shows studies comparing all other treatments. We are then faced with the challenge of wanting to make valid comparisons between treatments in these two sub-networks, illustrated in Plot 3.

One approach is to combine two AgD studies from opposite sides of the network into a single study. The Fixture study is the only AgD study in $N_2$. To determine the appropriate study to combine with in $N_1$, aggregate-level matching is used[10]. This involves selecting the study that minimises the Euclidean distance between the observed sets of covariates. Table 1 shows the Erasure study has the most similar characteristics to Fixture. As a result, these two studies will be combined into a new four-arm study, referred to as FIXTURE/ERASURE, effectively bridging the gap in the network.

Another method we used to reconnect the network is by incorporating external observational studies, specifically “Chiricozzi” and “Prospect,” which observe the effects of Secukinumab 300mg. We incorporated these single-arm studies into the Fixture study as if they were part of the original trial, thereby effectively bridging the network. As a result, we end up with two separate reconnected networks, each using one of the observational studies.

We have four networks for comparison: Full connected network, Reconnected using single arm study (Chircozzi), Reconnected using single arm study (Prospect), Reconnected using aggregate-level matching (FIXTURE/ERASURE). For each network, we will run both ML-NMR and standard NMA without regression. These analyses will produce population-adjusted relative treatment effects and probability outcomes for achieving a 75% reduction in the Plaque Area Severity Index (PASI75).

The ML-NMR results in the fully connected network will serve as the gold standard. We will compare the results obtained from the different methods (ML-NMR vs. NMA) and across the various networks (Full vs. reconnected) to evaluate the impact of different approaches on the relative treatment effects and outcome probabilities.

Plot 6 shows the probit relative treatment effects versus placebo across three populations: Feature ($N_1$), Uncover-1 ($N_2$), and the external population, Prospect. The results demonstrate that for treatments in $N_2$, the estimates produced by both NMA and ML-NMR are generally close to the gold standard. This similarity between NMA and ML-NMR is largely due to the homogeneity of the populations within the networks and the limited covariates we used to match original analysis. However, NMA results show smaller confidence intervals compared to ML-NMR, which may suggest an overconfidence in the NMA model’s results. ML-NMR accounts for more complexity and variability therefore extrapolates results.

For the Prospect population, the NMA results exhibit slight bias, likely due to differences between the external population and the network populations.

Results for treatments in $N_1$ show varying degrees of accuracy when compared to the gold standard in all populations. Among the reconnected networks, the FIXTURE/ERASURE and Prospect reconnected networks perform relatively well, while the Chiricozzi-based network struggles to match the gold standard results. This is due to Chiricozzi differing the most on covariates compared to all other populations.

In other words, when comparisons are made across the created “bridges” in the reconnected networks, bias can be introduced into our results.

The plot above is the reconnected plot using PROSPECT and Chiricozzi external studies and shows us what we mean by comparisons across the “bridge”. All results in plot (1) are relative to a placebo (PBO) which is in $N_2$. If we want to make comparisons to the placebo with treatments from $N_1$ we will need to use these generated direct comparisons or “bridges”.

In Plot 8, the FEATURE population results are very close to the gold standard for treatments in $N_1$ but results for treatments in $N_2$ show some bias. Unlike in the probit differences, the reference treatment for FEATURE now become Secukinumab 150mg and 300mg (SEC_150 & SEC_300) so in order to estimate absolute outcomes for $N_2$ treatments, we need to use our “bridges”, thereby incurring bias. This narrative is the same for the other 2 population estimates, where UNCOVER-2 is in $N_2$, estimates for treatments in $N_1$ are bias compared to the gold standard, dependent on network used. For PROSPECT, it’s reference treatment is Secukinumab 150mg ($N_1$), therefore results for $N_2$ treatments vary from the gold standard.

When producing estimates across reconnected networks, there’s a risk that the estimates may be biased or deviate from the true value. In our analysis, reconnecting the networks using ML-NMR showed little improvement over NMA. These results highlight the importance of carefully selecting studies to bridge networks and minimise bias. As disconnected networks become more common, it’s clear that better tools for evidence synthesis are needed to ensure reliable results that can inform clinical decisions and improve outcomes.

To improve the performance of ML-NMR over NMA, we will try incorporating more covariates into the regression model. We also plan to conduct a comprehensive simulation study to compare methods under various scenarios and explore additional approaches, such as class effects. Developing methods to assess the strong assumptions required for reconnecting networks will be another priority. Finally, we aim to implement these methods within the multinma package.

[1] – Sofia Dias, Anthony E Ades, Nicky J Welton, Jeroen P Jansen, and Alexander J Sutton. Network meta-analysis for decision-making. John Wiley & Sons, 2018.

[2] – Song F, Altman DG, Glenny AM, Deeks JJ. Validity of indirect comparison for estimating efficacy of competing interventions: empirical evidence from published meta-analyses. Bmj. 2003 Mar 1;326(7387):472.

[3] – David M Phillippo, Anthony E Ades, Sofia Dias, Stephen Palmer, Keith R Abrams, and Nicky J Welton. Methods for population-adjusted indirect comparisons in health technology appraisal. Medical decision making, 38(2):200–211, 2018

[4] – Sofia Dias, Alex J Sutton, AE Ades, and Nicky J Welton. Evidence synthesis for decision making 2: a generalized linear modeling framework for pairwise and network meta-analysis of randomized controlled trials. Medical Decision Making, 33(5):607–617, 2013

[5] – David M Phillippo, Sofia Dias, AE Ades, Mark Belger, Alan Brnabic, Alexander Schacht, Daniel Saure, Zbigniew Kadziola, and Nicky J Welton. Multilevel network meta-regression for population- adjusted treatment comparisons. Journal of the Royal Statistical Society. Series A,(Statistics in Society), 183(3):1189, 2020

[6] – John W Stevens, Christine Fletcher, Gerald Downey, and Anthea Sutton. A review of methods for comparing treatments evaluated in studies that form disconnected networks of evidence. Research synthesis methods, 9(2):148–162, 2018

[7] – Signorovitch, James E., et al. “Matching-adjusted indirect comparisons: a new tool for timely comparative effectiveness research.” *Value in Health* 15.6 (2012): 940-947.

[8] – Caro JJ, Ishak KJ. No head-to-head trial? Simulate the missing arms. Pharmacoeconomics. 2010;28(10):957–67.

[9] – David M Phillippo, Sofia Dias, AE Ades, Mark Belger, Alan Brnabic, Daniel Saure, Yves Schy-mura, and Nicky J Welton. Validating the assumptions of population adjustment: application of multilevel network meta-regression to a network of treatments for plaque psoriasis. Medical Decision Making, 43(1):53–67, 2023

[10] – Leahy, Joy, et al. “Incorporating single‐arm evidence into a network meta‐analysis using aggregate level matching: assessing the impact.” *Statistics in medicine* 38.14 (2019): 2505-2523.

A post by Sam Bowyer, PhD student on the Compass programme.

Training large AI models is tricky business. First you’ll want to raise money — and lots of it. (OpenAI’s GPT-4 reportedly cost over $100 million to train, roughly equivalent to 0.5% of Bristol’s GDP.) With that money you’ll need to buy hardware (25,000 NVIDIA A100 GPUs should do), hire a team of talented engineers, and purchase licensing to vast quantities of data (though you might consider foregoing that last one and just hope no one complains…). Once you’ve collected enough data (say, ~13 trillion tokens-worth^{1}), settled on a model architecture with hundred of billions, if not trillions, of parameters (each taking up at least a byte of memory), you can sit back and wait around for 100 days whilst your engineers firefight software and hardware crashes to steer your model’s training to completion.^{2}

But for those of us who can’t afford the $10^{25}$ FLOPs (floating point operations) needed to train such a model (or who might want to avoid the associated environmental costs), what can we do? The answer lies in finetuning: taking one of the available pretrained ‘foundation’ models (such as ChatGPT, or an open source model such as one from Meta’s Llama series) and tweaking them to suit your own purposes.

The basic idea is this: these foundation models are great multitaskers, they’ve been trained well enough to generate reasonable outputs to a wide variety of inputs, but if you’re only interested in using them on a particular set of data ($\mathcal{D}_\text{finetune}$), or for a particular task, then it might be a good idea to spend some extra time training on that data specifically, after the rest of (pre)training has taken place. Similarly, it’s worth noting that the foundation model you get straight out of pretraining will mimic its input dataset, $\mathcal{D}$. In the case that $\mathcal{D}$ is too large to be checked by humans (e.g. 13 trillion tokens — essentially including most of the public internet), your model will almost certainly have learnt undesirable behaviour and be capable of producing dangerous, offensive, and harmful output. Finetuning is critical to the pursuit of safe AI, putting guardrails in place and ensuring that a model’s behaviour is aligned with our desires, both in terms of utility* and* safety.^{3}

In this blog post, I’ll give an overview of LLM finetuning, specifically *parameter-efficient finetuning*, which tackles the problem of finetuning models whilst avoiding the computational burden that was required for pretraining. Even if your finetuning dataset $\mathcal{D}_\text{finetune}$ is much smaller than your pretraining set $\mathcal{D}$, you’ve still got the computational problem of the model’s size: how do you efficiently^{4} do gradient-based optimisation on a model with potentially billions of parameters? I’ll also argue that taking a Bayesian approach can be beneficial, and that whilst the added computational cost of Bayes might not be feasible (or even all that helpful) in the pretraining setting, these costs are much less impactful when finetuning.

Perhaps the simplest way to finetune a model on $\mathcal{D}_\text{finetune}$ is to simply carry on training as before — with some gradient-based optimiser like Adam [1] — but on this new dataset (often repeatedly, i.e. for multiple ‘epochs’). This is known as *full finetuning* (FFT) and usually leads to the best results, however, it’s often infeasible due to the size of the model being finetuned.

Recall that the model we’re working with might have billions of parameters — in order to train these parameters we need to store not only their values, but also their gradients, as well as the activation values of each neuron in the network and, depending on your optimiser, potentially momentum and second order gradient information (e.g. Adam makes use of the exponential moving average of gradients *and* the EMA of squared gradients — all per parameter). On a model like Llama-7B, whose 7 billion parameters at 8-bit precision require 7GB of storage, these extra gradient costs can easily overwhelm the 16GB capacity of a typical high-end consumer GPU such as an NVIDIA RTX 4080. (Add to that the fact that we usually want to batch our input data — that is, pass multiple input examples through the model at a time — and you can see where things start to spiral out of control.)

This motivates the need for finetuning algorithms that have a smaller memory footprint. There’s an exciting field of literature in model compression and quantisation — using compression techniques to represent your model and its gradients by fewer and fewer bits^{5}, but another approach is to simply reduce the number of parameters that you train during finetuning. However, choosing which parameters to train and which to freeze (thus freeing up space that would’ve gone to storing the gradient information of those parameters) is far from trivial.

In order to discuss finetuning techniques, it’ll be useful to briefly touch on the basic architecture of neural networks. The simplest type of neural network is a multilayer perceptron, or MLP, which consists of $L$ layers in which the output of layer $l-1$, $x^{l-1} \in \mathbb{R}^{d_{l-1}}$, is multiplied by a learnable weight matrix $W^l \in \mathbb{R}^{d_{l} \times d_{l-1}}$ and added to a learnable bias vector $b^l \in \mathbb{R}^{d_{l}}$ before being transformed through a nonlinearity, such as a sigmoid $\sigma(x) = (1+e^{-x})^{-1}$:

$$x^l = \sigma(W^l x^{l-1} + b^l),$$

with $x^0 \in \mathbb{R}^{d_0}$ being input data.

A common strategy for finetuning is to freeze all weights in earlier layers, say, up until the final $\hat{L}$ layers, and only train the set of parameters $\{W^l, b^l | l \geq L-\hat{L}\}$. Assuming constant network width $d = d_0 = \ldots = d_L$ this reduces the number of trainable parameters from $L(d^2 + d)$ to $\hat{L}(d^2 + d)$.

Another simple finetuning strategy is BitFit [3], which works by only training the bias parameters, leading to a total of $Ld$ trainable parameters (though of course this does make the iterative finetuning updates significantly less expressive).

It’s important to note that the final-layers-only approach can also be applied more generally. Most LLMs architectures use transformers [4] as their backbone, which — *very* loosely speaking — consist of multi-headed attention layers (another, more complicated type of neural network) followed by an MLP (plus a whole bunch of other stuff containing yet more parameters), and with each transformer’s output typically going on to form the input of another transformer. So it’s common to see only the final transformer finetuned, or even only the final transformer’s MLP.

Since it would be ill-advised to take a long detour into the definition of multi-headed attention here (as that’d be fairly involved and might take the momentum out of our finetuning discussion), I won’t do that. (Instead, I’ll banish it to yet another (increasingly-obnoxious) footnote^{6}.)

Rather than retraining the weights already in your model, most modern finetuning approaches actually *add* new parameters to the model, termed ‘adapters’, and only train these instead. For example, [5], [6], and [7] all essentially propose techniques in which we insert two-layer MLPs at different places inside a transformer, with varying results.

Adapter methods have the benefit of being ‘plug-and-play’, in the sense that you can train multiple adapters on different finetuning tasks and then insert them into your model if you detect that it would be helpful for a user’s given request.

By far the most common (and almost de facto standard as of 2024) finetuning method is Low Rank Adaptation (LoRA) [8]. The intuition behind LoRA is that the parameters inside your pretrained model are probably fairly close to their finetuned optimal values already, in the sense that those optimal values can probably be reached using only updates in a low-rank subspace. As such we can pose our finetuning problem in terms of finding the low-rank matrix $\Delta W \in \mathbb{R}^{d_\text{in} \times d_\text{out}}$ that optimises a given pretrained weight matrix $W_0$, leading to

$$W_\text{finetune} = W_0 + \Delta W,$$

where the low-rank of $\Delta W$ is enforced by parameterising it as $$\begin{aligned}\Delta W & = B A \\ B & \in \mathbb{R}^{d_\text{in} \times r} \\ A & \in \mathbb{R}^{r \times d_\text{out}} \end{aligned}$$so that $\text{rank}(\Delta W) \leq r \ll \text{rank}(W_0) \leq \min (d_\text{in}, d_\text{out})$. (Note that LoRA places the adapter in parallel to a pretrained weight matrix $W_0$, in contrast to the serial/in-between placement of the MLP adapters mentioned in the previous section.)

LoRA’s success has led to a large number of variants, such as AdaLoRA [9] which adaptively decides which weight matrices to apply LoRA to based on their singular values. Other methods include PiSSA (Principal Singular Values and Singular Vectors Adaptation) [10] which performs LoRA updates only on the first few principle components of each weight matrix and freezes the ‘residuals’ which come from later principle components. One recent paper presents GaLore (Gradient Low Rank Projection) [11], which performs PCA on the weight matrix every few iterations and performs low-rank updates by specifically only optimising in the (low-rank) space spanned by the first few priniciple components.

Although work has been done to introduce uncertainty estimation into pretraining, the results often aren’t worth the extra computational costs [12, 13]. Not only are the model sizes too large to make uncertainty quantification feasible, but the fact that your pretraining dataset, $\mathcal{D}$, is gigantic provides little uncertainty to reason about. However, in the context of finetuning we typically have a much smaller dataset, for which we’ll likely have much more uncertainty, and we also tend to work with far fewer parameters, allowing for extra computational budget to go towards the use of Bayesian methods.

Consider splitting up our finetuning set into prompt and target/response pairs $(X,y) \in \mathcal{D}_\text{finetune}$ where $X \in \mathcal{T}^{B \times n}$ is a matrix of $B$ sequences each of maximum length $n$ (potentially padded out with null-tokens) constructed with the token set $\mathcal{T}$, and $y \in \mathcal{Y}^B$ could be a corresponding batch of single tokens (in which case $\mathcal{Y} = \mathcal{T}$), or a batch of classification labels (e.g. in sentiment analysis, or multiple-choice Q&A, in which case $\mathcal{Y}$ might be different to $\mathcal{T}$).

What we fundamentally want to learn is a posterior distribution over all learnable parameters $$p(\theta | \mathcal{D}_\text{finetune}) = p(\theta | X, y),$$where, for example, in the case of LoRA finetuning, $\theta$ is the collection of all adapter weights $A$ and $B$. This not only gives us information about the uncertainty in the model’s parameters, which can be useful in itself, but can also be used to give us the posterior predictive distribution for a test input $x^* \in \mathcal{T}^n$, $$p(y^* | x^*, \mathcal{D}_\text{finetune}) = \int p(y^* | x^*, \theta)p(\theta|\mathcal{D}_\text{finetune})d\theta.$$

This is often more desirable than a predictive distribution that only uses a point estimate of $\theta$ and which would then ignore the uncertainty present in the model’s parameters.

Yang et al. [14] suggest a method for finding the posterior $$p(\theta | X, y) \propto p(y | X, \theta)p(\theta)$$post-hoc, i.e. after regular finetuning (with LoRA) using a Laplace approximation — which assumes the posterior is a Gaussian centered at the maximum a-posteriori (MAP) solution, $\theta_\text{MAP}$.

First, we note that the MAP solution can be written as the maximum of the log-joint $\mathcal{L}(y, X; \theta)$, $$\begin{align} \mathcal{L}(y, X; \theta) &= \log p(y | X, \theta) +\log p(\theta) = \log p(\theta | X, y) + \text{const} \\ \theta_\text{MAP} &= \arg\max{}_\theta \mathcal{L}(y, X; \theta). \end{align}$$

Then assuming that the finetuning successfully optimised $\theta$, i.e. reached parameter values $\theta_\text{MAP}$, the Laplace approximation involves taking the second-order Taylor expansion of the log-joint around $\theta_\text{MAP}$, $$\mathcal{L}(y, X; \theta) \approx \mathcal{L}(y, X; \theta_\text{MAP}) – \frac{1}{2}(\theta – \theta_\text{MAP})^T(\nabla_\theta^2 \mathcal{L}(X, y; \theta)|_{\theta_\text{MAP}})(\theta – \theta_\text{MAP}).$$(The expansion’s first-order term disappears because the gradient of the MAP objective at $\theta_\text{MAP}$ is zero.) This quadratic form can then be written as a Gaussian density, with mean $\theta_\text{MAP}$ and covariance given by the inverse of the log-joint Hessian: $$\begin{align}p(\theta | X, y) &\approx \mathcal{N}(\theta ; \theta_\text{MAP}, \Sigma), \\

\Sigma &= -(\nabla_\theta^2 \mathcal{L}(X, y; \theta))^{-1}.\end{align}$$

The authors makes use of various tricks to render computing this Hessian inverse feasible, most notably Kronecker-Factored Approximate Curvature (KFAC) [15]. (A nice explanation of which can be found at this blog post.)

Using the Laplace approximation comes with added benefits. Specifically, we can make use of the Gaussian form of the (approximate) posterior to easily compute two values of interest: samples from the posterior predictive distribution, and estimates of the marginal likelihood.

For the first of these, we can *linearise* our model, with output $f_\theta(x^*)$ approximated by a first-order Taylor expansion around $\theta_\text{MAP}$, $$f_\theta(x^*) \approx f_{\theta_\text{MAP}}(x^*) + \nabla_\theta f_\theta(x^*)|^T_{\theta_\text{MAP}}(\theta – \theta_\text{MAP}).$$

We can write this as a Gaussian density $$f_\theta(x^*) \sim \mathcal{N}(y^*; f_{\theta_\text{MAP}}(x^*), \Lambda)$$ where $$\Lambda = (\nabla_\theta f_\theta(x^*)|^T_{\theta_\text{MAP}})\Sigma(\nabla_\theta f_\theta(x^*)|_{\theta_\text{MAP}}).$$

With this, we can easily obtain samples from our predictive posterior through reparameterised sampling of some Gaussian noise $\mathbf{\xi} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and a Cholesky decomposition $\Lambda = LL^T$: $$\hat{y} = f_\theta(x^*) = f_{\theta_\text{MAP}}(x^*) + L\mathbf{\xi}.$$

The second value of interest is the marginal likelihood (also known as the model evidence), which is useful for hyperparameter optimisation and can be computed simply as follows $$\begin{align}p(y|X) &= \int p(y|X,\theta)p(\theta)d\theta \\ &\approx \exp (\mathcal{L}(y, X; \theta_\text{MAP}))(2\pi)^{D/2}\det(\Sigma)^{1/2}.\end{align}$$

A reasonable question to ask is whether it might be feasible to learn the posterior distribution *during* finetuning, rather than afterwards. One such method for achieving this is Stein variational gradient descent (SVGD) [16], in which a collection of $n$ parameter particles $\{\theta_i^{(0)}\}_{i=1}^n$ are iteratively updated to fit the true posterior using some similarity function (i.e. a kernel) $k: \Theta \times \Theta \to \mathbb{R}$, $$\begin{align}\theta_i^{t+1} &= \theta_i^{(t)} – \epsilon_i \phi(\theta_i^{(t)}) \\ \phi(\theta_i) &= \frac{1}{n} \sum_{j=1}^n \left[\frac{1}{T}k(\theta_j,\theta_i)\nabla_{\theta_j}\log p(\theta_j | \mathcal{D}_\text{finetune}) + \nabla_{\theta_j} k(\theta_j, \theta_i) \right],\end{align}$$where $\epsilon_i$ is a learning rate and $T$ is a temperature hyperparameter. The basic interpretation of the update is that the first term inside the summation drives particles towards areas of high posterior probability, whilst the second term penalises particles that are too similar to one another, acting as a repulsive force that encourages exploration of the parameter-space.

Once the particles have converged, we can simply approximate the posterior predictive as the average output of the network across each parameter particle $\theta_i$, $$p(y^* | x^*, \mathcal{D}_\text{finetune}) \approx \frac{1}{n} \sum_{i=1}^n f_{\theta_i}(x^*).$$

My current research lies in applying SVGD to LoRA adapters. The hopes are that we can learn a richer, multi-modal posterior distribution using SVGD’s particles without making the Gaussian posterior assumption of the Laplace approximation. Recent concurrent work [17] applies a very similar technique to computer-vision tasks and achieves promising results.

I hope this blog has been a useful introduction to the finetuning of LLMs. Feel free to get in touch if you’re interested! My email is sam.bowyer@bristol.ac.uk.

1: LLMs split input text up into a sequence of *tokens*. Roughly speaking, most words are split into one or two tokens depending on how common and how long they are. Using GPT-4’s tokenizer, this sentence is made from 17 tokens. (back to top)

2: Spare a moment, if you will, for the Meta engineers behind the OPT-175B (175 billion-parameters) model. The training logbook of which reads at times like that of a doomed ship at sea. (back to top)

3: Note that in the case of LLMs specifically, the straight-out-of-pretraining model will also likely be a poor virtual assistant, in the way we tend to desire of chatbots like ChatGPT. A model which can complete sentences to match the general patterns found in $\mathcal{D}$ won’t necessarily be much good at the user-agent back-and-forth conversation style we’d like, and as such might not have properly ‘learnt’ how to, for example, follow instructions and answer questions. It’s because of this that most public-facing LLMs go through what’s known as *instruction fine-tuning*, in which the model is finetuned on a large dataset of instruction-following chat logs before being deployed. (back to top)

4: That is, without using 25,000 GPUs… (back to top)

5: Consider this paper [2] by Huang et al. which boasts 1.08-bit quantisation^{a} of 16-bit models, all whilst retaining impressive levels of performance. (back to top)

a : i.e. representing parameters with an *average* precision of 1.08 bits.

6:

Attention layers work by taking three matrices as input, $Q_\text{input}, K_\text{input}, V_\text{input} \in \mathbb{R}^{n \times d_\text{model}}$, typically representing $d_\text{model}$-dimensional embeddings of a sequence of $n$ tokens. First we project these matrices using learnable weight matrices $W^Q, W^K \in \mathbb{R}^{d_\text{model} \times d_k}$, and $W^V \in \mathbb{R}^{d_\text{model} \times d_v}$ to obtain our* queries*, *keys* and *values*: $$\begin{align}

Q &= Q_\text{input} W^Q \in \mathbb{R}^{n \times d_k} \\

K &= K_\text{input} W^K \in \mathbb{R}^{n \times d_k} \\

V &= V_\text{input} W^V \in \mathbb{R}^{n \times d_v}.

\end{align}$$With these, we then compute attention as $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ where $\text{softmax}$ is applied over each row such that, denoting the $i$th row of the matrix $A = QK^T$ as $A^{(i)}$ and that row’s $j$th element as $a^{(i)}_j$, we define: $$\text{softmax}(A)^{(i)} = \frac{\exp A^{(i)}}{\sum_{j=1}^n \exp a^{(i)}_j}.$$

The intuition behind this is that our $n \times n$ attention matrix $\text{softmax}(QK^T/\sqrt{d_k})$ has entries representing how much token $i$ relates (or ‘attends’) to token $j$. The $\text{softmax}$ normalises each row so that the entries all add up to one, allowing us to think of each row as a distribution over tokens. The final multiplication with $V$ might then be thought of as selecting (or weighting) tokens in $V$ according to those distributions.

One important limitation of the attention mechanism we’ve just described is that it only allows us to consider how each token attends to each other token in some universal way, whereas in reality there are multiple ways that words in a sentence (for example) can relate to each other. Because of this, most of the time we actually use *multi-headed* attention, in which we compute attention between the token sequences $H \in \mathbb{N}$ times, each time with different learnable weight matrices $W^Q_h, W^K_h, W^V_h$ for $h \in \{1,\ldots,H\}$. Then we combine these separate attention heads, using yet another learnable weight matrix $W^O \in \mathbb{R}^{H d_v \times d_\text{model}}$, $$\text{MultiHead}(Q_\text{input}, K_\text{input}, V_\text{input}) = \text{Concat}(\text{head}_1,\ldots,\text{head}_H)W^O \in \mathbb{R}^{n \times d_\text{model}},$$ where $\text{head}_h = \text{Attention}(Q_\text{input}Q_h, K_\text{input}K_h, V_\text{input}V_h)$. Allowing the model to learn different types of attention on different heads makes MHA an incredibly powerful and expressive part of a neural network.

To summarise and return to the discussion of finetuning: MHA layers contain a *ton* of learnable parameters (specifically, $2H d_\text{model} (d_k + d_v)$ of them). (back to top)

[1] Kingma, D.P., 2014. Adam: a method for stochastic optimization. *arXiv preprint arXiv:1412.6980*.

[2] Huang, W., Liu, Y., Qin, H., Li, Y., Zhang, S., Liu, X., Magno, M. and Qi, X., 2024. Billm: Pushing the limit of post-training quantization for llms.* arXiv preprint arXiv:2402.04291*.

[3] Zaken, E.B., Ravfogel, S. and Goldberg, Y., 2021. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models.* arXiv preprint arXiv:2106.10199*.

[4] Vaswani, A., 2017. Attention is all you need. *arXiv preprint arXiv:1706.03762*.

[5] Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M. and Gelly, S., 2019, May. Parameter-efficient transfer learning for NLP. In *International conference on machine learning* (pp. 2790-2799). PMLR.

[6] Lin, Z., Madotto, A. and Fung, P., 2020. Exploring versatile generative language model via parameter-efficient transfer learning. *arXiv preprint arXiv:2004.03829*.

[7] Pfeiffer, J., Kamath, A., Rücklé, A., Cho, K. and Gurevych, I., 2021. AdapterFusion: Non-Destructive Task Composition for Transfer Learning. EACL 2021.

[8] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. and Chen, W., 2021. Lora: Low-rank adaptation of large language models. *arXiv preprint arXiv:2106.09685*.

[9] Zhang, Q., Chen, M., Bukharin, A., Karampatziakis, N., He, P., Cheng, Y., Chen, W. and Zhao, T., 2023. AdaLoRA: Adaptive budget allocation for parameter-efficient fine-tuning. *arXiv preprint arXiv:2303.10512*.

[10] Meng, F., Wang, Z. and Zhang, M., 2024. Pissa: Principal singular values and singular vectors adaptation of large language models. *arXiv preprint arXiv:2404.02948*.

[11] Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A. and Tian, Y., 2024. Galore: Memory-efficient llm training by gradient low-rank projection. *arXiv preprint arXiv:2403.03507*.

[12] Cinquin, T., Immer, A., Horn, M. and Fortuin, V., 2021. Pathologies in priors and inference for Bayesian transformers. *arXiv preprint arXiv:2110.04020*.

[13] Chen, W. and Li, Y., 2023. Calibrating transformers via sparse gaussian processes. *arXiv preprint arXiv:2303.02444*.

[14] Yang, A.X., Robeyns, M., Wang, X. and Aitchison, L., 2023. Bayesian low-rank adaptation for large language models. *arXiv preprint arXiv:2308.13111*.

[15] Martens, J. and Grosse, R., 2015, June. Optimizing neural networks with kronecker-factored approximate curvature. In *International conference on machine learning* (pp. 2408-2417). PMLR.

[16] Liu, Q. and Wang, D., 2016. Stein variational gradient descent: A general purpose bayesian inference algorithm. *Advances in neural information processing systems*, *29*.

[17] Doan, B.G., Shamsi, A., Guo, X.Y., Mohammadi, A., Alinejad-Rokny, H., Sejdinovic, D., Ranasinghe, D.C. and Abbasnejad, E., 2024. Bayesian Low-Rank LeArning (Bella): A Practical Approach to Bayesian Neural Networks. *arXiv preprint arXiv:2407.20891*.

A post by Sam Bowyer and Emma Ceccherini, PhD students on the Compass programme.

The annual Compass Away Day took place this past June at Folly Farm, a fully sustainable and eco-friendly venue that offered Compass students the opportunity to take some time away from their regular research and enjoy a variety of activities in the Somerset countryside. Over the course of three days, the students learnt—among other things—how to craft an effective CV as a machine learning researcher; how each of our research areas overlap in surprising ways; how to improve the execution time of Python programs; and how to throw an axe.

On arrival, the students first took part in a Responsible Innovation talk led by Henry Bourne, in which we explored the connections, biases, and gaps in and between our individual research areas. This took the form of a creative mind-mapping exercise in which the students eagerly relished the opportunity to don party hats (see picture below).

After a hearty lunch with plenty of coffee, the students next heard three short talks by their colleagues. First, Ed Davis treated them to a talk on the benefits of just-in-time compilation, with a live demonstration of how one line of code (‘`@jit`

‘) can speed up your Python scripts by multiple orders of magnitude. Secondly, the students saw another live-coding demonstration given by Kieran Morris as he walked through a project of his which utilised command-line automation and LLM APIs to create a text-based simulation of a well-known sporting event (The Hunger Games, but with Compass PhD students appearing as competitors). Finally, Edward Milsom gave a tongue-in-cheek TED-style talk entitled ‘How to be an AI Bro’ advising all present in the audience to pick a side between Doomers and Zoomers (anti- and pro-AI); to use self-attention mechanisms (the basis of modern LLMs; see https://arxiv.org/abs/1706.03762) whenever possible; and, following that last piece of advice, to follow @edward_milsom on X (formerly known as Twitter).

The final session of the day was led by Emma Ceccherini and Sam Bowyer, in which the two Away Day student organisers presented on the topic of CVs and online presence. The presentation exceeded everyone’s expectations, in no small part thanks to Helen Mawdsley being present to answer questions from the audience.

Feeling refreshed and revitalised after a peaceful night’s sleep, the students awoke to sunshine and birdsong. But soon a competitive feeling came over the group; this next morning was devoted to sports: archery, axe-throwing and segwaying. We won’t dwell on the sports too much except to say that Sam Bowyer’s team won handily, and that one impressed student is reported to have said “[Sam] was a G at archery. Bosh.” (That same anonymous source, taking time out of her quantitative spatial science and networks research—located in the big Compass office somewhere between desks C and E—, is also reported to have said “Ed D[redacted for anonymity], snapping all those single-use poles, not impressive segwaying.” We, the editors, feel it would be improper for us to comment on this matter.)

The final activity on the schedule was a writing retreat, taking place over the afternoon of the second day and the entirety of the third and final day. Being in a peaceful location and setting time aside to work on a single specific task proved useful to the students present.

To summarise, Away Day 2024 went extremely well, having been organised expertly by Crina Radu, to whom the students are all incredibly grateful.

A post by Dylan Dijk, PhD student on the Compass programme.

My current project is looking to robustify the performance of time series models to heavy-tailed data. The models I have been focusing on are vector autoregressive (VAR) models, and additionally factor-adjusted VAR models. In this post I will not be covering the robust methodology, but will be introducing VAR models and providing the motivation for introducing the factor adjustment step when working with high-dimensional time series.

In time series analysis the objective is often to forecast a future value given past data, for example, one of the classical models for univariate time series is the autoregressive AR(d) model:

\[X_t = a_1 X_{t-1} + \dots + a_d X_{t-d} + \epsilon_t \, .\]

However, in many cases, the value of a variable is influenced not just by its own past values but also by past values of other variables. For example, in Economics, household consumption expenditures may depend on variables such as income, interest rates, and investment expenditures, therefore we would want to include these variables in our model.

The VAR model [1] is simply the multivariate generalisation of the univariate autoregressive model, that is, for a $p$-dimensional stochastic process $(\dots, \mathbf{X}_t, \mathbf{X}_{t+1}, \dots) \in \mathbb{R}^p$ we model an observation at time $t$ as a linear combination of previous observations up to some lag $d$ plus an error:

\[\mathbf{X}_t = \mathbf{A}_1 \mathbf{X}_{t-1} + \dots + \mathbf{A}_d \mathbf{X}_{t-d} + \boldsymbol{\epsilon}_t \, ,\]

where $\mathbf{A}_i$ are $p \times p$ coefficient matrices. Therefore, in addition to modelling serial dependence, the model takes into account cross-sectional dependence. This model can then be used for forecasting, and as an explanatory model to describe the dynamic interrelationships between a number of variables.

Given a dataset of $n$ observations, $\{\mathbf{X}_1, \dots, \mathbf{X}_n \in \mathbb{R}^p\}$, we can aim to estimate the coefficient matrices. In order to do so, the model can be written in a stacked form:

\begin{align*} \underbrace{\left[\begin{array}{c}\left(\mathbf{X}_n\right)^{T} \\ \vdots \\ \left(\mathbf{X}_{d+1}\right)^{T}\end{array}\right]}_{\boldsymbol{\mathcal{Y}}} & =\underbrace{\left[\begin{array}{ccc}\left(\mathbf{X}_{n-1}\right)^{T} & \cdots & \left(\mathbf{X}_{n-d}\right)^{T} \\ \vdots & \ddots & \vdots \\ \left(\mathbf{X}_{d}\right)^{T} & \cdots & \left(\mathbf{X}_1\right)^{T}\end{array}\right]}_{\boldsymbol{\mathcal{X}}} \underbrace{\left[\begin{array}{c}\boldsymbol{A}_1^{T} \\ \vdots \\ \boldsymbol{A}_d^{T}\end{array}\right]}_{\boldsymbol{A}^T}+\underbrace{\left[\begin{array}{c}\left(\boldsymbol{\epsilon}_n\right)^{T} \\ \vdots \\ \left(\boldsymbol{\epsilon}_d\right)^{T}\end{array}\right]}_{\boldsymbol{E}}

\end{align*}

and subsequently vectorised to return a standard univariate linear regression problem

\begin{align*}

\operatorname{vec}(\boldsymbol{\mathcal{Y}}) & =\operatorname{vec}\left(\boldsymbol{\mathcal{X}} \boldsymbol{A}^T\right)+\operatorname{vec}(\boldsymbol{E}), \\ & =(\textbf{I} \otimes \boldsymbol{\mathcal{X}}) \operatorname{vec}\left(\boldsymbol{A}^T\right)+\operatorname{vec}(\boldsymbol{E}), \label{eq:stacked_var_regression_form}\\ \underbrace{\boldsymbol{Y}}_{N p \times 1} & =\underbrace{\boldsymbol{Z}}_{N p \times q} \underbrace{\boldsymbol{\beta}^*}_{q \times 1}+\underbrace{\operatorname{vec}(\boldsymbol{E})}_{N p \times 1}, \quad N=(n-d), \quad q=d p^2.

\end{align*}

There are $dp^2$ parameters to estimate in this model, and hence VAR estimation is naturally a high-dimensional statistical problem. Therefore, estimation methods and associated theory need to hold under high-dimensional scaling of the parameter dimension. Specifically, this means consistency is shown for when both $p$ and $n$ tend to infinity, as opposed to in classical statistics where $p$ is kept fixed.

The linear model in the high-dimensional setting is well understood [2]. To obtain a consistent estimator requires additional structural assumptions in the model, in particular, sparsity on the true vector $\boldsymbol\beta^*$. The common approach for estimation is lasso, which can be motivated from convex relaxation in the noiseless setting. Consistency of lasso is well studied [3][4], with consistency guaranteed under sparsity, and restrictions on the directions in which the hessian of the loss function is strictly positive.

The well known lasso objective is given by:

\begin{align*}

\underset{{\boldsymbol\beta \in \mathbb{R}^q}}{\text{argmin}} \, \|\boldsymbol{Y}-\boldsymbol{Z} \boldsymbol\beta\|_{2}^{2} + \lambda \|\boldsymbol\beta\|_1 \, ,

\end{align*}

and below, we give a simplified consistency result that can be obtained under certain assumptions.

We denote the sparsity of $\boldsymbol{A}$ by

$s_{0, j}=\left|\boldsymbol\beta^*_{(j)}\right|_0, s_0=\sum_{j=1}^p s_{0, j}$ and $s_{\text {in }}=\max _{1 \leq j \leq p} s_{0, j}$.

** Lasso consistency result **

Suppose

\begin{gather*}

\, s_{\text{in}} \leq C_1 \sqrt{\frac{n}{\log p}} \, \; \text{and } \; \lambda \geq C_2 (\|\boldsymbol{A}^T\|_{1,\infty} + 1)\sqrt{\frac{\log p}{n}} \; ,

\end{gather*}

then with high probability we have

\begin{align*}

|\widehat{\boldsymbol{A}} – \boldsymbol{A}|_2 \leq C_3 \sqrt{s_{0}} \lambda \quad \text{and} \quad |\widehat{\boldsymbol{A}} – \boldsymbol{A}|_1 \leq C_4 s_0 \lambda \, .

\end{align*}

What we mean here by consistency, is that as $n,p \rightarrow \infty$, the estimate $\widehat{\boldsymbol\beta}$ converges to $\boldsymbol\beta$ in probability. Where we think of $p$ as being a function of $n$, so the manner in which the dimension $p$ grows depends on the sample size. For example, in the result above, we can have consistency with $p = \exp(\sqrt{n})$.

The result indicates that for larger $p$ a more sparse solution, and a larger regularisation parameter is required. Similar results have been derived under various assumptions, for instance under a Gaussian VAR the result has been given in terms of the largest and smallest eigenvalues of the spectral density matrix of a series [5], and hence consistency requires that these quantities are bounded.

In summary, for lasso estimation to work we need $\boldsymbol{A}$ to be sufficiently sparse, and the largest eigenvalue of the spectral density matrix to be bounded. But are these reasonable assumptions to make?

First two leading eigenvalues of the spectral density matrix.

Heatmap of logged p-values for evidence of non-zero coefficients after fitting ridge regression model.

Well, intuitively, if a multivariate time series has strong cross-sectional dependence we would actually expect to have many non-zero entries in the VAR coefficients $\boldsymbol{A}_i$. The figures above, taken from [6], illustrate a real dataset in which there is statistical evidence for a non-sparse solution (heatmap), and that the leading eigenvalue of the spectral density matrix diverges linearly in $p$. Therefore providing an example in which two of the assumptions discussed above are unmet.

The idea now is to assume that the covariance of the observed vector $\mathbf{X}_t$ is driven by a lower dimensional latent vector. For example, the figures above were generated from a dataset of stock prices of financial institutions, in this case an interpretation of a latent factor could be overall market movements which captures the broad market trend, or a factor that captures the change in interest rates.

\begin{align}

\mathbf{X}_t &= \underset{p \times r}{\boldsymbol\Lambda} \underset{r \times 1}{\mathbf{F}_t} + \boldsymbol\xi_t \quad

\end{align}

Consequently, first fitting a factor model would account for strong cross-sectional correlations, leaving the remaining process to exhibit the individual behaviour of each series. Fitting a sparse VAR process will now be a more reasonable choice.

In the formula above, $\mathbf{F}_t$ is the factor random vector, and $\boldsymbol\Lambda$ the constant loading matrix, which quantifies the sensitivity of each variable to the common factors, and we can model $\boldsymbol\xi_t$ as a sparse VAR process, as described in the preceding sections.

**References**

[1] Lütkepohl, H. (2005) *New introduction to multiple time series analysis.* Berlin: Springer-Verlag.

[2] Wainwright, M. (2019) *High-dimensional statistics: A non-asymptotic viewpoint – Chapter 7 – Sparse linear models in high dimensions*. Cambridge, United Kingdom: Cambridge University Press.

[3] Geer, Sara A. van de, and Peter Bühlmann. (2009) *On the Conditions Used to Prove Oracle Results for the Lasso*. Electronic Journal of Statistics. Project Euclid, https://doi.org/10.1214/09-EJS506.

[4] Bickel, Peter J., Ya’acov Ritov, and Alexandre B. Tsybakov. (2009) *Simultaneous Analysis of Lasso and Dantzig Selector*. The Annals of Statistics. https://doi.org/10.1214/08-AOS620.

[5] Sumanta Basu, George Michailidis. (2015) *Regularized estimation in sparse high-dimensional time series models*. The Annals of Statistics. https://doi.org/10.1214/15-AOS1315.

[6] Barigozzi, M., Cho, H. and Owens, D. (2024). *FNETS: Factor-adjusted network estimation and forecasting for high-dimensional time series*. Journal of Business & Economic Statistics.

A post by Qi Chen, PhD student on the Compass programme.

Variational inference is a method to approximate posterior distributions. In Bayesian statistics context, we would like to get access to the posterior distribution \[p(\theta|x) = \frac{p(x|\theta)p(\theta)}{\int_\mathcal{\theta} p(x|\theta)p(\theta) d\theta}\]

In most cases the denominator $p(D)$ is intractable, that is we can not compute it analytically. How should we proceed? There are two broad ways:

- Using MCMC to simulate samples from the posterior distribution $p(\theta|D)$ to approximate the true posterior and get statistics of interest(mean, variance, etc.).
- Approximate $p({\theta}|{x})\approx q(\theta)\in\mathcal{Q}$.

The former method is unbiased and the convergence is guaranteed by the law of large numbers. But it requires a large number of samples and is quite computational demanding if the dimension of parameters/dataset is large. The later one, called variational inference,is biased depends on the choice of $\mathcal{Q}$ but is much faster and more scalable.

We call $q$ the variational distributions. The idea behind variational inference, is to approximate the posterior $p({\theta}|{x})$ using $q({\theta})\in\mathcal{Q}$ by minimizing the KL divergence between $q({\theta})$ and the true posterior $p({\theta}|{x})$, with the following formal expression:\[q^*({\theta}) = argmin_{q\in\mathcal{Q}}\;KL(q({\theta})||p({\theta}|{x})) = \int_{\Theta} q({\theta})\log\left(\frac{q({\theta})}{p({\theta}|{x})}\right)d{\theta}\]

This is a traditional measure of distribution mismatch over the same domain, and it is easy to see that $q = p$ is equivalent to $KL(q||p)=0$.

There are broadly two questions we would like to answer:

- How do we minimize $q$ over the space with the true posterior unknown?
- How do choose the variational family $q$?

We now answer the first question:

Notice that \begin{align*}

\log p({x}) &= \int q({\theta})\log\left(\frac{p({x},\theta)q({\theta})}{p({\theta}|{x})q({\theta})}\right) d{\theta}\\

&= \int q({\theta}) \log\left(\frac{p({x},{\theta})}{q({\theta})}\right) d{\theta} + \int q({\theta})\log\left(\frac{q({\theta})}{p({\theta}|{x})}\right)d{\theta}

\end{align*}

From the above derivation, we see that the second part is simply just the KL divergence we wish to minimize. As $\log(p({x}))$ is fixed, minimizing KL divergence is equivalent to maximizing the first part. This answers the first question. The first part is called \textbf{evidence lower bound(ELBO)}, written in $\mathcal{L}(q({\theta}))$.

For the second question, in theory, suppose the variational distribution is parametrized by variational parameters ${\phi}$, we can start with any variaitonal distributions we like, following the basic criterion:

Supp($q({\theta};{\phi})\subseteq$Supp($p({\theta}|{x})$).

We also need Supp($p({\theta}|{x})\subseteq$Supp($p({\theta})$) which is guaranteed in most cases.

But randomly choosing some variational distributions with any model won’t make the algorithm always feasible. Indeed, all VI methods centered around the goal of optimizing the ELBO \[\phi^* = argmin_{{\phi}}\mathbb{E}_q\left[\log \frac{p({x},{\theta})}{q({\theta})}\right]\]

Traditional methods set the mean-field assumptions that all parameters are independent. This breaks down the objective and a local optimum could be achieved via a coordinate ascent algorithm. Some methods enlarge the mean-field space to some specific conditional dependences between parameters according to graphical models with conjugate exponential relationship between parent-child pairs[1]. This is further extended to non-conjugate pairs with custom approximations.

Some modern methods have been developed in the last decade based on the idea that the gradient of the ELBO could be expressed in the from of $\mathbb{E}_q(\cdot)$. This immediately brings attention to a combination of MC algorithms(for sampling from $q$) and stochastic gradient descent(for efficiency in the optimization). These methods benefits from the simplicity that there’s no need to analytically compute the gradients based on conditional dependence specifications for each model: it is an automatic algorithm, for a greater domain of models. But it is worth noting that even those methods are theoretically sound, they still face practical issue which I will show in the later sections.

In this post I will briefly go through some of these methods, specifically coordinate ascent variational inference, black-box variational inference and automatic differentiation variational inference.

There are various assumptions we can make on $\mathcal{Q}$ . We start with the mean-field assumptions of the parameters [2] This is to assume the joint prior distributions of all parameters could be factorized completely. That is:\[q({\theta}) = \prod_{j=1}^m q_j({\theta}_j)\]

We now write ${\theta}_{-j}$ denote all the other latent variables except for ${\theta}_j$, with distribution $q_{-j}$.

If we only minimize $\mathcal{L}(q)$ against $q_j({\theta}_j)$, we are minimizing \[\mathbb{E}_{q_j}[\mathbb{E}_{q_{-j}}[\log p({\theta},{x})]] – \mathbb{E}_{q_j}[\log q_j({\theta}_j)]\]

Further write $r_j({\theta}_j) = \frac{1}{Z_j}\exp\{\mathbb{E}_{q_{-j}}[\log p({x},{\theta})]\}$ where $Z_j$ is some normalizing constant so that $r_j$ is a probability distribution. Then substitute in, we get \[\mathcal{L}(q_j) \propto \mathbb{E}_{q_j}[\log \frac{r_j({\theta}_j)}{q_j({\theta}_j)}] = -KL(q_j({\theta}_j)||r_j({\theta}_j))\]

Thus maximizing ELBO against $q_j$ is equivalent to set $q_j = r_j$, which is \[q_j({\theta}_j)\propto \exp\{\mathbb{E}_{q_{-j}}[\log p({x},{\theta})]\}\propto \exp\{\mathbb{E}_{q_{-j}}[\log p({\theta}_j|{\theta}_{-j},{x})]\}\]

Since we assume $q$ factorizes, maximizing $\mathcal{L}(q)$ is split into $m$ steps of maximizing $\mathcal{L}(q_j)$. This algorithm is called \textbf{coordinate ascent variational inference}(CAVI) or \textbf{block-coordinate assent}.

A algorithmic view is

- Initialize $q({\theta}) = \prod_{j=1}^m q_j({\theta}_j)$
- Iterate until convergence:

Update for each $q_j$ by $q_j = \frac{1}{Z_j}\exp(\mathbb{E}_{q_{-j}}[\log(p({\theta},{x}))])$This algorithm is guarantee to convergence since each iteration the ELBO increases.

This is not directly feasible for all cases, since we assume we can compute $r_j$ analytically. In case where there’s conditional conjugacy of likelihood and the prior on each $\theta_j$ conditioned on all other ${\theta}_{i\neq j}$. That is \[p(\theta_j|{\theta}_{i\neq j})\in \mathcal{A}(\alpha),\,p({x}|\theta_j, \theta_{i\neq j})\in \mathcal{B}(\theta_j)\rightarrow p(\theta_j|{x},{\theta}_{i\neq j})\in A(\alpha’)\]

this will be feasible. One particular family is that all complete conditionals lie in exponential family.

A distribution $p({\theta})$ is in exponential family if \[p(\theta) = h({\theta})\exp\{{\eta}^Tt({\theta}) – A({\eta})\}\]

Here $\eta$ is called natural parameter, and $A({\theta})$ satisfies \[A({\eta}) = \log \int h({\theta}) \exp \eta^Tt(\theta)d{\theta}\]

such that it integrates to 1.

Now assume that all the complete conditionals belong to an exponential family distribution, that is \[p({\theta}_j|{\theta}_{-j},{x}) = h({\theta}_j)\exp \{{\eta}_j^T({\theta}_{-j},{x}){\theta}_j – A({\eta}_j({\theta}_{-j},{x}))\}\]

where we assume that ${\theta}_j$ is already transformed to its appropriate sufficient statistic. We see now the CAVI becomes \begin{align*}

q_j({\theta}_j)&\propto \exp\{\log h({\theta}_j) + \mathbb{E}_{q_{-j}}[{\eta}_j({\theta}_{-j},{x})]^T{\theta}_j – \mathbb{E}_{q_{-j}}[A({\eta}_j({\theta}_{-j},{x}))]\}\\

&\propto h({\theta}_j)\exp\{\mathbb{E}_{q_{-j}}[{\eta}_j({\theta}_{-j},{x})]^T{\theta}_j\}

\end{align*}

where we see that the variational factors are in the same exponential family(due to conjugacy) as the complete conditionals, with the natural parameter updated to \[\phi_j = \mathbb{E}_{q_{-j}}[{\eta}_j({\theta}_{-j},{x})]\]

But in most cases, for example Bayesian logistic regression, we do not have conditional conjugacy in our model. In this blog post, we introduce two methods which are developed in the last decade tackling the lack of conjugacy. Notice that variational inference is indeed an optimization problem, and these methods are derived from expressing the derivatives of the ELBO in terms of expectation over the vatiational distributions q: \[\frac{\partial ELBO}{\partial {\phi}} = \mathbb{E}_{q({\theta};{\phi})}[\cdot]\]

\section{Evaluable Models: Black Box Variational Inference}

We want to optimize \[\mathcal{L}({\phi}) = \mathbb{E}_{q}[\log p({\theta},{x})] – \mathbb{E}_q[\log q({\theta};{\phi})]\]

and we notice that

\begin{align*}

\triangledown_{{\phi}}\mathcal{L}({\phi}) &= \triangledown_{{\phi}}\int q({\theta};{\phi})\log \frac{p({\theta},{x})}{q(\theta;{\phi})} d{\theta}\\

&= \int q({\theta};{\phi})\triangledown_{{\phi}} \log q({\theta};{\phi})\log \frac{p({\theta},{x})}{q({\theta};{\phi})} + q({\theta};{\phi})\triangledown_{{\phi}}\log \frac{p({\theta},{x})}{q({\theta};{\phi})} d{\theta}\\

&= \mathbb{E}_{q}[\triangledown_{{\phi}}\log q({\theta};{\phi})(\log p({\theta},{x})-\log q({\theta};{\phi}))]

\end{align*}

This is proposed in [3]. We see this is an expectation under the variational distributions, and we only need

- simulate from $q$.
- evaluate the derivatives of $q$.
- evaluate the model $p({\theta},{x})$.

This significantly relaxes the constraint of CAVI and enlarges the domain of models applicable.

In practice, we will use stochastic gradient descent to derive a noisy unbiased estimator of the gradient and adapt some step functions satisfying some conditions, for example \[\sum_j \rho_j =\infty\;\;\;\;\sum_j \rho_j^2 < \infty\]

A naive algorithm is as follows:

- $t \gets 0$, $\delta \gets \infty$
- While{$\delta > \tau$}{
- $t \gets t+1$
- ${\theta}^1,…,{\theta}^S\sim q({\theta},{\phi}_{t-1})$
- $\hat{\triangledown}_{{\phi}}\mathcal{L}({\phi}_{t-1})\gets \frac{1}{S}\sum_{s=1}^S \triangledown_{{\phi}}\log q({\theta}^s;{\phi}_{t-1})(\log p({\theta}^s,{x})-\log q({\theta}^s;{\phi}_{t-1}))$
- ${\phi}_t\gets{\phi}_{t-1} + \rho_t\hat{\triangledown}_{{\phi}}\mathcal{L}({\phi}_{t-1})$
- $\delta \gets \frac{||{\phi}_t – {\phi}_{t-1}||}{||{\phi}_{t-1}||}$

}

Output{${\phi}^* = {\phi}^t$}

However, in practice, this algorithm does not produce meaningful result for non-trivial model, since the variance of this estimates grows linearly with the number of parameters in the model ${\theta}$. Due to the high variance, we need some variance reduction technique.

Rao-Blackwellization reduces the variance of some estimator $J(X,Y)$ by defining another estimator \[\hat{J}(X) = \mathbb{E}[J(X,Y)|X]\]

It is clear that the expectation is preserved:\[\mathbb{E}[\hat{J}(X)] = \mathbb{E}[J(X,Y)]\]by tower law. The variance of this estimator is \[Var(\hat{J}(X)) = Var(J(X,Y)) + \mathbb{E}[\hat{J}(X)^2] – \mathbb{E}[J(X,Y)^2] = Var(J(X,Y)) – \mathbb{E}[(J(X,Y)-\hat{J}(X))^2]\]

Thus this new estimator always has less variance compared to $J(X,Y)$ unless $\hat{J}(X) = J(X,Y)$.

We now apply this to BBVI. Assume the approximating family follows the mean-field assumption, and let $p({x},{\theta}) = p_i({x},{\theta}_{(i)})p_{-i}({x},{\theta}_{-i})$

where $p_i$ are all the terms containing $\theta_i$, and $\theta_{(i)}$ is the collection of all latent variables that appear in $p_i$.

We can thus rewrite the derivatives of ELBO respect to $\theta_i$ as \[\hat{\triangledown}_{\phi_i}^{RB}\mathcal{L}(\phi_i) = \mathbb{E}_{q_{(i)}}[\triangledown_{\phi_i}[\log q_i(z_i;\phi_i)(\log p_i({x},\theta_{(i)})-\log q_i(\theta_i;\phi_i))]]\]

This is a Rao-Blackwellized $\triangledown_{\phi_i}\mathcal{L}({\phi})$ as \[\mathbb{E}_q[\hat{\triangledown}_{\phi_i}\mathcal{L}({\phi}) – \hat{\triangledown}_{\phi_i}^{RB}\mathcal{L}(\phi_i)] = C\mathbb{E}_{q_i}[\triangledown_{\phi_i}[\log q_i(\theta_i;\phi_i)]] = 0\]

with \[C = \mathbb{E}_{q_{-i}}[\log p_{-i}({x},{\theta}_{-i})] – \mathbb{E}_{q_{-i}}[\sum_{j\neq i}\log q_j(\theta_j;\phi_j)]\]

The detailed derivation could be found in [3].

We now introduce another method using regression estimator. Suppose we want to estimate some parameter $\mu$ and we have an estimator $f$ with $\mathbb{E}[f(u)] = \mu$, u is a random variable. Furthermore, if we have a “similar” function $h$ such that $\mathbb{E}[h(u)] = \nu$ is known. Then we define a new estimator of $\mu$:\[g(u) = f(u)-\beta(h(u)-\nu)\]

This is clearly an unbiased estimator and for the variance term\[Var(g(u)) = Var(f(u)) + \beta^2 Var(h(u)) – 2\beta Cov(f(u),h(u))\]

In order to minimize this variance, we choose \[\hat{\beta} = \frac{Cov(h(u),f(u))}{Var(h(u))}\]

This is also the OLS estimator for the linear regression:\[f(u) = \mu + \beta(h(u)-\nu)\] Now plugging in this $\hat{\beta}$ we have \[Var(g(u)) = Var(f(u))(1-\rho^2_{fh})\] where $\rho^2_{fh}$ is the correlation between $f(u)$ and $h(u)$. Such $h$ is called the control variate.

The original author in [3] combined these two methods and choose $\triangledown_{\phi_i}\log q_i(\theta_i;\phi_i)$ as the control variate for $\hat{\triangledown}_{\phi_i}^{RB}\mathcal{L}(\phi_i)$, which is shown below:

- $t \gets 0$, $\delta \gets \infty$\
- While{$\delta > \tau$}{
- t \gets t+1$
- ${\theta}^1,…,{\theta}^S\sim q({\theta},{\phi}_{t-1})$
- For{$i\gets 1$to $n$}{
- $f_i \gets \frac{1}{S}\sum_{s=1}^S \triangledown_{\phi_i}\log q(\theta_i^s;{\phi}^{t-1}_{i})(\log p_i({\theta}_{(i)}^s,{x})-\log q_i(\theta_i^s;{\phi}_i^{t-1}))$
- $h_i\gets \frac{1}{S}\sum_{s} \triangledown_{\phi_i}[\log q_i(\theta_i^s;{\phi}_i^{t-1}))]$
- $\hat{\beta}_i \gets \frac{\hat{Cov}(f_i,h_i)}{\hat{Var}(h_i)}$
- $g_i \gets f_i-\hat{\beta}h_i$
- $\phi_i^t\gets \phi_i^{t-1} + \rho_tg_i$

}

- $\delta \gets \frac{||{\phi}_t – \phi_{t-1}||}{||{\phi}_{t-1}||}$

}

- Output{${\phi}^* = {\phi}^t$}

According to the same authors in [4], they pointed out the limitation of BBVI. They found that the gradient can be very unstable for large values of their inputs, and adaptive step-size like AdaGrad needs extra tunning. Also, they found that, in the case of linear mixed effects model, it under-performs MH-Gibbs sampler. Also, they did experiment in LDA(Latent Dirichlet allocation), Gibbs sampler converged in couple of minutes for 20 topics but BBVI does not produce any reasonable results after hours of iterations for 2 topics. Thus, it requires more experiments and BBVI still has practical limitations.

The idea behind Automatic Differentiation Variational Inference(ADVI) is as follows

- Transform the parameter space to real space: $T:Supp({\theta})\rightarrow\mathbb{R}^k$ by a one-to-one mapping.
- Let ${\psi} = T({\theta})$ a joint normal distribution. That is \[q({\psi}|{\phi}) \sim \mathcal{N}({\mu},\Sigma)\] Notice that we need to ensure $\Sigma$ to be full rank. One way to do that is using Cholesky factorization: $\Sigma = LL^T$ where $L$ is a lower triangular matrix with dimension $(k+1)k/2$. Overall, ${\phi}$ lives in $\mathbb{R}^{(k+1)k/2+k}$ where $k$ is the dimension of parameters in our model. This comes with computational cost, so we may wish to make a mean-field assumption to ${\psi}$
- Finally we make the standardization ${\eta} = S_{{\phi}}({\psi}) = L^{-1}({\psi}-{\mu})$. This makes $q({\eta}) = \mathcal{N}({\eta};{0},{I})$.

Following the above recipe, we can rewrite the ELBO as \[{\phi}^* = argmin_{\phi} \mathbb{E}_{\mathcal{N}({\eta};{0},{I})}\left[\log p\left({x},T^{-1}(S^{-1}_{{\phi}}({\eta}))\right) + \log |detJ_{T^{-1}}(S_{{\phi}}^{-1}({\eta}))|\right] + \mathbb{H}[q({\psi};{\phi})]\]

In this case, the variational parameters are contained in the transformation $S$. We now give the gradients:\[\triangledown_{{\mu}}\mathcal{L} = \mathbb{E}_{\mathcal{N}({\eta})}[\triangledown_{{\theta}}\log p({x},{\theta})\triangledown_{{\psi}}T^{-1}({\psi}) + \triangledown_{{\psi}}\log|detJ_{T^{-1}}({\psi})|]\]

and \[\triangledown_{L}\mathcal{L} =\mathbb{E}_{\mathcal{N}({\eta})}[\left(\triangledown_{{\theta}}\log p({x},{\theta})\triangledown_{{\psi}}T^{-1}({\psi}) + \triangledown_{{\psi}}\log|detJ_{T^{-1}}({\psi})|\right){\eta}^T] + (L^{-1})^T\]

Now similar to BBVI, we can use MC algorithm and SGD to get an approximate gradient and do gradient descent. In [5] they propose a gradient of the form

\[\rho_k^i = \eta\times i^{-1/2+\epsilon}\times\left(\tau + \sqrt{s_k^i}\right)^{-1}\]

where \[s_k^i = \alpha (g_k^i)^2 + (1-\alpha)s_k^{i-1}\]

Here $k$ is the kth element and $i$ is the ith iteration. $g_k^i$ is the gradient vector at iteration i, and $s_k^1 = (g_k^1)^2$

Notice that here $\eta$ is another variable controls the scale of the step size sequence, it could be searched among $\{0.001,0.1,1,10,100\}$. $\epsilon$ is set to be small, for example $\epsilon = 10^{-6}$, to satisfy the Robbins and Monro conditions. The last term is to keep the memory of the past gradients. More details could be found in [5].

It is shown that in ADVI, variance of estimates of the gradients is controled better compared to BBVI. The performance is also compared to those famous MC methods, result is also displayed below.

[1] John Winn and Christopher M. Bishop. Variational message passing. Journal of Machine Learning Research, 6(23):661–694, 2005.

[2] David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, apr 2017.

[3] Rajesh Ranganath, Sean Gerrish, and David M. Blei. Black box variational inference, 2013.

[4] Rajesh Ranganath, Sean Gerrish, and David Blei. Black Box Variational Inference. In Samuel Kaski and Jukka Corander, editors, Proceedings of the Seventeenth International Conference on

Artificial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pages 814–822, Reykjavik, Iceland, 22–25 Apr 2014. PMLR.

[5] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M. Blei. Automatic differentiation variational inference. J. Mach. Learn. Res., 18(1):430–474, jan 2017.