Student perspectives: AI UK 2025 Conference

Posted on 25th April 2025 by shaun.jordan

A post by Sam Bowyer, PhD student on the Compass programme.

Compass at AI UK

The Alan Turing Institute’s AI UK 2025 Conference was held last month in the QEII Centre, Westminster, and three Compass students – Emma Ceccherini, Sherman Khoo, and myself – were present for both days of the event. We attended a variety of sessions and spent time exploring the exhibition stalls, which showcased a wide range of AI projects from within academia, government and industry.

Compass students and staff pictured at the AI UK 2025 Conference

Compass CDT students and staff at the AI UK 2025 Conference. From left to right:
Dr Dan Lawson, Emma Ceccherini, Sam Bowyer, Sherman Khoo and Helen Mawdlsey

It was an eye-opening experience to learn about the work that The Alan Turing Institute does, and especially insightful to see the myriad downstream applications of the machine learning theory that we spend so much time thinking about.

Conference highlights

One particular favourite exhibition was that of the Ministry of Justice (MoJ). Emma and I talked to a data scientist at the MoJ who was working on a tool that uses LLMs to explain laws in plain English, in order to help regular people better understand their rights.

Another project involved aggregating various disconnected datasets from across government on the national and local level in order to research social factors that might lead to successful post-prison rehabilitation or equally to recidivism.

It was encouraging to see a variety of projects and organisations at the conference aiming to use AI for social and public good, with a significant amount focussed on the climate and green-tech.

Whilst Compass wasn’t presenting at AIUK, colleagues from the Informed AI Hub, the Interactive AI CDT, the AI For Collective Intelligence (AI4CI) Hub, and Jean Golding Institute were.

It was great to not only see the other projects that are going on in the University, but also to be able to network with colleagues who only work down the road from the Fry Building (e.g. sharing Bristol restaurant recommendations)!

On the first day of the conference, Professor Charlotte Deane, Executive Chair of the Engineering and Physical Sciences Research Council (EPSRC), gave an informative keynote talk on the state of scientific research in UK academia. It was surprising to learn about the overall size of EPSRC and the range of activities they engage in, particularly their keenness for investing in spin-outs. I found Professor Deane’s talk to be very encouraging and optimistic.

The second day of the conference focused on governmental uses of AI, particularly in medicine and in defence. Professor the Lord Darzi, who recently led the Independent Investigation of the NHS in England, gave an incredibly thoughtful talk on the opportunities for AI within the NHS.

He likened the current AI boom to the development of keyhole surgery in the second half of the 20th century, urging fast, nationwide deployment in order to improve health outcomes and equality throughout the UK.

Three talks on defence and national security similarly stressed the importance of fast uptake of AI tools and made clear the desire for public-private partnerships (including with academia) in order to make this happen. (The importance of cross-sector collaboration was consistently a strong theme at AIUK, although the absence of frontier AI labs did, in my opinion, betray a slight limit to this stated commitment).

Presentation karaoke

It wasn’t all so serious, however! The conference finished its first day with “Presentation Karaoke”, in which eight contestants competed to present unseen 5-minute long, 10-slide PowerPoints, each more bizarre than the last.

This fun, often slightly cringe-inducing, activity is now rumoured to be deployed at a future COMPASS student event. (Get practising your stand-up now…)

In summary, AIUK was a great opportunity to see how AI/ML research leads to real-world impact in the UK, and I would recommend attending to any CDT student in the future.

Guest Lecture: Professor Chris Breward

Posted on 3rd April 20253rd April 2025 by shaun.jordan

An Introduction to Knowledge Exchange

The Compass CDT was delighted to recently host a Guest Lecture by Professor Chris Breward, from the Mathematical Institute, University of Oxford.

Chris led an interactive session for our PhD students, which focused on getting started with knowledge exchange (KE), and explored the skills needed to engage with industrial and other external partners.

As Scientific Director of the Knowledge Exchange Hub for Mathematical Sciences and Co-Director of the EPSRC CDT in Industrially Focused Mathematical Modelling, Chris had a wide range of valuable advice to share.

Drawing on his experience building parternships with companies and setting-up projects with industry co-funding, he ran through the different ways researchers at all stages of their career can get involved in KE.

Attendees explored why companies might engage with mathematical scientists, discussed things to consider before meeting potential collaborators, and looked at what can sometimes go wrong with academic-business relationships.

Student reflections

“During his Guest Lecture, Chris chatted with all of us about ways to communicate with non-academics during shared projects and how to do positive work as mathematical consultants.

“The session covered the pragmatics and hard-skills of private sector contract work, as well as the soft skills of open body language, effective listening and people management.

“He described the barriers that can arise between researchers (mathematicians in particular) and industrial partners. We then chatted interactively through where these pitfalls come from and how best to avoid them.

“He also gave us an entry-level look into the broader differences between universities and industry.”

Emma Tarmey, Compass CDT student, Cohort 4

KE initiatives

Chris closed by encouraging attendees to get involved with some of the opportunities the KE Hub provides for PhD students and researchers, such as the online Triage Workshops. These events can provide a safe space for individuals to gain experience with knowledge exchange, by observing senior colleagues from across the country.

He expressed his hope that Compass students would benefit from the upcoming five-day European Study Group with Industry (ESGI), which will take place here at the University of Bristol from Saturday, 14 July to Wednesday, 18 July 2025.

The Compass CDT was grateful to Chris for giving up his time to visit us in the School of Mathematics’ Fry Building, and we look forward to seeing him in Bristol again in the future.

As well as being an applied mathematician, lecturer and researcher at University of Oxford, Chris is co-founding Chief Moderator of the Mathematics-In-Industry Reports online KE repository, and a member of the Newton Gateway’s Scientific Advisory Board.

Student perspectives: Expectation Propagation

Posted on 27th February 2025 by Grace Yan

A post by Grace Yan, PhD student on the Compass programme.

Introduction

In many real-world problems, the exact posterior distribution is often infeasible due to non-conjugate priors and high-dimensional datasets. Thus, approximate Bayesian inference methods are used instead to obtain an approximate posterior. Some well-known examples of these methods include Variational Bayes (VB), Laplace approximation and Expectation Propagation (EP). In this blog post, I will focus on Expectation Propagation and explain: what it is, how it works, its strengths and limitations, and its relation to similar methods.

Figure 1: A comparison of approximate Bayesian inference methods along a spectrum of computational speed and accuracy. Methods like Variational Bayes (VB) and the Laplace approximation are faster but less accurate, while approaches like Expectation Propagation (EP) and Markov Chain Monte Carlo (MCMC) are slower but provide higher accuracy. Source: [1].

Background

EP was introduced by Minka in 2001 as an extension of the assumed-density filtering (ADF), which is a one-pass sequential algorithm for obtaining an approximate posterior [2]. Like VB methods, its aim is to approximate an intractable posterior with tractable distributions by minimising the Kullback-Leibler (KL) divergence. Recall that the KL divergence measures how different two distributions $p$ and $q$ are; often, $p$ is the true distribution and $q$ is a model distribution that we use to approximate $q$. There are two kinds of KL divergence: the forward KL and the reverse KL. Assuming $x$ is continuous, these are defined as
\[ \mathrm{KL}(p(x) \| q(x)) = \int p(x) \mathrm{log}\frac{p(x)}{q(x)} dx \]
and
\[ \mathrm{KL}(q(x) \| p(x)) = \int q(x) \mathrm{log}\frac{q(x)}{p(x)} dx \]
respectively (in the discrete case, the integrals are replaced by sums). These two types are not equivalent; [3] gives a good explanation of how they differ. EP uses the forward KL.

Expectation Propagation

Let $\mathbf{x}$ denote the observed data and $\boldsymbol{\theta}$ the parameters of interest. Recall that by Bayes’ theorem, the posterior is
\[ p(\boldsymbol{\theta}|\mathbf{x}) = \frac{p(\mathbf{x}, \boldsymbol{\theta})}{p(\mathbf{x})}, \]
where $p(\mathbf{x})$ is the model evidence. We can write the joint distribution $p(\mathbf{x}, \boldsymbol{\theta})$ in the form of a product of factors $f_i$, which are also called ‘sites’:
\[p(\mathbf{x}, \boldsymbol{\theta}) = p(\boldsymbol{\theta})p(\mathbf{x}|\boldsymbol{\theta}) = p(\boldsymbol{\theta}) \prod_{i=1}^n p(\mathbf{x}_i|\boldsymbol{\theta}) = p(\boldsymbol{\theta})\prod_{i=1}^n f_i(\boldsymbol{\theta}),\] where $p(\boldsymbol{\theta})$ is the prior and the factors from $1$ to $n$ is the likelihood partitioned into $n$ iid parts (e.g. each $i$ could be a data point).

For $f$, we drop the conditioning $x$ to simplify the notation. I use $f_j(\boldsymbol{\theta})$ to refer to one specific factor and $f_i(\boldsymbol{\theta})$ as factors in the plural sense. My notation is similar to the notation in [4].

The idea is to approximate the posterior by approximating the factors with $\tilde{f}_i$, which are often assumed to be Gaussian (or some other member of the exponential family). These approximations are refined one at a time in an iterative process until convergence. In EP, refining a factor $\tilde{f}_j$ is a “team effort”; it requires information from each of the other factors. This concept is known as message passing, because messages are being passed between different programs (a concept that largely belongs to computer science).

Figure 2: Illustration of message passing between three factors. The arrows show the exchange of information between the factors: each $f_j$ send out its information to the other two factors and also receives information from them.

Using the approximations of the likelihood factors, the resulting approximate posterior is given by
\[q(\boldsymbol{\theta}) = p(\boldsymbol{\theta})\prod_{i=1}^n \tilde{f}_i(\boldsymbol{\theta}).\]
In EP, the prior $p(\boldsymbol{\theta})$ is also taken to be Gaussian. Since the product of Gaussians results in another Gaussian, $q$ has to be a Gaussian distribution. Therefore, we do not face the issue of finding the normalising constant for an unnormalised posterior.

To make the approximations as accurate as we can, we need a kind of measurement. Naturally, the global KL divergence comes to mind, so we might want to consider minimising the following:

\[
\mathrm{KL}(p \| q) = \mathrm{KL} \left( \frac{1}{p(\mathbf{x})} p(\boldsymbol{\theta})\prod_{i=1}^n f_i(\boldsymbol{\theta}) \bigg\| p(\boldsymbol{\theta})\prod_{i=1}^n \tilde{f}_i(\boldsymbol{\theta}) \right).
\]

However, the global KL divergence is difficult to optimise. Instead, EP minimises the KL divergence locally to update each factor one at a time, using a distribution called the tilted distribution. When updating the factor $\tilde{f}_j$, the tilted distribution is defined by

\[ q^\text{tilt}(\boldsymbol{\theta}) \propto f_j(\boldsymbol{\theta})q_{\setminus j}(\boldsymbol{\theta}), \]
where $q_{\setminus j}$ is called the cavity distribution, which is essentially the posterior distribution with one $\tilde{f}_j$ removed:

\[ q_{\setminus j}(\boldsymbol{\theta}) = \prod_{i \neq j} \tilde{f}_i(\boldsymbol{\theta}) = \frac{q(\boldsymbol{\theta})}{\tilde{f}_j(\boldsymbol{\theta})}. \]
Then EP finds the $\tilde{f}_j$ that minimises the KL divergence between the tilted distribution and the updated approximate posterior (which we call $q^\text{new}$):
\[
\mathrm{KL}(q^\text{tilt}(\boldsymbol{\theta}) \| q^\text{new}(\boldsymbol{\theta})), \]
where \[q^\text{new}(\boldsymbol{\theta}) = \tilde{f}_j(\boldsymbol{\theta}) q_{\setminus j}(\boldsymbol{\theta}).
\] If $q^\text{new}$ is a member of the exponential family (e.g. Gaussian), then we can minimise $\mathrm{KL}(q^\text{tilt} \| q^\text{new})$ by matching the moments of $q^\text{new}$ with the moments of $q^\text{tilt}$. This trick is called moment matching. In general, for approximating distributions from the exponential family, matching moments of the approximating distribution with those of the target distribution minimises the forward KL [5].

Note that the tilted distribution is not Gaussian and therefore it can be difficult to compute its moments analytically. Instead, the moments are often computed numerically: using MCMC, we can generate samples from the tilted distribution (in which case we would not need to calculate its normalising constant) and use the samples to calculate the moments empirically.

The Gaussian EP algorithm is given below:

Initialise all the approximating factors $ \tilde{f}_i(\boldsymbol{\theta}), i=1,…,n $.
Initialise the approximate posterior by setting\[
q(\boldsymbol{\theta}) = p(\boldsymbol{\theta})\prod_{i=1}^n \tilde{f}_i(\boldsymbol{\theta}),
\]where $ p(\boldsymbol{\theta}) $ is a Gaussian prior.
Until all $ \tilde{f}_i $ for $ i=1,…,n $ converge:
(a) Choose a factor $ \tilde{f}_j $ to refine.
(b) Evaluate the cavity distribution\[
q_{\setminus j}(\boldsymbol{\theta}) = \frac{q(\boldsymbol{\theta})}{\tilde{f}_j(\boldsymbol{\theta})}.
\]
(c) Set the tilted distribution\[
q^\text{tilt}(\boldsymbol{\theta}) \propto f_j(\boldsymbol{\theta}) q_{\setminus j}(\boldsymbol{\theta}).
\]
Calculate the mean $ \boldsymbol{\mu} $ and covariance $ \boldsymbol{\Sigma} $ of $ q^\text{tilt} $.
(d) Obtain the new posterior $ q^\text{new} $ that minimises $ \mathrm{KL}(q^\text{tilt}(\boldsymbol{\theta}) \| q^\text{new}(\boldsymbol{\theta})) $ by matching its moments with $ \boldsymbol{\mu} $ and $ \boldsymbol{\Sigma} $.
(e) Evaluate and store the refined factor\[
\tilde{f}_j(\boldsymbol{\theta}) = \frac{q^{\text{new}}(\boldsymbol{\theta})}{q_{\setminus j}(\boldsymbol{\theta})}.
\]
(f) Use the refined factors to update the approximate posterior as\[
q(\boldsymbol{\theta}) = p(\boldsymbol{\theta})\prod_{i=1}^n \tilde{f}_i(\boldsymbol{\theta}).
\]

Benefits and limitations of EP

As with any method, EP has advantages and disadvantages. Its advantages include the following:

EP updates the approximation factor-by-factor rather than globally, which often leads to better approximations of the target distribution.
EP is faster and computationally cheaper than MCMC. It can also speed up MCMC [1].
EP is easy to parallelise [6].
EP can easily be used in conjunction with other methods. Minka’s Roadmap on EP [7] provides a rich guide to the various areas that have employed EP, including regression, neural networks and nonlinear dynamic systems. EP have also been used with likelihood-free inference methods such as ABC (e.g. EP-ABC [8]).
Due to its factorisation structure, EP is naturally suited to graphical models, such as Bayesian networks and Markov random fields.

However, EP has some serious limitations, which later works have tried to address:

There is a lack of theoretical guarantees, e.g. convergence of the EP algorithm is not guaranteed.
If the number of approximating factors is large, this can lead to substantial memory consumption.

Extensions of EP

EP is well-suited for parallelisation. The parallel version of the original EP algorithm (sometimes called ‘sequential EP’) is known as ‘parallel EP’. Here, factor updates occur simultaneously, meaning that $q$ is not updated at the end of each iteration in step 3 of the algorithm above, i.e. $f_j$ is refined using a cavity distribution that is the product of the unrefined factors minus $f_j$. $q$ is updated only after all the factors have been updated once (whereas in sequential EP, it was updated after each factor update), then the process repeats for multiple rounds.

Since the introduction of EP, many variants of EP have been developed, such as averaged EP (AEP) [9], power EP (PEP) [10] and stochastic EP (SEP) [11]. Different choices of divergence function has led to Variational Message Passing (VMP) [12], which uses the reverse KL, and Laplace propagation (LP) [13], which uses the Laplace approximation. Much work has been done to alleviate EP’s issues, such as guaranteeing convergence [14][9], bounding its approximate errors [15], and reducing memory consumption (e.g. SEP).

Due to the close relation between EP and Variational Inference (VI), many methods have been developed from the unification of the two. For example, Partitioned Variational Inference (PVI) [16] arises from a mixture of several methods including power EP, global VI and local VI.

Figure 3: VI and EP schemes encompassed by the PVI framework. Source: [16].

In recent years, there has been a growing interest towards federated learning. This is where the dataset is partitioned across “clients” that train models locally before the model parameters are aggregated by the central server to update the global model. Since the posterior naturally factorises across partitioned client data, EP adapts well to this framework, producing algorithms such as FedEP [17] and Federated Neural Propagation (FedNP) [18].

References

[1] Barthelmé, S. (2016). The Expectation-Propagation Algorithm: a tutorial. Gipsa-lab,
CNRS. https://www.cirm-math.fr/ProgWeebly/Renc1619/Barthelme\_EP1.pdf
[2] Minka, T. P. (2001). A family of algorithms for approximate Bayesian inference [Doctoral dissertation]. Massachusetts Institute of Technology.
[3] Jang, E. (2016). A Beginner’s Guide to Variational Methods: Mean-Field Approximation. https://blog.evjang.com/2016/08/variational-bayes.html
[4] Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
[5] Murray, I. (2017). Variational objectives and KL Divergence [Lecture notes]. University of Edinburgh. https://www.inf.ed.ac.uk/teaching/courses/mlpr/2017/notes/w9a_variational_kl.pdf
[6] Cseke, B. and Heskes, T. (2011). Approximate marginals in latent Gaussian models. Journal of Machine Learning Research, 12:417–454.
[7] Minka, T. P. (n.d.). A roadmap to research on EP. https://tminka.github.io/papers/ep/roadmap.html
[8] Barthelmé, S. and Chopin, N. (2012). Expectation Propagation for Likelihood-Free Inference. arXiv:1107.5959.
[9] Dehaene, G. and Barthelmé, S. (2018). Expectation propagation in the large data limit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):199–217.
[10] Minka, T. P. (2004). Power EP [Technical Report MSR-TR-2004-149]. Microsoft Research Ltd.
[11] Li, Y., Hernández-Lobato, J. M. and Turner, R. E. (2015). Stochastic Expectation Propagation. arXiv:1506.04132.
[12] Winn, J., Bishop, C. M. and Jaakkola, T. (2005). Variational message passing. Journal of Machine Learning Research, 6(4):661–694.
[13] Smola, A., Vishwanathan, S. V. N. and Eskin, E. (2003). Laplace propagation. Advances in Neural Information Processing Systems, 16.
[14] Hasenclever, L., Webb, S., Lienart, T., Vollmer, S., Lakshminarayanan, B., Blundell, C. and Teh, Y. W. (2017). Distributed Bayesian Learning with Stochastic Natural Gradient Expectation Propagation and the Posterior Server. arXiv:1512.09327.
[15] Dehaene, G. and Barthelmé, S. (2016). Bounding errors of Expectation-Propagation. arXiv:1601.02387
[16] Bui, T. D., Nguyen, C. V., Swaroop, S. and Turner, R. E. (2018). Partitioned Variational Inference: A unified framework encompassing federated and continual learning. arXiv:1811.11206.
[17] Guo, H., Greengard, P., Wang, H., Gelman, A., Kim, Y. and Xing, E. P. (2023). Federated learning as variational inference: a scalable expectation propagation approach. arXiv:2302.04228.
[18] Wu, X., Huang, H., Ding, Y., Wang, H., Wang, Y. and Xu, Q. (2023). FedNP: Towards Non-IID Federated Learning via Federated Neural Propagation. Proceedings of the AAAI Conference on Artificial Intelligence, 37(9):10399-10407.

Student Perspectives: Unravelling Ancestry – When Genes Don’t Follow the Family Tree

Posted on 13th February 202517th February 2025 by ic23897

A post by Daniella Montgomery, PhD student on the Compass programme.

Introduction

In my project, I am working with my two supervisors, Dan Lawson in the School of Mathematics and Sion Bayliss from the School of Veterinary Science, to investigate the analysis of genomic data and their inferred ancestry trees, to detect problematic lineages of bacterial pathogens.

An ancestry tree is a tree which describes how genetic data is passed down through generations. By understanding the evolution of bacteria, we can develop strategies to alert us when dangerous pathogens evolve. Bacteria typically only have one parent, and if this were true, their evolution can be described as a tree. However, bacteria also frequently evolve using horizontal gene transfer, where genetic data is exchanged between lineages with different ancestries, as seen in Figure 1. This disrupts the traditional parent-to-offspring tree, and instead, one needs to represent it using a complex graph.

Figure 1: An example of a phylogeny with horizontal gene transfer shown by the red dashed line and the resulting recombined lineage shown as a full red line breaking the structure of this tree.

In this case, each location on the genome may be described by a different tree obtained by following the correct parent at that location, i.e. the “left” or “right” parent of the red individual in Figure 1. These trees can be called “local ancestries”.

Simulating Ancestry with Msprime

The Python package msprime allows us to simulate genetic ancestral data using the coalescent method. The coalescent method is a backwards-in-time stochastic process where one has a set of sample lineages from which n are randomly selected, as seen in Figure 2. As we go back in time, their parent nodes are iteratively redrawn from this set at random. Once two lineages pick the same random parent, the lineage coalesces into one. This process is repeated until a common ancestor is achieved.

Figure 2: A depiction of the coalescent method taken from [1] for a population of 10 individuals and a sample size of 10, by keeping track of the times between coalescence events (T(3) and T(2)) and which lineages coalesce with which, we have a full picture of the phylogenetic tree.

The Impact of Gene Conversion

In this experiment, I am investigating how population structure manifests in genetic data and how this is affected by varying gene conversion rates. Gene conversion is a type of horizontal gene transfer where a donor genome replaces a sequence of DNA in a homologous acceptor genome. Our simulation has one population that splits into two populations with some gene conversion within the populations, as seen in Figure 3. From this, we can obtain local pedigrees across the genome for several sample genomes. Each local pedigree has a complex history, but gene conversion allows each gene to have a different random history.

Figure 3: A conceptual picture of the true population structure and the local pedigree of the sampled population obtained from simulation with nodes coloured by population. Blue represents the ancestral population and red and green represent the two descendent populations, A and B. The leaf nodes are labelled for comparison with future analysis.

Analyzing the Data

One common way to visualise complex histories is through Principle Component Analysis (PCA) where the data undergoes eigenvalue decomposition which will group similar genomes together in a far lower dimensional space. This dimensionality reduction also allows us to visualise certain population structure characteristics [2]. For example, in all of our 2D PCA graphs in Figure 4, we can see a clear split between population A and population B.

However, there is a limit to how interpretable these PCs are. We use the dendrogram from hierarchical clustering to help sort our data such that more similar data is kept together. Then we create a covariance plot of how similar the principal components of each genome are to each other. This plot is a rudimentary method to help us visualize the population structure of the simulation’s resulting lineages seen in Figure 5a. The population structure is clear, but there is still structure given by the random pedigree shared by all individuals.

Figure 4: The principal component analysis plots with colours showing the true populations for a gene conversion rate of 1e06.

Figure 5: Covariance matrices for increasing gene conversion rates (reading left to right, up to down) 1e-6, 1e-5, 1e-4, showing a breakdown of the sub-population structure.

In Figures 5a to 5c, we can see that as gene conversion is increased, the covariance matrix less represents one random history, and instead “averages out” into the population structure. This is a visualization of the dependence on the history breaking down as the genomes within each population become more similar to each other due to gene conversion.

If you would like to know more about this topic, please contact me at ic23897@bristol.ac.uk.

[1] Rosenberg, N., Nordborg, M. Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nat Rev Genet 3, 380–390 (2002). (https://doi.org/10.1038/nrg795])

[2] McVean G. A genealogical interpretation of principal components analysis. PLoS Genetics, 5, e1000686 (2009). (https://doi.org/10.1371/journal.pgen.1000686)

Student perspectives: Genetic Boolean Models – How to Make One

Posted on 6th February 2025 by shaun.jordan

A post by Daniel Gardner, PhD student on the Compass programme.

Introduction

My research focuses on genetic interaction networks within lung cancer cells. Our (long-term) aim is to model such networks dynamically using a Boolean modelling framework, and then use this to tie changes in cancer cells’ physiology to certain, often mutated, genes of interest.

Aims and problems

This blog post will focus on the challenge we are currently working on: constructing the model itself. This is often the most challenging element of the research, as it underpins all results going forward, and often there does not exist enough data to fully define a unique model.

In some respects this is acceptable, as Boolean modelling is more of a qualitative approach. Each node in the network is a ‘species’, be that a gene, protein, small molecule, etc. Each directed, labelled edge is either ‘activating’, if an increase in species A causes an increase in species B, or ‘inhibiting’, if the opposite is the case [1].

With this definition, a lot of papers we have looked at define their model purely from the literature [2], [3], [4], either manually mining links, or using pre-existing databases like the Kyoto Encyclopedia of Genes and Genomes (KEGG)[6].

What we are more interested in are methods deriving these models in a far more quantitative way, straight from transcriptomic data. Whilst some of the papers referenced above justify their hand-built models in retrospect by showing they can replicate real-world results [3], we wish to work the other way round – beginning at the real-world results and then using a reverse engineering approach.

Figure 1: The Boolean model used in [2], based off a similar model constructed in [4]. It contains 98 nodes (species) and 254 directed edges (labelled interactions).

Potential solutions

The solutions we have found can be broadly split into two categories: methods that go from:

Raw Data → Interaction Network

and similarly:

Interaction Network → Boolean Model

The former is a much more difficult challenge. Generally, in a published network, each edge will reference experimental work that justifies, e.g. ‘A activates B’. However, data-frames which contain many cell-line perturbation experiments in one are hard to come by, and expensive to perform [5]. The problem is often also undetermined since the solution-space for a potential network is far greater than the amount of data available. One option we may look into in the future, however, involves using other modelling techniques, such as ODEs or Bayesian networks.

The challenge of reverse engineering a Boolean network from a pre-built network is much more feasible. The main problem in this case is considering complex interactions. For example, if we had ’A inhibits C’ and ’B activates C’, how do they work in tandem?

Figure 2: Part of the optimisation algorithm from [7] applied to a toy model. In D, we classify each species in the network. All non-compressed nodes are those which we have data to train on. In E, we construct the hypergraph, where for any pair of combined interactions, both the ‘AND’ and ‘OR’ case are considered.

Sticking to the Boolean framework, these two interactions can either be joined through an ‘AND’ relation, or an ‘OR’ relation. For several proteins affecting one specific protein, the combinations of Boolean rules are non-trivial.

One paper we found that deals with this problem well is Saez-Rodriguez et al. [7], which attempts to train a hypergraph of the interaction network to cell line assay data. It contains a number of different techniques to do with graph and state space reduction, as well as some heuristic rules on which complex interactions to target. For example, it is unlikely in biology for a protein to require multiple other species to necessitate a change in function, so we can remove ‘AND’ links of more than N complex interactions from the state space.

One other model component we are looking for, which we have not currently looked into properly, is a ‘layered’ model, which includes different levels of genomic interaction. For example, many papers we have read use ‘protein interaction network (PIN)’ and ‘gene regulatory network (GRN)’ interchangeably. Whilst the two are greatly related, drawing a one-to-one equivalence between the two in all cases is incorrect.

Conclusion and future plans

Starting directly from data to build a network is perhaps too ambitious a challenge, especially with the limited data available. In fact, even to train a Boolean network for optimisation requires quite specific cell-line perturbation data. It could be that we make do with a network partially trained on limited data, and the rest taken from prior knowledge in the literature.

One promising sign is that [7] finds that it is best to begin with ’too many’ interactions in a literature-curated interaction network, and then ’prune’ spurious interactions via network optimisation. This is due to these large networks being built from many different sources, some using different tissue, conditions, etc. Therefore, when we desire a model specific to lung adenocarcinoma data, it is natural for the training to remove many of these genetic interactions.

In the future, we aim for this research topic to simply be one section of the wider project. Once we decide upon the most justified Boolean model for lung cancer, we aim to use patient mRNA and mutation data to personalise the models, in order to predict patient specific cell phenotype probabilities. Using this, along with multi-layer protein imaging data from Cancer Research UK, we aim to find a statistically significant link between certain gene mutations, and the resulting shape and, therefore, phenotype of a tumour of cancer cells.

Thank you for reading this blog post. If you have any questions, please feel free to get in touch with me at: daniel.gardner@bristol.ac.uk

References

[1] Abou-Jaoudé, W., Traynard, P., Monteiro, P. T., Saez- Rodrıguez, J., Helikar, T., Thieffry, D., and Chaouiya, C. (2016). Logical modeling and dynamical analysis of cellular networks. Frontiers in Genetics, 7.

[2] Béal, J., Montagud, A., Traynard, P., Barillot, E., and Calzone, L. (2019). Personalization of logical models with multi-omics data allows clinical stratification of patients. Frontiers in Physiology, 9.

[3] Cohen, D. P. A., Martignetti, L., Robine, S., Barillot, E., Zinovyev, A., and Calzone, L. (2015). Mathematical modelling of molecular pathways enabling tumour cell invasion and migration. PLOS Computational Biology, 11.

[4] Fumiã, H. (2013). Boolean network model for cancer pathways: Predicting carcinogenesis and targeted therapy outcomes. PloS one, 8:e69008.

[5] Galindez, G., Sadegh, S., Baumbach, J., Kacprowski, T., and List, M. (2023). Network-based approaches for modeling disease regulation and progression. Computational and Structural Biotechnology Journal, 21:780–795. 4

[6] Kanehisa, M. and Goto, S. (2000). Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28(1):27–30.

[7] Saez-Rodriguez, J., Alexopoulos, L. G., Epperlein, J., Samaga, R., Lauffenburger, D. A., Klamt, S., and Sorger, P. K. (2009). Discrete logic modelling as a means to link protein signalling networks with functional analysis of mammalian signal transduction. Molecular Systems Biology, 5(1):331.

Student perspectives: Regional Sensitivity Analysis

Posted on 21st November 2024 by cecina.babichmorrow

A post by Cecina Babich Morrow, PhD student on the Compass programme.

Introduction

Sensitivity analysis seeks to understand how much changes in each input affect the output of a model. We want to be able to determine how variation in a model’s output can be attributed to variations in its input. Given the high amount of uncertainty present in most real-world modelling settings, it is crucial to understand the magnitude of this uncertainty’s impact on model results. Knowing how sensitive a model is to a particular parameter can help guide modellers in prioritising what level of precision is needed in estimating that parameter in order to produce valid results. Sensitivity analysis thus serves as a vital tool for modellers in numerous fields, allowing them to assess robustness and to identify key drivers of uncertainty. By systematically analysing the relative amount of influence that each input parameter has on the output, sensitivity analysis reveals which parameters have the greatest impact on the results.

By identifying these critical parameters, stakeholders can prioritize investments in data collection, parameter estimation, and uncertainty reduction. This targeted approach ensures that efforts are concentrated where they will have the most significant impact.

Why use Regional Sensitivity Analysis?

In this blog post, I will focus on one particular sensitivity analysis method that I have been using in my project so far to help understand the sensitivity of an output decision to input parameters which affect that decision. Regional Sensitivity Analysis (RSA) was developed in the field of hydrology, but has widespread applications in environmental modelling, disease modelling, and beyond.

My research focuses on environmental decision-making, so I frequently deal with models that output a decision that can take on one of several discrete values. For example, consider trying to make a decision about what to wear based on the weather. To make our decision, we use three input parameters about the weather: temperature, humidity, and wind speed. Then, our decision model can output one of three decisions: (1) stay home, (2) leave the house with a jacket, (3) leave the house without a jacket. We might then be interested in how sensitive our model is to each of our three weather-related input parameters to understand how much each one contributes to uncertainty in our ultimate decision. In this type of setting, we need to use a sensitivity analysis method that can handle continuous inputs, e.g. temperature, in conjunction with a discrete output, e.g. our decision.

For settings such as these where the inputs of our model are continuous and the outputs are discrete, RSA, also referred to as Monte Carlo filtering, is a potential method of sensitivity analysis [1]. RSA aims to identify which regions of input space corresponding to specific values in the output space [2, 3]. Originally, the method was developed in the field of hydrology for cases where the output variable is binary, or made such by applying a threshold. It has since been extended by splitting the parameter space into more than two groups [3, 4]. RSA is well-suited to sensitivity analysis in the case where the output variable is categorical [5].

RSA is fundamentally a Bayesian approach. First, prior distributions are assigned to the input parameters. The model is then run multiple times, sampling input parameters from these priors, and recording the resulting output values. By analysing the relationship between input uncertainties and output uncertainties, RSA identifies which parameters significantly affect the model’s predictions.

How does RSA work?

We will present the mathematical formalisation of RSA in a setting where we have a discrete output variable $y \in \{ y_1, y_2, \ldots, y_m \}$ which can take on one of $m$ possible output values, and a vector of $d$ continuous input variables $\mathbf{x} = [x_1, x_2, \ldots, x_d]$ . We start with prior distributions on the input vector $\mathbf{x}$ , from which we sample before running the model to calculate the output value for that particular input.

Then, RSA compares the empirical conditional cumulative distribution functions (CDFs) $F_{x_i | y_j}$ conditioned on the different output values of $y$ . That is, for the $i$ th input parameter, we take the empirical CDF conditioned on the output of the model being the $j$ th possible output value. For example, in our weather-based decision model, we would be considering the empirical CDF $F(\text{temperature } | \text{ decide to stay home})$ . We then compare these CDFs $F_{x_i|y_j}$ for each of the possible $j \in \{1, \ldots,m\}$ output values (in our case, each of the possible output decisions). If the conditional CDFs of $x_i$ differ greatly in distribution for one or more of the values of $y$ , then we can conclude that our model is sensitive to that particular input parameter. If $F(x_i) = F(x_i | y_1) = \ldots = F(x_i | y_m)$ , then the output is insensitive to $x_i$ on its own. (See the Extensions of RSA section for a discussion of variable interactions.)

The difference between these CDFs can be measured using several possible sensitivity indices. Typically, the Kolmogorov-Smirnov (KS) statistic is applied over all possible values of $y$ , and then some statistic (e.g. mean, median, maximum, etc.) is calculated to summarise the overall sensitivity of $y$ to $x_i$ :

$\text{stat}_{j,k} [KS(x_i)] = \text{stat}_{j,k} \left[\max_{x_i} \left \lvert F_{x_i | y_j} (x_i | y = y_j) - F_{x_i | y_k} (x_i | y = y_k) \right \rvert\right]$

where $j,k \in \{1, \ldots, m\}$ and $\text{stat}$ could be mean, median, maximum, etc.

For instance, consider the following situation with an input parameter $x_i$ , where the output $y$ can take on one of three values. We assumed a uniform prior for $x_i \sim \text{Unif}(350, 800)$ . The blue, green, and red distributions shown in Fig. 1 below are the empirical conditional CDFs $F(x_i | y_1)$ , $F(x_i | y_2)$ , and $F(x_i | y_3)$ , respectively, giving the probability that $x_i$ is less than or equal to a given value given that the output result of the model was $y_j$ . The vertical dotted lines are the KS statistic between each of the three pairs of CDFs. Then a statistic, such as the mean, median, or maximum of those three KS values, can be calculated to represent the overall sensitivity of $y$ to the input parameter $x_i$ . For example, the mean KS statistic is 0.5505.

Figure 1. Visualisation of RSA using a summary statistic of the KS statistic as a sensitivity index. The blue, green, and red distributions are the empirical conditional CDFs $F(x_i | y_k)$ for $k \in \{1, 2, 3\}$ , and the vertical dotted lines represent the KS statistic between each of the three pairs of CDFs.

As an alternative to using the KS statistic, we can instead apply a statistic to spread, i.e. the area between the CDFs:

$\text{stat}_{j,k} [\text{spread}(x_i)] = \text{stat}_{j,k} \left[ \int_{-\infty}^\infty \max \left(F_{x_i | y_j} (x_i | y = y_j), F_{x_i | y_k} (x_i | y = y_k)\right) dx_i - \int_{-\infty}^\infty \min \left(F_{x_i | y_j} (x_i | y = y_j), F_{x_i | y_k} (x_i | y = y_k)\right) dx_i \right]$

where $j,k \in \{1,\ldots, m\}$ . In this case, we would be considering the area between each of the three distributions shown in Fig. 1 above and then averaging them (or applying some other summary statistic) as our sensitivity index. For instance, the mean spread between CDFs is 134.09.

Higher values of either sensitivity index for a given input parameter $x_i$ suggest that the output is more sensitive to variations in that parameter, i.e. the distributions of input values leading to a given output value are more different from one another. For example, Figure 2 compares the conditional CDFs of $x_i$ with that of a different input parameter, $x_j$ , with a prior of $x_j \sim \text{Unif}(80,120)$ . We can see that the CDFs $F(x_i | y_k)$ show a high degree of separation, compared to the CDFs $F(x_j, y_k)$ , which do not. This is reflected in the sensitivity indices: for example, the mean KS statistic for $x_j$ is only 0.1648 and the mean spread is only 2.897. Comparing KS statistics in this manner makes RSA a tool well-suited for ranking, or factor prioritisation, one of the main goals of sensitivity analysis that aims to rank parameters by their contribution to variation in the output [1, 5].

Figure 2. Comparison of sensitivity of a model to two input parameters, $x_i$ and $x_j$ . The blue, green, and red distributions are the empirical conditional CDFs $F(x_i | y_k)$ and $F(x_j | y_k)$ for $k \in \{1, 2, 3\}$ .

Extensions of RSA

One notable limitation of RSA, identified since its inception [2], is its inability to handle parameter interactions. A zero value of the sensitivity index is a necessary condition for insensitivity, but it is not sufficient [2, 5]. Inputs that contribute to variation in the model output only through interactions can have the same univariate conditional CDFs, and thus RSA cannot properly identify their impact on model output. For theoretical examples, see Fig. 1 of [2] and Example 6 of Section 5.2.3 in [1]. In our toy example, we may have a decision model where the output decision is not particularly sensitive to temperature or humidity on their own, but it may be very sensitive to an interaction between these two parameters since their combined effects impact how warm or cool the weather actually feels.

In situations such as these where interactions between input variables may matter more than each variable on its own, RSA can be useful for ranking, but it cannot be used for screening, another goal of sensitivity analysis aiming to identify variables with little to no influence on output variability[1, 5]. To address this limitation, RSA can be augmented with machine learning methods such as random forests and density estimation trees [6]. Spear et al. performed a sensitivity analysis of a dengue epidemic model to demonstrate how these tree-based models can augment RSA [6].

First, the authors performed RSA in its original form, using the KS statistic to examine the difference between the univariate CDFs. Then, they used random forest analysis to classify model runs into the various output values. Then, a measure of variable importance, such as Gini impurity, was used to rank the input parameters in terms of their influence on the model output [6]. Random forest allows for the incorporation of the effects of variable interactions in ranking the importance of each parameter. By comparing the parameter ranking resulting from RSA with that from the random forest, they identified parameters which impacted the output through interaction. Finally, they used density estimation trees to help identify regions of parameter space corresponding to particular output values. Density estimation trees are the analogue of classification and regression trees, instead attempting to estimate the probability density function that gave rise to a particular region of output space [7]. By applying density estimation trees as part of the sensitivity analysis, Spear et al. were able to examine the effects of scale on sensitivity, identifying parameters which may be relatively unimportant when ranking across the entire parameter subspace, but are highly influential in small subspaces.

Further research such as this highlights the benefits of combining multiple sensitivity analysis methods in order to gain a full picture of how model inputs affect uncertainty in the output.

Conclusions

Hopefully this blog has been an informative crash course in regional sensitivity analysis! Note that the visualisations in this post have been created using the SAFEpython toolbox [8]. If you have any questions or comments, please feel free to get in touch at cecina.babichmorrow@bristol.ac.uk.

References

[1] A. Saltelli, Global sensitivity analysis: the primer. Wiley, 2008. [Online]. Available: https://onlinelibrary.wiley.com/doi/book/10.1002/9780470725184

[2] R. Spear and G. Hornberger, “Eutrophication in peel inlet—II. identification of critical uncertainties via generalized sensitivity analysis,” Water Research, vol. 14, no. 1, pp. 43–49, 1980. [Online]. Available: https://www.sciencedirect.com/science/article/pii/0043135480900408

[3] J. Freer, K. Beven, and B. Ambroise, “Bayesian estimation of uncertainty in runoff prediction and the value of data: An application of the GLUE approach,” Water Resources Research, vol. 32, no. 7, pp. 2161–2173, 1996. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1029/95WR03723

[4] T. Wagener, D. P. Boyle, M. J. Lees, H. S. Wheater, H. V. Gupta, and S. Sorooshian, “A framework for development and application of hydrological models,” Hydrology and Earth System Sciences, vol. 5, no. 1, pp. 13–26, 2001. [Online]. Available: https://hess.copernicus.org/articles/5/13/2001/

[5] F. Pianosi, K. Beven, J. Freer, J. W. Hall, J. Rougier, D. B. Stephenson, and T. Wagener, “Sensitivity analysis of environmental models: A systematic review with practical workflow,” Environmental Modelling & Software, vol. 79, pp. 214–232, 2016. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S1364815216300287

[6] R. C. Spear, Q. Cheng, and S. L. Wu, “An example of augmenting regional sensitivity analysis using machine learning software,” vol. 56, no. 4, p. e2019WR026379. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1029/2019WR026379

[7] P. Ram and A. G. Gray, “Density estimation trees,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp. 627–635. [Online]. Available: https://dl.acm.org/doi/10.1145/2020408.2020507

[8] F. Pianosi, F. Sarrazin, and T. Wagener, “A Matlab toolbox for global sensitivity analysis,” Environmental Modelling & Software, vol. 70, pp. 80–85, 2015. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S1364815215001188

Student perspectives: Compass Annual Conference 2024

Posted on 8th November 202414th November 2024 by ben.anson

A post by Compass students Ben Anson, Ollie Baker, Codie Wood and Rachel Wood.

Introduction

This October, we held our third annual Compass Conference. Unlike previous years, when the event was held in the University’s Fry Building, this time it took place at M Shed, offering scenic views of Bristol harbour. It was a great day for past and present Compass students, academics, and industry partners to come together and discuss this year’s theme: “The Future of Data Science”. With recent advances in machine learning and AI, it felt like a fitting time to learn from each other’s perspectives and to share ideas about how to move forward in this exciting space.

Panoramic view of Bristol harbour, as seen from M Shed

Student Research Talks

The morning started with four ten-minute research talks from Compass students. First was Rahil Morjaria‘s talk on “Group Testing” which explored current developments in the field, including algorithms and information-theoretic limits.

Following this, Kieran Morris presented “A Trip to Bregman Geometry and Applications”, considering advancements such as natural gradient methods, Bregman K-means clustering, and EM-projection algorithms that Bregman Geometry has enabled.

Ettore Fincato talked us through “Gradient-Free Optimisation via Integration”, focusing on a novel yet easy-to-implement algorithm for optimisation using Monte Carlo methods. Finally, Ed Milsom spoke about “Data Modalities and the Bias-Variance Decomposition”, taking us through a history of neural networks and speculating about why certain data types are so powerful, and why the future of general-purpose AI must be multi-modal.

Student Lightning Talks

The lightning talks challenged ten students to present on a topic in just three minutes. The ability to quickly convey a message in an engaging and understandable manner, to an audience with diverse backgrounds, is crucial in both academia and industry, and the students rose to the occasion.

Their talks captured the interest of the audience and inspired interesting questions that forced the students to think on their feet. Topics ranged from neural networks and large language models (LLMs), to making music using mathematics.

Compass Alumni Panel

This year’s conference panel, chaired by Compass CDT Director, Professor Nick Whiteley, offered an engaging look into the professional journeys of Compass alumni Dominic Owens, Jake Spiteri and Michael Whitehouse since completing their PhDs. With shared experiences in finance, each panelist provided unique insights into the early career landscape and the skills that helped them succeed.

Jake delved into the details of his day-to-day work in the financial sector, while Dominic discussed the challenge and dedication required to secure a role through extensive networking and job applications. Michael shared details of his transition from finance to epidemiological research. Together, they sparked valuable discussions on what the future of data science might hold for upcoming Compass graduates.

Special Guest Lecture

The conference concluded with an enlightening special guest lecture by Professor Aline Villavicencio, Director of the Institute for Data Science and Artificial Intelligence at the University of Exeter. Her talk, “Testing the Idiomatic Language Limits of Foundation Models: The Strange Case of the Idiomatic Eager Beaver in Cloud Nine,” offered a fascinating counterpoint to the current enthusiasm surrounding LLMs.

Drawing from her research in Natural Language Processing (NLP), Professor Villavicencio demonstrated how even today’s most advanced models struggle with aspects of language that humans master naturally – particularly idioms and multi-word expressions. She illustrated a persistent gap between machine and human linguistic capabilities, reminding us that the path to truly human-like language understanding remains long and complex.

She also shared her perspective on the cyclical nature of NLP research, noting how, throughout her career, there have been multiple predictions about NLP research becoming obsolete as models improve. Yet, as her work on datasets like SemEval (Semantic Evaluation) shows, there remain fundamental challenges in representing and understanding idiomatic language.

Concluding remarks

The successful day of talks, poster sessions and networking culminated with Professor Whiteley sharing his thoughts on what we learned throughout the event. He concluded that the future of our field is certain to be exciting and will encompass a huge range of different areas and ideas. This year’s conference embodied this by providing a platform for students, academics, and industry professionals to share new insights from many different sectors, and to form strong relationships to help forge a path to the future of data science.

Past conferences

Student Perspectives: The trade-off between sample size and number of trials in meta-analysis

Posted on 1st November 202411th November 2024 by xinrui.shi

A post by Xinrui Shi, PhD student on the Compass programme.

Introduction

Meta-analysis is a widely used statistical method for combining evidence from multiple independent trials that compare the same pair of interventions [1]. It is mainly used in medicine and healthcare but has also been applied in other fields, such as education and psychology. In general, it is assumed that there is a numerical measure of effectiveness associated with each intervention, and the goal is to estimate the difference in effectiveness between the two interventions. In medicine, this difference is termed the relative treatment effect. We assume that relative effects vary across trials but are drawn from some shared underlying distribution. The objective is to estimate the mean and standard deviation of this distribution, which we denote by $d$ and $\tau$ respectively; $\tau$ is referred to as the heterogeneity parameter.

In medical trials, patients are randomly allocated to one of the two treatment options, and their subsequent health outcomes are monitored. Each trial then provides an observation of the relative treatment effect in that trial. Meta-analysis uses these observations to estimate the mean $d$ and variance $\tau^2$ of the distribution of relative effects. In this work, we are interested in understanding what conditions maximise the precision of these pooled estimates.

It is well-known that the precision of meta-analysis can be improved by either increasing the number of observations or improving the precision of individual observations. Both approaches, however, require more participants to be included in the meta-analysis. To understand the relative importance of these factors, we constrain the total number of participants to a fixed value. Then, if more trials are conducted to generate additional observations, each trial will necessarily include fewer participants, thereby reducing the precision of each individual observation. Given this trade-off between the precision of observations and their quantity, we ask: how should participants be optimally partitioned across trials to achieve the most precise estimates?

Meta-analysis background

Model

Suppose there are two treatments for a disease, labelled $T_1$ and $T_2$, and we want to know which one is more effective. There are a total of $n$ patients across $M$ trials, and patients in each trial are randomly allocated to one of the two treatments. We write $n_{ij}$ for the number of patients assigned to treatment $T_j$ in trial $i\in \{ 1,\ldots,M \}$.

Outcomes refer to a patient’s health status after treatment. Here, we assume a binary outcome, either recovered or not recovered. A natural measure of the effectiveness of a treatment is the probability of recovery. Let $p_{ij}$ denote the probability of recovery after receiving treatment $T_j$ in trial $i$, and $X_{ij}$ the number of these patients who recover. We assume that outcomes are independent across patients. It then follows that $X_{{ij}}$ has a Binomial distribution,
\[eq(1):\quad X_{{ij}} \sim \text{Binomial}(n_{ij}, p_{ij}).\]
Due to differences in trial populations and procedures, the recovery probabilities are not assumed to be the same across trials. Instead, we assume the exchangeability of relative effects.

We model relative treatment effects on the continuous scale. Therefore, we transform $p_{ij}$ to its log-odds,
\begin{equation*}
\quad Z_{ij} := \text{logit}(p_{ij}) = \log \frac{p_{ij}}{1-p_{ij}}
\label{eq:def_Zij}
\end{equation*}
and define the trial-specific relative treatment effect, $\Delta_{i,12}$, as the log odds ratio (LOR) between the two treatments in this trial,
\begin{equation}\label{eq:LOR}
eq(2):\quad \Delta_{i,12}:= Z_{i2}- Z_{i1}=\log \frac{p_{i2}(1-p_{i1})}{p_{i1}(1-p_{i2})}.
\end{equation}
In words, $\Delta_{i,12}$ represents the effect of $T_2$ relative to $T_1$ in the $i$-th trial; $\Delta_{i,12}>0$ indicates that $T_2$ is more effective than $T_1$.

The random effects (RE) model assumes that the treatment effects vary across trials,
\begin{equation}
eq(3):\quad \Delta_{i,12}\sim \text{Normal}(d_{12},\tau^2),
\label{eq:normal_assump2}
\end{equation}
where $d_{12}$ represents the true mean of relative treatment effects between $T_1$ and $T_2$. The fixed effect (FE) model is a special case of the RE model, in which the relative treatment effects in all trials are assumed to be equal, i.e, $\tau=0$ and $\Delta_{i,12} \equiv d_{12}$ for all $i \in \{1,\ldots,M\}$.

Data

To achieve the primary goal of estimating $d_{12}$ and $\tau$, we first derive expressions for the relative treatment effects in each trial from the available data.

We write $r_{ij}$ for the realisation of the random variable $X_{ij}$. The observed relative treatment effect in the $i$-th trial is then
\[
\hat{\Delta_{i,12}} = \log\frac{n_{i2}(n_{i1}-r_{i1})}{n_{i1}(n_{i2}-r_{i2})}.
\]
It can be shown for our binomial model that, as the numbers of patients $n_{i1}$ and $n_{i2}$ grow large, the distribution from which $\hat{\Delta_{i,12}}$ is sampled is asymptotically normal, centred on the true trial-specific effect $\Delta_{i,12}$ and with a sampling variance $\sigma_i^2$ that can be explicitly expressed in terms of $n_{i1}$, $n_{i2}$, and the unknown parameters $p_{i1}$ and $p_{i2}$. The true variance $\sigma^2_i$ is thus unknown, but can be estimated as follows,
\begin{equation}
eq(4): \quad \hat{\sigma^2_i} = \frac{1}{r_{i1}} +\frac{1}{n_{i1}-r_{i1}} +\frac{1}{r_{i2}} +\frac{1}{n_{i1}-r_{i2}}.
\end{equation}

In many practical applications of meta-analysis, it is only the relative treatment effects and their estimated variance that are reported in individual studies, and not the raw data. Hence, meta-analysis often starts by treating $\hat{\Delta_{i,12}}$ and $\hat{\sigma^2_i}$ as the primary data from the $i$-th trial. The goal is then to aggregate data across trials to estimate $d_{12}$, the true treatment effect.

Estimating model parameters

The estimate of $d_{12}$ is given by the weighted mean of estimates from each trial,
\begin{equation}\label{eq:d-hat}
eq(5): \quad\hat{d}_{12} = \frac{\sum_{i=1}^M w_i \hat{\Delta_{i,12}}} {\sum_{i=1}^M w_i }.
\end{equation}
The usual choice of the weight $w_i$ is the inverse of the variance estimate associated with trial $i$. For the FE model this is $w_i = \hat{\sigma_i^{-2}}$, and for the RE model, $w_i = 1/(\hat{\sigma_i^{2}}+\hat{\tau^2})$; here, $\hat{\sigma_i^2}$ is given in (4) and $\hat{\tau^2}$ in (7) below. The choice of inverse variance weights minimises the variance of the estimator $\hat{d_{12}}$, as can be shown using Lagrange multipliers. Substituting these weights in (5) and computing the variance, we obtain that
\begin{equation}
\label{eq:optimised_var}
eq(6): \quad \mbox{Var}(\hat{d_{12}}) =\frac{1}{M} \left( \frac{1}{M} \sum_{i=1}^M \frac{1}{\mbox{Var}(\hat{\Delta_{i,12}})} \right)^{-1},
\end{equation}
where $\mbox{Var}(\hat{\Delta_{i,12}})$ $=\hat{\sigma^2_i}$ in the FE model and $\hat{\sigma^2_i}+\hat{\tau^2}$ in the RE model, as noted above. In words, the variance of the meta-analysis estimate of the treatment effect is the scaled (by $1/M$) harmonic mean of the variances from the individual trials.

One class of methods for estimating the unknown heterogeneity parameter, $\tau$, is the so-called `method of moments’ [2], which equates the empirical between trial variance with its expectation under the random effects model. The widely-used DerSimonian and Laird (DL) [1] estimator is a specific implementation of the method of moments given by
\begin{equation}
eq(7): \quad \hat{\tau^2} = \frac{\sum_{i=1}^M\hat{\sigma_i^{-2}}\left(
\hat{\Delta_{i,12}} – \frac{\sum_{l=1}^M \hat{\sigma_l^{-2}}\hat{\Delta_{l,12}}}{\sum_{l=1}^M \hat{\sigma_l^{-2}}}
\right)^2 – (M-1)}{\sum_{i=1}^M\hat{\sigma_i^{-2}} – \frac{\sum_{i=1}^M\hat{\sigma_i^{-4}}}{\sum_{i=1}^M\hat{\sigma_i^{-2}}}}.
\label{eq:DL_tau2}
\end{equation}
The right-hand side of the above formula can be negative, in which case $\hat{\tau^2}$ is set to zero.

Optimal partitioning of patients

Our aim is to determine the optimal allocation of participants across trials that yields the most precise meta-analysis estimates. We first address this analytically by seeking the allocation that minimises the variance of $\hat{d_{12}}$ in an asymptotic regime in which the number of patients tends to infinity. We complement the theoretical analysis with simulations over a wide range of numbers of patients.

Theoretical findings

To obtain analytic results, we make two simplifying assumptions. First, we assume that each trial, and each treatment within each trial, involves the same number of participants, i.e., $n_{ij}=\frac{n}{2M} \hspace{3pt}$ for all $\{i,j\}$. Then, we consider a limit as the total number of participants, $n$, as well as the number in each trial, $n/M$, tend to infinity. In this limiting regime, the observed number of recoveries, $r_{ij}$, in each arm and trial, satisfies $r_{ij}=np_{ij}/2M$, where $p_{ij}$ is the true probability of recovery. substituting this in (4) yields
\begin{equation} \label{eq:var_est_symm}
eq(8): \quad \hat{\sigma}^2_i = \frac{2Ma_i}{n}, \mbox{ where } a_i=\frac{1}{p_{i1}}+\frac{1}{1-p_{i1}}+\frac{1}{p_{i2}}+\frac{1}{1-p_{i2}}.
\end{equation}

By approximating the asymptotic distribution of $\hat{d_{12}}$, the problem of minimising the variance of $\hat{d}_{12}$ is transferred into to the following optimisation problem:
\begin{equation*}
\max_{M, \tau} \left[\sum_{i=1}^M \frac{1}{2Mna_i+\tau^2}\right], \quad a_i := \frac{1}{p_{i1}(1-p_{i1})}+\frac{1}{p_{i2}(1-p_{i2})}.
\label{eq:opt_problem_asymtotic}
\end{equation*}
Fixed effects: Under the FE model, $\tau=0$ and the optimization problem reduces to
\[
\max_M \left[\frac{1}{2M}
\sum_{i=1}^M \frac{1}{a_i}
\right].
\]
Assuming the values of $a_i$ are roughly of the same order of magnitude, we approximate
$$\sum_{i=1}^M \frac{1}{a_i} \approx \frac{M}{\bar{a}}, \quad \bar{a} := \frac{1}{M}\sum_{i=1}a_i.$$
Hence, the objective function is independent of $M$, indicating that the partitioning of participants does not influence the precision of estimation. This result aligns with our expectation, as, in the FE model, we are only estimating the mean of the distribution and not the variance.

Random effects: In the RE model, we must also estimate the between-trial variance $\tau^2$, which is working in progress.

Empirical findings

To assess whether findings based on asymptotic performance hold in practical scenarios, we conduct a simulation study involving a total of 20,000 participants. We vary the number of trials, $M$, in unit steps from 1 to 200. The number of participants assigned to each treatment in each trial, $n_{i1}=n_{i2}$, therefore varies from 10,000 to 50. We set the true relative treatment effect equal to $d_{12}=0.05$, with heterogeneity parameter $\tau=0.1$ for the RE model.

Data simulation: For each $M$ (and corresponding $n_{i1}=n_{i2}$), we sample trial-specific relative effects $\Delta_{i,12}$ from Equation (3). To construct the corresponding recovery probabilities, we sample $p_{i,1}$ from a standard uniform distribution and calculate $p_{i,2}$ by rearranging Equation (2) to give
\[
p_{i,2} = \frac{p_{i,1}e^{\Delta_{i,12}}}{1 + p_{i,1}(e^{\Delta_{i,12}}-1)}.
\]
Finally, we simulate the number of recovered patients, $r_{ij}$, from the binomial distribution in Equation (1). This yields the simulated data set,
$$\mathcal{D}= \left\{
(r_{i,j},n_{i,j}): i \in\{1,\ldots,M\}, j\in\{1,2\}
\right\},$$
which we use to estimate the model parameters via Equations (5) and (7).

For each $M$, we repeat the simulation 100 times and calculate the median and the interquartile range (IQR) of the estimates $\hat{d_{12}}$ and $\hat{\tau}$.

Estimation of $\hat{d}_{12}$ in FE model: The following figure shows the median and IQR of the estimated mean relative treatment effect $\hat{d}_{12}$ and its standard error for the FE model. As $M$ increases, the standard error on $\hat{d}_{12}$ increases while its estimate fluctuates around the true parameter value. This indicates that the FE estimate $\hat{d}_{12}$ becomes less precise as participants are partitioned into more trials (with fewer participants in each).

Empirical results for estimated mean of relative effects in FE model

Estimation of $\hat{d}_{12}$ in RE model: The following figure shows the estimated mean and standard error of the relative treatment effect in the RE model. As before, the estimated mean is not affected by the number of trials. The standard error exhibits an initial sharp increase from $M=1$ (one large trial) and then decreases until the number of trials reaches approximately 40. After this, the standard error remains almost fixed. This indicates that for more than one trial, the estimated mean relative treatment effect is more precise when participants are partitioned into more trials.

Empirical results for estimated mean of relative effects in RE model

Estimation of $\hat{\tau}$ in RE model: The following figure shows the estimated mean and standard error of the heterogeneity estimate $\hat\tau$ in the RE model. For very few trials ($M<6$), heterogeneity is underestimated (at $M=1$ this is zero since there can be no variation between one trial). As the number of trials increases, $\hat\tau$ fluctuates about its true value but with increasing variation (IQR). Beyond $M=1$ (where the standard error is necessarily zero), the standard error on $\hat{\tau}$ decreases with increasing $M$ up to approximately $M=10$, at which point it increases again. This suggests that, for the scenario simulated in this study, the precision of the heterogeneity estimate is optimal when participants are partitioned into about 10 trials (with $n_{i1}=n_{i2}=1000$).

Empirical results for estimated heterogeneity in RE model

Summary: Even with a large number of participants, the theoretical results only hold for a smaller number of trials. This is because the number of participants per trial decreases when partitioning into more trials.

Future work: As our simulation only extended to 200 trials, it did not investigate scenarios with small numbers of participants per trial. In future work we will explore these more extreme scenarios, taking the number of trials to its maximum (i.e. with one participant per treatment in each trial). We will also investigate the generalizability of our findings to other parameter values ($d_{12}$ and $\tau$), continuous rather than binary outcomes, and Bayesian inference methods.

Reference

[1] Rebecca DerSimonian and Nan Laird. Meta-analysis in clinical trials. Controlled clinical trials, 7(3):177–188, 1986.
[2] Rebecca DerSimonian and Raghu Kacker. Random-effects model for meta-analysis of clinical trials: an update. Contemporary clinical trials, 28(2):105–114, 2007.

Student perspectives: Extending multilevel network meta-regression to disconnected networks and single-arm studies

Posted on 20th September 2024 by ad21883

A post by Sam Perren, PhD student on the Compass programme.

Over the past year, my research has been focused on a method called network meta-analysis (NMA), which is widely used in healthcare decision-making to summarise evidence on the relative effectiveness of different treatments. In particular, I have been interested in the challenges presented by disconnected networks of evidence and single-arm studies and aim to extend the multinma package to handle these challenges. Recently, I presented at the International Society for Clinical Biostatistics (ISCB) conference in Thessaloniki, Greece. In this blog post, I will outline the key points from that presentation and discuss the latest developments from my research.

Network meta-analysis

Network Meta-Analysis (NMA) pools summary treatment effects from randomised control trials (RCTs) to estimate relative effects between multiple treatments [1]. NMA summarises all direct and indirect evidence about treatment effects, allowing comparisons to be made between all pairs of treatments [2]. Covariates such as age, biomarker status, or disease severity can be either Effect Modifiers that interact with treatment effects, or Prognostic Factors that predict outcomes without interacting with treatment effects[3]. NMA requires a connected network, either directly or indirectly, through a series of comparisons[4]. Plot 1 demonstrates the assumption in NMA of constancy of relative effects, that is, the AB effect observed in study AB would be exactly the same in study AC, if a B arm had been included. However, this assumption can break down if there are differences in effect modifiers between studies which can lead to bias.[6].

Plot 1: Simple scenario with A versus B and A versus C study: we assume constancy of relative effects when making an indirect comparison between treatments B and C via the common A arm

Population adjustments & IPD network meta-regression

Population adjustment methods aim to relax the assumption of constancy of relative effects using available individual level data (IPD) to adjust for differences between study populations[3]. A network where IPD is available from every study enables the use of IPD network meta-regression and is considered the gold standard. However, having all IPD data in a network is rare; some studies may only provide aggregate data (AgD) in published papers.

Multilevel – Network Meta-Regression

Multilevel Network Meta-Regression (ML-NMR) is a population adjustment method that extends the NMA framework to synthesise mixtures of IPD and AgD. ML-NMR can produce estimates from networks of any size and for any given target population. It does this by first defining an individual-level regression model on the IPD, then it averages (integrates) each aggregate study population to form the aggregate level model using efficient and general numerical integration. [5]

Disconnected networks

Healthcare policymakers are increasingly encountering disconnected networks of evidence, which often include studies without control groups (single-arm studies)[6]. Very strong assumptions are required to make comparisons in a disconnected network; such as adjusting for all prognostic factors and all effect modifiers, which may not always be feasible with the available data. Current methods to handle disconnected networks include unanchored Matching-Adjusted indirect comparisons (MAIC)[7] and simulated treatment comparison (STC)[8]. However, these methods have limitations: they cannot generate estimates for target populations outside the network of evidence that might be relevant to decision makers and they are limited to a two study-scenario. So there remains a need for more flexible and robust methods, such as an extended version of the ML-NMR approach, to better handle disconnected networks of evidence.

Example: Plaque Psoriasis

We use a network of 6 active treatments plus placebo all used to treat moderate-to-severe plaque psoriasis, previously analysed by Philippo et al. [9]. In this network, we have AgD from the following studies: CLEAR, ERASURE, FEATURE, FIXTURE, and JUNCTURE. Additionally, we have IPD from the IXORA-S, UNCOVER-1, UNCOVER-2, and UNCOVER-3 studies. Outcomes of interest include success/failure to achieve at least 75%, 90% or 100% improvement on the Psoriasis Area and Severity Index (PASI) scale at 12 weeks compared to baseline, denoted PASI 75, PASI 90, and PASI 100, respectively. We make adjustments for potential effect modifiers, including duration of psoriasis, previous systemic treatment, body surface area affected, weight, and psoriatic arthritis.

Plot 2: Network of studies comparing treatments for moderate-to severe plaque psoriasis. PBO, placebo; IXE, ixekizumab; SEC, secukinumab; ETN, etanercept; UST, ustekinumab. IXE and SEC were each investigated with 2 different dosing regimens.

This network (Plot 2) of evidence is connected; every pair of treatments is joined by a path of study comparisons. We will now disconnect this network to illustrate different methods for reconnecting using ML-NMR, and then compare the results back to the “true” results from the full evidence network. We removed the CLEAR study and removed the placebo arms from the ERASURE, FEATURE, and JUNCTURE studies, as well as the Secukinumab 150 mg and Secukinumab 300 mg arms from the FIXTURE study in the AgD. $N_1$ (Left hand side) shows studies comparing different doses of Secukinumab, 150mg and 300mg, $N_2$ shows studies comparing all other treatments. We are then faced with the challenge of wanting to make valid comparisons between treatments in these two sub-networks, illustrated in Plot 3.

Plot 3: Disconnected network comparing treatments for moderate-to-severe plaque psoriasis

Reconnected network – internal evidence

One approach is to combine two AgD studies from opposite sides of the network into a single study. The Fixture study is the only AgD study in $N_2$. To determine the appropriate study to combine with in $N_1$, aggregate-level matching is used[10]. This involves selecting the study that minimises the Euclidean distance between the observed sets of covariates. Table 1 shows the Erasure study has the most similar characteristics to Fixture. As a result, these two studies will be combined into a new four-arm study, referred to as FIXTURE/ERASURE, effectively bridging the gap in the network.

Table 1: Aggregate level matching results against FIXTURE study.

Plot 4: Reconnected network using aggregate level matching. Combing Fixture and Erasure into one study

Reconnected network – external evidence

Another method we used to reconnect the network is by incorporating external observational studies, specifically “Chiricozzi” and “Prospect,” which observe the effects of Secukinumab 300mg. We incorporated these single-arm studies into the Fixture study as if they were part of the original trial, thereby effectively bridging the network. As a result, we end up with two separate reconnected networks, each using one of the observational studies.

Plot 5: Reconnected network using external control studies

Producing Population-Average Estimates

We have four networks for comparison: Full connected network, Reconnected using single arm study (Chircozzi), Reconnected using single arm study (Prospect), Reconnected using aggregate-level matching (FIXTURE/ERASURE). For each network, we will run both ML-NMR and standard NMA without regression. These analyses will produce population-adjusted relative treatment effects and probability outcomes for achieving a 75% reduction in the Plaque Area Severity Index (PASI75).

The ML-NMR results in the fully connected network will serve as the gold standard. We will compare the results obtained from the different methods (ML-NMR vs. NMA) and across the various networks (Full vs. reconnected) to evaluate the impact of different approaches on the relative treatment effects and outcome probabilities.

Relative Effects vs Placebo

Plot 6: Probit relative treatment effects vs placebo estimates. Target populations in columns, treatments with their disconnected subnetwork in the rows (right-hand side) and reconnected/original networks in the subrows (left-hand side). Coloured by method (MLNMR or NMA)

Plot 6 shows the probit relative treatment effects versus placebo across three populations: Feature ($N_1$), Uncover-1 ($N_2$), and the external population, Prospect. The results demonstrate that for treatments in $N_2$, the estimates produced by both NMA and ML-NMR are generally close to the gold standard. This similarity between NMA and ML-NMR is largely due to the homogeneity of the populations within the networks and the limited covariates we used to match original analysis. However, NMA results show smaller confidence intervals compared to ML-NMR, which may suggest an overconfidence in the NMA model’s results. ML-NMR accounts for more complexity and variability therefore extrapolates results.

For the Prospect population, the NMA results exhibit slight bias, likely due to differences between the external population and the network populations.

Results for treatments in $N_1$ show varying degrees of accuracy when compared to the gold standard in all populations. Among the reconnected networks, the FIXTURE/ERASURE and Prospect reconnected networks perform relatively well, while the Chiricozzi-based network struggles to match the gold standard results. This is due to Chiricozzi differing the most on covariates compared to all other populations.

In other words, when comparisons are made across the created “bridges” in the reconnected networks, bias can be introduced into our results.

Plot 7: Reconnected network highlighting comparisons made over the “bridge”

The plot above is the reconnected plot using PROSPECT and Chiricozzi external studies and shows us what we mean by comparisons across the “bridge”. All results in plot (1) are relative to a placebo (PBO) which is in $N_2$. If we want to make comparisons to the placebo with treatments from $N_1$ we will need to use these generated direct comparisons or “bridges”.

Absolute probability of PASI75

Plot 8: Probability of absolute outcomes of PASI75. Target populations in columns, treatments with their disconnected subnetwork in the rows (right-hand side) and reconnected/original networks in the subrows (left-hand side). Coloured by method (MLNMR or NMA)

In Plot 8, the FEATURE population results are very close to the gold standard for treatments in $N_1$ but results for treatments in $N_2$ show some bias. Unlike in the probit differences, the reference treatment for FEATURE now become Secukinumab 150mg and 300mg (SEC_150 & SEC_300) so in order to estimate absolute outcomes for $N_2$ treatments, we need to use our “bridges”, thereby incurring bias. This narrative is the same for the other 2 population estimates, where UNCOVER-2 is in $N_2$, estimates for treatments in $N_1$ are bias compared to the gold standard, dependent on network used. For PROSPECT, it’s reference treatment is Secukinumab 150mg ($N_1$), therefore results for $N_2$ treatments vary from the gold standard.

Key Findings

When producing estimates across reconnected networks, there’s a risk that the estimates may be biased or deviate from the true value. In our analysis, reconnecting the networks using ML-NMR showed little improvement over NMA. These results highlight the importance of carefully selecting studies to bridge networks and minimise bias. As disconnected networks become more common, it’s clear that better tools for evidence synthesis are needed to ensure reliable results that can inform clinical decisions and improve outcomes.

Future Work

To improve the performance of ML-NMR over NMA, we will try incorporating more covariates into the regression model. We also plan to conduct a comprehensive simulation study to compare methods under various scenarios and explore additional approaches, such as class effects. Developing methods to assess the strong assumptions required for reconnecting networks will be another priority. Finally, we aim to implement these methods within the multinma package.

References

[1] – Sofia Dias, Anthony E Ades, Nicky J Welton, Jeroen P Jansen, and Alexander J Sutton. Network meta-analysis for decision-making. John Wiley & Sons, 2018.
[2] – Song F, Altman DG, Glenny AM, Deeks JJ. Validity of indirect comparison for estimating efficacy of competing interventions: empirical evidence from published meta-analyses. Bmj. 2003 Mar 1;326(7387):472.
[3] – David M Phillippo, Anthony E Ades, Sofia Dias, Stephen Palmer, Keith R Abrams, and Nicky J Welton. Methods for population-adjusted indirect comparisons in health technology appraisal. Medical decision making, 38(2):200–211, 2018
[4] – Sofia Dias, Alex J Sutton, AE Ades, and Nicky J Welton. Evidence synthesis for decision making 2: a generalized linear modeling framework for pairwise and network meta-analysis of randomized controlled trials. Medical Decision Making, 33(5):607–617, 2013
[5] – David M Phillippo, Sofia Dias, AE Ades, Mark Belger, Alan Brnabic, Alexander Schacht, Daniel Saure, Zbigniew Kadziola, and Nicky J Welton. Multilevel network meta-regression for population- adjusted treatment comparisons. Journal of the Royal Statistical Society. Series A,(Statistics in Society), 183(3):1189, 2020
[6] – John W Stevens, Christine Fletcher, Gerald Downey, and Anthea Sutton. A review of methods for comparing treatments evaluated in studies that form disconnected networks of evidence. Research synthesis methods, 9(2):148–162, 2018
[7] – Signorovitch, James E., et al. “Matching-adjusted indirect comparisons: a new tool for timely comparative effectiveness research.” Value in Health 15.6 (2012): 940-947.
[8] – Caro JJ, Ishak KJ. No head-to-head trial? Simulate the missing arms. Pharmacoeconomics. 2010;28(10):957–67.
[9] – David M Phillippo, Sofia Dias, AE Ades, Mark Belger, Alan Brnabic, Daniel Saure, Yves Schy-mura, and Nicky J Welton. Validating the assumptions of population adjustment: application of multilevel network meta-regression to a network of treatments for plaque psoriasis. Medical Decision Making, 43(1):53–67, 2023
[10] – Leahy, Joy, et al. “Incorporating single‐arm evidence into a network meta‐analysis using aggregate level matching: assessing the impact.” Statistics in medicine 38.14 (2019): 2505-2523.

Student Perspectives: Bayesian LLM Finetuning

Posted on 28th August 20242nd September 2024 by sam.bowyer

A post by Sam Bowyer, PhD student on the Compass programme.

Training large AI models is tricky business. First you’ll want to raise money — and lots of it. (OpenAI’s GPT-4 reportedly cost over $100 million to train, roughly equivalent to 0.5% of Bristol’s GDP.) With that money you’ll need to buy hardware (25,000 NVIDIA A100 GPUs should do), hire a team of talented engineers, and purchase licensing to vast quantities of data (though you might consider foregoing that last one and just hope no one complains…). Once you’ve collected enough data (say, ~13 trillion tokens-worth¹), settled on a model architecture with hundred of billions, if not trillions, of parameters (each taking up at least a byte of memory), you can sit back and wait around for 100 days whilst your engineers firefight software and hardware crashes to steer your model’s training to completion.²

But for those of us who can’t afford the $10^{25}$ FLOPs (floating point operations) needed to train such a model (or who might want to avoid the associated environmental costs), what can we do? The answer lies in finetuning: taking one of the available pretrained ‘foundation’ models (such as ChatGPT, or an open source model such as one from Meta’s Llama series) and tweaking them to suit your own purposes.

The basic idea is this: these foundation models are great multitaskers, they’ve been trained well enough to generate reasonable outputs to a wide variety of inputs, but if you’re only interested in using them on a particular set of data ($\mathcal{D}_\text{finetune}$), or for a particular task, then it might be a good idea to spend some extra time training on that data specifically, after the rest of (pre)training has taken place. Similarly, it’s worth noting that the foundation model you get straight out of pretraining will mimic its input dataset, $\mathcal{D}$. In the case that $\mathcal{D}$ is too large to be checked by humans (e.g. 13 trillion tokens — essentially including most of the public internet), your model will almost certainly have learnt undesirable behaviour and be capable of producing dangerous, offensive, and harmful output. Finetuning is critical to the pursuit of safe AI, putting guardrails in place and ensuring that a model’s behaviour is aligned with our desires, both in terms of utility and safety.³

In this blog post, I’ll give an overview of LLM finetuning, specifically parameter-efficient finetuning, which tackles the problem of finetuning models whilst avoiding the computational burden that was required for pretraining. Even if your finetuning dataset $\mathcal{D}_\text{finetune}$ is much smaller than your pretraining set $\mathcal{D}$, you’ve still got the computational problem of the model’s size: how do you efficiently⁴ do gradient-based optimisation on a model with potentially billions of parameters? I’ll also argue that taking a Bayesian approach can be beneficial, and that whilst the added computational cost of Bayes might not be feasible (or even all that helpful) in the pretraining setting, these costs are much less impactful when finetuning.

Parameter Efficient Finetuning

Perhaps the simplest way to finetune a model on $\mathcal{D}_\text{finetune}$ is to simply carry on training as before — with some gradient-based optimiser like Adam [1] — but on this new dataset (often repeatedly, i.e. for multiple ‘epochs’). This is known as full finetuning (FFT) and usually leads to the best results, however, it’s often infeasible due to the size of the model being finetuned.

Recall that the model we’re working with might have billions of parameters — in order to train these parameters we need to store not only their values, but also their gradients, as well as the activation values of each neuron in the network and, depending on your optimiser, potentially momentum and second order gradient information (e.g. Adam makes use of the exponential moving average of gradients and the EMA of squared gradients — all per parameter). On a model like Llama-7B, whose 7 billion parameters at 8-bit precision require 7GB of storage, these extra gradient costs can easily overwhelm the 16GB capacity of a typical high-end consumer GPU such as an NVIDIA RTX 4080. (Add to that the fact that we usually want to batch our input data — that is, pass multiple input examples through the model at a time — and you can see where things start to spiral out of control.)

This motivates the need for finetuning algorithms that have a smaller memory footprint. There’s an exciting field of literature in model compression and quantisation — using compression techniques to represent your model and its gradients by fewer and fewer bits⁵, but another approach is to simply reduce the number of parameters that you train during finetuning. However, choosing which parameters to train and which to freeze (thus freeing up space that would’ve gone to storing the gradient information of those parameters) is far from trivial.

Partial Finetuning

In order to discuss finetuning techniques, it’ll be useful to briefly touch on the basic architecture of neural networks. The simplest type of neural network is a multilayer perceptron, or MLP, which consists of $L$ layers in which the output of layer $l-1$, $x^{l-1} \in \mathbb{R}^{d_{l-1}}$, is multiplied by a learnable weight matrix $W^l \in \mathbb{R}^{d_{l} \times d_{l-1}}$ and added to a learnable bias vector $b^l \in \mathbb{R}^{d_{l}}$ before being transformed through a nonlinearity, such as a sigmoid $\sigma(x) = (1+e^{-x})^{-1}$:
$$x^l = \sigma(W^l x^{l-1} + b^l),$$
with $x^0 \in \mathbb{R}^{d_0}$ being input data.

A common strategy for finetuning is to freeze all weights in earlier layers, say, up until the final $\hat{L}$ layers, and only train the set of parameters $\{W^l, b^l | l \geq L-\hat{L}\}$. Assuming constant network width $d = d_0 = \ldots = d_L$ this reduces the number of trainable parameters from $L(d^2 + d)$ to $\hat{L}(d^2 + d)$.

Another simple finetuning strategy is BitFit [3], which works by only training the bias parameters, leading to a total of $Ld$ trainable parameters (though of course this does make the iterative finetuning updates significantly less expressive).

It’s important to note that the final-layers-only approach can also be applied more generally. Most LLMs architectures use transformers [4] as their backbone, which — very loosely speaking — consist of multi-headed attention layers (another, more complicated type of neural network) followed by an MLP (plus a whole bunch of other stuff containing yet more parameters), and with each transformer’s output typically going on to form the input of another transformer. So it’s common to see only the final transformer finetuned, or even only the final transformer’s MLP.

Since it would be ill-advised to take a long detour into the definition of multi-headed attention here (as that’d be fairly involved and might take the momentum out of our finetuning discussion), I won’t do that. (Instead, I’ll banish it to yet another (increasingly-obnoxious) footnote⁶.)

Adapter Tuning

Rather than retraining the weights already in your model, most modern finetuning approaches actually add new parameters to the model, termed ‘adapters’, and only train these instead. For example, [5], [6], and [7] all essentially propose techniques in which we insert two-layer MLPs at different places inside a transformer, with varying results.

Adapter methods have the benefit of being ‘plug-and-play’, in the sense that you can train multiple adapters on different finetuning tasks and then insert them into your model if you detect that it would be helpful for a user’s given request.

Low Rank Adaptation (LoRA)

By far the most common (and almost de facto standard as of 2024) finetuning method is Low Rank Adaptation (LoRA) [8]. The intuition behind LoRA is that the parameters inside your pretrained model are probably fairly close to their finetuned optimal values already, in the sense that those optimal values can probably be reached using only updates in a low-rank subspace. As such we can pose our finetuning problem in terms of finding the low-rank matrix $\Delta W \in \mathbb{R}^{d_\text{in} \times d_\text{out}}$ that optimises a given pretrained weight matrix $W_0$, leading to
$$W_\text{finetune} = W_0 + \Delta W,$$
where the low-rank of $\Delta W$ is enforced by parameterising it as $$\begin{aligned}\Delta W & = B A \\ B & \in \mathbb{R}^{d_\text{in} \times r} \\ A & \in \mathbb{R}^{r \times d_\text{out}} \end{aligned}$$so that $\text{rank}(\Delta W) \leq r \ll \text{rank}(W_0) \leq \min (d_\text{in}, d_\text{out})$. (Note that LoRA places the adapter in parallel to a pretrained weight matrix $W_0$, in contrast to the serial/in-between placement of the MLP adapters mentioned in the previous section.)

Figure reproduced from [8], showing a Gaussian-initialisation of $A$ and a zero-initialisation of $B$.

By only learning $A$ and $B$, we reduce the number of parameters from $d_\text{in} d_\text{out}$ to $r(d_\text{in} + d_\text{out})$. In practice, often $d_\text{in}$ and $d_\text{out}$ will be in the range ~$10^3$-$10^4$ whilst we’ll choose $r$ to be somewhere between 4 and 128. To cut down on the number of trainable parameters even further, we often only apply LoRA adapters to certain weight matrices in a model, for example only the query and value matrices ($W^Q$ and $W^V$) of attention layers.

LoRA’s success has led to a large number of variants, such as AdaLoRA [9] which adaptively decides which weight matrices to apply LoRA to based on their singular values. Other methods include PiSSA (Principal Singular Values and Singular Vectors Adaptation) [10] which performs LoRA updates only on the first few principle components of each weight matrix and freezes the ‘residuals’ which come from later principle components. One recent paper presents GaLore (Gradient Low Rank Projection) [11], which performs PCA on the weight matrix every few iterations and performs low-rank updates by specifically only optimising in the (low-rank) space spanned by the first few priniciple components.

Bayesian Finetuning

Although work has been done to introduce uncertainty estimation into pretraining, the results often aren’t worth the extra computational costs [12, 13]. Not only are the model sizes too large to make uncertainty quantification feasible, but the fact that your pretraining dataset, $\mathcal{D}$, is gigantic provides little uncertainty to reason about. However, in the context of finetuning we typically have a much smaller dataset, for which we’ll likely have much more uncertainty, and we also tend to work with far fewer parameters, allowing for extra computational budget to go towards the use of Bayesian methods.

Consider splitting up our finetuning set into prompt and target/response pairs $(X,y) \in \mathcal{D}_\text{finetune}$ where $X \in \mathcal{T}^{B \times n}$ is a matrix of $B$ sequences each of maximum length $n$ (potentially padded out with null-tokens) constructed with the token set $\mathcal{T}$, and $y \in \mathcal{Y}^B$ could be a corresponding batch of single tokens (in which case $\mathcal{Y} = \mathcal{T}$), or a batch of classification labels (e.g. in sentiment analysis, or multiple-choice Q&A, in which case $\mathcal{Y}$ might be different to $\mathcal{T}$).

What we fundamentally want to learn is a posterior distribution over all learnable parameters $$p(\theta | \mathcal{D}_\text{finetune}) = p(\theta | X, y),$$where, for example, in the case of LoRA finetuning, $\theta$ is the collection of all adapter weights $A$ and $B$. This not only gives us information about the uncertainty in the model’s parameters, which can be useful in itself, but can also be used to give us the posterior predictive distribution for a test input $x^* \in \mathcal{T}^n$, $$p(y^* | x^*, \mathcal{D}_\text{finetune}) = \int p(y^* | x^*, \theta)p(\theta|\mathcal{D}_\text{finetune})d\theta.$$
This is often more desirable than a predictive distribution that only uses a point estimate of $\theta$ and which would then ignore the uncertainty present in the model’s parameters.

Bayesian LoRA (via Laplace Approximation and KFAC)

Yang et al. [14] suggest a method for finding the posterior $$p(\theta | X, y) \propto p(y | X, \theta)p(\theta)$$post-hoc, i.e. after regular finetuning (with LoRA) using a Laplace approximation — which assumes the posterior is a Gaussian centered at the maximum a-posteriori (MAP) solution, $\theta_\text{MAP}$.

First, we note that the MAP solution can be written as the maximum of the log-joint $\mathcal{L}(y, X; \theta)$, $$\begin{align} \mathcal{L}(y, X; \theta) &= \log p(y | X, \theta) +\log p(\theta) = \log p(\theta | X, y) + \text{const} \\ \theta_\text{MAP} &= \arg\max{}_\theta \mathcal{L}(y, X; \theta). \end{align}$$
Then assuming that the finetuning successfully optimised $\theta$, i.e. reached parameter values $\theta_\text{MAP}$, the Laplace approximation involves taking the second-order Taylor expansion of the log-joint around $\theta_\text{MAP}$, $$\mathcal{L}(y, X; \theta) \approx \mathcal{L}(y, X; \theta_\text{MAP}) – \frac{1}{2}(\theta – \theta_\text{MAP})^T(\nabla_\theta^2 \mathcal{L}(X, y; \theta)|_{\theta_\text{MAP}})(\theta – \theta_\text{MAP}).$$(The expansion’s first-order term disappears because the gradient of the MAP objective at $\theta_\text{MAP}$ is zero.) This quadratic form can then be written as a Gaussian density, with mean $\theta_\text{MAP}$ and covariance given by the inverse of the log-joint Hessian: $$\begin{align}p(\theta | X, y) &\approx \mathcal{N}(\theta ; \theta_\text{MAP}, \Sigma), \\
\Sigma &= -(\nabla_\theta^2 \mathcal{L}(X, y; \theta))^{-1}.\end{align}$$
The authors makes use of various tricks to render computing this Hessian inverse feasible, most notably Kronecker-Factored Approximate Curvature (KFAC) [15]. (A nice explanation of which can be found at this blog post.)

Using the Laplace approximation comes with added benefits. Specifically, we can make use of the Gaussian form of the (approximate) posterior to easily compute two values of interest: samples from the posterior predictive distribution, and estimates of the marginal likelihood.

For the first of these, we can linearise our model, with output $f_\theta(x^*)$ approximated by a first-order Taylor expansion around $\theta_\text{MAP}$, $$f_\theta(x^*) \approx f_{\theta_\text{MAP}}(x^*) + \nabla_\theta f_\theta(x^*)|^T_{\theta_\text{MAP}}(\theta – \theta_\text{MAP}).$$
We can write this as a Gaussian density $$f_\theta(x^*) \sim \mathcal{N}(y^*; f_{\theta_\text{MAP}}(x^*), \Lambda)$$ where $$\Lambda = (\nabla_\theta f_\theta(x^*)|^T_{\theta_\text{MAP}})\Sigma(\nabla_\theta f_\theta(x^*)|_{\theta_\text{MAP}}).$$
With this, we can easily obtain samples from our predictive posterior through reparameterised sampling of some Gaussian noise $\mathbf{\xi} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and a Cholesky decomposition $\Lambda = LL^T$: $$\hat{y} = f_\theta(x^*) = f_{\theta_\text{MAP}}(x^*) + L\mathbf{\xi}.$$

The second value of interest is the marginal likelihood (also known as the model evidence), which is useful for hyperparameter optimisation and can be computed simply as follows $$\begin{align}p(y|X) &= \int p(y|X,\theta)p(\theta)d\theta \\ &\approx \exp (\mathcal{L}(y, X; \theta_\text{MAP}))(2\pi)^{D/2}\det(\Sigma)^{1/2}.\end{align}$$

Using Stein Variational Gradient Descent (SVGD)

A reasonable question to ask is whether it might be feasible to learn the posterior distribution during finetuning, rather than afterwards. One such method for achieving this is Stein variational gradient descent (SVGD) [16], in which a collection of $n$ parameter particles $\{\theta_i^{(0)}\}_{i=1}^n$ are iteratively updated to fit the true posterior using some similarity function (i.e. a kernel) $k: \Theta \times \Theta \to \mathbb{R}$, $$\begin{align}\theta_i^{t+1} &= \theta_i^{(t)} – \epsilon_i \phi(\theta_i^{(t)}) \\ \phi(\theta_i) &= \frac{1}{n} \sum_{j=1}^n \left[\frac{1}{T}k(\theta_j,\theta_i)\nabla_{\theta_j}\log p(\theta_j | \mathcal{D}_\text{finetune}) + \nabla_{\theta_j} k(\theta_j, \theta_i) \right],\end{align}$$where $\epsilon_i$ is a learning rate and $T$ is a temperature hyperparameter. The basic interpretation of the update is that the first term inside the summation drives particles towards areas of high posterior probability, whilst the second term penalises particles that are too similar to one another, acting as a repulsive force that encourages exploration of the parameter-space.

Once the particles have converged, we can simply approximate the posterior predictive as the average output of the network across each parameter particle $\theta_i$, $$p(y^* | x^*, \mathcal{D}_\text{finetune}) \approx \frac{1}{n} \sum_{i=1}^n f_{\theta_i}(x^*).$$

My current research lies in applying SVGD to LoRA adapters. The hopes are that we can learn a richer, multi-modal posterior distribution using SVGD’s particles without making the Gaussian posterior assumption of the Laplace approximation. Recent concurrent work [17] applies a very similar technique to computer-vision tasks and achieves promising results.

Conclusion

I hope this blog has been a useful introduction to the finetuning of LLMs. Feel free to get in touch if you’re interested! My email is sam.bowyer@bristol.ac.uk.

Footnotes

1: LLMs split input text up into a sequence of tokens. Roughly speaking, most words are split into one or two tokens depending on how common and how long they are. Using GPT-4’s tokenizer, this sentence is made from 17 tokens. (back to top)
2: Spare a moment, if you will, for the Meta engineers behind the OPT-175B (175 billion-parameters) model. The training logbook of which reads at times like that of a doomed ship at sea. (back to top)

3: Note that in the case of LLMs specifically, the straight-out-of-pretraining model will also likely be a poor virtual assistant, in the way we tend to desire of chatbots like ChatGPT. A model which can complete sentences to match the general patterns found in $\mathcal{D}$ won’t necessarily be much good at the user-agent back-and-forth conversation style we’d like, and as such might not have properly ‘learnt’ how to, for example, follow instructions and answer questions. It’s because of this that most public-facing LLMs go through what’s known as instruction fine-tuning, in which the model is finetuned on a large dataset of instruction-following chat logs before being deployed. (back to top)
4: That is, without using 25,000 GPUs… (back to top)
5: Consider this paper [2] by Huang et al. which boasts 1.08-bit quantisation^a of 16-bit models, all whilst retaining impressive levels of performance. (back to top)

a : i.e. representing parameters with an average precision of 1.08 bits.

An (Ill-Advised) Aside: Attention

Attention layers work by taking three matrices as input, $Q_\text{input}, K_\text{input}, V_\text{input} \in \mathbb{R}^{n \times d_\text{model}}$, typically representing $d_\text{model}$-dimensional embeddings of a sequence of $n$ tokens. First we project these matrices using learnable weight matrices $W^Q, W^K \in \mathbb{R}^{d_\text{model} \times d_k}$, and $W^V \in \mathbb{R}^{d_\text{model} \times d_v}$ to obtain our queries, keys and values: $$\begin{align}
Q &= Q_\text{input} W^Q \in \mathbb{R}^{n \times d_k} \\
K &= K_\text{input} W^K \in \mathbb{R}^{n \times d_k} \\
V &= V_\text{input} W^V \in \mathbb{R}^{n \times d_v}.
\end{align}$$With these, we then compute attention as $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ where $\text{softmax}$ is applied over each row such that, denoting the $i$th row of the matrix $A = QK^T$ as $A^{(i)}$ and that row’s $j$th element as $a^{(i)}_j$, we define: $$\text{softmax}(A)^{(i)} = \frac{\exp A^{(i)}}{\sum_{j=1}^n \exp a^{(i)}_j}.$$
The intuition behind this is that our $n \times n$ attention matrix $\text{softmax}(QK^T/\sqrt{d_k})$ has entries representing how much token $i$ relates (or ‘attends’) to token $j$. The $\text{softmax}$ normalises each row so that the entries all add up to one, allowing us to think of each row as a distribution over tokens. The final multiplication with $V$ might then be thought of as selecting (or weighting) tokens in $V$ according to those distributions.

One important limitation of the attention mechanism we’ve just described is that it only allows us to consider how each token attends to each other token in some universal way, whereas in reality there are multiple ways that words in a sentence (for example) can relate to each other. Because of this, most of the time we actually use multi-headed attention, in which we compute attention between the token sequences $H \in \mathbb{N}$ times, each time with different learnable weight matrices $W^Q_h, W^K_h, W^V_h$ for $h \in \{1,\ldots,H\}$. Then we combine these separate attention heads, using yet another learnable weight matrix $W^O \in \mathbb{R}^{H d_v \times d_\text{model}}$, $$\text{MultiHead}(Q_\text{input}, K_\text{input}, V_\text{input}) = \text{Concat}(\text{head}_1,\ldots,\text{head}_H)W^O \in \mathbb{R}^{n \times d_\text{model}},$$ where $\text{head}_h = \text{Attention}(Q_\text{input}Q_h, K_\text{input}K_h, V_\text{input}V_h)$. Allowing the model to learn different types of attention on different heads makes MHA an incredibly powerful and expressive part of a neural network.

To summarise and return to the discussion of finetuning: MHA layers contain a ton of learnable parameters (specifically, $2H d_\text{model} (d_k + d_v)$ of them). (back to top)

References

[1] Kingma, D.P., 2014. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980.
[2] Huang, W., Liu, Y., Qin, H., Li, Y., Zhang, S., Liu, X., Magno, M. and Qi, X., 2024. Billm: Pushing the limit of post-training quantization for llms. arXiv preprint arXiv:2402.04291.
[3] Zaken, E.B., Ravfogel, S. and Goldberg, Y., 2021. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199.
[4] Vaswani, A., 2017. Attention is all you need. arXiv preprint arXiv:1706.03762.
[5] Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M. and Gelly, S., 2019, May. Parameter-efficient transfer learning for NLP. In International conference on machine learning (pp. 2790-2799). PMLR.
[6] Lin, Z., Madotto, A. and Fung, P., 2020. Exploring versatile generative language model via parameter-efficient transfer learning. arXiv preprint arXiv:2004.03829.
[7] Pfeiffer, J., Kamath, A., Rücklé, A., Cho, K. and Gurevych, I., 2021. AdapterFusion: Non-Destructive Task Composition for Transfer Learning. EACL 2021.
[8] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. and Chen, W., 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
[9] Zhang, Q., Chen, M., Bukharin, A., Karampatziakis, N., He, P., Cheng, Y., Chen, W. and Zhao, T., 2023. AdaLoRA: Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512.
[10] Meng, F., Wang, Z. and Zhang, M., 2024. Pissa: Principal singular values and singular vectors adaptation of large language models. arXiv preprint arXiv:2404.02948.
[11] Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A. and Tian, Y., 2024. Galore: Memory-efficient llm training by gradient low-rank projection. arXiv preprint arXiv:2403.03507.
[12] Cinquin, T., Immer, A., Horn, M. and Fortuin, V., 2021. Pathologies in priors and inference for Bayesian transformers. arXiv preprint arXiv:2110.04020.
[13] Chen, W. and Li, Y., 2023. Calibrating transformers via sparse gaussian processes. arXiv preprint arXiv:2303.02444.
[14] Yang, A.X., Robeyns, M., Wang, X. and Aitchison, L., 2023. Bayesian low-rank adaptation for large language models. arXiv preprint arXiv:2308.13111.
[15] Martens, J. and Grosse, R., 2015, June. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning (pp. 2408-2417). PMLR.
[16] Liu, Q. and Wang, D., 2016. Stein variational gradient descent: A general purpose bayesian inference algorithm. Advances in neural information processing systems, 29.
[17] Doan, B.G., Shamsi, A., Guo, X.Y., Mohammadi, A., Alinejad-Rokny, H., Sejdinovic, D., Ranasinghe, D.C. and Abbasnejad, E., 2024. Bayesian Low-Rank LeArning (Bella): A Practical Approach to Bayesian Neural Networks. arXiv preprint arXiv:2407.20891.

Compass at AI UK

Compass CDT students and staff at the AI UK 2025 Conference. From left to right: Dr Dan Lawson, Emma Ceccherini, Sam Bowyer, Sherman Khoo and Helen Mawdlsey

Conference highlights

Presentation karaoke

Introduction

Background

Expectation Propagation

Benefits and limitations of EP

Extensions of EP

References

Introduction

Aims and problems

Potential solutions

Conclusion and future plans

References

Introduction

Why use Regional Sensitivity Analysis?

How does RSA work?

Extensions of RSA

Conclusions

References

Introduction

Student Research Talks

Student Lightning Talks

Compass Alumni Panel

Special Guest Lecture

Concluding remarks

Past conferences

Introduction

Meta-analysis background

Model

Data

Estimating model parameters

Optimal partitioning of patients

Theoretical findings

Empirical findings

Reference

Network meta-analysis

Population adjustments & IPD network meta-regression

Multilevel – Network Meta-Regression

Disconnected networks

Example: Plaque Psoriasis

Reconnected network – internal evidence

Reconnected network – external evidence

Producing Population-Average Estimates

Relative Effects vs Placebo

Absolute probability of PASI75

Key Findings

Future Work

References

Parameter Efficient Finetuning

Partial Finetuning

Adapter Tuning

Low Rank Adaptation (LoRA)

Bayesian Finetuning

Bayesian LoRA (via Laplace Approximation and KFAC)

Using Stein Variational Gradient Descent (SVGD)

Conclusion

Footnotes

An (Ill-Advised) Aside: Attention

References

Compass CDT students and staff at the AI UK 2025 Conference. From left to right:
Dr Dan Lawson, Emma Ceccherini, Sam Bowyer, Sherman Khoo and Helen Mawdlsey