Compass students at ICML 2025

Posted on 2nd July 202510th July 2025 by shaun.jordan

The Compass CDT will be well-represented at this year’s International Conference on Machine Learning (ICML), with several of its students having papers accepted by the prestigious event.

One of these has been selected for oral presentation and one as a ‘spotlight position paper’, while another was co-authored by more than one Compass student – indicative of the strength and depth of the Centre’s machine learning and AI research:

Oral presentation: ‘Score Matching with Missing Data’
Josh Givens (lead author); Song Liu; Henry W J Reeve

Spotlight: ‘Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints’
Sam Bowyer (lead author); Laurence Aitchison; Desi R. Ivanova

Poster presentation: ‘Function-Space Learning Rates‘
Ed Milsom (lead author) and Ben Anson (co-author); Laurence Aitchison

Poster presentation: ‘Flexible Tails for Normalizing Flows’
Tennessee Hickling (lead author); Dennis Prangle

This strong presence at ICML, which takes place in Vancouver next week, forms part of a packed conference summer, with Compass students presenting at other important UK and international events, including UAI 2025 in Brazil, and the ERSA Congress in Greece. (See full list).

ICML 2025 will take place in Vancouver, Canada from 13 to 19 July 2025.

“It’s fantastic to see our students making valuable contributions to one of the leading conferences in this field, and especially rewarding to have an oral presentation and spotlight paper as part of that,” said Professor Nick Whiteley, Director of the Compass CDT.

“Alongside NeurIPS and ICLR, both of which Compass students and alumni have contributed to in recent years, ICML provides an opportunity to showcase our work to machine learning researchers and practitioners from across academia and industry.

We are proud of the cutting-edge projects being undertaken by students and colleagues in the Centre, and across the wider Institute for Statistical Science here in the School of Mathematics, which span data science, statistics, machine learning and AI.

We’ve had some positive outcomes already this year, with our first Compass graduates finding employment with a range of organisations in the private and public sectors, as well as in academia, and current students securing some exciting internships.

Having papers accepted at high impact conferences, including ICML, and being involved in well-regarded events at home and overseas, is a fantastic way to round off the academic year.”

With Compass student Sherman Khoo winning an Early Career Researcher Poster Award in June, at Bayes Comp 2025 – a meeting of the Bayesian Computation Section of the International Society for Bayesian Analysis – the conference season is already off to a successful start.

Compass CDT student Sherman Khoo (furthest left) being announced as a winner of an Early Career Researcher Poster Award at Bayes Comp 2025, held in Singapore in June.

The fact that some papers accepted at conferences this year, including one at ICML, were co-authored by more than one Compass student has been particularly rewarding, as one of the Centre’s aims is to foster collaboration, and to build a community of researchers.

Emerald Dilworth will present a poster at Uncertainty in Artificial Intelligence (UAI), summarising a paper for which she was lead author, and Compass graduate Ed Davis was one of the co-authors, as was Compass Co-director Professor Daniel Lawson. “I reached out to Ed with my initial ideas, and we grew them together,” says Emerald.

“Some of the work belongs to each of us as parts we contributed individually, and other parts we worked on together, which is something we enjoyed. It’s a valuable, positive experience to collaborate on research, when a lot of what we do as PhD students can be solo work.”

This collaborative culture, alongside awards, presentations, spotlight slots, and a strong presence at ICML and other events, makes for a positive 2025 conference season, and suggests this trend will continue in future years, as students and alumni continue to gain experience.

“Travelling to Vancouver in July, and representing the University of Bristol, and the UK research community, at a forum as important as ICML, is a fantastic opportunity,” says Professor Whiteley. “The insights students gain will help to shape their future research.

The fact that students from both the 2021 and 2022 Compass cohorts will contribute to that conference, while all cohorts will be represented at a diverse range of leading events this summer, shows the breadth of Compass expertise. The research our students showcase, the knowledge they gain, and the networks they build, will have benefits for a long time to come.”

Compass student Cecina Babich Morrow (third from right) was a panellist at a GW4 AI and Data Science event focused on climate and health, held in Bristol in June.

Compass at ICML 2025

July 2025 – Vancouver, Canada

Student: Josh Givens (lead author)
Co-authors: Song Liu; Henry W J Reeve
Contribution: Oral presentation
Overview: ‘Score Matching with Missing Data’. Score matching is a technique used to learn intractable data distributions and serves as a foundational component of state-of-the-art image generation methods. In this paper, we adapt score matching (and its various extensions) to enable the learning of the original data distribution using only corrupted samples, where parts of each observation are missing or contain NaN values.

Student: Sam Bowyer (lead author)
Co-authors: Laurence Aitchison; Desi R. Ivanova
Contribution: Spotlight Position Paper
Overview: ‘Position: Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints’. We argue that AI researchers need to improve the techniques they use to calculate error bars (i.e. confidence/credible intervals) over model-performance on benchmark test-sets (‘evals’). We examine the failure of common approaches that rely on the Central Limit Theorem (CLT) and suggest Bayesian and frequentist alternatives in a variety of eval settings, such as questions organised in subtasks/clusters, and comparisons between two models.

Student: Ed Milsom (lead author) and Ben Anson (co-author)
Co-author: Laurence Aitchison
Contribution: Poster
Overview: ‘Function-Space Learning Rates’. This will help AI researchers better understand why current AI systems work well and how to improve them in the future. We utilise our method to tune very large AI systems by tuning a small, cheap model and then copying the settings to the large model (which would be very computationally expensive to tune by itself). This is not usually possible, because the optimal settings change between small and large AI systems, but with our method, we can predict and therefore correct these changes.

Student: Tennessee Hickling (lead author)
Co-authors Dennis Prangle
Contribution: Poster
Overview: ‘Flexible Tails for Normalizing Flows’. Modern machine learning methods model uncertainty by transforming simple random inputs into data-like outputs, but they often underestimate extreme events. While some recent approaches inject extreme inputs directly, this can cause models to behave poorly. We propose a different solution: adding a final step that enables models to generate extreme outcomes from non-extreme inputs, improving performance on data with heavy tails.

‘Paired question model comparison setting’ – from Sam Bowyer’s ICML Spotlight Position Paper.

Compass at UAI 2025

July 2025 – Rio de Janeiro, Brazil

Student: Emerald Dilworth (lead author) and Ed Davis (co-author)
Co-authors: Daniel Lawson (Compass Co-director)
Contribution: Poster
Overview: ‘Valid Bootstraps for Network Embeddings with Applications to Network Visualisation’. Quantifying uncertainty in networks is an important step in modelling relationships and interactions between entities. Under certain assumptions of the network, we utilise an exchangeable network test that can empirically validate bootstrap samples generated by any method, by considering if embeddings of the observed and bootstrapped network that are statistically indistinguishable. Existing methods fail this test, so we propose a principled, distribution-free network bootstrap using k-nearest neighbour smoothing, which can pass this exchangeable network test in many synthetic and real-data scenarios. We demonstrate the utility of this work in combination with the popular data visualisation method t-SNE, where uncertainty estimates from bootstrapping are used to explain whether visible structures represent real statistically sound structures.

Student: Emma Ceccherini (lead author)
Co-authors: Ian Gallagher; Andrew Jones; Daniel Lawson (Compass Co-director)
Contribution: Poster
Overview: ‘Unsupervised Attributed Dynamic Network Embedding with Stability Guarantees’. Most AI/ML methods have exchangeability or stability properties, loosely speaking, for the same input they produce the same output up to noise, however, this is not always true for dynamic graph embedding. We propose a dynamic attributed embedding that collectively embeds attributes and network information in the same low-dimensional space, and we prove uniform convergence, which establishes stability. As a result, our method performs better than competitors on downstream tasks.

A simulated results figure from Emerald Dilworth’s paper, co-authored with Ed Davis.

Compass at GOFCP 2025

(Goodness-of-fit, Change-point and Related Problems)

August 2025 – Charles University, Prague, Czech Republic

Student: Dylan Dijk
Contribution: Poster
Overview: This poster presents work on generalising the assumptions behind a widely used econometric model (the model is partly described in Dylan’s blog from July 2024). In real-world applications – especially in finance – large datasets often contain extreme values, particularly during periods of crisis. The goal is to adapt the model’s theoretical foundations to ensure reliable performance even when such heavy-tailed data are present.

Student: Yuqi Zhang
Contribution: Poster
Overview: Presents a method developed jointly with Dr Haeran Cho (Compass Projects Coordinator). We introduce a multiscale, bandwidth-free procedure for detecting multiple change points in large approximate factor models. The method combines the Narrowest-over-Threshold (NOT) principle with Seeded Binary Segmentation (SBS) to efficiently identify structural changes in high-dimensional time series without requiring prior bandwidth selection. The poster presents a method developed jointly with Dr Haeran Cho, designed for multiple change points in high-dimensional factor models.

Compass contributions to other events:

Random Networks Workshop
Date/location: May 2025 – University of Sheffield, UK
Student: Ollie Baker (lead author)
Co-authors: Carl P. Dettmann
Contribution: Contributed talk
Overview: ‘Entropy of Random Geometric Graphs in High and Low Dimensions’. The paper uses information theory to discuss whether or not we can detect the geometric embedding of a spatial network when the dimension of the embedding gets very large. We then use this to derive a bound on the amount of information saved by embedding a network in a lower dimensional space.

Bayes Comp 2025
Date/location: June 2025 – National University of Singapore
Student: Sherman Khoo (lead author)
Contribution: Poster
Overview: ‘Approximate Maximum Likelihood Estimation with Local Score Matching’. We study the problem of likelihood maximization when the likelihood function is intractable, but model simulations are readily available. We propose a sequential, gradient-based optimization method that directly models the Fisher score based on a local score matching technique which uses simulations from a localized region around each parameter iterate.

UK AI Conference 2025
Date/location: June 2025 – London, UK
Student: Rachel Wood (lead author)
Contribution: Poster
Overview: We first consider how to define an anomaly in a network, then explore using existing reliable and efficient embedding methods for this purpose. Most methods are only present for the correct (unknown) latent dimension $d$ and performance may deteriorate for other choices of $d$, so we propose an approach which retains accuracy when the dimension choice is mis-specified, offering a more robust solution for anomaly detection in dynamic and uncertain network environments.

GW4 AI and Data Science: AI, Climate and Health
Date/location: June 2025 – University of Bristol, UK
Student: Cecina Babich Morrow
Contribution: Lightning talk, panel member and poster
Overview: ‘From risk to action: Climate decision-making under deep uncertainty’. How can we make robust climate adaptation decisions despite our uncertainty about both the level of climate risk and the characteristics of our adaptations? I showed how uncertainty and sensitivity analysis might be helpful in addressing these issues.

46^th Annual Conference of the International Society for Clinical Biostatistics (ISCB)
Date/location: August 2025 – University of Basel, Switzerland
Student: Vera Hudak (lead author)
Co-authors: Hayley Jones; Nicky J. Welton; Efthymia Derezea
Contribution: Poster
Overview: ‘Evaluating Diagnostic Tests Against Composite Reference Standards: Quantifying and Adjusting for Bias’. Our research focuses on diagnostic test accuracy studies where gold standard testing is only performed on a subset of study participants, specifically those with certain results from some initial imperfect reference test. We have quantified the bias that can arise in these scenarios and proposed a method to adjust for it, which was evaluated using a simulation study.

64^th European Regional Science Association (ERSA) Congress: ‘Regional Science in Turbulent Times. In search of a resilient, sustainable and inclusive future’
Date/location: August 2025 – Panteion University, Athens, Greece
Student: Emerald Dilworth (presenter)
Co-authors: Emmanouil Tranos; Daniel Lawson (Compass Co-director)
Contribution: Special session presentation
Overview: ‘The Twin Transition in the UK through LLM labelled Web Crawl Data’. Many existing tools for tracking green and digital industries are expensive, limited in scale, or updated too infrequently, making it difficult to design effective policies for technological progress. To address this, we use freely available web data to identify UK firms involved in green and digital technologies by fine-tuning a language model to classify company websites. This allows us to track how these industries have evolved across regions and over time from 2014 to 2024, at a yearly time scale.

Global Optimization Workshop 2025
Date/location: September 2025 – KTH Royal Institute of Technology, Stockholm, Sweden
Student: Ettore Fincato (co-author)
Co-authors: Christophe Andrieu; Nicolas Chopin; Mathieu Gerber
Contribution: Paper presentation
Overview: ‘Gradient-free optimization via integration’. We develop and analyse a novel approach to optimize functions not assumed to be convex, differentiable or even continuous. The idea is to fit recursively the objective function to a parametric family of distributions, using a Bayesian update followed by a reprojection back onto the chosen family. Practically, the approach enables the optimization of a broad class of objectives via Monte Carlo sampling; theoretically, it establishes a link with gradient-based methods with smoothing.

Exact entropy curves from Ollie Baker’s paper, presented at the Random Networks Workshop.

Student Perspectives: Verification Bias in Diagnostic Test Accuracy Studies with Conditional Reference Standards

Posted on 27th June 2025 by ia23879

A post by Vera Hudak, PhD student on the Compass programme.

Introduction

To evaluate the accuracy of a new diagnostic test (the `index test’), the ideal approach is to compare it against an error-free reference standard, known as the gold standard.
However, gold standards may be unavailable, invasive, or costly. In such situations, a possible approach is to condition gold standard testing on the outcome of some initial imperfect reference standard test(s).

We focus on a conditional testing design which we refer to as `check the negatives’. Here, all participants receive the index test (Test A) and an imperfect reference standard (Test B), then those testing negative on Test B are followed up with the gold standard (GS). The diagnostic accuracy of Test A is assessed against observed disease status, derived from the test sequence combining Test B and the GS. Figure 1 illustrates this design.

Figure 1: Test sequence for `check the negatives’.

Now, if Test B was 100% specific, the `check the negatives’ design would lead to unbiased estimates of the sensitivity and specificity of Test A, given by:

$\text{Sensitivity} = \frac{\text{TP}}{\text{TP} + \text{FN}} \quad \text{and} \quad \text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}}.$

However, as Test B is an imperfect test, this is unlikely to be the case, and bias can be anticipated. We quantified this bias and proposed a Bayesian adjustment method.

Probabilities in `Check the Negatives’ Studies

Let $A_{se}$ and $A_{sp}$ denote the true sensitivity and specificity of Test A, $B_{sp}$ the true specificity of Test B, and $\pi$ the prevalence of the condition in the study population. We also introduce $c_0$ as the covariance of errors between Test A and Test B in the disease-free population.

In a `check the negatives’ design, we do not observe the complete set of outcomes from the index test, the imperfect reference standard, and the gold standard. Instead, we only observe a reduced set of results: each participant’s outcome on Test A and their observed disease status, as determined by the conditional diagnostic test sequence combining Test B and the GS. Table 1 shows the probabilities associated with these observed outcomes under conditional dependence between Test A and Test B [1]. The corresponding probabilities under conditional independence can be obtained by setting $c_0 = 0$ .

Table 1: Probabilities observed in a `check the negatives’ study.

Quantifying Bias

We used the probabilities from Table \ref{tab:partially ver prob} to find closed-form expressions for the naive estimates of the sensitivity ( $\widehat{A_{se}}$ ) and specificity ( $\widehat{A_{sp}}$ ) of Test A, and hence, the bias. The bias in the naive estimate of specificity is as follows:

$\text{Bias}(\widehat{A_{sp}}) = \frac{c_0}{B_{sp}}.$

Under conditional independence ( $c_0 = 0$ ), the naive specificity estimate is unbiased. If, however, Tests A and B are conditionally dependent among the disease-free population, then there is a bias which depends on $B_{sp}$ and $c_0$ .

Similarly, the bias in the naive estimate of the sensitivity of Test A can be expressed as:

$\text{Bias}(\widehat{A_{se}}) = \frac{(1-\pi)((1-A_{sp}) (1-B_{sp}) + c_0 - A_{se}(1-B_{sp}))}{(1-\pi)(1-B_{sp}) + \pi},$

Assuming independence in the disease-free population ( $c_0 = 0$ ), Figure 2 shows $\text{Bias}(\widehat{A_{se}})$ as a function of $B_{sp}$ , for selected values of $\pi, A_{se}, A_{sp}$ and $B_{sp}$ .

Figure 2: Bias in the naive estimate of the sensitivity of Test A against the specificity of Test B for different values of Test A sensitivity, specificity and disease prevalence.

As expected from the study design, bias tends to 0 as $B_{sp}$ tends to 1. Bias increases as the accuracy of Test A improves, i.e. when $A_{se}$ or $A_{sp}$ is larger, or with lower prevalence. We can see that when $\pi = 0.9$ , the bias is almost negligible. However, bias can be substantial in some scenarios, specifically under low prevalence, even when Test B has high specificity (e.g. over 95%).

Bias Adjustment

We proposed a Bayesian model with an informative prior on Test B specificity can be used for adjusting for the bias in the naive estimate of the sensitivity of Test A, under conditional independence ( $c_0 = 0$ ).

We let $\boldsymbol{x} = (TP, FN, FP, TN)$ be the data reported by a `check the negatives’ study evaluating Test A. Then $\boldsymbol{x} \sim \text{Multinomial}(\boldsymbol{p}, n)$ , where $n$ is the number of participants in the study, and the probabilities $\boldsymbol{p} = (p_1, p_2, p_3, p_4)$ are as specified in Table 1, with $c_0 = 0$ .

Suppose we have prior information about the specificity of Test B represented with a Beta prior distribution. Then a bias-adjusted estimate of $A_{se}$ could be obtained by fitting this multinomial distribution to the data with vague $\text{Beta}(1,1)$ priors for the remaining three parameters, $A_{se}$ , $A_{sp}$ , and $\pi$ .

Simulation Study

We assessed this adjustment method through a simulation study under two scenarios: where the informative prior is correctly centred on the true specificity, and where it underestimates the truth by 5%, to examine the impact of moderate prior misspecification. Prior precision ( $sd$ ) is also varied to assess the impact of increasing uncertainty. In Figure 3, we present some result from this simulation study for the correctly centred prior case. These results show that a correctly centred prior consistently eliminates bias under high precision and reduces it under lower precision.

Although not shown here, we found that overly pessimistic priors can over-correct, increasing absolute bias, especially when initial bias is small. This risk is mitigated when the informative prior is less precise. We are currently writing up this simulation study as a paper to be submitted for publication soon.

Future Work

Further work could be done to explore adjustment under conditional dependence between tests, or situations in which the third test in the sequence, here the GS, is imperfect.

References

[1] Pamela M. Vacek. The effect of conditional dependence on the evaluation of diagnostic tests. Biometrics, 41:959, 12 1985.

Student Perspectives: The Distribution of High-Dimensional Random Geometric Graphs

Posted on 7th May 2025 by o.baker

A post by Ollie Baker, PhD student on the Compass programme.

Introduction

A random geometric graph is a model of a network with a geometric embedding, or a latent space embedding. They were originally introduced by Gilvert in 1961, who called them ‘random planar networks’ [3]. If we are studying a high dimensional dataset, then we may be interested in modelling it using some form of network structure which contains information about similarity of data points in the latent space. This can be represented as points in a high-dimensional space connected based on their closeness according to some distance metric. What can we say about the distribution of these random graph models when the dimension of the space gets large? The answer to this question has applications in clustering, and graph compression once we consider the information theory of these network ensembles.

Figure 1: An example of a 500 node Random Geometric Graph on the unit square with connection range 0.1.

Let $(\Omega, \rho)$ be a metric space, where $\Omega \subset \mathbb{R}^d$ is compact. In this post, we will only consider $\Omega = [0,1]^d$ with $\rho = \|\cdot\|$ the Euclidean metric, and $\Omega = [0,1]^d$ with $\rho = \rho_t$, with $\rho_t(x,y) = \left(\sum_{i=1}^d\min(|x-y|, 1-|x-y|)^2 \right)^{\frac{1}{2}}$. This second metric space is known as the unit $d$-torus, and is denoted $\mathbb{T}^d$. To form a Random Geometric Graph (RGG) $G$, we distribute $n$ points at random according to a probability density $\nu$ in $\Omega$ to form the vertex set, and connect nodes $i$, $j$ if they have a mutual distance less than $r_0$. That is, if $X_1,…,X_n$ are our points in $\Omega$, $i\sim j$ in $G$ if and only if $\rho(X_i,X_j) \leq r_0$. Figure 1 shows an example realisation of such a graph. We are interested in densities of the form:

\begin{equation}
\label{general_distribution}
\nu(\{x_i\}_{i=1}^d) = \prod_{i=1}^d \pi(x_i)
\end{equation}
That is, the coordinates of a point $x$ are i.i.d. \\ \newline
We are interested in computing the probability of a graph $G$ with adjacency matrix $A=\{a_{ij}\}_{i,j=1}^n$:
\begin{equation}
\mathbb{P}(G) = \int_{[0,D]^{\binom{n}{2}}} f_{\vec{R}}(\vec{r})\prod_{i<j} \left(a_{ij}\mathbb{I}(r_{ij}<r_0)+(1-a_{ij})\mathbb{I}(r>r_0)\right)d\vec{r}
\end{equation}
where $D$ is the diameter of (largest possible distance between two points in) $\Omega$, $f_{\vec{R}}$ is the joint density of pair distances in $\Omega$ (which is well defined if $d>n+2$), and $d\vec{r}=\prod_{i<j}dr_{ij}$. Clearly, this is intractable for most choices of $\Omega$ and $n$. However, we can simplify things if we take the limit $d\rightarrow\infty$. An \textit{ensemble} $\mathcal{G}$ of RGGs is the set of all possible RGGs that can be constructed with a fixed $r_0, n, \Omega, \nu$ and $\rho$. A sequence of ensembles $\{\mathcal{G}_d\}$ (with all parameters except $n$ dependent on $d$) equipped with probability measures $\mathbb{P}_d$ \textit{converges in distribution} to another ensemble $\mathcal{G}$ as $d\rightarrow\infty$ if
\begin{equation}
\mathbb{P}_d(g) \rightarrow \mathbb{P}(g)
\end{equation}
for all graphs $g$ with $n$ nodes.

Central Limit Theorem for High-Dimensional Distance

We can use a central limit theorem (CLT) to prove that the vector of pair distances in $\Omega$ converges in distribution to a multivariate Gaussian as $d\rightarrow\infty$, which is a generalisation of what is done in [2].

Theorem 1:

Let $(\Omega, \rho)$ be a metric space, and $\nu$ be a node density as descibed above, and $X_1,…,X_n$ be random vectors distributed according to $\nu$. Define $R_{ij}^{(k)} = \rho(X_i^{(k)}, X_j^{(k)})^2$, the distance between $X_i$ and $X_j$ in each coordinate, and $\mu := \mathbb{E}[R_{12}^{(1)}]$. Let
\begin{equation}
q_{ij} := \frac{1}{\sqrt{d}}\sum_{k=1}^d (R_{ij}^{(k)}-\mu)
\end{equation}
Then, as $d\rightarrow\infty$ the vector $\vec{q} = \{q_{ij}\}_{1\leq i<j\leq n} \in \mathbb{R}^{\binom{n}{2}}$ satisfies
\begin{equation}
\vec{q} \rightarrow Z \sim N(0_{\binom{n}{2}},\Sigma)
\end{equation}
where $\Sigma$ is the covariance matrix indexed by multi-indexes $(i,j)$ given by $\Sigma_{(i,j),(i,j)} = \alpha := \mathbb{E}[(\rho(X_i^{(1)},X_j^{(1)})^2-\mu^2)^2]$, and $\Sigma_{(i,j),(j,k)} = \beta := \mathbb{E}[(\rho(X_i^{(1)},X_j^{(1)})^2-\mu^2)(\rho(X_j^{(1)},X_k^{(1)})^2-\mu^2)]$, and $\Sigma_{(i,j),(k,l)} = 0$.

Essentially, this means that when $d\rightarrow$, we can replace the intractable joint density $f_{\vec{R}}$ with a multivariate Gaussian density, which will make our calculations much easier.

Erdös-Rényi Random Graphs

The Erdös-Rényi (ER) random graph, also denoted as $G(n,p)$ is arguably the simplest model of a random graph. We take $n$ nodes, and connect each pair with fixed probability $p$. The probability of an ER graph $G$ is given by the binomial probability
\begin{equation}
\mathbb{P}(G) = p^{k}(1-p)^{\binom{n}{2}-k}
\end{equation}
where $k = \sum_{i<j} a_{ij}$ is the number of edges in $g$. Clearly, this model is a non-spatial network model, which is not good for modelling networks with a latent space structure! Therefore, we would like to guarantee that our model does not converge to $G(n,p)$ as $d\rightarrow\infty$, otherwise we would lose information about the spatial or latent space correlation of our data.

When Does the RGG Converge to $G(n,p)$?

For the main results, we will provide a condition on the distribution of nodes in $\Omega$ for when we see convergence in distribution to $G(n,p)$. For a RGG with connection range $r_0$ in the metric space $(\Omega, \rho)$ and $\mu = \mathbb{E}[\rho(X_i^{(1)},X_j^{(1)})^2]$, define the \textit{normalised connection range} $t = \frac{r_0^2}{\sqrt{d}} – \mu \sqrt{d}$. Recall that the probability of a graph is given by
\begin{equation}
\mathbb{P}(G) = \int_{[0,D]^{\binom{n}{2}}} f_{\vec{R}}(\vec{r})\prod_{i<j} \left(a_{ij}\mathbb{I}(r_{ij}<r_0)+(1-a_{ij})\mathbb{I}(r>r_0)\right)d\vec{r}
\end{equation}
If the random distances converge to a Gaussian, then we have (after some algebra)
\begin{equation}
\mathbb{P}(G) \rightarrow \int_{\mathcal{A}} N(0,\Sigma)(\vec{q})d\vec{q}
\end{equation}
where $\mathcal{A} = \bigotimes_{i<j} A_{ij}$ where $\bigotimes$ denotes the Cartesian product of sets, and $A_{ij}$ is the set:
\begin{equation}
A_{ij} := \begin{cases}
(-\infty, t] & a_{ij} =1 \\
(t,\infty) & a_{ij} = 0
\end{cases}
\end{equation}
If $\Sigma$ is diagonal, then the integral above splits up into its marginals, and
\begin{equation}
\mathbb{P}(G) = \prod_{i<j} \bar{p}(t)^k(1-\bar{p}(t))^{\binom{n}{2}-k}
\end{equation}
with $\bar{p}(t) := \int_{-\infty}^t N(0, \alpha)(q)dq$ with $\alpha$ being the diagonal elements of $\Sigma$. Note this is the exact probability of a graph in $G(n, \bar{p}(t))$. If there are non-zero off-diagonal elements, then there are correlations between edges, and we do not have convergence to $G(n,p)$.

The RGG in $[0,1]^d$

Suppose now that $(\Omega, \rho) = ([0,1]^d, \|\cdot\|)$. We will need that our node distribution $\nu$ is of the form we described earlier, where the marginals $\pi$ have a kurtosis greater that 1. It can be shown that the only distributions with unit kurtosis are the Bernoulli distribution with parameter $1/2$, and constant distributions.

Theorem 2 [1]

Suppose $\mathcal{G}$ is an ensemble of RGGs in $[0,1]^d$ with nodes distributed according to $\nu$, then, provided the kurtosis of the marginals $\pi$ is greater than 1, $\mathcal{G}$ does not converge to the ER ensemble as $d\rightarrow\infty$.

Sketch Proof:

The proof is direct. We set $\beta = 0$, which for the Euclidean distance metric means (after some rearranging),
\begin{equation}
\mathbb{E}[(X_i-\mu)^4] – \mathbb{E}[(X_i-\mu)^2]^2 = 0
\end{equation}
or equivalently, the kurtosis of $X_i$ is 1.

This means, for any `sufficiently nice’ distribution of nodes, we do not converge to $G(n,p)$, and maintain spatial properties which can be exploited in data analysis.

The RGG in $\mathbb{T}^d$

In the torus, the following theorem shows that if the distribution of nodes is uniform, then we will in fact see convergence to $G(n,p)$. However, for any other distribution, we maintain the spatial correlation.

Theorem 3 [1]

Let $\mathcal{G}$ be an ensemble of RGGs on $\mathbb{T}^d$ with nodes distributed according to $\nu$. Then as $d\rightarrow\infty$, $\mathcal{G}$ converges in distribution to $G(n,p)$ if and only if $\nu$ is the uniform distribution.

Sketch Proof

The tactic for the proof is the same as for the cube. We will find a condition for which $\beta = 0$. In the torus, we have
\begin{equation}
\beta = 0 \iff \mathbb{E}_X[(\mathbb{E}_Y[\rho_t(X,Y)^2])^2] = \mathbb{E}_X[\mathbb{E}_Y[\rho_t(X,Y)^2]]^2
\end{equation}

From which we can deduce that $\mathbb{E}_Y[\rho_t(x,Y)]$ is constant $\pi$-almost-everywhere. This implies
\begin{equation}
\int_0^1 \rho_t(x,y)^2\pi(y)dy = \mu
\end{equation}
for $\pi$ almost every $x$. We can rewrite the left hand side above as the convolution of two periodic functions, and therefore taking a Fourier transform of both sides simplifies the problem. By equating Fourier modes, we find that the Fourier transform of $\hat{\pi}$ of $\pi$ must be zero everywhere except when evaluated at $0$. This means that the original function $\pi$ must be constant on $[0,1]$.

So, if we are using a toroidal distance metric, then if our data is uniformly distributed, we will lose the latent space embedding as $d\rightarrow\infty$.

Example

Here we plot the distribution of edge counts in the high-dimension limit for RGGs with uniformly (figure 2) and Gaussian distributed (figure 3) nodes to illustrate the difference that changing the node distribution can make.

Figure 2: Comparison of the theoretical distribution as $d\rightarrow\infty$ of edge counts for RGGs in $[0,1]$ and $\mathbb{T}^d$ with uniform nodes. Top: $\mathbb{P}(\text{# of edges } = k)/\binom{n}{k}$ for RGGs in the cube for $n=3$ and $n=7$. The torus would have a uniform distribution and is therefore omitted. Bottom: $\mathbb{P}(\text{# of edges }=k)$ for RGGs with $n=7$ in $[0,1]^d$ and $\mathbb{T}^d$

Figure 3: Comparison of the theoretical distribution as $d\rightarrow\infty$ of edge counts for RGGs in $[0,1]$ and $\mathbb{T}^d$ with Gaussian distributed nodes. Top: $\mathbb{P}(\text{# of edges } = k)/\binom{n}{k}$ for RGGs in $[0,1]^d$ and $\mathbb{T}^d$ for $n=3$ and $n=7$. Bottom: $\mathbb{P}(\text{# of edges }=k)$ for RGGs with $n=7$ in $[0,1]^d$ and $\mathbb{T}^d$

Conclusion

In this blog post, we have defined the random geometric graph (RGG) with general node distributions, and proved that most of the time, the spatial correlations between edges are preserved as the dimension of the underlying geometry tends to $\infty$. In the $d$-cube, geometry is preserved as long as the distribution of our nodes is neither constant, or Bernoulli with parameter $1/2$, and in the $d$-torus, geometry is preserved provided our distribution is not uniform. The result for the torus is especially interesting, since it challenges ideas in the literature about how we should model high dimensional RGGs. In real-world data, the coordinates are unlikely to be uniformly distributed, yet the majority of theoretical high-dimensional random geometric graph studies use uniform distributions on the torus. The issue is that this is the only case where the torus behaves like a $G(n,p)$ graph. For a more in-depth explanation of this work, and extensions to the concepts of graph entropy, see our recent preprint [1].

References

[1] O. Baker and C.P. Dettmann “Entropy of Random Geometric Graphs in High and Low Dimensions”. arXiv preprint arXiv:2503.11418 (2025)

[2] V. Erba et al. “Random geometric graphs in high dimension”. Phys. Rev. E 102.1 (2020), 012306.

[3] E. N. Gilbert. “Random Plane Networks”. Journal of the Society for Industrial and Applied Mathematics 9.4 (1961), 533–543.

Student perspectives: AI UK 2025 Conference

Posted on 25th April 2025 by shaun.jordan

A post by Sam Bowyer, PhD student on the Compass programme.

Compass at AI UK

The Alan Turing Institute’s AI UK 2025 Conference was held last month in the QEII Centre, Westminster, and three Compass students – Emma Ceccherini, Sherman Khoo, and myself – were present for both days of the event. We attended a variety of sessions and spent time exploring the exhibition stalls, which showcased a wide range of AI projects from within academia, government and industry.

Compass students and staff pictured at the AI UK 2025 Conference

Compass CDT students and staff at the AI UK 2025 Conference. From left to right:
Dr Dan Lawson, Emma Ceccherini, Sam Bowyer, Sherman Khoo and Helen Mawdlsey

It was an eye-opening experience to learn about the work that The Alan Turing Institute does, and especially insightful to see the myriad downstream applications of the machine learning theory that we spend so much time thinking about.

Conference highlights

One particular favourite exhibition was that of the Ministry of Justice (MoJ). Emma and I talked to a data scientist at the MoJ who was working on a tool that uses LLMs to explain laws in plain English, in order to help regular people better understand their rights.

Another project involved aggregating various disconnected datasets from across government on the national and local level in order to research social factors that might lead to successful post-prison rehabilitation or equally to recidivism.

It was encouraging to see a variety of projects and organisations at the conference aiming to use AI for social and public good, with a significant amount focussed on the climate and green-tech.

Whilst Compass wasn’t presenting at AIUK, colleagues from the Informed AI Hub, the Interactive AI CDT, the AI For Collective Intelligence (AI4CI) Hub, and Jean Golding Institute were.

It was great to not only see the other projects that are going on in the University, but also to be able to network with colleagues who only work down the road from the Fry Building (e.g. sharing Bristol restaurant recommendations)!

On the first day of the conference, Professor Charlotte Deane, Executive Chair of the Engineering and Physical Sciences Research Council (EPSRC), gave an informative keynote talk on the state of scientific research in UK academia. It was surprising to learn about the overall size of EPSRC and the range of activities they engage in, particularly their keenness for investing in spin-outs. I found Professor Deane’s talk to be very encouraging and optimistic.

The second day of the conference focused on governmental uses of AI, particularly in medicine and in defence. Professor the Lord Darzi, who recently led the Independent Investigation of the NHS in England, gave an incredibly thoughtful talk on the opportunities for AI within the NHS.

He likened the current AI boom to the development of keyhole surgery in the second half of the 20th century, urging fast, nationwide deployment in order to improve health outcomes and equality throughout the UK.

Three talks on defence and national security similarly stressed the importance of fast uptake of AI tools and made clear the desire for public-private partnerships (including with academia) in order to make this happen. (The importance of cross-sector collaboration was consistently a strong theme at AIUK, although the absence of frontier AI labs did, in my opinion, betray a slight limit to this stated commitment).

Presentation karaoke

It wasn’t all so serious, however! The conference finished its first day with “Presentation Karaoke”, in which eight contestants competed to present unseen 5-minute long, 10-slide PowerPoints, each more bizarre than the last.

This fun, often slightly cringe-inducing, activity is now rumoured to be deployed at a future COMPASS student event. (Get practising your stand-up now…)

In summary, AIUK was a great opportunity to see how AI/ML research leads to real-world impact in the UK, and I would recommend attending to any CDT student in the future.

Guest Lecture: Professor Chris Breward

Posted on 3rd April 20253rd April 2025 by shaun.jordan

An Introduction to Knowledge Exchange

The Compass CDT was delighted to recently host a Guest Lecture by Professor Chris Breward, from the Mathematical Institute, University of Oxford.

Chris led an interactive session for our PhD students, which focused on getting started with knowledge exchange (KE), and explored the skills needed to engage with industrial and other external partners.

As Scientific Director of the Knowledge Exchange Hub for Mathematical Sciences and Co-Director of the EPSRC CDT in Industrially Focused Mathematical Modelling, Chris had a wide range of valuable advice to share.

Drawing on his experience building parternships with companies and setting-up projects with industry co-funding, he ran through the different ways researchers at all stages of their career can get involved in KE.

Attendees explored why companies might engage with mathematical scientists, discussed things to consider before meeting potential collaborators, and looked at what can sometimes go wrong with academic-business relationships.

Student reflections

“During his Guest Lecture, Chris chatted with all of us about ways to communicate with non-academics during shared projects and how to do positive work as mathematical consultants.

“The session covered the pragmatics and hard-skills of private sector contract work, as well as the soft skills of open body language, effective listening and people management.

“He described the barriers that can arise between researchers (mathematicians in particular) and industrial partners. We then chatted interactively through where these pitfalls come from and how best to avoid them.

“He also gave us an entry-level look into the broader differences between universities and industry.”

Emma Tarmey, Compass CDT student, Cohort 4

KE initiatives

Chris closed by encouraging attendees to get involved with some of the opportunities the KE Hub provides for PhD students and researchers, such as the online Triage Workshops. These events can provide a safe space for individuals to gain experience with knowledge exchange, by observing senior colleagues from across the country.

He expressed his hope that Compass students would benefit from the upcoming five-day European Study Group with Industry (ESGI), which will take place here at the University of Bristol from Saturday, 14 July to Wednesday, 18 July 2025.

The Compass CDT was grateful to Chris for giving up his time to visit us in the School of Mathematics’ Fry Building, and we look forward to seeing him in Bristol again in the future.

As well as being an applied mathematician, lecturer and researcher at University of Oxford, Chris is co-founding Chief Moderator of the Mathematics-In-Industry Reports online KE repository, and a member of the Newton Gateway’s Scientific Advisory Board.

Student perspectives: Expectation Propagation

Posted on 27th February 2025 by Grace Yan

A post by Grace Yan, PhD student on the Compass programme.

Introduction

In many real-world problems, the exact posterior distribution is often infeasible due to non-conjugate priors and high-dimensional datasets. Thus, approximate Bayesian inference methods are used instead to obtain an approximate posterior. Some well-known examples of these methods include Variational Bayes (VB), Laplace approximation and Expectation Propagation (EP). In this blog post, I will focus on Expectation Propagation and explain: what it is, how it works, its strengths and limitations, and its relation to similar methods.

Figure 1: A comparison of approximate Bayesian inference methods along a spectrum of computational speed and accuracy. Methods like Variational Bayes (VB) and the Laplace approximation are faster but less accurate, while approaches like Expectation Propagation (EP) and Markov Chain Monte Carlo (MCMC) are slower but provide higher accuracy. Source: [1].

Background

EP was introduced by Minka in 2001 as an extension of the assumed-density filtering (ADF), which is a one-pass sequential algorithm for obtaining an approximate posterior [2]. Like VB methods, its aim is to approximate an intractable posterior with tractable distributions by minimising the Kullback-Leibler (KL) divergence. Recall that the KL divergence measures how different two distributions $p$ and $q$ are; often, $p$ is the true distribution and $q$ is a model distribution that we use to approximate $q$. There are two kinds of KL divergence: the forward KL and the reverse KL. Assuming $x$ is continuous, these are defined as
\[ \mathrm{KL}(p(x) \| q(x)) = \int p(x) \mathrm{log}\frac{p(x)}{q(x)} dx \]
and
\[ \mathrm{KL}(q(x) \| p(x)) = \int q(x) \mathrm{log}\frac{q(x)}{p(x)} dx \]
respectively (in the discrete case, the integrals are replaced by sums). These two types are not equivalent; [3] gives a good explanation of how they differ. EP uses the forward KL.

Expectation Propagation

Let $\mathbf{x}$ denote the observed data and $\boldsymbol{\theta}$ the parameters of interest. Recall that by Bayes’ theorem, the posterior is
\[ p(\boldsymbol{\theta}|\mathbf{x}) = \frac{p(\mathbf{x}, \boldsymbol{\theta})}{p(\mathbf{x})}, \]
where $p(\mathbf{x})$ is the model evidence. We can write the joint distribution $p(\mathbf{x}, \boldsymbol{\theta})$ in the form of a product of factors $f_i$, which are also called ‘sites’:
\[p(\mathbf{x}, \boldsymbol{\theta}) = p(\boldsymbol{\theta})p(\mathbf{x}|\boldsymbol{\theta}) = p(\boldsymbol{\theta}) \prod_{i=1}^n p(\mathbf{x}_i|\boldsymbol{\theta}) = p(\boldsymbol{\theta})\prod_{i=1}^n f_i(\boldsymbol{\theta}),\] where $p(\boldsymbol{\theta})$ is the prior and the factors from $1$ to $n$ is the likelihood partitioned into $n$ iid parts (e.g. each $i$ could be a data point).

For $f$, we drop the conditioning $x$ to simplify the notation. I use $f_j(\boldsymbol{\theta})$ to refer to one specific factor and $f_i(\boldsymbol{\theta})$ as factors in the plural sense. My notation is similar to the notation in [4].

The idea is to approximate the posterior by approximating the factors with $\tilde{f}_i$, which are often assumed to be Gaussian (or some other member of the exponential family). These approximations are refined one at a time in an iterative process until convergence. In EP, refining a factor $\tilde{f}_j$ is a “team effort”; it requires information from each of the other factors. This concept is known as message passing, because messages are being passed between different programs (a concept that largely belongs to computer science).

Figure 2: Illustration of message passing between three factors. The arrows show the exchange of information between the factors: each $f_j$ send out its information to the other two factors and also receives information from them.

Using the approximations of the likelihood factors, the resulting approximate posterior is given by
\[q(\boldsymbol{\theta}) = p(\boldsymbol{\theta})\prod_{i=1}^n \tilde{f}_i(\boldsymbol{\theta}).\]
In EP, the prior $p(\boldsymbol{\theta})$ is also taken to be Gaussian. Since the product of Gaussians results in another Gaussian, $q$ has to be a Gaussian distribution. Therefore, we do not face the issue of finding the normalising constant for an unnormalised posterior.

To make the approximations as accurate as we can, we need a kind of measurement. Naturally, the global KL divergence comes to mind, so we might want to consider minimising the following:

\[
\mathrm{KL}(p \| q) = \mathrm{KL} \left( \frac{1}{p(\mathbf{x})} p(\boldsymbol{\theta})\prod_{i=1}^n f_i(\boldsymbol{\theta}) \bigg\| p(\boldsymbol{\theta})\prod_{i=1}^n \tilde{f}_i(\boldsymbol{\theta}) \right).
\]

However, the global KL divergence is difficult to optimise. Instead, EP minimises the KL divergence locally to update each factor one at a time, using a distribution called the tilted distribution. When updating the factor $\tilde{f}_j$, the tilted distribution is defined by

\[ q^\text{tilt}(\boldsymbol{\theta}) \propto f_j(\boldsymbol{\theta})q_{\setminus j}(\boldsymbol{\theta}), \]
where $q_{\setminus j}$ is called the cavity distribution, which is essentially the posterior distribution with one $\tilde{f}_j$ removed:

\[ q_{\setminus j}(\boldsymbol{\theta}) = \prod_{i \neq j} \tilde{f}_i(\boldsymbol{\theta}) = \frac{q(\boldsymbol{\theta})}{\tilde{f}_j(\boldsymbol{\theta})}. \]
Then EP finds the $\tilde{f}_j$ that minimises the KL divergence between the tilted distribution and the updated approximate posterior (which we call $q^\text{new}$):
\[
\mathrm{KL}(q^\text{tilt}(\boldsymbol{\theta}) \| q^\text{new}(\boldsymbol{\theta})), \]
where \[q^\text{new}(\boldsymbol{\theta}) = \tilde{f}_j(\boldsymbol{\theta}) q_{\setminus j}(\boldsymbol{\theta}).
\] If $q^\text{new}$ is a member of the exponential family (e.g. Gaussian), then we can minimise $\mathrm{KL}(q^\text{tilt} \| q^\text{new})$ by matching the moments of $q^\text{new}$ with the moments of $q^\text{tilt}$. This trick is called moment matching. In general, for approximating distributions from the exponential family, matching moments of the approximating distribution with those of the target distribution minimises the forward KL [5].

Note that the tilted distribution is not Gaussian and therefore it can be difficult to compute its moments analytically. Instead, the moments are often computed numerically: using MCMC, we can generate samples from the tilted distribution (in which case we would not need to calculate its normalising constant) and use the samples to calculate the moments empirically.

The Gaussian EP algorithm is given below:

Initialise all the approximating factors $ \tilde{f}_i(\boldsymbol{\theta}), i=1,…,n $.
Initialise the approximate posterior by setting\[
q(\boldsymbol{\theta}) = p(\boldsymbol{\theta})\prod_{i=1}^n \tilde{f}_i(\boldsymbol{\theta}),
\]where $ p(\boldsymbol{\theta}) $ is a Gaussian prior.
Until all $ \tilde{f}_i $ for $ i=1,…,n $ converge:
(a) Choose a factor $ \tilde{f}_j $ to refine.
(b) Evaluate the cavity distribution\[
q_{\setminus j}(\boldsymbol{\theta}) = \frac{q(\boldsymbol{\theta})}{\tilde{f}_j(\boldsymbol{\theta})}.
\]
(c) Set the tilted distribution\[
q^\text{tilt}(\boldsymbol{\theta}) \propto f_j(\boldsymbol{\theta}) q_{\setminus j}(\boldsymbol{\theta}).
\]
Calculate the mean $ \boldsymbol{\mu} $ and covariance $ \boldsymbol{\Sigma} $ of $ q^\text{tilt} $.
(d) Obtain the new posterior $ q^\text{new} $ that minimises $ \mathrm{KL}(q^\text{tilt}(\boldsymbol{\theta}) \| q^\text{new}(\boldsymbol{\theta})) $ by matching its moments with $ \boldsymbol{\mu} $ and $ \boldsymbol{\Sigma} $.
(e) Evaluate and store the refined factor\[
\tilde{f}_j(\boldsymbol{\theta}) = \frac{q^{\text{new}}(\boldsymbol{\theta})}{q_{\setminus j}(\boldsymbol{\theta})}.
\]
(f) Use the refined factors to update the approximate posterior as\[
q(\boldsymbol{\theta}) = p(\boldsymbol{\theta})\prod_{i=1}^n \tilde{f}_i(\boldsymbol{\theta}).
\]

Benefits and limitations of EP

As with any method, EP has advantages and disadvantages. Its advantages include the following:

EP updates the approximation factor-by-factor rather than globally, which often leads to better approximations of the target distribution.
EP is faster and computationally cheaper than MCMC. It can also speed up MCMC [1].
EP is easy to parallelise [6].
EP can easily be used in conjunction with other methods. Minka’s Roadmap on EP [7] provides a rich guide to the various areas that have employed EP, including regression, neural networks and nonlinear dynamic systems. EP have also been used with likelihood-free inference methods such as ABC (e.g. EP-ABC [8]).
Due to its factorisation structure, EP is naturally suited to graphical models, such as Bayesian networks and Markov random fields.

However, EP has some serious limitations, which later works have tried to address:

There is a lack of theoretical guarantees, e.g. convergence of the EP algorithm is not guaranteed.
If the number of approximating factors is large, this can lead to substantial memory consumption.

Extensions of EP

EP is well-suited for parallelisation. The parallel version of the original EP algorithm (sometimes called ‘sequential EP’) is known as ‘parallel EP’. Here, factor updates occur simultaneously, meaning that $q$ is not updated at the end of each iteration in step 3 of the algorithm above, i.e. $f_j$ is refined using a cavity distribution that is the product of the unrefined factors minus $f_j$. $q$ is updated only after all the factors have been updated once (whereas in sequential EP, it was updated after each factor update), then the process repeats for multiple rounds.

Since the introduction of EP, many variants of EP have been developed, such as averaged EP (AEP) [9], power EP (PEP) [10] and stochastic EP (SEP) [11]. Different choices of divergence function has led to Variational Message Passing (VMP) [12], which uses the reverse KL, and Laplace propagation (LP) [13], which uses the Laplace approximation. Much work has been done to alleviate EP’s issues, such as guaranteeing convergence [14][9], bounding its approximate errors [15], and reducing memory consumption (e.g. SEP).

Due to the close relation between EP and Variational Inference (VI), many methods have been developed from the unification of the two. For example, Partitioned Variational Inference (PVI) [16] arises from a mixture of several methods including power EP, global VI and local VI.

Figure 3: VI and EP schemes encompassed by the PVI framework. Source: [16].

In recent years, there has been a growing interest towards federated learning. This is where the dataset is partitioned across “clients” that train models locally before the model parameters are aggregated by the central server to update the global model. Since the posterior naturally factorises across partitioned client data, EP adapts well to this framework, producing algorithms such as FedEP [17] and Federated Neural Propagation (FedNP) [18].

References

[1] Barthelmé, S. (2016). The Expectation-Propagation Algorithm: a tutorial. Gipsa-lab,
CNRS. https://www.cirm-math.fr/ProgWeebly/Renc1619/Barthelme\_EP1.pdf
[2] Minka, T. P. (2001). A family of algorithms for approximate Bayesian inference [Doctoral dissertation]. Massachusetts Institute of Technology.
[3] Jang, E. (2016). A Beginner’s Guide to Variational Methods: Mean-Field Approximation. https://blog.evjang.com/2016/08/variational-bayes.html
[4] Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
[5] Murray, I. (2017). Variational objectives and KL Divergence [Lecture notes]. University of Edinburgh. https://www.inf.ed.ac.uk/teaching/courses/mlpr/2017/notes/w9a_variational_kl.pdf
[6] Cseke, B. and Heskes, T. (2011). Approximate marginals in latent Gaussian models. Journal of Machine Learning Research, 12:417–454.
[7] Minka, T. P. (n.d.). A roadmap to research on EP. https://tminka.github.io/papers/ep/roadmap.html
[8] Barthelmé, S. and Chopin, N. (2012). Expectation Propagation for Likelihood-Free Inference. arXiv:1107.5959.
[9] Dehaene, G. and Barthelmé, S. (2018). Expectation propagation in the large data limit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):199–217.
[10] Minka, T. P. (2004). Power EP [Technical Report MSR-TR-2004-149]. Microsoft Research Ltd.
[11] Li, Y., Hernández-Lobato, J. M. and Turner, R. E. (2015). Stochastic Expectation Propagation. arXiv:1506.04132.
[12] Winn, J., Bishop, C. M. and Jaakkola, T. (2005). Variational message passing. Journal of Machine Learning Research, 6(4):661–694.
[13] Smola, A., Vishwanathan, S. V. N. and Eskin, E. (2003). Laplace propagation. Advances in Neural Information Processing Systems, 16.
[14] Hasenclever, L., Webb, S., Lienart, T., Vollmer, S., Lakshminarayanan, B., Blundell, C. and Teh, Y. W. (2017). Distributed Bayesian Learning with Stochastic Natural Gradient Expectation Propagation and the Posterior Server. arXiv:1512.09327.
[15] Dehaene, G. and Barthelmé, S. (2016). Bounding errors of Expectation-Propagation. arXiv:1601.02387
[16] Bui, T. D., Nguyen, C. V., Swaroop, S. and Turner, R. E. (2018). Partitioned Variational Inference: A unified framework encompassing federated and continual learning. arXiv:1811.11206.
[17] Guo, H., Greengard, P., Wang, H., Gelman, A., Kim, Y. and Xing, E. P. (2023). Federated learning as variational inference: a scalable expectation propagation approach. arXiv:2302.04228.
[18] Wu, X., Huang, H., Ding, Y., Wang, H., Wang, Y. and Xu, Q. (2023). FedNP: Towards Non-IID Federated Learning via Federated Neural Propagation. Proceedings of the AAAI Conference on Artificial Intelligence, 37(9):10399-10407.

Student Perspectives: Unravelling Ancestry – When Genes Don’t Follow the Family Tree

Posted on 13th February 202517th February 2025 by ic23897

A post by Daniella Montgomery, PhD student on the Compass programme.

Introduction

In my project, I am working with my two supervisors, Dan Lawson in the School of Mathematics and Sion Bayliss from the School of Veterinary Science, to investigate the analysis of genomic data and their inferred ancestry trees, to detect problematic lineages of bacterial pathogens.

An ancestry tree is a tree which describes how genetic data is passed down through generations. By understanding the evolution of bacteria, we can develop strategies to alert us when dangerous pathogens evolve. Bacteria typically only have one parent, and if this were true, their evolution can be described as a tree. However, bacteria also frequently evolve using horizontal gene transfer, where genetic data is exchanged between lineages with different ancestries, as seen in Figure 1. This disrupts the traditional parent-to-offspring tree, and instead, one needs to represent it using a complex graph.

Figure 1: An example of a phylogeny with horizontal gene transfer shown by the red dashed line and the resulting recombined lineage shown as a full red line breaking the structure of this tree.

In this case, each location on the genome may be described by a different tree obtained by following the correct parent at that location, i.e. the “left” or “right” parent of the red individual in Figure 1. These trees can be called “local ancestries”.

Simulating Ancestry with Msprime

The Python package msprime allows us to simulate genetic ancestral data using the coalescent method. The coalescent method is a backwards-in-time stochastic process where one has a set of sample lineages from which n are randomly selected, as seen in Figure 2. As we go back in time, their parent nodes are iteratively redrawn from this set at random. Once two lineages pick the same random parent, the lineage coalesces into one. This process is repeated until a common ancestor is achieved.

Figure 2: A depiction of the coalescent method taken from [1] for a population of 10 individuals and a sample size of 10, by keeping track of the times between coalescence events (T(3) and T(2)) and which lineages coalesce with which, we have a full picture of the phylogenetic tree.

The Impact of Gene Conversion

In this experiment, I am investigating how population structure manifests in genetic data and how this is affected by varying gene conversion rates. Gene conversion is a type of horizontal gene transfer where a donor genome replaces a sequence of DNA in a homologous acceptor genome. Our simulation has one population that splits into two populations with some gene conversion within the populations, as seen in Figure 3. From this, we can obtain local pedigrees across the genome for several sample genomes. Each local pedigree has a complex history, but gene conversion allows each gene to have a different random history.

Figure 3: A conceptual picture of the true population structure and the local pedigree of the sampled population obtained from simulation with nodes coloured by population. Blue represents the ancestral population and red and green represent the two descendent populations, A and B. The leaf nodes are labelled for comparison with future analysis.

Analyzing the Data

One common way to visualise complex histories is through Principle Component Analysis (PCA) where the data undergoes eigenvalue decomposition which will group similar genomes together in a far lower dimensional space. This dimensionality reduction also allows us to visualise certain population structure characteristics [2]. For example, in all of our 2D PCA graphs in Figure 4, we can see a clear split between population A and population B.

However, there is a limit to how interpretable these PCs are. We use the dendrogram from hierarchical clustering to help sort our data such that more similar data is kept together. Then we create a covariance plot of how similar the principal components of each genome are to each other. This plot is a rudimentary method to help us visualize the population structure of the simulation’s resulting lineages seen in Figure 5a. The population structure is clear, but there is still structure given by the random pedigree shared by all individuals.

Figure 4: The principal component analysis plots with colours showing the true populations for a gene conversion rate of 1e06.

Figure 5: Covariance matrices for increasing gene conversion rates (reading left to right, up to down) 1e-6, 1e-5, 1e-4, showing a breakdown of the sub-population structure.

In Figures 5a to 5c, we can see that as gene conversion is increased, the covariance matrix less represents one random history, and instead “averages out” into the population structure. This is a visualization of the dependence on the history breaking down as the genomes within each population become more similar to each other due to gene conversion.

If you would like to know more about this topic, please contact me at ic23897@bristol.ac.uk.

[1] Rosenberg, N., Nordborg, M. Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nat Rev Genet 3, 380–390 (2002). (https://doi.org/10.1038/nrg795])

[2] McVean G. A genealogical interpretation of principal components analysis. PLoS Genetics, 5, e1000686 (2009). (https://doi.org/10.1371/journal.pgen.1000686)

Student perspectives: Genetic Boolean Models – How to Make One

Posted on 6th February 2025 by shaun.jordan

A post by Daniel Gardner, PhD student on the Compass programme.

Introduction

My research focuses on genetic interaction networks within lung cancer cells. Our (long-term) aim is to model such networks dynamically using a Boolean modelling framework, and then use this to tie changes in cancer cells’ physiology to certain, often mutated, genes of interest.

Aims and problems

This blog post will focus on the challenge we are currently working on: constructing the model itself. This is often the most challenging element of the research, as it underpins all results going forward, and often there does not exist enough data to fully define a unique model.

In some respects this is acceptable, as Boolean modelling is more of a qualitative approach. Each node in the network is a ‘species’, be that a gene, protein, small molecule, etc. Each directed, labelled edge is either ‘activating’, if an increase in species A causes an increase in species B, or ‘inhibiting’, if the opposite is the case [1].

With this definition, a lot of papers we have looked at define their model purely from the literature [2], [3], [4], either manually mining links, or using pre-existing databases like the Kyoto Encyclopedia of Genes and Genomes (KEGG)[6].

What we are more interested in are methods deriving these models in a far more quantitative way, straight from transcriptomic data. Whilst some of the papers referenced above justify their hand-built models in retrospect by showing they can replicate real-world results [3], we wish to work the other way round – beginning at the real-world results and then using a reverse engineering approach.

Figure 1: The Boolean model used in [2], based off a similar model constructed in [4]. It contains 98 nodes (species) and 254 directed edges (labelled interactions).

Potential solutions

The solutions we have found can be broadly split into two categories: methods that go from:

Raw Data → Interaction Network

and similarly:

Interaction Network → Boolean Model

The former is a much more difficult challenge. Generally, in a published network, each edge will reference experimental work that justifies, e.g. ‘A activates B’. However, data-frames which contain many cell-line perturbation experiments in one are hard to come by, and expensive to perform [5]. The problem is often also undetermined since the solution-space for a potential network is far greater than the amount of data available. One option we may look into in the future, however, involves using other modelling techniques, such as ODEs or Bayesian networks.

The challenge of reverse engineering a Boolean network from a pre-built network is much more feasible. The main problem in this case is considering complex interactions. For example, if we had ’A inhibits C’ and ’B activates C’, how do they work in tandem?

Figure 2: Part of the optimisation algorithm from [7] applied to a toy model. In D, we classify each species in the network. All non-compressed nodes are those which we have data to train on. In E, we construct the hypergraph, where for any pair of combined interactions, both the ‘AND’ and ‘OR’ case are considered.

Sticking to the Boolean framework, these two interactions can either be joined through an ‘AND’ relation, or an ‘OR’ relation. For several proteins affecting one specific protein, the combinations of Boolean rules are non-trivial.

One paper we found that deals with this problem well is Saez-Rodriguez et al. [7], which attempts to train a hypergraph of the interaction network to cell line assay data. It contains a number of different techniques to do with graph and state space reduction, as well as some heuristic rules on which complex interactions to target. For example, it is unlikely in biology for a protein to require multiple other species to necessitate a change in function, so we can remove ‘AND’ links of more than N complex interactions from the state space.

One other model component we are looking for, which we have not currently looked into properly, is a ‘layered’ model, which includes different levels of genomic interaction. For example, many papers we have read use ‘protein interaction network (PIN)’ and ‘gene regulatory network (GRN)’ interchangeably. Whilst the two are greatly related, drawing a one-to-one equivalence between the two in all cases is incorrect.

Conclusion and future plans

Starting directly from data to build a network is perhaps too ambitious a challenge, especially with the limited data available. In fact, even to train a Boolean network for optimisation requires quite specific cell-line perturbation data. It could be that we make do with a network partially trained on limited data, and the rest taken from prior knowledge in the literature.

One promising sign is that [7] finds that it is best to begin with ’too many’ interactions in a literature-curated interaction network, and then ’prune’ spurious interactions via network optimisation. This is due to these large networks being built from many different sources, some using different tissue, conditions, etc. Therefore, when we desire a model specific to lung adenocarcinoma data, it is natural for the training to remove many of these genetic interactions.

In the future, we aim for this research topic to simply be one section of the wider project. Once we decide upon the most justified Boolean model for lung cancer, we aim to use patient mRNA and mutation data to personalise the models, in order to predict patient specific cell phenotype probabilities. Using this, along with multi-layer protein imaging data from Cancer Research UK, we aim to find a statistically significant link between certain gene mutations, and the resulting shape and, therefore, phenotype of a tumour of cancer cells.

Thank you for reading this blog post. If you have any questions, please feel free to get in touch with me at: daniel.gardner@bristol.ac.uk

References

[1] Abou-Jaoudé, W., Traynard, P., Monteiro, P. T., Saez- Rodrıguez, J., Helikar, T., Thieffry, D., and Chaouiya, C. (2016). Logical modeling and dynamical analysis of cellular networks. Frontiers in Genetics, 7.

[2] Béal, J., Montagud, A., Traynard, P., Barillot, E., and Calzone, L. (2019). Personalization of logical models with multi-omics data allows clinical stratification of patients. Frontiers in Physiology, 9.

[3] Cohen, D. P. A., Martignetti, L., Robine, S., Barillot, E., Zinovyev, A., and Calzone, L. (2015). Mathematical modelling of molecular pathways enabling tumour cell invasion and migration. PLOS Computational Biology, 11.

[4] Fumiã, H. (2013). Boolean network model for cancer pathways: Predicting carcinogenesis and targeted therapy outcomes. PloS one, 8:e69008.

[5] Galindez, G., Sadegh, S., Baumbach, J., Kacprowski, T., and List, M. (2023). Network-based approaches for modeling disease regulation and progression. Computational and Structural Biotechnology Journal, 21:780–795. 4

[6] Kanehisa, M. and Goto, S. (2000). Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28(1):27–30.

[7] Saez-Rodriguez, J., Alexopoulos, L. G., Epperlein, J., Samaga, R., Lauffenburger, D. A., Klamt, S., and Sorger, P. K. (2009). Discrete logic modelling as a means to link protein signalling networks with functional analysis of mammalian signal transduction. Molecular Systems Biology, 5(1):331.

Student perspectives: Regional Sensitivity Analysis

Posted on 21st November 2024 by cecina.babichmorrow

A post by Cecina Babich Morrow, PhD student on the Compass programme.

Introduction

Sensitivity analysis seeks to understand how much changes in each input affect the output of a model. We want to be able to determine how variation in a model’s output can be attributed to variations in its input. Given the high amount of uncertainty present in most real-world modelling settings, it is crucial to understand the magnitude of this uncertainty’s impact on model results. Knowing how sensitive a model is to a particular parameter can help guide modellers in prioritising what level of precision is needed in estimating that parameter in order to produce valid results. Sensitivity analysis thus serves as a vital tool for modellers in numerous fields, allowing them to assess robustness and to identify key drivers of uncertainty. By systematically analysing the relative amount of influence that each input parameter has on the output, sensitivity analysis reveals which parameters have the greatest impact on the results.

By identifying these critical parameters, stakeholders can prioritize investments in data collection, parameter estimation, and uncertainty reduction. This targeted approach ensures that efforts are concentrated where they will have the most significant impact.

Why use Regional Sensitivity Analysis?

In this blog post, I will focus on one particular sensitivity analysis method that I have been using in my project so far to help understand the sensitivity of an output decision to input parameters which affect that decision. Regional Sensitivity Analysis (RSA) was developed in the field of hydrology, but has widespread applications in environmental modelling, disease modelling, and beyond.

My research focuses on environmental decision-making, so I frequently deal with models that output a decision that can take on one of several discrete values. For example, consider trying to make a decision about what to wear based on the weather. To make our decision, we use three input parameters about the weather: temperature, humidity, and wind speed. Then, our decision model can output one of three decisions: (1) stay home, (2) leave the house with a jacket, (3) leave the house without a jacket. We might then be interested in how sensitive our model is to each of our three weather-related input parameters to understand how much each one contributes to uncertainty in our ultimate decision. In this type of setting, we need to use a sensitivity analysis method that can handle continuous inputs, e.g. temperature, in conjunction with a discrete output, e.g. our decision.

For settings such as these where the inputs of our model are continuous and the outputs are discrete, RSA, also referred to as Monte Carlo filtering, is a potential method of sensitivity analysis [1]. RSA aims to identify which regions of input space corresponding to specific values in the output space [2, 3]. Originally, the method was developed in the field of hydrology for cases where the output variable is binary, or made such by applying a threshold. It has since been extended by splitting the parameter space into more than two groups [3, 4]. RSA is well-suited to sensitivity analysis in the case where the output variable is categorical [5].

RSA is fundamentally a Bayesian approach. First, prior distributions are assigned to the input parameters. The model is then run multiple times, sampling input parameters from these priors, and recording the resulting output values. By analysing the relationship between input uncertainties and output uncertainties, RSA identifies which parameters significantly affect the model’s predictions.

How does RSA work?

We will present the mathematical formalisation of RSA in a setting where we have a discrete output variable $y \in \{ y_1, y_2, \ldots, y_m \}$ which can take on one of $m$ possible output values, and a vector of $d$ continuous input variables $\mathbf{x} = [x_1, x_2, \ldots, x_d]$ . We start with prior distributions on the input vector $\mathbf{x}$ , from which we sample before running the model to calculate the output value for that particular input.

Then, RSA compares the empirical conditional cumulative distribution functions (CDFs) $F_{x_i | y_j}$ conditioned on the different output values of $y$ . That is, for the $i$ th input parameter, we take the empirical CDF conditioned on the output of the model being the $j$ th possible output value. For example, in our weather-based decision model, we would be considering the empirical CDF $F(\text{temperature } | \text{ decide to stay home})$ . We then compare these CDFs $F_{x_i|y_j}$ for each of the possible $j \in \{1, \ldots,m\}$ output values (in our case, each of the possible output decisions). If the conditional CDFs of $x_i$ differ greatly in distribution for one or more of the values of $y$ , then we can conclude that our model is sensitive to that particular input parameter. If $F(x_i) = F(x_i | y_1) = \ldots = F(x_i | y_m)$ , then the output is insensitive to $x_i$ on its own. (See the Extensions of RSA section for a discussion of variable interactions.)

The difference between these CDFs can be measured using several possible sensitivity indices. Typically, the Kolmogorov-Smirnov (KS) statistic is applied over all possible values of $y$ , and then some statistic (e.g. mean, median, maximum, etc.) is calculated to summarise the overall sensitivity of $y$ to $x_i$ :

$\text{stat}_{j,k} [KS(x_i)] = \text{stat}_{j,k} \left[\max_{x_i} \left \lvert F_{x_i | y_j} (x_i | y = y_j) - F_{x_i | y_k} (x_i | y = y_k) \right \rvert\right]$

where $j,k \in \{1, \ldots, m\}$ and $\text{stat}$ could be mean, median, maximum, etc.

For instance, consider the following situation with an input parameter $x_i$ , where the output $y$ can take on one of three values. We assumed a uniform prior for $x_i \sim \text{Unif}(350, 800)$ . The blue, green, and red distributions shown in Fig. 1 below are the empirical conditional CDFs $F(x_i | y_1)$ , $F(x_i | y_2)$ , and $F(x_i | y_3)$ , respectively, giving the probability that $x_i$ is less than or equal to a given value given that the output result of the model was $y_j$ . The vertical dotted lines are the KS statistic between each of the three pairs of CDFs. Then a statistic, such as the mean, median, or maximum of those three KS values, can be calculated to represent the overall sensitivity of $y$ to the input parameter $x_i$ . For example, the mean KS statistic is 0.5505.

Figure 1. Visualisation of RSA using a summary statistic of the KS statistic as a sensitivity index. The blue, green, and red distributions are the empirical conditional CDFs $F(x_i | y_k)$ for $k \in \{1, 2, 3\}$ , and the vertical dotted lines represent the KS statistic between each of the three pairs of CDFs.

As an alternative to using the KS statistic, we can instead apply a statistic to spread, i.e. the area between the CDFs:

$\text{stat}_{j,k} [\text{spread}(x_i)] = \text{stat}_{j,k} \left[ \int_{-\infty}^\infty \max \left(F_{x_i | y_j} (x_i | y = y_j), F_{x_i | y_k} (x_i | y = y_k)\right) dx_i - \int_{-\infty}^\infty \min \left(F_{x_i | y_j} (x_i | y = y_j), F_{x_i | y_k} (x_i | y = y_k)\right) dx_i \right]$

where $j,k \in \{1,\ldots, m\}$ . In this case, we would be considering the area between each of the three distributions shown in Fig. 1 above and then averaging them (or applying some other summary statistic) as our sensitivity index. For instance, the mean spread between CDFs is 134.09.

Higher values of either sensitivity index for a given input parameter $x_i$ suggest that the output is more sensitive to variations in that parameter, i.e. the distributions of input values leading to a given output value are more different from one another. For example, Figure 2 compares the conditional CDFs of $x_i$ with that of a different input parameter, $x_j$ , with a prior of $x_j \sim \text{Unif}(80,120)$ . We can see that the CDFs $F(x_i | y_k)$ show a high degree of separation, compared to the CDFs $F(x_j, y_k)$ , which do not. This is reflected in the sensitivity indices: for example, the mean KS statistic for $x_j$ is only 0.1648 and the mean spread is only 2.897. Comparing KS statistics in this manner makes RSA a tool well-suited for ranking, or factor prioritisation, one of the main goals of sensitivity analysis that aims to rank parameters by their contribution to variation in the output [1, 5].

Figure 2. Comparison of sensitivity of a model to two input parameters, $x_i$ and $x_j$ . The blue, green, and red distributions are the empirical conditional CDFs $F(x_i | y_k)$ and $F(x_j | y_k)$ for $k \in \{1, 2, 3\}$ .

Extensions of RSA

One notable limitation of RSA, identified since its inception [2], is its inability to handle parameter interactions. A zero value of the sensitivity index is a necessary condition for insensitivity, but it is not sufficient [2, 5]. Inputs that contribute to variation in the model output only through interactions can have the same univariate conditional CDFs, and thus RSA cannot properly identify their impact on model output. For theoretical examples, see Fig. 1 of [2] and Example 6 of Section 5.2.3 in [1]. In our toy example, we may have a decision model where the output decision is not particularly sensitive to temperature or humidity on their own, but it may be very sensitive to an interaction between these two parameters since their combined effects impact how warm or cool the weather actually feels.

In situations such as these where interactions between input variables may matter more than each variable on its own, RSA can be useful for ranking, but it cannot be used for screening, another goal of sensitivity analysis aiming to identify variables with little to no influence on output variability[1, 5]. To address this limitation, RSA can be augmented with machine learning methods such as random forests and density estimation trees [6]. Spear et al. performed a sensitivity analysis of a dengue epidemic model to demonstrate how these tree-based models can augment RSA [6].

First, the authors performed RSA in its original form, using the KS statistic to examine the difference between the univariate CDFs. Then, they used random forest analysis to classify model runs into the various output values. Then, a measure of variable importance, such as Gini impurity, was used to rank the input parameters in terms of their influence on the model output [6]. Random forest allows for the incorporation of the effects of variable interactions in ranking the importance of each parameter. By comparing the parameter ranking resulting from RSA with that from the random forest, they identified parameters which impacted the output through interaction. Finally, they used density estimation trees to help identify regions of parameter space corresponding to particular output values. Density estimation trees are the analogue of classification and regression trees, instead attempting to estimate the probability density function that gave rise to a particular region of output space [7]. By applying density estimation trees as part of the sensitivity analysis, Spear et al. were able to examine the effects of scale on sensitivity, identifying parameters which may be relatively unimportant when ranking across the entire parameter subspace, but are highly influential in small subspaces.

Further research such as this highlights the benefits of combining multiple sensitivity analysis methods in order to gain a full picture of how model inputs affect uncertainty in the output.

Conclusions

Hopefully this blog has been an informative crash course in regional sensitivity analysis! Note that the visualisations in this post have been created using the SAFEpython toolbox [8]. If you have any questions or comments, please feel free to get in touch at cecina.babichmorrow@bristol.ac.uk.

References

[1] A. Saltelli, Global sensitivity analysis: the primer. Wiley, 2008. [Online]. Available: https://onlinelibrary.wiley.com/doi/book/10.1002/9780470725184

[2] R. Spear and G. Hornberger, “Eutrophication in peel inlet—II. identification of critical uncertainties via generalized sensitivity analysis,” Water Research, vol. 14, no. 1, pp. 43–49, 1980. [Online]. Available: https://www.sciencedirect.com/science/article/pii/0043135480900408

[3] J. Freer, K. Beven, and B. Ambroise, “Bayesian estimation of uncertainty in runoff prediction and the value of data: An application of the GLUE approach,” Water Resources Research, vol. 32, no. 7, pp. 2161–2173, 1996. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1029/95WR03723

[4] T. Wagener, D. P. Boyle, M. J. Lees, H. S. Wheater, H. V. Gupta, and S. Sorooshian, “A framework for development and application of hydrological models,” Hydrology and Earth System Sciences, vol. 5, no. 1, pp. 13–26, 2001. [Online]. Available: https://hess.copernicus.org/articles/5/13/2001/

[5] F. Pianosi, K. Beven, J. Freer, J. W. Hall, J. Rougier, D. B. Stephenson, and T. Wagener, “Sensitivity analysis of environmental models: A systematic review with practical workflow,” Environmental Modelling & Software, vol. 79, pp. 214–232, 2016. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S1364815216300287

[6] R. C. Spear, Q. Cheng, and S. L. Wu, “An example of augmenting regional sensitivity analysis using machine learning software,” vol. 56, no. 4, p. e2019WR026379. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1029/2019WR026379

[7] P. Ram and A. G. Gray, “Density estimation trees,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp. 627–635. [Online]. Available: https://dl.acm.org/doi/10.1145/2020408.2020507

[8] F. Pianosi, F. Sarrazin, and T. Wagener, “A Matlab toolbox for global sensitivity analysis,” Environmental Modelling & Software, vol. 70, pp. 80–85, 2015. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S1364815215001188

Student perspectives: Compass Annual Conference 2024

Posted on 8th November 202414th November 2024 by ben.anson

A post by Compass students Ben Anson, Ollie Baker, Codie Wood and Rachel Wood.

Introduction

This October, we held our third annual Compass Conference. Unlike previous years, when the event was held in the University’s Fry Building, this time it took place at M Shed, offering scenic views of Bristol harbour. It was a great day for past and present Compass students, academics, and industry partners to come together and discuss this year’s theme: “The Future of Data Science”. With recent advances in machine learning and AI, it felt like a fitting time to learn from each other’s perspectives and to share ideas about how to move forward in this exciting space.

Panoramic view of Bristol harbour, as seen from M Shed

Student Research Talks

The morning started with four ten-minute research talks from Compass students. First was Rahil Morjaria‘s talk on “Group Testing” which explored current developments in the field, including algorithms and information-theoretic limits.

Following this, Kieran Morris presented “A Trip to Bregman Geometry and Applications”, considering advancements such as natural gradient methods, Bregman K-means clustering, and EM-projection algorithms that Bregman Geometry has enabled.

Ettore Fincato talked us through “Gradient-Free Optimisation via Integration”, focusing on a novel yet easy-to-implement algorithm for optimisation using Monte Carlo methods. Finally, Ed Milsom spoke about “Data Modalities and the Bias-Variance Decomposition”, taking us through a history of neural networks and speculating about why certain data types are so powerful, and why the future of general-purpose AI must be multi-modal.

Student Lightning Talks

The lightning talks challenged ten students to present on a topic in just three minutes. The ability to quickly convey a message in an engaging and understandable manner, to an audience with diverse backgrounds, is crucial in both academia and industry, and the students rose to the occasion.

Their talks captured the interest of the audience and inspired interesting questions that forced the students to think on their feet. Topics ranged from neural networks and large language models (LLMs), to making music using mathematics.

Compass Alumni Panel

This year’s conference panel, chaired by Compass CDT Director, Professor Nick Whiteley, offered an engaging look into the professional journeys of Compass alumni Dominic Owens, Jake Spiteri and Michael Whitehouse since completing their PhDs. With shared experiences in finance, each panelist provided unique insights into the early career landscape and the skills that helped them succeed.

Jake delved into the details of his day-to-day work in the financial sector, while Dominic discussed the challenge and dedication required to secure a role through extensive networking and job applications. Michael shared details of his transition from finance to epidemiological research. Together, they sparked valuable discussions on what the future of data science might hold for upcoming Compass graduates.

Special Guest Lecture

The conference concluded with an enlightening special guest lecture by Professor Aline Villavicencio, Director of the Institute for Data Science and Artificial Intelligence at the University of Exeter. Her talk, “Testing the Idiomatic Language Limits of Foundation Models: The Strange Case of the Idiomatic Eager Beaver in Cloud Nine,” offered a fascinating counterpoint to the current enthusiasm surrounding LLMs.

Drawing from her research in Natural Language Processing (NLP), Professor Villavicencio demonstrated how even today’s most advanced models struggle with aspects of language that humans master naturally – particularly idioms and multi-word expressions. She illustrated a persistent gap between machine and human linguistic capabilities, reminding us that the path to truly human-like language understanding remains long and complex.

She also shared her perspective on the cyclical nature of NLP research, noting how, throughout her career, there have been multiple predictions about NLP research becoming obsolete as models improve. Yet, as her work on datasets like SemEval (Semantic Evaluation) shows, there remain fundamental challenges in representing and understanding idiomatic language.

Concluding remarks

The successful day of talks, poster sessions and networking culminated with Professor Whiteley sharing his thoughts on what we learned throughout the event. He concluded that the future of our field is certain to be exciting and will encompass a huge range of different areas and ideas. This year’s conference embodied this by providing a platform for students, academics, and industry professionals to share new insights from many different sectors, and to form strong relationships to help forge a path to the future of data science.

ICML 2025 will take place in Vancouver, Canada from 13 to 19 July 2025.

Compass CDT student Sherman Khoo (furthest left) being announced as a winner of an Early Career Researcher Poster Award at Bayes Comp 2025, held in Singapore in June.

Compass student Cecina Babich Morrow (third from right) was a panellist at a GW4 AI and Data Science event focused on climate and health, held in Bristol in June.

Compass at ICML 2025

July 2025 – Vancouver, Canada

‘Paired question model comparison setting’ – from Sam Bowyer’s ICML Spotlight Position Paper.

Compass at UAI 2025

July 2025 – Rio de Janeiro, Brazil

A simulated results figure from Emerald Dilworth’s paper, co-authored with Ed Davis.

Compass at GOFCP 2025

(Goodness-of-fit, Change-point and Related Problems)

August 2025 – Charles University, Prague, Czech Republic

Compass contributions to other events:

Exact entropy curves from Ollie Baker’s paper, presented at the Random Networks Workshop.

Introduction

Probabilities in `Check the Negatives’ Studies

Quantifying Bias

Bias Adjustment

Simulation Study

Future Work

References

Introduction

Central Limit Theorem for High-Dimensional Distance

Theorem 1:

Erdös-Rényi Random Graphs

When Does the RGG Converge to $G(n,p)$?

The RGG in $[0,1]^d$

Theorem 2 [1]

Sketch Proof:

The RGG in $\mathbb{T}^d$

Theorem 3 [1]

Sketch Proof

Example

Conclusion

References

Compass at AI UK

Compass CDT students and staff at the AI UK 2025 Conference. From left to right: Dr Dan Lawson, Emma Ceccherini, Sam Bowyer, Sherman Khoo and Helen Mawdlsey

Conference highlights

Presentation karaoke

Introduction

Background

Expectation Propagation

Benefits and limitations of EP

Extensions of EP

References

Introduction

Aims and problems

Potential solutions

Conclusion and future plans

References

Introduction

Why use Regional Sensitivity Analysis?

How does RSA work?

Extensions of RSA

Conclusions

References

Introduction

Student Research Talks

Student Lightning Talks

Compass Alumni Panel

Special Guest Lecture

Concluding remarks

Past conferences

Compass CDT students and staff at the AI UK 2025 Conference. From left to right:
Dr Dan Lawson, Emma Ceccherini, Sam Bowyer, Sherman Khoo and Helen Mawdlsey