Student Perspectives: Verification Bias in Diagnostic Test Accuracy Studies with Conditional Reference Standards

Posted on 27th June 2025 by ia23879

A post by Vera Hudak, PhD student on the Compass programme.

Introduction

To evaluate the accuracy of a new diagnostic test (the `index test’), the ideal approach is to compare it against an error-free reference standard, known as the gold standard.
However, gold standards may be unavailable, invasive, or costly. In such situations, a possible approach is to condition gold standard testing on the outcome of some initial imperfect reference standard test(s).

We focus on a conditional testing design which we refer to as `check the negatives’. Here, all participants receive the index test (Test A) and an imperfect reference standard (Test B), then those testing negative on Test B are followed up with the gold standard (GS). The diagnostic accuracy of Test A is assessed against observed disease status, derived from the test sequence combining Test B and the GS. Figure 1 illustrates this design.

Figure 1: Test sequence for `check the negatives’.

Now, if Test B was 100% specific, the `check the negatives’ design would lead to unbiased estimates of the sensitivity and specificity of Test A, given by:

$\text{Sensitivity} = \frac{\text{TP}}{\text{TP} + \text{FN}} \quad \text{and} \quad \text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}}.$

However, as Test B is an imperfect test, this is unlikely to be the case, and bias can be anticipated. We quantified this bias and proposed a Bayesian adjustment method.

Probabilities in `Check the Negatives’ Studies

Let $A_{se}$ and $A_{sp}$ denote the true sensitivity and specificity of Test A, $B_{sp}$ the true specificity of Test B, and $\pi$ the prevalence of the condition in the study population. We also introduce $c_0$ as the covariance of errors between Test A and Test B in the disease-free population.

In a `check the negatives’ design, we do not observe the complete set of outcomes from the index test, the imperfect reference standard, and the gold standard. Instead, we only observe a reduced set of results: each participant’s outcome on Test A and their observed disease status, as determined by the conditional diagnostic test sequence combining Test B and the GS. Table 1 shows the probabilities associated with these observed outcomes under conditional dependence between Test A and Test B [1]. The corresponding probabilities under conditional independence can be obtained by setting $c_0 = 0$ .

Table 1: Probabilities observed in a `check the negatives’ study.

Quantifying Bias

We used the probabilities from Table \ref{tab:partially ver prob} to find closed-form expressions for the naive estimates of the sensitivity ( $\widehat{A_{se}}$ ) and specificity ( $\widehat{A_{sp}}$ ) of Test A, and hence, the bias. The bias in the naive estimate of specificity is as follows:

$\text{Bias}(\widehat{A_{sp}}) = \frac{c_0}{B_{sp}}.$

Under conditional independence ( $c_0 = 0$ ), the naive specificity estimate is unbiased. If, however, Tests A and B are conditionally dependent among the disease-free population, then there is a bias which depends on $B_{sp}$ and $c_0$ .

Similarly, the bias in the naive estimate of the sensitivity of Test A can be expressed as:

$\text{Bias}(\widehat{A_{se}}) = \frac{(1-\pi)((1-A_{sp}) (1-B_{sp}) + c_0 - A_{se}(1-B_{sp}))}{(1-\pi)(1-B_{sp}) + \pi},$

Assuming independence in the disease-free population ( $c_0 = 0$ ), Figure 2 shows $\text{Bias}(\widehat{A_{se}})$ as a function of $B_{sp}$ , for selected values of $\pi, A_{se}, A_{sp}$ and $B_{sp}$ .

Figure 2: Bias in the naive estimate of the sensitivity of Test A against the specificity of Test B for different values of Test A sensitivity, specificity and disease prevalence.

As expected from the study design, bias tends to 0 as $B_{sp}$ tends to 1. Bias increases as the accuracy of Test A improves, i.e. when $A_{se}$ or $A_{sp}$ is larger, or with lower prevalence. We can see that when $\pi = 0.9$ , the bias is almost negligible. However, bias can be substantial in some scenarios, specifically under low prevalence, even when Test B has high specificity (e.g. over 95%).

Bias Adjustment

We proposed a Bayesian model with an informative prior on Test B specificity can be used for adjusting for the bias in the naive estimate of the sensitivity of Test A, under conditional independence ( $c_0 = 0$ ).

We let $\boldsymbol{x} = (TP, FN, FP, TN)$ be the data reported by a `check the negatives’ study evaluating Test A. Then $\boldsymbol{x} \sim \text{Multinomial}(\boldsymbol{p}, n)$ , where $n$ is the number of participants in the study, and the probabilities $\boldsymbol{p} = (p_1, p_2, p_3, p_4)$ are as specified in Table 1, with $c_0 = 0$ .

Suppose we have prior information about the specificity of Test B represented with a Beta prior distribution. Then a bias-adjusted estimate of $A_{se}$ could be obtained by fitting this multinomial distribution to the data with vague $\text{Beta}(1,1)$ priors for the remaining three parameters, $A_{se}$ , $A_{sp}$ , and $\pi$ .

Simulation Study

We assessed this adjustment method through a simulation study under two scenarios: where the informative prior is correctly centred on the true specificity, and where it underestimates the truth by 5%, to examine the impact of moderate prior misspecification. Prior precision ( $sd$ ) is also varied to assess the impact of increasing uncertainty. In Figure 3, we present some result from this simulation study for the correctly centred prior case. These results show that a correctly centred prior consistently eliminates bias under high precision and reduces it under lower precision.

Although not shown here, we found that overly pessimistic priors can over-correct, increasing absolute bias, especially when initial bias is small. This risk is mitigated when the informative prior is less precise. We are currently writing up this simulation study as a paper to be submitted for publication soon.

Future Work

Further work could be done to explore adjustment under conditional dependence between tests, or situations in which the third test in the sequence, here the GS, is imperfect.

References

[1] Pamela M. Vacek. The effect of conditional dependence on the evaluation of diagnostic tests. Biometrics, 41:959, 12 1985.

Protected: Student Perspectives: Careers in Data Science

Posted on 23rd June 202527th June 2025 by shaun.jordan

Student Perspectives: The Distribution of High-Dimensional Random Geometric Graphs

Posted on 7th May 2025 by o.baker

A post by Ollie Baker, PhD student on the Compass programme.

Introduction

A random geometric graph is a model of a network with a geometric embedding, or a latent space embedding. They were originally introduced by Gilvert in 1961, who called them ‘random planar networks’ [3]. If we are studying a high dimensional dataset, then we may be interested in modelling it using some form of network structure which contains information about similarity of data points in the latent space. This can be represented as points in a high-dimensional space connected based on their closeness according to some distance metric. What can we say about the distribution of these random graph models when the dimension of the space gets large? The answer to this question has applications in clustering, and graph compression once we consider the information theory of these network ensembles.

Figure 1: An example of a 500 node Random Geometric Graph on the unit square with connection range 0.1.

Let $(\Omega, \rho)$ be a metric space, where $\Omega \subset \mathbb{R}^d$ is compact. In this post, we will only consider $\Omega = [0,1]^d$ with $\rho = \|\cdot\|$ the Euclidean metric, and $\Omega = [0,1]^d$ with $\rho = \rho_t$, with $\rho_t(x,y) = \left(\sum_{i=1}^d\min(|x-y|, 1-|x-y|)^2 \right)^{\frac{1}{2}}$. This second metric space is known as the unit $d$-torus, and is denoted $\mathbb{T}^d$. To form a Random Geometric Graph (RGG) $G$, we distribute $n$ points at random according to a probability density $\nu$ in $\Omega$ to form the vertex set, and connect nodes $i$, $j$ if they have a mutual distance less than $r_0$. That is, if $X_1,…,X_n$ are our points in $\Omega$, $i\sim j$ in $G$ if and only if $\rho(X_i,X_j) \leq r_0$. Figure 1 shows an example realisation of such a graph. We are interested in densities of the form:

\begin{equation}
\label{general_distribution}
\nu(\{x_i\}_{i=1}^d) = \prod_{i=1}^d \pi(x_i)
\end{equation}
That is, the coordinates of a point $x$ are i.i.d. \\ \newline
We are interested in computing the probability of a graph $G$ with adjacency matrix $A=\{a_{ij}\}_{i,j=1}^n$:
\begin{equation}
\mathbb{P}(G) = \int_{[0,D]^{\binom{n}{2}}} f_{\vec{R}}(\vec{r})\prod_{i<j} \left(a_{ij}\mathbb{I}(r_{ij}<r_0)+(1-a_{ij})\mathbb{I}(r>r_0)\right)d\vec{r}
\end{equation}
where $D$ is the diameter of (largest possible distance between two points in) $\Omega$, $f_{\vec{R}}$ is the joint density of pair distances in $\Omega$ (which is well defined if $d>n+2$), and $d\vec{r}=\prod_{i<j}dr_{ij}$. Clearly, this is intractable for most choices of $\Omega$ and $n$. However, we can simplify things if we take the limit $d\rightarrow\infty$. An \textit{ensemble} $\mathcal{G}$ of RGGs is the set of all possible RGGs that can be constructed with a fixed $r_0, n, \Omega, \nu$ and $\rho$. A sequence of ensembles $\{\mathcal{G}_d\}$ (with all parameters except $n$ dependent on $d$) equipped with probability measures $\mathbb{P}_d$ \textit{converges in distribution} to another ensemble $\mathcal{G}$ as $d\rightarrow\infty$ if
\begin{equation}
\mathbb{P}_d(g) \rightarrow \mathbb{P}(g)
\end{equation}
for all graphs $g$ with $n$ nodes.

Central Limit Theorem for High-Dimensional Distance

We can use a central limit theorem (CLT) to prove that the vector of pair distances in $\Omega$ converges in distribution to a multivariate Gaussian as $d\rightarrow\infty$, which is a generalisation of what is done in [2].

Theorem 1:

Let $(\Omega, \rho)$ be a metric space, and $\nu$ be a node density as descibed above, and $X_1,…,X_n$ be random vectors distributed according to $\nu$. Define $R_{ij}^{(k)} = \rho(X_i^{(k)}, X_j^{(k)})^2$, the distance between $X_i$ and $X_j$ in each coordinate, and $\mu := \mathbb{E}[R_{12}^{(1)}]$. Let
\begin{equation}
q_{ij} := \frac{1}{\sqrt{d}}\sum_{k=1}^d (R_{ij}^{(k)}-\mu)
\end{equation}
Then, as $d\rightarrow\infty$ the vector $\vec{q} = \{q_{ij}\}_{1\leq i<j\leq n} \in \mathbb{R}^{\binom{n}{2}}$ satisfies
\begin{equation}
\vec{q} \rightarrow Z \sim N(0_{\binom{n}{2}},\Sigma)
\end{equation}
where $\Sigma$ is the covariance matrix indexed by multi-indexes $(i,j)$ given by $\Sigma_{(i,j),(i,j)} = \alpha := \mathbb{E}[(\rho(X_i^{(1)},X_j^{(1)})^2-\mu^2)^2]$, and $\Sigma_{(i,j),(j,k)} = \beta := \mathbb{E}[(\rho(X_i^{(1)},X_j^{(1)})^2-\mu^2)(\rho(X_j^{(1)},X_k^{(1)})^2-\mu^2)]$, and $\Sigma_{(i,j),(k,l)} = 0$.

Essentially, this means that when $d\rightarrow$, we can replace the intractable joint density $f_{\vec{R}}$ with a multivariate Gaussian density, which will make our calculations much easier.

Erdös-Rényi Random Graphs

The Erdös-Rényi (ER) random graph, also denoted as $G(n,p)$ is arguably the simplest model of a random graph. We take $n$ nodes, and connect each pair with fixed probability $p$. The probability of an ER graph $G$ is given by the binomial probability
\begin{equation}
\mathbb{P}(G) = p^{k}(1-p)^{\binom{n}{2}-k}
\end{equation}
where $k = \sum_{i<j} a_{ij}$ is the number of edges in $g$. Clearly, this model is a non-spatial network model, which is not good for modelling networks with a latent space structure! Therefore, we would like to guarantee that our model does not converge to $G(n,p)$ as $d\rightarrow\infty$, otherwise we would lose information about the spatial or latent space correlation of our data.

When Does the RGG Converge to $G(n,p)$?

For the main results, we will provide a condition on the distribution of nodes in $\Omega$ for when we see convergence in distribution to $G(n,p)$. For a RGG with connection range $r_0$ in the metric space $(\Omega, \rho)$ and $\mu = \mathbb{E}[\rho(X_i^{(1)},X_j^{(1)})^2]$, define the \textit{normalised connection range} $t = \frac{r_0^2}{\sqrt{d}} – \mu \sqrt{d}$. Recall that the probability of a graph is given by
\begin{equation}
\mathbb{P}(G) = \int_{[0,D]^{\binom{n}{2}}} f_{\vec{R}}(\vec{r})\prod_{i<j} \left(a_{ij}\mathbb{I}(r_{ij}<r_0)+(1-a_{ij})\mathbb{I}(r>r_0)\right)d\vec{r}
\end{equation}
If the random distances converge to a Gaussian, then we have (after some algebra)
\begin{equation}
\mathbb{P}(G) \rightarrow \int_{\mathcal{A}} N(0,\Sigma)(\vec{q})d\vec{q}
\end{equation}
where $\mathcal{A} = \bigotimes_{i<j} A_{ij}$ where $\bigotimes$ denotes the Cartesian product of sets, and $A_{ij}$ is the set:
\begin{equation}
A_{ij} := \begin{cases}
(-\infty, t] & a_{ij} =1 \\
(t,\infty) & a_{ij} = 0
\end{cases}
\end{equation}
If $\Sigma$ is diagonal, then the integral above splits up into its marginals, and
\begin{equation}
\mathbb{P}(G) = \prod_{i<j} \bar{p}(t)^k(1-\bar{p}(t))^{\binom{n}{2}-k}
\end{equation}
with $\bar{p}(t) := \int_{-\infty}^t N(0, \alpha)(q)dq$ with $\alpha$ being the diagonal elements of $\Sigma$. Note this is the exact probability of a graph in $G(n, \bar{p}(t))$. If there are non-zero off-diagonal elements, then there are correlations between edges, and we do not have convergence to $G(n,p)$.

The RGG in $[0,1]^d$

Suppose now that $(\Omega, \rho) = ([0,1]^d, \|\cdot\|)$. We will need that our node distribution $\nu$ is of the form we described earlier, where the marginals $\pi$ have a kurtosis greater that 1. It can be shown that the only distributions with unit kurtosis are the Bernoulli distribution with parameter $1/2$, and constant distributions.

Theorem 2 [1]

Suppose $\mathcal{G}$ is an ensemble of RGGs in $[0,1]^d$ with nodes distributed according to $\nu$, then, provided the kurtosis of the marginals $\pi$ is greater than 1, $\mathcal{G}$ does not converge to the ER ensemble as $d\rightarrow\infty$.

Sketch Proof:

The proof is direct. We set $\beta = 0$, which for the Euclidean distance metric means (after some rearranging),
\begin{equation}
\mathbb{E}[(X_i-\mu)^4] – \mathbb{E}[(X_i-\mu)^2]^2 = 0
\end{equation}
or equivalently, the kurtosis of $X_i$ is 1.

This means, for any `sufficiently nice’ distribution of nodes, we do not converge to $G(n,p)$, and maintain spatial properties which can be exploited in data analysis.

The RGG in $\mathbb{T}^d$

In the torus, the following theorem shows that if the distribution of nodes is uniform, then we will in fact see convergence to $G(n,p)$. However, for any other distribution, we maintain the spatial correlation.

Theorem 3 [1]

Let $\mathcal{G}$ be an ensemble of RGGs on $\mathbb{T}^d$ with nodes distributed according to $\nu$. Then as $d\rightarrow\infty$, $\mathcal{G}$ converges in distribution to $G(n,p)$ if and only if $\nu$ is the uniform distribution.

Sketch Proof

The tactic for the proof is the same as for the cube. We will find a condition for which $\beta = 0$. In the torus, we have
\begin{equation}
\beta = 0 \iff \mathbb{E}_X[(\mathbb{E}_Y[\rho_t(X,Y)^2])^2] = \mathbb{E}_X[\mathbb{E}_Y[\rho_t(X,Y)^2]]^2
\end{equation}

From which we can deduce that $\mathbb{E}_Y[\rho_t(x,Y)]$ is constant $\pi$-almost-everywhere. This implies
\begin{equation}
\int_0^1 \rho_t(x,y)^2\pi(y)dy = \mu
\end{equation}
for $\pi$ almost every $x$. We can rewrite the left hand side above as the convolution of two periodic functions, and therefore taking a Fourier transform of both sides simplifies the problem. By equating Fourier modes, we find that the Fourier transform of $\hat{\pi}$ of $\pi$ must be zero everywhere except when evaluated at $0$. This means that the original function $\pi$ must be constant on $[0,1]$.

So, if we are using a toroidal distance metric, then if our data is uniformly distributed, we will lose the latent space embedding as $d\rightarrow\infty$.

Example

Here we plot the distribution of edge counts in the high-dimension limit for RGGs with uniformly (figure 2) and Gaussian distributed (figure 3) nodes to illustrate the difference that changing the node distribution can make.

Figure 2: Comparison of the theoretical distribution as $d\rightarrow\infty$ of edge counts for RGGs in $[0,1]$ and $\mathbb{T}^d$ with uniform nodes. Top: $\mathbb{P}(\text{# of edges } = k)/\binom{n}{k}$ for RGGs in the cube for $n=3$ and $n=7$. The torus would have a uniform distribution and is therefore omitted. Bottom: $\mathbb{P}(\text{# of edges }=k)$ for RGGs with $n=7$ in $[0,1]^d$ and $\mathbb{T}^d$

Figure 3: Comparison of the theoretical distribution as $d\rightarrow\infty$ of edge counts for RGGs in $[0,1]$ and $\mathbb{T}^d$ with Gaussian distributed nodes. Top: $\mathbb{P}(\text{# of edges } = k)/\binom{n}{k}$ for RGGs in $[0,1]^d$ and $\mathbb{T}^d$ for $n=3$ and $n=7$. Bottom: $\mathbb{P}(\text{# of edges }=k)$ for RGGs with $n=7$ in $[0,1]^d$ and $\mathbb{T}^d$

Conclusion

In this blog post, we have defined the random geometric graph (RGG) with general node distributions, and proved that most of the time, the spatial correlations between edges are preserved as the dimension of the underlying geometry tends to $\infty$. In the $d$-cube, geometry is preserved as long as the distribution of our nodes is neither constant, or Bernoulli with parameter $1/2$, and in the $d$-torus, geometry is preserved provided our distribution is not uniform. The result for the torus is especially interesting, since it challenges ideas in the literature about how we should model high dimensional RGGs. In real-world data, the coordinates are unlikely to be uniformly distributed, yet the majority of theoretical high-dimensional random geometric graph studies use uniform distributions on the torus. The issue is that this is the only case where the torus behaves like a $G(n,p)$ graph. For a more in-depth explanation of this work, and extensions to the concepts of graph entropy, see our recent preprint [1].

References

[1] O. Baker and C.P. Dettmann “Entropy of Random Geometric Graphs in High and Low Dimensions”. arXiv preprint arXiv:2503.11418 (2025)

[2] V. Erba et al. “Random geometric graphs in high dimension”. Phys. Rev. E 102.1 (2020), 012306.

[3] E. N. Gilbert. “Random Plane Networks”. Journal of the Society for Industrial and Applied Mathematics 9.4 (1961), 533–543.

Student perspectives: AI UK 2025 Conference

Posted on 25th April 2025 by shaun.jordan

A post by Sam Bowyer, PhD student on the Compass programme.

Compass at AI UK

The Alan Turing Institute’s AI UK 2025 Conference was held last month in the QEII Centre, Westminster, and three Compass students – Emma Ceccherini, Sherman Khoo, and myself – were present for both days of the event. We attended a variety of sessions and spent time exploring the exhibition stalls, which showcased a wide range of AI projects from within academia, government and industry.

Compass students and staff pictured at the AI UK 2025 Conference

Compass CDT students and staff at the AI UK 2025 Conference. From left to right:
Dr Dan Lawson, Emma Ceccherini, Sam Bowyer, Sherman Khoo and Helen Mawdlsey

It was an eye-opening experience to learn about the work that The Alan Turing Institute does, and especially insightful to see the myriad downstream applications of the machine learning theory that we spend so much time thinking about.

Conference highlights

One particular favourite exhibition was that of the Ministry of Justice (MoJ). Emma and I talked to a data scientist at the MoJ who was working on a tool that uses LLMs to explain laws in plain English, in order to help regular people better understand their rights.

Another project involved aggregating various disconnected datasets from across government on the national and local level in order to research social factors that might lead to successful post-prison rehabilitation or equally to recidivism.

It was encouraging to see a variety of projects and organisations at the conference aiming to use AI for social and public good, with a significant amount focussed on the climate and green-tech.

Whilst Compass wasn’t presenting at AIUK, colleagues from the Informed AI Hub, the Interactive AI CDT, the AI For Collective Intelligence (AI4CI) Hub, and Jean Golding Institute were.

It was great to not only see the other projects that are going on in the University, but also to be able to network with colleagues who only work down the road from the Fry Building (e.g. sharing Bristol restaurant recommendations)!

On the first day of the conference, Professor Charlotte Deane, Executive Chair of the Engineering and Physical Sciences Research Council (EPSRC), gave an informative keynote talk on the state of scientific research in UK academia. It was surprising to learn about the overall size of EPSRC and the range of activities they engage in, particularly their keenness for investing in spin-outs. I found Professor Deane’s talk to be very encouraging and optimistic.

The second day of the conference focused on governmental uses of AI, particularly in medicine and in defence. Professor the Lord Darzi, who recently led the Independent Investigation of the NHS in England, gave an incredibly thoughtful talk on the opportunities for AI within the NHS.

He likened the current AI boom to the development of keyhole surgery in the second half of the 20th century, urging fast, nationwide deployment in order to improve health outcomes and equality throughout the UK.

Three talks on defence and national security similarly stressed the importance of fast uptake of AI tools and made clear the desire for public-private partnerships (including with academia) in order to make this happen. (The importance of cross-sector collaboration was consistently a strong theme at AIUK, although the absence of frontier AI labs did, in my opinion, betray a slight limit to this stated commitment).

Presentation karaoke

It wasn’t all so serious, however! The conference finished its first day with “Presentation Karaoke”, in which eight contestants competed to present unseen 5-minute long, 10-slide PowerPoints, each more bizarre than the last.

This fun, often slightly cringe-inducing, activity is now rumoured to be deployed at a future COMPASS student event. (Get practising your stand-up now…)

In summary, AIUK was a great opportunity to see how AI/ML research leads to real-world impact in the UK, and I would recommend attending to any CDT student in the future.

Student perspectives: Expectation Propagation

Posted on 27th February 2025 by Grace Yan

A post by Grace Yan, PhD student on the Compass programme.

Introduction

In many real-world problems, the exact posterior distribution is often infeasible due to non-conjugate priors and high-dimensional datasets. Thus, approximate Bayesian inference methods are used instead to obtain an approximate posterior. Some well-known examples of these methods include Variational Bayes (VB), Laplace approximation and Expectation Propagation (EP). In this blog post, I will focus on Expectation Propagation and explain: what it is, how it works, its strengths and limitations, and its relation to similar methods.

Figure 1: A comparison of approximate Bayesian inference methods along a spectrum of computational speed and accuracy. Methods like Variational Bayes (VB) and the Laplace approximation are faster but less accurate, while approaches like Expectation Propagation (EP) and Markov Chain Monte Carlo (MCMC) are slower but provide higher accuracy. Source: [1].

Background

EP was introduced by Minka in 2001 as an extension of the assumed-density filtering (ADF), which is a one-pass sequential algorithm for obtaining an approximate posterior [2]. Like VB methods, its aim is to approximate an intractable posterior with tractable distributions by minimising the Kullback-Leibler (KL) divergence. Recall that the KL divergence measures how different two distributions $p$ and $q$ are; often, $p$ is the true distribution and $q$ is a model distribution that we use to approximate $q$. There are two kinds of KL divergence: the forward KL and the reverse KL. Assuming $x$ is continuous, these are defined as
\[ \mathrm{KL}(p(x) \| q(x)) = \int p(x) \mathrm{log}\frac{p(x)}{q(x)} dx \]
and
\[ \mathrm{KL}(q(x) \| p(x)) = \int q(x) \mathrm{log}\frac{q(x)}{p(x)} dx \]
respectively (in the discrete case, the integrals are replaced by sums). These two types are not equivalent; [3] gives a good explanation of how they differ. EP uses the forward KL.

Expectation Propagation

Let $\mathbf{x}$ denote the observed data and $\boldsymbol{\theta}$ the parameters of interest. Recall that by Bayes’ theorem, the posterior is
\[ p(\boldsymbol{\theta}|\mathbf{x}) = \frac{p(\mathbf{x}, \boldsymbol{\theta})}{p(\mathbf{x})}, \]
where $p(\mathbf{x})$ is the model evidence. We can write the joint distribution $p(\mathbf{x}, \boldsymbol{\theta})$ in the form of a product of factors $f_i$, which are also called ‘sites’:
\[p(\mathbf{x}, \boldsymbol{\theta}) = p(\boldsymbol{\theta})p(\mathbf{x}|\boldsymbol{\theta}) = p(\boldsymbol{\theta}) \prod_{i=1}^n p(\mathbf{x}_i|\boldsymbol{\theta}) = p(\boldsymbol{\theta})\prod_{i=1}^n f_i(\boldsymbol{\theta}),\] where $p(\boldsymbol{\theta})$ is the prior and the factors from $1$ to $n$ is the likelihood partitioned into $n$ iid parts (e.g. each $i$ could be a data point).

For $f$, we drop the conditioning $x$ to simplify the notation. I use $f_j(\boldsymbol{\theta})$ to refer to one specific factor and $f_i(\boldsymbol{\theta})$ as factors in the plural sense. My notation is similar to the notation in [4].

The idea is to approximate the posterior by approximating the factors with $\tilde{f}_i$, which are often assumed to be Gaussian (or some other member of the exponential family). These approximations are refined one at a time in an iterative process until convergence. In EP, refining a factor $\tilde{f}_j$ is a “team effort”; it requires information from each of the other factors. This concept is known as message passing, because messages are being passed between different programs (a concept that largely belongs to computer science).

Figure 2: Illustration of message passing between three factors. The arrows show the exchange of information between the factors: each $f_j$ send out its information to the other two factors and also receives information from them.

Using the approximations of the likelihood factors, the resulting approximate posterior is given by
\[q(\boldsymbol{\theta}) = p(\boldsymbol{\theta})\prod_{i=1}^n \tilde{f}_i(\boldsymbol{\theta}).\]
In EP, the prior $p(\boldsymbol{\theta})$ is also taken to be Gaussian. Since the product of Gaussians results in another Gaussian, $q$ has to be a Gaussian distribution. Therefore, we do not face the issue of finding the normalising constant for an unnormalised posterior.

To make the approximations as accurate as we can, we need a kind of measurement. Naturally, the global KL divergence comes to mind, so we might want to consider minimising the following:

\[
\mathrm{KL}(p \| q) = \mathrm{KL} \left( \frac{1}{p(\mathbf{x})} p(\boldsymbol{\theta})\prod_{i=1}^n f_i(\boldsymbol{\theta}) \bigg\| p(\boldsymbol{\theta})\prod_{i=1}^n \tilde{f}_i(\boldsymbol{\theta}) \right).
\]

However, the global KL divergence is difficult to optimise. Instead, EP minimises the KL divergence locally to update each factor one at a time, using a distribution called the tilted distribution. When updating the factor $\tilde{f}_j$, the tilted distribution is defined by

\[ q^\text{tilt}(\boldsymbol{\theta}) \propto f_j(\boldsymbol{\theta})q_{\setminus j}(\boldsymbol{\theta}), \]
where $q_{\setminus j}$ is called the cavity distribution, which is essentially the posterior distribution with one $\tilde{f}_j$ removed:

\[ q_{\setminus j}(\boldsymbol{\theta}) = \prod_{i \neq j} \tilde{f}_i(\boldsymbol{\theta}) = \frac{q(\boldsymbol{\theta})}{\tilde{f}_j(\boldsymbol{\theta})}. \]
Then EP finds the $\tilde{f}_j$ that minimises the KL divergence between the tilted distribution and the updated approximate posterior (which we call $q^\text{new}$):
\[
\mathrm{KL}(q^\text{tilt}(\boldsymbol{\theta}) \| q^\text{new}(\boldsymbol{\theta})), \]
where \[q^\text{new}(\boldsymbol{\theta}) = \tilde{f}_j(\boldsymbol{\theta}) q_{\setminus j}(\boldsymbol{\theta}).
\] If $q^\text{new}$ is a member of the exponential family (e.g. Gaussian), then we can minimise $\mathrm{KL}(q^\text{tilt} \| q^\text{new})$ by matching the moments of $q^\text{new}$ with the moments of $q^\text{tilt}$. This trick is called moment matching. In general, for approximating distributions from the exponential family, matching moments of the approximating distribution with those of the target distribution minimises the forward KL [5].

Note that the tilted distribution is not Gaussian and therefore it can be difficult to compute its moments analytically. Instead, the moments are often computed numerically: using MCMC, we can generate samples from the tilted distribution (in which case we would not need to calculate its normalising constant) and use the samples to calculate the moments empirically.

The Gaussian EP algorithm is given below:

Initialise all the approximating factors $ \tilde{f}_i(\boldsymbol{\theta}), i=1,…,n $.
Initialise the approximate posterior by setting\[
q(\boldsymbol{\theta}) = p(\boldsymbol{\theta})\prod_{i=1}^n \tilde{f}_i(\boldsymbol{\theta}),
\]where $ p(\boldsymbol{\theta}) $ is a Gaussian prior.
Until all $ \tilde{f}_i $ for $ i=1,…,n $ converge:
(a) Choose a factor $ \tilde{f}_j $ to refine.
(b) Evaluate the cavity distribution\[
q_{\setminus j}(\boldsymbol{\theta}) = \frac{q(\boldsymbol{\theta})}{\tilde{f}_j(\boldsymbol{\theta})}.
\]
(c) Set the tilted distribution\[
q^\text{tilt}(\boldsymbol{\theta}) \propto f_j(\boldsymbol{\theta}) q_{\setminus j}(\boldsymbol{\theta}).
\]
Calculate the mean $ \boldsymbol{\mu} $ and covariance $ \boldsymbol{\Sigma} $ of $ q^\text{tilt} $.
(d) Obtain the new posterior $ q^\text{new} $ that minimises $ \mathrm{KL}(q^\text{tilt}(\boldsymbol{\theta}) \| q^\text{new}(\boldsymbol{\theta})) $ by matching its moments with $ \boldsymbol{\mu} $ and $ \boldsymbol{\Sigma} $.
(e) Evaluate and store the refined factor\[
\tilde{f}_j(\boldsymbol{\theta}) = \frac{q^{\text{new}}(\boldsymbol{\theta})}{q_{\setminus j}(\boldsymbol{\theta})}.
\]
(f) Use the refined factors to update the approximate posterior as\[
q(\boldsymbol{\theta}) = p(\boldsymbol{\theta})\prod_{i=1}^n \tilde{f}_i(\boldsymbol{\theta}).
\]

Benefits and limitations of EP

As with any method, EP has advantages and disadvantages. Its advantages include the following:

EP updates the approximation factor-by-factor rather than globally, which often leads to better approximations of the target distribution.
EP is faster and computationally cheaper than MCMC. It can also speed up MCMC [1].
EP is easy to parallelise [6].
EP can easily be used in conjunction with other methods. Minka’s Roadmap on EP [7] provides a rich guide to the various areas that have employed EP, including regression, neural networks and nonlinear dynamic systems. EP have also been used with likelihood-free inference methods such as ABC (e.g. EP-ABC [8]).
Due to its factorisation structure, EP is naturally suited to graphical models, such as Bayesian networks and Markov random fields.

However, EP has some serious limitations, which later works have tried to address:

There is a lack of theoretical guarantees, e.g. convergence of the EP algorithm is not guaranteed.
If the number of approximating factors is large, this can lead to substantial memory consumption.

Extensions of EP

EP is well-suited for parallelisation. The parallel version of the original EP algorithm (sometimes called ‘sequential EP’) is known as ‘parallel EP’. Here, factor updates occur simultaneously, meaning that $q$ is not updated at the end of each iteration in step 3 of the algorithm above, i.e. $f_j$ is refined using a cavity distribution that is the product of the unrefined factors minus $f_j$. $q$ is updated only after all the factors have been updated once (whereas in sequential EP, it was updated after each factor update), then the process repeats for multiple rounds.

Since the introduction of EP, many variants of EP have been developed, such as averaged EP (AEP) [9], power EP (PEP) [10] and stochastic EP (SEP) [11]. Different choices of divergence function has led to Variational Message Passing (VMP) [12], which uses the reverse KL, and Laplace propagation (LP) [13], which uses the Laplace approximation. Much work has been done to alleviate EP’s issues, such as guaranteeing convergence [14][9], bounding its approximate errors [15], and reducing memory consumption (e.g. SEP).

Due to the close relation between EP and Variational Inference (VI), many methods have been developed from the unification of the two. For example, Partitioned Variational Inference (PVI) [16] arises from a mixture of several methods including power EP, global VI and local VI.

Figure 3: VI and EP schemes encompassed by the PVI framework. Source: [16].

In recent years, there has been a growing interest towards federated learning. This is where the dataset is partitioned across “clients” that train models locally before the model parameters are aggregated by the central server to update the global model. Since the posterior naturally factorises across partitioned client data, EP adapts well to this framework, producing algorithms such as FedEP [17] and Federated Neural Propagation (FedNP) [18].

References

[1] Barthelmé, S. (2016). The Expectation-Propagation Algorithm: a tutorial. Gipsa-lab,
CNRS. https://www.cirm-math.fr/ProgWeebly/Renc1619/Barthelme\_EP1.pdf
[2] Minka, T. P. (2001). A family of algorithms for approximate Bayesian inference [Doctoral dissertation]. Massachusetts Institute of Technology.
[3] Jang, E. (2016). A Beginner’s Guide to Variational Methods: Mean-Field Approximation. https://blog.evjang.com/2016/08/variational-bayes.html
[4] Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
[5] Murray, I. (2017). Variational objectives and KL Divergence [Lecture notes]. University of Edinburgh. https://www.inf.ed.ac.uk/teaching/courses/mlpr/2017/notes/w9a_variational_kl.pdf
[6] Cseke, B. and Heskes, T. (2011). Approximate marginals in latent Gaussian models. Journal of Machine Learning Research, 12:417–454.
[7] Minka, T. P. (n.d.). A roadmap to research on EP. https://tminka.github.io/papers/ep/roadmap.html
[8] Barthelmé, S. and Chopin, N. (2012). Expectation Propagation for Likelihood-Free Inference. arXiv:1107.5959.
[9] Dehaene, G. and Barthelmé, S. (2018). Expectation propagation in the large data limit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):199–217.
[10] Minka, T. P. (2004). Power EP [Technical Report MSR-TR-2004-149]. Microsoft Research Ltd.
[11] Li, Y., Hernández-Lobato, J. M. and Turner, R. E. (2015). Stochastic Expectation Propagation. arXiv:1506.04132.
[12] Winn, J., Bishop, C. M. and Jaakkola, T. (2005). Variational message passing. Journal of Machine Learning Research, 6(4):661–694.
[13] Smola, A., Vishwanathan, S. V. N. and Eskin, E. (2003). Laplace propagation. Advances in Neural Information Processing Systems, 16.
[14] Hasenclever, L., Webb, S., Lienart, T., Vollmer, S., Lakshminarayanan, B., Blundell, C. and Teh, Y. W. (2017). Distributed Bayesian Learning with Stochastic Natural Gradient Expectation Propagation and the Posterior Server. arXiv:1512.09327.
[15] Dehaene, G. and Barthelmé, S. (2016). Bounding errors of Expectation-Propagation. arXiv:1601.02387
[16] Bui, T. D., Nguyen, C. V., Swaroop, S. and Turner, R. E. (2018). Partitioned Variational Inference: A unified framework encompassing federated and continual learning. arXiv:1811.11206.
[17] Guo, H., Greengard, P., Wang, H., Gelman, A., Kim, Y. and Xing, E. P. (2023). Federated learning as variational inference: a scalable expectation propagation approach. arXiv:2302.04228.
[18] Wu, X., Huang, H., Ding, Y., Wang, H., Wang, Y. and Xu, Q. (2023). FedNP: Towards Non-IID Federated Learning via Federated Neural Propagation. Proceedings of the AAAI Conference on Artificial Intelligence, 37(9):10399-10407.

Student Perspectives: Unravelling Ancestry – When Genes Don’t Follow the Family Tree

Posted on 13th February 202517th February 2025 by ic23897

A post by Daniella Montgomery, PhD student on the Compass programme.

Introduction

In my project, I am working with my two supervisors, Dan Lawson in the School of Mathematics and Sion Bayliss from the School of Veterinary Science, to investigate the analysis of genomic data and their inferred ancestry trees, to detect problematic lineages of bacterial pathogens.

An ancestry tree is a tree which describes how genetic data is passed down through generations. By understanding the evolution of bacteria, we can develop strategies to alert us when dangerous pathogens evolve. Bacteria typically only have one parent, and if this were true, their evolution can be described as a tree. However, bacteria also frequently evolve using horizontal gene transfer, where genetic data is exchanged between lineages with different ancestries, as seen in Figure 1. This disrupts the traditional parent-to-offspring tree, and instead, one needs to represent it using a complex graph.

Figure 1: An example of a phylogeny with horizontal gene transfer shown by the red dashed line and the resulting recombined lineage shown as a full red line breaking the structure of this tree.

In this case, each location on the genome may be described by a different tree obtained by following the correct parent at that location, i.e. the “left” or “right” parent of the red individual in Figure 1. These trees can be called “local ancestries”.

Simulating Ancestry with Msprime

The Python package msprime allows us to simulate genetic ancestral data using the coalescent method. The coalescent method is a backwards-in-time stochastic process where one has a set of sample lineages from which n are randomly selected, as seen in Figure 2. As we go back in time, their parent nodes are iteratively redrawn from this set at random. Once two lineages pick the same random parent, the lineage coalesces into one. This process is repeated until a common ancestor is achieved.

Figure 2: A depiction of the coalescent method taken from [1] for a population of 10 individuals and a sample size of 10, by keeping track of the times between coalescence events (T(3) and T(2)) and which lineages coalesce with which, we have a full picture of the phylogenetic tree.

The Impact of Gene Conversion

In this experiment, I am investigating how population structure manifests in genetic data and how this is affected by varying gene conversion rates. Gene conversion is a type of horizontal gene transfer where a donor genome replaces a sequence of DNA in a homologous acceptor genome. Our simulation has one population that splits into two populations with some gene conversion within the populations, as seen in Figure 3. From this, we can obtain local pedigrees across the genome for several sample genomes. Each local pedigree has a complex history, but gene conversion allows each gene to have a different random history.

Figure 3: A conceptual picture of the true population structure and the local pedigree of the sampled population obtained from simulation with nodes coloured by population. Blue represents the ancestral population and red and green represent the two descendent populations, A and B. The leaf nodes are labelled for comparison with future analysis.

Analyzing the Data

One common way to visualise complex histories is through Principle Component Analysis (PCA) where the data undergoes eigenvalue decomposition which will group similar genomes together in a far lower dimensional space. This dimensionality reduction also allows us to visualise certain population structure characteristics [2]. For example, in all of our 2D PCA graphs in Figure 4, we can see a clear split between population A and population B.

However, there is a limit to how interpretable these PCs are. We use the dendrogram from hierarchical clustering to help sort our data such that more similar data is kept together. Then we create a covariance plot of how similar the principal components of each genome are to each other. This plot is a rudimentary method to help us visualize the population structure of the simulation’s resulting lineages seen in Figure 5a. The population structure is clear, but there is still structure given by the random pedigree shared by all individuals.

Figure 4: The principal component analysis plots with colours showing the true populations for a gene conversion rate of 1e06.

Figure 5: Covariance matrices for increasing gene conversion rates (reading left to right, up to down) 1e-6, 1e-5, 1e-4, showing a breakdown of the sub-population structure.

In Figures 5a to 5c, we can see that as gene conversion is increased, the covariance matrix less represents one random history, and instead “averages out” into the population structure. This is a visualization of the dependence on the history breaking down as the genomes within each population become more similar to each other due to gene conversion.

If you would like to know more about this topic, please contact me at ic23897@bristol.ac.uk.

[1] Rosenberg, N., Nordborg, M. Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nat Rev Genet 3, 380–390 (2002). (https://doi.org/10.1038/nrg795])

[2] McVean G. A genealogical interpretation of principal components analysis. PLoS Genetics, 5, e1000686 (2009). (https://doi.org/10.1371/journal.pgen.1000686)

Student perspectives: Genetic Boolean Models – How to Make One

Posted on 6th February 2025 by shaun.jordan

A post by Daniel Gardner, PhD student on the Compass programme.

Introduction

My research focuses on genetic interaction networks within lung cancer cells. Our (long-term) aim is to model such networks dynamically using a Boolean modelling framework, and then use this to tie changes in cancer cells’ physiology to certain, often mutated, genes of interest.

Aims and problems

This blog post will focus on the challenge we are currently working on: constructing the model itself. This is often the most challenging element of the research, as it underpins all results going forward, and often there does not exist enough data to fully define a unique model.

In some respects this is acceptable, as Boolean modelling is more of a qualitative approach. Each node in the network is a ‘species’, be that a gene, protein, small molecule, etc. Each directed, labelled edge is either ‘activating’, if an increase in species A causes an increase in species B, or ‘inhibiting’, if the opposite is the case [1].

With this definition, a lot of papers we have looked at define their model purely from the literature [2], [3], [4], either manually mining links, or using pre-existing databases like the Kyoto Encyclopedia of Genes and Genomes (KEGG)[6].

What we are more interested in are methods deriving these models in a far more quantitative way, straight from transcriptomic data. Whilst some of the papers referenced above justify their hand-built models in retrospect by showing they can replicate real-world results [3], we wish to work the other way round – beginning at the real-world results and then using a reverse engineering approach.

Figure 1: The Boolean model used in [2], based off a similar model constructed in [4]. It contains 98 nodes (species) and 254 directed edges (labelled interactions).

Potential solutions

The solutions we have found can be broadly split into two categories: methods that go from:

Raw Data → Interaction Network

and similarly:

Interaction Network → Boolean Model

The former is a much more difficult challenge. Generally, in a published network, each edge will reference experimental work that justifies, e.g. ‘A activates B’. However, data-frames which contain many cell-line perturbation experiments in one are hard to come by, and expensive to perform [5]. The problem is often also undetermined since the solution-space for a potential network is far greater than the amount of data available. One option we may look into in the future, however, involves using other modelling techniques, such as ODEs or Bayesian networks.

The challenge of reverse engineering a Boolean network from a pre-built network is much more feasible. The main problem in this case is considering complex interactions. For example, if we had ’A inhibits C’ and ’B activates C’, how do they work in tandem?

Figure 2: Part of the optimisation algorithm from [7] applied to a toy model. In D, we classify each species in the network. All non-compressed nodes are those which we have data to train on. In E, we construct the hypergraph, where for any pair of combined interactions, both the ‘AND’ and ‘OR’ case are considered.

Sticking to the Boolean framework, these two interactions can either be joined through an ‘AND’ relation, or an ‘OR’ relation. For several proteins affecting one specific protein, the combinations of Boolean rules are non-trivial.

One paper we found that deals with this problem well is Saez-Rodriguez et al. [7], which attempts to train a hypergraph of the interaction network to cell line assay data. It contains a number of different techniques to do with graph and state space reduction, as well as some heuristic rules on which complex interactions to target. For example, it is unlikely in biology for a protein to require multiple other species to necessitate a change in function, so we can remove ‘AND’ links of more than N complex interactions from the state space.

One other model component we are looking for, which we have not currently looked into properly, is a ‘layered’ model, which includes different levels of genomic interaction. For example, many papers we have read use ‘protein interaction network (PIN)’ and ‘gene regulatory network (GRN)’ interchangeably. Whilst the two are greatly related, drawing a one-to-one equivalence between the two in all cases is incorrect.

Conclusion and future plans

Starting directly from data to build a network is perhaps too ambitious a challenge, especially with the limited data available. In fact, even to train a Boolean network for optimisation requires quite specific cell-line perturbation data. It could be that we make do with a network partially trained on limited data, and the rest taken from prior knowledge in the literature.

One promising sign is that [7] finds that it is best to begin with ’too many’ interactions in a literature-curated interaction network, and then ’prune’ spurious interactions via network optimisation. This is due to these large networks being built from many different sources, some using different tissue, conditions, etc. Therefore, when we desire a model specific to lung adenocarcinoma data, it is natural for the training to remove many of these genetic interactions.

In the future, we aim for this research topic to simply be one section of the wider project. Once we decide upon the most justified Boolean model for lung cancer, we aim to use patient mRNA and mutation data to personalise the models, in order to predict patient specific cell phenotype probabilities. Using this, along with multi-layer protein imaging data from Cancer Research UK, we aim to find a statistically significant link between certain gene mutations, and the resulting shape and, therefore, phenotype of a tumour of cancer cells.

Thank you for reading this blog post. If you have any questions, please feel free to get in touch with me at: daniel.gardner@bristol.ac.uk

References

[1] Abou-Jaoudé, W., Traynard, P., Monteiro, P. T., Saez- Rodrıguez, J., Helikar, T., Thieffry, D., and Chaouiya, C. (2016). Logical modeling and dynamical analysis of cellular networks. Frontiers in Genetics, 7.

[2] Béal, J., Montagud, A., Traynard, P., Barillot, E., and Calzone, L. (2019). Personalization of logical models with multi-omics data allows clinical stratification of patients. Frontiers in Physiology, 9.

[3] Cohen, D. P. A., Martignetti, L., Robine, S., Barillot, E., Zinovyev, A., and Calzone, L. (2015). Mathematical modelling of molecular pathways enabling tumour cell invasion and migration. PLOS Computational Biology, 11.

[4] Fumiã, H. (2013). Boolean network model for cancer pathways: Predicting carcinogenesis and targeted therapy outcomes. PloS one, 8:e69008.

[5] Galindez, G., Sadegh, S., Baumbach, J., Kacprowski, T., and List, M. (2023). Network-based approaches for modeling disease regulation and progression. Computational and Structural Biotechnology Journal, 21:780–795. 4

[6] Kanehisa, M. and Goto, S. (2000). Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28(1):27–30.

[7] Saez-Rodriguez, J., Alexopoulos, L. G., Epperlein, J., Samaga, R., Lauffenburger, D. A., Klamt, S., and Sorger, P. K. (2009). Discrete logic modelling as a means to link protein signalling networks with functional analysis of mammalian signal transduction. Molecular Systems Biology, 5(1):331.

Student perspectives: Regional Sensitivity Analysis

Posted on 21st November 2024 by cecina.babichmorrow

A post by Cecina Babich Morrow, PhD student on the Compass programme.

Introduction

Sensitivity analysis seeks to understand how much changes in each input affect the output of a model. We want to be able to determine how variation in a model’s output can be attributed to variations in its input. Given the high amount of uncertainty present in most real-world modelling settings, it is crucial to understand the magnitude of this uncertainty’s impact on model results. Knowing how sensitive a model is to a particular parameter can help guide modellers in prioritising what level of precision is needed in estimating that parameter in order to produce valid results. Sensitivity analysis thus serves as a vital tool for modellers in numerous fields, allowing them to assess robustness and to identify key drivers of uncertainty. By systematically analysing the relative amount of influence that each input parameter has on the output, sensitivity analysis reveals which parameters have the greatest impact on the results.

By identifying these critical parameters, stakeholders can prioritize investments in data collection, parameter estimation, and uncertainty reduction. This targeted approach ensures that efforts are concentrated where they will have the most significant impact.

Why use Regional Sensitivity Analysis?

In this blog post, I will focus on one particular sensitivity analysis method that I have been using in my project so far to help understand the sensitivity of an output decision to input parameters which affect that decision. Regional Sensitivity Analysis (RSA) was developed in the field of hydrology, but has widespread applications in environmental modelling, disease modelling, and beyond.

My research focuses on environmental decision-making, so I frequently deal with models that output a decision that can take on one of several discrete values. For example, consider trying to make a decision about what to wear based on the weather. To make our decision, we use three input parameters about the weather: temperature, humidity, and wind speed. Then, our decision model can output one of three decisions: (1) stay home, (2) leave the house with a jacket, (3) leave the house without a jacket. We might then be interested in how sensitive our model is to each of our three weather-related input parameters to understand how much each one contributes to uncertainty in our ultimate decision. In this type of setting, we need to use a sensitivity analysis method that can handle continuous inputs, e.g. temperature, in conjunction with a discrete output, e.g. our decision.

For settings such as these where the inputs of our model are continuous and the outputs are discrete, RSA, also referred to as Monte Carlo filtering, is a potential method of sensitivity analysis [1]. RSA aims to identify which regions of input space corresponding to specific values in the output space [2, 3]. Originally, the method was developed in the field of hydrology for cases where the output variable is binary, or made such by applying a threshold. It has since been extended by splitting the parameter space into more than two groups [3, 4]. RSA is well-suited to sensitivity analysis in the case where the output variable is categorical [5].

RSA is fundamentally a Bayesian approach. First, prior distributions are assigned to the input parameters. The model is then run multiple times, sampling input parameters from these priors, and recording the resulting output values. By analysing the relationship between input uncertainties and output uncertainties, RSA identifies which parameters significantly affect the model’s predictions.

How does RSA work?

We will present the mathematical formalisation of RSA in a setting where we have a discrete output variable $y \in \{ y_1, y_2, \ldots, y_m \}$ which can take on one of $m$ possible output values, and a vector of $d$ continuous input variables $\mathbf{x} = [x_1, x_2, \ldots, x_d]$ . We start with prior distributions on the input vector $\mathbf{x}$ , from which we sample before running the model to calculate the output value for that particular input.

Then, RSA compares the empirical conditional cumulative distribution functions (CDFs) $F_{x_i | y_j}$ conditioned on the different output values of $y$ . That is, for the $i$ th input parameter, we take the empirical CDF conditioned on the output of the model being the $j$ th possible output value. For example, in our weather-based decision model, we would be considering the empirical CDF $F(\text{temperature } | \text{ decide to stay home})$ . We then compare these CDFs $F_{x_i|y_j}$ for each of the possible $j \in \{1, \ldots,m\}$ output values (in our case, each of the possible output decisions). If the conditional CDFs of $x_i$ differ greatly in distribution for one or more of the values of $y$ , then we can conclude that our model is sensitive to that particular input parameter. If $F(x_i) = F(x_i | y_1) = \ldots = F(x_i | y_m)$ , then the output is insensitive to $x_i$ on its own. (See the Extensions of RSA section for a discussion of variable interactions.)

The difference between these CDFs can be measured using several possible sensitivity indices. Typically, the Kolmogorov-Smirnov (KS) statistic is applied over all possible values of $y$ , and then some statistic (e.g. mean, median, maximum, etc.) is calculated to summarise the overall sensitivity of $y$ to $x_i$ :

$\text{stat}_{j,k} [KS(x_i)] = \text{stat}_{j,k} \left[\max_{x_i} \left \lvert F_{x_i | y_j} (x_i | y = y_j) - F_{x_i | y_k} (x_i | y = y_k) \right \rvert\right]$

where $j,k \in \{1, \ldots, m\}$ and $\text{stat}$ could be mean, median, maximum, etc.

For instance, consider the following situation with an input parameter $x_i$ , where the output $y$ can take on one of three values. We assumed a uniform prior for $x_i \sim \text{Unif}(350, 800)$ . The blue, green, and red distributions shown in Fig. 1 below are the empirical conditional CDFs $F(x_i | y_1)$ , $F(x_i | y_2)$ , and $F(x_i | y_3)$ , respectively, giving the probability that $x_i$ is less than or equal to a given value given that the output result of the model was $y_j$ . The vertical dotted lines are the KS statistic between each of the three pairs of CDFs. Then a statistic, such as the mean, median, or maximum of those three KS values, can be calculated to represent the overall sensitivity of $y$ to the input parameter $x_i$ . For example, the mean KS statistic is 0.5505.

Figure 1. Visualisation of RSA using a summary statistic of the KS statistic as a sensitivity index. The blue, green, and red distributions are the empirical conditional CDFs $F(x_i | y_k)$ for $k \in \{1, 2, 3\}$ , and the vertical dotted lines represent the KS statistic between each of the three pairs of CDFs.

As an alternative to using the KS statistic, we can instead apply a statistic to spread, i.e. the area between the CDFs:

$\text{stat}_{j,k} [\text{spread}(x_i)] = \text{stat}_{j,k} \left[ \int_{-\infty}^\infty \max \left(F_{x_i | y_j} (x_i | y = y_j), F_{x_i | y_k} (x_i | y = y_k)\right) dx_i - \int_{-\infty}^\infty \min \left(F_{x_i | y_j} (x_i | y = y_j), F_{x_i | y_k} (x_i | y = y_k)\right) dx_i \right]$

where $j,k \in \{1,\ldots, m\}$ . In this case, we would be considering the area between each of the three distributions shown in Fig. 1 above and then averaging them (or applying some other summary statistic) as our sensitivity index. For instance, the mean spread between CDFs is 134.09.

Higher values of either sensitivity index for a given input parameter $x_i$ suggest that the output is more sensitive to variations in that parameter, i.e. the distributions of input values leading to a given output value are more different from one another. For example, Figure 2 compares the conditional CDFs of $x_i$ with that of a different input parameter, $x_j$ , with a prior of $x_j \sim \text{Unif}(80,120)$ . We can see that the CDFs $F(x_i | y_k)$ show a high degree of separation, compared to the CDFs $F(x_j, y_k)$ , which do not. This is reflected in the sensitivity indices: for example, the mean KS statistic for $x_j$ is only 0.1648 and the mean spread is only 2.897. Comparing KS statistics in this manner makes RSA a tool well-suited for ranking, or factor prioritisation, one of the main goals of sensitivity analysis that aims to rank parameters by their contribution to variation in the output [1, 5].

Figure 2. Comparison of sensitivity of a model to two input parameters, $x_i$ and $x_j$ . The blue, green, and red distributions are the empirical conditional CDFs $F(x_i | y_k)$ and $F(x_j | y_k)$ for $k \in \{1, 2, 3\}$ .

Extensions of RSA

One notable limitation of RSA, identified since its inception [2], is its inability to handle parameter interactions. A zero value of the sensitivity index is a necessary condition for insensitivity, but it is not sufficient [2, 5]. Inputs that contribute to variation in the model output only through interactions can have the same univariate conditional CDFs, and thus RSA cannot properly identify their impact on model output. For theoretical examples, see Fig. 1 of [2] and Example 6 of Section 5.2.3 in [1]. In our toy example, we may have a decision model where the output decision is not particularly sensitive to temperature or humidity on their own, but it may be very sensitive to an interaction between these two parameters since their combined effects impact how warm or cool the weather actually feels.

In situations such as these where interactions between input variables may matter more than each variable on its own, RSA can be useful for ranking, but it cannot be used for screening, another goal of sensitivity analysis aiming to identify variables with little to no influence on output variability[1, 5]. To address this limitation, RSA can be augmented with machine learning methods such as random forests and density estimation trees [6]. Spear et al. performed a sensitivity analysis of a dengue epidemic model to demonstrate how these tree-based models can augment RSA [6].

First, the authors performed RSA in its original form, using the KS statistic to examine the difference between the univariate CDFs. Then, they used random forest analysis to classify model runs into the various output values. Then, a measure of variable importance, such as Gini impurity, was used to rank the input parameters in terms of their influence on the model output [6]. Random forest allows for the incorporation of the effects of variable interactions in ranking the importance of each parameter. By comparing the parameter ranking resulting from RSA with that from the random forest, they identified parameters which impacted the output through interaction. Finally, they used density estimation trees to help identify regions of parameter space corresponding to particular output values. Density estimation trees are the analogue of classification and regression trees, instead attempting to estimate the probability density function that gave rise to a particular region of output space [7]. By applying density estimation trees as part of the sensitivity analysis, Spear et al. were able to examine the effects of scale on sensitivity, identifying parameters which may be relatively unimportant when ranking across the entire parameter subspace, but are highly influential in small subspaces.

Further research such as this highlights the benefits of combining multiple sensitivity analysis methods in order to gain a full picture of how model inputs affect uncertainty in the output.

Conclusions

Hopefully this blog has been an informative crash course in regional sensitivity analysis! Note that the visualisations in this post have been created using the SAFEpython toolbox [8]. If you have any questions or comments, please feel free to get in touch at cecina.babichmorrow@bristol.ac.uk.

References

[1] A. Saltelli, Global sensitivity analysis: the primer. Wiley, 2008. [Online]. Available: https://onlinelibrary.wiley.com/doi/book/10.1002/9780470725184

[2] R. Spear and G. Hornberger, “Eutrophication in peel inlet—II. identification of critical uncertainties via generalized sensitivity analysis,” Water Research, vol. 14, no. 1, pp. 43–49, 1980. [Online]. Available: https://www.sciencedirect.com/science/article/pii/0043135480900408

[3] J. Freer, K. Beven, and B. Ambroise, “Bayesian estimation of uncertainty in runoff prediction and the value of data: An application of the GLUE approach,” Water Resources Research, vol. 32, no. 7, pp. 2161–2173, 1996. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1029/95WR03723

[4] T. Wagener, D. P. Boyle, M. J. Lees, H. S. Wheater, H. V. Gupta, and S. Sorooshian, “A framework for development and application of hydrological models,” Hydrology and Earth System Sciences, vol. 5, no. 1, pp. 13–26, 2001. [Online]. Available: https://hess.copernicus.org/articles/5/13/2001/

[5] F. Pianosi, K. Beven, J. Freer, J. W. Hall, J. Rougier, D. B. Stephenson, and T. Wagener, “Sensitivity analysis of environmental models: A systematic review with practical workflow,” Environmental Modelling & Software, vol. 79, pp. 214–232, 2016. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S1364815216300287

[6] R. C. Spear, Q. Cheng, and S. L. Wu, “An example of augmenting regional sensitivity analysis using machine learning software,” vol. 56, no. 4, p. e2019WR026379. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1029/2019WR026379

[7] P. Ram and A. G. Gray, “Density estimation trees,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp. 627–635. [Online]. Available: https://dl.acm.org/doi/10.1145/2020408.2020507

[8] F. Pianosi, F. Sarrazin, and T. Wagener, “A Matlab toolbox for global sensitivity analysis,” Environmental Modelling & Software, vol. 70, pp. 80–85, 2015. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S1364815215001188

Student perspectives: Compass Annual Conference 2024

Posted on 8th November 202414th November 2024 by ben.anson

A post by Compass students Ben Anson, Ollie Baker, Codie Wood and Rachel Wood.

Introduction

This October, we held our third annual Compass Conference. Unlike previous years, when the event was held in the University’s Fry Building, this time it took place at M Shed, offering scenic views of Bristol harbour. It was a great day for past and present Compass students, academics, and industry partners to come together and discuss this year’s theme: “The Future of Data Science”. With recent advances in machine learning and AI, it felt like a fitting time to learn from each other’s perspectives and to share ideas about how to move forward in this exciting space.

Panoramic view of Bristol harbour, as seen from M Shed

Student Research Talks

The morning started with four ten-minute research talks from Compass students. First was Rahil Morjaria‘s talk on “Group Testing” which explored current developments in the field, including algorithms and information-theoretic limits.

Following this, Kieran Morris presented “A Trip to Bregman Geometry and Applications”, considering advancements such as natural gradient methods, Bregman K-means clustering, and EM-projection algorithms that Bregman Geometry has enabled.

Ettore Fincato talked us through “Gradient-Free Optimisation via Integration”, focusing on a novel yet easy-to-implement algorithm for optimisation using Monte Carlo methods. Finally, Ed Milsom spoke about “Data Modalities and the Bias-Variance Decomposition”, taking us through a history of neural networks and speculating about why certain data types are so powerful, and why the future of general-purpose AI must be multi-modal.

Student Lightning Talks

The lightning talks challenged ten students to present on a topic in just three minutes. The ability to quickly convey a message in an engaging and understandable manner, to an audience with diverse backgrounds, is crucial in both academia and industry, and the students rose to the occasion.

Their talks captured the interest of the audience and inspired interesting questions that forced the students to think on their feet. Topics ranged from neural networks and large language models (LLMs), to making music using mathematics.

Compass Alumni Panel

This year’s conference panel, chaired by Compass CDT Director, Professor Nick Whiteley, offered an engaging look into the professional journeys of Compass alumni Dominic Owens, Jake Spiteri and Michael Whitehouse since completing their PhDs. With shared experiences in finance, each panelist provided unique insights into the early career landscape and the skills that helped them succeed.

Jake delved into the details of his day-to-day work in the financial sector, while Dominic discussed the challenge and dedication required to secure a role through extensive networking and job applications. Michael shared details of his transition from finance to epidemiological research. Together, they sparked valuable discussions on what the future of data science might hold for upcoming Compass graduates.

Special Guest Lecture

The conference concluded with an enlightening special guest lecture by Professor Aline Villavicencio, Director of the Institute for Data Science and Artificial Intelligence at the University of Exeter. Her talk, “Testing the Idiomatic Language Limits of Foundation Models: The Strange Case of the Idiomatic Eager Beaver in Cloud Nine,” offered a fascinating counterpoint to the current enthusiasm surrounding LLMs.

Drawing from her research in Natural Language Processing (NLP), Professor Villavicencio demonstrated how even today’s most advanced models struggle with aspects of language that humans master naturally – particularly idioms and multi-word expressions. She illustrated a persistent gap between machine and human linguistic capabilities, reminding us that the path to truly human-like language understanding remains long and complex.

She also shared her perspective on the cyclical nature of NLP research, noting how, throughout her career, there have been multiple predictions about NLP research becoming obsolete as models improve. Yet, as her work on datasets like SemEval (Semantic Evaluation) shows, there remain fundamental challenges in representing and understanding idiomatic language.

Concluding remarks

The successful day of talks, poster sessions and networking culminated with Professor Whiteley sharing his thoughts on what we learned throughout the event. He concluded that the future of our field is certain to be exciting and will encompass a huge range of different areas and ideas. This year’s conference embodied this by providing a platform for students, academics, and industry professionals to share new insights from many different sectors, and to form strong relationships to help forge a path to the future of data science.

Past conferences

Student Perspectives: The trade-off between sample size and number of trials in meta-analysis

Posted on 1st November 202411th November 2024 by xinrui.shi

A post by Xinrui Shi, PhD student on the Compass programme.

Introduction

Meta-analysis is a widely used statistical method for combining evidence from multiple independent trials that compare the same pair of interventions [1]. It is mainly used in medicine and healthcare but has also been applied in other fields, such as education and psychology. In general, it is assumed that there is a numerical measure of effectiveness associated with each intervention, and the goal is to estimate the difference in effectiveness between the two interventions. In medicine, this difference is termed the relative treatment effect. We assume that relative effects vary across trials but are drawn from some shared underlying distribution. The objective is to estimate the mean and standard deviation of this distribution, which we denote by $d$ and $\tau$ respectively; $\tau$ is referred to as the heterogeneity parameter.

In medical trials, patients are randomly allocated to one of the two treatment options, and their subsequent health outcomes are monitored. Each trial then provides an observation of the relative treatment effect in that trial. Meta-analysis uses these observations to estimate the mean $d$ and variance $\tau^2$ of the distribution of relative effects. In this work, we are interested in understanding what conditions maximise the precision of these pooled estimates.

It is well-known that the precision of meta-analysis can be improved by either increasing the number of observations or improving the precision of individual observations. Both approaches, however, require more participants to be included in the meta-analysis. To understand the relative importance of these factors, we constrain the total number of participants to a fixed value. Then, if more trials are conducted to generate additional observations, each trial will necessarily include fewer participants, thereby reducing the precision of each individual observation. Given this trade-off between the precision of observations and their quantity, we ask: how should participants be optimally partitioned across trials to achieve the most precise estimates?

Meta-analysis background

Model

Suppose there are two treatments for a disease, labelled $T_1$ and $T_2$, and we want to know which one is more effective. There are a total of $n$ patients across $M$ trials, and patients in each trial are randomly allocated to one of the two treatments. We write $n_{ij}$ for the number of patients assigned to treatment $T_j$ in trial $i\in \{ 1,\ldots,M \}$.

Outcomes refer to a patient’s health status after treatment. Here, we assume a binary outcome, either recovered or not recovered. A natural measure of the effectiveness of a treatment is the probability of recovery. Let $p_{ij}$ denote the probability of recovery after receiving treatment $T_j$ in trial $i$, and $X_{ij}$ the number of these patients who recover. We assume that outcomes are independent across patients. It then follows that $X_{{ij}}$ has a Binomial distribution,
\[eq(1):\quad X_{{ij}} \sim \text{Binomial}(n_{ij}, p_{ij}).\]
Due to differences in trial populations and procedures, the recovery probabilities are not assumed to be the same across trials. Instead, we assume the exchangeability of relative effects.

We model relative treatment effects on the continuous scale. Therefore, we transform $p_{ij}$ to its log-odds,
\begin{equation*}
\quad Z_{ij} := \text{logit}(p_{ij}) = \log \frac{p_{ij}}{1-p_{ij}}
\label{eq:def_Zij}
\end{equation*}
and define the trial-specific relative treatment effect, $\Delta_{i,12}$, as the log odds ratio (LOR) between the two treatments in this trial,
\begin{equation}\label{eq:LOR}
eq(2):\quad \Delta_{i,12}:= Z_{i2}- Z_{i1}=\log \frac{p_{i2}(1-p_{i1})}{p_{i1}(1-p_{i2})}.
\end{equation}
In words, $\Delta_{i,12}$ represents the effect of $T_2$ relative to $T_1$ in the $i$-th trial; $\Delta_{i,12}>0$ indicates that $T_2$ is more effective than $T_1$.

The random effects (RE) model assumes that the treatment effects vary across trials,
\begin{equation}
eq(3):\quad \Delta_{i,12}\sim \text{Normal}(d_{12},\tau^2),
\label{eq:normal_assump2}
\end{equation}
where $d_{12}$ represents the true mean of relative treatment effects between $T_1$ and $T_2$. The fixed effect (FE) model is a special case of the RE model, in which the relative treatment effects in all trials are assumed to be equal, i.e, $\tau=0$ and $\Delta_{i,12} \equiv d_{12}$ for all $i \in \{1,\ldots,M\}$.

Data

To achieve the primary goal of estimating $d_{12}$ and $\tau$, we first derive expressions for the relative treatment effects in each trial from the available data.

We write $r_{ij}$ for the realisation of the random variable $X_{ij}$. The observed relative treatment effect in the $i$-th trial is then
\[
\hat{\Delta_{i,12}} = \log\frac{n_{i2}(n_{i1}-r_{i1})}{n_{i1}(n_{i2}-r_{i2})}.
\]
It can be shown for our binomial model that, as the numbers of patients $n_{i1}$ and $n_{i2}$ grow large, the distribution from which $\hat{\Delta_{i,12}}$ is sampled is asymptotically normal, centred on the true trial-specific effect $\Delta_{i,12}$ and with a sampling variance $\sigma_i^2$ that can be explicitly expressed in terms of $n_{i1}$, $n_{i2}$, and the unknown parameters $p_{i1}$ and $p_{i2}$. The true variance $\sigma^2_i$ is thus unknown, but can be estimated as follows,
\begin{equation}
eq(4): \quad \hat{\sigma^2_i} = \frac{1}{r_{i1}} +\frac{1}{n_{i1}-r_{i1}} +\frac{1}{r_{i2}} +\frac{1}{n_{i1}-r_{i2}}.
\end{equation}

In many practical applications of meta-analysis, it is only the relative treatment effects and their estimated variance that are reported in individual studies, and not the raw data. Hence, meta-analysis often starts by treating $\hat{\Delta_{i,12}}$ and $\hat{\sigma^2_i}$ as the primary data from the $i$-th trial. The goal is then to aggregate data across trials to estimate $d_{12}$, the true treatment effect.

Estimating model parameters

The estimate of $d_{12}$ is given by the weighted mean of estimates from each trial,
\begin{equation}\label{eq:d-hat}
eq(5): \quad\hat{d}_{12} = \frac{\sum_{i=1}^M w_i \hat{\Delta_{i,12}}} {\sum_{i=1}^M w_i }.
\end{equation}
The usual choice of the weight $w_i$ is the inverse of the variance estimate associated with trial $i$. For the FE model this is $w_i = \hat{\sigma_i^{-2}}$, and for the RE model, $w_i = 1/(\hat{\sigma_i^{2}}+\hat{\tau^2})$; here, $\hat{\sigma_i^2}$ is given in (4) and $\hat{\tau^2}$ in (7) below. The choice of inverse variance weights minimises the variance of the estimator $\hat{d_{12}}$, as can be shown using Lagrange multipliers. Substituting these weights in (5) and computing the variance, we obtain that
\begin{equation}
\label{eq:optimised_var}
eq(6): \quad \mbox{Var}(\hat{d_{12}}) =\frac{1}{M} \left( \frac{1}{M} \sum_{i=1}^M \frac{1}{\mbox{Var}(\hat{\Delta_{i,12}})} \right)^{-1},
\end{equation}
where $\mbox{Var}(\hat{\Delta_{i,12}})$ $=\hat{\sigma^2_i}$ in the FE model and $\hat{\sigma^2_i}+\hat{\tau^2}$ in the RE model, as noted above. In words, the variance of the meta-analysis estimate of the treatment effect is the scaled (by $1/M$) harmonic mean of the variances from the individual trials.

One class of methods for estimating the unknown heterogeneity parameter, $\tau$, is the so-called `method of moments’ [2], which equates the empirical between trial variance with its expectation under the random effects model. The widely-used DerSimonian and Laird (DL) [1] estimator is a specific implementation of the method of moments given by
\begin{equation}
eq(7): \quad \hat{\tau^2} = \frac{\sum_{i=1}^M\hat{\sigma_i^{-2}}\left(
\hat{\Delta_{i,12}} – \frac{\sum_{l=1}^M \hat{\sigma_l^{-2}}\hat{\Delta_{l,12}}}{\sum_{l=1}^M \hat{\sigma_l^{-2}}}
\right)^2 – (M-1)}{\sum_{i=1}^M\hat{\sigma_i^{-2}} – \frac{\sum_{i=1}^M\hat{\sigma_i^{-4}}}{\sum_{i=1}^M\hat{\sigma_i^{-2}}}}.
\label{eq:DL_tau2}
\end{equation}
The right-hand side of the above formula can be negative, in which case $\hat{\tau^2}$ is set to zero.

Optimal partitioning of patients

Our aim is to determine the optimal allocation of participants across trials that yields the most precise meta-analysis estimates. We first address this analytically by seeking the allocation that minimises the variance of $\hat{d_{12}}$ in an asymptotic regime in which the number of patients tends to infinity. We complement the theoretical analysis with simulations over a wide range of numbers of patients.

Theoretical findings

To obtain analytic results, we make two simplifying assumptions. First, we assume that each trial, and each treatment within each trial, involves the same number of participants, i.e., $n_{ij}=\frac{n}{2M} \hspace{3pt}$ for all $\{i,j\}$. Then, we consider a limit as the total number of participants, $n$, as well as the number in each trial, $n/M$, tend to infinity. In this limiting regime, the observed number of recoveries, $r_{ij}$, in each arm and trial, satisfies $r_{ij}=np_{ij}/2M$, where $p_{ij}$ is the true probability of recovery. substituting this in (4) yields
\begin{equation} \label{eq:var_est_symm}
eq(8): \quad \hat{\sigma}^2_i = \frac{2Ma_i}{n}, \mbox{ where } a_i=\frac{1}{p_{i1}}+\frac{1}{1-p_{i1}}+\frac{1}{p_{i2}}+\frac{1}{1-p_{i2}}.
\end{equation}

By approximating the asymptotic distribution of $\hat{d_{12}}$, the problem of minimising the variance of $\hat{d}_{12}$ is transferred into to the following optimisation problem:
\begin{equation*}
\max_{M, \tau} \left[\sum_{i=1}^M \frac{1}{2Mna_i+\tau^2}\right], \quad a_i := \frac{1}{p_{i1}(1-p_{i1})}+\frac{1}{p_{i2}(1-p_{i2})}.
\label{eq:opt_problem_asymtotic}
\end{equation*}
Fixed effects: Under the FE model, $\tau=0$ and the optimization problem reduces to
\[
\max_M \left[\frac{1}{2M}
\sum_{i=1}^M \frac{1}{a_i}
\right].
\]
Assuming the values of $a_i$ are roughly of the same order of magnitude, we approximate
$$\sum_{i=1}^M \frac{1}{a_i} \approx \frac{M}{\bar{a}}, \quad \bar{a} := \frac{1}{M}\sum_{i=1}a_i.$$
Hence, the objective function is independent of $M$, indicating that the partitioning of participants does not influence the precision of estimation. This result aligns with our expectation, as, in the FE model, we are only estimating the mean of the distribution and not the variance.

Random effects: In the RE model, we must also estimate the between-trial variance $\tau^2$, which is working in progress.

Empirical findings

To assess whether findings based on asymptotic performance hold in practical scenarios, we conduct a simulation study involving a total of 20,000 participants. We vary the number of trials, $M$, in unit steps from 1 to 200. The number of participants assigned to each treatment in each trial, $n_{i1}=n_{i2}$, therefore varies from 10,000 to 50. We set the true relative treatment effect equal to $d_{12}=0.05$, with heterogeneity parameter $\tau=0.1$ for the RE model.

Data simulation: For each $M$ (and corresponding $n_{i1}=n_{i2}$), we sample trial-specific relative effects $\Delta_{i,12}$ from Equation (3). To construct the corresponding recovery probabilities, we sample $p_{i,1}$ from a standard uniform distribution and calculate $p_{i,2}$ by rearranging Equation (2) to give
\[
p_{i,2} = \frac{p_{i,1}e^{\Delta_{i,12}}}{1 + p_{i,1}(e^{\Delta_{i,12}}-1)}.
\]
Finally, we simulate the number of recovered patients, $r_{ij}$, from the binomial distribution in Equation (1). This yields the simulated data set,
$$\mathcal{D}= \left\{
(r_{i,j},n_{i,j}): i \in\{1,\ldots,M\}, j\in\{1,2\}
\right\},$$
which we use to estimate the model parameters via Equations (5) and (7).

For each $M$, we repeat the simulation 100 times and calculate the median and the interquartile range (IQR) of the estimates $\hat{d_{12}}$ and $\hat{\tau}$.

Estimation of $\hat{d}_{12}$ in FE model: The following figure shows the median and IQR of the estimated mean relative treatment effect $\hat{d}_{12}$ and its standard error for the FE model. As $M$ increases, the standard error on $\hat{d}_{12}$ increases while its estimate fluctuates around the true parameter value. This indicates that the FE estimate $\hat{d}_{12}$ becomes less precise as participants are partitioned into more trials (with fewer participants in each).

Empirical results for estimated mean of relative effects in FE model

Estimation of $\hat{d}_{12}$ in RE model: The following figure shows the estimated mean and standard error of the relative treatment effect in the RE model. As before, the estimated mean is not affected by the number of trials. The standard error exhibits an initial sharp increase from $M=1$ (one large trial) and then decreases until the number of trials reaches approximately 40. After this, the standard error remains almost fixed. This indicates that for more than one trial, the estimated mean relative treatment effect is more precise when participants are partitioned into more trials.

Empirical results for estimated mean of relative effects in RE model

Estimation of $\hat{\tau}$ in RE model: The following figure shows the estimated mean and standard error of the heterogeneity estimate $\hat\tau$ in the RE model. For very few trials ($M<6$), heterogeneity is underestimated (at $M=1$ this is zero since there can be no variation between one trial). As the number of trials increases, $\hat\tau$ fluctuates about its true value but with increasing variation (IQR). Beyond $M=1$ (where the standard error is necessarily zero), the standard error on $\hat{\tau}$ decreases with increasing $M$ up to approximately $M=10$, at which point it increases again. This suggests that, for the scenario simulated in this study, the precision of the heterogeneity estimate is optimal when participants are partitioned into about 10 trials (with $n_{i1}=n_{i2}=1000$).

Empirical results for estimated heterogeneity in RE model

Summary: Even with a large number of participants, the theoretical results only hold for a smaller number of trials. This is because the number of participants per trial decreases when partitioning into more trials.

Future work: As our simulation only extended to 200 trials, it did not investigate scenarios with small numbers of participants per trial. In future work we will explore these more extreme scenarios, taking the number of trials to its maximum (i.e. with one participant per treatment in each trial). We will also investigate the generalizability of our findings to other parameter values ($d_{12}$ and $\tau$), continuous rather than binary outcomes, and Bayesian inference methods.

Reference

[1] Rebecca DerSimonian and Nan Laird. Meta-analysis in clinical trials. Controlled clinical trials, 7(3):177–188, 1986.
[2] Rebecca DerSimonian and Raghu Kacker. Random-effects model for meta-analysis of clinical trials: an update. Contemporary clinical trials, 28(2):105–114, 2007.

Introduction

Probabilities in `Check the Negatives’ Studies

Quantifying Bias

Bias Adjustment

Simulation Study

Future Work

References

Introduction

Central Limit Theorem for High-Dimensional Distance

Theorem 1:

Erdös-Rényi Random Graphs

When Does the RGG Converge to $G(n,p)$?

The RGG in $[0,1]^d$

Theorem 2 [1]

Sketch Proof:

The RGG in $\mathbb{T}^d$

Theorem 3 [1]

Sketch Proof

Example

Conclusion

References

Compass at AI UK

Compass CDT students and staff at the AI UK 2025 Conference. From left to right: Dr Dan Lawson, Emma Ceccherini, Sam Bowyer, Sherman Khoo and Helen Mawdlsey

Conference highlights

Presentation karaoke

Introduction

Background

Expectation Propagation

Benefits and limitations of EP

Extensions of EP

References

Introduction

Aims and problems

Potential solutions

Conclusion and future plans

References

Introduction

Why use Regional Sensitivity Analysis?

How does RSA work?

Extensions of RSA

Conclusions

References

Introduction

Student Research Talks

Student Lightning Talks

Compass Alumni Panel

Special Guest Lecture

Concluding remarks

Past conferences

Introduction

Meta-analysis background

Model

Data

Estimating model parameters

Optimal partitioning of patients

Theoretical findings

Empirical findings

Reference

Compass CDT students and staff at the AI UK 2025 Conference. From left to right:
Dr Dan Lawson, Emma Ceccherini, Sam Bowyer, Sherman Khoo and Helen Mawdlsey