## Student Perspectives: Group Testing

A post by Rahil Morjaria, PhD student on the Compass programme.

### What is Group Testing?

Group Testing was first introduced in the 1940s as a way to test military recruits for syphilis during World War II. By combining blood samples, they hoped to reduce the total number of tests needed to detect the diseased individuals (compared to testing each recruit individually).

Example of pooling blood samples, where red and green depicts diseased and not diseased respectively.

Since then, there have been many advances in Group Testing with applications not just in detecting diseased individuals but also in communications, cybersecurity and data storage. In essence, whenever we have a situation where we need to detect a proportionally rare occurrence, Group Testing is probably applicable.

More formally, if we have some diseased set of individuals of size $k$ of a total population $n$, it might be considered (instead of testing each individual separately) to pool people together into groups (with replacement) and test these groups.

Matrix Form of Group Testing (each row depicting a test and each column depicting a member of the population) [1].

We often write our test design in matrix form where each row is a group/test and each column indicates a member of the population, where a $1$ indicates a individuals inclusion in a test.

The relationship between $k$ and $n$ is quite important, our focus is on the sparse case in which $k = O(n^\alpha)$ where $\alpha \in (0,1)$.

Often we assume our test apparatus is sensitive enough where a single diseased individual in the test will give us a positive result (as shown in the image above) this is known as the noiseless case (there is vast amounts of work done for different types of noise, for more information check out [1]).

Adaptive Group Testing (as the name suggests) allows us to adapt our subsequent tests by the results of the previous. If we compare this to Non-Adaptive Group Testing, in which we have to define our tests (and thus our groups) before we obtain any results, we can expect stronger results.

As our tests have binary outputs, we can obtain at most $1$ bit of information per test. As there are $\binom{n}{k}$ possible defective sets, we would need $\log_2\binom{n}{k}$ bits to uniquely represent each possible set. This gives us the limit of $\log_2\binom{n}{k}$ tests needed, this is known as a converse result, a fundamental limit which we are unable to overcome.

In the noiseless case, Adaptive Group Testing is able to achieve this fundamental limit. First we split our total items $n$ into $k$ (the number of defective items) subsets of length $n/k$ without replacement, and then perform Binary Splitting.

While Adaptive Group Testing is able to reach fundamental limits, our main focus is on Non-Adaptive Group Testing. Non-Adaptive Group Testing has many applications due to it’s ability to be ran in parallel (and other ease of use situations).

Non-Adaptive Group Testing procedures are often designed randomly (in which each items inclusion in a test is $Bern(v)$ for some $v$) or with near constant column weight (each item is in ‘nearly’ the same amount of tests). Out of these 2 designs near constant column weight gives stronger results.

A graph comparing different group testing designs, where the $\text{Rate} = \log_2\binom{n}{k}/T$ where $T$ is the number of tests needed to recover all the defective items (with high probability for the red lines and with certainty for the purple line). [1]

### Goals

While there are many strong results in Group Testing, there is still much to explore. From looking at List Decoding (in which we allow a list of possible defective sets to be outputted), other forms of noise and our efficient algorithms still not being able to match up to our theoretical achievable methods, there is work to be done in all aspects. With improvements in technology, combined with the myriad of applications, the future of Group Testing definitely looks bright!

### References:

[1] M. Aldridge, O. Johnson, and J. Scarlett. Group testing: an information theory perspective. CoRR, abs/1902.06002, 2019. URL http://arxiv.org/abs/1902.06002.

## Student Perspectives: SPREE Methods for Small Area Estimation

A post by Codie Wood, PhD student on the Compass programme.

This blog post is an introduction to structure preserving estimation (SPREE) methods. These methods form the foundation of my current work with the Office for National Statistics (ONS), where I am undertaking a six-month internship as part of my PhD. During this internship, I am focusing on the use of SPREE to provide small area estimates of population characteristic counts and proportions.

# Small area estimation

Small area estimation (SAE) refers to the collection of methods used to produce accurate and precise population characteristic estimates for small population domains. Examples of domains may include low-level geographical areas, or population subgroups. An example of an SAE problem would be estimating the national population breakdown in small geographical areas by ethnic group [2015_Luna].

Demographic surveys with a large enough scale to provide high-quality direct estimates at a fine-grain level are often expensive to conduct, and so smaller sample surveys are often conducted instead.

SAE methods work by drawing information from different data sources and similar population domains in order to obtain accurate and precise model-based estimates where sample counts are too small for high quality direct estimates. We use the term small area to refer to domains where we have little or no data available in our sample survey.

SAE methods are frequently relied upon for population characteristic estimation, particularly as there is an increasing demand for information about local populations in order to ensure correct allocation of resources and services across the nation.

## Structure preserving estimation

Structure preserving estimation (SPREE) is one of the tools used within SAE to provide population composition estimates. We use the term composition here to refer to a population break down into a two-way contingency table containing positive count values. Here, we focus on the case where we have a population broken down into geographical areas (e.g. local authority) and some subgroup or category (e.g. ethnic group or age).

Orginal SPREE-type estimators, as proposed in [1980_Purcell], can be used in the case when we have a proxy data source for our target composition, containing information for the same set of areas and categories but that may not entirely accurately represent the variable of interest. This is usually because the data are outdated or have a slightly different variable definition than the target.

We also incorporate benchmark estimates of the row and column totals for our composition of interest, taken from trusted, quality assured data sources and treated as known values. This ensures consistency with higher level known population estimates. SPREE then adjusts the proxy data to the estimates of the row and column totals to obtain the improved estimate of the target composition.

An illustration of the data required to produce SPREE-type estimates.

In an extension of SPREE, known as generalised SPREE (GSPREE) [2004_Zhang], the proxy data can also be supplemented by sample survey data to generate estimates that are less subject to bias and uncertainty than it would be possible to generate from each source individually. The survey data used is assumed to be a valid measure of the target variable (i.e. it has the same definition and is not out of date), but due to small sample sizes may have a degree of uncertainty or bias for some cells.

The GSPREE method establishes a relationship between the proxy data and the survey data, with this relationship being used to adjust the proxy compositions towards the survey data.

An illustration of the data required to produce GSPREE estimates.

GSPREE is not the only extension to SPREE-type methods, but those are beyond the scope of this post. Further extensions such as Multivariate SPREE are discussed in detail in [2016_Luna].

### Original SPREE methods

First, we describe original SPREE-type estimators. For these estimators, we require only well-established estimates of the margins of our target composition.

We will denote the target composition of interest by $\mathbf{Y} = (Y{aj})$, where $Y{aj}$ is the cell count for small area $a = 1,\dots,A$ and group $j = 1,\dots,J$. We can write $\mathbf Y$ in the form of a saturated log-linear model as the sum of four terms,

$$\log Y_{aj} = \alpha_0^Y + \alpha_a^Y + \alpha_j^Y + \alpha_{aj}^Y.$$

There are multiple ways to write this parameterisation, and here we use the centered constraints parameterisation given by $$\alpha_0^Y = \frac{1}{AJ}\sum_a\sum_j\log Y_{aj},$$ $$\alpha_a^Y = \frac{1}{J}\sum_j\log Y_{aj} – \alpha_0^Y,$$ $$\alpha_j^Y = \frac{1}{A}\sum_a\log Y_{aj} – \alpha_0^Y,$$ $$\alpha_{aj}^Y = \log Y_{aj} – \alpha_0^Y – \alpha_a^Y – \alpha_j^Y,$$

which satisfy the constraints $\sum_a \alpha_a^Y = \sum_j \alpha_j^Y = \sum_a \alpha_{aj}^Y = \sum_j \alpha_{aj}^Y = 0.$

Using this expression, we can decompose $\mathbf Y$ into two structures:

1. The association structure, consisting of the set of $AJ$ interaction terms $\alpha_{aj}^Y$ for $a = 1,\dots,A$ and $j = 1,\dots,J$. This determines the relationship between the rows (areas) and columns (groups).
2. The allocation structure, consisting of the sets of terms $\alpha_0^Y, \alpha_a^Y,$ and $\alpha_j^Y$ for $a = 1,\dots,A$ and $j = 1,\dots,J$. This determines the size of the composition, and differences between the sets of rows (areas) and columns (groups).

Suppose we have a proxy composition $\mathbf X$ of the same dimensions as $\mathbf Y$, and we have the sets of row and column margins of $\mathbf Y$ denoted by $\mathbf Y_{a+} = (Y_{1+}, \dots, Y_{A+})$ and $\mathbf Y_{+j} = (Y_{+1}, \dots, Y_{+J})$, where $+$ substitutes the index being summed over.

We can then use iterative proportional fitting (IPF) to produce an estimate $\widehat{\mathbf Y}$ of $\mathbf Y$ that preserves the association structure observed in the proxy composition $\mathbf X$. The IPF procedure is as follows:

1. Rescale the rows of $\mathbf X$ as $$\widehat{Y}_{aj}^{(1)} = X_{aj} \frac{Y_{+j}}{X_{+j}},$$
2. Rescale the columns of $\widehat{\mathbf Y}^{(1)}$ as $$\widehat{Y}_{aj}^{(2)} = \widehat{Y}_{aj}^{(1)} \frac{Y_{a+}}{\widehat{Y}_{a+}^{(1)}},$$
3. Rescale the rows of $\widehat{\mathbf Y}^{(2)}$ as $$\widehat{Y}_{aj}^{(3)} = \widehat{Y}_{aj}^{(2)} \frac{Y_{+j}}{\widehat{Y}_{+j}^{(2)}}.$$

Steps 2 and 3 are then repeated until convergence occurs, and we have a final composition estimate denoted by $\widehat{\mathbf Y}^S$ which has the same association structure as our proxy composition, i.e. we have $\alpha_{aj}^X = \alpha_{aj}^Y$ for all $a \in \{1,\dots,A\}$ and $j \in \{1,\dots,J\}.$ This is a key assumption of the SPREE implementation, which in practise is often restrictive, motivating a generalisation of the method.

### Generalised SPREE methods

If we can no longer assume that the proxy composition and target compositions have the same association structure, we instead use the GSPREE method first introduced in [2004_Zhang], and incorporate survey data into our estimation process.

The GSPREE method relaxes the assumption that $\alpha_{aj}^X = \alpha_{aj}^Y$ for all $a \in \{1,\dots,A\}$ and $j \in \{1,\dots,J\},$ instead imposing the structural assumption $\alpha_{aj}^Y = \beta \alpha_{aj}^X$, i.e. the association structure of the proxy and target compositions are proportional to one another. As such, we note that SPREE is a particular case of GSPREE where $\beta = 1$.

Continuing with our notation from the previous section, we proceed to estimate $\beta$ by modelling the relationship between our target and proxy compositions as a generalised linear structural model (GLSM) given by
$$\tau_{aj}^Y = \lambda_j + \beta \tau_{aj}^X,$$ with $\sum_j \lambda_j = 0$, and where \begin{align} \tau_{aj}^Y &= \log Y_{aj} – \frac{1}{J}\sum_j\log Y_{aj},\\ &= \alpha_{aj}^Y + \alpha_j^Y, \end{align} and analogously for $\mathbf X$.

It is shown in [2016_Luna] that fitting this model is equivalent to fitting a Poisson generalised linear model to our cell counts, with a $\log$ link function. We use the association structure of our proxy data, as well as categorical variables representing the area and group of the cell, as our covariates. Then we have a model given by $$\log Y_{aj} = \gamma_a + \tilde{\lambda}_j + \tilde{\beta}\alpha_{aj}^X,$$ with $\gamma_a = \alpha_0^Y + \alpha_a^Y$, $\tilde\lambda_j = \alpha_j^Y$ and $\tilde\beta \alpha_{aj}^X = \alpha_{aj}^Y.$

When fitting the model we use survey data $\tilde{\mathbf Y}$ as our response variable, and are then able to obtain a set of unbenchmarked estimates of our target composition. The GSPREE method then benchmarks these to estimates of the row and column totals, following a procedure analagous to that undertaken in the orginal SPREE methodology, to provide a final set of estimates for our target composition.

# ONS applications

The ONS has used GSPREE to provide population ethnicity composition estimates in intercensal years, where the detailed population estimates resulting from the census are outdated [2015_Luna]. In this case, the census data is considered the proxy data source. More recent works have also used GSPREE to estimate counts of households and dwellings in each tenure at the subnational level during intercensal years [2023_ONS].

My work with the ONS has focussed on extending the current workflows and systems in place to implement these methods in a reproducible manner, allowing them to be applied to a wider variety of scenarios with differing data availability.

# References

[1980_Purcell] Purcell, Noel J., and Leslie Kish. 1980. ‘Postcensal Estimates for Local Areas (Or Domains)’. International Statistical Review / Revue Internationale de Statistique 48 (1): 3–18. https://doi.org/10/b96g3g.

[2004_Zhang] Zhang, Li-Chun, and Raymond L. Chambers. 2004. ‘Small Area Estimates for Cross-Classifications’. Journal of the Royal Statistical Society Series B: Statistical Methodology 66 (2): 479–96. https://doi.org/10/fq2ftt.

[2015_Luna] Luna Hernández, Ángela, Li-Chun Zhang, Alison Whitworth, and Kirsten Piller. 2015. ‘Small Area Estimates of the Population Distribution by Ethnic Group in England: A Proposal Using Structure Preserving Estimators’. Statistics in Transition New Series and Survey Methodology 16 (December). https://doi.org/10/gs49kq.

[2016_Luna] Luna Hernández, Ángela. 2016. ‘Multivariate Structure Preserving Estimation for Population Compositions’. PhD thesis, University of Southampton, School of Social Sciences. https://eprints.soton.ac.uk/404689/.

[2023_ONS] Office for National Statistics (ONS), released 17 May 2023, ONS website, article, Tenure estimates for households and dwellings, England: GSPREE compared with Census 2021 data

## Student Perspectives: Semantic Search

A post by Ben Anson, PhD student on the Compass programme.

## Semantic Search

Semantic search is here. We already see it in use in search engines [13], but what is it exactly and how does it work?

Search is about retrieving information from a corpus, based on some query. You are probably using search technology all the time, maybe $\verb|ctrl+f|$, or searching on google. Historically, keyword search, which works by comparing the occurrences of keywords between queries and documents in the corpus, has been surprisingly effective. Unfortunately, keywords are inherently restrictive – if you don’t know the right one to use then you are stuck.

Semantic search is about giving search a different interface. Semantic search queries are provided in the most natural interface for humans: natural language. A semantic search algorithm will ideally be able to point you to a relevant result, even if you only provided the gist of your desires, and even if you didn’t provide relevant keywords.

Figure 1 illustrates a concrete example where semantic search might be desirable. The query ‘animal’ should return both the dog and cat documents, but because the keyword ‘animal’ is not present in the cat document, the keyword model fails. In other words, keyword search is susceptible to false negatives.

Transformer neural networks turn out to be very effective for semantic search [1,2,3,10]. In this blog post, I hope to elucidate how transformers are tuned for semantic search, and will briefly touch on extensions and scaling.

### The search problem, more formally

Suppose we have a big corpus $\mathcal{D}$ of documents (e.g. every document on wikipedia). A user sends us a query $q$, and we want to point them to the most relevant document $d^*$. If we denote the relevance of a document $d$ to $q$ as $\text{score}(q, d)$, the top search result should simply be the document with the highest score,

$$d^* = \mathrm{argmax}_{d\in\mathcal{D}}\, \text{score}(q, d).$$

This framework is simple and it generalizes. For $\verb|ctrl+f|$, let $\mathcal{D}$ be the set of individual words in a file, and $\text{score}(q, d) = 1$ if $q=d$ and $0$ otherwise. The venerable keyword search algorithm BM25 [4], which was state of the art for decades [8], uses this score function.

For semantic search, the score function is often set as the inner product between query and document embeddings: $\text{score}(q, d) = \langle \phi(q), \phi(d) \rangle$. Assuming this score function actually works well for finding relevant documents, and we use a simple inner product, it is clear that the secret sauce lies in the embedding function $\phi$.

### Transformer embeddings

We said above that a common score function for semantic search is $\text{score}(q, d) = \langle \phi(q), \phi(d) \rangle$. This raises two questions:

• Question 1: what should the inner product be? For semantic search, people tend to use the cosine similarity for their inner product.
•  Question 2: what should $\phi$ be? The secret sauce is to use a transformer encoder, which is explained below.

#### Quick version

Transformers magically gives us a tunable embedding function $\phi: \text{“set of all pieces of text”} \rightarrow \mathbb{R}^{d_{\text{model}}}$, where $d_{\text{model}}$ is the embedding dimension.

#### More detailed version

See Figure 2 for an illustration of how a transformer encoder calculates an embedding for a piece of text. In the figure we show how to encode “cat”, but we can encode arbitrary pieces of text in a similar way. The transformer block details are out of scope here; though, for these details I personally found Attention is All You Need [9] helpful, the crucial part being the Multi-Head Attention which allows modelling dependencies between words.

The transformer encoder is very flexible, with almost every component parameterized by a learnable weight / bias – this is why it can be used to model the complicated semantics in natural language. The pooling step in Figure 2, where we map our sequence embedding $X’$ to a fixed size, is not part of a ‘regular’ transformer, but it is essential for us. It ensures that our score function $\langle \phi(q), \phi(d) \rangle$ will work when $q$ and $d$ have different sizes.

### Making the score function good for search

There is a massive issue with transformer embedding as described above, at least for our purposes – there is no reason to believe it will satisfy simple semantic properties, such as,

$\text{score}(\text{“busy place”}, \text{“tokyo”}) > \text{score}(\text{“busy place”}, \text{“a small village”})$

‘But why would the above not work?’ Because, of course, transformers are typically trained to predict the next token in a sequence, not to differentiate pieces of text according to their semantics.

The solution to this problem is not to eschew transformer embeddings, but rather to fine-tune them for search. The idea is to encourage the transformer to give us embeddings that place semantically dissimilar items far apart. E.g. let $q=$’busy place’, then we want $d^+=$’tokyo’ to be close to $q$ and $d^-=$’a small village’ to be far away.

This semantic separation can be achieved by fine-tuning with a contrastive loss [1,2,3,10],

$$\text{maximize}_{\theta}\,\mathcal{L} = \log \frac{\exp(\text{score}(q, d^+))}{\exp(\text{score}(q, d^+)) + \exp(\text{score}(q, d^-))},$$

where $\theta$ represents the transformer parameters. The $\exp$’s in the contastive loss are to ensure we never divide by zero. Note that we can interpret the contrastive loss as doing classification since we can think of the argument to the logarithm as $p(d^+ | q)$.

That’s all we need, in principle, to turn a transformer encoder into a text embedder! In practice, the contrastive loss can be generalized to include more positive and negative examples, and it is indeed a good idea to have a large batch size [11] (intuitively it makes the separation of positive and negative examples more difficult, resulting in a better classifier). We also need a fine-tuning dataset – a dataset of positive/negative examples. OpenAI showed that it is possible to construct one in an unsupervised fashion [1]. However, there are also publicly available datasets for supervised fine-tuning, e.g. MSMARCO [12].

### Extensions

One really interesting avenue of research is training of general purposes encoders. The idea is to provide instructions alongside the queries/documents [2,3]. The instruction could be $\verb|Embed this document for search: {document}|$ (for the application we’ve been discussing), or $\verb|Embed this document for clustering: {document}|$ to get embeddings suitable for clustering, or $\verb|Embed this document for sentiment analysis: {document}|$ for embeddings suitable for sentiment analysis. The system is fine-tuned end-to-end with the appropriate task, e.g. a contrastive learning objective for the search instruction, a classification objective for sentiment analysis, etc., leaving us with an easy way to generate embeddings for different tasks.

### A note on scaling

The real power of semantic (and keyword) search comes when a search corpus is too large for a human to search manually. However if the corpus is enormous, we’d rather avoid looking at every document each time we get a query. Thankfully, there are methods to avoid this by using specially tailored data structures: see Inverted Indices for keyword algorithms, and Hierarchical Navigable Small World graphs [5] for semantic algorithms. These both reduce search time complexity from $\mathcal{O}(|\mathcal{D}|)$ to $\mathcal{O}(\log |\mathcal{D}|)$, where $|\mathcal{D}|$ is the corpus size.

There are many startups (Pinecone, Weviate, Milvus, Chroma, etc.) that are proposing so-called vector databases – databases in which embeddings are stored, and queries can be efficiently performed. Though, there is also work contesting the need for these types of database in the first place [7].

### Summary

We summarised search, semantic search, and how transformers are fine-tuned for search with a contrastive loss. I personally find this a very nice area of research with exciting real-world applications – please reach out (ben.anson@bristol.ac.uk) if you’d like to discuss it!

### References

[1]: Text and code embeddings by contrastive pre-training, Neelakantan et al (2022)

[2]: Task-aware Retrieval with Instructions, Asai et al (2022)

[3]: One embedder, any task: Instruction-finetuned text embeddings, Su et al (2022)

[4]: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval, Robertson and Walker (1994)

[5]: Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs, https://arxiv.org/abs/1603.09320

[6]: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale, https://arxiv.org/abs/2010.11929

[7]: Vector Search with OpenAI Embeddings: Lucene Is All You Need, arXiv preprint arXiv:2308.14963

[8]: Complement lexical retrieval model with semantic residual embeddings, Advances in Information Retrieval (2021)

[9]: Attention is all you need, Advances in neural information processing systems (2017)

[10]: Sgpt: Gpt sentence embeddings for semantic search, arXiv preprint arXiv:2202.08904

[11]: Contrastive representation learning: A framework and review, IEEE Access 8 (2020)

[12]: Ms marco: A human generated machine reading comprehension dataset, arXiv preprint arXiv:1611.09268

## Student Perspectives: Machine Learning Models for Probability Distributions

A post by Tennessee Hickling, PhD student on the Compass programme.

Introduction
Probabilistic modelling provides a consistent way to deal with uncertainty in data. The central tool in this methodology is the probability distribution, which describes the randomness of observations. To create effective models of reality, we need to be able to specify probability distributions that are flexible enough to capture real phenomena whilst remaining feasible to estimate. In the past decade machine learning (ML) has developed many new and exciting ways to represent and learn potentially complex probability distributions.

ML has provided significant advances in modelling of high dimensional and highly structured data such as images or text. Many of these modern approaches are applied as “generative models”. The goal of such approaches is to sample new synthetic observations from an approximate distribution which closely matches the target distribution. For example, we may use many images of cats to learn an approximate distribution, from which we can sample new images of cats that appear realistic. Usually, a “generative model” indicates the requirement to sample from the model, but not necessarily assign probabilities to observed data. In this case, the model captures uncertainty by imitating the structure and randomness of the data.

Many of these modern methods work by transforming simple randomness (such as a Normal distribution) into the target complex randomness. In my own research, I work on a known limitation of such approaches to replicate a particular aspect of randomness – the tails of probability distributions [1, 11]. In this post, I wanted to take a step back and provide an overview of and connections between two ML methods that can be used to model probability distributions – Normalising Flows (NFs) and Variational Autoencoders (VAEs).

Figure 1: Basic illustration of ML learning of a distribution. We optimise the machine learning model to produce a distribution close to our target. This is often conceptualised in the generative direction, such that our ML model moves samples from the simple distribution to more closely match the target observations.

Some Background
Consider real valued vectors $z \in \mathbb{R}^{d_z}$ and $x \in \mathbb{R}^{d_x}$. In this post I mirror notation used in [2], where $p(x)$ refers to the density and distribution of $x$ and $x \sim p(x)$ indicates samples according to that distribution. The generic set up I am considering is that of density estimation – trying to model the distribution $p(x)$ of some observed data $\{x_i\}_{i=1}^{N}$. I use a semicolon to denote parameters, so $p(x; \beta)$ is a distribution over $x$ with parameters $\beta$. I also make use of different letters to distinguish different distributions, for example using $q(x)$ to denote an approximation to $p(x)$. The notation $\mathbb{E}_p[f(x)]$ refers to the standard expectation of $f(x)$ over the distribution $p$.

The discussed methods introduce some simple source of randomness arising from a known, simple latent distribution $p(z)$. This is also referred to in some literature as the prior, though the usage is not straightforwardly relatable to traditional Bayesian concepts. The goal is then to fit an approximate $q(x|z; \theta)$, that is a conditional distribution, such that $$q(x; \theta) = \int q(x|z; \theta)p(z)dz \approx p(x),$$in words, the marginal density over $x$ implied by the conditional density, is close to our target distribution $p(x)$. In general, we can’t compute $q(x)$, as we can’t solve the above integral for very flexible $q(x | z; \theta)$.

Variational Inference
We commonly make use of the Kullback-Leibler (KL) divergence, which can be interpreted as measuring the difference between two probability distributions. It is a useful practical tool, since we can compute and optimise the quantity in a wide variety of situations. Techniques which optimise a probability distribution using such divergences are known as variational methods. There are other choices of divergence, but KL is the most standard. Important properties of KL are that the quantity $KL(p|| q)$ is non-negative and non-symmetric i.e. $KL(p|| q) \neq KL(q || p)$.

Given this, we can see that a natural objective is to minimise the difference between distributions, as measured by the KL, $$KL(p(x) || q(x; \theta)) = \int p(x) \log \frac{p(x)}{q(x; \theta)} dx.$$Advances in this area have mostly developed new ways to make this optimisation tractable for flexible $q(x | z; \theta)$.

Normalising Flow
A normalising flow is a parameterised transformation of a random variable. The key simplifying assumption is that the forward generation is deterministic. That is, for $d_x = d_z$, that $$x = T(z; \theta),$$for some transformation function $T$. We additionally require that $T$ is a differentiable bijection. Given these requirements, we can express the approximate density of $x$ exactly as $$q_x(x; \theta) = p_z(T^{-1}(x; \theta))\big|\text{det } J_{T^{-1}}(x; \theta)\big|.$$Here, $\text{det }J_{T^{-1}}$ is the determinant of the Jacobian of the inverse transformation. Research on NFs has developed numerous ways to make the computation of the Jacobian term tractable. The standard approach is to use neural networks to produce $\theta$ (the parameters of the transformation), with numerous ways of configuring the model to capture dependency between dimensions. Additionally, we often stack many layers to provide more flexibility. See [10] and the review [2] for more details on how this is achieved.

As we have access to an analytic approximate density, we can minimise the negative log-likelihood of our model, $$\mathcal{J}(\theta) = -\sum_{i=1}^{N} \log q(x_i; \theta),$$which is the Monte-Carlo approximation of the KL loss (up to an additive constant). This is straightforward to optimise using stochastic gradient descent [9] and automatic differentiation.

Figure 2: Schematic of NF model. The ML model produces the parameters of our transformation, which are identical in the forward and backwards directions. We choose the transformation such that we can express an analytic density function for our approximate distribution.

Variational Autoencoder
In the Variational Autoencoder (VAE) [3] the conditional distribution $q(x| z; \theta)$ is known as the decoder. VAEs consider the marginal in terms of the posterior, that is $$q(x; \theta) = \frac{q(x | z; \theta)p(z)}{q(z | x; \theta)}.$$The posterior $q(z | x; \theta)$ is itself not generally tractable. VAEs proceed by introducing an encoder, which approximates $q(z | x; \theta)$. This is itself simply a conditional distribution $e(z | x; \psi)$. We use this approximation to express the log marginal over $x$ as below.
\begin{align} \log q(x; \theta) &= \mathbb{E}_{e}\bigg[\log q(x; \theta)\frac{e(z | x; \psi)}{e(z | x; \psi)}\bigg] \\ &= \mathbb{E}_{e}\bigg[\log\frac{q(x | z; \theta)p(z)}{q(z | x; \theta)}\frac{e(z | x; \psi)}{e(z | x; \psi)}\bigg] \\ &= \mathbb{E}_{e}\bigg[\log\frac{q(x | z; \theta)p(z)}{e(z | x; \psi)}\bigg] + KL(e(z | x; \psi) || q(z | x; \theta)) \\ &= \mathcal{J}_{\theta,\psi} + KL(e(z | x; \psi) || q(z | x; \theta)) \end{align}
The additional approximation gives a more complex expression and does not provide an analytical approximate density. However, as $KL(e(z | x; \psi) || q(z | x; \theta))$ is positive, $\mathcal{J}_{\theta,\psi}$ forms a lower bound on $\log q(x; \theta)$. This expression is commonly referred to as the Evidence Lower Bound (or ELBO). The second term, the divergence between our encoder and the implied posterior will be minimised as we increase $\mathcal{J}_{\theta,\psi}$ in $\psi$. As we increase $\mathcal{J}_{\theta,\psi}$, we will either be increasing $\log q(x; \theta)$ or reducing $KL(e(z | x; \psi) || q(z | x; \theta))$.

The goal is to increase $\log q(x; \theta)$, which we hope occurs as we optimise over both parameters. This ambiguity in optimisation results in well known issues, such as posterior collapse [12] and can result in some counter-intuitive behaviour [4]. Despite this, VAEs remain a powerful and popular approach. An important benefit is that we no longer require $d_x = d_z$, which means we can map high dimensional $x$ to low dimensional $z$ to perform dimensionality reduction.

Figure 3: Schematic of VAE model. We now have a stochastic forward transformation. To optimise this we introduce a decoder model which approximates the posterior implied by the forward transformation. We now have a more flexible transformation, but two models to train and no analytic approximate density.

Surjective Flows
We can now identify a connection between NFs and VAEs. Recent work has reinterpreted NFs within the VAE framework, permitting a broader class of transitions whilst retaining analytic tractability of NFs [5]. Considering our decoding conditional as $$q(x|z; \theta) = \delta(x – T(z; \theta)),$$ we have the posterior exactly as
$$q(z|x; \theta) = \delta(z – T^{-1}(x; \theta)),$$
where $\delta$ is the dirac delta function.

This provides a view of a NF as a special case of a VAE, where we don’t need to approximate the posterior. Considering our VAE approximation, $$\log q(x; \theta) = \mathbb{E}_e\big[\log p(z) + \log\frac{q(x | z; \theta)}{e(z | x; \psi)}\big] + KL(e(z | x; \psi) || q(z | x; \theta)),$$and taking $e(z | x; \psi) = q(z | x; \theta)$, then the final KL term is 0 by definition. In that case, we recover our analytic density for NFs (see [5] for details).

Note that accessing an analytic density depends on having $e(z | x; \psi) = q(z | x; \theta)$ and computing $$\mathbb{E}_e\big[\log p(z) + \log\frac{q(x | z; \theta)}{e(z | x; \psi)}\big].$$These requirements are actually weaker than those we apply in the case of standard NFs. Consider a deterministic transformation $T^{-1}(x)$ which is surjective, crucially we can have many $x$ which map to the same $z$, so no longer have an analytic inverse. However we can still choose a $q(x | z)$ which is stochastic inverse of $T^{-1}$. For example, consider an absolute value surjection, $q(z | x) = \delta(z – |x|)$, to invert this transformation we can choose $q(x | z) = \frac{1}{2}\delta(x – z) + \frac{1}{2}\delta(x + z)$, which we can sample from straight-forwardly. This transformation enforces symmetry across the origin in the approximate distribution, a potentially useful inductive bias. In this example, and many others, we can also compute the density exactly. This has led to a number of interesting extensions to NFs, such as “funnel flows” which have $d_z < d_x$ but retain an analytic approximate density [6]. As we retain an analytic approximate density, we can optimise them in the same way as NFs.

Figure 4: Schematic of a surjective transformation. We have a stochastic forward transformation, but the inverse is deterministic. This restricts what transformations we can have, but retains an analytic approximate density.

Conclusion
I have presented an overview of two widely-used methods for modelling probability distributions with machine learning techniques. I’ve also highlighted an interesting connection between these methods, an area of research that has led to the development of interesting new models. It’s worth noting that other important classes of ML models such as Generative Adversarial Networks and Diffusion models can also be interpreted as approximate probability distributions. There are many superficial connections between those methods and the ones discussed here. Exploring these theoretical similarities presents an compelling direction of research to sharpen understanding of such models’ relative advantages. Another promising direction is the synthesis of these methods, where researchers aim to harness the strengths of each approach [7,8]. This not only enhances the existing knowledge base but also offers opportunities for innovative applications in the field.

References
[1] Jaini, Priyank, et al. “Tails of Lipschitz triangular flows.” International Conference on Machine Learning. PMLR, 2020
[2] Papamakarios, G., et al. “Normalizing flows for probabilistic modeling and inference.” Journal of Machine Learning Research, 2021
[3] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv:1312.6114, 2013
[4] Rainforth, Tom, et al. “Tighter variational bounds are not necessarily better.” International Conference on Machine Learning. PMLR, 2018
[5] Nielsen, Didrik, et al. “Survae flows: Surjections to bridge the gap between vaes and flows.” Advances in Neural Information Processing, 2020
[6] Samuel Klein, et al. “Funnels: Exact maximum likelihood with dimensionality reduction.” arXiv:2112.08069, 2021
[7] Kingma, Durk P., et al. “Improved variational inference with inverse autoregressive flow.” Advances in neural information processing systems , 2016
[8] Zhang, Qinsheng, and Yongxin Chen. “Diffusion normalizing flow.” Advances in Neural Information Processing Systems 34, 2021
[10] Daniel Ward “An introduction to normalising flows” https://compass.blogs.bristol.ac.uk/2022/07/13/2608/
[11] Tennessee Hickling and Dennis Prangle, “Flexible Tails for Normalising Flows, with Application to the Modelling of Financial Return Data”, 8th MIDAS Workshop, ECML PKDD, 2023
[12] Lucas, James, et al. “Don’t blame the elbo! a linear vae perspective on posterior collapse.” Advances in Neural Information Processing Systems 32, 2019

## Student perspectives: Compass Annual Conference 2023

A post by Dominic Broadbent, PhD student on the Compass CDT, and Michael Whitehouse, PhD student of the Compass CDT (recently submitted thesis)

## Introduction

September saw the second annual Compass Conference, hosted in the Fry Building, the home of the School of Mathematics. The event was particularly special as it is the first time that all five Compass cohorts were brought together, and it was a fantastic opportunity to celebrate the achievements and research of the Compass CDT with external partners.  This year the theme was “Communicating Research in Context“, focusing on how research can be better communicated, and the need to highlight the motivation and applications of mathematical research.

## Research talks

The day began with four long form talks touching on the topic of communicating research. Starting with Alessio Zakaria’s talk which delved into Hypothesis tests, commenting on their criticial role as the defacto statistical tool across the sciences, and how p-hacking has led to a replication crisis that undermines public confidence in research. The next talk by Sam Stockman and Emerald Dilworth discussed the challenges they faced, and the key takeaways from their shared experience of communicating mathematics with researchers in the geographical sciences. Following this, Ed Davis’s interactive talk “The Universal Language of Visualisations” explored how effective visualisation techniques should differ by the intended audience, with examples from his research and activities outside of academia. The last talk by Dan Milner explored his research on understanding the effect of environmental factors on outcomes of smallholder farmers in Kenya. He took us through the process of collecting data on the ground, to modelling and communicating results to external partners.  After each talk there was an opportunity to ask questions, allowing for audience participation and the sparking of interesting discussions. The format mirrored that which is most frequently used in external academic conferences, giving the speakers a chance to practice their technique in front of friendly faces.

## Lightning talks

After a short break, we jumped back into the fray with a series of 3-minute fast-paced lightning talks. A huge range of topics were covered, all the way from developing modelling techniques for the electric grid of the future, to predicting the incidence of Cerebral Vasospasm at the Southmead Hospital ICU. With such a short time to present, these talks were a great exercise in distilling research into just the essentials, knowing there is very limited time to garner the audience’s interest and convey an effective message.

## Special guest lecture

After lunch, we reconvened to attend the special guest lecture. The talk, entitled Bridging the gap between research and industry, was delivered by Ruth Voisey, CEO of the Smith institute. It outlined Ruth’s journey from writing her PhD thesis ‘Multiple wave scattering by quasiperiodic structures’, to CEO of the Smith Institute – via an internship with the acoustic research team at Dyson. It was particularly refreshing to hear Ruth’s candid account of her ‘non-linear’ rise to CEO, accrediting her success to strong principles of clear research communication and ‘mathematical evangelicalism’.

As PhD students in the bubble of academia, the path to opportunities in the world of industry can often feel clouded – Ruth’s lecture painted a clear picture of how one can transition from university based research to a rewarding career outside of this bubble, applying such research to tangible problems in the real world.

## Panel discussion and poster session

The special guest lecture was followed by a discussion on communicating research in context, with panel members Ruth Voisey, David Greenwood, Helen Barugh, Oliver Johnson, plus Compass CDT students Ed Davis and Sam Stockman. The panel discussed the difficulty of communicating the nuances of research conclusions with the public, with a particular focus on handling these nuances when talking to journalists – stressing the importance of communicating the limitations of the research in question.

This was followed by a poster session, one enthusiastic student had the following comment “it was great to see of all the Compass students’ hard work being celebrated and shared with the wider data science community”.

## Concluding remarks

To cap off the successful event, Compass students Hannah Sansford and Josh Givens delivered some concluding remarks which were drawn from comments made by students about what key points they’d taken from the day. These focused on the importance of clear communication of research throughout the whole pipeline, from inception in discussion with fellow academics to the dissemination of knowledge to the general population.

The day ended with a walk to Goldney Hall, where students, staff, and attendees enjoyed delicious food, wine, and access to the beautiful Orangery gardens.

## 2023/24 Compass research projects confirmed

Our Cohort 4 Compass students have confirmed their PhD projects for the next 3 years and are establishing the direction of their own research within the CDT.

Supervised by the Institute for Statistical Science:

Qi Chen:  Methodology for inferring directed graphs representing generative processes      Supervised by Dan Lawson

Emma Ceccherini:  Covariate Information for Dynamic Network Embedding.  Supervised by Ian Gallagher & Dan Lawson

Henry Bourne: Investigating the Effect of Latent Representations on Continual Learning Performance.  Supervised by Rihuan Ke

Rachel Wood:  Comparing qualitatively different data at scale.  Supervised by Dan Lawson

Dylan Dijk:  Robust estimation and inference for high-dimensional time series.  Supervised by Haeran Cho

Rahil Morjaria:  New Directions in Group Testing.  Supervised by Sid Jaggi

Xinrui Shi:  Collective decision making in distributed systems.  Supervised by Ayalvadi Ganesh

The following projects are supervised in collaboration with the Institute for Statistical Science (IfSS) and our other internal partners at the University of Bristol:

Codie Wood:  Misclassification in binary and categorical variables: development of methods and software for epidemiology.  Supervised by Kate Tilling & Rachael Hughes from Bristol Medical School (Population Health Science).  Plus Jonathan Bartlett (London School Hygiene Tropical Medicine)

Ben Anson:  Graph deep kernel machines. Supervised by Laurence Aitchison (Department of Computer Science)

Sam Bowyer:  Fast and correct Bayesian inference with massively parallel methods. Supervised by Laurence Aitchison (Department of Computer Science)

Sam Perren:  Validity of population adjustment methods for disconnected networks of evidence.  Supervised by Nicky Welton, David Phillippo & Hugo Pedder from Bristol Medical School (Population Health Science)

Emma Tarmey:   Variable selection in causal inference: development of methods and software for epidemiology.  Kate Tilling & Jonathan Sterne from Bristol Medical School (Population Health Science).  Plus Rhian Daniel (Cardiff University)

## Student Perspectives: Impurity Identification in Oligonucleotide Drug Samples

A post by Harry Tata, PhD student on the Compass programme.

# Oligonucleotides in Medicine

Oligonucleotide therapies are at the forefront of modern pharmaceutical research and development, with recent years seeing major advances in treatments for a variety of conditions. Oligonucleotide drugs for Duchenne muscular dystrophy (FDA approved) [1], Huntington’s disease (Phase 3 clinical trials) [2], and Alzheimer’s disease [3] and amyotrophic lateral sclerosis (early-phase clinical trials) [4] show their potential for tackling debilitating and otherwise hard-to-treat conditions. With continuing development of synthetic oligonucleotides, analytical techniques such as mass spectrometry must be tailored to these molecules and keep pace with the field.

Working in conjunction with AstraZeneca, this project aims to advance methods for impurity detection and quantification in synthetic oligonucleotide mass spectra. In this blog post we apply a regularised version of the Richardson-Lucy algorithm, an established technique for image deconvolution, to oligonucleotide mass spectrometry data. This allows us to attribute signals in the data to specific molecular fragments, and therefore to detect impurities in oligonucleotide synthesis.

# Oligonucleotide Fragmentation

If we have attempted to synthesise an oligonucleotide $\mathcal O$ with a particular sequence, we can take a sample from this synthesis and analyse it via mass spectrometry. In this process, molecules in the sample are first fragmented — broken apart into ions — and these charged fragments are then passed through an electromagnetic field. The trajectory of each fragment through this field depends on its mass/charge ratio (m/z), so measuring these trajectories (e.g. by measuring time of flight before hitting some detector) allows us to calculate the m/z of fragments in the sample. This gives us a discrete mass spectrum: counts of detected fragments (intensity) across a range of m/z bins [5].

To get an idea of how much of $\mathcal O$ is in a sample, and what impurities might be present, we first need to consider what fragments $\mathcal O$ will produce. Oligonucleotides are short strands of DNA or RNA; polymers with a backbone of sugars (such as ribose in RNA) connected by linkers (e.g. a phosphodiester bond), where each sugar has an attached base which encodes genetic information [6].

On each monomer, there are two sites where fragmentation is likely to occur: at the linker (backbone cleavage) or between the base and sugar (base loss). Specifically, depending on which bond within the linker is broken, there are four modes of backbone cleavage [7,8].
We include in $\mathcal F$ every product of a single fragmentation of $\mathcal O$ — any of the four backbone cleavage modes or base loss anywhere along the nucleotide — as well as the results of every combination of two fragmentations (different cleavage modes at the same linker are mutually exclusive).

# Sparse Richardson-Lucy Algorithm

Suppose we have a chemical sample which we have fragmented and analysed by mass spectrometry. This gives us a spectrum across n bins (each bin corresponding to a small m/z range), and we represent this spectrum with the column vector $\mathbf{b}\in\mathbb R^n$, where $b_i$ is the intensity in the $i^{th}$ bin. For a set $\{f_1,\ldots,f_m\}=\mathcal F$ of possible fragments, let $x_j$ be the amount of $f_j$ that is actually present. We would like to estimate the amounts of each fragment based on the spectrum $\mathbf b$.

If we had a sample comprising a unit amount of a single fragment $f_j$, so $x_j=1$ and $x_{k\ne j}=0,$ and this produced a spectrum $\begin{pmatrix}a_{1j}&\ldots&a_{nj}\end{pmatrix}^T$, we can say the intensity contributed to bin $i$ by $x_j$ is $a_{ij}.$ In mass spectrometry, the intensity in a single bin due to a single fragment is linear in the amount of that fragment, and the intensities in a single bin due to different fragments are additive, so in some general spectrum we have $b_i=\sum_j x_ja_{ij}.$

By constructing a library matrix $\mathbf{A}\in\mathbb R^{n\times m}$ such that $\{\mathbf A\}_{ij}=a_{ij}$ (so the columns of $\mathbf A$ correspond to fragments in $\mathcal F$), then in ideal conditions the vector of fragment amounts $\mathbf x=\begin{pmatrix}x_1&\ldots&x_m\end{pmatrix}^T$ solves $\mathbf{Ax}=\mathbf{b}$. In practice this exact solution is not found — due to experimental noise and potentially because there are contaminant fragments in the sample not included in $\mathcal F$ — and we instead make an estimate $\mathbf {\hat x}$ for which $\mathbf{A\hat x}$ is close to $\mathbf b$.

Note that the columns of $\mathbf A$ correspond to fragments in $\mathcal F$: the values in a single column represent intensities in each bin due to a single fragment only. We $\ell_1$-normalise these columns, meaning the total intensity (over all bins) of each fragment in the library matrix is uniform, and so the values in $\mathbf{\hat x}$ can be directly interpreted as relative abundances of each fragment.

The observed intensities — as counts of fragments incident on each bin — are realisations of latent Poisson random variables. Assuming these variables are i.i.d., it can be shown that the estimate of $\mathbf{x}$ which maximises the likelihood of the system is approximated by the iterative formula

$\mathbf {\hat{x}}^{(t+1)}=\left(\mathbf A^T \frac{\mathbf b}{\mathbf{A\hat x}^{(t)}}\right)\odot \mathbf{\hat x}^{(t)}.$

Here, quotients and the operator $\odot$ represent (respectively) elementwise division and multiplication of two vectors. This is known as the Richardson-Lucy algorithm [9].

In practice, when we enumerate oligonucleotide fragments to include in $\mathcal F$, most of these fragments will not actually be produced when the oligonucleotide passes through a mass spectrometer; there is a large space of possible fragments and (beyond knowing what the general fragmentation sites are) no well-established theory allowing us to predict, for a new oligonucleotide, which fragments will be abundant or negligible. This means we seek a sparse estimate, where most fragment abundances are zero.

The Richardson-Lucy algorithm, as a maximum likelihood estimate for Poisson variables, is analagous to ordinary least squares regression for Gaussian variables. Likewise lasso regression — a regularised least squares regression which favours sparse estimates, interpretable as a maximum a posteriori estimate with Laplace priors — has an analogue in the sparse Richardson-Lucy algorithm:

$\mathbf {\hat{x}}^{(t+1)}=\left(\mathbf A^T \frac{\mathbf b}{\mathbf{A\hat x}^{(t)}}\right)\odot \frac{ \mathbf{\hat x}^{(t)}}{\mathbf 1 + \lambda},$

where $\lambda$ is a regularisation parameter [10].

# Library Generation

For each oligonucleotide fragment $f\in\mathcal F$, we smooth and bin the m/z values of the most abundant isotopes of $f$, and store these values in the columns of $\mathbf A$. However, if these are the only fragments in $\mathcal F$ then impurities will not be identified: the sparse Richardson-Lucy algorithm will try to fit oligonucleotide fragments to every peak in the spectrum, even ones that correspond to fragments not from the target oligonucleotide. Therefore we also include ‘dummy’ fragments corresponding to single peaks in the spectrum — the method will fit these to non-oligonucleotide peaks, showing the locations of any impurities.

# Results

For a mass spectrum from a sample containing a synthetic oligonucleotide, we generated a library of oligonucleotide and dummy fragments as described above, and applied the sparse Richardson-Lucy algorithm. Below, the model fit is plotted alongside the (smoothed, binned) spectrum and the ten most abundant fragments as estimated by the model. These fragments are represented as bars with binned m/z at the peak fragment intensity, and are separated into oligonucleotide fragments and dummy fragments indicating possible impurities. All intensities and abundances are Anscombe transformed ($x\rightarrow\sqrt{x+3/8}$) for clarity.

As the oligonucleotide in question is proprietary, its specific composition and fragmentation is not mentioned here, and the bins plotted have been transformed (without changing the shape of the data) so that individual fragment m/z values are not identifiable.

We see the data is fit extremely closely, and that the spectrum is quite clean: there is one very pronounced peak roughly in the middle of the m/z range. This peak corresponds to one of the oligonucleotide fragments in the library, although there is also an abundant dummy fragment slightly to the left inside the main peak. Fragment intensities in the library matrix are smoothed, and it may be the case that the smoothing here is inappropriate for the observed peak, hence other fragments being fit at the peak edge. Investigating these effects is a target for the rest of the project.

We also see several smaller peaks, most of which are modelled with oligonucleotide fragments. One of these peaks, at approximately bin 5352, has a noticeably worse fit if excluding dummy fragments from the library matrix (see below). Using dummy fragments improves this fit and indicates a possible impurity. Going forward, understanding and quantification of these impurities will be improved by including other common fragments in the library matrix, and by grouping fragments which correspond to the same molecules.

# References

[1] Junetsu Igarashi, Yasuharu Niwa, and Daisuke Sugiyama. “Research and Development of Oligonucleotide Therapeutics in Japan for Rare Diseases”. In: Future Rare Diseases 2.1 (Mar. 2022), FRD19.

[2] Karishma Dhuri et al. “Antisense Oligonucleotides: An Emerging Area in Drug Discovery and Development”. In: Journal of Clinical Medicine 9.6 (6 June 2020), p. 2004.

[3] Catherine J. Mummery et al. “Tau-Targeting Antisense Oligonucleotide MAPTRx in Mild Alzheimer’s Disease: A Phase 1b, Randomized, Placebo-Controlled Trial”. In: Nature Medicine (Apr. 24, 2023), pp. 1–11.

[4] Benjamin D. Boros et al. “Antisense Oligonucleotides for the Study and Treatment of ALS”. In: Neurotherapeutics: The Journal of the American Society for Experimental NeuroTherapeutics 19.4 (July 2022), pp. 1145–1158.

[5] Ingvar Eidhammer et al. Computational Methods for Mass Spectrometry Proteomics. John Wiley & Sons, Feb. 28, 2008. 299 pp.

[6] Harri Lönnberg. Chemistry of Nucleic Acids. De Gruyter, Aug. 10, 2020.

[7] S. A. McLuckey, G. J. Van Berkel, and G. L. Glish. “Tandem Mass Spectrometry of Small, Multiply Charged Oligonucleotides”. In: Journal of the American Society for Mass Spectrometry 3.1 (Jan. 1992), pp. 60–70.

[8] Scott A. McLuckey and Sohrab Habibi-Goudarzi. “Decompositions of Multiply Charged Oligonucleotide Anions”. In: Journal of the American Chemical Society 115.25 (Dec. 1, 1993), pp. 12085–12095.

[9] Mario Bertero, Patrizia Boccacci, and Valeria Ruggiero. Inverse Imaging with Poisson Data: From Cells to Galaxies. IOP Publishing, Dec. 1, 2018.

[10] Elad Shaked, Sudipto Dolui, and Oleg V. Michailovich. “Regularized Richardson-Lucy Algorithm for Reconstruction of Poissonian Medical Images”. In: 2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro. Mar. 2011, pp. 1754–1757.

## Student Perspectives: Density Ratio Estimation with Missing Data

A post by Josh Givens, PhD student on the Compass programme.

Density ratio estimation is a highly useful field of mathematics with many applications.  This post describes my research undertaken alongside my supervisors Song Liu and Henry Reeve which aims to make density ratio estimation robust to missing data. This work was recently published in proceedings for AISTATS 2023.

## Density Ratio Estimation

### Definition

As the name suggests, density ratio estimation is simply the task of estimating the ratio between two probability densities. More precisely for two RVs (Random Variables) $Z^0, Z^1$ on some space $\mathcal{Z}$ with probability density functions (PDFs) $p_0, p_1$ respectively, the density ratio is the function $r^*:\mathcal{Z}\rightarrow\mathbb{R}$ defined by

$r^*(z):=\frac{p_0(z)}{p_1(z)}$.

Density ratio estimation (DRE) is then the practice of using IID (independent and identically distributed) samples from $Z^0$ and $Z^1$ to estimate $r^*$. What makes DRE so useful is that it gives us a way to characterise the difference between these 2 classes of data using just 1 quantity, $r^*$.

### The Density Ratio in Classification

We now give demonstrate this characterisability in the case of classification. To frame this as a classification problem define $Y\sim\text{Bernoulli}(0.5)$ and $Z$ by $Z|Y=y\sim Z^{y}$. The task of predicting $Y$ given $Z$ using some function $\phi:\mathcal{Z}\rightarrow\{0,1\}$ is then our standard classification problem. In classification a common target is the Bayes Optimal Classifier, the classifier $\phi^*$ which maximises $\mathbb{P}(Y=\phi(Z)).$ We can write this classifier in terms of $r^*$  as we know that $\phi^*(z)=\mathbb{I}\{\mathbb{P}(Y=1|Z=z)>0.5\}$ where $\mathbb{I}$ is the indicator function. Then, by the total law of probability, we have

$\mathbb{P}(Y=1|Z=z)=\frac{p_{Z|Y=1}(z)\mathbb{P}(Y=1)}{p_{Z|Y=1}(z)\mathbb{P}(Y=1)+p_{Z|Y=0}(z)\mathbb{P}(Y=0)}$

$=\frac{p_1(z)\mathbb{P}(Y=1)}{p_1(z)\mathbb{P}(Y=1)+p_0(z)\mathbb{P}(Y=0)} =\frac{1}{1+\frac{1}{r}\frac{\mathbb{P}(Y=0)}{\mathbb{P}(Y=1)}}.$

Hence to learn the Bayes optimal classifier it is sufficient to learn the density ratio and a constant. This pattern extends well beyond Bayes optimal classification to many other areas such as error controlled classification, GANs, importance sampling, covariate shift, and others.  Generally speaking, if you are in any situation where you need to characterise the difference between two classes of data, it’s likely that the density ratio will make an appearance.

### Estimation Implementation – KLIEP

Now we have properly introduced and motivated DRE, we need to look at how we can go about performing it. We will focus on one popular method called KLIEP here but there are a many different methods out there (see Sugiyama et al 2012 for some additional examples.)

The intuition behind KLIEP is simple: as $r^* \cdot p_0=p_1$, if $\hat r\cdot p_0$ is “close” to $p_1$ then $\hat r$ is a good estimate of $r^*$. To measure this notion of closeness KLIEP uses the KL (Kullback-Liebler) divergence which measures the distance between 2 probability distributions. We can now formulate our ideal KLIEP objective as follows:

$\underset{r}{\text{min}}~ KL(p_1|p_0\cdot r)$

$\text{subject to:}~ \int_{\mathcal{Z}}r(z)p_0(z)\mathrm{d}z=1$

where $KL(p|p')$ represent the KL divergence from $p$ to $p'$. The constraint  ensures that the right hand side of our KL divergence is indeed a PDF. From the definition of the KL-divergence we can rewrite the solution to this as $\hat r:=\frac{\tilde r}{\mathbb{E}[r(X^0)]}$ where $\tilde r$ is the solution to the unconstrained optimisation

$\underset{r}{\text{min}}~\mathbb{E}[\log (r(Z^1))]-\log(\mathbb{E}[r(Z^0)]).$

As this is now just an unconstrained optimisation over expectations of known transformations of $Z^0, Z^1$, we can approximate this using samples. Given samples $z^0_1,\dotsc,z^0_n$ from $Z_0$ and samples $z^1_1,\dotsc,z^1_n$ from $Z_1$ our estimate of the density ratio will be $\hat r:=\left(\frac{1}{n}\sum_{i=1}^nr(z_i^0)\right)^{-1}\tilde r$  where $\tilde r$ solves

$\underset{r}{\min}~ \frac{1}{n}\sum_{i=1}^n \log(r(z^1_i))-\log\left(\frac{1}{n}\sum_{i=1}^n r(z^0_i)\right).$

Despite KLIEP being commonly used, up until now it has not been made robust to missing not at random data. This is what our research aims to do.

## Missing Data

Suppose that instead of observing samples from $Z$, we observe samples from some corrupted version of $Z$, $X$. We assume that $\mathbb{P}(\{X=\varnothing\}\cup \{X=Z\})=1$ so that either $X$ is missing or $X$ takes the value of $Z$. We also assume that whether $X$ is missing depends upon the value of $Z$. Specifically we assume $\mathbb{P}(X=\varnothing|Z=z)=\varphi(z)$ with $\varphi(z)$ not constant and refer to $\varphi$ as the missingness function. This type of missingness is known as missing not at random (MNAR) and when dealt with improperly can lead to biased result. Some examples of MNAR data could be readings take from a medical instrument which is more likely to err when attempting to read extreme values or recording responses to a questionnaire where respondents may be more likely to not answer if the deem their response to be unfavourable. Note that while we do not see what the true response would be, we do at least get a response meaning that we know when an observation is missing.

## Missing Data with DRE

We now go back to density ratio estimation in the case where instead of observing samples from  $Z^0,Z^1$  we observe samples from their corrupted versions $X^0, X^1$. We take their respective missingness functions to be $\varphi_0, \varphi_1$ and assume them to be known. Now let us look at what would happen if we implemented KLIEP with the data naively by simply filtering out the missing-values. In this case, the actual density ratio we would be estimating would be

$r'(z):=\frac{p_{X_1|X_1\neq\varnothing}(z)}{p_{X_0|X_o\neq\varnothing}(z)}\propto\frac{(1-\varphi_1(z))p_1(z)}{(1-\varphi_0(z))p_0(z)}\not{\propto}r^*(z)$

and so we would get inaccurate estimates of the density ratio no matter how many samples are used to estimate it. The image below demonstrates this in the case were samples in class $1$ are more likely to be missing when larger and class $0$ has no missingness.

### Our Solution

Our solution to this problem is to use importance weighting. Using relationships between the densities of $X$ and $Z$ we have that

$\mathbb{E}[g(Z)]=\mathbb{E}\left[\frac{\mathbb{I}\{X\neq\varnothing\}g(X)}{1-\varphi(X)}\right].$

As such we can re-write the KLIEP objective to keep our expectation estimation unbiased even when using these corrupted samples. This gives our modified objective which we call M-KLIEP as follows. Given samples $x^0_1,\dotsc,x^0_n$ from $X_0$ and samples $x^1_1,\dotsc,x^1_n$ from $X_1$ our estimate is $\hat r=\left(\frac{1}{n}\sum_{i=1}^n\frac{\mathbb{I}\{x_i^0\neq\varnothing\}r(x_i^0)}{1-\varphi_o(x_i^o)}\right)^{-1}\tilde r$ where $\tilde r$ solves

$\underset{r}{\min}~\frac{1}{n}\sum_{i=1}^n\frac{\mathbb{I}\{x_i^1\neq\varnothing\}\log(r(x_i^1))}{1-\varphi_1(x_i^1)}-\log\left(\frac{1}{n}\sum_{i=1}^n\frac{\mathbb{I}\{x_i^0\neq\varnothing\}r(x_i^0)}{1-\varphi_0(x_i^0)}\right).$

This objective will now target $r^*$ even when used on MNAR data.

### Application to Classification

We now apply our density ratio estimation on MNAR data to estimate the Bayes optimal classifier. Below shows a plot of samples alongside the true Bayes optimal classifier and estimated classifiers from the samples via our method M-KLIEP and a naive method CC-KLIEP which simply ignores missing points. Missing data points are faded out.

As we can see, due to not accounting for the MNAR nature of the data, CC-KLIEP underestimates the true number of class 1 samples in the top left region and therefore produces a worse classifier than our approach.

As well as this modified objective our paper provides the following additional contributions:

• Theoretical finite sample bounds on the accuracy of our modified procedure.
• Methods for learning the missingness functions $\varphi_1,\varphi_0$.
• Expansions to partial missingness via a Naive-Bayes framework.
• Downstream implementation of our method within Neyman-Pearson classification.
• Adaptations to Neyman-Pearson classification itself making it robust to MNAR data.

For more details see our paper and corresponding github repository. If you have any questions on this work feel free to contact me at josh.givens@bristol.ac.uk.

### References

Givens, J., Liu, S., & Reeve, H. W. J. (2023). Density ratio estimation and neyman pearson classification with missing data. In F. Ruiz, J. Dy, & J.-W. van de Meent (Eds.), Proceedings of the 26th international conference on artificial intelligence and statistics (Vol. 206, pp. 8645–8681). PMLR.
Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density Ratio Estimation in Machine Learning. Cambridge University Press.

## Compass students attending AISTATS 2023 in Valencia

We (Ed Davis, Josh Givens, Alex Modell, and Hannah Sansford) attended the 2023 AISTATS conference in Valencia in order to explore the interesting research being presented as well as present some of our own work. While we talked about our work being published at the conference in this earlier blog post, having now attended the conference, we thought we’d talk about our experience there. We’ll spotlight some of the talks and posters which interested us most and talk about our highlights of Valencia as a whole.

## Talks & Posters

### Mode-Seeking Divergences: Theory and Applications to GANs

One especially interesting talk and poster at the conference was presented by Cheuk Ting Li on their work in collaboration with Farzan Farnia. This work aims to set up a formal classification for various probability measure divergences (such as f-Divergences, Wasserstein distance, etc.) in terms of there mode-seeking or mode-covering properties. By mode-seeking/mode-covering we mean the behaviour of the divergence when used to fit a unimodal distribution to a multi-model target. Specifically a mode-seeking divergence will encourage the target distribution to fit just one of the modes ignoring the other while a mode covering divergence will encourage the distribution to cover all modes leading to less accurate fitting on an individual mode but better covering the full support of the distribution. While these notions of mode-seeking and mode-covering divergences had been discussed before, up to this point there seems to be no formal definition for these properties, and disagreement on the appropriate categorisation of some divergences. This work presents such a definition and uses it to categorise many of the popular divergence methods. Additionally they show how an additive combination a mode seeking f-divergence and the 1-Wasserstein distance retain the mode-seeking property of the f-divergence while being implementable using only samples from our target distribution (rather than knowledge of the distribution itself) making it a desirable divergence for use with GANs.

### Using Sliced Mutual Information to Study Memorization and Generalization inDeep Neural Networks

The benefit of attending large conferences like AISTATS is having the opportunity to hear talks that are not related to your main research topic. This was the case with a talk by Wongso et. al. was one such talk. Although it did not overlap with any of our main research areas, we all found this talk very interesting.
The talk was on the topic of tracking memorisation and generalisation in deep neural networks (DNNs) through the use of /sliced mutual information/. Mutual information (MI) is commonly used in information theory and represents the reduction of uncertainty about one random variable, given the knowledge of the other. However, MI is hard to estimate in high dimensions, which makes it a prohibitive metric for use in neural networks.
Enter sliced mutual information (SMI). This metric is the average of the MI terms between their one-dimensional projections. The main difference between SMI and MI is that SMI is scalable to high dimensions and scales faster than MI.
Next, let’s talk about memorisation. Memorisation is known to occur in DNNs and is where the DNN fits random labels in training as it has memorised noisy labels in training, leading to bad generalisation. The authors demonstrate this behaviour by fitting a multi-layer perceptron to the MNIST dataset with various amounts of label noise. As the noise increased, the difference between the training and test accuracy became greater.
As the label noise increases, the MI between the features and target variable does not change, meaning that MI did not track the loss in generalisation. However, the authors show that the SMI did track the generalisation. As the label noise increased, the SMI decreased significantly as the MLP’s generalisation got worse. Their main theorem showed that SMI is lower-bounded by a term which includes the spherical soft-margin separation, a quantity which is used to track memorisation and generalisation!
In summary, unlike MI, SMI can track memorization and generalisation in DNNs. If you’d like to know more, you can find the full paper here: https://proceedings.mlr.press/v206/wongso23a.html.

### Invited Speakers and the Test of Time Award

As well as the talks on papers that had been selected for oral presentation, each day began with a (longer) invited talk which, for many of us, were highlights of each day. The invited speakers were extremely engaging and covered varied and interesting topics; from Arthur Gretton (UCL) presenting ‘Causal Effect Estimation with Context and Confounders’ to Shakir Mohamed (DeepMind) presenting ‘Elevating our Evaluations: Technical and Sociotechnical Standards of Assessment in Machine Learning’. A favourite amongst us was a talk from Tamara Broderick (MIT) titled ‘An Automatic Finite-Sample Robustness Check: Can Dropping a Little Data Change Conclusions?’. In this talk she addressed a worry that researchers might have when the goal is to analyse a data sample and apply any conclusions to a new population: was a small proportion of the data sample instrumental to the original conclusion? Tamara and collorators propose a method to assess the sensitivity of statistical conclusions to the removal of a very small fraction of the data set. They find that sensitivity is driven by a signal-to-noise ratio in the inference problem, does not disappear asymptotically, and is not decided by misspecification. In experiments they find that many data analyses are robust, but that the conclusions of severeal influential economics papers can be changed by removing (much) less than 1% of the data! A link to the talk can be found here: https://youtu.be/QYtIEqlwLHE

This year, AISTATS featured a Test of Time Award to recognise a paper from 10 years ago that has had a prominent impact in the field. It was awarded to Andreas Damianou and Neil Lawrence for the paper ‘Deep Gaussian Processes’, and their talk was a definite highlight of the conferece. Many of us had seen Neil speak at a seminar at the University of Bristol last year and, being the engaging speaker he is, we were looking forward to hearing him speak again. Rather than focussing on the technical details of the paper, Neils talk concentrated on his (and the machine learning community’s) research philosophy in the years preceeding writing the paper, and how the paper came about – a very interesting insight, and a refreshing break from the many technical talks!

## Valencia

There was so much to like about Valencia even from our short stay there. We’ll try and give you a very brief highlight of our favourite things.

### Food & Drink:

Obviously Valencia is renowned for being the birthplace of paella and while the paella was good we sampled many other delights our stay. Our collective highlight was the nicest Burrata any of us had ever had which, in a stunning display of individualism, all four of us decided to get on our first day at the conference.

### Beach:

About half an hours tram ride from the conference centre are the beaches of Valencia. These stretch for miles as well as having a good 100m in depth with (surprisingly hot) sand covering the lot. We visited these after the end of the conference on the Thursday and despite it being the only cloudy day of the week it was a perfect way to relax at the end of a hectic few days with the pleasantly temperate water being an added bonus.

### Architecture:

Valencia has so much interesting architecture scattered around the city centre. One of the most memorable remarkable places was the San Nicolás de Bari and San Pedro Mártir (Church of San Nicolás) which is known as the Sistine chapel of Valencia (according to the audio-guide for the church at least) with its incredible painted ceiling and live organ playing.

## Summer applications deadline for Compass CDT: 12 June 2023

### We are happy to announce our upcoming applications deadline is 12 June 2023, 23:59 (London UK time zone) for the final few fully funded places to start September 2023. For international applications there are limited scholarship funded places available. Early applications advised.

Compass is offering specific projects for PhD students to study from Sept 2023. We are pleased to announce that there are 4 new project opportunities to study. The full list of the projects/ supervisors has been updated. All the supervisors listed are open to discussion on the projects provided and can also talk to applicants about other project ideas. Please provide a ranked list of 3 projects of interest: 1 = project of highest interest. Project supervisors will be happy to respond to specific questions you have after reading the proposals. Applicants should contact them by email if they wish beforehand.

#### PhD Project Allocation Process

Application forms will be reviewed based on the 3 ranked projects specified or other proposed topic. Successful applicants will be invited to attend an interview with the Compass admissions tutors and the specific project supervisor. If you are made an offer of PhD study it will be published through the online application system. You will then have 2 weeks to consider the offer before deciding whether to accept or decline.

We welcome applications from all members of our community and are particularly encouraging those from diverse groups, such as members of the LGBT+ and black, Asian and minority ethnic communities, to join us.

12 June 2023, 23:59 (London UK time zone)

APPLY NOW

#### Advantages of being a Compass Student

• Stipend – a generous stipend of £22,622 pa tax free, paid in monthly payments. Plus your own expense budget of £1,000 pa towards travel and research activity.
• No fees – all tuition fees are covered by the EPSRC and University of Bristol.
• Bespoke training – first year units are designed specifically for the academic needs of each Compass student, which enables students to develop knowledge and capability to pursue cross-disciplinary PhD research.
• Supervisors – supervisors from across academic disciplines offer a range of research projects.
• Cohort – Compass students benefit from dedicated offices and collaboration spaces, enabling strong cohort links and opportunities for shared learning and research.

A 4-year bespoke PhD training programme in the statistical and computational techniques of data science, with partners from across the University of Bristol, industry and government agencies.

The cross-disciplinary programme offers exciting collaborations across medicine, computer science, geography, economics, life and earth sciences, as well as with our external partners who range from government organisations such as the Office for National Statistics, NCSC and the AWE, to industrial partners such as LV, Improbable, IBM Research, EDF, and AstraZeneca.

Students are co-located with the Institute for Statistical Science in the School of Mathematics, which occupies the Fry Building.

#### Hear from our students about their experience with the programme

• Compass has allowed me to advance my statistical knowledge and apply it to a range of exciting applied projects, as well as develop skills that I’m confident will be highly useful for a future career in data science. – Shannon, Cohort 2

• With the Compass CDT I feel part of a friendly, interactive environment that is preparing me for whatever I move on to next, whether it be in Academia or Industry. – Sam, Cohort 2

• An incredible opportunity to learn the ever-expanding field of data science, statistics and machine learning amongst amazing people. – Danny, Cohort 1

#### Compass CDT Video

Find out more about what it means to be a part of the Compass programme from our students in this short video.