## Student perspectives: How can we do data science without all of our data?

A post by Daniel Williams, Compass PhD student.

Imagine that you are employed by Chicago’s city council, and are tasked with estimating where the mean locations of reported crimes are in the city. The data that you are given only goes up to the city’s borders, even though crime does not suddenly stop beyond this artificial boundary. As a data scientist, how would you estimate these centres within the city? Your measurements are obscured past a very complex border, so regular methods such as maximum likelihood would not be appropriate.

This is an example of a more general problem in statistics named truncated probability density estimation. How do we estimate the parameters of a statistical model when data are not fully observed, and are cut off by some artificial boundary?

# The Problem

Incorporation of this boundary constraint into estimation is not straightforward. There are many factors which must be considered which make regular methods infeasible, we require:

• knowledge of the normalising constant in a statistical model, which is unknown in a truncated setting;
• incorporation of weighting points that are more ‘affected’ by the truncation.

A statistical model (a probability density function involving some parameters $\theta$) is defined as:

$p(x; \theta) = \frac{1}{Z(\theta)} \bar{p}(x;\theta).$

This is made up of two components:  $\bar{p}(x; \theta)$, the unnormalised part of the model, which is known analytically; and $Z(\theta)$, the normalising constant, which is oftentimes unknown or undefinable. The goal of estimation is to find a vector of parameters $\theta$ which characterises our model, and makes the model resemble the data as closely as possible. Usually, we can estimate $\theta$ by trying to minimise the difference between the model, $p(x;\theta)$, and the theoretical data density, $q(x)$.

# A Recipe for a Solution

## Ingredient #1: Score Matching

A novel estimation method called score matching enables estimation even when the normalising constant, $Z(\theta)$, is unknown. Score matching begins by taking the derivative of the logarithm of the probability density function, i.e.,

$\nabla_x \log p(x; \theta) = \nabla_x \log \bar{p}(x;\theta),$

which has become known as the score function. When taking the derivative, the dependence on $Z(\theta)$ is removed. To estimate the parameter vector $\theta$, we can minimise the difference between $q(x)$ and $p(x;\theta)$ by minimising the difference between the two score functions, $\nabla_x \log p(x; \theta)$ and $\nabla_x \log q(x)$. One such distance measure is the expected squared distance, so that score matching aims to minimise

$\mathbb{E} [\| \nabla_x \log p(x; \theta) - \nabla_x \log q(x)\|_2^2].$

With this first ingredient, we have eliminated the requirement that we must know the normalising constant.

## Ingredient #2: A Weighting Function

Heuristically, we can imagine our weighting function should vary with how close a point is to the boundary. To satisfy score matching assumptions, we require that this weighting function $g(x)$ must have the property that $g(x') = 0$ for any point $x'$ on the boundary. A natural candidate would be the Euclidean distance from a point $x$ to the boundary, i.e.,

$g(x) = \|x - \tilde{x}\|_2, \qquad \tilde{x} = \text{argmin}_{x'\text{ in boundary}} \|x - x'\|_2.$

This easily satisfies our criteria. The distance is going to be exactly zero on the boundary itself, and will approach zero the closer the points are to the edges.

## Ingredient #3: Any Statistical Model

Since we do not require knowledge of the normalising constant through the use of score matching, we can choose any probability density function $p(x; \theta)$ that is appropriate for our data. For example, if we are modelling count data, we may choose a Poisson distribution. If we have some sort of centralised location data, such as in the Chicago crime example in the introduction, we may choose a multivariate Normal distribution.

## Combine Ingredients and Mix

We aim to minimise the expected distance between the score functions, and we weight this by our function $g(x)$, to give

$\min_{\theta} \mathbb{E} [g(x) \| \nabla_x \log p(x; \theta) - \nabla_x \log q(x)\|_2^2].$

The only unknown in this equation now is the data density $q(x)$ (if we knew the true data density, there would be no point in estimating it with $p(x;\theta)$). However, we can rewrite this equation using integration by parts as

$\min_{\theta} \mathbb{E} \big[ g(x) [\|\nabla_x \log p(x; \theta)\|_2^2 + 2\text{trace}(\nabla_x^2 \log p(x; \theta))] + 2 \nabla_x g(x)^\top \nabla_x \log p(x;\theta)\big].$

This can be numerically minimised, or if $p(x; \theta)$ is in the exponential family, analytically minimised to obtain estimates for $\theta$.

# Results

## Artificial Data

As a sanity check for testing estimation methods, it is often reassuring to perform some experiments on simulated data before moving to real world applications. Since we know the true parameter values, it is possible to calculate how far our estimated values are from their true counterparts, thus giving a way to measure estimation accuracy.

Pictured below in Figure 2 are two experiments where data are simulated from a multivariate normal distribution with mean $\mu = [1, 1]$ and known variance $\sigma^2 = 1$. Our aim is to estimate the parameter $\theta$ to be close to $[1,1]$. The red crosses in the image are the true means at $[1,1]$, and the red dots are the estimates given by truncated score matching.

These figures clearly indicate that in this basic case, this method is giving sensible results. Even by ignoring most of our data, as long as we formulate our problem correctly, we can still get accurate results for estimation.

## Chicago Crime

Now we come back to the motivating problem, and the task is to estimate the centres of police reported crime in the city of Chicago, given the longitudes and latitudes of homicides in 2008. Our specification changes somewhat:

• from the original plots, it seems like there are two centres, so the statistical model we choose is a 2-dimensional mixed multivariate Normal distribution;
• the variance is no longer known, but to keep estimation straightforward, the standard deviation $\sigma$ is fixed so that $2\sigma$ roughly covers the ‘width’ of the city.

Figure 3 below shows applying this method to this application.

Truncated score matching has placed its estimates for the means in similar, but slightly different places than standard MLE. Whereas the maximum likelihood estimates are more central to the truncated dataset, the truncated score matching estimates are placed closer to the outer edges of the city. For the upper-most crime centre, around the city border the data are more ‘tightly packed’ – which is a property we expect of multivariate normal data. This result does not reflect the true likelihood of crime in a certain neighbourhood, and has only been presented to provide a visual explanation of the method. The actual likelihood depends on various characteristics in the city that are not included in our analysis, see here for more details.

# Final Thoughts

Score matching is a powerful tool, and by not requiring any form of normalising constant, enables the use of some more complicated models. Truncated probability density estimation is just one such example of an intricate problem which can be solved with this methodology, but it is one with far reaching and interesting applications. Whilst this blog post has focused on location-based data and estimation, truncated probability density estimation could have a range of applications. For example, consider disease/virus modelling such as the COVID-19 pandemic. The true number of cases is obscured by the number of tests that can be performed, so the density evolution of the pandemic through time could be fully modelled with incorporation of this constraint using this method. Other applications as well as extensions to this method will be fully investigated in my future PhD projects.

## Some Notes

Alternative methods which do not require the normalising constant: We could choose a model which has truncation built into its specification, such as the truncated normal distribution. For complicated regions of truncation, such as the city of Chicago, it is highly unlikely that an appropriate model specification is available. We could also perform some rejection algorithm to approximate the normalising constant, such as MCMC or ABC, but these are often very slow, and do not provide an exact solution. Score matching can have an analytical estimate for exponential family models, and even when it does not, minimisation is usually straightforward and fast.

Choice of the weighting function: We did not need to specify $g(x)$ before formulating our score matching objective function in ingredient #3. In fact, under some assumptions (the function $g(x)$ is 1-Lipschitz continuous) it can be shown that the choice of $g(x)$ which gives the supremum of the score matching expectation is in fact the Euclidean distance which was specified earlier, so it was not just a heuristic!

Score matching derivation conditions: This post has hand-waved some conditions of the derivation of the score matching objective function. There are two conditions which must be satisfied in the weighting function $g(x)$ so that the integration by parts ‘trick’ can work correctly. Firstly, $g(x)$ must be positive for every $x$. Secondly, for any $x'$ in the boundary, $g(x') = 0$. For more information, see the further reading section below.

If you are interested to learn more about any of these topics, check out any of the papers below:

Estimation of Non-Normalized Statistical Models by Score Matching, Aapo Hyvärinen (2005)

Estimating Density Models with Truncation Boundaries using Score Matching. Song Liu, Takafumi Kanamori, & Daniel J. Williams (2021)

And the Github page where all the plots and results were generated can be found here.

## Novel semi-supervised Bayesian learning to rapidly screen new oligonucleotide drugs for impurities.

This is an exciting opportunity to join Compass’ 4-year programme with integrated training in the statistical and computational techniques of Data Science. You will be part of a dynamic cohort of PhD researchers hosted in the historic Fry Building, which has recently undergone a £35 million refurbishment as the new home for Bristol’s School of Mathematics.

This fully-funded 4 year studentship covers:

• tuition fees at UK rate
• tax-free stipend of up £19,609 per year for living expenses and
• equipment and travel allowance to support research related activities.

This opportunity is open to UK, EU, and international students.

AstraZeneca is a global, science-led biopharmaceutical business whose innovative medicines are used by millions of patients worldwide. Oligonucleotide-based therapies are advanced novel interventions with the potential to provide a step-change in treatment for many. Nevertheless, as oligonucleotides are large complex molecules they are currently very difficult to profile for impurities, as the analysis is labour intensive and the data complexity is high.

The aim of this PhD is to develop Bayesian data science methodology that does this automatically, accurately, and delivers statistical measures of certainty. The challenge is a mathematical one, and no chemistry, biology or pharmacological background is expected of the student. More specifically, we have large batches of mass spectrometry data that will enable us to learn how to characterise the known oglionucleotide signal and deconvolute it from a number of known and unknown impurities longitudinally, in a semi-supervised learning framework. This will allow us to confirm the overall consistency of the profile, identify any change patterns, trends over batches, and any correlation between impurities.

The end goal is to establish a data analytics pipeline and embed it as part of routine analysis in AstraZeneca, so impurities can be monitored more closely and more precisely. The knowledge can then be used to identify possible issues in manufacturing and improve process chemistry by pinpointing impurities associated with different steps of the drug synthesis. This project would also improve the overall understanding of oligonucleotides and therefore, serve as a key step towards establishing an advanced analytical platform.

### Project Supervisor

The PhD will be supervised by statistical data scientist Prof Andrew Dowsey at Bristol in collaboration with AstraZeneca. Prof Dowsey’s group has extensive expertise and experience in Bayesian mass spectrometry analytics (e.g. Nature Comms Biology 2019Nature Scientific Reports 2016) and leads the development of the seaMass suite of tools for quantification and statistical analyses in mass spectrometry.

Application Deadline is 5.00pm Friday 18 June 2021. Please quote ‘Compass/ AstraZeneca’ in the funding section of the application form and in your Personal Statement to ensure your application is reviewed correctly. Please follow the Compass application guidance.

Interviews are expected to be held in the week commencing 12 July.

## Video: The Data Science behind COVID Modelling

We are excited to share Dr Daniel Lawson’s (Compass CDT Co-Director) latest video where he will tell you about the Data Science behind Bristol’s COVID Modelling.

Mathematics has had a hidden role in predicting how we can best fight COVID-19. How is mathematics used with data science and machine learning? Why is modelling epidemics such a hard problem? How can we do it better next time? What will data science be able to do in the future, and how do you become a part of it?

## Student Perspectives: Embedding probability distributions in RKHSs

A post by Jake Spiteri, Compass PhD student.

Recent advancements in kernel methods have introduced a framework for nonparametric statistics by embedding and manipulating probability distributions in Hilbert spaces. In this blog post we will look at how to embed marginal and conditional distributions, and how to perform probability operations such as the sum, product, and Bayes’ rule with embeddings.

## Embedding marginal distributions

Throughout this blog post we will make use of reproducing kernel Hilbert spaces (RKHS). A reproducing kernel Hilbert space is simply a Hilbert space with some additional structure, and a Hilbert space is just a topological vector space equipped with an inner product, which is also complete.

We will frequently refer to a random variable $X$ which has domain $\mathcal{X}$ endowed with the $\sigma$-algebra $\mathcal{B}_\mathcal{X}$.

Definition. A reproducing kernel Hilbert space $\mathcal{H}$ on $\mathcal{X}$ with associated positive semi-definite kernel function $k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ is a Hilbert space of functions $f: \mathcal{X} \rightarrow \mathbb{R}$. The inner product $\langle\cdot,\cdot\rangle_\mathcal{H}$ satisfies the reproducing property: $\langle f, k(x, \cdot) \rangle_\mathcal{H}, \forall f \in \mathcal{H}, x \in \mathcal{X}$. We also have $k(x, \cdot) \in \mathcal{H}, \forall x \in \mathcal{X}$.

It may be helpful to think of $\mathcal{H}$ as $\overline{\text{span}\{k(x, \cdot)\}}$; all functions in our space may be written as a sum of feature maps $k(x, \cdot)$. This can help us visualise what functions belong to our RKHS. We should also note that kernel functions are not limited to $\mathcal{X} = \mathbb{R}^d$ — kernel functions have been defined for inputs such as strings, graphs, images, etc.

Definition. Let $\mathcal{P}(\mathcal{X})$ the space of all probability measures on $\mathcal{X}$. We can embed any probability distribution $\mathbb{P} \in \mathcal{P}(\mathcal{X})$ into the RKHS $\mathcal{H}$ with associated kernel $k$ via the kernel mean embedding:

$\mu_\mathbb{P}: \mathcal{P}(\mathcal{X}) \rightarrow \mathcal{H}, \quad \mu_\mathbb{P}: \mathbb{P} \mapsto \int k(x, \cdot) d\mathbb{P}(x)$

The embedding $\mu_\mathbb{P}$ exists and belongs to the RKHS $\mathcal{H}$ if $\mathbb{E}_\mathbb{P}[\sqrt{k(X,X)}] < \infty$. The kernel mean embedding is representer of expectation in the RKHS: it satisfies the reproducing property $\mathbb{E}_\mathbb{P}[f(X)] = \langle f, \mu_\mathbb{P} \rangle_\mathcal{H}, \forall f \in \mathcal{H}$. I.e. can compute expectations of functions $f \in \mathcal{H}$ with respect to $\mathbb{P}$ by taking an inner product between $f$ and $\mu_\mathbb{P}$.

A useful property of the kernel mean embedding is that it captures all characteristics of a probability distribution, for a good choice of kernel. Provided that the feature mapping induced by the kernel is injective, the embedding is unique. A kernel which corresponds to an injective feature mapping is said to be characteristic. That is, for $\mathbb{P}, \mathbb{Q} \in \mathcal{P}(\mathcal{X})$, if $\mu_\mathbb{P} = \mu_\mathbb{Q}$ then we must have $\mathbb{P} = \mathbb{Q}$ for a characteristic kernel. Choosing an appropriate kernel function for a given problem is not straightfoward, and it is common to use a ‘default’ kernel choice such as the Gaussian or Laplace kernel, as they are characteristic. We should note here that whilst the mapping from distribution to kernel mean embedding is unique, reconstructing the original density function is non-trivial.

Empirical estimator. In practice it is unlikely that we would have access to the distribution $\mathbb{P}$, and so rather than compute the kernel mean embedding $\mu_\mathbb{P}$, we can only compute its empirical estimate. Given an i.i.d. sample $\{x_1, \dots, x_n\}$ from $\mathbb{P}$, we can compute the unbiased estimate $\hat{\mu}_\mathbb{P} := \frac{1}{n} \sum_{i=1}^n k(x_i, \cdot)$. This empirical estimator has $\sqrt{n}$-consistency.

Above we have introduced a quantity which makes use of the mean feature mapping in the RKHS. Similarly, we may also make use of the covariance of feature mappings in the RKHS.

Definition. Let $(X, Y)$ be a random variable taking values on domain $\mathcal{X} \times \mathcal{Y}$, and let $\mathcal{H}$ and $\mathcal{G}$ be RKHSs with kernel functions $k$ and $l$ on domains $\mathcal{X}$ and $\mathcal{Y}$ respectively. Let $\phi(x) := k(x, \cdot)$ and $\varphi(y) := l(y, \cdot)$ be the feature maps corresponding to kernels $k$ and $l$ respectively. Then we define the (uncentered) cross-covariance operator $\mathcal{C}_{YX}: \mathcal{H} \rightarrow \mathcal{G}$ as

$\mathcal{C}_{YX} := \mathbb{E}_{\mathbb{P}_{YX}}[\varphi(Y) \otimes \phi(X)],$

where $\otimes$ denotes the tensor product, and $\mathbb{P}_{YX}$ denotes the joint probability distribution of $(X, Y).$

The cross-covariance operator exists if $\mathbb{E}_X[k(X, X)] < \infty$ and $\mathbb{E}_Y[l(Y, Y)] < \infty$. It is useful to note that $\mathcal{C}_{YX}$ can be identified as the kernel mean embedding of the joint distribution in the tensor product space $\mathcal{G} \otimes \mathcal{H}$. We emphasise that the cross-covariance operator is an operator mapping from one RKHS to another. We see that when acting on an element $f \in \mathcal{H}$ we have $\mathcal{C}_{YX}(f)= \mathbb{E}_{\mathbb{P}_{YX}}[\varphi(Y) \langle f, \phi(X) \rangle_\mathcal{H}]\in\mathcal{G}$.

Empirical estimator. Given an i.i.d. sample $\{x_i, y_i\}_{i=1}^n$  from the joint probability distribution $\mathbb{P}_{XY},$ we have the following empirical estimator of the cross-covariance

$\hat{\mathcal{C}}_{YX} = \frac{1}{n} \Phi^T \Upsilon,$

where $\Phi = [ \phi(x_1), \dots, \phi(x_n)] \in \mathcal{H}^n$ and $\Upsilon = [ \varphi(y_1), \dots, \varphi(y_n)] \in \mathcal{G}^n$ are row vectors of functions in $\mathcal{H}$ and $\mathcal{G}$ respectively.

## Embedding conditional distributions

Using the cross-covariance operator, we can now define the conditional mean embedding. We want to extend the notion of embedding a marginal distribution $P(X)$ to that of embedding a conditional distribution $P(Y | X)$ and $P(Y | X = x)$ for some $x \in \mathcal{X}$. Embeddings which capture the dependence between variables will allow us to develop more complex algorithms, and will allow us to produce kernel versions of elementary probability operations such as the sum rule and product rule.

We start by specifying two conditions which we want the conditional mean embedding to satisfy based on our definiton of the kernel mean embedding. We will use $\mu_{Y | X=x} \in \mathcal{G}$ and $\mathcal{U}_{Y|X}:\mathcal{H} \rightarrow \mathcal{G}$ to denote the embeddings of the conditional distributions $P(Y | X = x)$ and $P(Y | X)$ respectively.

These conditions are

1. $\mu_{Y | X=x} = \mathbb{E}_{Y | X=x}[\varphi(Y) | X = x] = \mathcal{U}_{Y|X} k(x, \cdot), \quad \forall x\in\mathcal{X}$.
2. $\mathbb{E}_{Y | X=x}[g(Y) | X = x] = \langle g, \mu_{Y|X=x} \rangle_\mathcal{G}, \quad \forall g \in \mathcal{G}$

Note that for a given $x \in \mathcal{X}$ the embedding of $P(Y | X = x)$, $\mu_{Y | X=x}$, is an element in $\mathcal{G}$, whereas the embedding of $P(Y | X)$, $\mathcal{U}_{Y|X}$, is an operator which traces out a family of elements in the RKHS $\mathcal{G}$.

It is shown in (Song, 2009) that the following definition satisfies the above properties.

Definition. Let $\mathcal{C}_{XX}: \mathcal{H} \rightarrow \mathcal{H}$, and $\mathcal{C}_{YX}: \mathcal{H} \rightarrow \mathcal{G}$. Then we can define the conditional mean embeddings of the conditional distributions $P(Y | X)$ and $P(Y | X = x)$, for $x \in \mathcal{X}$, as

$\mathcal{U}_{Y|X} := \mathcal{C}_{YX} \mathcal{C}_{XX}^{-1}, \quad \mu_{Y | X=x} := \mathcal{C}_{YX} \mathcal{C}_{XX}^{-1} k(x, \cdot),$

under the assumption that $\mathbb{E}_{Y|X}[g(Y) | X] \in \mathcal{H}$. It is easy to produce examples in which this does not hold, and so we often replace the above inverse with a Tikhonov-regularised inverse, and consider the embeddings to be approximations of the true embeddings.

Empirical estimator. Given a set of i.i.d. samples $\{x_i, y_i\}_{i=1}^n$ from the joint distribution $P(X,Y)$, the conditional mean embedding can be estimated using

$\hat{\mu}_{Y|X=x} = \sum_{i=1}^n \beta_i l(y_i, \cdot)$,

where $\beta := (K + n \lambda I)^{-1} k_x \in \mathbb{R}^n$ for $K\in\mathbb{R}^{n \times n}$ the Gram matrix associated with $\{ x_1, \dots, x_n\}$, and $k_x = [k(x, x_1), \dots, k(x, x_n)]$, and $\lambda$ a regularisation parameter.

An example. Here we look at a simple example of the conditional mean embedding. We generate data in a method inspired by Schuster et al. (2020). We start by drawing a circle on the $(x,y)$-plane, and specifying 50 equidistant points on the circle. Each of these points is the mean of a Gaussian distribution over $Y$, and we draw 50 samples from each distribution with equal probability. The conditional distribution $P(Y | X)$ in this scenario is a mixture of Gaussians. The top left plot of Figure 1 shows the data sampled from these Gaussians in grey, with conditional distributions overlaid in blue. The conditional mean embedding embeds each conditional distribution in the RKHS $\mathcal{G}$ illustrated in the top right plot, via $\mu_{Y|X=x} = \mathcal{U}_{Y|X} \phi(x)$. In the second row of Figure 1 we see three examples of the reconstructed conditional densities in red. These reconstructed densities have been obtained via the nonparametric estimator proposed in Kanagawa et al. (2014).

Figure 1. A visualisation of the conditional mean embedding.

## Operations: the Sum, Product, and Bayes’ rule

Using the conditional mean embedding and the cross-covariance operator, we can produce versions of the sum, product, and Bayes’ rule which allow us to manipulate distributions embedded in an RKHS.

Recall the sum rule and product rule: $P(X) = \sum_{Y \in \mathcal{Y}} P(X, Y)$, and $P(X, Y) = P(Y | X) P(X)$.

Kernel Sum rule. The sum rule allows us to compute the marginal density of a variable $X$ given the joint distribution over variables $X$ and $Y$, by marginalising out $Y$.

Using the law of total expectation, we have $\mu_X = \mathbb{E}_{XY}[\varphi(X)] = \mathbb{E}_Y \mathbb{E}_{X | Y}[\varphi(X)|Y]$. Thus, we have

$\mu_X = \mathcal{U}_{X|Y} \mathbb{E}_Y [\phi(Y)] = \mathcal{U}_{X|Y} \mu_Y$.

Kernel Product rule. To construct a kernel product rule, we consider the tensor product feature map $\varphi(X) \otimes \phi(Y)$. We can factorise the embedding $\mu_{XY} = \mathbb{E}_{XY}[\varphi(X) \otimes \phi(Y)]$ in two ways using the law of total expectation

1. $\mathbb{E}_Y \mathbb{E}_{X | Y} [\varphi(X) \otimes \phi(Y) | Y] = \mathcal{U}_{X | Y} \mathbb{E}_Y[\phi(Y) \otimes \phi(Y)]$,
2. $\mathbb{E}_X \mathbb{E}_{Y | X} [\phi(Y) \otimes \varphi(X)| X] = \mathcal{U}_{Y | X} \mathbb{E}_X[\varphi(X) \otimes \varphi(X)]$.

Let $\mu_Y^\otimes := \mathbb{E}_Y[\phi(Y) \otimes \phi(Y)]$, and $\mu_X^\otimes := \mathbb{E}_X[\varphi(X) \otimes \varphi(X)]$. Then we can rewrite the above as

$\mu_{XY} = \mathcal{U}_{X | Y} \mu_Y^\otimes = \mathcal{U}_{Y | X}\mu_X^\otimes$.

This looks similar to the sum rule, but we see that there is an additional copy of the variables $Y$ and $X$ passed as input, which is not marginalised out as is done in the sum rule.

Kernel Bayes rule. The kernel Bayes’ rule (KBR) is a nonparametric method which allows for Bayesian inference in the absence of a parametric model or likelihood. In KBR we embed the prior and likelihood in an RKHS via the kernel mean embedding and cross-covariance operator respectively, and use the sum and product rules to manipulate the embeddings in the RKHS.

The presentation of KBR shown here is that given in Muandet et al. (2017), which provides a concise summary of the original work (Fukumizu et al., 2013). Our aim is to compute the embedding of the posterior $P(Y|X)$ given a prior distribution $\Pi(Y)$. We obtain the embedding of the posterior distribution as $\mu_{Y | X=x} = \mathcal{U}_{Y | X}^\Pi \varphi(x)$, where we use a superscript $\Pi$ to denote dependence on the prior $\Pi(Y)$.  More precisely, we have

$\mu_{Y | X=x} = \mathcal{U}_{Y | X}^\Pi \varphi(x) = \mathcal{C}_{YX}^\Pi (\mathcal{C}_{XX}^\Pi)^{-1} \varphi(x)$,

where the cross-covariance operators depend on the prior, and are given by

$\mathcal{C}_{YX}^\Pi = (\mathcal{U}_{X | Y} \mathcal{C}_{YY}^\Pi)^{T}, \quad \mathcal{C}_{XX}^\Pi = \mathcal{U}_{(XX) | Y} \mu_Y^\Pi$.

These results follow from the product and sum rule respectively, where we have replaced the input feature map $\varphi(X)$ with the tensor product feature map $\varphi(X) \otimes \varphi(X)$. The embeddings $\mu_Y^\Pi$ and $\mathcal{C}_{YY}^\Pi$ are simply the embedded prior distribution corresponding to the feature maps $\phi(Y)$ and $\phi(Y) \otimes \phi(Y)$.

A KBR example. Here we look at a simple example of Bayesian inference without a likelihood, similar to that given in Fukumizu et al. (2013). In this toy example we assume that we cannot write down the likelihood function, but we can sample from it. We generate data from a Normal distribution with mean $\mu = 25$, and we assume the variance $\sigma^2 = 2.5^2$ is known. We place a Normal prior on $\mu$ with mean equal to the mean of the observed data, and standard deviation 4. Figure 2 below shows two ways of approaching posterior inference: in the first row we use the fact that the prior and likelihood are conjugate, and we derive the analytical posterior distribution; in the second row we embed the prior distribution and the joint distribution in the RKHS and use KBR to obtain an embedding of the posterior.

Figure 2. Top row: Posterior inference when the prior and likelihood are conjugate. Bottom row: Posterior inference via KBR when the likelihood is intractable.

## References

Fukumizu, K., L. Song, and A. Gretton (2013). “Kernel Bayes’ rule: Bayesian inference with positive definite kernels”. In: The Journal of Machine Learning Research 14.1, pp. 3753–3783.
Kanagawa, M. and K. Fukumizu (2014). “Recovering distributions from Gaussian RKHS embeddings”. In: Artificial Intelligence and Statistics. PMLR, pp. 457–465.
Muandet, K., K. Fukumizu, B. Sriperumbudur, and B. Schölkopf (2017). “Kernel Mean Embedding of Distributions: A Review and Beyond”. In: Foundations and Trends in Machine Learning.
Schuster, I., M. Mollenhauer, S. Klus, and K. Muandet (2020). “Kernel conditional density operators”. In: International Conference on Artificial Intelligence and Statistics. PMLR, pp. 993–1004.
Song, L., J. Huang, A. Smola, and K. Fukumizu (2009). “Hilbert space embeddings of conditional distributions with applications to dynamical systems”. In: Proceedings of the 26th Annual International Conference on Machine
Learning, pp. 961–968.

## Launch of industry-focused seminar series DataScience@Work

Compass is excited to announce the launch of the DataScience@work seminar series. This new seminar series invites speakers from external organisations to talk about their experiences as Data Scientists in industry, government and the third sector. The dual meaning of DataScience@work focuses talks on both the technical side of the speakers’ roles as well as working as part of a wider organisation, and building a career in data science.

Highlighting the importance of the new seminar series, Prof Nick Whiteley (Compass Director) says…

Compass aims to develop scientific and professionally agility in its students. Our goal is to connect technical expertise in data science with experience of thinking, communicating and collaborating across disciplines and across sectors. In our new DataScience@Work seminar series, Compass partners from industry will share insights into the key role of Data Science within their organisations, their objectives and future outlook. This is a great opportunity for our students to learn about career trajectories beyond academia, helping shape their aspirations and personal goals for life beyond the PhD. I’m especially grateful to Adarga, CheckRisk, IBM Research, Improbable, and Shell for leading this first season of DataScience@Work and for their ongoing support for Compass.

For further information on the seminar series, including invited speakers to the 2020/21 session, see the DataScience@work page.

## Student Perspectives: Change in the air: Tackling Bristol’s nitrogen oxide problem

A post by Dom Owens, PhD student on the Compass programme.

“Air pollution kills an estimated seven million people worldwide every year” – World Health Organisation

Many particulates and chemicals are present in the air in urban areas like Bristol, and this poses a serious risk to our respiratory health. It is difficult to model how these concentrations behave over time due to the complex physical, environmental, and economic factors they depend on, but identifying if and when abrupt changes occur is crucial for designing and evaluating public policy measures, as outlined in the local Air Quality Annual Status Report.  Using a novel change point detection procedure to account for dependence in time and space, we provide an interpretable model for nitrogen oxide (NOx) levels in Bristol, telling us when these structural changes occur and describing the dynamics driving them in between.

## Model and Change Point Detection

We model the data with a piecewise-stationary vector autoregression (VAR) model:

In between change points the time series $\boldsymbol{Y}_{t}$, a $d$-dimensional vector, depends on itself linearly over $p \geq 1$ previous time steps through parameter matrices $\boldsymbol{A}_i^{(j)}, i=1, \dots, p$ with intercepts $\boldsymbol{\mu}^{(j)}$, but at unknown change points $k_j, j = 1, \dots, q$ the parameters switch abruptly. $\{ \boldsymbol{\varepsilon}_{t} \in \mathbb{R}^d : t \geq 1 \}$ are white noise errors, and we have $n$ observations.

We phrase change point detection as a test of the hypotheses $H_0: q = 0$ vs. $H_1: q \geq 1$, i.e. the null states there is no change, and the alternative supposes there are possibly multiple changes. To test this, we use moving sum (MOSUM) statistics $T_k(G)$ extracted from the model; these compare the scaled difference between prediction errors for $G$ steps before and after $k$; when $k$ is in a stationary region, $T_k(G)$  will be close to zero, but when $k$ is near or at a change point, $T_k(G)$  will be large. If the maximum of these exceeds a threshold suggested by the distribution under the null, we reject $H_0$ and estimate change point locations with local peaks of $T_k(G)$.

## Air Quality Data

With data from Open Data Bristol over the period from January 2010 to March 2021, we have hourly readings of NOx levels at five locations (with Site IDs) around the city of Bristol, UK: AURN St Paul’s (452); Brislington Depot (203); Parson Street School (215); Wells Road (270); Fishponds Road (463). Taking daily averages, we control for meteorological and seasonal effects such as temperature, wind speed and direction, and the day of the week, with linear regression then analyse the residuals.

We use the bandwidth $G=280$ days to ensure estimation accuracy relative to the number of parameters, which effectively asserts that two changes cannot occur in a shorter time span. The MOSUM procedure rejects the null hypothesis and detects ten changes, pictured above. We might attribute the first few to the Joint Local Transport Plan which began in 2011, while the later changes may be due to policy implemented after a Council supported motion in November 2016. The image below visualises the estimated parameters around the change point in November 2016; we can see that in the segment seven there is only mild cross-dependence, but in segment eight the readings at Wells Road, St. Paul’s, and Fishponds Road become strongly dependent on the other series.

Scientists have generally worked under the belief that these concentration series have long-memory behaviour, meaning values long in the past have influence on today’s values, and that this explains why the autocorrelation function (ACF) decays slowly, as seen above. Perhaps the most interesting conclusion we can draw from this analysis is that that structural changes explain this slow decay – the image below displays much shorter range dependence for one such stationary segment.

## Conclusion

After all our work, we have a simple, interpretable model for NOx levels. In reality, physical processes often depend on each other in a non-linear fashion, and according to the geographical distances between where they are measured; accounting for this might provide better predictions, or tell a slightly different story.  Moreover, there should be interactions with the other pollutants present in the air. Could we combine these for a large-scale, comprehensive air quality model? Perhaps the most important question to ask, however, is how this analysis can be used to benefit policy makers. If we could combine this with a causal model, we might we be able to identify policy outcomes, or to draw comparisons with other regions.

## Compass Special Lecture: Jonty Rougier

Compass is excited to announce that Jonty Rougier (2021 recipient of the Barnett Award) will be delivering a Compass Special Lecture.

Jonty’s experience lies in Computer Experiments, computational statistics and Machine Learning, uncertainty and risk assessment, and decision support. In 2021, he was awarded Barnett Award by the RSS, which is made to those internationally recognised for contributions in the field of environmental statistics, risk and uncertainty quantification. Rougier has also advised several UK Government departments and agencies, including a secondment to the Cabinet Office in 2016/17 to contribute to the UK National Risk Assessment.

13th April 2021
09:30 - 14:00

## Student perspectives: Stein’s paradox in statistics

A post by Alexander Modell, PhD student on the Compass programme.

In 1922, Sir R.A. Fisher laid the foundations of classical statistics with an idea he named maximum likelihood estimation. His principle, for determining the parameters of a statistical model, simply states that given some data, you should choose the parameters which maximise the probability of observing the data that you actually observed. It is based on a solid philosophical foundation which Fisher termed “logic of inductive reasoning” and provides intuitive estimators for all sorts of statistical problems. As a result, it is frequently cited as one of the most influential pieces of applied mathematics of the twentieth century. Eminent statistician Bradley Efron even goes as far as to write

“If Fisher had lived in the era of “apps”, maximum likelihood estimation would have made him a billionaire.” [1]

It came as a major shock to the statistical world when, in 1961, James and Stein exposed a paradox that had the potential to undermine what had become the backbone of classical statistics [2].

It goes something like this.

Suppose I choose a real number on a number line, and instead of telling you that number directly, I add some noise, that is, I perturb it a little and give you that perturbed number instead. We’ll assume that the noise I add is normally distributed and, to keep things simple, has variance one. I ask you to guess what my original number was – what do you do? With no other information to go on, most people would guess the number that I gave them, and they’d be right to do so. By a commonly accepted notion of what “good” is (we’ll describe this a little later on) – this is a “good” guess! Notably, this is also the maximum likelihood estimator.

Suppose now that I have two real numbers and, again, I independently add some noise to each and tell you those numbers. What is your guess of my original numbers? For the same reasons as before, you choose the numbers I gave you. Again, this is a good guess!

Now suppose I have three numbers. As usual, I independently add some noise to each one and tell you those numbers. Of course, you guess the numbers I gave you, right? Well, this no longer the best thing to do! If we imagine those three numbers as a point in three-dimensional space, you can actually do better by arbitrarily choosing some other point, let’s call it B, and shifting your estimate slightly from the point I gave you towards B. This is Stein’s paradox. At first glance, it seems rather absurd. Even after taking some time to take in its simple statement and proof, it’s almost impossible to see how it can possibly be true.

To keep things simple, from now on we will assume that B is the origin, that is that each dimension of B is zero, although keep in mind that all of this holds for any arbitrary point. Stated more formally, the estimator proposed by James and Stein is

$\hat{\mu}_i^{\text{(JS)}} = \left( 1 - \frac{(p-2)}{x_1^2+\cdots + x_p^2} \right) x_i.$

where p is the number of numbers you are estimating, and $x_1,\ldots,x_p$ are the noisy versions that I give you. Since the term in brackets is always less than one, their estimator shrinks each of the dimensions of the observation towards zero. For this reason, it is known as a “shrinkage” estimator. There are two central surprises here. Firstly, it seems totally paradoxical to estimate anything other than the values I gave you. With no other information, why on earth would you estimate anything different? Secondly, their estimate of one value depends on the observations of all of the other values, despite the fact that they’re totally independent!

To most, this is seemingly ludicrous, and to make this point, in his paper on the phenomenon [3], Richard Samworth presents an unusual example. Suppose we want to estimate the proportion of the US electorate who will vote for Barack Obama, the proportion of babies born in China that are girls and the proportion of Britons with light-coloured eyes, then our James-Stein estimate of the proportion of Democratic voters will depend on our hospital and eye colour data!

Before we discuss what we mean when we say that the James-Stein estimator is “better” than the maximum likelihood estimator, let’s take a step back and get some visual intuition as to why this might be true.

Suppose I have a circle on an infinite grid and I task you with guessing the position of the centre of that circle. Instead of just telling you the location of the centre, I draw a point from the circle at random and tell you that position. Let’s call this position A and we’ll call the true, unknown centre of the circle C. What do you propose as your guess? Intuition tells you to guess A, but Stein’s paradox would suggest choosing some other point B, and shifting your guess towards it.

Let’s suppose the circle has radius one and its true centre is at $x=0,y=2$. To keep things simple, suppose I set the point B to be the origin (that is $x=0,y=0$). Now, if I randomly draw a point from this circle, what proportion of the time will my guess actually be better if I shrink it slightly towards the point B. For this illustration, we’ll just consider shrinking towards B a very small amount, so we can rephrase this question as this: What proportion of the circle could get closer to the centre if it were shrunk towards B a little?

The answer to that question is the region marked “2” on the figure. A bit of geometry tells us that that’s about 61% of the circle, a little over half. So estimating a point slighter closer to B than the point we drew has the potential to improve our estimate more times than it hinders it! In fact, this holds true regardless of the position of B.

Now let’s consider this problem in three dimensions. We now have a sphere rather than a circle but everything else remains the same. In three dimensions, the region marked “2” covers just over 79% of the sphere, so about four times out of five, our shrinkage estimator does better than estimating the point A. If you’ve just tried to imagine a four-dimensional sphere, your head probably hurts a little but you can still do this maths. Region “2” now covers 87% of the sphere. In ten dimensions, it covers about 98% of the sphere and in one hundred dimensions it covers 99.99999% of it. That is to say, that only one in ten millions times, our estimate couldn’t be improved by shrinking it towards B.

Hopefully, you’re starting it see how powerful this idea can be, and how much more powerful it becomes as we move into higher dimensions. Despite this, the more you think about it, the more something seems amiss – yet it’s not.

Statisticians typically determine how “good” an estimator is through what is called a loss function. A loss function can be viewed as a penalty that increases the further your estimate is from the underlying truth. Examples are the squared error loss $\ell_2(\hat{\mu};\mu) = \sum_i(\hat{\mu}_i - \mu_i)^2,$ which is by far the most common choice, and the absolute error loss $\ell_1(\hat{\mu};\mu)\ = \sum_i |\hat{\mu}_i - \mu_i|$ where $\hat{\mu}$ is the estimate and $\mu$ is the truth. Since the loss function depends on the data that was observed, it is not suitable for comparing estimators, so instead the risk of the estimator is used, which is defined as the loss we would expect to see if we drew a random dataset from the underlying distribution.

The astonishing result that James and Stein proved, was that, under the squared error loss, their “shrinkage” estimator has a lower risk than the maximum likelihood estimator, regardless of the underlying truth. This came as a huge surprise to the statistics community at the time, which was firmly rooted in Fisher’s principle of maximum likelihood estimation.

It is natural to ask whether this is a quirk of the normal distribution and the squared error loss, but in fact, similar phenomena have been shown for a wide range of distributions and loss functions.

It would be easy to shrug this result off as merely a mathematical curiosity – many contemporaries of Stein did – yet in the era of big data, the ideas underpinning it have proved crucial to modern statistical methodology. Modern datasets regularly contain tens of thousands, if not millions of dimensions and in these settings, classical statistical ideas break down. Many modern machine learning algorithms, such as ridge and Lasso regression, are underpinned by these ideas of shrinkage.

References

[1]: Efron, B., & Hastie, T. (2016). Computer age statistical inference (Vol. 5). Cambridge University Press.

[2]: James, W. and Stein, C.M. (1961) Estimation with Quadratic Loss. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 1, 361-379, University California Press, Berkeley.

[3]: Samworth, R. J., & Cambridge, S. (2012). Stein’s paradox. eureka, 62, 38-41.