## Ed Davis wins poster competition

Congratulations to Ed Davis who won a poster award as part of the Jean Golding Institute’s Beauty of Data competition.

This visualisation, entitled “The World Stage”, gives a new way of representing the positions of countries. Instead of placing them based on their geographical position, they have been placed based on their geopolitical alliances. Countries have been placed such to minimise the distance to their allies and maximise the distance to their non-allies based on 40 different alliances involving 161 countries. This representation was achieved by embedding the alliance network using Node2Vec, followed by principal component analysis (PCA) to reduce it to 2D.

## Student perspectives: Sampling from Gaussian Processes

A post by Anthony Stephenson, PhD student on the Compass programme.

## Introduction

The general focus of my PhD research is in some sense to produce models with the following characteristics:

1. Well-calibrated (uncertainty estimates from the predictive process reflect the true variance of the target values)
2. Non-linear
3. Scalable (i.e. we can run it on large datasets)

At a vague high-level, we can consider that we can have two out of three of those requirements without too much difficulty, but including the third causes trouble. For example, Bayesian linear models would satisfy good-calibration and scalability but (as the name suggests) fail at modelling non-linear functions. Similarly, neural-networks are famously good at modelling non-linear functions and much work has been spent on improving their efficiency and scalability, but producing well-calibrated predictions is a complex additional feature. I am approaching the problem from the angle of Gaussian Processes, which provide well-calibrated non-linear models; at the expense of scalability.

### Gaussian Processes (GPs)

See Conor’s blog post for a more detailed introduction to GPs; here I will provide a basic summary of the key facts we need for the rest of the post.

The functional view of GPs is that we define a distribution over functions:

$f(\cdot) \sim \mathcal{GP}(m(\cdot),k(\cdot, x))$

where $m$ and $k$ are the mean function and kernel function respectively, which play analogous roles to the usual mean and covariance of a Gaussian distribution.

In practice, we only ever observe some finite collection of points, corrupted by noise, which we can hence view as a draw from some multivariate normal distribution:

$y_n \sim \mathcal{N}(0, \underbrace{K_n + \sigma^2I_n}_{K_\epsilon})$

where

$y_n = f_n(x) + \epsilon_n$ with $\epsilon \sim \mathcal{N}(0,\sigma^2)$.

(Here subscript $n$ denotes dimensionality of the vector or matrix).

When we use GPs to generate predictions at some new test point $x_\star \notin X$ we use the following equations which I will not derive here (See [1]) for the predicted mean and variance respectively:

$\mu(x_\star) = k(x_\star,X)K_\epsilon^{-1}y_n$

$V(x_\star) = k(x_\star, x_\star) - k(x_\star, X)K_\epsilon^{-1}k(X,x_\star)$

The key point here is that both predictive functions involve the inversion of an $n\times n$ matrix at a cost of $\mathcal{O}(n^3)$.

## Student Perspectives: Application of Density Ratio Estimation to Likelihood-Free problems

A post by Jack Simons, PhD student on the Compass programme.

# Introduction

I began my PhD with my supervisors, Dr Song Liu and Professor Mark Beaumont with the intention of combining their respective fields of research; Density Ratio Estimation (DRE), and Simulation Based Inference (SBI):

• DRE is a rapidly growing paradigm in machine learning which (broadly) provides efficient methods of comparing densities without the need to compute each density individually. For a comprehensive yet accessible overview of DRE in Machine Learning see [1].
• SBI is a group of methods which seek to solve Bayesian inference problems when the likelihood function is intractable. If you wish for a concise overview of the current work, as well as motivation then I recommend [2].

Last year we released a paper, Variational Likelihood-Free Gradient Descent [3] which combined these fields. This blog post seeks to condense, and make more accessible, the contents of the paper.

# Motivation: Likelihood-Free Inference

Let’s begin by introducing likelihood-free inference. We wish to do inference on the posterior distribution of parameters $\theta$ for a specific observation $x=x_{\mathrm{obs}}$, i.e. we wish to infer $p(\theta|x_{\mathrm{obs}})$ which can be decomposed via Bayes’ rule as

$p(\theta|x_{\mathrm{obs}}) = \frac{p(x_{\mathrm{obs}}|\theta)p(\theta)}{\int p(x_{\mathrm{obs}}|\theta)p(\theta) \mathrm{d}\theta}.$

The likelihood-free setting is that, additional to the usual intractability of the normalising constant in the denominator, the likelihood, $p(x|\theta)$, is also intractable. In lieu of this, we require an implicit likelihood which describes the relation between data $x$ and parameters $\theta$ in the form of a forward model/simulator (hence simulation based inference!). (more…)

## Between 4th and 8th of April 2022 Compass CDT students are attending APTS Week 2 in Durham.

Academy for PhD Training in Statistics (APTS) organises, through a collaboration between major UK statistics research groups, four residential weeks of training each year for first-year PhD students in statistics and applied probability nationally. Compass students attend all four APTS courses hosted by prestigious UK Universities.

For their APTS Week in Durham Compass students will be attending the following modules:

• Applied Stochastic Processes (Nicholas Georgiou and Matt Roberts): This module will introduce students to two important notions in stochastic processes — reversibility and martingales — identifying the basic ideas, outlining the main results and giving a flavour of some of the important ways in which these notions are used in statistics.
• Statistical Modelling (Helen Ogden): The aim of this module is to introduce important aspects of statistical modelling, including model selection, various extensions to generalised linear models, and non-linear models.

## Student Perspectives: Multi-agent sequential decision making

A post by Conor Newton, PhD student on the Compass programme.

# Introduction

My research focuses on designing decentralised algorithms for the multi-agent variant of the Multi-Armed Bandit problem. This research is jointly supervised by Henry Reeve and Ayalvadi Ganesh.

(Image credit: Microsoft Research)

Many real-world optimisation problems involve repeated rather than one-off decisions. A decision maker (who we refer to as an agent) is required to repeatedly perform actions from a set of available options. After taking an action, the agent will receive a reward based on the action performed. The agent can then use this feedback to inform later decisions. Some examples of such problems are:

• Choosing advertisements to display on a website each time a page is loaded to maximise click-through rate.
• Calibrating the temperature to maximise the yield from a chemical reaction.
• Distributing a budget between university departments to maximise research output.
• Choosing the best route to commute to work.

In each case there is a fundamental trade-off between exploitation and exploration. On the one hand, the agent should act in ways which exploit the knowledge they have accumulated to promote their short term reward, whether that’s the yield of a chemical process or click-through rate on advertisements. On the other hand, the agent should explore new actions in order to increase their understanding of their environment in ways which may translate into future rewards. (more…)

## PhD Compass application deadline: 16 March 2022

EPSRC PhD in Computational Statistics and Data Science is now recruiting for its next available fully-funded home fees places to start September 2022.

We will be prioritising applicants who wish to work with the following potential supervisors:

Professor Nicky Welton – Professor Welton works in the the department of Population Health Sciences in the Bristol Medical School.  Her work as a Compass supervisor can include supervision in the areas of Medical Statistics and Health Economics, in particular methods for combining evidence from multiple sources to answer healthcare policy questions.

Dr Sidarth Jaggi – Dr Jaggi is an Associate Professor in the Institute of Statistical Science and a Turing Fellow.  His Compass PhD supervision can cover areas such as high-dimensional statistics, and robust machine learning.

Dr Rihuan Ke – Dr Ke is a Lecturer in the School of Mathematics.  His research is on machine learning and mathematical image analysis. He has been developing statistical learning approaches and data-driven models for solving problems in computation and data science, and in particular for large scale image analysis. The typical approaches that he takes are to combine mathematical structures and statistical knowledge with modern deep learning techniques, to enable automatic analysis of the intrinsic structure of imaging data and exploiting rich information encoded in the data for the underlying tasks. In his projects, he is also interested in relevant applications in material sciences, medical imaging, and remote sensing. He is supervising PhD projects in deep learning, image analysis, and more generally data science.

# Introduction

My work focuses on addressing the growing need for reliable, day-ahead energy demand forecasts in smart grids. In particular, we have been developing structured ensemble models for probabilistic forecasting that are able to incorporate information from a number of sources. I have undertaken this EDF-sponsored project with the help of my supervisors Matteo Fasiolo (UoB) and Yannig Goude (EDF) and in collaboration Christian Capezza (University of Naples Federico II).

## Motivation

One of the largest challenges society faces is climate change. Decarbonisation will lead to both a considerable increase in demand for electricity and a change in the way it is produced. Reliable demand forecasts will play a key role in enabling this transition. Historically, electricity has been produced by large, centralised power plants. This allows production to be relatively easily tailored to demand with little need for large-scale storage infrastructure. However, renewable methods are typically decentralised, less flexible and supply is subject to weather conditions or other unpredictable factors. A consequence of this is that electricity production will less able to react to sudden changes in demand, instead it will need to be generated in advance and stored. To limit the need for large-scale and expensive electricity storage and transportation infrastructure, smart grid management systems can instead be employed. This will involve, for example, smaller, more localised energy storage options. This increases the reliance on accurate demand forecasts to inform storage management decisions, not only at the aggregate level, but possibly down at the individual household level. The recent impact of the Covid-19 pandemic also highlighted problems in current forecasting methods which struggled to cope with the sudden change in demand patterns. These issues call attention to the need to develop a framework for more flexible energy forecasting models that are accurate at the household level. At this level, demand is characterised by a low signal-to-noise ratio, with frequent abrupt changepoints in demand dynamics. This can be seen in Figure 1 below.

The challenges posed by forecasting at a low level of aggregation motivate the use of an ensemble approach that can incorporates information from several models and across households. In particular, we propose an additive stacking structure where we can borrow information across households by constructing a weighted combination of experts, which is generally referred to as stacking regressions [2].

## Compass news round-up 2021

As we start 2022, we look back at our Compass achievements over 2021…

## Invited speakers and seminars

Over the course of the year we invited seminar speakers Ingmar Schuster on kernel methods, Nicolas Chopin offered a two-part lecture on sequential Monte Carlo samplers, Ioannis Kosmidis on reducing bias in estimation and a special two-part lecture from Barnett Award winning Jonty Rougier on Wilcoxon’s Two Sample Test.

In May, Compass PhD student, Mauro Camara Escudero, set up PAI-Link: a nation-wide AI postgraduate seminar series.

Last year also saw the launch of our DataScience@work seminar series, at which we had 5 external organisations speak (Adarga, CheckRisk, Shell, IBM Research and Improbable) and the British Geological Survey opened this academic year’s seminar series with a talk from alumna Dr Kathryn Leeming.

## Training and internships

We ran training sessions on themes such as interdisciplinary research, responsible innovation and a Hackathon run with Compass partners LV= General Insurance, which is recounted by Doug Corbin in his blog post. Compass held its first Science Focus Lab on multi-omics data and cancer treatment with colleagues from Bristol Integrative Epidemiology unit.

Five Compass students were recruited to internships with organisations such as Microsoft Research, Adarga, CheckRisk, Afiniti and Shell.

## Outreach

The Student Perspectives blog series started up last year with Three Days in the Life of a Silicon Valley Start-up. This student-authored series explored topics such as air pollution in Bristol,  the different

approaches of frequentists and Bayesians, and how to generalise kernel methods to probability distributions.

Michael Whitehouse contributed to a Sky News report on the potential impact of the pandemic on the Tokyo Olympics by modelling the rise of COVID-19 cases in Japan.

Compass ran its first Access to Data Science event – an immersive experience for prospective PhD students which aimed to increase diversity amongst data science researchers by encouraging participants such as women and members of the LGBTQ+ and BAME communities to join us.

## Research and studentships

Our second cohort of students selected their mini-projects (a precursor to their PhD research) and our third cohort of students joined the Compass programme in September 2021.

Annie Gray presented her paper ‘Matrix factorisation and the interpretation of geodesic distance’ at NeurIPS 2021. Conor Newton gave a talk at a workshop in conjunction with ACM Sigmetrics 2021 and he and Dom Owens won the poster session of the Fry Statistics Conference.  Jack Simons paper ‘Variational Likelihood-Free Gradient Descent’ was accepted at AABI 2022. Alex Modell’s paper ‘A Graph Embedding Approach to User Behavior Anomaly Detection’ was accepted to IEEE Big Data Conference 2021. Danny Williams and supervisor Song Liu were awarded an EPSRC Impact Acceleration Account for their project in collaboration with Adarga.

We also created links with new industrial partners – AstraZeneca, ILRI and EDF – who are each sponsoring Compass PhD projects for the following students: Harry Tata, Dan Milner, and Ben Griffiths and Euan Enticott.

## Student Perspectives: Gaussian Process Emulation

A post by Conor Crilly, PhD student on the Compass programme.

# Introduction

This project investigates uncertainty quantification methods for expensive computer experiments. It is supervised by Oliver Johnson of the University of Bristol, and is partially funded by AWE.

## Outline

Physical systems and experiments are commonly represented, albeit approximately, using mathematical models implemented via computer code. This code, referred to as a simulator, often cannot be expressed in closed form, and is treated as a ‘black-box’. Such simulators arise in a range of application domains, for example engineering, climate science and medicine. Ultimately, we are interested in using simulators to aid some decision making process. However, for decisions made using the simulator to be credible, it is necessary to understand and quantify different sources of uncertainty induced by using the simulator. Running the simulator for a range of input combinations is what we call a computer experiment [1]. As the simulators of interest are expensive, the available data is usually scarce. Emulation is the process of using a statistical model (an emulator) to approximate our computer code and provide an estimate of the associated uncertainty.

Intuitively, an emulator must possess two fundamental properties

• It must be cheap, relative to the code
• It must provide an estimate of the uncertainty in its output

A common choice of emulator is the Gaussian process emulator, which is discussed extensively in [2] and described in the next section.

## Types of Uncertainty

There are many types of uncertainty associated with the use of simulators including input, model and observational uncertainty. One type of uncertainty induced by using an expensive simulator is code uncertainty, described by Kennedy and O’Hagan in their seminal paper on calibration [3]. To paraphrase Kennedy and O’Hagan: In principle the simulator encodes a relationship between a set of inputs and a set of outputs, which we could evaluate for any given combination of inputs. However, in practice, it is not feasible to run the simulator for every combination, so acknowledging the uncertainty in the code output is required. (more…)

## Student Perspectives: The Importance of Stability in Dynamic Network Analysis

A post by Ed Davis, PhD student on the Compass programme.

## Introduction

Today is a great day to be a data scientist. More than ever, our ability to both collect and analyse data allow us to solve larger, more interesting, and more diverse problems. My research focuses on analysing networks, which cover a mind-boggling range of applications from modelling vast computer networks [1], to detecting schizophrenia in brain networks [2]. In this blog, I want to share some of the cool research I have been a part of since joining the COMPASS CDT, which has to do with the analysis of dynamic networks.

## Network Basics

A network can be defined as an ordered pair, $(V, E)$, where $V$ is a node (or vertex) set and $E$ is an edge set. From this definition, we can represent any $n$ node network in terms of an adjacency matrix, $A \in \mathbb{R}^{n \times n}$, where for nodes $i, j \in V$,

$A_{ij} = \Bigg\{ \begin{array}{ll} 1 & (i,j) \in E \\ 0, & (i,j) \not\in E \end{array}$.

When we model networks, we can assume that there are some unobservable weightings which mean that certain nodes have a higher connection probability than others. We then observe these in the adjacency matrix with some added noise (like an image that has been blurred). Under this assumption, there must exist some unobservable noise-free version of the adjacency matrix (i.e. the image) that we call the probability matrix, $\mathbf{P} \in \mathbb{R}^{n \times n}$. Mathematically, we represent this by saying

$A_{ij} \overset{\text{ind}}{\sim} \text{Bernoulli} \left(P_{ij} \right)$,

where we have chosen a Bernoulli distribution as it will return either a 1 or a 0. As the connection probabilities are not uniform across the network (inhomogeneous) and the adjacency is sampled from some probability matrix (random), we say that $\mathbf{A}$ is an inhomogeneous random graph.

Going a step further, we can model each node as having a latent position, which can be used to generate its connection probabilities and, hence define its behaviour. Using this, we can define node communities; a group of nodes that have the same underlying connection probabilities, meaning they have the same latent positions. We call this kind of model a latent position model. For example, in a network of social interactions at a school, we expect that pupils are more likely to interact with other pupils in their class. In this case, pupils in the same class are said to have similar latent positions and are part of a node community. Mathematically, we say there is a latent position $\mathbf{Z}_i \in \mathbb{R}^{k}$ assigned to each node, and then our probability matrix will be the gram matrix of some kernel, $f: \mathbb{R}^k \times \mathbb{R}^k \rightarrow [0,1]$. From this, we generate our adjacency matrix as

$A_{ij} \overset{\text{ind}}{\sim} \text{Bernoulli}\left( f \left\{ \mathbf{Z}_i, \mathbf{Z}_j \right\} \right)$.

Under this model, our goal is then to estimate the latent positions by analysing $\mathbf{A}$.