A post by Anthony Stephenson, PhD student on the Compass programme.

## Introduction

The general focus of my PhD research is in some sense to produce models with the following characteristics:

- Well-calibrated (uncertainty estimates from the predictive process reflect the true variance of the target values)
- Non-linear
- Scalable (i.e. we can run it on large datasets)

At a vague high-level, we can consider that we can have two out of three of those requirements without *too* much difficulty, but including the third causes trouble. For example, Bayesian linear models would satisfy good-calibration and scalability but (as the name suggests) fail at modelling non-linear functions. Similarly, neural-networks are famously good at modelling non-linear functions and much work has been spent on improving their efficiency and scalability, but producing well-calibrated predictions is a complex additional feature. I am approaching the problem from the angle of Gaussian Processes, which provide well-calibrated non-linear models; at the expense of scalability.

### Gaussian Processes (GPs)

See Conor’s blog post for a more detailed introduction to GPs; here I will provide a basic summary of the key facts we need for the rest of the post.

The functional view of GPs is that we define a distribution over functions:

where and are the mean function and kernel function respectively, which play analogous roles to the usual mean and covariance of a Gaussian distribution.

In practice, we only ever observe some finite collection of points, corrupted by noise, which we can hence view as a draw from some multivariate normal distribution:

where

with .

(Here subscript denotes dimensionality of the vector or matrix).

When we use GPs to generate predictions at some new test point we use the following equations which I will not derive here (See [1]) for the predicted mean and variance respectively:

The key point here is that both predictive functions involve the inversion of an matrix at a cost of .