Student perspectives: Sampling from Gaussian Processes

A post by Anthony Stephenson, PhD student on the Compass programme.

Introduction

The general focus of my PhD research is in some sense to produce models with the following characteristics:

  1. Well-calibrated (uncertainty estimates from the predictive process reflect the true variance of the target values)
  2. Non-linear
  3. Scalable (i.e. we can run it on large datasets)

At a vague high-level, we can consider that we can have two out of three of those requirements without too much difficulty, but including the third causes trouble. For example, Bayesian linear models would satisfy good-calibration and scalability but (as the name suggests) fail at modelling non-linear functions. Similarly, neural-networks are famously good at modelling non-linear functions and much work has been spent on improving their efficiency and scalability, but producing well-calibrated predictions is a complex additional feature. I am approaching the problem from the angle of Gaussian Processes, which provide well-calibrated non-linear models; at the expense of scalability.

Gaussian Processes (GPs)

See Conor’s blog post¬†for a more detailed introduction to GPs; here I will provide a basic summary of the key facts we need for the rest of the post.

The functional view of GPs is that we define a distribution over functions:

f(\cdot) \sim \mathcal{GP}(m(\cdot),k(\cdot, x))

where m and k are the mean function and kernel function respectively, which play analogous roles to the usual mean and covariance of a Gaussian distribution.

In practice, we only ever observe some finite collection of points, corrupted by noise, which we can hence view as a draw from some multivariate normal distribution:

y_n \sim \mathcal{N}(0, \underbrace{K_n + \sigma^2I_n}_{K_\epsilon})

where

y_n = f_n(x) + \epsilon_n with \epsilon \sim \mathcal{N}(0,\sigma^2).

(Here subscript n denotes dimensionality of the vector or matrix).

When we use GPs to generate predictions at some new test point x_\star \notin X we use the following equations which I will not derive here (See [1]) for the predicted mean and variance respectively:

\mu(x_\star) = k(x_\star,X)K_\epsilon^{-1}y_n

V(x_\star) = k(x_\star, x_\star) - k(x_\star, X)K_\epsilon^{-1}k(X,x_\star)

The key point here is that both predictive functions involve the inversion of an n\times n matrix at a cost of \mathcal{O}(n^3).

Motivation

(more…)

Skip to toolbar