Student perspectives: Sampling from Gaussian Processes

A post by Anthony Stephenson, PhD student on the Compass programme.

Introduction

The general focus of my PhD research is in some sense to produce models with the following characteristics:

Well-calibrated (uncertainty estimates from the predictive process reflect the true variance of the target values)
Non-linear
Scalable (i.e. we can run it on large datasets)

At a vague high-level, we can consider that we can have two out of three of those requirements without too much difficulty, but including the third causes trouble. For example, Bayesian linear models would satisfy good-calibration and scalability but (as the name suggests) fail at modelling non-linear functions. Similarly, neural-networks are famously good at modelling non-linear functions and much work has been spent on improving their efficiency and scalability, but producing well-calibrated predictions is a complex additional feature. I am approaching the problem from the angle of Gaussian Processes, which provide well-calibrated non-linear models; at the expense of scalability.

Gaussian Processes (GPs)

See Conor’s blog post for a more detailed introduction to GPs; here I will provide a basic summary of the key facts we need for the rest of the post.

The functional view of GPs is that we define a distribution over functions:

$f(\cdot) \sim \mathcal{GP}(m(\cdot),k(\cdot, x))$

where $m$ and $k$ are the mean function and kernel function respectively, which play analogous roles to the usual mean and covariance of a Gaussian distribution.

In practice, we only ever observe some finite collection of points, corrupted by noise, which we can hence view as a draw from some multivariate normal distribution:

$y_n \sim \mathcal{N}(0, \underbrace{K_n + \sigma^2I_n}_{K_\epsilon})$

where

$y_n = f_n(x) + \epsilon_n$ with $\epsilon \sim \mathcal{N}(0,\sigma^2)$ .

(Here subscript $n$ denotes dimensionality of the vector or matrix).

When we use GPs to generate predictions at some new test point $x_\star \notin X$ we use the following equations which I will not derive here (See [1]) for the predicted mean and variance respectively:

$\mu(x_\star) = k(x_\star,X)K_\epsilon^{-1}y_n$

$V(x_\star) = k(x_\star, x_\star) - k(x_\star, X)K_\epsilon^{-1}k(X,x_\star)$

The key point here is that both predictive functions involve the inversion of an $n\times n$ matrix at a cost of $\mathcal{O}(n^3)$ .

Motivation

So, our over-arching goal is to improve the scalability of Gaussian Processes which we have seen naïvely scales as $n^3$ . This likely means we wish to develop some approximation methods which scale better; but we also need to be able to verify that any such approximations we produce still satisfy the requirements we defined. In this case, if we aim to approximate a GP, the best metric we can have as to our success, is to compare it to a GP. As such, we choose to generate data from a GP which we can subsequently train our methods on and for which we can carefully assess how close to the objective truth they can get.

Unfortunately, even generating data from a GP naïvely scales as $\mathcal{O}(n^3)$ . We can see this by considering the simplest way to do this:

Generate an $n$ -dimensional vector $u\sim \mathcal{N}(0,1)$ .
Transform: $f_n = K^\frac{1}{2} u$ .
Add noise: $y_n = f_n + \epsilon_n$ .

For some choice of matrix square root: we might choose a Cholesky decomposition (technically not a square root, but would achieve the desired result) at a cost of $\mathcal{O}(n^3)$ or perhaps we could apply SVD (again at a cost of $\mathcal{O}(n^3)$ ).

So we see that we struggle at the first hurdle; we need a method to approximately sample from a GP before we can begin to test our approximation methods (at moderately large data sizes).

Methods

Random Fourier Features (RFF)[2]

One possible method we could use to generate approximate draws from a GP is the RFF algorithm. This idea is based around Bochner’s theorem which, roughly speaking, states that the Fourier transform of any positive-definite shift-invariant kernel (i.e. $k(x,x') = k(x-x')$ ) is a probability density. At first glance this might not appear that useful, but it allows us to produce a Monte-Carlo approximation of any such kernel by sampling from this probability density.

More precisely, if

$k(x-x') = \int_{R^d}p(\omega)\underbrace{e^{j\omega^T(x-x')}}_{\zeta_\omega(x-x')}d\omega = E_\omega[\zeta_\omega(x)\zeta_\omega(x')^*]$

then we can approximate the expectation at each pair of points $x,x'$ by sampling $D\, \omega$ ‘s and averaging the output of the $\zeta_\omega$ functions.

So we see that we can use RFFs to approximate a kernel.

How is this useful for us?

In order to generate samples from our GP, we need some form of matrix factorisation (say $K = LL^T$ à la Cholesky) of the kernel matrix so that the covariance of our approximate draw is a good approximation to the kernel, i.e. given some approximate (noiseless) draw $\hat{f}_n$ ,

$\mathrm{Cov}(\hat{f}_n) = E[Zww^TZ^T] = E[ZZ^T] \simeq K$

for random vectors $w \sim \mathcal{N}(0,I)$ .

Since we generate our Monte Carlo approximation by taking the inner product of our functions $\zeta_\omega(x)$ it’s not hard to see that we have such a factorisation.

In particular, if we form a matrix $Z \in R^{n\times D}$ from the $D$ -vectors $z(x)=\sqrt{2}\cos(\omega^T x + b)$ at each location $x$ and then use this to transform a random $\mathcal{N}(0,1)$ $D$ -vector $w$ we can generate our $n$ -dimensional approximate draw:

$\hat{y}_n = Zw + \xi$

for $\xi \sim \mathcal{N}(0,\sigma_n^2I_n)$ at a cost of $\mathcal{O}(nD)$ ; so the question is, how large is $D$ ?

Guarantees

So we now have a procedure to generate approximate draws from a GP, but if we want to conduct experiments on this data to analyse the behaviour of different models on data from some “true” GP we need to ensure that our draw is sufficiently close to the real thing that our experiments are still valid. For RFFs, the main lever we have to control the fidelity of our approximation is the number $D$ of RFFs used. What we require then, is some bound on the size of $D$ required to give us a draw from our GP that is almost guaranteed to fool some third party into thinking it’s a draw from the true GP.

To make this more concrete, we say that if we provide a sample via our RFF method to a suspicious colleague and they run any hypothesis test they like to check we haven’t fobbed them off, with high probability the sample passes the test. This is quite a strong requirement and the line of argument that leads us from here to a bound on $D$ is unfortunately so stringent that the number we come up with is $D \sim \mathcal{O}(n^2\log n)$ giving an overall complexity of the approximate sampling method of $\mathcal{O}(n^3\log n)$ . Even worse than what we started with!

On a more positive note, we were able to find an alternative method that satisfied our needs; we were able to show that Contour Integral Quadrature (CIQ) [3] can generate samples good enough to meet the hypothesis-test-test with (slightly more affordable) complexity $\mathcal{O}(n^{5/2}\log n)$ .

Acknowledgements

My research is guided by my Bristol supervisor Robert Allison and my industry supervisor Edward Pyzer-Knapp of IBM Research (who part-fund my PhD).

References

[1] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes For Machine Learning. The MIT Press, Cambridge 2006.

[2] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007.

[3] Geoff Pleiss and Martin Jankowiak and David Eriksson and Anil Damle and Jacob R. Gardner. Fast Matrix Square Roots with Applications to Gaussian Processes and Bayesian Optimization. arXiv:2006.11267, 2020.