Student Perspectives: Are larger models always better?

A post by Emma Ceccherini, PhD student on the Compass programme.


In December 2023, I attended NeurIPS, a machine learning conference, with some COMPASS colleagues. There, I attended a tutorial titled “Reconsidering Overfitting in the Age of Overparameterized Models”. The findings the speakers presented overturn some traditional statistical concepts, so I’d like to share some of these innovative ideas with the COMPASS blog readers.

Classical statistician vs deep learning practitioners
Classical statisticians argue that small models have high bias but large variance (Figure 1 (left)) and large models have low bias but high variance (Figure 1 (right)). This is called the bias-variance trade-off and is a crucial notion that can be found in all traditional statistic textbooks. Large, over-parameterised models perfectly interpolate the data points by fitting noise and they have a near-zero training error, but an increasing test error. This phenomenon is called overfitting and causes poor performances on unseen data. Overfitting implies low generalisation, which can be thought of as the model’s performance on newly generated data at test time.

Figure 1: Examples of models with low complexity, good complexity, and large complexity.

Therefore, statistics textbooks recommend avoiding overfitting and improving generalization by finding a balance in the bias-variance trade-off, either by reducing the number of parameters or using regularisation (Figure 1 (centre)).

However, as available computational power has increased, practitioners have made larger and larger models. For example, neural networks have millions of parameters, more than enough to fit noise, but they generalize very well in practice, performing significantly better than small models. These large over-parametrised models exceed the so-called interpolation threshold that is when the training error is approximately zero. Several theoretical statisticians are trying to infer what happens after this threshold. While we now have some answers, many questions are still up for debate!

Figure 2: The bias-variance trade-off.

 

The double descent

Nakkiran et al. [2019] show that in the under-parameterised regime, neural networks test errors exhibit the classical u-shape from the bias-variance trade-off, while in the over-parameterised regime, after the interpolation threshold, the test error decreases again creating the so-called double descent (see Figure 3). Figure 4 shows the test error of a neural network classifier on CIFAR-10, a standard image data set. The plot shows a double descent in the test error for neural networks trained until convergence (purple line).

Figure 3: The double descent.

The authors make two more innovative observations: harmless interpolation and good generalisation for large models. It can be observed from Figure 4 that regularisation, equivalent to early stopping (red line), is substantially beneficial around the interpolation threshold. However, as the model size grows the test error for optimal early stopped neural networks (red line) and the one of neural networks trained until convergence test (purple line) overlap. Therefore, For large models, interpolation (trained until convergence) is not worse than regularisation (optimal early stopped), that is interpolation is harmless. Finally, Figure 4 shows that the test error is low as the size of the model grows. Hence, for large models, we can achieve reasonably good test accuracy, namely as a result of good generalisation.

Figure 4: Classification using neural networks on CIFAR-10 Nakkiran et al. [2019].
Simple maths for linear models
Given these groundbreaking experimental results, statisticians seek to use theoretical analysis to understand when these three phenomena occur. Although neural networks were the initial motivation of this work, they are hard to analyse even for shallow networks. And so statisticians resorted to understanding these phenomena starting from the well-known linear models.

Over-parameterisation in linear models of the form $\mathbf{Y} = \mathbf{X}\theta^* + \mathbf{W}$ means there are more features $d$ than number of samples $n$, i.e. $d >n$ for an input matrix $\mathbf{X}$ of dimension $n \times d$. Then the system $\mathbf{X}\hat{\theta} = \mathbf{Y}$ has infinite solutions, thus consider the solution with minimum norm $\hat{\theta} = \text{arg min}||\hat{\theta}||_2$.

After the interpolation threshold, the variance is dominating (see Figure 3) so it needs to go down for the test error to go down. Indeed, Bartlett et al. [2020] show that in this setup the variance decreases as $d \gg n$, precisely $$\text{variance} \asymp \frac{\sigma^2n}{d}. $$

It can be shown that data is approximately orthogonal when $d \gg n$, namely $<X_i, X_j> \approx 0$ for $i \neq 0$, so the noise “energy” is spread out along the $d$ dimensions, hence as $d$ grows the noise contribution decreases.
However, Bartlett et al. [2020] also show that the bias increases with $d$, precisely $$\text{bias} \asymp (1-\frac{n}{d})||\theta^*||_2^2.$$ This is because the signal “energy” as well is spread out along $d$ dimensions.

Eventually, the bias will dominate and the test error will increase again, see Figure 5 (left). Therefore under this framework, the double descent and harmless interpolation can be achieved but good generalisation cannot.

Figure 5: Bias-variance trade-off after interpolation threshold for a simple linear model (left) and a linear model with spiked covariance (right).

Finally, Bartlett et al. [2020] show that in the special case where the $k$ features are “upweighted”, all three phenomena are observed. Assuming a spiked covariance $$\Sigma = \mathbb{E}[\mathbf{X}\mathbf{X}^T] = \begin{bmatrix}
R\mathbf{I}_k & \mathbf{0} \\
\mathbf{0} & \mathbf{I}_{d-k}
\end{bmatrix},$$ it can be shown that the variance and the bias will go to zero as $d \rightarrow \infty$ provided that $R \gg \frac{d}{n}$, therefore the double descent, harmless interpolation and good generalization are achieved (see Figure 5 (right)).

Many unanswered questions remain
Similar results to the ones described for linear models have been obtained for linear classification [Muthukumar et al., 2021]. While these types of results for neural networks [Frei et al., 2022] are still limited. Moreover, there are still many open questions on benign overfitting for neural networks. For example, the existing result focuses on $d \gg n$ regimes for neural networks, but there are no results on neural networks over-parameterised in low dimensions by increasing their width. Theoretical statisticians still have plenty of work to do to fully understand these phenomena!

References 

Peter L. Bartlett, Philip M. Long, G´abor Lugosi, and Alexander Tsigler. Benign overfitting in linear
regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, April 2020. ISSN
1091-6490. doi: 10.1073/pnas.1907378117. URL http://dx.doi.org/10.1073/pnas.1907378117.

Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro, and Wei Hu. Implicit bias in leaky relu
networks trained on high-dimensional data, 2022.

Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, and Anant
Sahai. Classification vs regression in overparameterized regimes: Does the loss function matter?,
2021.

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep
double descent: Where bigger models and more data hurt, 2019.

Compass students at ICLR 2024

Congratulations to Compass students Edward Milsom and Ben Anson who, along with their supervisor, had their paper accepted for a poster at ICLR 2024.

 

Convolutional Deep Kernel Machines

Edward Milsom, Ben Anson, Laurence Aitchison 

Ed and Ben: In this paper we explore the importance of representation learning in convolutional neural networks, specifically in the context of an infinite-width limit called the Neural Network Gaussian Process (NNGP) that is often used by theorists. Representation learning refers to the ability of models to learn a transformation of the data that is tailored to the task at hand. This is in contrast to algorithms that use a fixed transformation of the data, e.g. a support vector machine with a fixed kernel function like the RBF kernel. Representation learning is thought to be critical to the success of convolutional neural networks in vision tasks, but networks in the NNGP limit do not perform representation learning, instead transforming the data with a fixed kernel function. A recent modification to the NNGP limit, called the Deep Kernel Machine (DKM), allows one to gradually “add representation learning back in” to the NNGP, using a single hyperparameter that controls the amount of flexibility in the kernel. We extend this algorithm to convolutional architectures, which required us to develop a new sparse inducing point approximation scheme. This allowed us to test on the full CIFAR-10 image classification dataset, where we achieved state-of-the-art test accuracy for kernel methods, with 92.7%.

In the plot below, we see how changing the hyperparameter (x-axis) to reduce flexibility too much harms the performance on unseen data.

 

Student Perspectives: Group Testing

A post by Rahil Morjaria, PhD student on the Compass programme.

What is Group Testing?

Group Testing was first introduced in the 1940s as a way to test military recruits for syphilis during World War II. By combining blood samples, they hoped to reduce the total number of tests needed to detect the diseased individuals (compared to testing each recruit individually).

Example of pooling blood samples, where red and green depicts diseased and not diseased respectively.

Since then, there have been many advances in Group Testing with applications not just in detecting diseased individuals but also in communications, cybersecurity and data storage. In essence, whenever we have a situation where we need to detect a proportionally rare occurrence, Group Testing is probably applicable.

More formally, if we have some diseased set of individuals of size $k$ of a total population $n$, it might be considered (instead of testing each individual separately) to pool people together into groups (with replacement) and test these groups.

 

Matrix Form of Group Testing (each row depicting a test and each column depicting a member of the population) [1].

We often write our test design in matrix form where each row is a group/test and each column indicates a member of the population, where a $1$ indicates a individuals inclusion in a test.

The relationship between $k$ and $n$ is quite important, our focus is on the sparse case in which $k = O(n^\alpha)$ where $\alpha \in (0,1)$.

Often we assume our test apparatus is sensitive enough where a single diseased individual in the test will give us a positive result (as shown in the image above) this is known as the noiseless case (there is vast amounts of work done for different types of noise, for more information check out [1]).

Adaptive vs Non-Adaptive Group Testing

Adaptive Group Testing (as the name suggests) allows us to adapt our subsequent tests by the results of the previous. If we compare this to Non-Adaptive Group Testing, in which we have to define our tests (and thus our groups) before we obtain any results, we can expect stronger results.

As our tests have binary outputs, we can obtain at most $1$ bit of information per test. As there are $\binom{n}{k}$ possible defective sets, we would need $\log_2\binom{n}{k}$ bits to uniquely represent each possible set. This gives us the limit of $\log_2\binom{n}{k}$ tests needed, this is known as a converse result, a fundamental limit which we are unable to overcome.

In the noiseless case, Adaptive Group Testing is able to achieve this fundamental limit. First we split our total items $n$ into $k$ (the number of defective items) subsets of length $n/k$ without replacement, and then perform Binary Splitting.

Binary Splitting Adaptive Algorithm.

While Adaptive Group Testing is able to reach fundamental limits, our main focus is on Non-Adaptive Group Testing. Non-Adaptive Group Testing has many applications due to it’s ability to be ran in parallel (and other ease of use situations).

Non-Adaptive Group Testing procedures are often designed randomly (in which each items inclusion in a test is $Bern(v)$ for some $v$) or with near constant column weight (each item is in ‘nearly’ the same amount of tests). Out of these 2 designs near constant column weight gives stronger results.

A graph comparing different group testing designs, where the $\text{Rate} = \log_2\binom{n}{k}/T$ where $T$ is the number of tests needed to recover all the defective items (with high probability for the red lines and with certainty for the purple line). [1]

Goals

While there are many strong results in Group Testing, there is still much to explore. From looking at List Decoding (in which we allow a list of possible defective sets to be outputted), other forms of noise and our efficient algorithms still not being able to match up to our theoretical achievable methods, there is work to be done in all aspects. With improvements in technology, combined with the myriad of applications, the future of Group Testing definitely looks bright!

References:

[1] M. Aldridge, O. Johnson, and J. Scarlett. Group testing: an information theory perspective. CoRR, abs/1902.06002, 2019. URL http://arxiv.org/abs/1902.06002.

Student Perspectives: Machine Learning Models for Probability Distributions

A post by Tennessee Hickling, PhD student on the Compass programme.


Introduction
Probabilistic modelling provides a consistent way to deal with uncertainty in data. The central tool in this methodology is the probability distribution, which describes the randomness of observations. To create effective models of reality, we need to be able to specify probability distributions that are flexible enough to capture real phenomena whilst remaining feasible to estimate. In the past decade machine learning (ML) has developed many new and exciting ways to represent and learn potentially complex probability distributions.

ML has provided significant advances in modelling of high dimensional and highly structured data such as images or text. Many of these modern approaches are applied as “generative models”. The goal of such approaches is to sample new synthetic observations from an approximate distribution which closely matches the target distribution. For example, we may use many images of cats to learn an approximate distribution, from which we can sample new images of cats that appear realistic. Usually, a “generative model” indicates the requirement to sample from the model, but not necessarily assign probabilities to observed data. In this case, the model captures uncertainty by imitating the structure and randomness of the data.

Many of these modern methods work by transforming simple randomness (such as a Normal distribution) into the target complex randomness. In my own research, I work on a known limitation of such approaches to replicate a particular aspect of randomness – the tails of probability distributions [1, 11]. In this post, I wanted to take a step back and provide an overview of and connections between two ML methods that can be used to model probability distributions – Normalising Flows (NFs) and Variational Autoencoders (VAEs).

Figure 1: Basic illustration of ML learning of a distribution. We optimise the machine learning model to produce a distribution close to our target. This is often conceptualised in the generative direction, such that our ML model moves samples from the simple distribution to more closely match the target observations.

Some Background
Consider real valued vectors $z \in \mathbb{R}^{d_z}$ and $x \in \mathbb{R}^{d_x}$. In this post I mirror notation used in [2], where $p(x)$ refers to the density and distribution of $x$ and $x \sim p(x)$ indicates samples according to that distribution. The generic set up I am considering is that of density estimation – trying to model the distribution $p(x)$ of some observed data $\{x_i\}_{i=1}^{N}$. I use a semicolon to denote parameters, so $p(x; \beta)$ is a distribution over $x$ with parameters $\beta$. I also make use of different letters to distinguish different distributions, for example using $q(x)$ to denote an approximation to $p(x)$. The notation $\mathbb{E}_p[f(x)]$ refers to the standard expectation of $f(x)$ over the distribution $p$.

The discussed methods introduce some simple source of randomness arising from a known, simple latent distribution $p(z)$. This is also referred to in some literature as the prior, though the usage is not straightforwardly relatable to traditional Bayesian concepts. The goal is then to fit an approximate $q(x|z; \theta)$, that is a conditional distribution, such that $$q(x; \theta) = \int q(x|z; \theta)p(z)dz \approx p(x),$$in words, the marginal density over $x$ implied by the conditional density, is close to our target distribution $p(x)$. In general, we can’t compute $q(x)$, as we can’t solve the above integral for very flexible $q(x | z; \theta)$.

Variational Inference
We commonly make use of the Kullback-Leibler (KL) divergence, which can be interpreted as measuring the difference between two probability distributions. It is a useful practical tool, since we can compute and optimise the quantity in a wide variety of situations. Techniques which optimise a probability distribution using such divergences are known as variational methods. There are other choices of divergence, but KL is the most standard. Important properties of KL are that the quantity $KL(p|| q)$ is non-negative and non-symmetric i.e. $KL(p|| q) \neq KL(q || p)$.

Given this, we can see that a natural objective is to minimise the difference between distributions, as measured by the KL, $$KL(p(x) || q(x; \theta)) = \int p(x) \log \frac{p(x)}{q(x; \theta)} dx.$$Advances in this area have mostly developed new ways to make this optimisation tractable for flexible $q(x | z; \theta)$.

Normalising Flow
A normalising flow is a parameterised transformation of a random variable. The key simplifying assumption is that the forward generation is deterministic. That is, for $d_x = d_z$, that $$
x = T(z; \theta),$$for some transformation function $T$. We additionally require that $T$ is a differentiable bijection. Given these requirements, we can express the approximate density of $x$ exactly as $$q_x(x; \theta) = p_z(T^{-1}(x; \theta))\big|\text{det } J_{T^{-1}}(x; \theta)\big|.$$Here, $\text{det }J_{T^{-1}}$ is the determinant of the Jacobian of the inverse transformation. Research on NFs has developed numerous ways to make the computation of the Jacobian term tractable. The standard approach is to use neural networks to produce $\theta$ (the parameters of the transformation), with numerous ways of configuring the model to capture dependency between dimensions. Additionally, we often stack many layers to provide more flexibility. See [10] and the review [2] for more details on how this is achieved.

As we have access to an analytic approximate density, we can minimise the negative log-likelihood of our model, $$\mathcal{J}(\theta) = -\sum_{i=1}^{N} \log q(x_i; \theta),$$which is the Monte-Carlo approximation of the KL loss (up to an additive constant). This is straightforward to optimise using stochastic gradient descent [9] and automatic differentiation.

Figure 2: Schematic of NF model. The ML model produces the parameters of our transformation, which are identical in the forward and backwards directions. We choose the transformation such that we can express an analytic density function for our approximate distribution.

Variational Autoencoder
In the Variational Autoencoder (VAE) [3] the conditional distribution $q(x| z; \theta)$ is known as the decoder. VAEs consider the marginal in terms of the posterior, that is $$q(x; \theta) = \frac{q(x | z; \theta)p(z)}{q(z | x; \theta)}.$$The posterior $q(z | x; \theta)$ is itself not generally tractable. VAEs proceed by introducing an encoder, which approximates $q(z | x; \theta)$. This is itself simply a conditional distribution $e(z | x; \psi)$. We use this approximation to express the log marginal over $x$ as below.
$$\begin{align}
\log q(x; \theta) &= \mathbb{E}_{e}\bigg[\log q(x; \theta)\frac{e(z | x; \psi)}{e(z | x; \psi)}\bigg] \\
&= \mathbb{E}_{e}\bigg[\log\frac{q(x | z; \theta)p(z)}{q(z | x; \theta)}\frac{e(z | x; \psi)}{e(z | x; \psi)}\bigg] \\
&= \mathbb{E}_{e}\bigg[\log\frac{q(x | z; \theta)p(z)}{e(z | x; \psi)}\bigg] + KL(e(z | x; \psi) || q(z | x; \theta)) \\
&= \mathcal{J}_{\theta,\psi} + KL(e(z | x; \psi) || q(z | x; \theta))
\end{align}$$
The additional approximation gives a more complex expression and does not provide an analytical approximate density. However, as $KL(e(z | x; \psi) || q(z | x; \theta))$ is positive, $\mathcal{J}_{\theta,\psi}$ forms a lower bound on $\log q(x; \theta)$. This expression is commonly referred to as the Evidence Lower Bound (or ELBO). The second term, the divergence between our encoder and the implied posterior will be minimised as we increase $\mathcal{J}_{\theta,\psi}$ in $\psi$. As we increase $\mathcal{J}_{\theta,\psi}$, we will either be increasing $\log q(x; \theta)$ or reducing $KL(e(z | x; \psi) || q(z | x; \theta))$.

The goal is to increase $\log q(x; \theta)$, which we hope occurs as we optimise over both parameters. This ambiguity in optimisation results in well known issues, such as posterior collapse [12] and can result in some counter-intuitive behaviour [4]. Despite this, VAEs remain a powerful and popular approach. An important benefit is that we no longer require $d_x = d_z$, which means we can map high dimensional $x$ to low dimensional $z$ to perform dimensionality reduction.

Figure 3: Schematic of VAE model. We now have a stochastic forward transformation. To optimise this we introduce a decoder model which approximates the posterior implied by the forward transformation. We now have a more flexible transformation, but two models to train and no analytic approximate density.

Surjective Flows
We can now identify a connection between NFs and VAEs. Recent work has reinterpreted NFs within the VAE framework, permitting a broader class of transitions whilst retaining analytic tractability of NFs [5]. Considering our decoding conditional as $$q(x|z; \theta) = \delta(x – T(z; \theta)),$$ we have the posterior exactly as
$$q(z|x; \theta) = \delta(z – T^{-1}(x; \theta)),$$
where $\delta$ is the dirac delta function.

This provides a view of a NF as a special case of a VAE, where we don’t need to approximate the posterior. Considering our VAE approximation, $$
\log q(x; \theta) = \mathbb{E}_e\big[\log p(z) + \log\frac{q(x | z; \theta)}{e(z | x; \psi)}\big] + KL(e(z | x; \psi) || q(z | x; \theta)),
$$and taking $e(z | x; \psi) = q(z | x; \theta)$, then the final KL term is 0 by definition. In that case, we recover our analytic density for NFs (see [5] for details).

Note that accessing an analytic density depends on having $e(z | x; \psi) = q(z | x; \theta)$ and computing $$\mathbb{E}_e\big[\log p(z) + \log\frac{q(x | z; \theta)}{e(z | x; \psi)}\big].$$These requirements are actually weaker than those we apply in the case of standard NFs. Consider a deterministic transformation $T^{-1}(x)$ which is surjective, crucially we can have many $x$ which map to the same $z$, so no longer have an analytic inverse. However we can still choose a $q(x | z)$ which is stochastic inverse of $T^{-1}$. For example, consider an absolute value surjection, $q(z | x) = \delta(z – |x|)$, to invert this transformation we can choose $q(x | z) = \frac{1}{2}\delta(x – z) + \frac{1}{2}\delta(x + z)$, which we can sample from straight-forwardly. This transformation enforces symmetry across the origin in the approximate distribution, a potentially useful inductive bias. In this example, and many others, we can also compute the density exactly. This has led to a number of interesting extensions to NFs, such as “funnel flows” which have $d_z < d_x$ but retain an analytic approximate density [6]. As we retain an analytic approximate density, we can optimise them in the same way as NFs.

Figure 4: Schematic of a surjective transformation. We have a stochastic forward transformation, but the inverse is deterministic. This restricts what transformations we can have, but retains an analytic approximate density.

Conclusion
I have presented an overview of two widely-used methods for modelling probability distributions with machine learning techniques. I’ve also highlighted an interesting connection between these methods, an area of research that has led to the development of interesting new models. It’s worth noting that other important classes of ML models such as Generative Adversarial Networks and Diffusion models can also be interpreted as approximate probability distributions. There are many superficial connections between those methods and the ones discussed here. Exploring these theoretical similarities presents an compelling direction of research to sharpen understanding of such models’ relative advantages. Another promising direction is the synthesis of these methods, where researchers aim to harness the strengths of each approach [7,8]. This not only enhances the existing knowledge base but also offers opportunities for innovative applications in the field.

References
[1] Jaini, Priyank, et al. “Tails of Lipschitz triangular flows.” International Conference on Machine Learning. PMLR, 2020
[2] Papamakarios, G., et al. “Normalizing flows for probabilistic modeling and inference.” Journal of Machine Learning Research, 2021
[3] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv:1312.6114, 2013
[4] Rainforth, Tom, et al. “Tighter variational bounds are not necessarily better.” International Conference on Machine Learning. PMLR, 2018
[5] Nielsen, Didrik, et al. “Survae flows: Surjections to bridge the gap between vaes and flows.” Advances in Neural Information Processing, 2020
[6] Samuel Klein, et al. “Funnels: Exact maximum likelihood with dimensionality reduction.” arXiv:2112.08069, 2021
[7] Kingma, Durk P., et al. “Improved variational inference with inverse autoregressive flow.” Advances in neural information processing systems , 2016
[8] Zhang, Qinsheng, and Yongxin Chen. “Diffusion normalizing flow.” Advances in Neural Information Processing Systems 34, 2021
[9] Ettore Fincato “An Introduction to Stochastic Gradient Methods” https://compass.blogs.bristol.ac.uk/2023/03/14/stochastic-gradient-methods/
[10] Daniel Ward “An introduction to normalising flows” https://compass.blogs.bristol.ac.uk/2022/07/13/2608/
[11] Tennessee Hickling and Dennis Prangle, “Flexible Tails for Normalising Flows, with Application to the Modelling of Financial Return Data”, 8th MIDAS Workshop, ECML PKDD, 2023
[12] Lucas, James, et al. “Don’t blame the elbo! a linear vae perspective on posterior collapse.” Advances in Neural Information Processing Systems 32, 2019

Student perspectives: Compass Annual Conference 2023

A post by Dominic Broadbent, PhD student on the Compass CDT, and Michael Whitehouse, PhD student of the Compass CDT (recently submitted thesis)

Introduction

September saw the second annual Compass Conference, hosted in the Fry Building, the home of the School of Mathematics. The event was particularly special as it is the first time that all five Compass cohorts were brought together, and it was a fantastic opportunity to celebrate the achievements and research of the Compass CDT with external partners.  This year the theme was “Communicating Research in Context“, focusing on how research can be better communicated, and the need to highlight the motivation and applications of mathematical research.

Research talks

The day began with four long form talks touching on the topic of communicating research. Starting with Alessio Zakaria’s talk which delved into Hypothesis tests, commenting on their criticial role as the defacto statistical tool across the sciences, and how p-hacking has led to a replication crisis that undermines public confidence in research. The next talk by Sam Stockman and Emerald Dilworth discussed the challenges they faced, and the key takeaways from their shared experience of communicating mathematics with researchers in the geographical sciences. Following this, Ed Davis’s interactive talk “The Universal Language of Visualisations” explored how effective visualisation techniques should differ by the intended audience, with examples from his research and activities outside of academia. The last talk by Dan Milner explored his research on understanding the effect of environmental factors on outcomes of smallholder farmers in Kenya. He took us through the process of collecting data on the ground, to modelling and communicating results to external partners.  After each talk there was an opportunity to ask questions, allowing for audience participation and the sparking of interesting discussions. The format mirrored that which is most frequently used in external academic conferences, giving the speakers a chance to practice their technique in front of friendly faces.

Lightning talks

After a short break, we jumped back into the fray with a series of 3-minute fast-paced lightning talks. A huge range of topics were covered, all the way from developing modelling techniques for the electric grid of the future, to predicting the incidence of Cerebral Vasospasm at the Southmead Hospital ICU. With such a short time to present, these talks were a great exercise in distilling research into just the essentials, knowing there is very limited time to garner the audience’s interest and convey an effective message.

Special guest lecture

After lunch, we reconvened to attend the special guest lecture. The talk, entitled Bridging the gap between research and industry, was delivered by Ruth Voisey, CEO of the Smith institute. It outlined Ruth’s journey from writing her PhD thesis ‘Multiple wave scattering by quasiperiodic structures’, to CEO of the Smith Institute – via an internship with the acoustic research team at Dyson. It was particularly refreshing to hear Ruth’s candid account of her ‘non-linear’ rise to CEO, accrediting her success to strong principles of clear research communication and ‘mathematical evangelicalism’.

As PhD students in the bubble of academia, the path to opportunities in the world of industry can often feel clouded – Ruth’s lecture painted a clear picture of how one can transition from university based research to a rewarding career outside of this bubble, applying such research to tangible problems in the real world. 

Panel discussion and poster session

The special guest lecture was followed by a discussion on communicating research in context, with panel members Ruth Voisey, David Greenwood, Helen Barugh, Oliver Johnson, plus Compass CDT students Ed Davis and Sam Stockman. The panel discussed the difficulty of communicating the nuances of research conclusions with the public, with a particular focus on handling these nuances when talking to journalists – stressing the importance of communicating the limitations of the research in question.

This was followed by a poster session, one enthusiastic student had the following comment “it was great to see of all the Compass students’ hard work being celebrated and shared with the wider data science community”.

Concluding remarks

To cap off the successful event, Compass students Hannah Sansford and Josh Givens delivered some concluding remarks which were drawn from comments made by students about what key points they’d taken from the day. These focused on the importance of clear communication of research throughout the whole pipeline, from inception in discussion with fellow academics to the dissemination of knowledge to the general population.

The day ended with a walk to Goldney Hall, where students, staff, and attendees enjoyed delicious food, wine, and access to the beautiful Orangery gardens.

Student Perspectives: Density Ratio Estimation with Missing Data

A post by Josh Givens, PhD student on the Compass programme.

Density ratio estimation is a highly useful field of mathematics with many applications.  This post describes my research undertaken alongside my supervisors Song Liu and Henry Reeve which aims to make density ratio estimation robust to missing data. This work was recently published in proceedings for AISTATS 2023.

Density Ratio Estimation

Definition

As the name suggests, density ratio estimation is simply the task of estimating the ratio between two probability densities. More precisely for two RVs (Random Variables) Z^0, Z^1 on some space \mathcal{Z} with probability density functions (PDFs) p_0, p_1 respectively, the density ratio is the function r^*:\mathcal{Z}\rightarrow\mathbb{R} defined by

r^*(z):=\frac{p_0(z)}{p_1(z)}.

Plot of the scaled density ratio alongside the PDFs for the two classes.

Density ratio estimation (DRE) is then the practice of using IID (independent and identically distributed) samples from Z^0 and Z^1 to estimate r^*. What makes DRE so useful is that it gives us a way to characterise the difference between these 2 classes of data using just 1 quantity, r^*.

 

The Density Ratio in Classification

We now give demonstrate this characterisability in the case of classification. To frame this as a classification problem define Y\sim\text{Bernoulli}(0.5) and Z by Z|Y=y\sim Z^{y}. The task of predicting Y given Z using some function \phi:\mathcal{Z}\rightarrow\{0,1\} is then our standard classification problem. In classification a common target is the Bayes Optimal Classifier, the classifier \phi^* which maximises \mathbb{P}(Y=\phi(Z)). We can write this classifier in terms of r^*  as we know that \phi^*(z)=\mathbb{I}\{\mathbb{P}(Y=1|Z=z)>0.5\} where \mathbb{I} is the indicator function. Then, by the total law of probability, we have

\mathbb{P}(Y=1|Z=z)=\frac{p_{Z|Y=1}(z)\mathbb{P}(Y=1)}{p_{Z|Y=1}(z)\mathbb{P}(Y=1)+p_{Z|Y=0}(z)\mathbb{P}(Y=0)}

=\frac{p_1(z)\mathbb{P}(Y=1)}{p_1(z)\mathbb{P}(Y=1)+p_0(z)\mathbb{P}(Y=0)} =\frac{1}{1+\frac{1}{r}\frac{\mathbb{P}(Y=0)}{\mathbb{P}(Y=1)}}.

Hence to learn the Bayes optimal classifier it is sufficient to learn the density ratio and a constant. This pattern extends well beyond Bayes optimal classification to many other areas such as error controlled classification, GANs, importance sampling, covariate shift, and others.  Generally speaking, if you are in any situation where you need to characterise the difference between two classes of data, it’s likely that the density ratio will make an appearance.

Estimation Implementation – KLIEP

Now we have properly introduced and motivated DRE, we need to look at how we can go about performing it. We will focus on one popular method called KLIEP here but there are a many different methods out there (see Sugiyama et al 2012 for some additional examples.)

The intuition behind KLIEP is simple: as r^* \cdot p_0=p_1, if \hat r\cdot p_0 is “close” to p_1 then \hat r is a good estimate of r^*. To measure this notion of closeness KLIEP uses the KL (Kullback-Liebler) divergence which measures the distance between 2 probability distributions. We can now formulate our ideal KLIEP objective as follows:

\underset{r}{\text{min}}~ KL(p_1|p_0\cdot r)

\text{subject to:}~ \int_{\mathcal{Z}}r(z)p_0(z)\mathrm{d}z=1

where KL(p|p') represent the KL divergence from p to p'. The constraint  ensures that the right hand side of our KL divergence is indeed a PDF. From the definition of the KL-divergence we can rewrite the solution to this as \hat r:=\frac{\tilde r}{\mathbb{E}[r(X^0)]} where \tilde r is the solution to the unconstrained optimisation

\underset{r}{\text{min}}~\mathbb{E}[\log (r(Z^1))]-\log(\mathbb{E}[r(Z^0)]).

As this is now just an unconstrained optimisation over expectations of known transformations of Z^0, Z^1, we can approximate this using samples. Given samples z^0_1,\dotsc,z^0_n from Z_0 and samples z^1_1,\dotsc,z^1_n from Z_1 our estimate of the density ratio will be \hat r:=\left(\frac{1}{n}\sum_{i=1}^nr(z_i^0)\right)^{-1}\tilde r  where \tilde r solves

\underset{r}{\min}~ \frac{1}{n}\sum_{i=1}^n \log(r(z^1_i))-\log\left(\frac{1}{n}\sum_{i=1}^n r(z^0_i)\right).

Despite KLIEP being commonly used, up until now it has not been made robust to missing not at random data. This is what our research aims to do.

Missing Data

Suppose that instead of observing samples from Z, we observe samples from some corrupted version of Z, X. We assume that \mathbb{P}(\{X=\varnothing\}\cup \{X=Z\})=1 so that either X is missing or X takes the value of Z. We also assume that whether X is missing depends upon the value of Z. Specifically we assume \mathbb{P}(X=\varnothing|Z=z)=\varphi(z) with \varphi(z) not constant and refer to \varphi as the missingness function. This type of missingness is known as missing not at random (MNAR) and when dealt with improperly can lead to biased result. Some examples of MNAR data could be readings take from a medical instrument which is more likely to err when attempting to read extreme values or recording responses to a questionnaire where respondents may be more likely to not answer if the deem their response to be unfavourable. Note that while we do not see what the true response would be, we do at least get a response meaning that we know when an observation is missing.

Missing Data with DRE

We now go back to density ratio estimation in the case where instead of observing samples from  Z^0,Z^1  we observe samples from their corrupted versions X^0, X^1. We take their respective missingness functions to be \varphi_0, \varphi_1 and assume them to be known. Now let us look at what would happen if we implemented KLIEP with the data naively by simply filtering out the missing-values. In this case, the actual density ratio we would be estimating would be

r'(z):=\frac{p_{X_1|X_1\neq\varnothing}(z)}{p_{X_0|X_o\neq\varnothing}(z)}\propto\frac{(1-\varphi_1(z))p_1(z)}{(1-\varphi_0(z))p_0(z)}\not{\propto}r^*(z)

and so we would get inaccurate estimates of the density ratio no matter how many samples are used to estimate it. The image below demonstrates this in the case were samples in class 1 are more likely to be missing when larger and class 0 has no missingness.

A plot of the density ratio using both the full data and only the observed part of the corrupted data

Our Solution

Our solution to this problem is to use importance weighting. Using relationships between the densities of X and Z we have that

\mathbb{E}[g(Z)]=\mathbb{E}\left[\frac{\mathbb{I}\{X\neq\varnothing\}g(X)}{1-\varphi(X)}\right].

As such we can re-write the KLIEP objective to keep our expectation estimation unbiased even when using these corrupted samples. This gives our modified objective which we call M-KLIEP as follows. Given samples x^0_1,\dotsc,x^0_n from X_0 and samples x^1_1,\dotsc,x^1_n from X_1 our estimate is \hat r=\left(\frac{1}{n}\sum_{i=1}^n\frac{\mathbb{I}\{x_i^0\neq\varnothing\}r(x_i^0)}{1-\varphi_o(x_i^o)}\right)^{-1}\tilde r where \tilde r solves

\underset{r}{\min}~\frac{1}{n}\sum_{i=1}^n\frac{\mathbb{I}\{x_i^1\neq\varnothing\}\log(r(x_i^1))}{1-\varphi_1(x_i^1)}-\log\left(\frac{1}{n}\sum_{i=1}^n\frac{\mathbb{I}\{x_i^0\neq\varnothing\}r(x_i^0)}{1-\varphi_0(x_i^0)}\right).

This objective will now target r^* even when used on MNAR data.

Application to Classification

We now apply our density ratio estimation on MNAR data to estimate the Bayes optimal classifier. Below shows a plot of samples alongside the true Bayes optimal classifier and estimated classifiers from the samples via our method M-KLIEP and a naive method CC-KLIEP which simply ignores missing points. Missing data points are faded out.

Faded points represent missing values. M-KLIEP represents our method, CC-KLIEP represents a Naive approach, BOC gives the Bayes optimal classifier

As we can see, due to not accounting for the MNAR nature of the data, CC-KLIEP underestimates the true number of class 1 samples in the top left region and therefore produces a worse classifier than our approach.

Additional Contributions

As well as this modified objective our paper provides the following additional contributions:

  • Theoretical finite sample bounds on the accuracy of our modified procedure.
  • Methods for learning the missingness functions \varphi_1,\varphi_0.
  • Expansions to partial missingness via a Naive-Bayes framework.
  • Downstream implementation of our method within Neyman-Pearson classification.
  • Adaptations to Neyman-Pearson classification itself making it robust to MNAR data.

For more details see our paper and corresponding github repository. If you have any questions on this work feel free to contact me at josh.givens@bristol.ac.uk.

References

Givens, J., Liu, S., & Reeve, H. W. J. (2023). Density ratio estimation and neyman pearson classification with missing data. In F. Ruiz, J. Dy, & J.-W. van de Meent (Eds.), Proceedings of the 26th international conference on artificial intelligence and statistics (Vol. 206, pp. 8645–8681). PMLR.
Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density Ratio Estimation in Machine Learning. Cambridge University Press.

Compass students at AISTATS 2023

Congratulations to Compass students Josh Givens, Hannah Sansford and Alex Modell who, along with their supervisors have had their papers accepted to be published at AISTATS 2023.

 

‘Implications of sparsity and high triangle density for graph representation learning’

Hannah Sansford, Alexander Modell, Nick Whiteley, Patrick Rubin-Delanchy

Hannah: In this paper we explore the implications of two common characteristics of real-world networks, sparsity and triangle-density, for graph representation learning. An example of where these properties arise in the real-world is in social networks, where, although the number of connections each individual has compared to the size of the network is small (sparsity), often a friend of a friend is also a friend (triangle-density). Our result counters a recent influential paper that shows the impossibility of simultaneously recovering these properties with finite-dimensional representations of the nodes, when the probability of connection is modelled by the inner-product. We, by contrast, show that it is possible to recover these properties using an infinite-dimensional inner-product model, where representations lie on a low-dimensional manifold. One of the implications of this work is that we can ‘zoom-in’ to local neighbourhoods of the network, where a lower-dimensional representation is possible.

The paper has been selected for oral presentation at the conference in Valencia (<2% of submissions). 

 

Density Ratio Estimation and Neyman Pearson Classification with Missing Data

Josh Givens, Song Liu, Henry W J Reeve

Josh: In our paper we adapt the popular density ratio estimation procedure KLIEP to make it robust to missing not at random (MNAR) data and demonstrate its efficacy in Neyman-Pearson (NP) classification. Density ratio estimation (DRE) aims to characterise the difference between two classes of data by estimating the ratio between their probability densities. The density ratio is a fundamental quantity in statistics appearing in many settings such as classification, GANs, and covariate shift making its estimation a valuable goal. To our knowledge there is no prior research into DRE with MNAR data, a missing data paradigm where the likelihood of an observation being missing depends on its underlying value. We propose the estimator M-KLIEP and provide finite sample bounds on its accuracy which we show to be minimax optimal for MNAR data. To demonstrate the utility of this estimator we apply it the the field of NP classification. In NP classification we aim to create a classifier which strictly controls the probability of incorrectly classifying points from one class. This is useful in any setting where misclassification for one class is much worse than the other such as fault detection on a production line where you would want to strictly control the probability of classifying a faulty item as non-faulty. In addition to showing the efficacy of our new estimator in this setting we also provide an adaptation to NP classification which allows it to still control this misclassification probability even when fit using MNAR data.

Student Perspectives: An Introduction to Stochastic Gradient Methods

A post by Ettore Fincato, PhD student on the Compass programme.

This post provides an introduction to Gradient Methods in Stochastic Optimisation. This class of algorithms is the foundation of my current research work with Prof. Christophe Andrieu and Dr. Mathieu Gerber, and finds applications in a great variety of topics, such as regression estimation, support vector machines, convolutional neural networks.

We can see below a simulation by Emilien Dupont (https://emiliendupont.github.io/) which represents two trajectories of an optimisation process of a time-varying function. This well describes the main idea behind the algorithms we will be looking at, that is, using the (stochastic) gradient of a (random) function to iteratively reach the optimum.

Stochastic Optimisation

Stochastic optimisation was introduced by [1], and its aim is to find a scheme for solving equations of the form \nabla_w g(w)=0 given “noisy” measurements of g [2].

In the simplest deterministic framework, one can fully determine the analytical form of g(w), knows that it is differentiable and admits an unique minimum – hence the problem

w_*=\underset{w}{\text{argmin}}\quad g(w)

is well defined and solved by \nabla_w g(w)=0.

On the other hand, one may not be able to fully determine g(w) because his experiment is corrupted by a random noise. In such cases, it is common to identify this noise with a random variable, say V, consider an unbiased estimator \eta(w,V) s.t. \mathbb{E}_V[\eta(w,V)]=g(w) and to rewrite the problem as

w_*=\underset{w}{\text{argmin}}\quad\mathbb{E}_V[\eta(w,V)].

(more…)

Compass CDT is recruiting for its final few fully funded places to start September 2023

We are happy to announce our upcoming applications deadline of 16 March 2023 for Compass CDT programme. For international applications there are limited scholarship funded places available for this final recruitment round. Early applications advised.

Compass is offering specific projects for PhD students to study from Sept 2023. The projects are listed in the research section of our website. All the supervisors listed are open to discussion on the projects provided and can also talk to applicants about other project ideas. Please provide a ranked list of 3 projects of interest: 1 = project of highest interest. Project supervisors will be happy to respond to specific questions you have after reading the proposals. Applicants should contact them by email if they wish beforehand.

Also, we are pleased to announce a new project Genetic Similarity Based Cohort Building that has been added to the list for September 2023 start funded by Roche, one of the world’s largest biotech companies, as well as a leading provider of in-vitro diagnostics and a global supplier of transformative innovative solutions across major disease areas.

PhD Project Allocation Process

Application forms will be reviewed based on the 3 ranked projects specified or other proposed topic. Successful applicants will be invited to attend an interview with the Compass admissions tutors and the specific project supervisor. If you are made an offer of PhD study it will be published through the online application system. You will then have 2 weeks to consider the offer before deciding whether to accept or decline.

We welcome applications from all members of our community and are particularly encouraging those from diverse groups, such as members of the LGBT+ and black, Asian and minority ethnic communities, to join us.

APPLICATIONS DEADLINE

16 March 2023, 23:59 (London UK time zone)

APPLY NOW

Advantages of being a Compass Student

  • Stipend – a generous stipend of £21,668 pa tax free, paid in monthly payments. Plus your own expense budget of £1,000 pa towards travel and research activity.
  • No fees – all tuition fees are covered by the EPSRC and University of Bristol.
  • Bespoke training – first year units are designed specifically for the academic needs of each Compass student, which enables students to develop knowledge and capability to pursue cross-disciplinary PhD research.
  • Supervisors – supervisors from across academic disciplines offer a range of research projects.
  • Cohort – Compass students benefit from dedicated offices and collaboration spaces, enabling strong cohort links and opportunities for shared learning and research.

About Compass CDT

A 4-year bespoke PhD training programme in the statistical and computational techniques of data science, with partners from across the University of Bristol, industry and government agencies.

The cross-disciplinary programme offers exciting collaborations across medicine, computer science, geography, economics, life and earth sciences, as well as with our external partners who range from government organisations such as the Office for National Statistics, NCSC and the AWE, to industrial partners such as LV, Improbable, IBM Research, EDF, and AstraZeneca.

Students are co-located with the Institute for Statistical Science in the School of Mathematics, which occupies the Fry Building.

Hear from our students about their experience with the programme

  • Compass has allowed me to advance my statistical knowledge and apply it to a range of exciting applied projects, as well as develop skills that I’m confident will be highly useful for a future career in data science. – Shannon, Cohort 2

  • With the Compass CDT I feel part of a friendly, interactive environment that is preparing me for whatever I move on to next, whether it be in Academia or Industry. – Sam, Cohort 2

  • An incredible opportunity to learn the ever-expanding field of data science, statistics and machine learning amongst amazing people. – Danny, Cohort 1

Compass CDT Video

Find out more about what it means to be a part of the Compass programme from our students in this short video.

APPLICATIONS DEADLINE

16 March 2023, 23:59 (London UK time zone)

APPLY NOW

Compass at NeurIPS 2022

A post by Anthony Stephenson, Jack Simons, and Dan Ward, PhD students on the Compass programme.

Introduction

Ant Stephenson, Jack Simons, and I (Dan Ward) had the pleasure of attending the 2022 Conference on Neural Information Processing Systems (NeurIPS), one of the largest machine learning conferences in the world. The conference was held in New Orleans, which gave us an opportunity to explore a lively city full of culture with delicious local cuisine. We thought we’d collaborate on a blog post together covering some of the highlights.

Memorable Talks

The conference had broad range of talks including technical presentations of research, applied projects, and discussions of the philosophical and ethical questions that arise in AI. To give a taste of some of the talks, we picked out some of our favourites below.

Noam Brown: Human Modelling and Strategic Reasoning in the Game of Diplomacy.

The Game of Diplomacy is a strategic board game invented in 1954. It’s unique feature, and of crucial importance of the game, is that players interact via natural language to form allegiances. Whilst AI has been successful in beating humans in many purely adversarial games (e.g. Chess, Go), this collaborative element poses unique challenges. Firstly, it isn’t obvious how to evaluate/devise strategies for collaboration/betrayal, especially in the self-play-based reinforcement learning paradigm. Secondly, as communication happens via natural language, the AI must be able to translate their strategic plan into text. This strange combination of problems lead to interesting and innovative solutions. Paper link here.

Geoffrey Hinton: The Forward-Forward Algorithm for Training Deep Neural Networks.

Among the great line-up of speakers was Professor Geoffrey Hinton, known for popularising backpropogation for deep neural networks. Inspired by producing a more biologically plausible algorithm for learning, he has proposed the ‘Forward-Forward’ algorithm which he claims can also explain the phenomena of sleep! Professor Geoffrey Hinton then went on to express his belief that using biologically-inspired hardware, so-called neuromorphic computing, may play a key role in advancing AI. The talk was certainly unconventional, but nevertheless entertaining. Paper link here.

David Chalmers: Could a Large Language Model be Conscious?

Amongst all the machine-learning experts was David Chalmers, a philosopher! There are important questions regarding the possibility that language models might be conscious. David Chalmers aimed to educate the machine-learning audience in attendance of how we can better think about these problems and re-phrase the questions that we’re asking. We concluded that these questions are, unsurprisingly, best left to philosophers!

Poster Sessions


Jack and Dan:

I (Dan), presented a poster of my work at the conference, on Robust Neural Posterior Estimation (paper link here). I was definitely surprised by the scale of the poster sessions, and the broad scope of all the work taking place. Below is some of the posters that me and Jack found interesting:

Contrastive Neural Ratio Estimation
Benjamin K. Miller · Christoph Weniger · Patrick Forré
Authors propose NRE-C which aims to generalise NRE-A (Hermans et al. (2019)) and NRE-B (Durkan et al. (2020)) into one method. NRE-C can recover both methods by taking their two introduced hyperparameters at certain limits. Paper link here.

Towards Reliable Simulation-Based Inference with Balanced Neural Ratio Estimation
Arnaud Delaunoy · Joeri Hermans · François Rozet · Antoine Wehenkel · Gilles Louppe
These authors also make a contribution to the field of neural ratio estimation in the simulation-based inference context. Authors propose the notion of a “balanced” classifier, which is a classifier in which the average output from the classifier over the positive data class plus the average output over the negative data class equals to 1. The authors argue that if one has a classifier is balanced then it will lead to more conservative posterior estimates, which is something which practitioners seek. To integrate this into an algorithm they suggest adding a penalisation term onto the standard logistic-loss which punishes classifiers as they become less balanced. Paper link here.

Training and Inference on Any-Order Autoregressive Models the Right Way
Andy Shih · Dorsa Sadigh · Stefano Ermon
A joint distribution can be decomposed into its univariate conditionals by the chain rule, although by doing so we implicitly choose an ordering in a model, which prevents arbitrary conditional inference. Any-order autoregressive models circumvent this generally by being trained such that all possible univariate conditionals are considered, but this leads to learning redundant information. The paper proposes a new method to train autoregressive models, using a subset of univariate conditionals that still supports arbitrary conditional inference. This research was also presented as a talk, but sadly we missed it!  Paper link here.

Anthony:

The poster sessions formed the bulk of the conference timetable, with 2 2-hour sessions per day, on Tuesday, Wednesday and Thursday. These were very busy, with many posters on a wide-range of topics and a large congregation of attendees. As a result, it was sometimes difficult to track down the subset of posters on material of particular interest and when this feat was achieved, on occasion it was still hard work to actually have a detailed conversation with the author(s). Nonetheless, it was interesting to see the how varied the subjects of the poster were and in addition get a feeling for “themes” of the conference: recurring, clearly in-vogue topics. Amongst the sea of posters, I did manage to find a number relating to GPs; of these, those I found most interesting were:

Posterior and Computational Uncertainty in Gaussian Processes:
Jonathan Wenger · Geoff Pleiss · Marvin Pförtner · Philipp Hennig · John Cunningham
Here the authors propose a way to naturally incorporate uncertainty introduced from the use of (iterative) GP approximation methods. Paper link here.

Sparse Gaussian Process Hyperparameters: Optimize or Integrate?
Vidhi Lalchand · Wessel Bruinsma · David Burt · Carl Edward Rasmussen
The authors attempt to integrate a fully-Bayesian inference procedure for sparse GPs, as an alternative to the commonly adopted approach of optimising the kernel hyperparameters by maximum likelihood estimation. Paper link here.

Log-Linear-Time Gaussian Processes Using Binary Tree Kernels:
Michael K. Cohen · Samuel Daulton · Michael A Osborne
The idea here feels a bit unorthodox; they use a “binary-tree” kernel which discretises the space, with quantization error determined by the number of leaves. This would seem to lose interpretability on the properties of the function prior (e.g. smoothness), but does appear to give empirical benefits in their experiments. Paper link here.

Workshops

In addition to the main conference, on the Friday and Saturday at the end of the week there were a selection of workshops on a variety of sub-fields within machine learning. If you are fortunate enough for there to be a workshop dedicated to your research area, then they provide a space to gather people with research directly relevant to your own and facilitate helpful discussions and networking opportunities.

Anthony:

For me, the “Gaussian processes, spatiotemporal modeling and decision-making systems” workshop was the most useful part of the conference. It gave me the chance to speak to people working on interesting problems related to my own; discover the kind of directions they are heading in and lines of work they are contemplating. Additionally, I presented a poster during this workshop which allowed me to discuss my work with an audience well-versed on the topic and its possible significance.

The Big Easy

In addition to the actual conference, attending NeurIPS also gave us the opportunity to explore the city of New Orleans; aka The Big Easy. Upon arrival, we were immediately greeted in the airport by the sound of Louis Armstrong, a strong theme in the city, which features a park named after him. New ‘Awlins’ is well known for its jazz, but awareness of this fact does not necessarily prepare you for the sheer quantity, especially in the streets of the French Quarter, that awaits you. The real epicentre of jazz in the city is situated on Frenchmen street, on which a swathe of bars hosting nearly-nightly live music reside. We spent several evenings there, including one of particular note, where French president Emmanuel Macron suddenly appeared, trailed by an extensive retinue of blue-suited aides and bodyguards. Another street in New Orleans infamous for its nightlife is Bourbon street. Where Frenchmen street is focused on jazz, Bourbon street contains all manner of rowdy madness, assaulting your senses with noise, smells and sights as soon as you arrive. Both are necessary experiences when visiting The Big Easy.

Conclusion

All in all, the conference was a great opportunity to get a taste of the massive array of research that occurs in machine learning. We were all surprised by the scope of the research topics and talks, and enjoyed the opportunity to explore a new culture and city.

Skip to toolbar