Student perspectives: Compass Annual Conference 2023

Posted on 26th September 202326th September 2023 by dominic.broadbent

A post by Dominic Broadbent, PhD student on the Compass CDT, and Michael Whitehouse, PhD student of the Compass CDT (recently submitted thesis)

Introduction

September saw the second annual Compass Conference, hosted in the Fry Building, the home of the School of Mathematics. The event was particularly special as it is the first time that all five Compass cohorts were brought together, and it was a fantastic opportunity to celebrate the achievements and research of the Compass CDT with external partners. This year the theme was “Communicating Research in Context“, focusing on how research can be better communicated, and the need to highlight the motivation and applications of mathematical research.

Research talks

The day began with four long form talks touching on the topic of communicating research. Starting with Alessio Zakaria’s talk which delved into Hypothesis tests, commenting on their criticial role as the defacto statistical tool across the sciences, and how p-hacking has led to a replication crisis that undermines public confidence in research. The next talk by Sam Stockman and Emerald Dilworth discussed the challenges they faced, and the key takeaways from their shared experience of communicating mathematics with researchers in the geographical sciences. Following this, Ed Davis’s interactive talk “The Universal Language of Visualisations” explored how effective visualisation techniques should differ by the intended audience, with examples from his research and activities outside of academia. The last talk by Dan Milner explored his research on understanding the effect of environmental factors on outcomes of smallholder farmers in Kenya. He took us through the process of collecting data on the ground, to modelling and communicating results to external partners. After each talk there was an opportunity to ask questions, allowing for audience participation and the sparking of interesting discussions. The format mirrored that which is most frequently used in external academic conferences, giving the speakers a chance to practice their technique in front of friendly faces.

Lightning talks

After a short break, we jumped back into the fray with a series of 3-minute fast-paced lightning talks. A huge range of topics were covered, all the way from developing modelling techniques for the electric grid of the future, to predicting the incidence of Cerebral Vasospasm at the Southmead Hospital ICU. With such a short time to present, these talks were a great exercise in distilling research into just the essentials, knowing there is very limited time to garner the audience’s interest and convey an effective message.

Special guest lecture

After lunch, we reconvened to attend the special guest lecture. The talk, entitled Bridging the gap between research and industry, was delivered by Ruth Voisey, CEO of the Smith institute. It outlined Ruth’s journey from writing her PhD thesis ‘Multiple wave scattering by quasiperiodic structures’, to CEO of the Smith Institute – via an internship with the acoustic research team at Dyson. It was particularly refreshing to hear Ruth’s candid account of her ‘non-linear’ rise to CEO, accrediting her success to strong principles of clear research communication and ‘mathematical evangelicalism’.

As PhD students in the bubble of academia, the path to opportunities in the world of industry can often feel clouded – Ruth’s lecture painted a clear picture of how one can transition from university based research to a rewarding career outside of this bubble, applying such research to tangible problems in the real world.

Panel discussion and poster session

The special guest lecture was followed by a discussion on communicating research in context, with panel members Ruth Voisey, David Greenwood, Helen Barugh, Oliver Johnson, plus Compass CDT students Ed Davis and Sam Stockman. The panel discussed the difficulty of communicating the nuances of research conclusions with the public, with a particular focus on handling these nuances when talking to journalists – stressing the importance of communicating the limitations of the research in question.

This was followed by a poster session, one enthusiastic student had the following comment “it was great to see of all the Compass students’ hard work being celebrated and shared with the wider data science community”.

Concluding remarks

To cap off the successful event, Compass students Hannah Sansford and Josh Givens delivered some concluding remarks which were drawn from comments made by students about what key points they’d taken from the day. These focused on the importance of clear communication of research throughout the whole pipeline, from inception in discussion with fellow academics to the dissemination of knowledge to the general population.

The day ended with a walk to Goldney Hall, where students, staff, and attendees enjoyed delicious food, wine, and access to the beautiful Orangery gardens.

Student Perspectives: Density Ratio Estimation with Missing Data

Posted on 3rd July 20233rd July 2023 by josh.givens

A post by Josh Givens, PhD student on the Compass programme.

Density ratio estimation is a highly useful field of mathematics with many applications. This post describes my research undertaken alongside my supervisors Song Liu and Henry Reeve which aims to make density ratio estimation robust to missing data. This work was recently published in proceedings for AISTATS 2023.

Density Ratio Estimation

Definition

As the name suggests, density ratio estimation is simply the task of estimating the ratio between two probability densities. More precisely for two RVs (Random Variables) $Z^0, Z^1$ on some space $\mathcal{Z}$ with probability density functions (PDFs) $p_0, p_1$ respectively, the density ratio is the function $r^*:\mathcal{Z}\rightarrow\mathbb{R}$ defined by

$r^*(z):=\frac{p_0(z)}{p_1(z)}$ .

Plot of the scaled density ratio alongside the PDFs for the two classes.

Density ratio estimation (DRE) is then the practice of using IID (independent and identically distributed) samples from $Z^0$ and $Z^1$ to estimate $r^*$ . What makes DRE so useful is that it gives us a way to characterise the difference between these 2 classes of data using just 1 quantity, $r^*$ .

The Density Ratio in Classification

We now give demonstrate this characterisability in the case of classification. To frame this as a classification problem define $Y\sim\text{Bernoulli}(0.5)$ and $Z$ by $Z|Y=y\sim Z^{y}$ . The task of predicting $Y$ given $Z$ using some function $\phi:\mathcal{Z}\rightarrow\{0,1\}$ is then our standard classification problem. In classification a common target is the Bayes Optimal Classifier, the classifier $\phi^*$ which maximises $\mathbb{P}(Y=\phi(Z)).$ We can write this classifier in terms of $r^*$ as we know that $\phi^*(z)=\mathbb{I}\{\mathbb{P}(Y=1|Z=z)>0.5\}$ where $\mathbb{I}$ is the indicator function. Then, by the total law of probability, we have

$\mathbb{P}(Y=1|Z=z)=\frac{p_{Z|Y=1}(z)\mathbb{P}(Y=1)}{p_{Z|Y=1}(z)\mathbb{P}(Y=1)+p_{Z|Y=0}(z)\mathbb{P}(Y=0)}$

$=\frac{p_1(z)\mathbb{P}(Y=1)}{p_1(z)\mathbb{P}(Y=1)+p_0(z)\mathbb{P}(Y=0)} =\frac{1}{1+\frac{1}{r}\frac{\mathbb{P}(Y=0)}{\mathbb{P}(Y=1)}}.$

Hence to learn the Bayes optimal classifier it is sufficient to learn the density ratio and a constant. This pattern extends well beyond Bayes optimal classification to many other areas such as error controlled classification, GANs, importance sampling, covariate shift, and others. Generally speaking, if you are in any situation where you need to characterise the difference between two classes of data, it’s likely that the density ratio will make an appearance.

Estimation Implementation – KLIEP

Now we have properly introduced and motivated DRE, we need to look at how we can go about performing it. We will focus on one popular method called KLIEP here but there are a many different methods out there (see Sugiyama et al 2012 for some additional examples.)

The intuition behind KLIEP is simple: as $r^* \cdot p_0=p_1$ , if $\hat r\cdot p_0$ is “close” to $p_1$ then $\hat r$ is a good estimate of $r^*$ . To measure this notion of closeness KLIEP uses the KL (Kullback-Liebler) divergence which measures the distance between 2 probability distributions. We can now formulate our ideal KLIEP objective as follows:

$\underset{r}{\text{min}}~ KL(p_1|p_0\cdot r)$

$\text{subject to:}~ \int_{\mathcal{Z}}r(z)p_0(z)\mathrm{d}z=1$

where $KL(p|p')$ represent the KL divergence from $p$ to $p'$ . The constraint ensures that the right hand side of our KL divergence is indeed a PDF. From the definition of the KL-divergence we can rewrite the solution to this as $\hat r:=\frac{\tilde r}{\mathbb{E}[r(X^0)]}$ where $\tilde r$ is the solution to the unconstrained optimisation

$\underset{r}{\text{min}}~\mathbb{E}[\log (r(Z^1))]-\log(\mathbb{E}[r(Z^0)]).$

As this is now just an unconstrained optimisation over expectations of known transformations of $Z^0, Z^1$ , we can approximate this using samples. Given samples $z^0_1,\dotsc,z^0_n$ from $Z_0$ and samples $z^1_1,\dotsc,z^1_n$ from $Z_1$ our estimate of the density ratio will be $\hat r:=\left(\frac{1}{n}\sum_{i=1}^nr(z_i^0)\right)^{-1}\tilde r$ where $\tilde r$ solves

$\underset{r}{\min}~ \frac{1}{n}\sum_{i=1}^n \log(r(z^1_i))-\log\left(\frac{1}{n}\sum_{i=1}^n r(z^0_i)\right).$

Despite KLIEP being commonly used, up until now it has not been made robust to missing not at random data. This is what our research aims to do.

Missing Data

Suppose that instead of observing samples from $Z$ , we observe samples from some corrupted version of $Z$ , $X$ . We assume that $\mathbb{P}(\{X=\varnothing\}\cup \{X=Z\})=1$ so that either $X$ is missing or $X$ takes the value of $Z$ . We also assume that whether $X$ is missing depends upon the value of $Z$ . Specifically we assume $\mathbb{P}(X=\varnothing|Z=z)=\varphi(z)$ with $\varphi(z)$ not constant and refer to $\varphi$ as the missingness function. This type of missingness is known as missing not at random (MNAR) and when dealt with improperly can lead to biased result. Some examples of MNAR data could be readings take from a medical instrument which is more likely to err when attempting to read extreme values or recording responses to a questionnaire where respondents may be more likely to not answer if the deem their response to be unfavourable. Note that while we do not see what the true response would be, we do at least get a response meaning that we know when an observation is missing.

Missing Data with DRE

We now go back to density ratio estimation in the case where instead of observing samples from $Z^0,Z^1$ we observe samples from their corrupted versions $X^0, X^1$ . We take their respective missingness functions to be $\varphi_0, \varphi_1$ and assume them to be known. Now let us look at what would happen if we implemented KLIEP with the data naively by simply filtering out the missing-values. In this case, the actual density ratio we would be estimating would be

$r'(z):=\frac{p_{X_1|X_1\neq\varnothing}(z)}{p_{X_0|X_o\neq\varnothing}(z)}\propto\frac{(1-\varphi_1(z))p_1(z)}{(1-\varphi_0(z))p_0(z)}\not{\propto}r^*(z)$

and so we would get inaccurate estimates of the density ratio no matter how many samples are used to estimate it. The image below demonstrates this in the case were samples in class $1$ are more likely to be missing when larger and class $0$ has no missingness.

A plot of the density ratio using both the full data and only the observed part of the corrupted data

Our Solution

Our solution to this problem is to use importance weighting. Using relationships between the densities of $X$ and $Z$ we have that

$\mathbb{E}[g(Z)]=\mathbb{E}\left[\frac{\mathbb{I}\{X\neq\varnothing\}g(X)}{1-\varphi(X)}\right].$

As such we can re-write the KLIEP objective to keep our expectation estimation unbiased even when using these corrupted samples. This gives our modified objective which we call M-KLIEP as follows. Given samples $x^0_1,\dotsc,x^0_n$ from $X_0$ and samples $x^1_1,\dotsc,x^1_n$ from $X_1$ our estimate is $\hat r=\left(\frac{1}{n}\sum_{i=1}^n\frac{\mathbb{I}\{x_i^0\neq\varnothing\}r(x_i^0)}{1-\varphi_o(x_i^o)}\right)^{-1}\tilde r$ where $\tilde r$ solves

$\underset{r}{\min}~\frac{1}{n}\sum_{i=1}^n\frac{\mathbb{I}\{x_i^1\neq\varnothing\}\log(r(x_i^1))}{1-\varphi_1(x_i^1)}-\log\left(\frac{1}{n}\sum_{i=1}^n\frac{\mathbb{I}\{x_i^0\neq\varnothing\}r(x_i^0)}{1-\varphi_0(x_i^0)}\right).$

This objective will now target $r^*$ even when used on MNAR data.

Application to Classification

We now apply our density ratio estimation on MNAR data to estimate the Bayes optimal classifier. Below shows a plot of samples alongside the true Bayes optimal classifier and estimated classifiers from the samples via our method M-KLIEP and a naive method CC-KLIEP which simply ignores missing points. Missing data points are faded out.

Faded points represent missing values. M-KLIEP represents our method, CC-KLIEP represents a Naive approach, BOC gives the Bayes optimal classifier

As we can see, due to not accounting for the MNAR nature of the data, CC-KLIEP underestimates the true number of class 1 samples in the top left region and therefore produces a worse classifier than our approach.

Additional Contributions

As well as this modified objective our paper provides the following additional contributions:

Theoretical finite sample bounds on the accuracy of our modified procedure.
Methods for learning the missingness functions $\varphi_1,\varphi_0$ .
Expansions to partial missingness via a Naive-Bayes framework.
Downstream implementation of our method within Neyman-Pearson classification.
Adaptations to Neyman-Pearson classification itself making it robust to MNAR data.

For more details see our paper and corresponding github repository. If you have any questions on this work feel free to contact me at josh.givens@bristol.ac.uk.

References

Givens, J., Liu, S., & Reeve, H. W. J. (2023). Density ratio estimation and neyman pearson classification with missing data. In F. Ruiz, J. Dy, & J.-W. van de Meent (Eds.), Proceedings of the 26th international conference on artificial intelligence and statistics (Vol. 206, pp. 8645–8681). PMLR.

Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density Ratio Estimation in Machine Learning. Cambridge University Press.

Compass students attending AISTATS 2023 in Valencia

Posted on 14th June 202317th January 2024 by josh.givens

We (Ed Davis, Josh Givens, Alex Modell, and Hannah Sansford) attended the 2023 AISTATS conference in Valencia in order to explore the interesting research being presented as well as present some of our own work. While we talked about our work being published at the conference in this earlier blog post, having now attended the conference, we thought we’d talk about our experience there. We’ll spotlight some of the talks and posters which interested us most and talk about our highlights of Valencia as a whole.

Talks & Posters

Mode-Seeking Divergences: Theory and Applications to GANs

One especially interesting talk and poster at the conference was presented by Cheuk Ting Li on their work in collaboration with Farzan Farnia. This work aims to set up a formal classification for various probability measure divergences (such as f-Divergences, Wasserstein distance, etc.) in terms of there mode-seeking or mode-covering properties. By mode-seeking/mode-covering we mean the behaviour of the divergence when used to fit a unimodal distribution to a multi-model target. Specifically a mode-seeking divergence will encourage the target distribution to fit just one of the modes ignoring the other while a mode covering divergence will encourage the distribution to cover all modes leading to less accurate fitting on an individual mode but better covering the full support of the distribution. While these notions of mode-seeking and mode-covering divergences had been discussed before, up to this point there seems to be no formal definition for these properties, and disagreement on the appropriate categorisation of some divergences. This work presents such a definition and uses it to categorise many of the popular divergence methods. Additionally they show how an additive combination a mode seeking f-divergence and the 1-Wasserstein distance retain the mode-seeking property of the f-divergence while being implementable using only samples from our target distribution (rather than knowledge of the distribution itself) making it a desirable divergence for use with GANs.

Talk: https://youtu.be/F7LdHIzZQow

Paper: https://proceedings.mlr.press/v206/ting-li23a.html

Using Sliced Mutual Information to Study Memorization and Generalization in
Deep Neural Networks

The benefit of attending large conferences like AISTATS is having the opportunity to hear talks that are not related to your main research topic. This was the case with a talk by Wongso et. al. was one such talk. Although it did not overlap with any of our main research areas, we all found this talk very interesting.
The talk was on the topic of tracking memorisation and generalisation in deep neural networks (DNNs) through the use of /sliced mutual information/. Mutual information (MI) is commonly used in information theory and represents the reduction of uncertainty about one random variable, given the knowledge of the other. However, MI is hard to estimate in high dimensions, which makes it a prohibitive metric for use in neural networks.
Enter sliced mutual information (SMI). This metric is the average of the MI terms between their one-dimensional projections. The main difference between SMI and MI is that SMI is scalable to high dimensions and scales faster than MI.
Next, let’s talk about memorisation. Memorisation is known to occur in DNNs and is where the DNN fits random labels in training as it has memorised noisy labels in training, leading to bad generalisation. The authors demonstrate this behaviour by fitting a multi-layer perceptron to the MNIST dataset with various amounts of label noise. As the noise increased, the difference between the training and test accuracy became greater.
As the label noise increases, the MI between the features and target variable does not change, meaning that MI did not track the loss in generalisation. However, the authors show that the SMI did track the generalisation. As the label noise increased, the SMI decreased significantly as the MLP’s generalisation got worse. Their main theorem showed that SMI is lower-bounded by a term which includes the spherical soft-margin separation, a quantity which is used to track memorisation and generalisation!
In summary, unlike MI, SMI can track memorization and generalisation in DNNs. If you’d like to know more, you can find the full paper here: https://proceedings.mlr.press/v206/wongso23a.html.

Invited Speakers and the Test of Time Award

As well as the talks on papers that had been selected for oral presentation, each day began with a (longer) invited talk which, for many of us, were highlights of each day. The invited speakers were extremely engaging and covered varied and interesting topics; from Arthur Gretton (UCL) presenting ‘Causal Effect Estimation with Context and Confounders’ to Shakir Mohamed (DeepMind) presenting ‘Elevating our Evaluations: Technical and Sociotechnical Standards of Assessment in Machine Learning’. A favourite amongst us was a talk from Tamara Broderick (MIT) titled ‘An Automatic Finite-Sample Robustness Check: Can Dropping a Little Data Change Conclusions?’. In this talk she addressed a worry that researchers might have when the goal is to analyse a data sample and apply any conclusions to a new population: was a small proportion of the data sample instrumental to the original conclusion? Tamara and collorators propose a method to assess the sensitivity of statistical conclusions to the removal of a very small fraction of the data set. They find that sensitivity is driven by a signal-to-noise ratio in the inference problem, does not disappear asymptotically, and is not decided by misspecification. In experiments they find that many data analyses are robust, but that the conclusions of severeal influential economics papers can be changed by removing (much) less than 1% of the data! A link to the talk can be found here: https://youtu.be/QYtIEqlwLHE

This year, AISTATS featured a Test of Time Award to recognise a paper from 10 years ago that has had a prominent impact in the field. It was awarded to Andreas Damianou and Neil Lawrence for the paper ‘Deep Gaussian Processes’, and their talk was a definite highlight of the conferece. Many of us had seen Neil speak at a seminar at the University of Bristol last year and, being the engaging speaker he is, we were looking forward to hearing him speak again. Rather than focussing on the technical details of the paper, Neils talk concentrated on his (and the machine learning community’s) research philosophy in the years preceeding writing the paper, and how the paper came about – a very interesting insight, and a refreshing break from the many technical talks!

Valencia

There was so much to like about Valencia even from our short stay there. We’ll try and give you a very brief highlight of our favourite things.

Food & Drink:

Obviously Valencia is renowned for being the birthplace of paella and while the paella was good we sampled many other delights our stay. Our collective highlight was the nicest Burrata any of us had ever had which, in a stunning display of individualism, all four of us decided to get on our first day at the conference.

Beach:

About half an hours tram ride from the conference centre are the beaches of Valencia. These stretch for miles as well as having a good 100m in depth with (surprisingly hot) sand covering the lot. We visited these after the end of the conference on the Thursday and despite it being the only cloudy day of the week it was a perfect way to relax at the end of a hectic few days with the pleasantly temperate water being an added bonus.

Architecture:

Valencia has so much interesting architecture scattered around the city centre. One of the most memorable remarkable places was the San Nicolás de Bari and San Pedro Mártir (Church of San Nicolás) which is known as the Sistine chapel of Valencia (according to the audio-guide for the church at least) with its incredible painted ceiling and live organ playing.

Summer applications deadline for Compass CDT: 12 June 2023

Posted on 18th May 202325th May 2023 by crina.radu

We are happy to announce our upcoming applications deadline is 12 June 2023, 23:59 (London UK time zone) for the final few fully funded places to start September 2023. For international applications there are limited scholarship funded places available. Early applications advised.

Compass is offering specific projects for PhD students to study from Sept 2023. We are pleased to announce that there are 4 new project opportunities to study. The full list of the projects/ supervisors has been updated. All the supervisors listed are open to discussion on the projects provided and can also talk to applicants about other project ideas. Please provide a ranked list of 3 projects of interest: 1 = project of highest interest. Project supervisors will be happy to respond to specific questions you have after reading the proposals. Applicants should contact them by email if they wish beforehand.

New PhD Projects available

Spatio-temporal methods for estimating infectious disease transmission – Juliette Unwin
Project: Unwin_2 (PDF, 125kB)
Comparison Theory for MCMC Algorithm – Sam Power
Project: Power (PDF, 133kB)
Estimating impact of complex crises on children – Juliette Unwin
Project: Unwin1 (PDF, 135kB)
Modern nonparameteric approaches to the estimation of heterogeneous casual effects in observational studies – Katarzyna Reluga
Project: Reluga (PDF, 207kB)

The full list of the projects/ supervisors.

PhD Project Allocation Process

Application forms will be reviewed based on the 3 ranked projects specified or other proposed topic. Successful applicants will be invited to attend an interview with the Compass admissions tutors and the specific project supervisor. If you are made an offer of PhD study it will be published through the online application system. You will then have 2 weeks to consider the offer before deciding whether to accept or decline.

We welcome applications from all members of our community and are particularly encouraging those from diverse groups, such as members of the LGBT+ and black, Asian and minority ethnic communities, to join us.

APPLICATIONS DEADLINE

12 June 2023, 23:59 (London UK time zone)

APPLY NOW

Advantages of being a Compass Student

Stipend – a generous stipend of £22,622 pa tax free, paid in monthly payments. Plus your own expense budget of £1,000 pa towards travel and research activity.
No fees – all tuition fees are covered by the EPSRC and University of Bristol.
Bespoke training – first year units are designed specifically for the academic needs of each Compass student, which enables students to develop knowledge and capability to pursue cross-disciplinary PhD research.
Supervisors – supervisors from across academic disciplines offer a range of research projects.
Cohort – Compass students benefit from dedicated offices and collaboration spaces, enabling strong cohort links and opportunities for shared learning and research.

About Compass CDT

A 4-year bespoke PhD training programme in the statistical and computational techniques of data science, with partners from across the University of Bristol, industry and government agencies.

The cross-disciplinary programme offers exciting collaborations across medicine, computer science, geography, economics, life and earth sciences, as well as with our external partners who range from government organisations such as the Office for National Statistics, NCSC and the AWE, to industrial partners such as LV, Improbable, IBM Research, EDF, and AstraZeneca.

Students are co-located with the Institute for Statistical Science in the School of Mathematics, which occupies the Fry Building.

Hear from our students about their experience with the programme

Compass has allowed me to advance my statistical knowledge and apply it to a range of exciting applied projects, as well as develop skills that I’m confident will be highly useful for a future career in data science. – Shannon, Cohort 2
With the Compass CDT I feel part of a friendly, interactive environment that is preparing me for whatever I move on to next, whether it be in Academia or Industry. – Sam, Cohort 2
An incredible opportunity to learn the ever-expanding field of data science, statistics and machine learning amongst amazing people. – Danny, Cohort 1

Compass CDT Video

Find out more about what it means to be a part of the Compass programme from our students in this short video.

APPLICATIONS DEADLINE

12 June 2023, 23:59 (London UK time zone)

APPLY NOW

Student Perspectives: Intro to Recommendation Systems

Posted on 12th May 202312th May 2023 by hannah.sansford

A post by Hannah Sansford, PhD student on the Compass programme.

Introduction

Like many others, I interact with recommendation systems on a daily basis; from which toaster to buy on Amazon, to which hotel to book on booking.com, to which song to add to a playlist on Spotify. They are everywhere. But what is really going on behind the scenes?

Recommendation systems broadly fit into two main categories:

1) Content-based filtering. This approach uses the similarity between items to recommend items similar to what the user already likes. For instance, if Ed watches two hair tutorial videos, the system can recommend more hair tutorials to Ed.

2) Collaborative filtering. This approach uses the the similarity between users’ past behaviour to provide recommendations. So, if Ed has watched similar videos to Ben in the past, and Ben likes a cute cat video, then the system can recommend the cute cat video to Ed (even if Ed hasn’t seen any cute cat videos).

Both systems aim to map each item and each user to an embedding vector in a common low-dimensional embedding space $E = \mathbb{R}^d$ . That is, the dimension of the embeddings ( $d$ ) is much smaller than the number of items or users. The hope is that the position of these embeddings captures some of the latent (hidden) structure of the items/users, and so similar items end up ‘close together’ in the embedding space. What is meant by being ‘close’ may be specified by some similarity measure.

Collaborative filtering

In this blog post we will focus on the collaborative filtering system. We can break it down further depending on the type of data we have:

1) Explicit feedback data: aims to model relationships using explicit data such as user-item (numerical) ratings.

2) Implicit feedback data: analyses relationships using implicit signals such as clicks, page views, purchases, or music streaming play counts. This approach makes the assumption that: if a user listens to a song, for example, they must like it.

The majority of the data on the web comes from implicit feedback data, hence there is a strong demand for recommendation systems that take this form of data as input. Furthermore, this form of data can be collected at a much larger scale and without the need for users to provide any extra input. The rest of this blog post will assume we are working with implicit feedback data.

Problem Setup

Suppose we have a group of $n$ users $U = (u_1, \ldots, u_n)$ and a group of $m$ items $I = (i_1, \ldots, i_m)$ . Then we let $\mathbf{R} \in \mathbb{R}^{n \times m}$ be the ratings matrix where position $R_{ui}$ represents whether user $u$ interacts with item $i$ . Note that, in most cases the matrix $\mathbf{R}$ is very sparse, since most users only interact with a small subset of the full item set $I$ . For any items $i$ that user $u$ does not interact with, we set $R_{ui}$ equal to zero. To be clear, a value of zero does not imply the user does not like the item, but that they have not interacted with it. The final goal of the recommendation system is to find the best recommendations for each user of items they have not yet interacted with.

Matrix Factorisation (MF)

A simple model for finding user emdeddings, $\mathbf{X} \in \mathbb{R}^{n \times d}$ , and item embeddings, $\mathbf{Y} \in \mathbb{R}^{m \times d}$ , is Matrix Factorisation. The idea is to find low-rank embeddings such that the product $\mathbf{XY}^\top$ is a good approximation to the ratings matrix $\mathbf{R}$ by minimising some loss function on the known ratings.

A natural loss function to use would be the squared loss, i.e.

$L(\mathbf{X}, \mathbf{Y}) = \sum_{u, i} \left(R_{ui} - \langle X_u, Y_i \rangle \right)^2.$

This corresponds to minimising the Frobenius distance between $\mathbf{R}$ and its approximation $\mathbf{XY}^\top$ , and can be solved easily using the singular value decomposition $\mathbf{R} = \mathbf{U S V}^\top$ .

Once we have our embeddings $\mathbf{X}$ and $\mathbf{Y}$ , we can look at the row of $\mathbf{XY}^\top$ corresponding to user $u$ and recommend the items corresponding to the highest values (that they haven’t already interacted with).

Logistic MF

Minimising the loss function in the previous section is equivalent to modelling the probability that user $u$ interacts with item $i$ as the inner product $\langle X_u, Y_i \rangle$ , i.e.

$R_{ui} \sim \text{Bernoulli}(\langle X_u, Y_i \rangle),$

and maximising the likelihood over $\mathbf{X}$ and $\mathbf{Y}$ .

In a research paper from Spotify [3], this relationship is instead modelled according to a logistic function parameterised by the sum of the inner product above and user and item bias terms, $\beta_u$ and $\beta_i$ ,

$R_{ui} \sim \text{Bernoulli} \left( \frac{\exp(\langle X_u, Y_i \rangle + \beta_u + \beta_i)}{1 + \exp(\langle X_u, Y_i \rangle + \beta_u + \beta_i)} \right).$

Relation to my research

A recent influential paper [1] proved an impossibility result for modelling certain properties of networks using a low-dimensional inner product model. In my 2023 AISTATS publication [2] we show that using a kernel, such as the logistic one in the previous section, to model probabilities we can capture these properties with embeddings lying on a low-dimensional manifold embedded in infinite-dimensional space. This has various implications, and could explain part of the success of Spotify’s logistic kernel in producing good recommendations.