## Student Perspectives: Factor-adjusted vector autoregressive models

A post by Dylan Dijk, PhD student on the Compass programme.

# Introduction

My current project is looking to robustify the performance of time series models to heavy-tailed data. The models I have been focusing on are vector autoregressive (VAR) models, and additionally factor-adjusted VAR models. In this post I will not be covering the robust methodology, but will be introducing VAR models and providing the motivation for introducing the factor adjustment step when working with high-dimensional time series.

# Vector autoregressive models

In time series analysis the objective is often to forecast a future value given past data, for example, one of the classical models for univariate time series is the autoregressive AR(d) model:
$X_t = a_1 X_{t-1} + \dots + a_d X_{t-d} + \epsilon_t \, .$
However, in many cases, the value of a variable is influenced not just by its own past values but also by past values of other variables. For example, in Economics, household consumption expenditures may depend on variables such as income, interest rates, and investment expenditures, therefore we would want to include these variables in our model.

The VAR model [1] is simply the multivariate generalisation of the univariate autoregressive model, that is, for a $p$-dimensional stochastic process $(\dots, \mathbf{X}_t, \mathbf{X}_{t+1}, \dots) \in \mathbb{R}^p$ we model an observation at time $t$ as a linear combination of previous observations up to some lag $d$ plus an error:

$\mathbf{X}_t = \mathbf{A}_1 \mathbf{X}_{t-1} + \dots + \mathbf{A}_d \mathbf{X}_{t-d} + \boldsymbol{\epsilon}_t \, ,$
where $\mathbf{A}_i$ are $p \times p$ coefficient matrices. Therefore, in addition to modelling serial dependence, the model takes into account cross-sectional dependence. This model can then be used for forecasting, and as an explanatory model to describe the dynamic interrelationships between a number of variables.

## Estimation

Given a dataset of $n$ observations, $\{\mathbf{X}_1, \dots, \mathbf{X}_n \in \mathbb{R}^p\}$, we can aim to estimate the coefficient matrices. In order to do so, the model can be written in a stacked form:

\begin{align*} \underbrace{\left[\begin{array}{c}\left(\mathbf{X}_n\right)^{T} \\ \vdots \\ \left(\mathbf{X}_{d+1}\right)^{T}\end{array}\right]}_{\boldsymbol{\mathcal{Y}}} & =\underbrace{\left[\begin{array}{ccc}\left(\mathbf{X}_{n-1}\right)^{T} & \cdots & \left(\mathbf{X}_{n-d}\right)^{T} \\ \vdots & \ddots & \vdots \\ \left(\mathbf{X}_{d}\right)^{T} & \cdots & \left(\mathbf{X}_1\right)^{T}\end{array}\right]}_{\boldsymbol{\mathcal{X}}} \underbrace{\left[\begin{array}{c}\boldsymbol{A}_1^{T} \\ \vdots \\ \boldsymbol{A}_d^{T}\end{array}\right]}_{\boldsymbol{A}^T}+\underbrace{\left[\begin{array}{c}\left(\boldsymbol{\epsilon}_n\right)^{T} \\ \vdots \\ \left(\boldsymbol{\epsilon}_d\right)^{T}\end{array}\right]}_{\boldsymbol{E}}
\end{align*}
and subsequently vectorised to return a standard univariate linear regression problem
\begin{align*}
\operatorname{vec}(\boldsymbol{\mathcal{Y}}) & =\operatorname{vec}\left(\boldsymbol{\mathcal{X}} \boldsymbol{A}^T\right)+\operatorname{vec}(\boldsymbol{E}), \\ & =(\textbf{I} \otimes \boldsymbol{\mathcal{X}}) \operatorname{vec}\left(\boldsymbol{A}^T\right)+\operatorname{vec}(\boldsymbol{E}), \label{eq:stacked_var_regression_form}\\ \underbrace{\boldsymbol{Y}}_{N p \times 1} & =\underbrace{\boldsymbol{Z}}_{N p \times q} \underbrace{\boldsymbol{\beta}^*}_{q \times 1}+\underbrace{\operatorname{vec}(\boldsymbol{E})}_{N p \times 1}, \quad N=(n-d), \quad q=d p^2.
\end{align*}

## Sparse VAR

There are $dp^2$ parameters to estimate in this model, and hence VAR estimation is naturally a high-dimensional statistical problem. Therefore, estimation methods and associated theory need to hold under high-dimensional scaling of the parameter dimension. Specifically, this means consistency is shown for when both $p$ and $n$ tend to infinity, as opposed to in classical statistics where $p$ is kept fixed.

The linear model in the high-dimensional setting is well understood [2]. To obtain a consistent estimator requires additional structural assumptions in the model, in particular, sparsity on the true vector $\boldsymbol\beta^*$. The common approach for estimation is lasso, which can be motivated from convex relaxation in the noiseless setting. Consistency of lasso is well studied [3][4], with consistency guaranteed under sparsity, and restrictions on the directions in which the hessian of the loss function is strictly positive.

The well known lasso objective is given by:

\begin{align*}
\underset{{\boldsymbol\beta \in \mathbb{R}^q}}{\text{argmin}} \, \|\boldsymbol{Y}-\boldsymbol{Z} \boldsymbol\beta\|_{2}^{2} + \lambda \|\boldsymbol\beta\|_1 \, ,
\end{align*}

and below, we give a simplified consistency result that can be obtained under certain assumptions.

We denote the sparsity of $\boldsymbol{A}$ by
$s_{0, j}=\left|\boldsymbol\beta^*_{(j)}\right|_0, s_0=\sum_{j=1}^p s_{0, j}$ and $s_{\text {in }}=\max _{1 \leq j \leq p} s_{0, j}$.

Lasso consistency result
Suppose
\begin{gather*}
\, s_{\text{in}} \leq C_1 \sqrt{\frac{n}{\log p}} \, \; \text{and } \; \lambda \geq C_2 (\|\boldsymbol{A}^T\|_{1,\infty} + 1)\sqrt{\frac{\log p}{n}} \; ,
\end{gather*}
then with high probability we have
\begin{align*}
|\widehat{\boldsymbol{A}} – \boldsymbol{A}|_2 \leq C_3 \sqrt{s_{0}} \lambda \quad \text{and} \quad |\widehat{\boldsymbol{A}} – \boldsymbol{A}|_1 \leq C_4 s_0 \lambda \, .
\end{align*}
What we mean here by consistency, is that as $n,p \rightarrow \infty$, the estimate $\widehat{\boldsymbol\beta}$ converges to $\boldsymbol\beta$ in probability. Where we think of $p$ as being a function of $n$, so the manner in which the dimension $p$ grows depends on the sample size. For example, in the result above, we can have consistency with $p = \exp(\sqrt{n})$.

The result indicates that for larger $p$ a more sparse solution, and a larger regularisation parameter is required. Similar results have been derived under various assumptions, for instance under a Gaussian VAR the result has been given in terms of the largest and smallest eigenvalues of the spectral density matrix of a series [5], and hence consistency requires that these quantities are bounded.

In summary, for lasso estimation to work we need $\boldsymbol{A}$ to be sufficiently sparse, and the largest eigenvalue of the spectral density matrix to be bounded. But are these reasonable assumptions to make?

First two leading eigenvalues of the spectral density matrix.

Heatmap of logged p-values for evidence of non-zero coefficients after fitting ridge regression model.

Well, intuitively, if a multivariate time series has strong cross-sectional dependence we would actually expect to have many non-zero entries in the VAR coefficients $\boldsymbol{A}_i$. The figures above, taken from [6], illustrate a real dataset in which there is statistical evidence for a non-sparse solution (heatmap), and that the leading eigenvalue of the spectral density matrix diverges linearly in $p$. Therefore providing an example in which two of the assumptions discussed above are unmet.

The idea now is to assume that the covariance of the observed vector $\mathbf{X}_t$ is driven by a lower dimensional latent vector. For example, the figures above were generated from a dataset of stock prices of financial institutions, in this case an interpretation of a latent factor could be overall market movements which captures the broad market trend, or a factor that captures the change in interest rates.
\begin{align}
\mathbf{X}_t &= \underset{p \times r}{\boldsymbol\Lambda} \underset{r \times 1}{\mathbf{F}_t} + \boldsymbol\xi_t \quad
\end{align}
Consequently, first fitting a factor model would account for strong cross-sectional correlations, leaving the remaining process to exhibit the individual behaviour of each series. Fitting a sparse VAR process will now be a more reasonable choice.

In the formula above, $\mathbf{F}_t$ is the factor random vector, and $\boldsymbol\Lambda$ the constant loading matrix, which quantifies the sensitivity of each variable to the common factors, and we can model $\boldsymbol\xi_t$ as a sparse VAR process, as described in the preceding sections.

References

[1] Lütkepohl, H. (2005) New introduction to multiple time series analysis. Berlin: Springer-Verlag.

[2] Wainwright, M. (2019) High-dimensional statistics: A non-asymptotic viewpoint – Chapter 7 – Sparse linear models in high dimensions. Cambridge, United Kingdom: Cambridge University Press.

[3] Geer, Sara A. van de, and Peter Bühlmann. (2009) On the Conditions Used to Prove Oracle Results for the Lasso. Electronic Journal of Statistics. Project Euclid, https://doi.org/10.1214/09-EJS506.

[4] Bickel, Peter J., Ya’acov Ritov, and Alexandre B. Tsybakov. (2009) Simultaneous Analysis of Lasso and Dantzig Selector. The Annals of Statistics. https://doi.org/10.1214/08-AOS620.

[5] Sumanta Basu, George Michailidis. (2015) Regularized estimation in sparse high-dimensional time series models. The Annals of Statistics. https://doi.org/10.1214/15-AOS1315.

[6] Barigozzi, M., Cho, H. and Owens, D. (2024). FNETS: Factor-adjusted network estimation and forecasting for high-dimensional time series. Journal of Business & Economic Statistics.

# MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\$','\$']]}});

A post by Rachel Wood, PhD student on the Compass programme.

# Introduction

As our online lives expand, more data than we can reasonably consider at once is collected. Many of this is sparse and noisy data, needing methods which can recover information encoded in these structures. An example of these kind of datasets are networks. In this blog post, I explain how we can do this to identify changes between networks observing the same subjects (e.g. snapshots of the same graph over time).

# Problem Set-Up

We consider two undirected graphs, represented by their adjacency matrices $\mathbf{A}^{(1)}, \mathbf{A}^{(2)} \in \{0,1\}^{n \times n}$. As we can see below, there are two clusters (pink nodes form one, the yellow and blue nodes form another) in the first graph but in the second graph the blue nodes change behaviour to become a distinct third cluster.

Our question becomes, how can we detect this change without prior knowledge of the labels?

We can simply look at the adjacency matrices, but these are often sparse, noisy and computationally expensive to work with. Using dimensionality reduction, we can “denoise” the matices to obtain a $d$-dimensional latent representation of each node, which provides a natural measure of node behaviour and a simple space in which to measure change.

# Graph Embeddings

There is an extensive body of research investigating graph embeddings, however here we will focus on spectral methods.
Specifically we will compare the approaches of Unfolded Adjacency Spectral Embedding (UASE) presented in [1] and CLARITY presented in [2]. Both of these are explained in more detail below.

## UASE

UASE takes as input the unfolded adjacency matrix $\mathbf{A} = \left[ \mathbf{A}^{(1)}\big| \mathbf{A}^{(2)}\right] \in \{0,1\}^{2n \times n}$ and performs $d$ truncated SVD [3] to obtain a $d$-dimensional static and a $d$-dimensional dynamic representation:

Mathematically we can write this as:
\begin{equation*}
\mathbf{A} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^T = \mathbf{U}_{\mathbf{A}} \boldsymbol{\Sigma}_{\mathbf{A}} \mathbf{V}_{\mathbf{A}}^T + \mathbf{U}_{\perp \!\!\!\ } \ \boldsymbol{\Sigma}_{\perp \!\!\!\ } \ \mathbf{V}_{\perp \!\!\!\ }^T \ \approx \mathbf{U}_{\mathbf{A}} \boldsymbol{\Sigma}_{\mathbf{A}} \mathbf{V}_{\mathbf{A}}^T = \mathbf{X} \mathbf{Y}^T
\end{equation*}

where $\mathbf{U}_{\mathbf{A}}, \mathbf{V}_{\mathbf{A}}$ are the first $d$ columns of $\mathbf{U}$ and $\mathbf{V}$ respectively and $\boldsymbol{\Sigma}_{\mathbf{A}}$ is the diagonal matrix which forms the $d \times d$ upper left block of $\boldsymbol{\Sigma}$. This gives a static embedding $\mathbf{X} \in \mathbb{R}^{n \times d}$ and a time evolving embedding $\mathbf{Y} \in \mathbb{R}^{2n \times d}$.

The general approach in UASE literature is to measure change by comparing latent positions, which is backed by [4]. This paper gives a theoretical demonstration for longitudinal and cross-sectional stability in UASE, i.e. for observations $i$ at time $s$ and $j$ at time $t$ behaving similarly, their latent positions should be the same: $\hat Y_i^{(s)} \approx \hat Y_j^{(t)}$. This backs the general approach in the UASE literature of comparing latent positions to quantify change.

Going back to our example graphs, we apply UASE to the unfolded adjacency matrix and visualise the first two dimensions of the embedding for each of the graphs:

As we can see above, the pink nodes have retained their positions, the yellow nodes have moved a little and the blue nodes have moved the most.

## CLARITY

Clarity takes a different approach, by estimating $\mathbf{A}^{(2)}$ from $\mathbf{A}^{(1)}$. An illustration of how it is done is shown below:

Again we provide a mathmatical explanation of the method. First we perform a $d$-dimensional truncated eigendecompositionon $\mathbf{A}^{(1)}$:

\begin{equation*}
\mathbf{A}^{(1)} = \mathbf{U}^{(1)} \boldsymbol{\Sigma}^{(1)} \mathbf{U}^{(1)T} + \mathbf{U}_{\perp \!\!\!\ } \ \boldsymbol{\Sigma}_{\perp \!\!\!\ } \ \mathbf{U}_{\perp \!\!\!\ }^T \ \approx \mathbf{U}^{(1)} \boldsymbol{\Sigma}^{(1)} \mathbf{U}^{(1)T} = \hat{\mathbf{A}}^{(1)}
\end{equation*}
where $\mathbf{U} \in \mathbb{R}^{n \times d}$ is a matrix of the first $d$ eigenvectors and $\Sigma \in \mathbb{R}^{d \times d}$ is a diagonal matrix with the first $d$ eigenvalues.

Then we estimate $\mathbf{A}^{(2)}$ as
\begin{equation*}
\hat{\mathbf{A}}^{(2)} = \mathbf{U}^{(1)} \boldsymbol{\Sigma }^{(2)} \mathbf{U}^{(1)T} \hspace{1cm} \text{where} \hspace{1cm} \boldsymbol{\Sigma}^{(2)} = \mathbf{U}^{(1)T} \mathbf{A}^{(2)} \mathbf{U}^{(1)}
\end{equation*}

As opposed to UASE, Clarity examines change between $\mathbf{A}^{(1)}$ and $\mathbf{A}^{(2)}$ by a quantity called persistence. These are defined as
\begin{equation*}
\mathbf{P}_i = \sum_{j =1}^{n}\left( \mathbf{A}_{ij}^{(2)} -\hat{\mathbf{A}}_{ij}^{(2)} \right)
\end{equation*}

The intuition here is that the persistences will capture structure in $\mathbf{A}^{(2)}$ that is not present in or explained by $\mathbf{A}^{(1)}$.

Returning to our example problem, we can see heatmaps of $\mathbf{A}^{(1)}$ and $\mathbf{A}^{(2)}$ alongside their Clarity estimates:

Looking at the figure above we can see that the Clarity estimate of ${\mathbf{A}^{(2)}}$ does not capture the third cluster that appears in the second graph and therefore should identify these nodes as anomalies.

## Comparison

We can use receiver operating characteristic (ROC) curves to assess the success of our two methods. Given a score (in our case either the distance between latent positions or persistences) it plots the false positive rate against the true positive rate for a sequence of thresh-holds. We can see the ROCs below for $d = 2,3,4,5,6$

We can see that in lower dimensions UASE outperforms Clarity, but the performance degrades over time. This becomes a common problem in real world applications where the best choice for $d$ is unknown. Clarity on the other hand, does not have the same power as UASE but is more robust to dimension. Another difference between the two methods is that by allowing changes in relationship in the model, it is designed to cope with the entire graph changing a little bit.

# Conclusion

We have now introduced two methods for identifying change and compared their performance in a simple example. One method produces stronger results overall but is much more sensitive to the choice of dimension than the other. My current research looks to investigate why Clarity succeeds in this area when many other methods fail, with the ultimate goal of using this knowledge to modify more powerful methods to also have this feature.

[1] Jones, A., & Rubin-Delanchy, P. (2020). The multilayer random dot product graph. arXiv preprint arXiv:2007.10455.

[2] Lawson, D. J., Solanki, V., Yanovich, I., Dellert, J., Ruck, D., & Endicott, P. (2021). CLARITY: comparing heterogeneous data using dissimilarity. Royal Society Open Science, 8(12), 202182.

[3] Wikipedia contributors. (2024, June 11). Singular value decomposition. In Wikipedia, The Free Encyclopedia. Retrieved 09:54, July 1, 2024, from https://en.wikipedia.org/w/index.php?title=Singular_value_decomposition&oldid=1228566091

[4] Gallagher, I., Jones, A., & Rubin-Delanchy, P. (2021). Spectral embedding for dynamic networks with stability guarantees. Advances in Neural Information Processing Systems, 34, 10158-10170.

21st June 2024
14:00 - 16:00

## Student Perspectives: Semantic Search

A post by Ben Anson, PhD student on the Compass programme.

## Semantic Search

Semantic search is here. We already see it in use in search engines [13], but what is it exactly and how does it work?

Search is about retrieving information from a corpus, based on some query. You are probably using search technology all the time, maybe $\verb|ctrl+f|$, or searching on google. Historically, keyword search, which works by comparing the occurrences of keywords between queries and documents in the corpus, has been surprisingly effective. Unfortunately, keywords are inherently restrictive – if you don’t know the right one to use then you are stuck.

Semantic search is about giving search a different interface. Semantic search queries are provided in the most natural interface for humans: natural language. A semantic search algorithm will ideally be able to point you to a relevant result, even if you only provided the gist of your desires, and even if you didn’t provide relevant keywords.

Figure 1 illustrates a concrete example where semantic search might be desirable. The query ‘animal’ should return both the dog and cat documents, but because the keyword ‘animal’ is not present in the cat document, the keyword model fails. In other words, keyword search is susceptible to false negatives.

Transformer neural networks turn out to be very effective for semantic search [1,2,3,10]. In this blog post, I hope to elucidate how transformers are tuned for semantic search, and will briefly touch on extensions and scaling.

### The search problem, more formally

Suppose we have a big corpus $\mathcal{D}$ of documents (e.g. every document on wikipedia). A user sends us a query $q$, and we want to point them to the most relevant document $d^*$. If we denote the relevance of a document $d$ to $q$ as $\text{score}(q, d)$, the top search result should simply be the document with the highest score,

$$d^* = \mathrm{argmax}_{d\in\mathcal{D}}\, \text{score}(q, d).$$

This framework is simple and it generalizes. For $\verb|ctrl+f|$, let $\mathcal{D}$ be the set of individual words in a file, and $\text{score}(q, d) = 1$ if $q=d$ and $0$ otherwise. The venerable keyword search algorithm BM25 [4], which was state of the art for decades [8], uses this score function.

For semantic search, the score function is often set as the inner product between query and document embeddings: $\text{score}(q, d) = \langle \phi(q), \phi(d) \rangle$. Assuming this score function actually works well for finding relevant documents, and we use a simple inner product, it is clear that the secret sauce lies in the embedding function $\phi$.

### Transformer embeddings

We said above that a common score function for semantic search is $\text{score}(q, d) = \langle \phi(q), \phi(d) \rangle$. This raises two questions:

• Question 1: what should the inner product be? For semantic search, people tend to use the cosine similarity for their inner product.
•  Question 2: what should $\phi$ be? The secret sauce is to use a transformer encoder, which is explained below.

#### Quick version

Transformers magically gives us a tunable embedding function $\phi: \text{“set of all pieces of text”} \rightarrow \mathbb{R}^{d_{\text{model}}}$, where $d_{\text{model}}$ is the embedding dimension.

#### More detailed version

See Figure 2 for an illustration of how a transformer encoder calculates an embedding for a piece of text. In the figure we show how to encode “cat”, but we can encode arbitrary pieces of text in a similar way. The transformer block details are out of scope here; though, for these details I personally found Attention is All You Need [9] helpful, the crucial part being the Multi-Head Attention which allows modelling dependencies between words.

The transformer encoder is very flexible, with almost every component parameterized by a learnable weight / bias – this is why it can be used to model the complicated semantics in natural language. The pooling step in Figure 2, where we map our sequence embedding $X’$ to a fixed size, is not part of a ‘regular’ transformer, but it is essential for us. It ensures that our score function $\langle \phi(q), \phi(d) \rangle$ will work when $q$ and $d$ have different sizes.

### Making the score function good for search

There is a massive issue with transformer embedding as described above, at least for our purposes – there is no reason to believe it will satisfy simple semantic properties, such as,

$\text{score}(\text{“busy place”}, \text{“tokyo”}) > \text{score}(\text{“busy place”}, \text{“a small village”})$

‘But why would the above not work?’ Because, of course, transformers are typically trained to predict the next token in a sequence, not to differentiate pieces of text according to their semantics.

The solution to this problem is not to eschew transformer embeddings, but rather to fine-tune them for search. The idea is to encourage the transformer to give us embeddings that place semantically dissimilar items far apart. E.g. let $q=$’busy place’, then we want $d^+=$’tokyo’ to be close to $q$ and $d^-=$’a small village’ to be far away.

This semantic separation can be achieved by fine-tuning with a contrastive loss [1,2,3,10],

$$\text{maximize}_{\theta}\,\mathcal{L} = \log \frac{\exp(\text{score}(q, d^+))}{\exp(\text{score}(q, d^+)) + \exp(\text{score}(q, d^-))},$$

where $\theta$ represents the transformer parameters. The $\exp$’s in the contastive loss are to ensure we never divide by zero. Note that we can interpret the contrastive loss as doing classification since we can think of the argument to the logarithm as $p(d^+ | q)$.

That’s all we need, in principle, to turn a transformer encoder into a text embedder! In practice, the contrastive loss can be generalized to include more positive and negative examples, and it is indeed a good idea to have a large batch size [11] (intuitively it makes the separation of positive and negative examples more difficult, resulting in a better classifier). We also need a fine-tuning dataset – a dataset of positive/negative examples. OpenAI showed that it is possible to construct one in an unsupervised fashion [1]. However, there are also publicly available datasets for supervised fine-tuning, e.g. MSMARCO [12].

### Extensions

One really interesting avenue of research is training of general purposes encoders. The idea is to provide instructions alongside the queries/documents [2,3]. The instruction could be $\verb|Embed this document for search: {document}|$ (for the application we’ve been discussing), or $\verb|Embed this document for clustering: {document}|$ to get embeddings suitable for clustering, or $\verb|Embed this document for sentiment analysis: {document}|$ for embeddings suitable for sentiment analysis. The system is fine-tuned end-to-end with the appropriate task, e.g. a contrastive learning objective for the search instruction, a classification objective for sentiment analysis, etc., leaving us with an easy way to generate embeddings for different tasks.

### A note on scaling

The real power of semantic (and keyword) search comes when a search corpus is too large for a human to search manually. However if the corpus is enormous, we’d rather avoid looking at every document each time we get a query. Thankfully, there are methods to avoid this by using specially tailored data structures: see Inverted Indices for keyword algorithms, and Hierarchical Navigable Small World graphs [5] for semantic algorithms. These both reduce search time complexity from $\mathcal{O}(|\mathcal{D}|)$ to $\mathcal{O}(\log |\mathcal{D}|)$, where $|\mathcal{D}|$ is the corpus size.

There are many startups (Pinecone, Weviate, Milvus, Chroma, etc.) that are proposing so-called vector databases – databases in which embeddings are stored, and queries can be efficiently performed. Though, there is also work contesting the need for these types of database in the first place [7].

### Summary

We summarised search, semantic search, and how transformers are fine-tuned for search with a contrastive loss. I personally find this a very nice area of research with exciting real-world applications – please reach out (ben.anson@bristol.ac.uk) if you’d like to discuss it!

### References

[1]: Text and code embeddings by contrastive pre-training, Neelakantan et al (2022)

[2]: Task-aware Retrieval with Instructions, Asai et al (2022)

[3]: One embedder, any task: Instruction-finetuned text embeddings, Su et al (2022)

[4]: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval, Robertson and Walker (1994)

[5]: Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs, https://arxiv.org/abs/1603.09320

[6]: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale, https://arxiv.org/abs/2010.11929

[7]: Vector Search with OpenAI Embeddings: Lucene Is All You Need, arXiv preprint arXiv:2308.14963

[8]: Complement lexical retrieval model with semantic residual embeddings, Advances in Information Retrieval (2021)

[9]: Attention is all you need, Advances in neural information processing systems (2017)

[10]: Sgpt: Gpt sentence embeddings for semantic search, arXiv preprint arXiv:2202.08904

[11]: Contrastive representation learning: A framework and review, IEEE Access 8 (2020)

[12]: Ms marco: A human generated machine reading comprehension dataset, arXiv preprint arXiv:1611.09268

## Student Perspectives: Impurity Identification in Oligonucleotide Drug Samples

A post by Harry Tata, PhD student on the Compass programme.

# Oligonucleotides in Medicine

Oligonucleotide therapies are at the forefront of modern pharmaceutical research and development, with recent years seeing major advances in treatments for a variety of conditions. Oligonucleotide drugs for Duchenne muscular dystrophy (FDA approved) [1], Huntington’s disease (Phase 3 clinical trials) [2], and Alzheimer’s disease [3] and amyotrophic lateral sclerosis (early-phase clinical trials) [4] show their potential for tackling debilitating and otherwise hard-to-treat conditions. With continuing development of synthetic oligonucleotides, analytical techniques such as mass spectrometry must be tailored to these molecules and keep pace with the field.

Working in conjunction with AstraZeneca, this project aims to advance methods for impurity detection and quantification in synthetic oligonucleotide mass spectra. In this blog post we apply a regularised version of the Richardson-Lucy algorithm, an established technique for image deconvolution, to oligonucleotide mass spectrometry data. This allows us to attribute signals in the data to specific molecular fragments, and therefore to detect impurities in oligonucleotide synthesis.

# Oligonucleotide Fragmentation

If we have attempted to synthesise an oligonucleotide $\mathcal O$ with a particular sequence, we can take a sample from this synthesis and analyse it via mass spectrometry. In this process, molecules in the sample are first fragmented — broken apart into ions — and these charged fragments are then passed through an electromagnetic field. The trajectory of each fragment through this field depends on its mass/charge ratio (m/z), so measuring these trajectories (e.g. by measuring time of flight before hitting some detector) allows us to calculate the m/z of fragments in the sample. This gives us a discrete mass spectrum: counts of detected fragments (intensity) across a range of m/z bins [5].

To get an idea of how much of $\mathcal O$ is in a sample, and what impurities might be present, we first need to consider what fragments $\mathcal O$ will produce. Oligonucleotides are short strands of DNA or RNA; polymers with a backbone of sugars (such as ribose in RNA) connected by linkers (e.g. a phosphodiester bond), where each sugar has an attached base which encodes genetic information [6].

On each monomer, there are two sites where fragmentation is likely to occur: at the linker (backbone cleavage) or between the base and sugar (base loss). Specifically, depending on which bond within the linker is broken, there are four modes of backbone cleavage [7,8].
We include in $\mathcal F$ every product of a single fragmentation of $\mathcal O$ — any of the four backbone cleavage modes or base loss anywhere along the nucleotide — as well as the results of every combination of two fragmentations (different cleavage modes at the same linker are mutually exclusive).

# Sparse Richardson-Lucy Algorithm

Suppose we have a chemical sample which we have fragmented and analysed by mass spectrometry. This gives us a spectrum across n bins (each bin corresponding to a small m/z range), and we represent this spectrum with the column vector $\mathbf{b}\in\mathbb R^n$, where $b_i$ is the intensity in the $i^{th}$ bin. For a set $\{f_1,\ldots,f_m\}=\mathcal F$ of possible fragments, let $x_j$ be the amount of $f_j$ that is actually present. We would like to estimate the amounts of each fragment based on the spectrum $\mathbf b$.

If we had a sample comprising a unit amount of a single fragment $f_j$, so $x_j=1$ and $x_{k\ne j}=0,$ and this produced a spectrum $\begin{pmatrix}a_{1j}&\ldots&a_{nj}\end{pmatrix}^T$, we can say the intensity contributed to bin $i$ by $x_j$ is $a_{ij}.$ In mass spectrometry, the intensity in a single bin due to a single fragment is linear in the amount of that fragment, and the intensities in a single bin due to different fragments are additive, so in some general spectrum we have $b_i=\sum_j x_ja_{ij}.$

By constructing a library matrix $\mathbf{A}\in\mathbb R^{n\times m}$ such that $\{\mathbf A\}_{ij}=a_{ij}$ (so the columns of $\mathbf A$ correspond to fragments in $\mathcal F$), then in ideal conditions the vector of fragment amounts $\mathbf x=\begin{pmatrix}x_1&\ldots&x_m\end{pmatrix}^T$ solves $\mathbf{Ax}=\mathbf{b}$. In practice this exact solution is not found — due to experimental noise and potentially because there are contaminant fragments in the sample not included in $\mathcal F$ — and we instead make an estimate $\mathbf {\hat x}$ for which $\mathbf{A\hat x}$ is close to $\mathbf b$.

Note that the columns of $\mathbf A$ correspond to fragments in $\mathcal F$: the values in a single column represent intensities in each bin due to a single fragment only. We $\ell_1$-normalise these columns, meaning the total intensity (over all bins) of each fragment in the library matrix is uniform, and so the values in $\mathbf{\hat x}$ can be directly interpreted as relative abundances of each fragment.

The observed intensities — as counts of fragments incident on each bin — are realisations of latent Poisson random variables. Assuming these variables are i.i.d., it can be shown that the estimate of $\mathbf{x}$ which maximises the likelihood of the system is approximated by the iterative formula

$\mathbf {\hat{x}}^{(t+1)}=\left(\mathbf A^T \frac{\mathbf b}{\mathbf{A\hat x}^{(t)}}\right)\odot \mathbf{\hat x}^{(t)}.$

Here, quotients and the operator $\odot$ represent (respectively) elementwise division and multiplication of two vectors. This is known as the Richardson-Lucy algorithm [9].

In practice, when we enumerate oligonucleotide fragments to include in $\mathcal F$, most of these fragments will not actually be produced when the oligonucleotide passes through a mass spectrometer; there is a large space of possible fragments and (beyond knowing what the general fragmentation sites are) no well-established theory allowing us to predict, for a new oligonucleotide, which fragments will be abundant or negligible. This means we seek a sparse estimate, where most fragment abundances are zero.

The Richardson-Lucy algorithm, as a maximum likelihood estimate for Poisson variables, is analagous to ordinary least squares regression for Gaussian variables. Likewise lasso regression — a regularised least squares regression which favours sparse estimates, interpretable as a maximum a posteriori estimate with Laplace priors — has an analogue in the sparse Richardson-Lucy algorithm:

$\mathbf {\hat{x}}^{(t+1)}=\left(\mathbf A^T \frac{\mathbf b}{\mathbf{A\hat x}^{(t)}}\right)\odot \frac{ \mathbf{\hat x}^{(t)}}{\mathbf 1 + \lambda},$

where $\lambda$ is a regularisation parameter [10].

# Library Generation

For each oligonucleotide fragment $f\in\mathcal F$, we smooth and bin the m/z values of the most abundant isotopes of $f$, and store these values in the columns of $\mathbf A$. However, if these are the only fragments in $\mathcal F$ then impurities will not be identified: the sparse Richardson-Lucy algorithm will try to fit oligonucleotide fragments to every peak in the spectrum, even ones that correspond to fragments not from the target oligonucleotide. Therefore we also include ‘dummy’ fragments corresponding to single peaks in the spectrum — the method will fit these to non-oligonucleotide peaks, showing the locations of any impurities.

# Results

For a mass spectrum from a sample containing a synthetic oligonucleotide, we generated a library of oligonucleotide and dummy fragments as described above, and applied the sparse Richardson-Lucy algorithm. Below, the model fit is plotted alongside the (smoothed, binned) spectrum and the ten most abundant fragments as estimated by the model. These fragments are represented as bars with binned m/z at the peak fragment intensity, and are separated into oligonucleotide fragments and dummy fragments indicating possible impurities. All intensities and abundances are Anscombe transformed ($x\rightarrow\sqrt{x+3/8}$) for clarity.

As the oligonucleotide in question is proprietary, its specific composition and fragmentation is not mentioned here, and the bins plotted have been transformed (without changing the shape of the data) so that individual fragment m/z values are not identifiable.

We see the data is fit extremely closely, and that the spectrum is quite clean: there is one very pronounced peak roughly in the middle of the m/z range. This peak corresponds to one of the oligonucleotide fragments in the library, although there is also an abundant dummy fragment slightly to the left inside the main peak. Fragment intensities in the library matrix are smoothed, and it may be the case that the smoothing here is inappropriate for the observed peak, hence other fragments being fit at the peak edge. Investigating these effects is a target for the rest of the project.

We also see several smaller peaks, most of which are modelled with oligonucleotide fragments. One of these peaks, at approximately bin 5352, has a noticeably worse fit if excluding dummy fragments from the library matrix (see below). Using dummy fragments improves this fit and indicates a possible impurity. Going forward, understanding and quantification of these impurities will be improved by including other common fragments in the library matrix, and by grouping fragments which correspond to the same molecules.

# References

[1] Junetsu Igarashi, Yasuharu Niwa, and Daisuke Sugiyama. “Research and Development of Oligonucleotide Therapeutics in Japan for Rare Diseases”. In: Future Rare Diseases 2.1 (Mar. 2022), FRD19.

[2] Karishma Dhuri et al. “Antisense Oligonucleotides: An Emerging Area in Drug Discovery and Development”. In: Journal of Clinical Medicine 9.6 (6 June 2020), p. 2004.

[3] Catherine J. Mummery et al. “Tau-Targeting Antisense Oligonucleotide MAPTRx in Mild Alzheimer’s Disease: A Phase 1b, Randomized, Placebo-Controlled Trial”. In: Nature Medicine (Apr. 24, 2023), pp. 1–11.

[4] Benjamin D. Boros et al. “Antisense Oligonucleotides for the Study and Treatment of ALS”. In: Neurotherapeutics: The Journal of the American Society for Experimental NeuroTherapeutics 19.4 (July 2022), pp. 1145–1158.

[5] Ingvar Eidhammer et al. Computational Methods for Mass Spectrometry Proteomics. John Wiley & Sons, Feb. 28, 2008. 299 pp.

[6] Harri Lönnberg. Chemistry of Nucleic Acids. De Gruyter, Aug. 10, 2020.

[7] S. A. McLuckey, G. J. Van Berkel, and G. L. Glish. “Tandem Mass Spectrometry of Small, Multiply Charged Oligonucleotides”. In: Journal of the American Society for Mass Spectrometry 3.1 (Jan. 1992), pp. 60–70.

[8] Scott A. McLuckey and Sohrab Habibi-Goudarzi. “Decompositions of Multiply Charged Oligonucleotide Anions”. In: Journal of the American Chemical Society 115.25 (Dec. 1, 1993), pp. 12085–12095.

[9] Mario Bertero, Patrizia Boccacci, and Valeria Ruggiero. Inverse Imaging with Poisson Data: From Cells to Galaxies. IOP Publishing, Dec. 1, 2018.

[10] Elad Shaked, Sudipto Dolui, and Oleg V. Michailovich. “Regularized Richardson-Lucy Algorithm for Reconstruction of Poissonian Medical Images”. In: 2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro. Mar. 2011, pp. 1754–1757.

## Student Perspectives: Intro to Recommendation Systems

A post by Hannah Sansford, PhD student on the Compass programme.

# Introduction

Like many others, I interact with recommendation systems on a daily basis; from which toaster to buy on Amazon, to which hotel to book on booking.com, to which song to add to a playlist on Spotify. They are everywhere. But what is really going on behind the scenes?

Recommendation systems broadly fit into two main categories:

1) Content-based filtering. This approach uses the similarity between items to recommend items similar to what the user already likes. For instance, if Ed watches two hair tutorial videos, the system can recommend more hair tutorials to Ed.

2) Collaborative filtering. This approach uses the the similarity between users’ past behaviour to provide recommendations. So, if Ed has watched similar videos to Ben in the past, and Ben likes a cute cat video, then the system can recommend the cute cat video to Ed (even if Ed hasn’t seen any cute cat videos).

Both systems aim to map each item and each user to an embedding vector in a common low-dimensional embedding space $E = \mathbb{R}^d$. That is, the dimension of the embeddings ($d$) is much smaller than the number of items or users. The hope is that the position of these embeddings captures some of the latent (hidden) structure of the items/users, and so similar items end up ‘close together’ in the embedding space. What is meant by being ‘close’ may be specified by some similarity measure.

# Collaborative filtering

In this blog post we will focus on the collaborative filtering system. We can break it down further depending on the type of data we have:

1) Explicit feedback data: aims to model relationships using explicit data such as user-item (numerical) ratings.

2) Implicit feedback data: analyses relationships using implicit signals such as clicks, page views, purchases, or music streaming play counts. This approach makes the assumption that: if a user listens to a song, for example, they must like it.

The majority of the data on the web comes from implicit feedback data, hence there is a strong demand for recommendation systems that take this form of data as input. Furthermore, this form of data can be collected at a much larger scale and without the need for users to provide any extra input. The rest of this blog post will assume we are working with implicit feedback data.

## Problem Setup

Suppose we have a group of $n$ users $U = (u_1, \ldots, u_n)$ and a group of $m$ items $I = (i_1, \ldots, i_m)$. Then we let $\mathbf{R} \in \mathbb{R}^{n \times m}$ be the ratings matrix where position $R_{ui}$ represents whether user $u$ interacts with item $i$. Note that, in most cases the matrix $\mathbf{R}$ is very sparse, since most users only interact with a small subset of the full item set $I$. For any items $i$ that user $u$ does not interact with, we set $R_{ui}$ equal to zero. To be clear, a value of zero does not imply the user does not like the item, but that they have not interacted with it. The final goal of the recommendation system is to find the best recommendations for each user of items they have not yet interacted with.

## Matrix Factorisation (MF)

A simple model for finding user emdeddings, $\mathbf{X} \in \mathbb{R}^{n \times d}$, and item embeddings, $\mathbf{Y} \in \mathbb{R}^{m \times d}$, is Matrix Factorisation. The idea is to find low-rank embeddings such that the product $\mathbf{XY}^\top$ is a good approximation to the ratings matrix $\mathbf{R}$ by minimising some loss function on the known ratings.

A natural loss function to use would be the squared loss, i.e.

$L(\mathbf{X}, \mathbf{Y}) = \sum_{u, i} \left(R_{ui} - \langle X_u, Y_i \rangle \right)^2.$

This corresponds to minimising the Frobenius distance between $\mathbf{R}$ and its approximation $\mathbf{XY}^\top$, and can be solved easily using the singular value decomposition $\mathbf{R} = \mathbf{U S V}^\top$.

Once we have our embeddings $\mathbf{X}$ and $\mathbf{Y}$, we can look at the row of $\mathbf{XY}^\top$ corresponding to user $u$ and recommend the items corresponding to the highest values (that they haven’t already interacted with).

## Logistic MF

Minimising the loss function in the previous section is equivalent to modelling the probability that user $u$ interacts with item $i$ as the inner product $\langle X_u, Y_i \rangle$, i.e.

$R_{ui} \sim \text{Bernoulli}(\langle X_u, Y_i \rangle),$

and maximising the likelihood over $\mathbf{X}$ and $\mathbf{Y}$.

In a research paper from Spotify [3], this relationship is instead modelled according to a logistic function parameterised by the sum of the inner product above and user and item bias terms, $\beta_u$ and $\beta_i$,

$R_{ui} \sim \text{Bernoulli} \left( \frac{\exp(\langle X_u, Y_i \rangle + \beta_u + \beta_i)}{1 + \exp(\langle X_u, Y_i \rangle + \beta_u + \beta_i)} \right).$

## Relation to my research

A recent influential paper [1] proved an impossibility result for modelling certain properties of networks using a low-dimensional inner product model. In my 2023 AISTATS publication [2] we show that using a kernel, such as the logistic one in the previous section, to model probabilities we can capture these properties with embeddings lying on a low-dimensional manifold embedded in infinite-dimensional space. This has various implications, and could explain part of the success of Spotify’s logistic kernel in producing good recommendations.

### References

[1] Seshadhri, C., Sharma, A., Stolman, A., and Goel, A. (2020). The impossibility of low-rank representations for triangle-rich complex networks. Proceedings of the National Academy of Sciences, 117(11):5631–5637.

[2] Sansford, H., Modell, A., Whiteley, N., and Rubin-Delanchy, P. (2023). Implications of sparsity and high triangle density for graph representation learning. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:5449-5473.

[3] Johnson, C. C. (2014). Logistic matrix factorization for implicit feedback data. Advances in Neural Information Processing Systems, 27(78):1–9.

## Compass students attending the Workshop on Functional Inference and Machine Intelligence (FIMI) at ISM Tokyo

A post by Compass CDT students Edward Milsom, Jake Spiteri, Jack Simons, and Sam Stockman.

### We (Edward Milsom, Jake Spiteri, Jack Simons, Sam Stockman) attended the 2023 Workshop on Functional Inference and Machine Intelligence (FIMI) taking place on the 14, 15 and 16th of March at the Institute of Statistical Mathematics in Tokyo, Japan. Our attendance to the workshop was to further collaborative ties between the two institutions. The in-person participants included many distinguished academics from around Japan as well as our very own Dr Song Liu. Due to the workshops modest size, there was an intimate atmosphere which nurtured many productive research discussions. Whilst staying in Tokyo, we inevitably sampled some Japanese culture, from Izakayas to cherry blossoms and sumo wrestling!

We thought we’d share some of our thoughts and experiences. We’ll first go through some of our most memorable talks, and then talk about some of our activities outside the workshop.

# Talks

## Sho Sonoda – Ridgelet Transforms for Neural Networks on Manifolds and Hilbert Spaces

We particularly enjoyed the talk given by Sho Sonoda, a Research Scientist from the Deep Learning Theory group at Riken AIP on “Ridgelet Transforms for Neural Networks on Manifolds and Hilbert Spaces.” Sonoda’s research aims to demystify the black box nature of neural networks, shedding light on how they work and their universal approximation capabilities. His talk provided valuable insights into the integral representations of neural networks, and how they can be represented using ridgelet transforms. Sonoda presented a reconstruction formula from which we see that if a neural network can be represented using ridgelet transforms, then it is a universal approximator. He went on to demonstrate that various types of networks, such as those on finite fields, group convolutional neural networks (GCNNs), and networks on manifolds and Hilbert spaces, can be represented in this manner and are thus universal approximators. Sonoda’s work improves upon existing universality theorems by providing a more unified and direct approach, as opposed to the previous case-by-case methods that relied on manual adjustments of network parameters or indirect conversions of (G)CNNs into other universal approximators, such as invariant polynomials and fully-connected networks. Sonoda’s work is an important step toward a more transparent and comprehensive understanding of neural networks.

## Greg Yang – The unreasonable effectiveness of mathematics in large scale deep learning

Greg Yang is a researcher at Microsoft Research who is working on a framework for understanding neural networks called “tensor programs”. Similar to Neural Tangent Kernels and Neural Network Gaussian Processes, the tensor program framework allows us to consider neural networks in the infinite-width limit, where it becomes possible to make statements about the properties of very wide networks. However, tensor programs aim to unify existing work on infinite-width neural networks by allowing one to take the infinite limit of a much wider range of neural network architectures using one single framework.

In his talk, Yang discussed his most recent work in this area, concerning the “maximal update parametrisation”. In short, they show that in this parametrisation, the optimal hyperparameters of very wide neural networks are the same as those for much smaller neural networks. This means that hyperparameter search can be done using small, cheap models, and then applied to very large models like GPT-3, where hyperparameter search would be too expensive. The result is summarised in this figure from their paper “Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer”, which shows how this is not possible in the standard parametrisation. This work was only possible by building upon the tensor program framework, thereby demonstrating the value of having a solid theoretical understanding of neural networks.

# Statistical Seismology Seminar Series

In addition to the workshop, Sam attended the 88th Statistical Seismology seminar in the Risk Analysis Research Centre at ISM https://www.ism.ac.jp/~ogata/Ssg/ssg_statsei_seminarsE.html. The Statistical Seismology Research Group at ISM was created by Emeritus Professor Yosihiko Ogata and is one of the leading global research institutes for statistical seismology. Its most significant output has been the Epidemic-Type Aftershock Sequence (ETAS) model, a point process based earthquake forecasting model that has been the most dominant model for forecasting since its creation by Ogata in 1988.

As part of the Seminar series, Sam gave a talk on his most recent work (Forecasting the 2016-2017 Central Apennines Earthquake Sequence with a Neural Point Process’, https://arxiv.org/abs/2301.09948) to the research group and other visiting academics.

Japan’s interest is earthquake science is due to the fact that they record the most earthquakes in the world. The whole country is in a very active seismic area, and they have the densest seismic network. So even though they might not actually have the most earthquakes in the world (which is most likely Indonesia) they certainly document the most. The evening before flying back to the UK, Sam and Jack felt a magnitude 5.2 earthquake 300km north of Tokyo in the Miyagi prefecture. At that distance all that was felt was a small shudder…

# Japan

It’s safe to say that the abundance of delicious food was the most memorable aspect of our trip. In fact, we never had a bad meal! Our taste buds were taken on a culinary journey as we tried a variety of Japanese dishes. From hearty, broth-based bowls of ramen and tsukemen, to fun conveyor-belt sushi restaurants, and satisfying tonkatsu (breaded deep-fried pork cutlet) with sticky rice or spicy udon noodles, we were never at a loss for delicious options. We even had the opportunity to cook our own food at an indoor barbecue!

Aside from the food, we thoroughly enjoyed our time in Tokyo – exploring the array of second-hand clothes shops, relaxing in bath-houses, and trying random things from the abundance of vending machines.

24th May 2023
11:00 - 14:00

## Student Perspectives: An Introduction to Deep Kernel Machines

A post by Edward Milsom, PhD student on the Compass programme.

This blog post provides a simple introduction to Deep Kernel Machines[1] (DKMs), a novel supervised learning method that combines the advantages of both deep learning and kernel methods. This work provides the foundation of my current research on convolutional DKMs, which is supervised by Dr Laurence Aitchison.

# Why aren’t kernels cool anymore?

Kernel methods were once top-dog in machine learning due to their ability to implicitly map data to complicated feature spaces, where the problem usually becomes simpler, without ever explicitly computing the transformation. However, in the past decade deep learning has become the new king for complicated tasks like computer vision and natural language processing.

## Neural networks are flexible when learning representations

The reason is twofold: First, neural networks have millions of tunable parameters that allow them to learn their feature mappings automatically from the data, which is crucial for domains like images which are too complex for us to specify good, useful features by hand. Second, their layer-wise structure means these mappings can be built up to increasingly more abstract representations, while each layer itself is relatively simple[2]. For example, trying to learn a single function that takes in pixels from pictures of animals and outputs their species is difficult; it is easier to map pixels to corners and edges, then shapes, then body parts, and so on.

## Kernel methods are rigid when learning representations

It is therefore notable that classical kernel methods lack these characteristics: most kernels have a very small number of tunable hyperparameters, meaning their mappings cannot flexibly adapt to the task at hand, leaving us stuck with a feature space that, while complex, might be ill-suited to our problem. (more…)