A post by Dylan Dijk, PhD student on the Compass programme.

# Introduction

My current project is looking to robustify the performance of time series models to heavy-tailed data. The models I have been focusing on are vector autoregressive (VAR) models, and additionally factor-adjusted VAR models. In this post I will not be covering the robust methodology, but will be introducing VAR models and providing the motivation for introducing the factor adjustment step when working with high-dimensional time series.

# Vector autoregressive models

In time series analysis the objective is often to forecast a future value given past data, for example, one of the classical models for univariate time series is the autoregressive AR(d) model:

\[X_t = a_1 X_{t-1} + \dots + a_d X_{t-d} + \epsilon_t \, .\]

However, in many cases, the value of a variable is influenced not just by its own past values but also by past values of other variables. For example, in Economics, household consumption expenditures may depend on variables such as income, interest rates, and investment expenditures, therefore we would want to include these variables in our model.

The VAR model [1] is simply the multivariate generalisation of the univariate autoregressive model, that is, for a $p$-dimensional stochastic process $(\dots, \mathbf{X}_t, \mathbf{X}_{t+1}, \dots) \in \mathbb{R}^p$ we model an observation at time $t$ as a linear combination of previous observations up to some lag $d$ plus an error:

\[\mathbf{X}_t = \mathbf{A}_1 \mathbf{X}_{t-1} + \dots + \mathbf{A}_d \mathbf{X}_{t-d} + \boldsymbol{\epsilon}_t \, ,\]

where $\mathbf{A}_i$ are $p \times p$ coefficient matrices. Therefore, in addition to modelling serial dependence, the model takes into account cross-sectional dependence. This model can then be used for forecasting, and as an explanatory model to describe the dynamic interrelationships between a number of variables.

## Estimation

Given a dataset of $n$ observations, $\{\mathbf{X}_1, \dots, \mathbf{X}_n \in \mathbb{R}^p\}$, we can aim to estimate the coefficient matrices. In order to do so, the model can be written in a stacked form:

\begin{align*} \underbrace{\left[\begin{array}{c}\left(\mathbf{X}_n\right)^{T} \\ \vdots \\ \left(\mathbf{X}_{d+1}\right)^{T}\end{array}\right]}_{\boldsymbol{\mathcal{Y}}} & =\underbrace{\left[\begin{array}{ccc}\left(\mathbf{X}_{n-1}\right)^{T} & \cdots & \left(\mathbf{X}_{n-d}\right)^{T} \\ \vdots & \ddots & \vdots \\ \left(\mathbf{X}_{d}\right)^{T} & \cdots & \left(\mathbf{X}_1\right)^{T}\end{array}\right]}_{\boldsymbol{\mathcal{X}}} \underbrace{\left[\begin{array}{c}\boldsymbol{A}_1^{T} \\ \vdots \\ \boldsymbol{A}_d^{T}\end{array}\right]}_{\boldsymbol{A}^T}+\underbrace{\left[\begin{array}{c}\left(\boldsymbol{\epsilon}_n\right)^{T} \\ \vdots \\ \left(\boldsymbol{\epsilon}_d\right)^{T}\end{array}\right]}_{\boldsymbol{E}}

\end{align*}

and subsequently vectorised to return a standard univariate linear regression problem

\begin{align*}

\operatorname{vec}(\boldsymbol{\mathcal{Y}}) & =\operatorname{vec}\left(\boldsymbol{\mathcal{X}} \boldsymbol{A}^T\right)+\operatorname{vec}(\boldsymbol{E}), \\ & =(\textbf{I} \otimes \boldsymbol{\mathcal{X}}) \operatorname{vec}\left(\boldsymbol{A}^T\right)+\operatorname{vec}(\boldsymbol{E}), \label{eq:stacked_var_regression_form}\\ \underbrace{\boldsymbol{Y}}_{N p \times 1} & =\underbrace{\boldsymbol{Z}}_{N p \times q} \underbrace{\boldsymbol{\beta}^*}_{q \times 1}+\underbrace{\operatorname{vec}(\boldsymbol{E})}_{N p \times 1}, \quad N=(n-d), \quad q=d p^2.

\end{align*}

## Sparse VAR

There are $dp^2$ parameters to estimate in this model, and hence VAR estimation is naturally a high-dimensional statistical problem. Therefore, estimation methods and associated theory need to hold under high-dimensional scaling of the parameter dimension. Specifically, this means consistency is shown for when both $p$ and $n$ tend to infinity, as opposed to in classical statistics where $p$ is kept fixed.

The linear model in the high-dimensional setting is well understood [2]. To obtain a consistent estimator requires additional structural assumptions in the model, in particular, sparsity on the true vector $\boldsymbol\beta^*$. The common approach for estimation is lasso, which can be motivated from convex relaxation in the noiseless setting. Consistency of lasso is well studied [3][4], with consistency guaranteed under sparsity, and restrictions on the directions in which the hessian of the loss function is strictly positive.

The well known lasso objective is given by:

\begin{align*}

\underset{{\boldsymbol\beta \in \mathbb{R}^q}}{\text{argmin}} \, \|\boldsymbol{Y}-\boldsymbol{Z} \boldsymbol\beta\|_{2}^{2} + \lambda \|\boldsymbol\beta\|_1 \, ,

\end{align*}

and below, we give a simplified consistency result that can be obtained under certain assumptions.

We denote the sparsity of $\boldsymbol{A}$ by

$s_{0, j}=\left|\boldsymbol\beta^*_{(j)}\right|_0, s_0=\sum_{j=1}^p s_{0, j}$ and $s_{\text {in }}=\max _{1 \leq j \leq p} s_{0, j}$.

** Lasso consistency result **

Suppose

\begin{gather*}

\, s_{\text{in}} \leq C_1 \sqrt{\frac{n}{\log p}} \, \; \text{and } \; \lambda \geq C_2 (\|\boldsymbol{A}^T\|_{1,\infty} + 1)\sqrt{\frac{\log p}{n}} \; ,

\end{gather*}

then with high probability we have

\begin{align*}

|\widehat{\boldsymbol{A}} – \boldsymbol{A}|_2 \leq C_3 \sqrt{s_{0}} \lambda \quad \text{and} \quad |\widehat{\boldsymbol{A}} – \boldsymbol{A}|_1 \leq C_4 s_0 \lambda \, .

\end{align*}

What we mean here by consistency, is that as $n,p \rightarrow \infty$, the estimate $\widehat{\boldsymbol\beta}$ converges to $\boldsymbol\beta$ in probability. Where we think of $p$ as being a function of $n$, so the manner in which the dimension $p$ grows depends on the sample size. For example, in the result above, we can have consistency with $p = \exp(\sqrt{n})$.

The result indicates that for larger $p$ a more sparse solution, and a larger regularisation parameter is required. Similar results have been derived under various assumptions, for instance under a Gaussian VAR the result has been given in terms of the largest and smallest eigenvalues of the spectral density matrix of a series [5], and hence consistency requires that these quantities are bounded.

In summary, for lasso estimation to work we need $\boldsymbol{A}$ to be sufficiently sparse, and the largest eigenvalue of the spectral density matrix to be bounded. But are these reasonable assumptions to make?

First two leading eigenvalues of the spectral density matrix.

Heatmap of logged p-values for evidence of non-zero coefficients after fitting ridge regression model.

Well, intuitively, if a multivariate time series has strong cross-sectional dependence we would actually expect to have many non-zero entries in the VAR coefficients $\boldsymbol{A}_i$. The figures above, taken from [6], illustrate a real dataset in which there is statistical evidence for a non-sparse solution (heatmap), and that the leading eigenvalue of the spectral density matrix diverges linearly in $p$. Therefore providing an example in which two of the assumptions discussed above are unmet.

# Factor-adjusted VAR

The idea now is to assume that the covariance of the observed vector $\mathbf{X}_t$ is driven by a lower dimensional latent vector. For example, the figures above were generated from a dataset of stock prices of financial institutions, in this case an interpretation of a latent factor could be overall market movements which captures the broad market trend, or a factor that captures the change in interest rates.

\begin{align}

\mathbf{X}_t &= \underset{p \times r}{\boldsymbol\Lambda} \underset{r \times 1}{\mathbf{F}_t} + \boldsymbol\xi_t \quad

\end{align}

Consequently, first fitting a factor model would account for strong cross-sectional correlations, leaving the remaining process to exhibit the individual behaviour of each series. Fitting a sparse VAR process will now be a more reasonable choice.

In the formula above, $\mathbf{F}_t$ is the factor random vector, and $\boldsymbol\Lambda$ the constant loading matrix, which quantifies the sensitivity of each variable to the common factors, and we can model $\boldsymbol\xi_t$ as a sparse VAR process, as described in the preceding sections.

**References**

[1] Lütkepohl, H. (2005) *New introduction to multiple time series analysis.* Berlin: Springer-Verlag.

[2] Wainwright, M. (2019) *High-dimensional statistics: A non-asymptotic viewpoint – Chapter 7 – Sparse linear models in high dimensions*. Cambridge, United Kingdom: Cambridge University Press.

[3] Geer, Sara A. van de, and Peter Bühlmann. (2009) *On the Conditions Used to Prove Oracle Results for the Lasso*. Electronic Journal of Statistics. Project Euclid, https://doi.org/10.1214/09-EJS506.

[4] Bickel, Peter J., Ya’acov Ritov, and Alexandre B. Tsybakov. (2009) *Simultaneous Analysis of Lasso and Dantzig Selector*. The Annals of Statistics. https://doi.org/10.1214/08-AOS620.

[5] Sumanta Basu, George Michailidis. (2015) *Regularized estimation in sparse high-dimensional time series models*. The Annals of Statistics. https://doi.org/10.1214/15-AOS1315.

[6] Barigozzi, M., Cho, H. and Owens, D. (2024). *FNETS: Factor-adjusted network estimation and forecasting for high-dimensional time series*. Journal of Business & Economic Statistics.