A post by Ettore Fincato, PhD student on the Compass programme.

This post provides an introduction to Gradient Methods in Stochastic Optimisation. This class of algorithms is the foundation of my current research work with Prof. Christophe Andrieu and Dr. Mathieu Gerber, and finds applications in a great variety of topics, such as regression estimation, support vector machines, convolutional neural networks.

We can see below a simulation by Emilien Dupont (https://emiliendupont.github.io/) which represents two trajectories of an optimisation process of a time-varying function. This well describes the main idea behind the algorithms we will be looking at, that is, using the (stochastic) gradient of a (random) function to iteratively reach the optimum.

Stochastic Optimisation

Stochastic optimisation was introduced by [1], and its aim is to find a scheme for solving equations of the form $\nabla_w g(w)=0$ given “noisy” measurements of $g$ [2].

In the simplest deterministic framework, one can fully determine the analytical form of $g(w)$ , knows that it is differentiable and admits an unique minimum – hence the problem

$w_*=\underset{w}{\text{argmin}}\quad g(w)$

is well defined and solved by $\nabla_w g(w)=0$ .

On the other hand, one may not be able to fully determine $g(w)$ because his experiment is corrupted by a random noise. In such cases, it is common to identify this noise with a random variable, say $V$ , consider an unbiased estimator $\eta(w,V)$ s.t. $\mathbb{E}_V[\eta(w,V)]=g(w)$ and to rewrite the problem as

$w_*=\underset{w}{\text{argmin}}\quad\mathbb{E}_V[\eta(w,V)].$

Example

Consider a repeated experiment which gives pairs of input-output iid observations $(X_n,Y_n)$ , $n\geq 1$ with $X_n\in \mathbb{R}^d, Y_n \in \mathbb{R}$ . Assume that $Y_n$ depends on $X_n$ in the following way:

$Y_n=w_0+f_w(X_n)+\epsilon_n$

where $\epsilon_n$ is an error term, $w_0\in \mathbb{R}$ is an intercept and $w \in \mathbb{R}^d$ . This is for example a classical set-up of a Linear Regression procedure, where one generally assumes $f_w(X_n)=\langle w, X_n \rangle$ . More generally, $f_w$ could be a polynomial, a spline, a neural network.

In order to find the best fit, one generally assumes that $f$ is identified by the choice of $w$ , that is $f_w\neq f_{w'}$ if $w\neq w'$ , and investigates the Mean Squared Error problem

$\underset{w}{\text{argmin}}\quad g(w)$

where

$g(w):= \frac{1}{2}\mathbb{E}[||Y_n-f_w(X_n)||^2]$ .

In general, we do not know the true distribution of $(X_n,Y_n)$ and we only have access to this sample, hence we can only evaluate $g$ up to noise. This is a classical stochastic optimisation problem, which in several cases can be tackled by a recursive algorithm such as Stochastic Gradient Descent.

Stochastic Gradient Descent and Dynamics

The Stochastic Gradient Descent algorithm is the core of stochastic gradient methods. It is based on the simple idea of iteratively following the direction suggested by the gradient at the $n$ -th iteration. It is interesting to look at it from a dynamical systems point of view, that is, as a noisy approximation of a certain ordinary differential equation (ODE).

Noisy approximation of ODEs

Suppose we have an ODE

$\dot{x}(t)= h(x(t))$

describing the dynamics of a system. A well known scheme to numerically approximate the solution of such equation is the Euler Scheme

$x_{n+1}=x_n+\gamma h(x_n)$

where $\gamma$ is the so-called time-step. If one only has noisy evaluations of $h$ , he can replace it by an unbiased estimator $\eta(x_n,V)$ , where $V$ is some random variable.

Hence, in practice, the algorithm generates, independently for each $n$ , a random variable $V_{n+1}$ , and computes

$x_{n+1}=x_n+\gamma \eta(x_n,V_{n+1})$ .

By unbiasedness of $\eta(x_n,V)$ , we can write

$\eta(x_n,V_{n+1})=h(x_n)+U_{n+1}$

where $(U_{n})_n$ is a sequence of zero-mean random variables uncorrelated with the past (a martingale difference sequence). Hence the recursion above is equivalent to

$x_{n+1}=x_n+\gamma (h(x_n)+U_{n+1})$ .

This last formulation is the most used in the literature.

A further extension of the scheme is recovered by introducing non-constant time-steps, say $(\gamma_n)_n$ with $\gamma_n \underset{n}{\to} 0$ , writing

$x_{n+1}=x_n+\gamma_n ( h(x_n)+U_{n+1})$ .

The above is a stochastic approximation scheme which differs from the Euler’s scheme in the fact that it involves a noise $U_n$ and time-varying steps $\gamma_n$ [2].

Ensuring a good exploration of the space

These time-decaying steps must satisfy some stability criteria.

The algorithm needs to explore the space for long enough: hence it is common to assume $\sum_n \gamma_n = \infty$ .
Moreover, one needs the error component $\gamma_n U_{n+1}$ to go to zero quickly enough: hence one generally assumes $\sum_n \gamma_n^2<\infty$ , which can be shown to be equivalent to $\sum_n \gamma_n U_{n+1}<\infty$ when $U_n$ are iid with zero mean and finite variance [2].

Stochastic Gradient Descent

Now come back to the example of finding the minimum of a function $g$ for which we only have noisy measurements. Let

$h(w):=-\nabla_w g(w)$

In this case the the above scheme becomes

$w_{n+1}=w_n-\gamma_n (\nabla_w g(w_n)-U_{n+1})$ (1)

where we have used $-\nabla_w g(w_n)+U_{n+1}=\eta(w_n, V_{n+1})$ for $\eta(w, V)$ unbiased estimator of $h(w)$ .

It is called Stochastic Gradient Descent (SGD), and represents a noisy discretisation of

$\dot{w}(t)= -\nabla_w g (w(t))$

which is a well known ODE and is known to converge to the optimum $w_*$ s.t. $\nabla_w g(w_*)=0$ when $g$ is convex, or regular enough [2]. Intuitively, the ODE says that the underlying dynamics of $w$ should follow the gradient of $g$ .

Stochastic gradient descent can be shown to converge if the target function is smooth enough. See e.g. [3][4].

SGD and Stochastic Differential Equations

One could also see the SGD recursion as a time-discretisation of a stochastic differential equation (SDE). Under regular behaviour of time-steps, and if we assume $U_n$ iid $N(0,\sigma^2)$ , SGD approximates the below Itô diffusion SDE

$d W_t=-\gamma_t\nabla g(W_t)dt+\sqrt{\gamma_t}\sigma dB_t$

where $B_t$ is a standard Brownian Motion. If familiar with diffusion processes, one can see SGD as a Brownian motion with gradient drift and diffusion part $\sqrt{\gamma_t}\sigma$ . See figure below for an example of what a simple SGD motion looks like.

Variations and extensions

We briefly present two important variations of SGD method which are often seen in the literature.

Projected Stochastic Gradient Descent

Suppose now we are solving a constrained optimisation problem, that is, let $C$ be a subspace of our original space, say $C\in \mathbb{R}$ , and let $g(w)$ be the function we want to optimise. The constrained problem is therefore

$w_*=\underset{w\in C}{\text{argmin}}\quad g(w)$ .

To solve this problem in a “SGD style”, one can project every SGD step on $C$ . With an euclidean projection, this is equivalent to the following recursion

$w_{n+1}=\underset{w\in C}{\text{argmin}}||(w_n-\gamma_n s(x_n,V_{n+1})-w||_2^2$

where $s$ is a so-called sub-gradient, and the easiest example for its choice is an unbiased estimator $\eta$ of $\nabla g$ , as previous sections. More generally, it is a function that should resemble certain fundamental algebraic properties of a gradient.

By easy algebra, we see that the above is equivalent to

$w_{n+1}=\underset{w\in C}{\text{argmin}} \{\langle w, s(w_{n}, V_{n+1}) \rangle+\frac{1}{\gamma_n}\frac{||w_n-w||_2^2}{2}\}$

Hence we see that an immediate extension of this would be to replace the euclidean distance $||w_n-w||_2^2$ with some more general definition of divergence. In the literature, this is done by introducing a so-called Bregman divergence $D_L(w_n,w)$ , defining the class Stochastic Mirror Descent algorithms.

Stochastic Mirror Descent

Miming an euclidean distance, for which it holds that

$||w_n-w||_2^2=||w_n||^2+||w||^2-2\langle w_n , w \rangle=||w_n||^2-||w||^2-2\langle w , w_n-w \rangle$

the Bregman divergence is defined as

$D_L(w_n,w):=L(w_n)-L(w)-\langle\nabla L(w),w_n-w \rangle$

where $h$ is a so-called Legendre function. Examples for $L$ are

Energy $L(w)=\frac{1}{2}||w||_2^2$ in which case we recover an euclidean projection
Boltzmann-Shannon Entropy $L(w)=w\log(w)$
Fermi-Dirac Entropy $L(w)=w\log(w)+(1-w)\log(1-w)$

Using a Bregman divergence, we therefore get a more general recursion

$w_{n+1}=\underset{w\in C}{\text{argmin}}\{\langle w, s(w_{n}, V_{n+1}) \rangle+\frac{1}{\gamma_n}D_L(w_n,w)\}$

which is called Stochastic Mirror Descent (SMD) recursion. Hence, a SMD move is a tradeoff between following the slope of the (sub) gradient, as in a SGD algorithm, and moving according to the Bregman divergence. This generalisation allows to tackle in a “SGD style” a great variety of problems, such as non-convex problems and quadratic inverse problems [5].

References

[1] Robbins, H. and Monro, S. (1951). A Stochastic Approximation Method. The Annals of
Mathematical Statistics, 22(3):400 – 407.

[2] Borkar, V. (2008). Stochastic Approximation: A Dynamical Systems Viewpoint. Cam-
bridge University Press.

[3] Turinici, G. (2021). The convergence of the stochastic gradient descent (sgd) : a self-
contained proof. Technical report.

[4] Patel, V., B. Tian, and S. Zhang (2021). Global convergence and stability of stochastic
gradient descent. CoRR abs/2110.01663.

[5] Bolte, J., Sabach, S., Teboulle, M., and Vaisbourd, Y. (2018). First order methods bey-
ond convexity and lipschitz gradient continuity with applications to quadratic inverse
problems. SIAM Journal on Optimization, 28(3):2131–2151.