Convergence of Gradient Descent

14 March 2018

The following draws from this paper and David Martinez Rubio’s master’s thesis.

Given a function \(f: \mathbb R^d \to \mathbb R\), we want to solve the optimization problem \begin{align} \argmin_{x \in \mathbb R^d} f(x) \label{eq:argmin} \end{align} by the gradient descent procedure:

Initialize \(x_0\)
For \(t = 0, 1, 2, \dotsc\):
- Update \(x_{t + 1} = x_t - \alpha \nabla f(x_t)\)

for some learning rate \(\alpha\).

We assume that the optimization problem has a non-empty solution set \(\mathcal X^*\) and we use \(f^*\) to denote the corresponding optimal function value.

Definition 1 (Smoothness). A function \(f\) is \(L\)-smooth if its first derivative is \(L\)-Lipschitz continuous, i.e. there exists a finite and positive \(L\) such that \begin{align} f(y) \leq f(x) + (\nabla f(x))^T (y - x) + \frac{L}{2} \| y - x \|^2 \label{eq:smoothness} \end{align} for all \(x, y \in \mathbb R^d\).

Definition 2 (Strong Convexity). A function \(f\) is \(\mu\)-strongly convex if there exists a finite and positive \(\mu\) such that \begin{align} f(y) \geq f(x) + (\nabla f(x))^T (y - x) + \frac{\mu}{2} \| y - x \|^2, \label{eq:strong-convexity} \end{align} for all \(x, y \in \mathbb R^d\).

Definition 3 (Polyak-Łojasiewicz). A function \(f\) satisfies the Polyak-Łojasiewicz (PL) inequality if the following holds for some \(\mu > 0\): \begin{align} \frac{1}{2} \| \nabla f(x) \|^2 \geq \mu (f(x) - f^*) \label{eq:pl-inequality} \end{align} for all \(x \in \mathbb R^d\).

Proposition 1. Let \(f\) be an \(L\)-smooth and \(\mu\)-strongly convex function. Then \(f\) satisfies the PL inequality with constant \(\frac{\mu^2}{4L}\).

Proof. For \(x^*\) in the non-empty solution set \(\mathcal X^*\) we use the strong convexity property in \eqref{eq:strong-convexity} to write \begin{align} f^* \geq f(x) + (\nabla f(x))^T (x^* - x) + \frac{\mu}{2} \| x^* - x \|^2. \end{align}

Now, since \(f^* - f(x) \leq 0\), we can write \begin{align} 0 \geq f^* - f(x) \geq (\nabla f(x))^T (x^* - x) + \frac{\mu}{2} \| x^* - x \|^2 \end{align} from which we obtain \begin{align} (\nabla f(x))^T (x - x^*) \geq \frac{\mu}{2} \| x^* - x \|^2. \end{align} Using the Cauchy-Schwartz inequality, we have \(\| \nabla f(x) \| \cdot \|x - x^* \| \geq (\nabla f(x))^T (x - x^*)\) and hence \begin{align} \| \nabla f(x) \| \cdot \|x - x^* \| \geq \frac{\mu}{2} \| x^* - x \|^2. \end{align} Simplyfying further, we get \begin{align} \| x^* - x \|^2 \leq \frac{4}{\mu^2} || \nabla f(x) ||^2. \label{eq:temp} \end{align}

Now, using the smoothness property in \eqref{eq:smoothness}, we obtain \begin{align} f(x) &\leq f^* + (\nabla f(x^*))^T (x - x^*) + \frac{L}{2} || x - x^* ||^2 \\
&= f^* + \frac{L}{2} || x - x^* ||^2 && \text{(since } \nabla f(x^*) = 0 \text{)} \\
&\leq f^* + \frac{2L}{\mu^2} || \nabla f(x) ||^2. && \text{(using \eqref{eq:temp})} \end{align} Rearranging, we obtain \begin{align} \frac{1}{2} || \nabla f(x) ||^2 \geq \frac{\mu^2}{4L} (f(x) - f^*), \end{align} which means that \(f\) satisfies the PL inequality with the constant \(\frac{\mu^2}{4L}\). \(\square\)

Theorem 1. Given an \(L\)-smooth function \(f: \mathbb R^d \to \mathbb R\) that satisfies the PL inequality with constant \(\mu\), if the learning rate is set to \(\alpha = 1 / L\), gradient descent has the following convergence rate: \begin{align} f(x_t) - f^* \leq \left(1 - \frac{\mu}{L}\right)^t (f(x_0) - f^*). \label{eq:convergence} \end{align}

Proof. For any given \(t \geq 0\), we have \begin{align} x_{t + 1} = x_t - \frac{1}{L} \nabla f(x_t). \label{eq:update} \end{align}

Using the smoothness property in \eqref{eq:smoothness}, we obtain \begin{align} f(x_{t + 1}) &\leq f(x_t) + (\nabla f(x_t))^T (x_{t + 1} - x_t) + \frac{L}{2} \| x_{t + 1} - x_t \|^2 \\
&= f(x_t) + (\nabla f(x_t))^T \left(-\frac{1}{L} \nabla f(x_t) \right) + \frac{L}{2} \left\| -\frac{1}{L} \nabla f(x_t) \right\|^2 && \text{(using \eqref{eq:update})} \\
&= f(x_t) - \frac{1}{2L} \left\| \nabla f(x_t) \right\|^2 && \text{(simplifying)} \\
&\leq f(x_t) - \frac{\mu}{L} (f(x_t) - f^*). && \text{(using the PL inequality \eqref{eq:pl-inequality})} \end{align} We rearrange to obtain \begin{align} f(x_{t + 1}) - f^* \leq \left(1 - \frac{\mu}{L}\right) (f(x_t) - f^*). \end{align} Applying this equation successively results in the desired convergence rate \eqref{eq:convergence}. \(\square\)

Corollary 1. Given an \(L\)-smooth function and \(\mu\)-strongly convex function \(f: \mathbb R^d \to \mathbb R\), if the learning rate is set to \(\alpha = 1 / L\), gradient descent has the following convergence rate: \begin{align} f(x_t) - f^* \leq \left(1 - \frac{\mu^2}{4L^2}\right)^t (f(x_0) - f^*). \label{eq:convergence-2} \end{align}

Proof. Use Proposition 1 and Theorem 1. \(\square\)

[back]

Tuan Anh Le

Convergence of Gradient Descent