Optimization

Optimization#

Almost every task of machine learning is reduced to some optimization problem in multidimensional space. Sometimes the solution can be found analytically.

Necessary condition of extremum

If \(\mathcal L(\theta)\) is a smooth function and

\[ \boldsymbol{\widehat \theta} = \arg\min \mathcal L(\theta), \]

then \(\nabla \mathcal L(\boldsymbol{\widehat \theta}) = \boldsymbol 0\).

A simple example \(f(x) = x^3\) shows that this condition is not sufficient.

Sufficient condition of extremum

If \(\mathcal L(\theta)\) is twice differentiable, \(\nabla \mathcal L(\boldsymbol {\widehat \theta}) = \boldsymbol 0\) and \(\nabla^2\mathcal L(\boldsymbol{\widehat \theta})\) is positive (negative) definite, then then \(\widehat{\boldsymbol \theta}\) is a point of local minimum (maximum).

Linear regression#

The loss function of linear regression model has from

(69)#\[ f(\boldsymbol x) = \frac 12\Vert \boldsymbol{Ax} - \boldsymbol b\Vert_2^2, \quad \boldsymbol A \in \mathbb R^{m\times n}, \quad\boldsymbol b \in \mathbb R^m.\]

Since

\[ f(\boldsymbol x) = \frac 12 (\boldsymbol{Ax} - \boldsymbol b)^\mathsf{T}(\boldsymbol{Ax} - \boldsymbol b), \]

we have

\[ df(\boldsymbol x) = (\boldsymbol{Ax} - \boldsymbol b)^\mathsf{T}\big(d(\boldsymbol{Ax} - \boldsymbol b)\big) = \underbrace{(\boldsymbol{Ax} - \boldsymbol b)^\mathsf{T} \boldsymbol A}_{\nabla f(\boldsymbol x)^\mathsf{T}}d\boldsymbol x. \]

Hence,

\[ \nabla f(\hat{\boldsymbol x}) = \boldsymbol 0 \iff \boldsymbol A^\mathsf{T} (\boldsymbol{A\hat{x}} - \boldsymbol b) = \boldsymbol 0 \iff \boldsymbol A^\mathsf{T} \boldsymbol{A\hat{x}} = \boldsymbol A^\mathsf{T} \boldsymbol b. \]

Assuming that \(\boldsymbol A\) has full column rank, we obtain

(70)#\[ \boldsymbol{\hat{x}} = (\boldsymbol A^\mathsf{T} \boldsymbol A)^{-1} \boldsymbol A^\mathsf{T} \boldsymbol b = \boldsymbol A^\dagger \boldsymbol b,\]

where \(\boldsymbol A^\dagger\) is pseudo inverse matrix.

We’ve checked only necessary condition of extremum. To verify the fulfillment of the sufficient condition, calculate \(d^2f\):

\[ d^2f = d(df) = d\big((\boldsymbol{Ax} - \boldsymbol b)^\mathsf{T} \boldsymbol A d\boldsymbol x_1\big) = (\boldsymbol A d\boldsymbol x_2)^\mathsf{T} \boldsymbol A d\boldsymbol x_1 = d\boldsymbol x_2^\mathsf{T} \boldsymbol A^\mathsf{T} \boldsymbol A d\boldsymbol x_1. \]

Therefore, \(\nabla^2 f = \boldsymbol A^\mathsf{T} \boldsymbol A\). Note that it does not depend on \(\boldsymbol x\) and it is a positive definite matrix since

(71)#\[ \boldsymbol x^\mathsf{T}\boldsymbol A^\mathsf{T} \boldsymbol{Ax} = \Vert \boldsymbol{Ax} \Vert_2^2 >0 \text{ if } \boldsymbol x \ne \boldsymbol 0.\]

Gradient descent#

Analytic solution like (70) is a rare case in machine learning tasks. It also requires inversion of a matrix which is computationally expensive. This is why it is important to elaborate alternative ways of numeric optimization. One of the simplest methods is gradient descent.

Algorithm (Gradient descent)

To solve the optimization problem

\[ \mathcal L(\boldsymbol w) \to \min\limits_{\boldsymbol w} \]

do the followind steps:

initialize \(\boldsymbol w\) by some random values (e.g., from \(\mathcal N(0, 1\)))
choose tolerance \(\varepsilon > 0\) and learning rate \(\eta > 0\)
while \(\Vert \nabla\mathcal L(\boldsymbol w) \Vert > \varepsilon\) do the gradient step

\[ \boldsymbol w := \boldsymbol w - \eta\nabla\mathcal L(\boldsymbol w) \]
return \(\boldsymbol w\)

Note

If condition \(\Vert \nabla\mathcal L(\boldsymbol w) \Vert > \varepsilon\) holds for too long, the loop in step 3 terminates after some number iterations max_iter.

Application of gradient descent algorithm to linear regression is described here.

Newton’s method#

If the optimized function is twice differentiable, then an optimization method of the second order can be applied. One of such methods is Newton’s method. In this method weights \(\boldsymbol w\) are updated by the formula

\[ \boldsymbol w := \boldsymbol w - \eta \big(\nabla^2 L(\boldsymbol w)\big)^{-1}\nabla\mathcal L(\boldsymbol w) \]

Computation of hessian \(\nabla^2 L(\boldsymbol w)\) in Newton’s method could be costly. However, this is usually compensated by much faster convergence than that of first order methods like gradient descent.

Exercises#

Show that (70) is the point of global minimium of the function (69).
Prove (71) if \(\boldsymbol A\) has full column rank.
Find global minimum of the function \(f(\boldsymbol x) = \boldsymbol a^\mathsf{T}\boldsymbol x + \frac \sigma 3\Vert \boldsymbol x\Vert_2^3\), \(\boldsymbol a \ne \boldsymbol 0\), \(\sigma > 0\).
Find

\[ \min\limits_{\boldsymbol X} \Vert \boldsymbol{AX} - \boldsymbol B \Vert_F^2, \quad \boldsymbol A \in \mathbb R^{k\times m},\quad \boldsymbol B \in \mathbb R^{k\times n},\quad \mathrm{rank}(\boldsymbol A) = m. \]
Let \(\boldsymbol A \in \mathbb R^{n\times n}\) be a symmetric positive definite matrix and

\[ f(\boldsymbol X) = \mathrm{tr} (\boldsymbol X^{-1} \boldsymbol A) + \ln \vert \boldsymbol X\vert. \]

Find \(\min\limits_{\boldsymbol X \in \mathbb R^{n\times n}} f(\boldsymbol X)\).
Let \(\boldsymbol A \in \mathbb R^{n\times n}\) be a symmetric matrix and

\[ f(\boldsymbol x) = \frac{\boldsymbol x^\mathsf{T} \boldsymbol{Ax}}{\boldsymbol x^\mathsf{T} \boldsymbol x}, \quad \boldsymbol x \in \mathbb R^n. \]
- Calculate \(\nabla f(\boldsymbol x)\) and prove that \(\nabla f(\boldsymbol x) = \boldsymbol 0\) iff \(\boldsymbol x\) is an eigenvector of \(\boldsymbol A\).
- Prove that if \(\boldsymbol A\) is positive definite then
  
  \[ \max\limits_{\boldsymbol x \ne \boldsymbol 0} f(\boldsymbol x) = \lambda_{\max}(\boldsymbol A), \quad \min\limits_{\boldsymbol x \ne \boldsymbol 0} f(\boldsymbol x) = \lambda_{\min}(\boldsymbol A) \]