Optimization#
Almost every task of machine learning is reduced to some optimization problem in multidimensional space. Sometimes the solution can be found analytically.
Necessary condition of extremum
If \(\mathcal L(\theta)\) is a smooth function and
then \(\nabla \mathcal L(\boldsymbol{\widehat \theta}) = \boldsymbol 0\).
A simple example \(f(x) = x^3\) shows that this condition is not sufficient.
Sufficient condition of extremum
If \(\mathcal L(\theta)\) is twice differentiable, \(\nabla \mathcal L(\boldsymbol {\widehat \theta}) = \boldsymbol 0\) and \(\nabla^2\mathcal L(\boldsymbol{\widehat \theta})\) is positive (negative) definite, then then \(\widehat{\boldsymbol \theta}\) is a point of local minimum (maximum).
Linear regression#
The loss function of linear regression model has from
Since
we have
Hence,
Assuming that \(\boldsymbol A\) has full column rank, we obtain
where \(\boldsymbol A^\dagger\) is pseudo inverse matrix.
We’ve checked only necessary condition of extremum. To verify the fulfillment of the sufficient condition, calculate \(d^2f\):
Therefore, \(\nabla^2 f = \boldsymbol A^\mathsf{T} \boldsymbol A\). Note that it does not depend on \(\boldsymbol x\) and it is a positive definite matrix since
Gradient descent#
Analytic solution like (70) is a rare case in machine learning tasks. It also requires inversion of a matrix which is computationally expensive. This is why it is important to elaborate alternative ways of numeric optimization. One of the simplest methods is gradient descent.
Algorithm (Gradient descent)
To solve the optimization problem
do the followind steps:
- initialize \(\boldsymbol w\) by some random values (e.g., from \(\mathcal N(0, 1\))) 
- choose tolerance \(\varepsilon > 0\) and learning rate \(\eta > 0\) 
- while \(\Vert \nabla\mathcal L(\boldsymbol w) \Vert > \varepsilon\) do the gradient step \[ \boldsymbol w := \boldsymbol w - \eta\nabla\mathcal L(\boldsymbol w) \]
- return \(\boldsymbol w\) 
Note
If condition \(\Vert \nabla\mathcal L(\boldsymbol w) \Vert > \varepsilon\) holds for too long, the loop in step 3 terminates after some number iterations max_iter.
Application of gradient descent algorithm to linear regression is described here.
Newton’s method#
If the optimized function is twice differentiable, then an optimization method of the second order can be applied. One of such methods is Newton’s method. In this method weights \(\boldsymbol w\) are updated by the formula
Computation of hessian \(\nabla^2 L(\boldsymbol w)\) in Newton’s method could be costly. However, this is usually compensated by much faster convergence than that of first order methods like gradient descent.
Exercises#
- Show that (70) is the point of global minimium of the function (69). 
- Prove (71) if \(\boldsymbol A\) has full column rank. 
- Find global minimum of the function \(f(\boldsymbol x) = \boldsymbol a^\mathsf{T}\boldsymbol x + \frac \sigma 3\Vert \boldsymbol x\Vert_2^3\), \(\boldsymbol a \ne \boldsymbol 0\), \(\sigma > 0\). 
- Find \[ \min\limits_{\boldsymbol X} \Vert \boldsymbol{AX} - \boldsymbol B \Vert_F^2, \quad \boldsymbol A \in \mathbb R^{k\times m},\quad \boldsymbol B \in \mathbb R^{k\times n},\quad \mathrm{rank}(\boldsymbol A) = m. \]
- Let \(\boldsymbol A \in \mathbb R^{n\times n}\) be a symmetric positive definite matrix and \[ f(\boldsymbol X) = \mathrm{tr} (\boldsymbol X^{-1} \boldsymbol A) + \ln \vert \boldsymbol X\vert. \]- Find \(\min\limits_{\boldsymbol X \in \mathbb R^{n\times n}} f(\boldsymbol X)\). 
- Let \(\boldsymbol A \in \mathbb R^{n\times n}\) be a symmetric matrix and \[ f(\boldsymbol x) = \frac{\boldsymbol x^\mathsf{T} \boldsymbol{Ax}}{\boldsymbol x^\mathsf{T} \boldsymbol x}, \quad \boldsymbol x \in \mathbb R^n. \]- Calculate \(\nabla f(\boldsymbol x)\) and prove that \(\nabla f(\boldsymbol x) = \boldsymbol 0\) iff \(\boldsymbol x\) is an eigenvector of \(\boldsymbol A\). 
- Prove that if \(\boldsymbol A\) is positive definite then \[ \max\limits_{\boldsymbol x \ne \boldsymbol 0} f(\boldsymbol x) = \lambda_{\max}(\boldsymbol A), \quad \min\limits_{\boldsymbol x \ne \boldsymbol 0} f(\boldsymbol x) = \lambda_{\min}(\boldsymbol A) \]
 
