Numeric optimization

Numeric optimization#

To find the optimal weights of the logistic regression, we can use gradient descent algorithm. To apply this algorithm, one need to calculate the gradient of the loss function.

Binary logistic regression#

Multiply the loss function (29) by \(n\):

\[ \mathcal L(\boldsymbol w) = -\sum\limits_{i=1}^n \big(y_i \log(\sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)) + (1- y_i)\log(1 - \sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w))\big). \]

To find \(\nabla \mathcal L(\boldsymbol w)\) observe that

\[ \nabla \log(\sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)) = \frac {\nabla \sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)}{\sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)} = \frac{\sigma'(\boldsymbol x_i^\mathsf{T} \boldsymbol w) \nabla(\boldsymbol x_i^\mathsf{T} \boldsymbol w)}{{\sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)}}. \]

Also,

\[ \nabla \log(1 - \sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)) = -\frac {\nabla \sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)}{1 - \sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)} = \frac{\sigma'(\boldsymbol x_i^\mathsf{T} \boldsymbol w) \nabla(\boldsymbol x_i^\mathsf{T} \boldsymbol w)}{{1 - \sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)}}. \]

Q. What is \(\nabla(\boldsymbol x_i^\mathsf{T} \boldsymbol w)\)?

Putting it altogether, we get

\[ \nabla \mathcal L(\boldsymbol w) = -\sum\limits_{i=1}^n \big(y_i(1 - \sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w))\boldsymbol x_i - (1-y_i)\sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)\boldsymbol x_i\big) = \sum\limits_{i=1}^n (\sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w) - y_i)\boldsymbol x_i. \]

Question

How to write \(\nabla \mathcal L(\boldsymbol w)\) as a product of a matrix and a vector, avoiding the explicit summation?

Hint

The shape of \(\nabla \mathcal L(\boldsymbol w)\) is the same as of \(\boldsymbol w\), i.e., \(d\times 1\). Now observe that

\[\begin{split} \begin{pmatrix} \sigma(\boldsymbol x_1^\mathsf{T} \boldsymbol w) - y_1 \\ \vdots \\ \sigma(\boldsymbol x_n^\mathsf{T} \boldsymbol w) - y_n \end{pmatrix} = \sigma(\boldsymbol X \boldsymbol w )- \boldsymbol y \in \mathbb R^n. \end{split}\]

What should we multiply by this vector to obtain \(\nabla \mathcal L\)?

Question

What is hessian \(\nabla^2 L(\boldsymbol w)\)?

Answer

\[ \nabla^2 L(\boldsymbol w) = \boldsymbol X^\mathsf{T} \boldsymbol S \boldsymbol X, \]

where

\[\begin{split} \boldsymbol S = \mathrm{diag}\{\sigma(\boldsymbol X \boldsymbol w )- \boldsymbol y\} = \begin{pmatrix} \sigma(\boldsymbol x_1^{\mathsf{T}} \boldsymbol w) - y_1 & \ldots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \ldots & \sigma(\boldsymbol x_n^{\mathsf{T}} \boldsymbol w) - y_n \end{pmatrix} \end{split}\]

Breast cancer dataset: numeric optimization#

Fetch the dataset:

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X, y = data.data, data.target
X.shape, y.shape

((569, 30), (569,))

Apply the gradient descent algorithm to the logistic regression:

import numpy as np
from scipy.special import expit

def logistic_regression_gd(X, y, C=1, learning_rate=0.01, tol=1e-3, max_iter=10000):
    w = np.random.normal(size=X.shape[1])
    gradient = X.T.dot(expit(X.dot(w)) - y) + C * w
    for i in range(max_iter):
        if np.linalg.norm(gradient) <= tol:
            return w
        w -= learning_rate * gradient
        gradient = X.T.dot(expit(X.dot(w)) - y) + C * w
    print("max_iter exceeded")
    return w

Fit the logistic regresion on the whole dataset:

%time w = logistic_regression_gd(X, y, learning_rate=2e-7, max_iter=10**6)
w

max_iter exceeded
CPU times: user 57.2 s, sys: 4.75 s, total: 1min 1s
Wall time: 34.1 s

array([ 0.95017256,  0.20093416,  0.95389464, -0.0472658 ,  0.47622005,
        0.84587822, -0.41132275, -0.31228808,  0.08504519,  0.53761053,
       -0.82640988,  0.58935115, -0.08576555, -0.16455134,  0.08608244,
       -0.28015316, -1.08272513,  0.56933785, -0.1965725 ,  0.74017633,
        1.31894747, -0.5814842 , -0.55305047, -0.02001289, -0.75746949,
       -1.12086515, -2.09220646, -0.24091238, -0.9334001 , -1.08400614])

Calculate the accuracy score:

from sklearn.metrics import accuracy_score
accuracy_score(expit(X.dot(w)) > 0.5, y)

0.9490333919156415

Compare with sklearn:

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(fit_intercept=False, max_iter=5000)
log_reg.fit(X, y)
print(log_reg.score(X, y))
print(accuracy_score(log_reg.predict(X), y))
log_reg.coef_

0.9595782073813708
0.9595782073813708

array([[ 2.18332329,  0.11789056, -0.0696952 , -0.00350176, -0.16519865,
        -0.4118028 , -0.67385012, -0.36497435, -0.23983217, -0.02406538,
        -0.02186803,  1.20613335,  0.04838948, -0.0970984 , -0.01870574,
         0.01626219, -0.03861864, -0.04276028, -0.04322983,  0.00846606,
         1.29672397, -0.34212421, -0.12576754, -0.02460068, -0.30612789,
        -1.14244163, -1.62895872, -0.6992171 , -0.72520831, -0.11275416]])

np.linalg.norm(w - log_reg.coef_)

3.176761458282468

Multinomial logistic regression#

Recall that the loss function in this case is

\[\begin{split} \begin{multline*} \mathcal L(\boldsymbol W) = -\sum\limits_{i=1}^n \sum\limits_{k=1}^K y_{ik} \bigg(\boldsymbol x_i^\mathsf{T}\boldsymbol w_{k} -\log\Big(\sum\limits_{k=1}^K \exp(\boldsymbol x_i^\mathsf{T}\boldsymbol w_{k})\Big)\bigg) = \\ = -\sum\limits_{i=1}^n \sum\limits_{k=1}^K y_{ik} \bigg(\sum\limits_{j=1}^d x_{ij} w_{jk} -\log\Big(\sum\limits_{k=1}^K \exp\Big(\sum\limits_{j=1}^d x_{ij} w_{jk}\Big)\Big)\bigg) \end{multline*} \end{split}\]

One can show that

\[ \nabla \mathcal L(\boldsymbol W) = \boldsymbol X^\mathsf{T} (\boldsymbol {\widehat Y} - \boldsymbol Y) = \boldsymbol X^\mathsf{T} ( \sigma(\boldsymbol{XW}) - \boldsymbol Y). \]

TODO

Try numerical optimization on several datasets
Apply Newton’s method and compare it’s performance with GD

Numeric optimization

Contents

Numeric optimization#

Binary logistic regression#

Breast cancer dataset: numeric optimization#

Multinomial logistic regression#