Numeric optimization#

To find the optimal weights of the logistic regression, we can use gradient descent algorithm. To apply this algorithm, one need to calculate the gradient of the loss function.

Binary logistic regression#

Multiply the loss function (29) by \(n\):

\[ \mathcal L(\boldsymbol w) = -\sum\limits_{i=1}^n \big(y_i \log(\sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)) + (1- y_i)\log(1 - \sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w))\big). \]

To find \(\nabla \mathcal L(\boldsymbol w)\) observe that

\[ \nabla \log(\sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)) = \frac {\nabla \sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)}{\sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)} = \frac{\sigma'(\boldsymbol x_i^\mathsf{T} \boldsymbol w) \nabla(\boldsymbol x_i^\mathsf{T} \boldsymbol w)}{{\sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)}}. \]

Also,

\[ \nabla \log(1 - \sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)) = -\frac {\nabla \sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)}{1 - \sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)} = \frac{\sigma'(\boldsymbol x_i^\mathsf{T} \boldsymbol w) \nabla(\boldsymbol x_i^\mathsf{T} \boldsymbol w)}{{1 - \sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)}}. \]

Q. What is \(\nabla(\boldsymbol x_i^\mathsf{T} \boldsymbol w)\)?

Putting it altogether, we get

\[ \nabla \mathcal L(\boldsymbol w) = -\sum\limits_{i=1}^n \big(y_i(1 - \sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w))\boldsymbol x_i - (1-y_i)\sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w)\boldsymbol x_i\big) = \sum\limits_{i=1}^n (\sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w) - y_i)\boldsymbol x_i. \]

Question

How to write \(\nabla \mathcal L(\boldsymbol w)\) as a product of a matrix and a vector, avoiding the explicit summation?

Question

What is hessian \(\nabla^2 L(\boldsymbol w)\)?

Breast cancer dataset: numeric optimization#

Fetch the dataset:

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X, y = data.data, data.target
X.shape, y.shape
((569, 30), (569,))

Apply the gradient descent algorithm to the logistic regression:

import numpy as np
from scipy.special import expit

def logistic_regression_gd(X, y, C=1, learning_rate=0.01, tol=1e-3, max_iter=10000):
    w = np.random.normal(size=X.shape[1])
    gradient = X.T.dot(expit(X.dot(w)) - y) + C * w
    for i in range(max_iter):
        if np.linalg.norm(gradient) <= tol:
            return w
        w -= learning_rate * gradient
        gradient = X.T.dot(expit(X.dot(w)) - y) + C * w
    print("max_iter exceeded")
    return w

Fit the logistic regresion on the whole dataset:

%time w = logistic_regression_gd(X, y, learning_rate=2e-7, max_iter=10**6)
w
max_iter exceeded
CPU times: user 57.2 s, sys: 4.75 s, total: 1min 1s
Wall time: 34.1 s
array([ 0.95017256,  0.20093416,  0.95389464, -0.0472658 ,  0.47622005,
        0.84587822, -0.41132275, -0.31228808,  0.08504519,  0.53761053,
       -0.82640988,  0.58935115, -0.08576555, -0.16455134,  0.08608244,
       -0.28015316, -1.08272513,  0.56933785, -0.1965725 ,  0.74017633,
        1.31894747, -0.5814842 , -0.55305047, -0.02001289, -0.75746949,
       -1.12086515, -2.09220646, -0.24091238, -0.9334001 , -1.08400614])

Calculate the accuracy score:

from sklearn.metrics import accuracy_score
accuracy_score(expit(X.dot(w)) > 0.5, y)
0.9490333919156415

Compare with sklearn:

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(fit_intercept=False, max_iter=5000)
log_reg.fit(X, y)
print(log_reg.score(X, y))
print(accuracy_score(log_reg.predict(X), y))
log_reg.coef_
0.9595782073813708
0.9595782073813708
array([[ 2.18332329,  0.11789056, -0.0696952 , -0.00350176, -0.16519865,
        -0.4118028 , -0.67385012, -0.36497435, -0.23983217, -0.02406538,
        -0.02186803,  1.20613335,  0.04838948, -0.0970984 , -0.01870574,
         0.01626219, -0.03861864, -0.04276028, -0.04322983,  0.00846606,
         1.29672397, -0.34212421, -0.12576754, -0.02460068, -0.30612789,
        -1.14244163, -1.62895872, -0.6992171 , -0.72520831, -0.11275416]])
np.linalg.norm(w - log_reg.coef_)
3.176761458282468

Multinomial logistic regression#

Recall that the loss function in this case is

\[\begin{split} \begin{multline*} \mathcal L(\boldsymbol W) = -\sum\limits_{i=1}^n \sum\limits_{k=1}^K y_{ik} \bigg(\boldsymbol x_i^\mathsf{T}\boldsymbol w_{k} -\log\Big(\sum\limits_{k=1}^K \exp(\boldsymbol x_i^\mathsf{T}\boldsymbol w_{k})\Big)\bigg) = \\ = -\sum\limits_{i=1}^n \sum\limits_{k=1}^K y_{ik} \bigg(\sum\limits_{j=1}^d x_{ij} w_{jk} -\log\Big(\sum\limits_{k=1}^K \exp\Big(\sum\limits_{j=1}^d x_{ij} w_{jk}\Big)\Big)\bigg) \end{multline*} \end{split}\]

One can show that

\[ \nabla \mathcal L(\boldsymbol W) = \boldsymbol X^\mathsf{T} (\boldsymbol {\widehat Y} - \boldsymbol Y) = \boldsymbol X^\mathsf{T} ( \sigma(\boldsymbol{XW}) - \boldsymbol Y). \]

TODO

  • Try numerical optimization on several datasets

  • Apply Newton’s method and compare it’s performance with GD