Numeric optimization#
To find the optimal weights of the logistic regression, we can use gradient descent algorithm. To apply this algorithm, one need to calculate the gradient of the loss function.
Binary logistic regression#
Multiply the loss function (29) by \(n\):
To find \(\nabla \mathcal L(\boldsymbol w)\) observe that
Also,
Q. What is \(\nabla(\boldsymbol x_i^\mathsf{T} \boldsymbol w)\)?
Putting it altogether, we get
Question
How to write \(\nabla \mathcal L(\boldsymbol w)\) as a product of a matrix and a vector, avoiding the explicit summation?
Hint
The shape of \(\nabla \mathcal L(\boldsymbol w)\) is the same as of \(\boldsymbol w\), i.e., \(d\times 1\). Now observe that
What should we multiply by this vector to obtain \(\nabla \mathcal L\)?
Question
What is hessian \(\nabla^2 L(\boldsymbol w)\)?
Answer
where
Breast cancer dataset: numeric optimization#
Fetch the dataset:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X, y = data.data, data.target
X.shape, y.shape
((569, 30), (569,))
Apply the gradient descent algorithm to the logistic regression:
import numpy as np
from scipy.special import expit
def logistic_regression_gd(X, y, C=1, learning_rate=0.01, tol=1e-3, max_iter=10000):
w = np.random.normal(size=X.shape[1])
gradient = X.T.dot(expit(X.dot(w)) - y) + C * w
for i in range(max_iter):
if np.linalg.norm(gradient) <= tol:
return w
w -= learning_rate * gradient
gradient = X.T.dot(expit(X.dot(w)) - y) + C * w
print("max_iter exceeded")
return w
Fit the logistic regresion on the whole dataset:
%time w = logistic_regression_gd(X, y, learning_rate=2e-7, max_iter=10**6)
w
max_iter exceeded
CPU times: user 57.2 s, sys: 4.75 s, total: 1min 1s
Wall time: 34.1 s
array([ 0.95017256, 0.20093416, 0.95389464, -0.0472658 , 0.47622005,
0.84587822, -0.41132275, -0.31228808, 0.08504519, 0.53761053,
-0.82640988, 0.58935115, -0.08576555, -0.16455134, 0.08608244,
-0.28015316, -1.08272513, 0.56933785, -0.1965725 , 0.74017633,
1.31894747, -0.5814842 , -0.55305047, -0.02001289, -0.75746949,
-1.12086515, -2.09220646, -0.24091238, -0.9334001 , -1.08400614])
Calculate the accuracy score:
from sklearn.metrics import accuracy_score
accuracy_score(expit(X.dot(w)) > 0.5, y)
0.9490333919156415
Compare with sklearn
:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(fit_intercept=False, max_iter=5000)
log_reg.fit(X, y)
print(log_reg.score(X, y))
print(accuracy_score(log_reg.predict(X), y))
log_reg.coef_
0.9595782073813708
0.9595782073813708
array([[ 2.18332329, 0.11789056, -0.0696952 , -0.00350176, -0.16519865,
-0.4118028 , -0.67385012, -0.36497435, -0.23983217, -0.02406538,
-0.02186803, 1.20613335, 0.04838948, -0.0970984 , -0.01870574,
0.01626219, -0.03861864, -0.04276028, -0.04322983, 0.00846606,
1.29672397, -0.34212421, -0.12576754, -0.02460068, -0.30612789,
-1.14244163, -1.62895872, -0.6992171 , -0.72520831, -0.11275416]])
np.linalg.norm(w - log_reg.coef_)
3.176761458282468
Multinomial logistic regression#
Recall that the loss function in this case is
One can show that
TODO
Try numerical optimization on several datasets
Apply Newton’s method and compare it’s performance with GD