Numeric optimization#
Due to some computational issues it could be inpractical to calculate \(\widehat{\boldsymbol w}\) by the formula (23). If so, a numerical optimization method can be used. The most common one is gradient descent algorithm.
For ordinary linear regression the loss function is proportional to
therefore, \(\nabla \mathcal L(\boldsymbol w) = \boldsymbol X^\mathsf{T}(\boldsymbol{Xw} - \boldsymbol y)\).
Gradient descent#
Now let’s implement the gradient descent algorithm for this objective:
import numpy as np
def linear_regression_gd(X, y, learning_rate=0.01, tol=1e-3, max_iter=10000):
w = np.random.normal(size=X.shape[1], scale=2/(X.shape[0] + X.shape[1]))
gradient = X.T.dot(X.dot(w) - y)
for i in range(max_iter):
if np.linalg.norm(gradient) <= tol:
return w
w -= learning_rate * gradient
gradient = X.T.dot(X.dot(w) - y)
print("max_iter exceeded")
return w
Try in on some synthetic data:
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=300, n_features=21, noise=4, random_state=42)
w = linear_regression_gd(X, y, learning_rate=1e-3, tol=1e-4)
print("MSE:", np.mean((X.dot(w) - y)**2))
MSE: 15.3087215455674
Double check by sklearn
:
from sklearn.linear_model import LinearRegression
LR = LinearRegression(fit_intercept=False)
LR.fit(X, y)
print("r-score:", LR.score(X, y))
print("MSE:", np.mean((LR.predict(X) - y) ** 2))
print("Difference from w:", np.linalg.norm(LR.coef_ - w))
r-score: 0.9995298145938111
MSE: 15.308721545567236
Difference from w: 5.39008675221966e-07
Now apply to Boston dataset. Remove features age
, indus
,tax
, and add dummy feature:
import pandas as pd
boston = pd.read_csv("../datasets/ISLP/Boston.csv").drop("Unnamed: 0", axis=1)
target = boston['medv']
train = boston.drop(['medv', 'age', 'indus', 'tax'], axis=1)
# train = boston[['lstat', 'age']]
X_train = np.hstack([np.ones(506)[:, None], train])
Apply self-made gradient descent algorithm:
weights = linear_regression_gd(X_train, target.values, learning_rate=4e-6, max_iter=10**6)
print("MSE:", np.mean((X_train.dot(weights) - target)**2))
max_iter exceeded
MSE: 23.031660361639574
Compare with sklearn
result:
LR = LinearRegression(fit_intercept=False)
LR.fit(X_train, target)
print("r-score:", LR.score(X_train, target))
print("MSE:", np.mean((LR.predict(X_train) - target) ** 2))
print("Difference from w:", np.linalg.norm(LR.coef_ - weights))
r-score: 0.7272525819160474
MSE: 23.025215977387404
Difference from w: 2.054722721699466
Even one million iterations is not enough here for convergence…
Newton’s method#
TODO
Add more experiments for gradient descent along with visualizations
Apply Newton’s method to linear regression
Add regularization
Compare the performance of GD and Newton’s method