Numeric optimization#

Due to some computational issues it could be inpractical to calculate \(\widehat{\boldsymbol w}\) by the formula (23). If so, a numerical optimization method can be used. The most common one is gradient descent algorithm.

For ordinary linear regression the loss function is proportional to

\[ \mathcal L(\boldsymbol w) = \frac 12\Vert\boldsymbol {Xw} - \boldsymbol y \Vert_2^2, \]

therefore, \(\nabla \mathcal L(\boldsymbol w) = \boldsymbol X^\mathsf{T}(\boldsymbol{Xw} - \boldsymbol y)\).

Gradient descent#

Now let’s implement the gradient descent algorithm for this objective:

import numpy as np

def linear_regression_gd(X, y, learning_rate=0.01, tol=1e-3, max_iter=10000):
    w = np.random.normal(size=X.shape[1], scale=2/(X.shape[0] + X.shape[1]))
    gradient = X.T.dot(X.dot(w) - y)
    for i in range(max_iter):
        if np.linalg.norm(gradient) <= tol:
            return w
        w -= learning_rate * gradient
        gradient = X.T.dot(X.dot(w) - y)
    print("max_iter exceeded")
    return w

Try in on some synthetic data:

from sklearn.datasets import make_regression
X, y = make_regression(n_samples=300, n_features=21, noise=4, random_state=42)
w = linear_regression_gd(X, y, learning_rate=1e-3, tol=1e-4)
print("MSE:", np.mean((X.dot(w) - y)**2))
MSE: 15.3087215455674

Double check by sklearn:

from sklearn.linear_model import LinearRegression
LR = LinearRegression(fit_intercept=False)
LR.fit(X, y)
print("r-score:", LR.score(X, y))
print("MSE:", np.mean((LR.predict(X) - y) ** 2))
print("Difference from w:", np.linalg.norm(LR.coef_ - w))
r-score: 0.9995298145938111
MSE: 15.308721545567236
Difference from w: 5.39008675221966e-07

Now apply to Boston dataset. Remove features age, indus,tax, and add dummy feature:

import pandas as pd
boston = pd.read_csv("../datasets/ISLP/Boston.csv").drop("Unnamed: 0", axis=1)
target = boston['medv']
train = boston.drop(['medv', 'age', 'indus', 'tax'], axis=1)
# train = boston[['lstat', 'age']]
X_train = np.hstack([np.ones(506)[:, None], train])

Apply self-made gradient descent algorithm:

weights = linear_regression_gd(X_train, target.values, learning_rate=4e-6, max_iter=10**6)
print("MSE:", np.mean((X_train.dot(weights) - target)**2))
max_iter exceeded
MSE: 23.031660361639574

Compare with sklearn result:

LR = LinearRegression(fit_intercept=False)
LR.fit(X_train, target)
print("r-score:", LR.score(X_train, target))
print("MSE:", np.mean((LR.predict(X_train) - target) ** 2))
print("Difference from w:", np.linalg.norm(LR.coef_ - weights))
r-score: 0.7272525819160474
MSE: 23.025215977387404
Difference from w: 2.054722721699466

Even one million iterations is not enough here for convergence…

Newton’s method#

TODO

  • Add more experiments for gradient descent along with visualizations

  • Apply Newton’s method to linear regression

  • Add regularization

  • Compare the performance of GD and Newton’s method