Numeric optimization

Numeric optimization#

Due to some computational issues it could be inpractical to calculate \(\widehat{\boldsymbol w}\) by the formula (23). If so, a numerical optimization method can be used. The most common one is gradient descent algorithm.

For ordinary linear regression the loss function is proportional to

\[ \mathcal L(\boldsymbol w) = \frac 12\Vert\boldsymbol {Xw} - \boldsymbol y \Vert_2^2, \]

therefore, \(\nabla \mathcal L(\boldsymbol w) = \boldsymbol X^\mathsf{T}(\boldsymbol{Xw} - \boldsymbol y)\).

Gradient descent#

Now let’s implement the gradient descent algorithm for this objective:

import numpy as np

def linear_regression_gd(X, y, learning_rate=0.01, tol=1e-3, max_iter=10000):
    w = np.random.normal(size=X.shape[1], scale=2/(X.shape[0] + X.shape[1]))
    gradient = X.T.dot(X.dot(w) - y)
    for i in range(max_iter):
        if np.linalg.norm(gradient) <= tol:
            return w
        w -= learning_rate * gradient
        gradient = X.T.dot(X.dot(w) - y)
    print("max_iter exceeded")
    return w

Try in on some synthetic data:

from sklearn.datasets import make_regression
X, y = make_regression(n_samples=300, n_features=21, noise=4, random_state=42)
w = linear_regression_gd(X, y, learning_rate=1e-3, tol=1e-4)
print("MSE:", np.mean((X.dot(w) - y)**2))

MSE: 15.3087215455674

Double check by sklearn:

from sklearn.linear_model import LinearRegression
LR = LinearRegression(fit_intercept=False)
LR.fit(X, y)
print("r-score:", LR.score(X, y))
print("MSE:", np.mean((LR.predict(X) - y) ** 2))
print("Difference from w:", np.linalg.norm(LR.coef_ - w))

r-score: 0.9995298145938111
MSE: 15.308721545567236
Difference from w: 5.39008675221966e-07

Now apply to Boston dataset. Remove features age, indus,tax, and add dummy feature:

import pandas as pd
boston = pd.read_csv("../datasets/ISLP/Boston.csv").drop("Unnamed: 0", axis=1)
target = boston['medv']
train = boston.drop(['medv', 'age', 'indus', 'tax'], axis=1)
# train = boston[['lstat', 'age']]
X_train = np.hstack([np.ones(506)[:, None], train])

Apply self-made gradient descent algorithm:

weights = linear_regression_gd(X_train, target.values, learning_rate=4e-6, max_iter=10**6)
print("MSE:", np.mean((X_train.dot(weights) - target)**2))

max_iter exceeded
MSE: 23.031660361639574

Compare with sklearn result:

LR = LinearRegression(fit_intercept=False)
LR.fit(X_train, target)
print("r-score:", LR.score(X_train, target))
print("MSE:", np.mean((LR.predict(X_train) - target) ** 2))
print("Difference from w:", np.linalg.norm(LR.coef_ - weights))

r-score: 0.7272525819160474
MSE: 23.025215977387404
Difference from w: 2.054722721699466

Even one million iterations is not enough here for convergence…

Newton’s method#

TODO

Add more experiments for gradient descent along with visualizations
Apply Newton’s method to linear regression
Add regularization
Compare the performance of GD and Newton’s method

Numeric optimization

Contents

Numeric optimization#

Gradient descent#

Newton’s method#