Multiple linear regression#

In practice, there usually more than one feature. Suppose that we have \(n\) training samples and \(d\) features. Given a training sample \(\boldsymbol x \in\mathbb R^d\), we predict the target variable as follows:

(18)#\[ \widehat y = w_0 + w_1 x_1 + \ldots + w_d x_d = w_0 + \sum\limits_{j=1}^d w_j x_j\]

To fit the linear regression, minimize MSE or, equivalently, RSS:

(19)#\[ RSS = \sum\limits_{i=1}^n(y_i - \widehat y_i)^2 = \sum\limits_{i=1}^n \Big(y_i - w_0 - \sum\limits_{j=1}^k w_j x_j\Big)^2 \to \min\limits_{\boldsymbol w}.\]

2-d linear regression on Boston dataset#

Here we trained a simple linear regression with a single feature (lstat). Now add one more feature: age.

import pandas as pd
boston = pd.read_csv("../ISLP_datasets/Boston.csv")
boston[['age','lstat','medv']].head()
age lstat medv
0 65.2 4.98 24.0
1 78.9 9.14 21.6
2 61.1 4.03 34.7
3 45.8 2.94 33.4
4 54.2 5.33 36.2

Now fit the linear regression model:

from sklearn.linear_model import LinearRegression
import numpy as np

X, y = boston[['lstat', 'age']], boston['medv']
LR = LinearRegression()
LR.fit(X, y)
print("intercept:", LR.intercept_)
print("coefficients:", LR.coef_)
print("r-score:", LR.score(X, y))
print("MSE:", np.mean((LR.predict(X) - y) ** 2))
intercept: 33.2227605317929
coefficients: [-1.03206856  0.03454434]
r-score: 0.5512689379421003
MSE: 37.88167709241267

Q. Are these results better or worse than those of simple regression? Quadratic regression?

Analytical solution for best fit#

Important

The intercept \(w_0\) is a bit annoying since it has to be written separately all the time. For notational convenience introduce a fake feature \(x_0 =1 \) for all samples. Than (18) can be rewritten as

(20)#\[ \sum\limits_{j=0}^d w_j x_j = \boldsymbol x^\mathsf{T} \boldsymbol w,\]

where \(\boldsymbol w\) is the vector of weights of the linear regression.

Taking into account this note, write the optimization task (19) as

(21)#\[ \mathcal L(\boldsymbol w) = \sum\limits_{i=1}^n(y_i - \boldsymbol x_i^\mathsf{T} \boldsymbol w)^2 \to \min\limits_{\boldsymbol w},\]

or

\[ \mathcal L(\boldsymbol w) = \Vert\boldsymbol {Xw} - \boldsymbol y \Vert_2^2 \to \min\limits_{\boldsymbol w}. \]

Observe that

\[ \Vert\boldsymbol {Xw} - \boldsymbol y \Vert_2^2 = (\boldsymbol{Xw} - \boldsymbol y)^\mathsf{T}(\boldsymbol {Xw} - \boldsymbol y) = \boldsymbol w^\mathsf{T} \boldsymbol X^\mathsf{T} \boldsymbol X \boldsymbol w - 2\boldsymbol y^\mathsf{T} \boldsymbol{Xw} + \boldsymbol y^\mathsf{T} \boldsymbol y. \]

Now use some matrix calculus:

  • \(\nabla_{\boldsymbol w} (\boldsymbol y^\mathsf{T} \boldsymbol{Xw}) = \boldsymbol X^\mathsf{T} \boldsymbol y\);

  • \(\nabla_{\boldsymbol w}(\boldsymbol w^\mathsf{T} \boldsymbol X^\mathsf{T} \boldsymbol X \boldsymbol w) = 2\boldsymbol X^\mathsf{T} \boldsymbol X\boldsymbol w\).

That’s why the condition \(\nabla_{\boldsymbol w}\mathcal L(\boldsymbol w)=\boldsymbol 0\) implies that

(22)#\[ \boldsymbol X^\mathsf{T} \boldsymbol X\boldsymbol w - \boldsymbol X^\mathsf{T} \boldsymbol y = \boldsymbol 0\]

If the matrix \(\boldsymbol X\) has full column rank, then the square matrix \(\boldsymbol X^\mathsf{T} \boldsymbol X\) is invertible, and from (22) we obtain that the optimal weights are

(23)#\[ \widehat{\boldsymbol w} = (\boldsymbol X^\mathsf{T} \boldsymbol X)^{-1} \boldsymbol X^\mathsf{T} \boldsymbol y.\]

Note

The matrix \(\boldsymbol X^\dagger = (\boldsymbol X^\mathsf{T} \boldsymbol X)^{-1} \boldsymbol X^\mathsf{T}\) is called Moore-Penrose pseudo inverse.

Q. What is \(\boldsymbol X^\dagger\) if \(\boldsymbol X\) is a square invertible matrix?

Computational issues#

  • Multicollinearity. if there are highly correlated or even linearly dependent features in the design matrix \(\boldsymbol X\), then the matrix \(\boldsymbol X^\mathsf{T} \boldsymbol X\) is nearly singular. One can cope with multicollinearity by feature selection.

  • Curse of dimensionality. If \(d \gg 1\), it is too computationally expensive to invert \(\boldsymbol X^\mathsf{T} \boldsymbol X\). In this case instead analytical solution (23) numerical methods such as gradient descent are used.

  • Ill-conditioning. In practice the matrix \(\boldsymbol X^\mathsf{T} \boldsymbol X\) has large Condition number. This implies the numerical instability of the solution (23).

12-d linear regression on Boston dataset#

Take all \(12\) features and train multiple regression!

target = boston['medv']
train = boston.drop(['medv', "Unnamed: 0"], axis=1)
X_train = np.hstack([np.ones(506)[:, None], train])
X_T_X = X_train.T.dot(X_train)
print("condition number:", np.linalg.cond(X_T_X))
w = np.linalg.inv(X_T_X).dot(X_train.T).dot(target)
print("intercept:", w[0])
print("coefficients:", w[1:])
print("MSE:", np.mean((X_train.dot(w) - target)**2))
condition number: 137383410.3557865
intercept: 41.61727017593546
coefficients: [-1.21388618e-01  4.69634633e-02  1.34676947e-02  2.83999338e+00
 -1.87580220e+01  3.65811904e+00  3.61071055e-03 -1.49075365e+00
  2.89404521e-01 -1.26819813e-02 -9.37532900e-01 -5.52019101e-01]
MSE: 22.42968143948993

Do the same thing with sklearn:

LR = LinearRegression()
LR.fit(train, target)
print("intercept:", LR.intercept_)
print("coefficients:", LR.coef_)
print("r-score:", LR.score(train, target))
print("MSE:", np.mean((LR.predict(train) - target) ** 2))
intercept: 41.61727017595457
coefficients: [-1.21388618e-01  4.69634633e-02  1.34676947e-02  2.83999338e+00
 -1.87580220e+01  3.65811904e+00  3.61071055e-03 -1.49075365e+00
  2.89404521e-01 -1.26819813e-02 -9.37532900e-01 -5.52019101e-01]
r-score: 0.7343070437613075
MSE: 22.429681439489933

Here we see that the improvement of MSE and \(R^2\)-score to those of 1-d and quadratic regression.

Connection to polynomial regression#

Note that the polynomial regression (16) is a special case of multiple regression. If the polynomial degree is \(m\), then (16) can be written in the form (20) with feature matrix

(24)#\[\begin{split} \boldsymbol X = [\boldsymbol 1 \;\boldsymbol x\;\boldsymbol x^2 \; \ldots\;\boldsymbol x^m] = \begin{pmatrix} 1 & x_1 & x_1^2 & \ldots & x_1^m \\ 1 & x_2 & x_2^2 & \ldots & x_2^m \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & x_n^2 & \ldots & x_n^m \\ \end{pmatrix}\end{split}\]

Exercises#

  1. Prove that matrix \(\boldsymbol X\) from (24) has full column rank if \(m \leqslant n\) and \(x_i \ne x_j\) when \(i\ne j\).

TODO

  • Create references to matrix calculus chapters

  • Develop the idea of connection to polynomial regression

  • Add some simple quizzes

  • Add more datasets

  • Show the computational issues on real and/or simulated datasets