Multiple linear regression

Multiple linear regression#

In practice, there usually more than one feature. Suppose that we have \(n\) training samples and \(d\) features. Given a training sample \(\boldsymbol x \in\mathbb R^d\), we predict the target variable as follows:

(18)#\[ \widehat y = w_0 + w_1 x_1 + \ldots + w_d x_d = w_0 + \sum\limits_{j=1}^d w_j x_j\]

To fit the linear regression, minimize MSE or, equivalently, RSS:

(19)#\[ RSS = \sum\limits_{i=1}^n(y_i - \widehat y_i)^2 = \sum\limits_{i=1}^n \Big(y_i - w_0 - \sum\limits_{j=1}^k w_j x_j\Big)^2 \to \min\limits_{\boldsymbol w}.\]

2-d linear regression on Boston dataset#

Here we trained a simple linear regression with a single feature (lstat). Now add one more feature: age.

import pandas as pd
boston = pd.read_csv("../ISLP_datasets/Boston.csv")
boston[['age','lstat','medv']].head()

	age	lstat	medv
0	65.2	4.98	24.0
1	78.9	9.14	21.6
2	61.1	4.03	34.7
3	45.8	2.94	33.4
4	54.2	5.33	36.2

Now fit the linear regression model:

from sklearn.linear_model import LinearRegression
import numpy as np

X, y = boston[['lstat', 'age']], boston['medv']
LR = LinearRegression()
LR.fit(X, y)
print("intercept:", LR.intercept_)
print("coefficients:", LR.coef_)
print("r-score:", LR.score(X, y))
print("MSE:", np.mean((LR.predict(X) - y) ** 2))

intercept: 33.2227605317929
coefficients: [-1.03206856  0.03454434]
r-score: 0.5512689379421003
MSE: 37.88167709241267

Q. Are these results better or worse than those of simple regression? Quadratic regression?

Analytical solution for best fit#

Important

The intercept \(w_0\) is a bit annoying since it has to be written separately all the time. For notational convenience introduce a fake feature \(x_0 =1 \) for all samples. Than (18) can be rewritten as

(20)#\[ \sum\limits_{j=0}^d w_j x_j = \boldsymbol x^\mathsf{T} \boldsymbol w,\]

where \(\boldsymbol w\) is the vector of weights of the linear regression.

Taking into account this note, write the optimization task (19) as

(21)#\[ \mathcal L(\boldsymbol w) = \sum\limits_{i=1}^n(y_i - \boldsymbol x_i^\mathsf{T} \boldsymbol w)^2 \to \min\limits_{\boldsymbol w},\]

or

\[ \mathcal L(\boldsymbol w) = \Vert\boldsymbol {Xw} - \boldsymbol y \Vert_2^2 \to \min\limits_{\boldsymbol w}. \]

Observe that

\[ \Vert\boldsymbol {Xw} - \boldsymbol y \Vert_2^2 = (\boldsymbol{Xw} - \boldsymbol y)^\mathsf{T}(\boldsymbol {Xw} - \boldsymbol y) = \boldsymbol w^\mathsf{T} \boldsymbol X^\mathsf{T} \boldsymbol X \boldsymbol w - 2\boldsymbol y^\mathsf{T} \boldsymbol{Xw} + \boldsymbol y^\mathsf{T} \boldsymbol y. \]

Now use some matrix calculus:

\(\nabla_{\boldsymbol w} (\boldsymbol y^\mathsf{T} \boldsymbol{Xw}) = \boldsymbol X^\mathsf{T} \boldsymbol y\);
\(\nabla_{\boldsymbol w}(\boldsymbol w^\mathsf{T} \boldsymbol X^\mathsf{T} \boldsymbol X \boldsymbol w) = 2\boldsymbol X^\mathsf{T} \boldsymbol X\boldsymbol w\).

That’s why the condition \(\nabla_{\boldsymbol w}\mathcal L(\boldsymbol w)=\boldsymbol 0\) implies that

(22)#\[ \boldsymbol X^\mathsf{T} \boldsymbol X\boldsymbol w - \boldsymbol X^\mathsf{T} \boldsymbol y = \boldsymbol 0\]

If the matrix \(\boldsymbol X\) has full column rank, then the square matrix \(\boldsymbol X^\mathsf{T} \boldsymbol X\) is invertible, and from (22) we obtain that the optimal weights are

(23)#\[ \widehat{\boldsymbol w} = (\boldsymbol X^\mathsf{T} \boldsymbol X)^{-1} \boldsymbol X^\mathsf{T} \boldsymbol y.\]

Note

The matrix \(\boldsymbol X^\dagger = (\boldsymbol X^\mathsf{T} \boldsymbol X)^{-1} \boldsymbol X^\mathsf{T}\) is called Moore-Penrose pseudo inverse.

Q. What is \(\boldsymbol X^\dagger\) if \(\boldsymbol X\) is a square invertible matrix?

Computational issues#

Multicollinearity. if there are highly correlated or even linearly dependent features in the design matrix \(\boldsymbol X\), then the matrix \(\boldsymbol X^\mathsf{T} \boldsymbol X\) is nearly singular. One can cope with multicollinearity by feature selection.
Curse of dimensionality. If \(d \gg 1\), it is too computationally expensive to invert \(\boldsymbol X^\mathsf{T} \boldsymbol X\). In this case instead analytical solution (23) numerical methods such as gradient descent are used.
Ill-conditioning. In practice the matrix \(\boldsymbol X^\mathsf{T} \boldsymbol X\) has large Condition number. This implies the numerical instability of the solution (23).

12-d linear regression on Boston dataset#

Take all \(12\) features and train multiple regression!

target = boston['medv']
train = boston.drop(['medv', "Unnamed: 0"], axis=1)
X_train = np.hstack([np.ones(506)[:, None], train])
X_T_X = X_train.T.dot(X_train)
print("condition number:", np.linalg.cond(X_T_X))
w = np.linalg.inv(X_T_X).dot(X_train.T).dot(target)
print("intercept:", w[0])
print("coefficients:", w[1:])
print("MSE:", np.mean((X_train.dot(w) - target)**2))

condition number: 137383410.3557865
intercept: 41.61727017593546
coefficients: [-1.21388618e-01  4.69634633e-02  1.34676947e-02  2.83999338e+00
 -1.87580220e+01  3.65811904e+00  3.61071055e-03 -1.49075365e+00
  2.89404521e-01 -1.26819813e-02 -9.37532900e-01 -5.52019101e-01]
MSE: 22.42968143948993

Do the same thing with sklearn:

LR = LinearRegression()
LR.fit(train, target)
print("intercept:", LR.intercept_)
print("coefficients:", LR.coef_)
print("r-score:", LR.score(train, target))
print("MSE:", np.mean((LR.predict(train) - target) ** 2))

intercept: 41.61727017595457
coefficients: [-1.21388618e-01  4.69634633e-02  1.34676947e-02  2.83999338e+00
 -1.87580220e+01  3.65811904e+00  3.61071055e-03 -1.49075365e+00
  2.89404521e-01 -1.26819813e-02 -9.37532900e-01 -5.52019101e-01]
r-score: 0.7343070437613075
MSE: 22.429681439489933

Here we see that the improvement of MSE and \(R^2\)-score to those of 1-d and quadratic regression.

Connection to polynomial regression#

Note that the polynomial regression (16) is a special case of multiple regression. If the polynomial degree is \(m\), then (16) can be written in the form (20) with feature matrix

(24)#\[\begin{split} \boldsymbol X = [\boldsymbol 1 \;\boldsymbol x\;\boldsymbol x^2 \; \ldots\;\boldsymbol x^m] = \begin{pmatrix} 1 & x_1 & x_1^2 & \ldots & x_1^m \\ 1 & x_2 & x_2^2 & \ldots & x_2^m \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & x_n^2 & \ldots & x_n^m \\ \end{pmatrix}\end{split}\]

Exercises#

Prove that matrix \(\boldsymbol X\) from (24) has full column rank if \(m \leqslant n\) and \(x_i \ne x_j\) when \(i\ne j\).

TODO

Create references to matrix calculus chapters
Develop the idea of connection to polynomial regression
Add some simple quizzes
Add more datasets
Show the computational issues on real and/or simulated datasets