Multiple linear regression#
In practice, there usually more than one feature. Suppose that we have \(n\) training samples and \(d\) features. Given a training sample \(\boldsymbol x \in\mathbb R^d\), we predict the target variable as follows:
To fit the linear regression, minimize MSE or, equivalently, RSS:
2-d linear regression on Boston dataset#
Here we trained a simple linear regression with a single feature (lstat
). Now add one more feature: age
.
import pandas as pd
boston = pd.read_csv("../ISLP_datasets/Boston.csv")
boston[['age','lstat','medv']].head()
age | lstat | medv | |
---|---|---|---|
0 | 65.2 | 4.98 | 24.0 |
1 | 78.9 | 9.14 | 21.6 |
2 | 61.1 | 4.03 | 34.7 |
3 | 45.8 | 2.94 | 33.4 |
4 | 54.2 | 5.33 | 36.2 |
Now fit the linear regression model:
from sklearn.linear_model import LinearRegression
import numpy as np
X, y = boston[['lstat', 'age']], boston['medv']
LR = LinearRegression()
LR.fit(X, y)
print("intercept:", LR.intercept_)
print("coefficients:", LR.coef_)
print("r-score:", LR.score(X, y))
print("MSE:", np.mean((LR.predict(X) - y) ** 2))
intercept: 33.2227605317929
coefficients: [-1.03206856 0.03454434]
r-score: 0.5512689379421003
MSE: 37.88167709241267
Q. Are these results better or worse than those of simple regression? Quadratic regression?
Analytical solution for best fit#
Important
The intercept \(w_0\) is a bit annoying since it has to be written separately all the time. For notational convenience introduce a fake feature \(x_0 =1 \) for all samples. Than (18) can be rewritten as
where \(\boldsymbol w\) is the vector of weights of the linear regression.
Taking into account this note, write the optimization task (19) as
or
Observe that
Now use some matrix calculus:
\(\nabla_{\boldsymbol w} (\boldsymbol y^\mathsf{T} \boldsymbol{Xw}) = \boldsymbol X^\mathsf{T} \boldsymbol y\);
\(\nabla_{\boldsymbol w}(\boldsymbol w^\mathsf{T} \boldsymbol X^\mathsf{T} \boldsymbol X \boldsymbol w) = 2\boldsymbol X^\mathsf{T} \boldsymbol X\boldsymbol w\).
That’s why the condition \(\nabla_{\boldsymbol w}\mathcal L(\boldsymbol w)=\boldsymbol 0\) implies that
If the matrix \(\boldsymbol X\) has full column rank, then the square matrix \(\boldsymbol X^\mathsf{T} \boldsymbol X\) is invertible, and from (22) we obtain that the optimal weights are
Note
The matrix \(\boldsymbol X^\dagger = (\boldsymbol X^\mathsf{T} \boldsymbol X)^{-1} \boldsymbol X^\mathsf{T}\) is called Moore-Penrose pseudo inverse.
Q. What is \(\boldsymbol X^\dagger\) if \(\boldsymbol X\) is a square invertible matrix?
Computational issues#
Multicollinearity. if there are highly correlated or even linearly dependent features in the design matrix \(\boldsymbol X\), then the matrix \(\boldsymbol X^\mathsf{T} \boldsymbol X\) is nearly singular. One can cope with multicollinearity by feature selection.
Curse of dimensionality. If \(d \gg 1\), it is too computationally expensive to invert \(\boldsymbol X^\mathsf{T} \boldsymbol X\). In this case instead analytical solution (23) numerical methods such as gradient descent are used.
Ill-conditioning. In practice the matrix \(\boldsymbol X^\mathsf{T} \boldsymbol X\) has large Condition number. This implies the numerical instability of the solution (23).
12-d linear regression on Boston dataset#
Take all \(12\) features and train multiple regression!
target = boston['medv']
train = boston.drop(['medv', "Unnamed: 0"], axis=1)
X_train = np.hstack([np.ones(506)[:, None], train])
X_T_X = X_train.T.dot(X_train)
print("condition number:", np.linalg.cond(X_T_X))
w = np.linalg.inv(X_T_X).dot(X_train.T).dot(target)
print("intercept:", w[0])
print("coefficients:", w[1:])
print("MSE:", np.mean((X_train.dot(w) - target)**2))
condition number: 137383410.3557865
intercept: 41.61727017593546
coefficients: [-1.21388618e-01 4.69634633e-02 1.34676947e-02 2.83999338e+00
-1.87580220e+01 3.65811904e+00 3.61071055e-03 -1.49075365e+00
2.89404521e-01 -1.26819813e-02 -9.37532900e-01 -5.52019101e-01]
MSE: 22.42968143948993
Do the same thing with sklearn
:
LR = LinearRegression()
LR.fit(train, target)
print("intercept:", LR.intercept_)
print("coefficients:", LR.coef_)
print("r-score:", LR.score(train, target))
print("MSE:", np.mean((LR.predict(train) - target) ** 2))
intercept: 41.61727017595457
coefficients: [-1.21388618e-01 4.69634633e-02 1.34676947e-02 2.83999338e+00
-1.87580220e+01 3.65811904e+00 3.61071055e-03 -1.49075365e+00
2.89404521e-01 -1.26819813e-02 -9.37532900e-01 -5.52019101e-01]
r-score: 0.7343070437613075
MSE: 22.429681439489933
Here we see that the improvement of MSE and \(R^2\)-score to those of 1-d and quadratic regression.
Connection to polynomial regression#
Note that the polynomial regression (16) is a special case of multiple regression. If the polynomial degree is \(m\), then (16) can be written in the form (20) with feature matrix
Exercises#
Prove that matrix \(\boldsymbol X\) from (24) has full column rank if \(m \leqslant n\) and \(x_i \ne x_j\) when \(i\ne j\).
TODO
Create references to matrix calculus chapters
Develop the idea of connection to polynomial regression
Add some simple quizzes
Add more datasets
Show the computational issues on real and/or simulated datasets