Multinomial logistic regression#

Let \(\mathcal Y\) be a finite set, e.g. \(\mathcal Y=\{1, 2, \ldots, K\}\), and the training dataset

\[ \mathcal D = \{(\boldsymbol x_i, y_i)\}_{i=1}^n, \quad \boldsymbol x_i\in\mathbb R^d, \quad y_i\in \mathcal Y. \]

Multinomial logistic regression predicts a vector of probabilities

\[ \boldsymbol{\widehat y} = (p_1,\ldots, p_K), \quad p_k > 0, \quad \sum\limits_{k=1}^K p_k = 1. \]

How to predict these probabilities? We can use a linear model whose output is a vector of \(K\) different numbers:

\[ z_k = \boldsymbol x^\mathsf{T} \boldsymbol w_k, \quad \boldsymbol w_k \in \mathbb R^d, \quad k = 1, \ldots, K. \]

Now convert the vector \(\boldsymbol z \in \mathbb R^K\) to the vector of probabilities \(\boldsymbol{\widehat y}\) via softmax:

\[ \boldsymbol{\widehat y} = \mathrm{Softmax}(\boldsymbol z) = \bigg(\frac{e^{z_1}}{\sum e^{z_k}}, \ldots, \frac{e^{z_K}}{\sum e^{z_k}}\bigg) \]

If we need to pick a class, we can choose the most probable one:

\[ \arg\max\limits_{1\leqslant k \leqslant K} p_k = \arg\max\limits_{1\leqslant k \leqslant K} \Big\{\frac{\exp(\boldsymbol x^\mathsf{T} \boldsymbol w_k)}{\sum \exp(\boldsymbol x^\mathsf{T} \boldsymbol w_k)}\Big\} \]

The parameters \(\boldsymbol w_k\) naturally form a matrix

\[ \boldsymbol W = [\boldsymbol w_1 \ldots \boldsymbol w_K] \]

Q. What is the shape of this matrix? How many parameters does multinomial logistic regression have?

Loss function#

The optimal parameters \(\boldsymbol W\) are solutions of the following optimization problem:

(30)#\[\mathcal L (\boldsymbol W) = \sum\limits_{i=1}^n \bigg(\boldsymbol x_i^\mathsf{T}\boldsymbol w_{y_i} -\log\Big(\sum\limits_{k=1}^K \exp(\boldsymbol x_i^\mathsf{T}\boldsymbol w_{k})\Big)\bigg) \to \max\limits_{\boldsymbol w_{1}, \ldots, \boldsymbol w_{K}}\]

If the targets \(y_k\) are one-hot encoded, then they from a matrix

\[\begin{split} \boldsymbol Y = \begin{pmatrix} \boldsymbol y_1^\mathsf{T} \\ \vdots \\ \boldsymbol y_n^\mathsf{T} \end{pmatrix}, \quad y_{ik} \geqslant 0, \quad \sum\limits_{k=1}^K y_{ik} = 1. \end{split}\]

Accordingly, the loss function (30) can be written as

(31)#\[\mathcal L (\boldsymbol W) = \sum\limits_{i=1}^n \sum\limits_{k=1}^K y_{ik}\log\bigg(\frac{\exp(\boldsymbol x_i^\mathsf{T}\boldsymbol w_k)}{\sum\limits_{1 \leqslant k \leqslant K}\exp(\boldsymbol x_i^\mathsf{T}\boldsymbol w_{k})}\bigg) = \sum\limits_{i=1}^n \sum\limits_{k=1}^K y_{ik} \log \widehat y_{ik},\]

and this is generally the cross-entropy loss (7), taken with opposite sign.

Regularized version:

\[ \sum\limits_{i=1}^n \Big(\boldsymbol x_i^\mathsf{T}\boldsymbol w_{y_i} -\sum\limits_{k=1}^K \exp(\boldsymbol x_i^\mathsf{T}\boldsymbol w_{k})\Big) - C\Vert \boldsymbol W \Vert_F^2 \to \max\limits_{\boldsymbol W}, \quad C > 0. \]

Q. Why do we see minus before the regularization term?

Example: MNIST#

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

%config InlineBackend.figure_format = 'svg'

X, Y = fetch_openml('mnist_784', return_X_y=True, parser='auto')

X = X.astype(float).values / 255
Y = Y.astype(int).values

Visualize data#

Visualize some random samples:

../_images/7098e3602db9c0ff0091ae53060ac00a02c290508f403ee8d2b3da732ae0d0e3.svg

Splitting into train and test#

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=10000)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((60000, 784), (10000, 784), (60000,), (10000,))

Check that the classes are balanced:

np.unique(y_test, return_counts=True)
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 array([ 945, 1233,  995, 1036,  972,  903,  970, 1000,  950,  996]))

Fit and evaluate#

Fit the logistic regression:

%%time
LR = LogisticRegression(max_iter=100)
LR.fit(X_train, y_train)
CPU times: user 35.3 s, sys: 641 ms, total: 36 s
Wall time: 19 s
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Make predictions:

y_hat = LR.predict(X_test)
y_hat
array([5, 4, 4, ..., 3, 2, 3])

We can also predict probabilities:

y_proba = LR.predict_proba(X_test)
y_proba[:3]
array([[1.65095773e-04, 1.43385260e-07, 1.14780168e-05, 2.99636017e-01,
        5.98926346e-07, 6.39615667e-01, 1.80372814e-07, 1.76480310e-07,
        5.86493420e-02, 1.92130044e-03],
       [1.21475181e-04, 1.49673625e-06, 1.08891109e-03, 1.24093043e-05,
        9.14920772e-01, 1.95035318e-04, 1.16798504e-04, 1.85831413e-04,
        1.03925873e-02, 7.29646828e-02],
       [9.63965076e-09, 9.52335705e-09, 1.80277138e-07, 4.04423620e-09,
        9.99908083e-01, 2.32334829e-06, 7.00253101e-06, 8.06028699e-06,
        6.62953819e-06, 6.76977960e-05]])

Calculate metrics:

print("Accuracy:", accuracy_score(y_test, y_hat))
Accuracy: 0.9215

Visualize performance#

plt.figure(figsize=(10, 8))
plt.title("Logistic regression on MNIST")
sns.heatmap(confusion_matrix(y_test, y_hat), annot=True);
../_images/61f0302b158d156bb8a1ce9ad7e1b2fd88a91714a0e62f49f9e4318d5777c4f3.svg

Plot some samples with predictions and ground truths:

plot_digits(X_test, y_test, y_hat)
../_images/9c7a3ae780aaecc8bd739e590e39b9dd8b9a7d29eaa50c2c0deba7b0752629f9.svg

Exercises#

  1. Denote \(\boldsymbol{\widehat Y} = (\widehat y_{ik}) = \mathrm{Softmax}(\boldsymbol {XW})\) (softmax is applied to each row). Rewrite the loss function (31) in matrix form.