Probabilistic models for logistic regression#

Binary logistic regression predicts the probability of positive class:

\[ \mathbb P(y = 1 \vert \boldsymbol x, \boldsymbol w) = \sigma(\boldsymbol x^{\mathsf T} \boldsymbol w), \quad \sigma(t) = \frac 1{1 + e^{-t}}. \]

Q. What is the probability of negative class \(\mathbb P(y = 0 \vert \boldsymbol x, \boldsymbol w)\)?

It follows that

\[ p(y \vert \boldsymbol x, \boldsymbol w) = \mathrm{Bern}(\sigma(\boldsymbol x^{\mathsf T} \boldsymbol w)) \]

Q. What is \(p(y \vert \boldsymbol x, \boldsymbol w)\) if \(y \in\{-1, +1\}\)?

MLE#

Binary case#

Note that if \(\xi \sim \mathrm{Bern}(q)\) then \(\mathbb P(\xi = t) = q^{t} (1-q)^{1-t}\). Denote \(\widehat y_i = \sigma(\boldsymbol x_i^{\mathsf T} \boldsymbol w)\) and write down the negative log-likelihood:

\[\begin{split} \begin{multline*} \mathrm{NLL}(\boldsymbol w) = -\sum_{i=1}^n \log p(y_i \vert \boldsymbol x_i, \boldsymbol w) = -\sum_{i=1}^n \log\big(\widehat y_i^{y_i} (1 - \widehat y_i)^{1 - y_i}\big) = \\ -\sum_{i=1}^n y_i\log(\widehat y_i) + (1-y_i) \log(1 - \widehat y_i). \end{multline*} \end{split}\]

This is exactly the binary cross-entropy loss between true labels \(y_i\) and predictions \(\widehat y_i\).

Multinomial case#

Suppose that \(\mathcal Y = \{1, \ldots, K\}\), then prediction on a sample \(\boldsymbol x \in \mathbb R^d\) is a vector of probabilities

\[ \boldsymbol{\widehat y} = \mathrm{Softmax}(\boldsymbol x^{\mathsf T} \boldsymbol W), \quad \boldsymbol W \in \mathbb R^{d\times K}. \]

Thus,

\[ p(\boldsymbol y \vert \boldsymbol X, \boldsymbol W) = \mathrm{Cat}(\mathrm{Softmax}(\boldsymbol X \boldsymbol W)), \]

where \(\mathrm{Cat}(\boldsymbol p)\) is categorical (or multinoully) distribution over categories \(\{1, \ldots, K\}\).

Now write the negative log-likelihood:

\[ \mathrm{NLL}(\boldsymbol W) = -\log\prod_{i=1}^n \prod_{k=1}^K \widehat y_{ik}^{y_{ik}} = -\sum\limits_{i=1}^n \sum_{k=1}^K y_{ik} \log \widehat y_{ik}. \]

This is cross-entropy loss.

MAP#

Consider binary case. As for linear regression, use gaussian prior:

\[ p(\boldsymbol w) = \mathcal N(\boldsymbol 0, \tau^2\boldsymbol I). \]

Then

\[ -\log p(\boldsymbol w \vert \boldsymbol X, \boldsymbol y) = -\sum_{i=1}^n \big(y_i\log(\widehat y_i) + (1-y_i) \log(1 - \widehat y_i)\big) + \frac 1{2\tau^2} \sum\limits_{j=1}^d w_j^2, \]

where \(\widehat y_i = \sigma(\boldsymbol x_i^{\mathsf T} \boldsymbol w)\). Hence,

\[ \boldsymbol {\widehat w}_{\mathrm{MAP}} = \arg\min\limits_{\boldsymbol w} \bigg(-\sum\limits_{i=1}^n \big(y_i\log(\widehat y_i) + (1-y_i) \log(1 - \widehat y_i)\big)+ \frac{1}{2\tau^2} \Vert \boldsymbol w \Vert_2^2\bigg). \]

This is \(L_2\)-regularization of the binary logistic regression. Taking laplacian prior as here, we obtain \(L_1\) regularization.