Probabilistic models for logistic regression
Binary logistic regression predicts the probability of positive class:
\[
\mathbb P(y = 1 \vert \boldsymbol x, \boldsymbol w) = \sigma(\boldsymbol x^{\mathsf T} \boldsymbol w), \quad \sigma(t) = \frac 1{1 + e^{-t}}.
\]
Q. What is the probability of negative class \(\mathbb P(y = 0 \vert \boldsymbol x, \boldsymbol w)\)?
It follows that
\[
p(y \vert \boldsymbol x, \boldsymbol w) = \mathrm{Bern}(\sigma(\boldsymbol x^{\mathsf T} \boldsymbol w))
\]
Q. What is \(p(y \vert \boldsymbol x, \boldsymbol w)\) if \(y \in\{-1, +1\}\)?
MLE
Binary case
Note that if \(\xi \sim \mathrm{Bern}(q)\) then \(\mathbb P(\xi = t) = q^{t} (1-q)^{1-t}\).
Denote \(\widehat y_i = \sigma(\boldsymbol x_i^{\mathsf T} \boldsymbol w)\) and write down the negative log-likelihood:
\[\begin{split}
\begin{multline*}
\mathrm{NLL}(\boldsymbol w) = -\sum_{i=1}^n \log p(y_i \vert \boldsymbol x_i, \boldsymbol w) = -\sum_{i=1}^n \log\big(\widehat y_i^{y_i} (1 - \widehat y_i)^{1 - y_i}\big) = \\
-\sum_{i=1}^n y_i\log(\widehat y_i) + (1-y_i) \log(1 - \widehat y_i).
\end{multline*}
\end{split}\]
This is exactly the binary cross-entropy loss between true labels \(y_i\) and predictions \(\widehat y_i\).
Multinomial case
Suppose that \(\mathcal Y = \{1, \ldots, K\}\), then prediction on a sample \(\boldsymbol x \in \mathbb R^d\) is a vector of probabilities
\[
\boldsymbol{\widehat y} = \mathrm{Softmax}(\boldsymbol x^{\mathsf T} \boldsymbol W), \quad \boldsymbol W \in \mathbb R^{d\times K}.
\]
Thus,
\[
p(\boldsymbol y \vert \boldsymbol X, \boldsymbol W) = \mathrm{Cat}(\mathrm{Softmax}(\boldsymbol X \boldsymbol W)),
\]
where \(\mathrm{Cat}(\boldsymbol p)\) is categorical (or multinoully) distribution over categories \(\{1, \ldots, K\}\).
Now write the negative log-likelihood:
\[
\mathrm{NLL}(\boldsymbol W) = -\log\prod_{i=1}^n \prod_{k=1}^K \widehat y_{ik}^{y_{ik}} = -\sum\limits_{i=1}^n \sum_{k=1}^K y_{ik} \log \widehat y_{ik}.
\]
This is cross-entropy loss.
MAP
Consider binary case. As for linear regression, use gaussian prior:
\[
p(\boldsymbol w) = \mathcal N(\boldsymbol 0, \tau^2\boldsymbol I).
\]
Then
\[
-\log p(\boldsymbol w \vert \boldsymbol X, \boldsymbol y) = -\sum_{i=1}^n \big(y_i\log(\widehat y_i) + (1-y_i) \log(1 - \widehat y_i)\big) + \frac 1{2\tau^2} \sum\limits_{j=1}^d w_j^2,
\]
where \(\widehat y_i = \sigma(\boldsymbol x_i^{\mathsf T} \boldsymbol w)\). Hence,
\[
\boldsymbol {\widehat w}_{\mathrm{MAP}} = \arg\min\limits_{\boldsymbol w}
\bigg(-\sum\limits_{i=1}^n \big(y_i\log(\widehat y_i) + (1-y_i) \log(1 - \widehat y_i)\big)+ \frac{1}{2\tau^2} \Vert \boldsymbol w \Vert_2^2\bigg).
\]
This is \(L_2\)-regularization of the binary logistic regression. Taking laplacian prior as here, we obtain \(L_1\) regularization.