Linear classification#
In classification task we have categorical targets \(y \in \mathcal Y = \{1 ,\ldots, K\}\).
Why not linear regression?#
However, the class labels in this settings are numbers, and we could fit a linear regression. Why is this not a good idea?
Answer
The possible problems are:
Inappropriate predictions: \(\widehat y\) could easily be outside \(\mathcal Y\)
No inherent ordering between class labels
Loss function mismatch: MSE could be quite poor metric of quality
Binary case#
Suppose that \(\mathcal Y = \{-1, 1\}\), then we could predict labels of \(\boldsymbol x \in \mathbb R^d\) by the formula
What about loss function? Rewrite the misclassification rate (5) as
Margins#
Define the margin of the training sample \((\boldsymbol x_i, y_i)\) as
If the margin is positive, the prediction is correct, and vise versa. Now we can express the loss function (27) in terms of margins:
The function \(\ell\) is discontinuous, which impedes the direct optimization of this loss functions. That’s why \(\ell\) is often substituted by some smooth loss function \(\overline{\ell}(M)\), which estimates \(\ell(M)\) from above: \(\ell(M) \leqslant \overline{\ell}(M)\).
Popular choices of \(\overline{\ell}(M)\):
\(V(M) = (1 - M)_+\) (SVM)
\(H(M) = (-M)_+\) (Hebb’s rule)
\(L(M) = \log_2(1 + e^{-M})\) (LR)
\(Q(M) = (1 - M)^2\) (quadratic)
\(S(M) = \frac 2{1 + e^{-M}}\) (sigmoid)
\(E(M) = e^{-M}\) (exponential)
Predicting probabilities#
Another common way to classify objects with labels from \(\mathcal Y = \{0, 1\}\) is to predict probability of positive class:
Linear regression \(\widehat y = \boldsymbol x^\mathsf{T} \boldsymbol w\) is not suitable for this purpose since \(\widehat y\) here can be any real number. However, it is not difficult to convert it to the desired probability. Just apply the sigmoid function
This is how logistic regression works.