Logistic regression

Logistic regression#

Simple logistic regression#

Just like in Simple linear regression, if we have only one feature \(x\) and two classes \(0\) (negative) and \(1\) (positive), introduce two parameters: intercept \(w_0\) and slope \(w_1\). Then put

\[ \widehat y = \mathbb P(\text{class }1) = \sigma(w_0 + w_1 x) \]

where \(\sigma(t) = \frac 1{1 + e^{-t}}\) — sigmoid function.

../_images/0091f7896072a9d69f5d86b8d627a96ec50ce991ecff5f36a4396f04d2c0f2aa.svg

Q. What is \(\mathbb P(\text{class } 0)\)?

Q. How to predict classes if we know predicted probabilities?

Binary logistic regression#

Now suppose that each sample is descirbed by \(d\) numeric features, and we have a dataset \(\mathcal D = \{\boldsymbol x_i, y_i\}_{i=1}^n\), \(y_i \in \{0, 1\}\), \(\boldsymbol x_i \in \mathbb R^d\). Logistic regression model predicts the probability of the positive class:

\[ \widehat y_i = \sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w) = \mathbb P(\boldsymbol x \in \text{class }1), \]

The linear output \(\boldsymbol x^\mathsf{T} \boldsymbol w\) is also called logit.

Note

As for linear regression model, intercept is included to \(\boldsymbol w\) by putting \(\boldsymbol x_0 = \boldsymbol 1\).

The loss function is binary cross-entropy

(29)#\[\begin{split} \begin{multline*} \mathcal L(\boldsymbol w) = -\frac 1n\sum\limits_{i=1}^n \big(y_i \log \widehat y_i + (1-y_i)\log(1-\widehat y_i)\big) = \\ =-\frac 1n\sum\limits_{i=1}^n \big(y_i \log(\sigma(\boldsymbol x_i^\mathsf{T}\boldsymbol w)) + (1- y_i)\log(1 - \sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w))\big). \end{multline*}\end{split}\]

Fitting a logistic regression model is achieved by minimizing this loss function with respect to \(\boldsymbol w\).

Question

How will the cross entropy loss change if \(\mathcal Y = \{-1, 1\}\)?

Answer

In this case

\[ \mathbb P(\boldsymbol x \in \text{class }1) = \sigma(\boldsymbol x^\mathsf{T} \boldsymbol w), \quad \mathbb P(\boldsymbol x \in \text{class }-1) = \sigma(-\boldsymbol x^\mathsf{T} \boldsymbol w). \]

Hence,

\[\begin{split} \begin{multline*} \mathcal L(\boldsymbol w) = -\frac 1n\sum\limits_{i=1}^n [y_i = 1] \log\sigma(\boldsymbol x_i^\mathsf{T} \boldsymbol w) + [y_i = -1] \log\sigma(-\boldsymbol x_i^\mathsf{T} \boldsymbol w) = \\ = -\frac 1n\sum\limits_{i=1}^n \log \sigma(y_i \boldsymbol x_i^\mathsf{T} \boldsymbol w) = \frac 1n\sum\limits_{i=1}^n \log \big(1 + e^{y_i \boldsymbol x_i^\mathsf{T} \boldsymbol w}\big). \end{multline*} \end{split}\]

Regularization#

Logistic regression can also suffer from multicollinearity. A possible solution is to add regularization term of the form \(C\Vert \boldsymbol w \Vert\). For example, the loss function for \(L_2\)-regularized logistic regression with \(\mathcal Y = \{-1, 1\}\) is

\[ \mathcal L(\boldsymbol w) = \frac 1n\sum\limits_{i=1}^n \log \big(1 + e^{y_i \boldsymbol x_i^\mathsf{T} \boldsymbol w}\big) + C \boldsymbol w^\mathsf{T} \boldsymbol w. \]

There are also versions of \(L_1\) penalizer or elastic net.

Example: breast cancer dataset#

This is a dataset with \(30\) features and binary target.

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
data['data'].shape, data['target'].shape

((569, 30), (569,))

Malignant or benign?

data.target_names

array(['malignant', 'benign'], dtype='<U9')

data.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

Divide the dataset into train and test:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2)

Train and evaluate \(30\) simple logistic regressions — one per each feature:

from sklearn.linear_model import LogisticRegression

for i, feature in enumerate(data.feature_names):
    log_reg = LogisticRegression()
    log_reg.fit(X_train[:, i][:, None], y_train)
    print(feature)
    print("  Train accuracy: {:.2%}".format(log_reg.score(X_train[:, i][:, None], y_train)) + 
          "  test accuracy: {:.2%}".format(log_reg.score(X_test[:, i][:, None], y_test)))
    print(f"  Intercept: {log_reg.intercept_}, coef: {log_reg.coef_[0]}")

Show code cell output Hide code cell output

mean radius
  Train accuracy: 87.91%  test accuracy: 87.72%
  Intercept: [15.48626194], coef: [-1.04775503]
mean texture
  Train accuracy: 71.43%  test accuracy: 66.67%
  Intercept: [5.55758287], coef: [-0.25414955]
mean perimeter
  Train accuracy: 88.57%  test accuracy: 88.60%
  Intercept: [16.14248459], coef: [-0.16798658]
mean area
  Train accuracy: 88.35%  test accuracy: 87.72%
  Intercept: [8.23497749], coef: [-0.0121362]
mean smoothness
  Train accuracy: 63.52%  test accuracy: 59.65%
  Intercept: [0.65101591], coef: [-1.00311261]
mean compactness
  Train accuracy: 67.03%  test accuracy: 64.04%
  Intercept: [1.11224782], coef: [-5.21923937]
mean concavity
  Train accuracy: 76.70%  test accuracy: 73.68%
  Intercept: [1.21871063], coef: [-7.20621253]
mean concave points
  Train accuracy: 65.49%  test accuracy: 64.04%
  Intercept: [0.82856996], coef: [-5.55225382]
mean symmetry
  Train accuracy: 63.52%  test accuracy: 59.65%
  Intercept: [0.8748352], coef: [-1.76662857]
mean fractal dimension
  Train accuracy: 63.52%  test accuracy: 59.65%
  Intercept: [0.55162726], coef: [0.04458837]
radius error
  Train accuracy: 79.78%  test accuracy: 81.58%
  Intercept: [2.80617233], coef: [-5.66786272]
texture error
  Train accuracy: 63.52%  test accuracy: 59.65%
  Intercept: [0.52276798], coef: [0.02593166]
perimeter error
  Train accuracy: 79.12%  test accuracy: 80.70%
  Intercept: [4.28533116], coef: [-1.35916656]
area error
  Train accuracy: 86.59%  test accuracy: 87.72%
  Intercept: [4.85509798], coef: [-0.13220235]
smoothness error
  Train accuracy: 63.52%  test accuracy: 59.65%
  Intercept: [0.55398657], coef: [0.06385341]
compactness error
  Train accuracy: 63.52%  test accuracy: 59.65%
  Intercept: [0.583793], coef: [-1.11736919]
concavity error
  Train accuracy: 63.52%  test accuracy: 59.65%
  Intercept: [0.60188266], coef: [-1.443719]
concave points error
  Train accuracy: 63.52%  test accuracy: 59.65%
  Intercept: [0.56089656], coef: [-0.54318051]
symmetry error
  Train accuracy: 63.52%  test accuracy: 59.65%
  Intercept: [0.553728], coef: [0.03469531]
fractal dimension error
  Train accuracy: 63.52%  test accuracy: 59.65%
  Intercept: [0.55457246], coef: [-0.03360037]
worst radius
  Train accuracy: 90.99%  test accuracy: 92.98%
  Intercept: [18.47945071], coef: [-1.10925571]
worst texture
  Train accuracy: 72.75%  test accuracy: 69.30%
  Intercept: [6.02407229], coef: [-0.20769328]
worst perimeter
  Train accuracy: 91.21%  test accuracy: 93.86%
  Intercept: [19.12960256], coef: [-0.17338367]
worst area
  Train accuracy: 90.77%  test accuracy: 92.11%
  Intercept: [9.84858435], coef: [-0.01154297]
worst smoothness
  Train accuracy: 63.52%  test accuracy: 59.65%
  Intercept: [0.80536662], coef: [-1.90094578]
worst compactness
  Train accuracy: 78.90%  test accuracy: 78.07%
  Intercept: [2.2331808], coef: [-6.37318683]
worst concavity
  Train accuracy: 83.30%  test accuracy: 83.33%
  Intercept: [2.38919387], coef: [-6.43632774]
worst concave points
  Train accuracy: 78.46%  test accuracy: 78.95%
  Intercept: [1.47669558], coef: [-7.84876484]
worst symmetry
  Train accuracy: 66.37%  test accuracy: 64.04%
  Intercept: [1.71082525], coef: [-3.98159412]
worst fractal dimension
  Train accuracy: 63.52%  test accuracy: 59.65%
  Intercept: [0.6548674], coef: [-1.18933456]

Now take all \(30\) features at once:

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The default value of max_iter is \(100\), and here it is not enough for convergence. However, accuracy is not bad:

print("Train accuracy:", log_reg.score(X_train, y_train))
print("Test accuracy:", log_reg.score(X_test, y_test))

Train accuracy: 0.9428571428571428
Test accuracy: 0.9385964912280702

Now increase max_iter argument:

log_reg = LogisticRegression(max_iter=3000)
log_reg.fit(X_train, y_train)

LogisticRegression(max_iter=3000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The improvement of accuracy seems not to be significant:

print("Train accuracy:", log_reg.score(X_train, y_train))
print("Test accuracy:", log_reg.score(X_test, y_test))

Train accuracy: 0.967032967032967
Test accuracy: 0.9649122807017544

Exercises#

Suppose we collect data for a group of students in a statistics class with variables

\[ x_1 = \text{hours studied},\quad x_2 = \text{undergrad GPA},\quad y = \text{receive an A}. \]

We fit a logistic regression and produce estimated coefficients

\[ w_0 = -6, \quad w_1 = 0.05, \quad w_2 = 1. \]
- Estimate the probability that a student who studies for \(40\) h and has an undergrad GPA of \(3.5\) gets an A in the class.
- How many hours would the student need to study to have a \(50\%\) chance of getting an A in the class?
Write down the loss function of logstic regression model with \(\mathcal Y = \{-1, 1\}\) in matrix-vector form.