ROC-AUC#
Is a binary classifier predicts the probability \(\widehat y \in (0, 1)\) of the positive class, then the predicted label is usually can be obtained as
The value \(0.5\) plays a role of threshold here, and it can be actually any number \(t \in (0, 1)\). How does performance of the classifier depends on \(t\)?
One popular metric for such cases is ROC-AUC (Receiver Operating Characteristic - Area Under the Curve).
The ROC-AUC score quantifies the area under the ROC curve, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various classification thresholds.
Presicion-Recall curve#
To build precision-recall curve, do the following steps:
Rearrange the predictions in the increasing order:
\[ \widehat y_{(1)} \leqslant \widehat y_{(2)} \leqslant \ldots \leqslant \widehat y_{(n)}. \]For each threshold
\[ t = y_{(n)}, \quad t = y_{(n-1)}, \quad \ldots,\quad t = y_{(1)}, \quad t = 0 \]make label predictions by the formula \(\mathbb I[\widehat y_i > t]\) and calculate precision \(P_t\) and recall \(R_t\) metrics.
Connect the points \((R_{t}, P_{t})\) by line segments parallel to coordinate axes.
The area under the precision-recall curve is called average precision metric:
The greater this area — the better.
A random example#
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import PrecisionRecallDisplay, average_precision_score
def sim_labels_and_probs(size, seed=None):
if seed is not None:
np.random.seed(seed)
probs = np.sort(np.round(np.random.rand(size), 2))
labels = np.round(np.clip(np.random.normal(loc=probs, scale=1.1, size=size), 0, 1))
return labels, probs
y_true, y_hat = sim_labels_and_probs(5, 905)
print("True labels:", y_true)
print("Predicted probabilities:", y_hat)
print(average_precision_score(y_true, y_hat))
True labels: [0. 1. 1. 0. 1.]
Predicted probabilities: [0.09 0.29 0.54 0.62 0.93]
0.8055555555555556
PrecisionRecallDisplay.from_predictions(y_true, y_hat, lw=2, c='r', marker='o', markeredgecolor='b')
%config InlineBackend.figure_format = 'svg'
plt.title("Precision-recall curve")
plt.grid(ls=":");
Q. What are the left-most and the right-most points of PR-curve?
ROC curve#
To build ROC-curve, do the following steps:
Rearrange the predictions in the increasing order:
\[ \widehat y_{(1)} \leqslant \widehat y_{(2)} \leqslant \ldots \leqslant \widehat y_{(n)}. \]For each threshold
\[ t = y_{(n)}, \quad t = y_{(n-1)}, \quad \ldots,\quad t = y_{(1)}, \quad t = 0 \]make label predictions by the formula \(\mathbb I[\widehat y_i > t]\) and calculate true positive rate
\[ \text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}} \]and false positive rate (FPR):
\[ \text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}} \]Connect the points \((\mathrm{FPR}_t, \mathrm{TPR}_t)\) by line segments.
Note
True positive rate is also known as sensitivity or recall, while \(1 - \mathrm{FPR} = \frac{\text{TN}}{\text{FP} + \text{TN}}\) is called specificity.
The area under ROC-curve is called ROC-AUC score. Once again, the greater this area — the better.
It turns out that ROC-AUC score is equal to the fraction of correctly ordered pairs of predictions:
where \(n_+\) and \(n_-\) is the number of positive and negative samples respectively.
The same random example#
from sklearn.metrics import RocCurveDisplay
RocCurveDisplay.from_predictions(y_true, y_hat, lw=2, c='r', marker='o', markeredgecolor='b')
plt.title("ROC curve")
plt.grid(ls=":")
Q. What are the left-most and the right-most points of ROC-AUC curve?
Q. Suppose that ROC-AUC score of a classifier is less than \(0.5\). How can we easily improve the performance of this classifier?
Credit dataset#
import pandas as pd
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
credit_df = pd.read_csv("../ISLP_datsets/creditcard.csv.zip")
y = credit_df['Class']
X = credit_df.drop("Class", axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y)
dc_mf = DummyClassifier(strategy="most_frequent")
dc_mf.fit(X_train, y_train)
DummyClassifier(strategy='most_frequent')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DummyClassifier(strategy='most_frequent')
Dummy classifier metrics:
print("Train ROC-AUC score:", roc_auc_score(y_train, dc_mf.predict_proba(X_train)[:, 1]))
print("Train AP score:", average_precision_score(y_train, dc_mf.predict_proba(X_train)[:, 1]))
print("Test ROC-AUC score:", roc_auc_score(y_test, dc_mf.predict_proba(X_test)[:, 1]))
print("Test AP score:", average_precision_score(y_test, dc_mf.predict_proba(X_test)[:, 1]))
Train ROC-AUC score: 0.5
Train AP score: 0.0018211184195126519
Test ROC-AUC score: 0.5
Test AP score: 0.0014465885789725008
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(max_iter=10000)
log_reg.fit(X_train, y_train)
LogisticRegression(max_iter=10000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=10000)
Logistic regression metrics:
print("Train ROC-AUC score:", roc_auc_score(y_train, log_reg.predict_proba(X_train)[:, 1]))
print("Train AP score:", average_precision_score(y_train, log_reg.predict_proba(X_train)[:, 1]))
print("Test ROC-AUC score:", roc_auc_score(y_test, log_reg.predict_proba(X_test)[:, 1]))
print("Test AP score:", average_precision_score(y_test, log_reg.predict_proba(X_test)[:, 1]))
Train ROC-AUC score: 0.9425492263032587
Train AP score: 0.6635935420107146
Test ROC-AUC score: 0.9505265255051857
Test AP score: 0.6253817235944542
Note that ROC-AUC score shows much better performance on an unbalanced dataset than average precision metric.