Cross-validation#
Cross-validation is a widely used technique in machine learning and statistics for assessing the performance and generalization of a predictive model. The basic idea is to split the dataset into multiple subsets, train and test the model on different combinations of these subsets, and then aggregate the results to get a more comprehensive performance evaluation.
A good source in Russian: ML Handbook chapter.
Hold-out#
Divide the dataset into train and test, using train_test_split()
Stratify#
If the dataset has unbalanced classes, it’s important to verify that proportion of classes is correct in both train and test.
from sklearn.datasets import fetch_openml
credit_g = fetch_openml(name="credit-g", version=1, parser='auto')
credit_g['target'].value_counts()
class
good 700
bad 300
Name: count, dtype: int64
Split into train and test:
from sklearn.model_selection import train_test_split
*_, y_test = train_test_split(credit_g['data'], credit_g['target'], test_size=0.2, random_state=9)
y_test.value_counts()
class
good 151
bad 49
Name: count, dtype: int64
Add stratification:
*_, y_test = train_test_split(
credit_g['data'],
credit_g['target'],
test_size=0.2,
random_state=9,
stratify=credit_g['target']
)
y_test.value_counts()
class
good 140
bad 60
Name: count, dtype: int64
import pandas as pd
credit_df = pd.read_csv("../ISLP_datsets/creditcard.csv.zip")
y = credit_df['Class']
X = credit_df.drop("Class", axis=1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
y_test.value_counts()
Class
0 71079
1 123
Name: count, dtype: int64
K-Fold#
import numpy as np
from sklearn.model_selection import KFold
X = np.arange(27).reshape(9, 3)
y = np.array([1, 2, 3, 4, 5, 6 , 7, 8, 9])
kf = KFold(n_splits=3, shuffle=True)
for train_index, test_index in kf.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
TRAIN: [0 1 3 6 7 8] TEST: [2 4 5]
TRAIN: [0 1 2 3 4 5] TEST: [6 7 8]
TRAIN: [2 4 5 6 7 8] TEST: [0 1 3]
Apply K-fold cross validation for training of logistic regression on credict card fraud dataset:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
clf = LogisticRegression(max_iter=5000)
scores = cross_val_score(clf, X_train, y_train, cv=5, scoring='f1')
print("Cross validation scores:", scores)
Cross validation scores: [0.73529412 0.66666667 0.672 0.75590551 0.76335878]
Leave-one-out#
A special case of K-fold when \(K=n\) — number of samples. On each iteration train set has \(n-1\) elements, validation set — only one element.
import numpy as np
from sklearn.model_selection import LeaveOneOut
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
loo = LeaveOneOut()
for train_index, test_index in loo.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
TRAIN: [1 2 3] TEST: [0]
TRAIN: [0 2 3] TEST: [1]
TRAIN: [0 1 3] TEST: [2]
TRAIN: [0 1 2] TEST: [3]