Cross-validation

Cross-validation#

Cross-validation is a widely used technique in machine learning and statistics for assessing the performance and generalization of a predictive model. The basic idea is to split the dataset into multiple subsets, train and test the model on different combinations of these subsets, and then aggregate the results to get a more comprehensive performance evaluation.

A good source in Russian: ML Handbook chapter.

Hold-out#

Divide the dataset into train and test, using train_test_split()

Stratify#

If the dataset has unbalanced classes, it’s important to verify that proportion of classes is correct in both train and test.

from sklearn.datasets import fetch_openml
credit_g = fetch_openml(name="credit-g", version=1, parser='auto')
credit_g['target'].value_counts()

class
good    700
bad     300
Name: count, dtype: int64

Split into train and test:

from sklearn.model_selection import train_test_split 
*_, y_test = train_test_split(credit_g['data'], credit_g['target'], test_size=0.2, random_state=9)
y_test.value_counts()

class
good    151
bad      49
Name: count, dtype: int64

Add stratification:

*_, y_test = train_test_split(
    credit_g['data'],
    credit_g['target'],
    test_size=0.2,
    random_state=9,
    stratify=credit_g['target']
)
y_test.value_counts()

class
good    140
bad      60
Name: count, dtype: int64

import pandas as pd
credit_df = pd.read_csv("../ISLP_datsets/creditcard.csv.zip")
y = credit_df['Class']
X = credit_df.drop("Class", axis=1)

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
y_test.value_counts()

Class
0    71079
1      123
Name: count, dtype: int64

K-Fold#

import numpy as np
from sklearn.model_selection import KFold
 
X = np.arange(27).reshape(9, 3)
y = np.array([1, 2, 3, 4, 5, 6 , 7, 8, 9])
kf = KFold(n_splits=3, shuffle=True)
 
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [0 1 3 6 7 8] TEST: [2 4 5]
TRAIN: [0 1 2 3 4 5] TEST: [6 7 8]
TRAIN: [2 4 5 6 7 8] TEST: [0 1 3]

Apply K-fold cross validation for training of logistic regression on credict card fraud dataset:

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
 
clf = LogisticRegression(max_iter=5000)
scores = cross_val_score(clf, X_train, y_train, cv=5, scoring='f1')
print("Cross validation scores:", scores)

Cross validation scores: [0.73529412 0.66666667 0.672      0.75590551 0.76335878]

Leave-one-out#

A special case of K-fold when \(K=n\) — number of samples. On each iteration train set has \(n-1\) elements, validation set — only one element.

import numpy as np
from sklearn.model_selection import LeaveOneOut
 
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
loo = LeaveOneOut()
 
for train_index, test_index in loo.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

TRAIN: [1 2 3] TEST: [0]
TRAIN: [0 2 3] TEST: [1]
TRAIN: [0 1 3] TEST: [2]
TRAIN: [0 1 2] TEST: [3]