Random forest#

Random forest is an enhancement of bagging specifically tailored for decision trees, adding feature randomness to boost ensemble diversity and decorelate base algorithms. This makes the ensemble model more robust and less prone to overfitting. See ML Handbook for details.

Boston dataset#

Apply random forest to Boston dataset:

import pandas as pd
from sklearn.model_selection import train_test_split

boston = pd.read_csv("../datasets/ISLP/Boston.csv").drop("Unnamed: 0", axis=1)
y = boston['medv']
X = boston.drop('medv', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Train a random forest model:

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
print("Train RF R2-score:", rf.score(X_train, y_train))
print("Test RF R2-score:", rf.score(X_test, y_test))
Train RF R2-score: 0.9826817129099186
Test RF R2-score: 0.8896941842251395

Bagging#

Now use bagging upon decision trees:

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor

base_model = DecisionTreeRegressor(random_state=42)
bagging_model = BaggingRegressor(base_model, n_estimators=100, random_state=42)
bagging_model.fit(X_train, y_train)

print("Train bagging R2-score:", bagging_model.score(X_train, y_train))
print("Test bagging R2-score:", bagging_model.score(X_test, y_test))
Train bagging R2-score: 0.9824645880169679
Test bagging R2-score: 0.8870116086741119

Train and test coefficients of determination are very close for these two models.

Default dataset#

A synthetic dataset with binary target default:

default = pd.read_csv("../datasets/ISLP/Default.csv")
default.head()
default student balance income
0 No No 729.526495 44361.625074
1 No Yes 817.180407 12106.134700
2 No No 1073.549164 31767.138947
3 No No 529.250605 35704.493935
4 No No 785.655883 38463.495879
default.shape
(10000, 4)

Split into train and test:

y = (default['default'] == 'Yes').astype(int)
X = default.drop('default', axis=1)
X['student_ohe'] = (X['student'] == 'Yes').astype(int)
X.drop('student', axis=1, inplace=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Fit random forest classifier:

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
print("Train RF accuracy:", rf.score(X_train, y_train))
print("Test RF accuracy:", rf.score(X_test, y_test))
Train RF accuracy: 1.0
Test RF accuracy: 0.9705

Apply bagging with decision tree as base classifier:

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

base_model = DecisionTreeClassifier(random_state=42)
bagging_model = BaggingClassifier(base_model, n_estimators=200, random_state=42)
bagging_model.fit(X_train, y_train)

print("Train bagging accuracy:", bagging_model.score(X_train, y_train))
print("Test bagging accuracy:", bagging_model.score(X_test, y_test))
Train bagging accuracy: 1.0
Test bagging accuracy: 0.97