Random forest#
Random forest is an enhancement of bagging specifically tailored for decision trees, adding feature randomness to boost ensemble diversity and decorelate base algorithms. This makes the ensemble model more robust and less prone to overfitting. See ML Handbook for details.
Boston dataset#
Apply random forest to Boston dataset:
import pandas as pd
from sklearn.model_selection import train_test_split
boston = pd.read_csv("../datasets/ISLP/Boston.csv").drop("Unnamed: 0", axis=1)
y = boston['medv']
X = boston.drop('medv', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Train a random forest model:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
print("Train RF R2-score:", rf.score(X_train, y_train))
print("Test RF R2-score:", rf.score(X_test, y_test))
Train RF R2-score: 0.9826817129099186
Test RF R2-score: 0.8896941842251395
Bagging#
Now use bagging upon decision trees:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
base_model = DecisionTreeRegressor(random_state=42)
bagging_model = BaggingRegressor(base_model, n_estimators=100, random_state=42)
bagging_model.fit(X_train, y_train)
print("Train bagging R2-score:", bagging_model.score(X_train, y_train))
print("Test bagging R2-score:", bagging_model.score(X_test, y_test))
Train bagging R2-score: 0.9824645880169679
Test bagging R2-score: 0.8870116086741119
Train and test coefficients of determination are very close for these two models.
Default dataset#
A synthetic dataset with binary target default
:
default = pd.read_csv("../datasets/ISLP/Default.csv")
default.head()
default | student | balance | income | |
---|---|---|---|---|
0 | No | No | 729.526495 | 44361.625074 |
1 | No | Yes | 817.180407 | 12106.134700 |
2 | No | No | 1073.549164 | 31767.138947 |
3 | No | No | 529.250605 | 35704.493935 |
4 | No | No | 785.655883 | 38463.495879 |
default.shape
(10000, 4)
Split into train and test:
y = (default['default'] == 'Yes').astype(int)
X = default.drop('default', axis=1)
X['student_ohe'] = (X['student'] == 'Yes').astype(int)
X.drop('student', axis=1, inplace=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Fit random forest classifier:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
print("Train RF accuracy:", rf.score(X_train, y_train))
print("Test RF accuracy:", rf.score(X_test, y_test))
Train RF accuracy: 1.0
Test RF accuracy: 0.9705
Apply bagging with decision tree as base classifier:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
base_model = DecisionTreeClassifier(random_state=42)
bagging_model = BaggingClassifier(base_model, n_estimators=200, random_state=42)
bagging_model.fit(X_train, y_train)
print("Train bagging accuracy:", bagging_model.score(X_train, y_train))
print("Test bagging accuracy:", bagging_model.score(X_test, y_test))
Train bagging accuracy: 1.0
Test bagging accuracy: 0.97