Random forests#

In summary, both Bagging and Random Forests are ensemble methods that reduce overfitting and improve predictive performance by combining multiple decision trees. However, Random Forests go a step further by introducing feature selection randomness during tree construction, making them more robust and less prone to overfitting. As a result, Random Forests are often preferred when working with decision tree-based ensembles for a wide range of tasks.

Boston dataset#

import pandas as pd
from sklearn.model_selection import train_test_split

boston = pd.read_csv("../ISLP_datasets/Boston.csv").drop("Unnamed: 0", axis=1)
y = boston['medv']
X = boston.drop('medv', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Train a random forest model:

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor()
rf.fit(X_train, y_train)
print("Train score:", rf.score(X_train, y_train))
print("Train score:", rf.score(X_test, y_test))
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset as an example
data = fetch_california_housing()
X = data.data
y = data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a base decision tree regressor
base_model = DecisionTreeRegressor(random_state=42)

# Create a Bagging Regressor with 100 base models (decision trees)
bagging_model = BaggingRegressor(base_model, n_estimators=100, random_state=42)

# Train the Bagging Regressor on the training data
bagging_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = bagging_model.predict(X_test)

# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
Mean Squared Error: 0.26

In this code:

  • We load the California Housing dataset from scikit-learn as an example regression dataset.

  • The dataset is split into training and testing sets using train_test_split.

  • We create a base model, which is a decision tree regressor.

  • We create a Bagging Regressor with 100 base models (decision trees) using BaggingRegressor.

  • The Bagging Regressor is trained on the training data using fit.

  • We make predictions on the test data using predict.

  • Finally, we evaluate the model’s performance using the mean squared error (MSE).

You can modify this code to work with your own dataset and adjust hyperparameters as needed. Bagging can also be applied to classification tasks using BaggingClassifier in scikit-learn.

Random forest#

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Load the California Housing dataset as an example
data = fetch_california_housing()
X = data.data
y = data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Random Forest Regressor with 100 trees (estimators)
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the Random Forest Regressor on the training data
rf_regressor.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_regressor.predict(X_test)

# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared (R2) Score: {r2:.2f}")
Mean Squared Error: 0.26
R-squared (R2) Score: 0.81

In this code:

  • We load the California Housing dataset from scikit-learn as an example regression dataset.

  • The dataset is split into training and testing sets using train_test_split.

  • We create a Random Forest Regressor with 100 decision trees (estimators) using RandomForestRegressor. You can adjust the n_estimators parameter to change the number of trees in the forest.

  • The Random Forest Regressor is trained on the training data using fit.

  • We make predictions on the test data using predict.

  • Finally, we evaluate the model’s performance using metrics such as mean squared error (MSE) and R-squared (R2) score.

You can modify this code to work with your own dataset and adjust hyperparameters as needed. Random Forest Regression is a powerful technique for solving regression tasks, as it combines the strengths of multiple decision trees while mitigating their weaknesses, such as overfitting.