Classification and regression trees#
See also
A decision tree chapter in ML Handbook
Binary splitting#
Suppose we have a dataset \(\mathcal D = (\boldsymbol X, \boldsymbol y)\) with \(n\) training samples, each of which has \(d\) numeric features. To make a binary split one needs to do the following:
take \(j\)-th feature, \(1\leqslant j \leqslant d\)
select a threshold \(t\)
split dataset by condition \(\boldsymbol X_{:, j} \leqslant t\)
After that the dataset \(\mathcal D\) has been split into two parts, \(\mathcal D_\ell\) and \(\mathcal D_r\), such that \(j\)-th feature of samples from \(\mathcal D_\ell\) is greater than \(t\) whereas \(j\)-th feature of samples from \(\mathcal D_r\) is does not exceed \(t\).
This splitting operation continues recursively: apply binary splitting to both left and right nodes with datasets \(\mathcal D_\ell\) and \(\mathcal D_r\) respectively. On this way we obtain a binary decision tree which consists of decision nodes.
Q. How many variants are there for selection a feature \(j\) and a threshold \(t\) on the first step when we create the root node? Suppose that we want to avoid trivial splits when one of subnodes is empty.
Stopping criteria#
Where should we stop and not do split a decision node anymore? There could be different stategies.
All data points in a node belong to the same class (for classification)
A pre-defined depth of the tree is reached
A pre-defined minimum leaf size is reached
Further splitting doesn’t add much value
If a node is not split, it becomes a leaf. Each leaf produces a prediction as majority class of samples in this leaf (for classification) or average of targets (for regression). A regression tree is always a piecewise constant model.
Iris dataset#
This is an example from sklearn.
Download and visualize Iris dataset:
import seaborn as sns
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue="species");
Fit decision tree classifier:
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_graphviz
y = iris['species']
X = iris.drop("species", axis=1)
clf = DecisionTreeClassifier()
clf = clf.fit(X, y)
print("Accuracy:", clf.score(X, y))
Accuracy: 1.0
Plot the tree:
plot_tree(clf, filled=True);
A prettier tree can be drawn by graphviz
:
import graphviz
dot_data = export_graphviz(clf, out_file=None,
feature_names=iris.columns[:-1],
class_names=['setosa', 'versicolor', 'virginica'],
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph
To reduce overfitting, tree depth is usually limited. Depth \(2\) is enough for this toy dataset:
clf = DecisionTreeClassifier(max_depth=2)
clf = clf.fit(X, y)
print("Accuracy:", clf.score(X, y))
Accuracy: 0.96
MNIST#
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
%config InlineBackend.figure_format = 'svg'
X, Y = fetch_openml('mnist_784', return_X_y=True, parser='auto')
X = X.astype(float).values / 255
Y = Y.astype(int).values
Visualize data:
Split into train and test:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=10000)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((60000, 784), (10000, 784), (60000,), (10000,))
Fit a decision tree model:
from sklearn.tree import DecisionTreeClassifier
DT = DecisionTreeClassifier()
DT.fit(X_train, y_train)
DecisionTreeClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier()
Accuracy score:
print("Train accuracy:", DT.score(X_train, y_train))
print("Test accuracy:", DT.score(X_test, y_test))
Train accuracy: 1.0
Test accuracy: 0.8735
plt.figure(figsize=(10, 8))
plt.title("Decision tree on MNIST")
sns.heatmap(confusion_matrix(y_test, DT.predict(X_test)), annot=True);
Limit the tree depth and size of leaves:
DT = DecisionTreeClassifier(max_depth=15, min_samples_leaf=3)
DT.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=15, min_samples_leaf=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=15, min_samples_leaf=3)
print("Train accuracy:", DT.score(X_train, y_train))
print("Test accuracy:", DT.score(X_test, y_test))
Train accuracy: 0.9615166666666667
Test accuracy: 0.8776
Boston dataset#
Let’s apply decision trees to a regression task.
import pandas as pd
boston = pd.read_csv("../datasets/ISLP/Boston.csv").drop("Unnamed: 0", axis=1)
boston.head()
crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 5.33 | 36.2 |
Select features and targets:
y = boston['medv']
X = boston.drop('medv', axis=1)
Split into train and test:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Train a regression decision tree:
from sklearn.tree import DecisionTreeRegressor
DT = DecisionTreeRegressor()
DT.fit(X_train, y_train)
print("Train R-score:", DT.score(X_train, y_train))
print("Test R-score:", DT.score(X_test, y_test))
Train R-score: 1.0
Test R-score: 0.7221625063841336
The model overfits since the tree is quite deep:
plot_tree(DT, filled=True);
Let’s limit its depth:
DT = DecisionTreeRegressor(max_depth=4)
DT.fit(X_train, y_train)
print("Train R-score:", DT.score(X_train, y_train))
print("Test R-score:", DT.score(X_test, y_test))
Train R-score: 0.8990851061603367
Test R-score: 0.7005887273107992
plot_tree(DT, filled=True);
dot_data = export_graphviz(DT, out_file=None,
feature_names=boston.columns[:-1],
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph