Classification and regression trees#

https://dataaspirant.com/wp-content/uploads/2017/01/B03905_05_01-compressor.png

See also

A decision tree chapter in ML Handbook

Binary splitting#

Suppose we have a dataset \(\mathcal D = (\boldsymbol X, \boldsymbol y)\) with \(n\) training samples, each of which has \(d\) numeric features. To make a binary split one needs to do the following:

  • take \(j\)-th feature, \(1\leqslant j \leqslant d\)

  • select a threshold \(t\)

  • split dataset by condition \(\boldsymbol X_{:, j} \leqslant t\)

After that the dataset \(\mathcal D\) has been split into two parts, \(\mathcal D_\ell\) and \(\mathcal D_r\), such that \(j\)-th feature of samples from \(\mathcal D_\ell\) is greater than \(t\) whereas \(j\)-th feature of samples from \(\mathcal D_r\) is does not exceed \(t\).

This splitting operation continues recursively: apply binary splitting to both left and right nodes with datasets \(\mathcal D_\ell\) and \(\mathcal D_r\) respectively. On this way we obtain a binary decision tree which consists of decision nodes.

Q. How many variants are there for selection a feature \(j\) and a threshold \(t\) on the first step when we create the root node? Suppose that we want to avoid trivial splits when one of subnodes is empty.

Stopping criteria#

Where should we stop and not do split a decision node anymore? There could be different stategies.

  • All data points in a node belong to the same class (for classification)

  • A pre-defined depth of the tree is reached

  • A pre-defined minimum leaf size is reached

  • Further splitting doesn’t add much value

If a node is not split, it becomes a leaf. Each leaf produces a prediction as majority class of samples in this leaf (for classification) or average of targets (for regression). A regression tree is always a piecewise constant model.

https://raw.githubusercontent.com/Wei2624/AI_Learning_Hub/master/machine-learning/images/cs229_trees_3.png

Iris dataset#

This is an example from sklearn.

Download and visualize Iris dataset:

import seaborn as sns
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue="species");
../_images/f15343f4f2c2ee6746a647b63b3026e2c342cd3be6b60b934570a2a9bbba6b1b.png

Fit decision tree classifier:

from sklearn.tree import DecisionTreeClassifier, plot_tree, export_graphviz

y = iris['species']
X = iris.drop("species", axis=1)
clf = DecisionTreeClassifier()
clf = clf.fit(X, y)
print("Accuracy:", clf.score(X, y))
Accuracy: 1.0

Plot the tree:

plot_tree(clf, filled=True);
../_images/48f02e4d92b7c8c1ecb0e9d74cee349924f759b6c3b0fc2fba0ec610736bacbb.png

A prettier tree can be drawn by graphviz:

import graphviz

dot_data = export_graphviz(clf, out_file=None, 
                     feature_names=iris.columns[:-1],  
                     class_names=['setosa', 'versicolor', 'virginica'],  
                     filled=True, rounded=True,  
                     special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 
../_images/29401f9cf315d4982eeaeaa5944f6e23c8e686640245858cc3be9c0054ffefcb.svg

To reduce overfitting, tree depth is usually limited. Depth \(2\) is enough for this toy dataset:

clf = DecisionTreeClassifier(max_depth=2)
clf = clf.fit(X, y)
print("Accuracy:", clf.score(X, y))
Accuracy: 0.96
../_images/6de197877034cd57eabc6b799301ea137ee6d6ad6b114df3a31fe4a09b0226bc.svg

MNIST#

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

%config InlineBackend.figure_format = 'svg'

X, Y = fetch_openml('mnist_784', return_X_y=True, parser='auto')

X = X.astype(float).values / 255
Y = Y.astype(int).values

Visualize data:

../_images/0949e5cc2953092c04cfb58cc4024406306384ec62d84f3a510d6ef97a9dea47.svg

Split into train and test:

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=10000)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((60000, 784), (10000, 784), (60000,), (10000,))

Fit a decision tree model:

from sklearn.tree import DecisionTreeClassifier

DT = DecisionTreeClassifier()
DT.fit(X_train, y_train)
DecisionTreeClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Accuracy score:

print("Train accuracy:", DT.score(X_train, y_train))
print("Test accuracy:", DT.score(X_test, y_test))
Train accuracy: 1.0
Test accuracy: 0.8735
plt.figure(figsize=(10, 8))
plt.title("Decision tree on MNIST")
sns.heatmap(confusion_matrix(y_test, DT.predict(X_test)), annot=True);
../_images/616bfd66635690d9ffe35e384d13763c5f4cdb22d81c03e248409af177492a83.svg

Limit the tree depth and size of leaves:

DT = DecisionTreeClassifier(max_depth=15, min_samples_leaf=3)
DT.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=15, min_samples_leaf=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
print("Train accuracy:", DT.score(X_train, y_train))
print("Test accuracy:", DT.score(X_test, y_test))
Train accuracy: 0.9615166666666667
Test accuracy: 0.8776
../_images/2b71a1368a6766ebdc46e3a8f1acad3a5314571a2307323e1f5b7cd23432672a.svg

Boston dataset#

Let’s apply decision trees to a regression task.

import pandas as pd
boston = pd.read_csv("../datasets/ISLP/Boston.csv").drop("Unnamed: 0", axis=1)
boston.head()
crim zn indus chas nox rm age dis rad tax ptratio lstat medv
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2

Select features and targets:

y = boston['medv']
X = boston.drop('medv', axis=1)

Split into train and test:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Train a regression decision tree:

from sklearn.tree import DecisionTreeRegressor

DT = DecisionTreeRegressor()
DT.fit(X_train, y_train)
print("Train R-score:", DT.score(X_train, y_train))
print("Test R-score:", DT.score(X_test, y_test))
Train R-score: 1.0
Test R-score: 0.7221625063841336

The model overfits since the tree is quite deep:

plot_tree(DT, filled=True);
../_images/bbc9bc15ff12ad70bb9233e2483a67507018d7ad9379c43669a284a9ae572305.svg

Let’s limit its depth:

DT = DecisionTreeRegressor(max_depth=4)
DT.fit(X_train, y_train)
print("Train R-score:", DT.score(X_train, y_train))
print("Test R-score:", DT.score(X_test, y_test))
Train R-score: 0.8990851061603367
Test R-score: 0.7005887273107992
plot_tree(DT, filled=True);
../_images/360f70aad6668341e6d63b3e38476d468fdc6069bf75374220cc4e945ed428b6.svg
dot_data = export_graphviz(DT, out_file=None, 
                           feature_names=boston.columns[:-1],   
                           filled=True, rounded=True,  
                           special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 
../_images/85e2f38ed1ff804644d0601dee4b6d64b13bf2d1522f9a8efb0a525caf3d6bb3.svg