Classification and regression trees

Classification and regression trees#

https://dataaspirant.com/wp-content/uploads/2017/01/B03905_05_01-compressor.png

Binary splitting#

Suppose we have a dataset \(\mathcal D = (\boldsymbol X, \boldsymbol y)\) with \(n\) training samples, each of which has \(d\) numeric features. To make a binary split one needs to do the following:

take \(j\)-th feature, \(1\leqslant j \leqslant d\)
select a threshold \(t\)
split dataset by condition \(\boldsymbol X_{:, j} \leqslant t\)

After that the dataset \(\mathcal D\) has been split into two parts, \(\mathcal D_\ell\) and \(\mathcal D_r\), such that \(j\)-th feature of samples from \(\mathcal D_\ell\) is greater than \(t\) whereas \(j\)-th feature of samples from \(\mathcal D_r\) is does not exceed \(t\).

This splitting operation continues recursively: apply binary splitting to both left and right nodes with datasets \(\mathcal D_\ell\) and \(\mathcal D_r\) respectively. On this way we obtain a binary decision tree which consists of decision nodes.

Q. How many variants are there for selection a feature \(j\) and a threshold \(t\) on the first step when we create the root node? Suppose that we want to avoid trivial splits when one of subnodes is empty.

Stopping criteria#

Where should we stop and not do split a decision node anymore? There could be different stategies.

All data points in a node belong to the same class (for classification)
A pre-defined depth of the tree is reached
A pre-defined minimum leaf size is reached
Further splitting doesn’t add much value

If a node is not split, it becomes a leaf. Each leaf produces a prediction as majority class of samples in this leaf (for classification) or average of targets (for regression). A regression tree is always a piecewise constant model.

https://raw.githubusercontent.com/Wei2624/AI_Learning_Hub/master/machine-learning/images/cs229_trees_3.png

Iris dataset#

This is an example from sklearn.

Download and visualize Iris dataset:

import seaborn as sns
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue="species");

../_images/f15343f4f2c2ee6746a647b63b3026e2c342cd3be6b60b934570a2a9bbba6b1b.png

Fit decision tree classifier:

from sklearn import tree

y = iris['species']
X = iris.drop("species", axis=1)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)
print("Accuracy:", clf.score(X, y))

Accuracy: 1.0

Plot the tree:

tree.plot_tree(clf, filled=True);

../_images/51fe45a96358301684ebc593b1445bc0ae4212449de4a1f1aa4dd7858b37f513.png

A prettier tree can be drawn by graphviz:

import graphviz

dot_data = tree.export_graphviz(clf, out_file=None, 
                     feature_names=iris.columns[:-1],  
                     class_names=['setosa', 'versicolor', 'virginica'],  
                     filled=True, rounded=True,  
                     special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 

../_images/1d8211eaa9de654fcb8ac16f6b223635f3f79c5ce8ac47662992f08afd49888d.svg

To reduce overfitting, tree depth is usually limited. Depth \(2\) is enough for this toy dataset:

clf = tree.DecisionTreeClassifier(max_depth=2)
clf = clf.fit(X, y)
print("Accuracy:", clf.score(X, y))

Accuracy: 0.96

../_images/ed5f83344db7c11d337487e839c8db53a8ccc5bc1d9f80cdd6bdcf489abc9e2e.svg

MNIST#

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

%config InlineBackend.figure_format = 'svg'

X, Y = fetch_openml('mnist_784', return_X_y=True, parser='auto')

X = X.astype(float).values / 255
Y = Y.astype(int).values

Visualize data:

../_images/e9f48b378f8516a55465c954e7e70718c894e83e564277f9d4056b8f0c193629.svg

Split into train and test:

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=10000)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((60000, 784), (10000, 784), (60000,), (10000,))

Fit a decision tree model:

from sklearn.tree import DecisionTreeClassifier

DT = DecisionTreeClassifier()
DT.fit(X_train, y_train)

DecisionTreeClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Accuracy score:

print("Train accuracy:", DT.score(X_train, y_train))
print("Test accuracy:", DT.score(X_test, y_test))

Train accuracy: 1.0
Test accuracy: 0.8735

plt.figure(figsize=(10, 8))
plt.title("Decision tree on MNIST")
sns.heatmap(confusion_matrix(y_test, DT.predict(X_test)), annot=True);

../_images/6af360a722cfba4c39eb227e380da87469267297fc7803abd2db453f00b27ccc.svg

Limit the tree depth and size of leaves:

DT = DecisionTreeClassifier(max_depth=15, min_samples_leaf=3)
DT.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=15, min_samples_leaf=3)