Impurity and information criterions#
Information criteria are statistical measures used to evaluate and compare different models, including decision trees, based on their ability to fit the data while penalizing for model complexity. Information criteria are often used during the process of splitting nodes in decision trees to determine the best feature and split point. Two common information criteria used in this context are the Gini impurity and entropy.
Impurity measure in classification trees#
Let \(\mathcal Y = \{1, \ldots, K\}\). Estimate probability of class \(k\) at some node \(V\) as
The main two impurity measures for classification are entropy and Gini.
Entropy#
The entropy of the distribution \((p_k)\) is
For binary logarithm entropy is measured in bits, for natural — in nats.
Q. What is the entropy of the uniform distribution, i.e., \(p_k = \frac 1K\), \(k=1,\ldots, K\)?
Gini#
Note that \(-\ln(1-t) \sim t\) as \(t \to 0\). After replacing \(-\log p_k\) by \(1-p_k\) in (34), the entropy turns into Gini impurity measure
Q. What is the Gini measure of the uniform distribution \(p_k = \frac 1K\), \(k=1,\ldots, K\)?
The Gini measure can be written also as
From here it is clear that \(\mathbb G[p] < 1\) for any distribution \(p\).
If \(K=2\), then probabilites are equal to \(p\) and \(1-p\) for some \(p \in [0, 1]\). For this particular case the comparison of entropy and Gini measures is shown on the next picture.
Information gain#
Suppose that in the node \(V\) the distribution of classes is \(p\), and after the split the distributions in the child nodes \(V^l\) and \(V^r\) are \(p^l\) and \(p^r\) respectively. The information gain of the split \(V \to (V^l, V^r)\) is
(also could use Gini measure). The greater information gain is, the more uncertainty was removed by the split.
Q. Suppose that \(p^l = p^r = p\). What is the information gain of such split?
Iris dataset#
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier, export_graphviz
iris = sns.load_dataset("iris")
y = iris['species']
X = iris.drop("species", axis=1)
clf_ent = DecisionTreeClassifier(criterion='entropy', max_depth=2)
clf_ent = clf_ent.fit(X, y)
print("Entropy accuracy:", clf_ent.score(X, y))
clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=2)
clf_gini = clf_gini.fit(X, y)
print("Gini accuracy:", clf_gini.score(X, y))
Entropy accuracy: 0.96
Gini accuracy: 0.96
Entropy decision tree:
import graphviz
dot_data = export_graphviz(clf_ent, out_file=None,
feature_names=iris.columns[:-1],
class_names=['setosa', 'versicolor', 'virginica'],
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph
Gini decision tree:
import graphviz
dot_data = export_graphviz(clf_gini, out_file=None,
feature_names=iris.columns[:-1],
class_names=['setosa', 'versicolor', 'virginica'],
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph
Q. What is the information gain of each split in this tree?
Impurity measure in regression trees#
If we choose MSE as the loss function, the impurity measure of the leaf \(V\) if given by
After splitting into \(V^l\) and \(V^r\) we obtain
where
The best split maximizes information gain
Boston dataset#
import pandas as pd
from sklearn.tree import DecisionTreeRegressor, export_graphviz
boston = pd.read_csv("../datasets/ISLP/Boston.csv").drop("Unnamed: 0", axis=1)
y = boston['medv']
X = boston.drop('medv', axis=1)
DT = DecisionTreeRegressor(max_depth=2)
DT.fit(X, y)
print("Train R-score:", DT.score(X, y))
Train R-score: 0.695574477973027
import graphviz
dot_data = export_graphviz(DT, out_file=None,
feature_names=boston.columns[:-1],
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph
Q. Find information gain of each split in this regression tree.