Another type of classification model. Decision Trees form the base knowledge underlying Random Forests.
Intuitively: Trees are created by a bisection approach. It takes a feature and tries to determine the best split to get towards the classification goal. This process is repeated for each feature taking the previous splits into account.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
# some data
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=4,
random_state=0, cluster_std=1.0)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow');
from sklearn.tree import DecisionTreeClassifier
from ipywidgets import interact
def DCTree_wPlot(model,X,y):
# Illustrating it is considerably more difficult
ax = plt.gca()
ax.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap='viridis',clim=(y.min(), y.max()), zorder=3)
xlim = ax.get_xlim()
ylim = ax.get_ylim()
model.fit(X, y)
xx, yy = np.meshgrid(np.linspace(*xlim, num=200),np.linspace(*ylim, num=200))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
n_classes = len(np.unique(y))
Z = Z.reshape(xx.shape)
contours = ax.contourf(xx, yy, Z, alpha=0.2,
levels=np.arange(n_classes + 1) - 0.5,
cmap='rainbow', clim=(y.min(), y.max()),
zorder=1)
# Building the model is pretty trivial
model = DecisionTreeClassifier(max_depth=3)
DCTree_wPlot(model,X,y)
#ax.set(xlim=xlim, ylim=ylim)
The above example uses a fixed depth of 3. Lets try again with a depth of 6 for illustrative purposes
As you can see below the improvement comes at a cost in complexity & bias. Think of it like a high degree polynomial when doing regression. You can increase the degree until your error converges to 0, but the trade off becomes that your model is heavily biased.
model = DecisionTreeClassifier(max_depth=6)
DCTree_wPlot(model,X,y)
At this time there is no function to compute the best depth. However our friends at Stack overflow have provided a general algorithm.
Credit to Kasra Manshaei
let x = A sequence of potential tree depths. Be sure to have some small and some large.
for each i in x
split data into train & test (70/30)
create model using train at depth i
test model on test set
compute e[i] = errors at depth i
compute d_best = i where min(e[i])
now take d_best and reconstruct based on depths close to d_best.
repeat process
Imagine that you could combine multiple overfitted models into one. Turns out you can and doing so improves the final result.
Bagging uses an approach called ensemble. It take multiple estimators, which are over-fit on their own, and combines them into one by taking the best of each.
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
dtree = DecisionTreeClassifier()
bagmdl = BaggingClassifier(dtree, n_estimators=100, max_samples=0.8,
random_state=1)
DCTree_wPlot(bagmdl,X,y)