Decision Trees

Another type of classification model. Decision Trees form the base knowledge underlying Random Forests.

Intuitively: Trees are created by a bisection approach. It takes a feature and tries to determine the best split to get towards the classification goal. This process is repeated for each feature taking the previous splits into account.

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

The data

In [2]:
# some data 
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=300, centers=4,
                  random_state=0, cluster_std=1.0)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow');

Modeling and Plotting

In [3]:
from sklearn.tree import DecisionTreeClassifier
from ipywidgets import interact

def DCTree_wPlot(model,X,y):
    # Illustrating it is considerably more difficult
    ax = plt.gca()
    ax.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap='viridis',clim=(y.min(), y.max()), zorder=3)

    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    model.fit(X, y)

    xx, yy = np.meshgrid(np.linspace(*xlim, num=200),np.linspace(*ylim, num=200))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    n_classes = len(np.unique(y))
    Z = Z.reshape(xx.shape)
    contours = ax.contourf(xx, yy, Z, alpha=0.2,
                           levels=np.arange(n_classes + 1) - 0.5,
                           cmap='rainbow', clim=(y.min(), y.max()),
                           zorder=1)
    
# Building the model is pretty trivial
model = DecisionTreeClassifier(max_depth=3)
DCTree_wPlot(model,X,y)
#ax.set(xlim=xlim, ylim=ylim)
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\contour.py:967: UserWarning: The following kwargs were not used by contour: 'clim'
  s)

Overfitting

The above example uses a fixed depth of 3. Lets try again with a depth of 6 for illustrative purposes

As you can see below the improvement comes at a cost in complexity & bias. Think of it like a high degree polynomial when doing regression. You can increase the degree until your error converges to 0, but the trade off becomes that your model is heavily biased.

In [4]:
model = DecisionTreeClassifier(max_depth=6)
DCTree_wPlot(model,X,y)
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\contour.py:967: UserWarning: The following kwargs were not used by contour: 'clim'
  s)

Best Depth

At this time there is no function to compute the best depth. However our friends at Stack overflow have provided a general algorithm.

Credit to Kasra Manshaei

let x = A sequence of potential tree depths. Be sure to have some small and some large.
for each i in x
    split data into train & test (70/30)
    create model using train at depth i
    test model on test set
    compute e[i] = errors at depth i 

compute d_best = i where min(e[i])

now take d_best and reconstruct based on depths close to d_best. 
repeat process

Random Forests

Bagging

Imagine that you could combine multiple overfitted models into one. Turns out you can and doing so improves the final result.

Bagging uses an approach called ensemble. It take multiple estimators, which are over-fit on their own, and combines them into one by taking the best of each.

In [5]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

dtree = DecisionTreeClassifier()
bagmdl = BaggingClassifier(dtree, n_estimators=100, max_samples=0.8,
                        random_state=1)
DCTree_wPlot(bagmdl,X,y)
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\contour.py:967: UserWarning: The following kwargs were not used by contour: 'clim'
  s)