Home

Support Vector Machines - SVMs

SVMs are another classification type algorithm similar to Naive Bayes. Where Naive Bayes uses a generative model to try to group similar datapoints using a distance measure. An SVM uses a discriminative model to bisect each group. This model will often appear as a line/curve that seperates the groups with the greatest distance padding the model.

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

import seaborn as sns
sns.set()

The idea

We begin by creating some junk data to help visualize the type of problem SVMs are used for. This will also help us to visualize the SVM at work

In [2]:
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=100, centers=2,
                  random_state=0, cluster_std=0.50)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn');

Our clusters are pretty far apart. We want to draw a line between them. This line will form our linear discriminative classifier. In the graph below we see three possible lines. The x is used to simulate a future value. How do we pick the right line to use? Simple ... we chose the line that evenly divides the space between the clusters.

In [3]:
xfit = np.linspace(-1, 3.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plt.plot([0.6], [2.1], 'x', color='red', markeredgewidth=1, markersize=10)

for m, b in [(1, 0.65), (0.5, 1.6), (-0.2, 2.9)]:
    plt.plot(xfit, m * xfit + b, '-k')

for j, i in [(0.5, 1.2), (0.5, 2)]:
    plt.plot(xfit, j * xfit + i, dashes=[1, 1])
    
    
plt.xlim(-1, 3.5);

In the above graph there are three potential linear discriminants represented as dark black lines . However, it is the middle one that best divides the space between the two clusters. We define best as having the largest margins, represented as dashed lines. Of course the above graph is simply an illustration. The svm algo would do a much better job than simply eyeballing the data.

In Short: SVM calculates a line whose margins maximize the distance between clusters

Linear SVM Model

As usual the model creation is quite trivial.

In [4]:
from sklearn.svm import SVC # "Support vector classifier"
model = SVC(kernel='linear', C=1E10)
model.fit(X, y)
Out[4]:
SVC(C=10000000000.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
In [5]:
# there's no built in graphing so we google for one
# Credit to https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html

def plot_svc_decision_function(model, ax=None, plot_support=True):
    """Plot the decision function for a 2D SVC"""
    if ax is None:
        ax = plt.gca()
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    # create grid to evaluate model
    x = np.linspace(xlim[0], xlim[1], 30)
    y = np.linspace(ylim[0], ylim[1], 30)
    Y, X = np.meshgrid(y, x)
    xy = np.vstack([X.ravel(), Y.ravel()]).T
    P = model.decision_function(xy).reshape(X.shape)
    
    # plot decision boundary and margins
    ax.contour(X, Y, P, colors='k',
               levels=[-1, 0, 1], alpha=0.5,
               linestyles=['--', '-', '--'])
    
    # plot support vectors
    if plot_support:
        ax.scatter(model.support_vectors_[:, 0],
                   model.support_vectors_[:, 1],
                   s=300, linewidth=1, facecolors='none');
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)
    
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(model);

print(model.support_vectors_)  # Array of boundary points
[[0.5323772  3.31338909]
 [2.11114739 3.57660449]
 [1.46870582 1.86947425]]

Pretty nice, eh?

The top array describes the points which directly affect the model. SVMs care mostly about boundary points which are used to determine the distance between the clusters. In other words a cluster can be described mathematically by its edge elements.

We can illustrate this by increasing the number of data points, and recreating the model for each set

In [6]:
from ipywidgets import interact, fixed

def plot_svm(N=10, ax=None):
    X, y = make_blobs(n_samples=200, centers=2,
                      random_state=0, cluster_std=0.60)
    X = X[:N]
    y = y[:N]
    model = SVC(kernel='linear', C=1E10)
    model.fit(X, y)
    
    ax = ax or plt.gca()
    ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
    ax.set_xlim(-1, 4)
    ax.set_ylim(-1, 6)
    plot_svc_decision_function(model, ax)


interact(plot_svm, N=[100, 250, 500], ax=fixed(None));

While the shape and size of the clusters has changed it's still the boundary points which matter. Inner points do not change the distance between the clusters and therefore don't matter. Note that the line above changes little as N increases.

Nonlinear SVM

The example above is to build the intuition behind SVMs. Chances are you'll rarely run into such a simple example in real life

Recall the model constructor:

from sklearn.svm import SVC # "Support vector classifier"
model = SVC(kernel='linear', C=1E10)
model.fit(X, y)

Note the arg "kernel='linear'". We defined the model as a linear one. When our clusters are wholly independent then this is fine but what if they're not? Well SKlearn has us covered http://scikit-learn.org/stable/modules/svm.html#svm-kernels

There are several builtin-in options. You can also build your own

  • linear
  • polynomial
  • rbf - Radial Basis Function *(we go into this in the below code)
  • sigmoid

In the example below we can clearly see that a linear model is not applicable.
We have two options:

  • We can either use a different type of classifier, such as rbf.
  • or we can transform the data in a process similar to what is done in regression
In [7]:
from sklearn.datasets.samples_generator import make_circles
X, y = make_circles(100, factor=.1, noise=.1)

clf = SVC(kernel='linear').fit(X, y)

plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(clf, plot_support=False);
In [8]:
from mpl_toolkits import mplot3d

r = np.exp(-(X ** 2).sum(1))

def plot_3D(elev=30, azim=30, X=X, y=y):
    ax = plt.subplot(projection='3d')
    ax.scatter3D(X[:, 0], X[:, 1], r, c=y, s=50, cmap='prism')
    ax.view_init(elev=elev, azim=azim)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_zlabel('r')

plot_3D()
#interact(plot_3D, elev=[-90, 90], azip=(-180, 180),
#         X=fixed(X), y=fixed(y));

The introduction of the r variable increases the number of dimensions. Now we can clearly see a space between the 2 clusters, and it becomes trivial to model a plane that evenly bisects the space. r=0.6/0.7 might be a decent guess but let's let the algo take care of that.

This approach does have a glaring problem, can you see it? It requires a knowledge of the radial function, r. As with all machine learning algos we would prefer to take the humans out of the equation. Thankfully, there is a better approach.

Kernel Transformation

A basis function is computed for every possible point in the data. Then SVM sifts through the results and chooses the best. This may seem costly, but due to a little trick called (Kernel Trick: https://en.wikipedia.org/wiki/Kernel_method) we won't need to build the entire set of basis functions. We can fit the model implicitly, rather than explicitly which would require an N-dimensional array. The math behind the kernel trick is pretty invloved but is built into the SVM routine as RBF so we need not rehash it.

In [9]:
clf = SVC(kernel='rbf', C=1E6)
clf.fit(X, y)
Out[9]:
SVC(C=1000000.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
In [10]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(clf)
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
            s=300, lw=1, facecolors='none');

Using rbf we get a kernelized SVM capable of discriminating between two nonlinear clusters.

Soft SVMs

Our data so far has been composed of clearly delineated data. In reality most data will have some overlap which we'll need to handle. You may have noticed the second arguement, (C=1E6) when we create an SVM model object. This C represents the penalty applied to the error term in determining the SVM. As C increase the less forgiving the margin of error. This is fine when the data has no overlapping. But the more overlap then the more forgiving the model must be

In the graph on the left C=100 and there is very little forgiveness in the margins. On the right however C=0.5 and a few points sneak into the margin.

In [11]:
X, y = make_blobs(n_samples=100, centers=2,
                  random_state=0, cluster_std=0.8)

fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)

for axi, C in zip(ax, [10.0, 0.1]):
    model = SVC(kernel='linear', C=C).fit(X, y)
    axi.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap='autumn')
    plot_svc_decision_function(model, axi)
    axi.scatter(model.support_vectors_[:, 0],
                model.support_vectors_[:, 1],
                s=300, lw=1, facecolors='none');
    axi.set_title('C = {0:.1f}'.format(C), size=15)

Parameter Calibration

The following snippet comes from the SKLearn documentation on the SVM. It demonstrates how to calibrate the variable C which affects the margins of the SVM classifier

In [12]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV


C_range = np.logspace(-2, 10, 13)
gamma_range = np.logspace(-9, 3, 13)
param_grid = dict(gamma=gamma_range, C=C_range)
cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
grid = GridSearchCV(SVC(), param_grid=param_grid, cv=cv)
grid.fit(X, y)

print("The best parameters are %s with a score of %0.2f"
      % (grid.best_params_, grid.best_score_))
The best parameters are {'C': 1.0, 'gamma': 1.0} with a score of 1.00