Home ¶

The idea¶

Credit (https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html)

The K-means algo look to define the set of points belonging to a pre-defined set of groups or clusters. It does this by

measuring the arithmetic mean of all the points belonging to the cluster
and by grouping the points closest to the centre than to other centres

In order to determine the groups it will use iterative algorithm called the Expectation-Maximization.

It takes a guess at the cluster centre
While not converged
a. E_Step: assign points to nearest cluster centre
b. M_Step: set cluster centres to mean

The E, for expectation, is pretty straight forward as it's a simple assignment. The M_step however shifts the centre to the mean distance of the closest points.

A much better, albeit much more advanced, explanation can be found at wikipedia

Shortcomings¶

The algorithm does not always return the global best solution. That is highly dependent on the predetermined clusters fed into the system. (For example try n_clusters = 4 or 5 in the above code). This can be ok if you know beforehand how many classifications there are. Ideally though you'll want the data to tell you. Try using sklearn.metrics silhouette_score to help determine the optimal number of clusters
The larger the dataset the slower it becomes.
The algo is geared towards linear relationships. Similar to Naive Bayes if the groups are not linear then the model breaks down. However just as SVM can elevate the data to a higher dimension there are algos for doing so here as well

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()  # for plot styling
import numpy as np

Basic Example¶

Sample Data¶

# As usual we create some sample data grouped rather nicely for illustration
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4,
                       cluster_std=0.75, random_state=0)
plt.scatter(X[:, 0], X[:, 1], s=50);

In the above data is 4 clusters which we'll now feed into the k means algo to compute.

The Model¶

Let's see it in action

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

plt.scatter(X[:, 0], X[:, 1], s=50)
# A more colourful version
# plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
#Plot the kmeans Centre points
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=500, alpha=0.75);

Potential Downfalls¶

The algorithm does not always return the global best solution. That is highly dependent on the predetermined clusters fed into the system. (For example try n_clusters = 4 or 5 in the above code). This can be ok if you know beforehand how many classifications there are. Ideally though you'll want the data to tell you. Try using sklearn.metrics silhouette_score to help determine the optimal number of clusters
It cans be slow
The algo is geared towards linear relationships. Similar to Naive Bayes if the groups are not linear then the model breaks down. However just as SVM can elevate the data to a higher dimension there are algos for doing so here as well

K-Means from scratch¶

Algorithm¶

Let's see if we can implement the algo ourselves

Let K = expected number of clusters
Randomly select K points from the data to serve as our starting point
Calculate the distance of points, not in K, to the points in K
Assign each point to the closest K (You will now have K groups)
Take the mean distance of each point in the K group
Define the new K = mean from 5 if new K = old K then quit else goto 3 and repeat process

from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4,
                       cluster_std=0.75, random_state=0)
#plt.scatter(X[:, 0], X[:, 1], s=50);
type(X)
X[:10,:]
#np.random.choice(X,5)

array([[ 0.53225417,  2.44603331],
       [-1.45500868,  7.30316404],
       [ 1.19994871,  5.29857675],
       [-0.96120645,  7.80977831],
       [ 1.07537238,  2.14126167],
       [ 3.78320504,  0.10265179],
       [-1.87855215,  1.2668013 ],
       [ 1.523088  ,  4.40052451],
       [ 0.05589873,  8.58125581],
       [-0.57681057,  8.86668307]])

#Credit: https://pythonprogramming.net/k-means-from-scratch-2-machine-learning-tutorial
#                /?completed=/k-means-from-scratch-machine-learning-tutorial/

#Sklearn signatures
#kmeans = KMeans(n_clusters=4)
#kmeans.fit(X)
#y_kmeans = kmeans.predict(X)

class myKMeans:
    def __init__(self,clusters=2,tolerance=0.001,maxIterations=300):
        self.clusters = clusters
        self.tolerance = tolerance
        self.maxIterations = maxIterations
    
    def fit(self,data):
        self.centres = {}
        
        #select elements from the data
        # This should be randomized
        for i in range(self.clusters):
            self.centres[i] = data[i]
        
        for i in range(self.maxIterations):
            self.class_labels = {}
            
            for j in range(self.clusters):
                self.class_labels[j] = []
                
            for featureset in data:
                distances = [np.linalg.norm(featureset-self.centres[centroid]) for centroid in self.centres]
                classification = distances.index(min(distances))
                self.class_labels[classification].append(featureset)  
            
            old_centres = dict(self.centres)
            
            for pt in self.class_labels:
                self.centres[pt] = np.average(self.class_labels[pt],axis=0)
                
            # we now need to determine if the algo has done it's best
            optimal = True
            
            for c in self.centres:
                prv_centres = old_centres[c]
                new_centres = self.centres[c]
                if np.sum( (new_centres-prv_centres)/prv_centres*100.0  ) > self.tolerance:
                    print(np.sum( (new_centres-prv_centres)/prv_centres*100.0  ))
                    optimal = False
            
            if optimal:
                break
                
    def predict(self,data):
        distances = [np.linalg.norm(featureset-self.centres[centroid]) for centroid in self.centres]
        classification = distances.index(min(distances))
        return classification

import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')

colors = ['r','g','b','c','k','o','y']
X, y_true = make_blobs(n_samples=300, centers=4,
                       cluster_std=0.75, random_state=0)
#plt.scatter(X[:, 0], X[:, 1], s=50);

myclust = myKMeans(clusters=5)
myclust.fit(X)

for centre in myclust.centres:
    plt.scatter(myclust.centres[centre][0], myclust.centres[centre][1],
                marker="o", color="k", s=150, linewidths=5)

for class_label in myclust.class_labels:
    color = colors[class_label]
    for featureset in myclust.class_labels[class_label]:
        plt.scatter(featureset[0], featureset[1], marker="x", color=color, s=150, linewidths=5)
        
plt.show()

27.289701744117796
31.085636066572405
16.704170844538528
5.042676142273882
2.804843035233753
0.7209592661545037
3.5812352331969985
0.8622631996613725

Home¶