The idea

Credit (https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html)

The K-means algo look to define the set of points belonging to a pre-defined set of groups or clusters. It does this by

  1. measuring the arithmetic mean of all the points belonging to the cluster
  2. and by grouping the points closest to the centre than to other centres

In order to determine the groups it will use iterative algorithm called the Expectation-Maximization.

  1. It takes a guess at the cluster centre
  2. While not converged
    a. E_Step: assign points to nearest cluster centre
    b. M_Step: set cluster centres to mean

The E, for expectation, is pretty straight forward as it's a simple assignment. The M_step however shifts the centre to the mean distance of the closest points.

A much better, albeit much more advanced, explanation can be found at wikipedia

Shortcomings

  1. The algorithm does not always return the global best solution. That is highly dependent on the predetermined clusters fed into the system. (For example try n_clusters = 4 or 5 in the above code). This can be ok if you know beforehand how many classifications there are. Ideally though you'll want the data to tell you. Try using sklearn.metrics silhouette_score to help determine the optimal number of clusters

  2. The larger the dataset the slower it becomes.

  3. The algo is geared towards linear relationships. Similar to Naive Bayes if the groups are not linear then the model breaks down. However just as SVM can elevate the data to a higher dimension there are algos for doing so here as well

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()  # for plot styling
import numpy as np

Basic Example

Sample Data

In [2]:
# As usual we create some sample data grouped rather nicely for illustration
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4,
                       cluster_std=0.75, random_state=0)
plt.scatter(X[:, 0], X[:, 1], s=50);

In the above data is 4 clusters which we'll now feed into the k means algo to compute.

The Model

Let's see it in action

In [3]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

plt.scatter(X[:, 0], X[:, 1], s=50)
# A more colourful version
# plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
#Plot the kmeans Centre points
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=500, alpha=0.75);

Potential Downfalls

  1. The algorithm does not always return the global best solution. That is highly dependent on the predetermined clusters fed into the system. (For example try n_clusters = 4 or 5 in the above code). This can be ok if you know beforehand how many classifications there are. Ideally though you'll want the data to tell you. Try using sklearn.metrics silhouette_score to help determine the optimal number of clusters

  2. It cans be slow

  3. The algo is geared towards linear relationships. Similar to Naive Bayes if the groups are not linear then the model breaks down. However just as SVM can elevate the data to a higher dimension there are algos for doing so here as well

K-Means from scratch

Algorithm

Let's see if we can implement the algo ourselves

  1. Let K = expected number of clusters
  2. Randomly select K points from the data to serve as our starting point
  3. Calculate the distance of points, not in K, to the points in K
  4. Assign each point to the closest K (You will now have K groups)
  5. Take the mean distance of each point in the K group
  6. Define the new K = mean from 5 if new K = old K then quit else goto 3 and repeat process
In [4]:
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4,
                       cluster_std=0.75, random_state=0)
#plt.scatter(X[:, 0], X[:, 1], s=50);
type(X)
X[:10,:]
#np.random.choice(X,5)
Out[4]:
array([[ 0.53225417,  2.44603331],
       [-1.45500868,  7.30316404],
       [ 1.19994871,  5.29857675],
       [-0.96120645,  7.80977831],
       [ 1.07537238,  2.14126167],
       [ 3.78320504,  0.10265179],
       [-1.87855215,  1.2668013 ],
       [ 1.523088  ,  4.40052451],
       [ 0.05589873,  8.58125581],
       [-0.57681057,  8.86668307]])
In [5]:
#Credit: https://pythonprogramming.net/k-means-from-scratch-2-machine-learning-tutorial
#                /?completed=/k-means-from-scratch-machine-learning-tutorial/

#Sklearn signatures
#kmeans = KMeans(n_clusters=4)
#kmeans.fit(X)
#y_kmeans = kmeans.predict(X)

class myKMeans:
    def __init__(self,clusters=2,tolerance=0.001,maxIterations=300):
        self.clusters = clusters
        self.tolerance = tolerance
        self.maxIterations = maxIterations
    
    def fit(self,data):
        self.centres = {}
        
        #select elements from the data
        # This should be randomized
        for i in range(self.clusters):
            self.centres[i] = data[i]
        
        for i in range(self.maxIterations):
            self.class_labels = {}
            
            for j in range(self.clusters):
                self.class_labels[j] = []
                
            for featureset in data:
                distances = [np.linalg.norm(featureset-self.centres[centroid]) for centroid in self.centres]
                classification = distances.index(min(distances))
                self.class_labels[classification].append(featureset)  
            
            old_centres = dict(self.centres)
            
            for pt in self.class_labels:
                self.centres[pt] = np.average(self.class_labels[pt],axis=0)
                
            # we now need to determine if the algo has done it's best
            optimal = True
            
            for c in self.centres:
                prv_centres = old_centres[c]
                new_centres = self.centres[c]
                if np.sum( (new_centres-prv_centres)/prv_centres*100.0  ) > self.tolerance:
                    print(np.sum( (new_centres-prv_centres)/prv_centres*100.0  ))
                    optimal = False
            
            if optimal:
                break
                
    def predict(self,data):
        distances = [np.linalg.norm(featureset-self.centres[centroid]) for centroid in self.centres]
        classification = distances.index(min(distances))
        return classification
    
In [31]:
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')

colors = ['r','g','b','c','k','o','y']
X, y_true = make_blobs(n_samples=300, centers=4,
                       cluster_std=0.75, random_state=0)
#plt.scatter(X[:, 0], X[:, 1], s=50);

myclust = myKMeans(clusters=5)
myclust.fit(X)

for centre in myclust.centres:
    plt.scatter(myclust.centres[centre][0], myclust.centres[centre][1],
                marker="o", color="k", s=150, linewidths=5)

for class_label in myclust.class_labels:
    color = colors[class_label]
    for featureset in myclust.class_labels[class_label]:
        plt.scatter(featureset[0], featureset[1], marker="x", color=color, s=150, linewidths=5)
        
plt.show()
27.289701744117796
31.085636066572405
16.704170844538528
5.042676142273882
2.804843035233753
0.7209592661545037
3.5812352331969985
0.8622631996613725