Credit (https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html)
The K-means algo look to define the set of points belonging to a pre-defined set of groups or clusters. It does this by
In order to determine the groups it will use iterative algorithm called the Expectation-Maximization.
The E, for expectation, is pretty straight forward as it's a simple assignment. The M_step however shifts the centre to the mean distance of the closest points.
A much better, albeit much more advanced, explanation can be found at wikipedia
The algorithm does not always return the global best solution. That is highly dependent on the predetermined clusters fed into the system. (For example try n_clusters = 4 or 5 in the above code). This can be ok if you know beforehand how many classifications there are. Ideally though you'll want the data to tell you. Try using sklearn.metrics silhouette_score to help determine the optimal number of clusters
The larger the dataset the slower it becomes.
The algo is geared towards linear relationships. Similar to Naive Bayes if the groups are not linear then the model breaks down. However just as SVM can elevate the data to a higher dimension there are algos for doing so here as well
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set() # for plot styling
import numpy as np
# As usual we create some sample data grouped rather nicely for illustration
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4,
cluster_std=0.75, random_state=0)
plt.scatter(X[:, 0], X[:, 1], s=50);
In the above data is 4 clusters which we'll now feed into the k means algo to compute.
Let's see it in action
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], s=50)
# A more colourful version
# plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
#Plot the kmeans Centre points
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=500, alpha=0.75);
The algorithm does not always return the global best solution. That is highly dependent on the predetermined clusters fed into the system. (For example try n_clusters = 4 or 5 in the above code). This can be ok if you know beforehand how many classifications there are. Ideally though you'll want the data to tell you. Try using sklearn.metrics silhouette_score to help determine the optimal number of clusters
It cans be slow
The algo is geared towards linear relationships. Similar to Naive Bayes if the groups are not linear then the model breaks down. However just as SVM can elevate the data to a higher dimension there are algos for doing so here as well
Let's see if we can implement the algo ourselves
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4,
cluster_std=0.75, random_state=0)
#plt.scatter(X[:, 0], X[:, 1], s=50);
type(X)
X[:10,:]
#np.random.choice(X,5)
#Credit: https://pythonprogramming.net/k-means-from-scratch-2-machine-learning-tutorial
# /?completed=/k-means-from-scratch-machine-learning-tutorial/
#Sklearn signatures
#kmeans = KMeans(n_clusters=4)
#kmeans.fit(X)
#y_kmeans = kmeans.predict(X)
class myKMeans:
def __init__(self,clusters=2,tolerance=0.001,maxIterations=300):
self.clusters = clusters
self.tolerance = tolerance
self.maxIterations = maxIterations
def fit(self,data):
self.centres = {}
#select elements from the data
# This should be randomized
for i in range(self.clusters):
self.centres[i] = data[i]
for i in range(self.maxIterations):
self.class_labels = {}
for j in range(self.clusters):
self.class_labels[j] = []
for featureset in data:
distances = [np.linalg.norm(featureset-self.centres[centroid]) for centroid in self.centres]
classification = distances.index(min(distances))
self.class_labels[classification].append(featureset)
old_centres = dict(self.centres)
for pt in self.class_labels:
self.centres[pt] = np.average(self.class_labels[pt],axis=0)
# we now need to determine if the algo has done it's best
optimal = True
for c in self.centres:
prv_centres = old_centres[c]
new_centres = self.centres[c]
if np.sum( (new_centres-prv_centres)/prv_centres*100.0 ) > self.tolerance:
print(np.sum( (new_centres-prv_centres)/prv_centres*100.0 ))
optimal = False
if optimal:
break
def predict(self,data):
distances = [np.linalg.norm(featureset-self.centres[centroid]) for centroid in self.centres]
classification = distances.index(min(distances))
return classification
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
colors = ['r','g','b','c','k','o','y']
X, y_true = make_blobs(n_samples=300, centers=4,
cluster_std=0.75, random_state=0)
#plt.scatter(X[:, 0], X[:, 1], s=50);
myclust = myKMeans(clusters=5)
myclust.fit(X)
for centre in myclust.centres:
plt.scatter(myclust.centres[centre][0], myclust.centres[centre][1],
marker="o", color="k", s=150, linewidths=5)
for class_label in myclust.class_labels:
color = colors[class_label]
for featureset in myclust.class_labels[class_label]:
plt.scatter(featureset[0], featureset[1], marker="x", color=color, s=150, linewidths=5)
plt.show()