aka Simple Bayes, or Independence Bayes
Recall from statistics 101:
Bayes Theorem: Is the conditional probability of an event A succeeding the event B
Mathematically:
$$ Bayes Theorem: P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{ P(A)P(B|A)}{P(B)} $$
Verbalizing last term: $$ posterior = \frac{prior * likelihood}{evidence} $$
In data terms this could be an event given certain descriptors.
So let d1 & d2 be the sample space for A, and let F = Features / data columns then
$$ \frac{P(d1 | cols)}{P(d2 | cols)} = \frac{P(cols | d1)P(d1)}{P(cols | d2)P(d2)} $$
P(d1) & P(d2) are straight forward calculations. P(cols | d1/d2) are the difficult calulations. To compute we'll create a generative model. This is a very difficult task to generalize so we make a "naive" assumption of independence between features in order to find a workable approximation model. Different classifiers use different assumptions.
Examples:
In any case we want to label/classify our output into 2 or more buckets, d1s and d2s.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
Assumption: The data from each label is drawn from a simple Gaussian/Normal distribution.
ie: the features are normally distributed
Methodology:
http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf
SKLearn documentation:
http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
from sklearn.datasets import make_blobs
X, y = make_blobs(100, 2, centers=2, random_state=2, cluster_std=2.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu');
Above is 100 points with some overlapping ... Just to make life a little more difficult (lower the cluster_std to make it easier). Suppose there was no overlapping? In such an event the mean and standard deviation can be used to form a classifier. Points within mean+stdDev would belong to one and only one classifier. It would be difficult to extend this idea to larger datasets.
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y); # X,y are defined above
# generate some new data
rng = np.random.RandomState(0)
Xnew = [-6, -14] + [14, 18] * rng.rand(2000, 2)
ynew = model.predict(Xnew)
# Plot our original data in bold colours
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu')
lim = plt.axis()
# Plot our new points in the backdrop
plt.scatter(Xnew[:, 0], Xnew[:, 1], c=ynew, s=20, cmap='RdBu', alpha=0.1)
plt.axis(lim);
The boundary of a gaussian bayes is quadratic hence the subtle curve in the data.
# Return probability estimates for the test vector X.
# What you see below is the posterior probability that the point from X is of that label
yprob = model.predict_proba(Xnew)
yprob[-8:].round(2)
Similar to the gaussian this is the finite/countable version.
Assumption: features are independent and distibuted according to a multinomial distibution (recall that the multinomial distn is discrete)
Our demonstration comes from
https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html
# Text classification using scikit learn's newsgroups dataset
# Documentation: http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
data = fetch_20newsgroups() # this may take a few minutes as it's 14MB
#data.target_names # uncomment to see categories
# we'll focus on just a few categories
cats = ['talk.religion.misc', 'soc.religion.christian',
'sci.space', 'comp.graphics']
trn = fetch_20newsgroups(subset='train', categories=cats)
tst = fetch_20newsgroups(subset='test', categories=cats)
# print(trn.data[4]) # An arbitrarily chosen sample point
# because we are dealing with text data we will need to convert it to a numerical representation
# TfidfVectorizer will tokenize the text, weight it, then normalize the weights
# to produce a vector that can be piped into our classifier to produce the model
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
# The model can now be fitted
model.fit(trn.data, trn.target)
# and used to predict
labels = model.predict(tst.data)
# We're now ready to evaluate our model
# Classifiers are best evaluated using a confunsion matrix
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(tst.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
xticklabels=trn.target_names, yticklabels=trn.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');
Not bad eh? For most topics it's pretty good except it goes bad for religion. Not entirely surprising as christianity is a subset of religion. Our naive assumption of independence has lead us down a shaky path with respect to our data.
We can now use this model going forward
def get_category(s, train=trn, model=model):
pred = model.predict([s])
return trn.target_names[pred[0]]
print(get_category('sending a payload to the ISS'))
print(get_category('discussing islam vs atheism'))
print(get_category('determining the screen resolution'))