Home

Bayesian Classification

aka Simple Bayes, or Independence Bayes

The Math

Recall from statistics 101:
Bayes Theorem: Is the conditional probability of an event A succeeding the event B
Mathematically:
$$ Bayes Theorem: P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{ P(A)P(B|A)}{P(B)} $$

Verbalizing last term: $$ posterior = \frac{prior * likelihood}{evidence} $$

In data terms this could be an event given certain descriptors.

So let d1 & d2 be the sample space for A, and let F = Features / data columns then

$$ \frac{P(d1 | cols)}{P(d2 | cols)} = \frac{P(cols | d1)P(d1)}{P(cols | d2)P(d2)} $$

P(d1) & P(d2) are straight forward calculations. P(cols | d1/d2) are the difficult calulations. To compute we'll create a generative model. This is a very difficult task to generalize so we make a "naive" assumption of independence between features in order to find a workable approximation model. Different classifiers use different assumptions.

Business Case

Examples:

  • What's the probability of a customer defaulting on a loan given that debt exceeds their income.
  • What's the probability that a customer will leave our company?
  • What's the probability of an email being spam?

In any case we want to label/classify our output into 2 or more buckets, d1s and d2s.

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

Methods

Gaussian Naive Bayes

Assumption: The data from each label is drawn from a simple Gaussian/Normal distribution.
ie: the features are normally distributed

Methodology:
http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf

SKLearn documentation:
http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

In [2]:
from sklearn.datasets import make_blobs
X, y = make_blobs(100, 2, centers=2, random_state=2, cluster_std=2.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu');

Above is 100 points with some overlapping ... Just to make life a little more difficult (lower the cluster_std to make it easier). Suppose there was no overlapping? In such an event the mean and standard deviation can be used to form a classifier. Points within mean+stdDev would belong to one and only one classifier. It would be difficult to extend this idea to larger datasets.

In [3]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y);         # X,y are defined above

# generate some new data 
rng = np.random.RandomState(0)
Xnew = [-6, -14] + [14, 18] * rng.rand(2000, 2)
ynew = model.predict(Xnew)

# Plot our original data in bold colours
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu')
lim = plt.axis()

# Plot our new points in the backdrop
plt.scatter(Xnew[:, 0], Xnew[:, 1], c=ynew, s=20, cmap='RdBu', alpha=0.1)
plt.axis(lim);

The boundary of a gaussian bayes is quadratic hence the subtle curve in the data.

In [4]:
# Return probability estimates for the test vector X.
# What you see below is the posterior probability that the point from X is of that label
yprob = model.predict_proba(Xnew)
yprob[-8:].round(2)
Out[4]:
array([[0.78, 0.22],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [0.96, 0.04],
       [0.99, 0.01],
       [0.  , 1.  ],
       [0.28, 0.72]])

Multinomial Naive Bayes

Similar to the gaussian this is the finite/countable version.

Assumption: features are independent and distibuted according to a multinomial distibution (recall that the multinomial distn is discrete)

Our demonstration comes from
https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html

In [5]:
# Text classification using scikit learn's newsgroups dataset
# Documentation: http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

data = fetch_20newsgroups() # this may take a few minutes as it's 14MB
#data.target_names          # uncomment to see categories
In [6]:
# we'll focus on just a few categories
cats = ['talk.religion.misc', 'soc.religion.christian',
              'sci.space', 'comp.graphics']
trn = fetch_20newsgroups(subset='train', categories=cats)
tst = fetch_20newsgroups(subset='test', categories=cats)

# print(trn.data[4])      # An arbitrarily chosen sample point
# because we are dealing with text data we will need to convert it to a numerical representation
# TfidfVectorizer will tokenize the text, weight it, then normalize the weights
# to produce a vector that can be piped into our classifier to produce the model
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

# The model can now be fitted
model.fit(trn.data, trn.target)
# and used to predict
labels = model.predict(tst.data)
In [7]:
# We're now ready to evaluate our model
# Classifiers are best evaluated using a confunsion matrix
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(tst.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=trn.target_names, yticklabels=trn.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');

Not bad eh? For most topics it's pretty good except it goes bad for religion. Not entirely surprising as christianity is a subset of religion. Our naive assumption of independence has lead us down a shaky path with respect to our data.

We can now use this model going forward

In [8]:
def get_category(s, train=trn, model=model):
    pred = model.predict([s])
    return trn.target_names[pred[0]]

print(get_category('sending a payload to the ISS'))

print(get_category('discussing islam vs atheism'))

print(get_category('determining the screen resolution'))
sci.space
soc.religion.christian
comp.graphics