In [1]:
# Start by importing the needed libs
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import warnings
warnings.filterwarnings('ignore')

Logistic Regression

In this tutorial we build upon Linear regression to perform logistic regression. The most basic of generalized linear models.
Linear Reg: Predicts the expected values of the response variable
Logistic Reg: Predicts the probability of the response in taking a binary value

Binary: Response variable may only take 1 of two values Nominal: Response variable may take 3 or more values with no NATURAL ordering Ordinal: Response variable may take 3 or more ordered values with uneven spacing/intervals between them

Examples:
Binary: (yes,no) (true,false) etc
Nominal: Category 1, Category 2 etc etc
Ordinal: survey data from a hot wings contest where people are asked to rank the hot wing heat between 1 and 5. the difference between a 1 & 2 would differ from person to person

Binary Form

Examples: What's the probability of getting a job given certain education What's the probability of passing a test given gender

Notice that in both of these examples the predictor in not numerical, however that makes no difference to logistic reg. Linear reg requires it.

The Math

Define the logistic function as: $$ logit(x) = \frac{1}{1+e^{-x}} $$ (NB: Machine Learning algos may call this the sigmoid function)

then our model is $$ P(Y=1) = P(Y=1 \text{ | } X) = \frac{ e^{b_0+b_1X} }{1+e^{b_0+b_1X}} $$ Note that $$ Ln( \frac{P(Y=1)}{1-P(Y=1)}) = b_0+b_1X $$ Showing the relationship between the linear model and the logistic. To make this a binary regression we would create a rule to map probabilites unto a binary value (ie: If P(Y=1|X) > 0.5 then 1 else 0 ). To fit this model we find $b_0 \text{ and } b_1$. Linear reg estimates the co-efficients by minimizing the sum of squares. Logistic reg maximizes the likelihood function. In layman terms this means that we need to find $b_0 \text{ and } b_1$ that best approximate the observed data. Thankfully python has already done all this for us. so let's get to an example

NB: $\frac{P(Y=1)}{1-P(Y=1)}$ is called the odds. ln(odds) is the log-odds or probit
Then it follows that $odds = e^{b_0+b_1X}$.

Example

In [2]:
df = pd.read_csv('Data/001LogisticRegression_Iris.csv') 
#id is irrelevant 
df = df.drop(['Id'],axis=1)
#df.groupby(['Species']).size()

#turn this into a binary problem by dropping Iris-virginica
df = df.drop( df[df.Species == 'Iris-virginica'].index ,axis=0)

#turn text category into binary 1/0
encoding = {"Iris-setosa":0, "Iris-versicolor":1  }
df.replace(encoding,inplace=True)
#Check   df.groupby(['Species']).size()
#Check   df.head()

# I'd like to slim things down even further so
#c = df.corr()

#For those that prefer a visual
#f, ax = plt.subplots(figsize=(11, 9))
#sns.heatmap(c)

# PetalLengthCm,PetalWidthCm have the highest correlation
# we drop the other two
df = df.drop(['SepalLengthCm','SepalWidthCm'],axis=1)
df.head()  #Finally looks good for our purposes

plt.scatter(df['PetalLengthCm'],df['PetalWidthCm'],c=df['Species'],cmap=plt.cm.Spectral )
plt.show()
In [3]:
#As usual the modeling is quite trivial in python
from sklearn.linear_model import LogisticRegression

logReg = LogisticRegression(random_state=0, solver='lbfgs')

X = df[['PetalLengthCm','PetalWidthCm']].values
y = df['Species'].values
clf = logReg.fit(X, y)

print(clf.predict(X))
#print(clf.predict_proba(X) ) This is a large array of probabilities 
print(clf.score(X, y))        #accuracy = predictions vs actual
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
1.0

We achieved a 100% accuracy in our predictions. This isn't surprising since our data is super clean