# Start by importing the needed libs
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import warnings
warnings.filterwarnings('ignore')
In this tutorial we build upon Linear regression to perform logistic regression. The most basic of generalized linear models.
Linear Reg: Predicts the expected values of the response variable
Logistic Reg: Predicts the probability of the response in taking a binary value
Binary: Response variable may only take 1 of two values Nominal: Response variable may take 3 or more values with no NATURAL ordering Ordinal: Response variable may take 3 or more ordered values with uneven spacing/intervals between them
Examples:
Binary: (yes,no) (true,false) etc
Nominal: Category 1, Category 2 etc etc
Ordinal: survey data from a hot wings contest where people are asked to rank the hot wing heat between 1 and 5. the difference between a 1 & 2 would differ from person to person
Examples: What's the probability of getting a job given certain education What's the probability of passing a test given gender
Notice that in both of these examples the predictor in not numerical, however that makes no difference to logistic reg. Linear reg requires it.
Define the logistic function as: $$ logit(x) = \frac{1}{1+e^{-x}} $$ (NB: Machine Learning algos may call this the sigmoid function)
then our model is $$ P(Y=1) = P(Y=1 \text{ | } X) = \frac{ e^{b_0+b_1X} }{1+e^{b_0+b_1X}} $$ Note that $$ Ln( \frac{P(Y=1)}{1-P(Y=1)}) = b_0+b_1X $$ Showing the relationship between the linear model and the logistic. To make this a binary regression we would create a rule to map probabilites unto a binary value (ie: If P(Y=1|X) > 0.5 then 1 else 0 ). To fit this model we find $b_0 \text{ and } b_1$. Linear reg estimates the co-efficients by minimizing the sum of squares. Logistic reg maximizes the likelihood function. In layman terms this means that we need to find $b_0 \text{ and } b_1$ that best approximate the observed data. Thankfully python has already done all this for us. so let's get to an example
NB: $\frac{P(Y=1)}{1-P(Y=1)}$ is called the odds. ln(odds) is the log-odds or probit
Then it follows that $odds = e^{b_0+b_1X}$.
df = pd.read_csv('Data/001LogisticRegression_Iris.csv')
#id is irrelevant
df = df.drop(['Id'],axis=1)
#df.groupby(['Species']).size()
#turn this into a binary problem by dropping Iris-virginica
df = df.drop( df[df.Species == 'Iris-virginica'].index ,axis=0)
#turn text category into binary 1/0
encoding = {"Iris-setosa":0, "Iris-versicolor":1 }
df.replace(encoding,inplace=True)
#Check df.groupby(['Species']).size()
#Check df.head()
# I'd like to slim things down even further so
#c = df.corr()
#For those that prefer a visual
#f, ax = plt.subplots(figsize=(11, 9))
#sns.heatmap(c)
# PetalLengthCm,PetalWidthCm have the highest correlation
# we drop the other two
df = df.drop(['SepalLengthCm','SepalWidthCm'],axis=1)
df.head() #Finally looks good for our purposes
plt.scatter(df['PetalLengthCm'],df['PetalWidthCm'],c=df['Species'],cmap=plt.cm.Spectral )
plt.show()
#As usual the modeling is quite trivial in python
from sklearn.linear_model import LogisticRegression
logReg = LogisticRegression(random_state=0, solver='lbfgs')
X = df[['PetalLengthCm','PetalWidthCm']].values
y = df['Species'].values
clf = logReg.fit(X, y)
print(clf.predict(X))
#print(clf.predict_proba(X) ) This is a large array of probabilities
print(clf.score(X, y)) #accuracy = predictions vs actual
We achieved a 100% accuracy in our predictions. This isn't surprising since our data is super clean