Linear Regression

In this tutorial we will perform the most basic of regression techniques using python.

One should have a rough idea of what's happening behind the code. Let's get started with a quick math refresher.

The Math

The model we are going to fit is as follows

$$Y(i) = a + bX + \epsilon $$

Y - is the output variable (aka: response variable)

X - is the input variable (aka: predictor variable)

a - is the intercept

b - is the slope of our line

$\epsilon$ - is the margin of error

Assumptions:

  • Linear relationship - moving in tandem while allowing for some variation
  • Multivariate normality -
  • No or little multicollinearity
  • No auto-correlation
  • Homoscedasticity

An input var (x) and an output var (y). Recall that the equation for a straight line is $ y = ax + b $ where a denotes the slope of the line and b the intercept. The output of a Linear Regression algorithim is to estimate a & b such that the deviations (residuals) from the model-line to actual data is kept to a minimum. As it turns out this is when the following equations are satisfied $$ a =\frac{ \sum{(x(i) - \mu_X) * ( y(i) - \mu_y ) }} {\sum{x(i) - \mu(X^2)} } $$ $$ b = \mu_y - a*\mu_X $$

That's enough to get us going.

Business Case

Our data contains several columns. Just to keep things simple we'll stick to the new column and try to predict the future based on what we know. Assume the new is the number of new accounts for a bank for one month. We want to forecast the number of new accounts for the coming year. This is a pretty common task done at almost any large institution.

In [1]:
# Start by importing the needed libs
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

Data Exploration

Let's create some data and check it out

In [2]:
# Let's make some data to play with
# For demonstration purposes we'll generate a set of random number according to a normal distribution
rng = np.random.RandomState(25)          # a class used to generate random numbers
time = 5 * rng.rand(50)                  # creates an array of normally distributed numbers
target = (5 * time) + 5 + rng.randn(50)  # similarly but here we manipulate 
data_raw = pd.DataFrame({'time': time,'target' : target})

print(data_raw.head())

plt.figure(figsize=(25,7))
plt.scatter(data_raw['time'], data_raw['target']);
      target      time
0  27.117731  4.350621
1  19.053971  2.911385
2  11.745222  1.394195
3   9.082243  0.929556
4  15.380898  2.055501

Pretty safe to say that the linear relation holds we did create it using a line after all.

In [3]:
print(data_raw.head())  

print('\nBasic Statistics')
print('  Mean    :',data_raw[['target']].mean().values[0]) 
print('  Var     :',data_raw[['target']].var().values[0])
print('  Std Dev :',data_raw[['target']].std().values[0])
print('  Skewness:',data_raw[['target']].skew().values[0])
print('  Kurtosis:',data_raw[['target']].kurtosis().values[0])  
      target      time
0  27.117731  4.350621
1  19.053971  2.911385
2  11.745222  1.394195
3   9.082243  0.929556
4  15.380898  2.055501

Basic Statistics
  Mean    : 17.54698058398409
  Var     : 38.84229446277452
  Std Dev : 6.232358659670873
  Skewness: -0.004329766013003645
  Kurtosis: -0.39519345283431484

Basic Methods

scipy.stats.linregress

Documentation: (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html)

This function is specialised for linear, and only linear, regression.

In [4]:
from scipy import stats

x = data_raw['time']
y = data_raw['target']
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)

# To get coefficient of determination (r_squared)
print("Slope     :", slope)
print("Intercept :", intercept)
print("Corr Coeff:", r_value)
print("R^2       :", r_value**2)
print("Std Error :", std_err)
print("P-Value   :", p_value)

#Plot the data along with the fitted line
plt.figure(figsize=(25,10))
plt.ylim(0,40)  
plt.xlim(0,6)  

plt.plot(x, y, 'o', label='original data')
plt.plot(x, intercept + slope*x, 'r', label='fitted line')
plt.legend()
plt.show()
Slope     : 5.051307710178239
Intercept : 4.904042510643183
Corr Coeff: 0.9883872057268769
R^2       : 0.9769092684445837
Std Error : 0.11209208248142495
P-Value   : 6.115473438536561e-41

scipy.Optimize.curve_fit

Uses non-linear least square to fit a function, you provide, to the data

In [5]:
from scipy.optimize import curve_fit

def linfunc(x,a,b,c):
    return (a * x) + b    # This is the same one used to create 

#curve_fit requires args to be in 1dim arrays
x0 = data_raw[['time']].values.reshape(50,)
y0 = data_raw[['target']].values.reshape(50,)

popt, pcov = curve_fit(linfunc, x0, y0)

plt.plot(x0, y0, '.', label='data')
plt.plot(x0, linfunc(x0,*popt), 'r-', label='pred')
print(popt)
[5.0513077  4.90404254 1.        ]
  • Slope is the rate of change in the response y for one unit of change in the predictor x
  • Intercept is the model starting point
  • Corr Coeff is the correlation coefficent r which describes the strength and direction of the relationship between the reponse and the predictor
  • R^2: (R-squared) is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.
  • Std Error: is a measure of the accuracy of predictions
  • P-Value: Is a significance value. We've only used one input variable and consequently the significance of that variable will be high as it's the only one. This value will be much important when we use multiple variables

How do we start predicting? Simple take the slope and intercept, returned by the function, and create your model.
Our initial equation as follows

$$Y(i) = 2639.74 + 214.287X $$

And it's accuracy will be within +-0.11 as given by our std error

sklearn.linear_model

sklearn (http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html)

In [6]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# Initial data => data_raw
X = data_raw[['time']]
y = data_raw[['target']]

# Let's do an 80/20 split into train/test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=1)


linreg = linear_model.LinearRegression()        # Create a linear regression object
linreg.fit(X_train, y_train)                    # Train the model
score = linreg.score(X_train, y_train)**2         # test the model

print('Coefficients: ', linreg.coef_[0])        # Print the coefficients
print('Intercept   : ', linreg.intercept_[0] )
print('R^2         : ', score )

y_predict = linreg.predict(X_train)
std_err = (((y_train - y_predict )**2).sum() / y_train.count()[0])**(1/2)

y_predict = linreg.predict(X_train)
# we must compute the std error manually
std_err = (((y_train - y_predict )**2).sum() / y_train.count()[0])**(1/2)

print('Coefficients: ', linreg.coef_[0])        # Print the coefficients
print('Intercept   : ', intercept )
print('R^2         : ', score )
print('Std error   : ', std_err[0])
print('MeanSqr_err : ', mean_squared_error(y_predict, y_train))
print('Var score   : ', r2_score(y_train, y_predict))

#print('  Compare the above numbers to the results from section 3.1!!  ')    

#Plot the data along with the fitted line
plt.figure(figsize=(25,10))
plt.ylim(0,40)  
plt.xlim(0,6)  

plt.plot(X_train, y_train, 'o', label='original data')
plt.plot(X_train, y_predict, 'r', label='fitted line')
plt.legend()
plt.show()
Coefficients:  [5.02950453]
Intercept   :  4.9903273143597175
R^2         :  0.9591928845487526
Coefficients:  [5.02950453]
Intercept   :  4.904042510643183
R^2         :  0.9591928845487526
Std error   :  0.8651222512271788
MeanSqr_err :  0.7484365095683818
Var score   :  0.9793839311264774

As expected it's very close to our original model

statsmodels.OLS

We'll make this one short as it works similar to the above two.

In [7]:
import statsmodels.api as sm


X3 = data_raw[['time']]
y3 = data_raw[['target']]

X3 = sm.add_constant(X3)   # add the constant term ( intercept )
m3 = sm.OLS(y3,X3)         # Create model using Ordinary least squares
results = m3.fit()         # Fit OLS model
print(results.summary())   # display the results 
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 target   R-squared:                       0.977
Model:                            OLS   Adj. R-squared:                  0.976
Method:                 Least Squares   F-statistic:                     2031.
Date:                Sun, 09 Sep 2018   Prob (F-statistic):           6.12e-41
Time:                        15:51:45   Log-Likelihood:                -67.722
No. Observations:                  50   AIC:                             139.4
Df Residuals:                      48   BIC:                             143.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9040      0.311     15.744      0.000       4.278       5.530
time           5.0513      0.112     45.064      0.000       4.826       5.277
==============================================================================
Omnibus:                        0.019   Durbin-Watson:                   1.901
Prob(Omnibus):                  0.990   Jarque-Bera (JB):                0.113
Skew:                           0.042   Prob(JB):                        0.945
Kurtosis:                       2.783   Cond. No.                         7.08
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Statsmodel is really made for statistics! Most of the terms are pretty straight forward and should already be familiar to you. Let's go over a few that may not be

Prob(Omnibus): Describes the normality of our data.

Numpy.polyfit

Note that Scipy also has a polyfit that works similarly.

In [8]:
# import numpy as np    # We've already imported before

X4 = data_raw['time']
y4 = data_raw['target']

coeff = np.polyfit(X4,y4,1)
model = np.poly1d(coeff)
y4_pred = model(X4)   
    
print('Coefficients', coeff)
print('Model:  ',model)
Coefficients [5.05130771 4.90404251]
Model:    
5.051 x + 4.904

Transformations

Sometimes our data is not linear. This is not to say that linear regression can't be used, we simply need to transform it.

Here's a small list of common transformations and guidelines on their usage
$ \ln(y) = \beta_0 + \beta_1 \ln(x) $ Logarithmic transform
$ \mu_y = \beta_0 + \beta_1 (\frac{1}{x}) $
$ \mu_y = \beta_0 + \beta_1 e^{-x} $

In [9]:
rng     = np.random.RandomState(25)           # a class used to generate random numbers
time    = 5 * rng.rand(75)                    # creates an array of normally distributed numbers
targ    = 5 + (np.exp(time)) + rng.randn(75)  # similarly but here we manipulate 
target  = np.log(targ)
tmp = pd.DataFrame({'time': time,'target' : target,'targ' : targ })

print(tmp.head())

plt.figure(figsize=(25,7))
plt.scatter(tmp['time'], tmp['target']);
        targ    target      time
0  81.931332  4.405881  4.350621
1  23.877192  3.172924  2.911385
2   8.314944  2.118054  1.394195
3   8.695821  2.162843  0.929556
4  11.324154  2.426938  2.055501

Multivariate Linear Regression

Is also known as Multivariable Linear regression. Much of what we've seen can be extended to the multivariate case. We're also going to use a mix of the libraries used above.

The model we fit is simply an extension of the 1 variable case

$$ Y(i) =\beta_0+\beta_1 X_1 +\beta_2 X_2 + ... $$

The most important thing to note is that the predictors are linear! It's a polynomial with degree 1

Business Case

The business wants to know where they're getting the best bang for their bucks.

Let's determine the relationship between spending on advertising channels and the effect on sales.

We've been provided with a csv containing the following

  • TV = dollars spent on T.V.
  • radio = Advertising dollars spent on radio
  • newspaper = Advertising dollars spent on newspapers
  • sales figures

Why choose regression: The output/response is a continuous (aka Float )

Model Construction

In [10]:
import pandas as pd
import numpy as np

# Let's load & look at the data
df_raw = pd.read_csv('./Data/Advertising.csv', header=0)
print('Shape :',df_raw.shape)
print(df_raw.head())
print(df_raw.tail())
Shape : (200, 5)
   Unnamed: 0     TV  radio  newspaper  sales
0           1  230.1   37.8       69.2   22.1
1           2   44.5   39.3       45.1   10.4
2           3   17.2   45.9       69.3    9.3
3           4  151.5   41.3       58.5   18.5
4           5  180.8   10.8       58.4   12.9
     Unnamed: 0     TV  radio  newspaper  sales
195         196   38.2    3.7       13.8    7.6
196         197   94.2    4.9        8.1    9.7
197         198  177.0    9.3        6.4   12.8
198         199  283.6   42.0       66.2   25.5
199         200  232.1    8.6        8.7   13.4
In [11]:
# Let's plot each predictor against sales to get a sense of the relationship
import seaborn as sns
sns.pairplot(df_raw, x_vars=['TV', 'radio', 'newspaper'], y_vars='sales', size=7, aspect=0.7, kind='reg')
# We notice that newspaper seems very scattered
# we may have issues using this in our model
Out[11]:
<seaborn.axisgrid.PairGrid at 0x18653b75908>
In [12]:
features = ['TV', 'radio','newspaper']   # Split up our data into predictors
X = df_raw[features]
y = df_raw['sales']                      # and a response variable
# Note that X is a pandas dataframe and y is a pandas series
# 

# import and split them up
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# Default will be a 75/25 train-test split using random sampling

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(150, 3)
(50, 3)
(150,)
(50,)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
In [13]:
# we use stats models for metrics
import statsmodels.api as sm  
X = sm.add_constant(X)
est = sm.OLS(y,X).fit()
print('Statsmodels Metrics \n')
print(est.summary())

from sklearn.linear_model import LinearRegression

linreg = LinearRegression()       # instantiate
linreg.fit(X_train, y_train)      # fit the model to the training data (learn the coefficients)
print('\n SKlearn Results')
print(linreg.intercept_)          # print the intercept and coefficients
print(linreg.coef_)
Statsmodels Metrics 

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sales   R-squared:                       0.897
Model:                            OLS   Adj. R-squared:                  0.896
Method:                 Least Squares   F-statistic:                     570.3
Date:                Sun, 09 Sep 2018   Prob (F-statistic):           1.58e-96
Time:                        15:51:48   Log-Likelihood:                -386.18
No. Observations:                 200   AIC:                             780.4
Df Residuals:                     196   BIC:                             793.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.9389      0.312      9.422      0.000       2.324       3.554
TV             0.0458      0.001     32.809      0.000       0.043       0.049
radio          0.1885      0.009     21.893      0.000       0.172       0.206
newspaper     -0.0010      0.006     -0.177      0.860      -0.013       0.011
==============================================================================
Omnibus:                       60.414   Durbin-Watson:                   2.084
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              151.241
Skew:                          -1.327   Prob(JB):                     1.44e-33
Kurtosis:                       6.332   Cond. No.                         454.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

 SKlearn Results
2.8769666223179318
[0.04656457 0.17915812 0.00345046]

The results are pretty consistent. Note the p-value for newspaper though

Model Evaluation

Math refresher

Recall:

$$ Mean Absolute Error (MAE) = \frac{1}{N} \sum_{i=1}^n | Y_i - \hat{Y_i} | ^ 2 $$ This is basically your average error size

$$ Mean Squared Error (MSE) = \frac{1}{N} \sum_{i=1}^n ( Y_i - \hat{Y_i} ) ^ 2 $$ Similar to MAE but with a magnifying effect on error.

$$ Root Mean Squared Error (RMSE) = \sqrt{ \frac{1}{N} \sum_{i=1}^n ( Y_i - \hat{Y_i} ) } $$ Interpretable is terms of the response y.

In [14]:
# make predictions 
y_pred = linreg.predict(X_test)

# Evaluate the model predictions
from sklearn import metrics
print(' MAE  :',metrics.mean_absolute_error(y_test, y_pred))
print(' MSE  :',metrics.mean_squared_error(y_test, y_pred))
print('RMSE  :',np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
 MAE  : 1.0668917082595208
 MSE  : 1.9730456202283375
RMSE  : 1.4046514230328953

Our mean square error is pretty good! Could it be improved? I'm betting that removal of newspaper would yield an even better model. We'll leave it as an exercise for the reader.