In this tutorial we will perform the most basic of regression techniques using python.
One should have a rough idea of what's happening behind the code. Let's get started with a quick math refresher.
The model we are going to fit is as follows
$$Y(i) = a + bX + \epsilon $$
Y - is the output variable (aka: response variable)
X - is the input variable (aka: predictor variable)
a - is the intercept
b - is the slope of our line
$\epsilon$ - is the margin of error
Assumptions:
An input var (x) and an output var (y). Recall that the equation for a straight line is $ y = ax + b $ where a denotes the slope of the line and b the intercept. The output of a Linear Regression algorithim is to estimate a & b such that the deviations (residuals) from the model-line to actual data is kept to a minimum. As it turns out this is when the following equations are satisfied $$ a =\frac{ \sum{(x(i) - \mu_X) * ( y(i) - \mu_y ) }} {\sum{x(i) - \mu(X^2)} } $$ $$ b = \mu_y - a*\mu_X $$
That's enough to get us going.
Our data contains several columns. Just to keep things simple we'll stick to the new column and try to predict the future based on what we know. Assume the new is the number of new accounts for a bank for one month. We want to forecast the number of new accounts for the coming year. This is a pretty common task done at almost any large institution.
# Start by importing the needed libs
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
Let's create some data and check it out
# Let's make some data to play with
# For demonstration purposes we'll generate a set of random number according to a normal distribution
rng = np.random.RandomState(25) # a class used to generate random numbers
time = 5 * rng.rand(50) # creates an array of normally distributed numbers
target = (5 * time) + 5 + rng.randn(50) # similarly but here we manipulate
data_raw = pd.DataFrame({'time': time,'target' : target})
print(data_raw.head())
plt.figure(figsize=(25,7))
plt.scatter(data_raw['time'], data_raw['target']);
Pretty safe to say that the linear relation holds we did create it using a line after all.
print(data_raw.head())
print('\nBasic Statistics')
print(' Mean :',data_raw[['target']].mean().values[0])
print(' Var :',data_raw[['target']].var().values[0])
print(' Std Dev :',data_raw[['target']].std().values[0])
print(' Skewness:',data_raw[['target']].skew().values[0])
print(' Kurtosis:',data_raw[['target']].kurtosis().values[0])
Documentation: (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html)
This function is specialised for linear, and only linear, regression.
from scipy import stats
x = data_raw['time']
y = data_raw['target']
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
# To get coefficient of determination (r_squared)
print("Slope :", slope)
print("Intercept :", intercept)
print("Corr Coeff:", r_value)
print("R^2 :", r_value**2)
print("Std Error :", std_err)
print("P-Value :", p_value)
#Plot the data along with the fitted line
plt.figure(figsize=(25,10))
plt.ylim(0,40)
plt.xlim(0,6)
plt.plot(x, y, 'o', label='original data')
plt.plot(x, intercept + slope*x, 'r', label='fitted line')
plt.legend()
plt.show()
Uses non-linear least square to fit a function, you provide, to the data
from scipy.optimize import curve_fit
def linfunc(x,a,b,c):
return (a * x) + b # This is the same one used to create
#curve_fit requires args to be in 1dim arrays
x0 = data_raw[['time']].values.reshape(50,)
y0 = data_raw[['target']].values.reshape(50,)
popt, pcov = curve_fit(linfunc, x0, y0)
plt.plot(x0, y0, '.', label='data')
plt.plot(x0, linfunc(x0,*popt), 'r-', label='pred')
print(popt)
How do we start predicting? Simple take the slope and intercept, returned by the function, and create your model.
Our initial equation as follows
$$Y(i) = 2639.74 + 214.287X $$
And it's accuracy will be within +-0.11 as given by our std error
sklearn (http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html)
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
# Initial data => data_raw
X = data_raw[['time']]
y = data_raw[['target']]
# Let's do an 80/20 split into train/test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=1)
linreg = linear_model.LinearRegression() # Create a linear regression object
linreg.fit(X_train, y_train) # Train the model
score = linreg.score(X_train, y_train)**2 # test the model
print('Coefficients: ', linreg.coef_[0]) # Print the coefficients
print('Intercept : ', linreg.intercept_[0] )
print('R^2 : ', score )
y_predict = linreg.predict(X_train)
std_err = (((y_train - y_predict )**2).sum() / y_train.count()[0])**(1/2)
y_predict = linreg.predict(X_train)
# we must compute the std error manually
std_err = (((y_train - y_predict )**2).sum() / y_train.count()[0])**(1/2)
print('Coefficients: ', linreg.coef_[0]) # Print the coefficients
print('Intercept : ', intercept )
print('R^2 : ', score )
print('Std error : ', std_err[0])
print('MeanSqr_err : ', mean_squared_error(y_predict, y_train))
print('Var score : ', r2_score(y_train, y_predict))
#print(' Compare the above numbers to the results from section 3.1!! ')
#Plot the data along with the fitted line
plt.figure(figsize=(25,10))
plt.ylim(0,40)
plt.xlim(0,6)
plt.plot(X_train, y_train, 'o', label='original data')
plt.plot(X_train, y_predict, 'r', label='fitted line')
plt.legend()
plt.show()
As expected it's very close to our original model
We'll make this one short as it works similar to the above two.
import statsmodels.api as sm
X3 = data_raw[['time']]
y3 = data_raw[['target']]
X3 = sm.add_constant(X3) # add the constant term ( intercept )
m3 = sm.OLS(y3,X3) # Create model using Ordinary least squares
results = m3.fit() # Fit OLS model
print(results.summary()) # display the results
Statsmodel is really made for statistics! Most of the terms are pretty straight forward and should already be familiar to you. Let's go over a few that may not be
Prob(Omnibus): Describes the normality of our data.
Note that Scipy also has a polyfit that works similarly.
# import numpy as np # We've already imported before
X4 = data_raw['time']
y4 = data_raw['target']
coeff = np.polyfit(X4,y4,1)
model = np.poly1d(coeff)
y4_pred = model(X4)
print('Coefficients', coeff)
print('Model: ',model)
Sometimes our data is not linear. This is not to say that linear regression can't be used, we simply need to transform it.
Here's a small list of common transformations and guidelines on their usage
$ \ln(y) = \beta_0 + \beta_1 \ln(x) $ Logarithmic transform
$ \mu_y = \beta_0 + \beta_1 (\frac{1}{x}) $
$ \mu_y = \beta_0 + \beta_1 e^{-x} $
rng = np.random.RandomState(25) # a class used to generate random numbers
time = 5 * rng.rand(75) # creates an array of normally distributed numbers
targ = 5 + (np.exp(time)) + rng.randn(75) # similarly but here we manipulate
target = np.log(targ)
tmp = pd.DataFrame({'time': time,'target' : target,'targ' : targ })
print(tmp.head())
plt.figure(figsize=(25,7))
plt.scatter(tmp['time'], tmp['target']);
Is also known as Multivariable Linear regression. Much of what we've seen can be extended to the multivariate case. We're also going to use a mix of the libraries used above.
The model we fit is simply an extension of the 1 variable case
$$ Y(i) =\beta_0+\beta_1 X_1 +\beta_2 X_2 + ... $$
The most important thing to note is that the predictors are linear! It's a polynomial with degree 1
The business wants to know where they're getting the best bang for their bucks.
Let's determine the relationship between spending on advertising channels and the effect on sales.
We've been provided with a csv containing the following
Why choose regression: The output/response is a continuous (aka Float )
import pandas as pd
import numpy as np
# Let's load & look at the data
df_raw = pd.read_csv('./Data/Advertising.csv', header=0)
print('Shape :',df_raw.shape)
print(df_raw.head())
print(df_raw.tail())
# Let's plot each predictor against sales to get a sense of the relationship
import seaborn as sns
sns.pairplot(df_raw, x_vars=['TV', 'radio', 'newspaper'], y_vars='sales', size=7, aspect=0.7, kind='reg')
# We notice that newspaper seems very scattered
# we may have issues using this in our model
features = ['TV', 'radio','newspaper'] # Split up our data into predictors
X = df_raw[features]
y = df_raw['sales'] # and a response variable
# Note that X is a pandas dataframe and y is a pandas series
#
# import and split them up
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# Default will be a 75/25 train-test split using random sampling
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
# we use stats models for metrics
import statsmodels.api as sm
X = sm.add_constant(X)
est = sm.OLS(y,X).fit()
print('Statsmodels Metrics \n')
print(est.summary())
from sklearn.linear_model import LinearRegression
linreg = LinearRegression() # instantiate
linreg.fit(X_train, y_train) # fit the model to the training data (learn the coefficients)
print('\n SKlearn Results')
print(linreg.intercept_) # print the intercept and coefficients
print(linreg.coef_)
The results are pretty consistent. Note the p-value for newspaper though
Math refresher
Recall:
$$ Mean Absolute Error (MAE) = \frac{1}{N} \sum_{i=1}^n | Y_i - \hat{Y_i} | ^ 2 $$ This is basically your average error size
$$ Mean Squared Error (MSE) = \frac{1}{N} \sum_{i=1}^n ( Y_i - \hat{Y_i} ) ^ 2 $$ Similar to MAE but with a magnifying effect on error.
$$ Root Mean Squared Error (RMSE) = \sqrt{ \frac{1}{N} \sum_{i=1}^n ( Y_i - \hat{Y_i} ) } $$ Interpretable is terms of the response y.
# make predictions
y_pred = linreg.predict(X_test)
# Evaluate the model predictions
from sklearn import metrics
print(' MAE :',metrics.mean_absolute_error(y_test, y_pred))
print(' MSE :',metrics.mean_squared_error(y_test, y_pred))
print('RMSE :',np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Our mean square error is pretty good! Could it be improved? I'm betting that removal of newspaper would yield an even better model. We'll leave it as an exercise for the reader.