Principal Component Analysis

Intro

Is a procedure that uses orthogonal transformation to convert correlated variables into linearly uncorrelated variables, which it calls the principal components. It does this by repeatedly determining which variables account/explain the greatest amount of variability in the data and is orthogonal to the preceding components. (This constraint does not apply to the first component). The result is a set of vectors that forms an uncorrelated orthogonal basis set for the data

For those needing a refresher (like i did)
Correlation := a standard measure of the interdependence between 2 variables
Orthogonal := statistical context implies independence which is the same as two vectors being at 90-degrees to each other. The orthogonal basis set is a potential axis set, aka the principal axes. I like to think of these as the x,y,z axes in a 3-dimensional diagram

In Machine learning application it is used to reduce the number of dimensions. Higher dimensions can be easier to work with mathematically but considerably more difficult to apply to the real world.

Basically it's a great way to take a lot of data and reduce it to the important stuff.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns 
sns.set()
In [2]:
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 3), rng.randn(3, 250)).T
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal');
In [3]:
# Obviously this dataset is highly correlated. 
# Instead of trying to predict y from x we shall try to learn about their relationship. 

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
print("Principal Components: ")
print(pca.components_)

print("Explained Variance: ")
print(pca.explained_variance_)
Principal Components: 
[[-0.94310415 -0.33249746]
 [ 0.33249746 -0.94310415]]
Explained Variance: 
[0.85137301 0.03735463]
In [4]:
# The above appears rather cryptic so let's try to visualize it
def draw_vector(v0, v1, ax=None):
    ax = ax or plt.gca()
    arrowprops=dict(arrowstyle='->',
                    linewidth=2,
                    shrinkA=0, shrinkB=0)
    ax.annotate('', v1, v0, arrowprops=arrowprops)

# plot data
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v)
plt.axis('equal');
In [5]:
# The axes above are the principal axes for the dataset
# the length of the axis/vector is a reflection of it's importance in explaining the variance 
# and thus describing the distribution

#Now let's see what happens when we reduce the dimensionality
# recreate the original diagram and draw the vectors
pca = PCA(n_components=2)
pca.fit(X)

pca2 = PCA(n_components=1)
pca2.fit(X)
X_pca = pca2.transform(X)
#print("original shape:   ", X.shape)
#print("transformed shape:", X_pca.shape)
X_new = pca2.inverse_transform(X_pca)
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
plt.scatter(X_new[:, 0], X_new[:, 1], c='red', alpha=0.8)
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v)
    
plt.axis('equal');

What we've done is create 2 models.

Model 1: Retains both dimensions and computes the principal component vectors

Model 2: Sets the numbers of components=1 thereby reducing the number of dimensions(2).
Transforms the data (zero's out the less valuable component) and stores in X_pca Apply inverse transform (which will extrapolate the Ys from X_pca)

The end result here is that the final extrapolated data (the red points) contains the majority of the information in the original but came at a much lower price point.

RW Example 1

It can be difficult to understand the application of PCA from the concocted data above so let's take a look at something in higher dimensions.

The following example can be found at:
https://www.kaggle.com/nirajvermafcb/principal-component-analysis-explained/notebook

In [6]:
import pandas as pd
raw = pd.read_csv('./data/hr_comma_sep.csv')
print('Shape: ', raw.shape)

columns_names = raw.columns.tolist()
print("Columns names:")
print(columns_names)

raw.head()
Shape:  (14999, 10)
Columns names:
['satisfaction_level', 'last_evaluation', 'number_project', 'average_montly_hours', 'time_spend_company', 'Work_accident', 'left', 'promotion_last_5years', 'sales', 'salary']
Out[6]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low
In [7]:
correlation = raw.corr()
plt.figure(figsize=(10,10))
sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='cubehelix')
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x221b6d7ed30>
In [8]:
raw2 = raw.drop(labels=['sales','salary'],axis=1)       # drop non numeric data
# Move target column to the front
cols = raw2.columns.tolist()
cols.insert(0, cols.pop(cols.index('left')))
raw2 = raw2.reindex(columns= cols)
#raw2.columns.tolist()

#Seperate features and labels
X = raw2.iloc[:,1:8].values       #Cols 1 thru 8 inclusive
y = raw2.iloc[:,0].values         #Col  0 = left
X
Out[8]:
array([[0.38, 0.53, 2.  , ..., 3.  , 0.  , 0.  ],
       [0.8 , 0.86, 5.  , ..., 6.  , 0.  , 0.  ],
       [0.11, 0.88, 7.  , ..., 4.  , 0.  , 0.  ],
       ...,
       [0.37, 0.53, 2.  , ..., 3.  , 0.  , 0.  ],
       [0.11, 0.96, 6.  , ..., 4.  , 0.  , 0.  ],
       [0.37, 0.52, 2.  , ..., 3.  , 0.  , 0.  ]])
In [9]:
from sklearn.preprocessing import StandardScaler

#Standardize features by removing the mean and scaling to unit variance
X_std = StandardScaler().fit_transform(X)

#Get mean vector
mean_vec = np.mean(X_std, axis=0)

#Get feature var-covar matrix
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
print('Covariance matrix \n%s' %cov_mat)
Covariance matrix 
[[ 1.00006668  0.10502822 -0.14297912 -0.02004945 -0.1008728   0.05870115
   0.02560689]
 [ 0.10502822  1.00006668  0.34935588  0.33976445  0.1315995  -0.00710476
  -0.00868435]
 [-0.14297912  0.34935588  1.00006668  0.41723845  0.19679901 -0.00474086
  -0.00606436]
 [-0.02004945  0.33976445  0.41723845  1.00006668  0.12776343 -0.01014356
  -0.00354465]
 [-0.1008728   0.1315995   0.19679901  0.12776343  1.00006668  0.00212056
   0.06743742]
 [ 0.05870115 -0.00710476 -0.00474086 -0.01014356  0.00212056  1.00006668
   0.03924805]
 [ 0.02560689 -0.00868435 -0.00606436 -0.00354465  0.06743742  0.03924805
   1.00006668]]
In [10]:
plt.figure(figsize=(8,8))
sns.heatmap(cov_mat, vmax=1, square=True,annot=True,cmap='cubehelix')
plt.title('Correlation between features')
Out[10]:
Text(0.5,1,'Correlation between features')
In [10]:
# Eigenvector/Value decomposition
eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)
Eigenvectors 
[[ 0.08797699  0.29189921  0.27784886  0.33637135  0.79752505  0.26786864
  -0.09438973]
 [-0.50695734 -0.30996609 -0.70780994  0.07393548  0.33180877  0.1101505
  -0.13499526]
 [-0.5788351   0.77736008 -0.00657105 -0.19677589 -0.10338032 -0.10336241
  -0.02293518]
 [-0.54901653 -0.45787675  0.63497294 -0.25170987  0.10388959 -0.01034922
  -0.10714981]
 [-0.31354922 -0.05287224  0.12200054  0.78782241 -0.28404472  0.04036861
   0.42547869]
 [ 0.01930249 -0.04433104 -0.03622859 -0.05762997  0.37489883 -0.8048393
   0.45245222]
 [-0.00996933 -0.00391698 -0.04873036 -0.39411153  0.10557298  0.50589173
   0.75836313]]

Eigenvalues 
[1.83017431 0.54823098 0.63363587 0.84548166 1.12659606 0.95598647
 1.06036136]
In [11]:
# Selecting Principal Components
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse=True)

# Visually confirm that the list is correctly sorted by decreasing eigenvalues
#print('Eigenvalues in descending order:')
#for i in eig_pairs:
#    print(i[0])
    
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
with plt.style.context('dark_background'):
    plt.figure(figsize=(6, 4))

    plt.bar(range(7), var_exp, alpha=0.5, align='center',
            label='individual explained variance')
    plt.ylabel('Explained variance ratio')
    plt.xlabel('Principal components')
    plt.legend(loc='best')
    plt.tight_layout()
In [12]:
matrix_w = np.hstack((eig_pairs[0][1].reshape(7,1), 
                      eig_pairs[1][1].reshape(7,1)
                    ))
print('Matrix W:\n', matrix_w)
Matrix W:
 [[ 0.08797699  0.79752505]
 [-0.50695734  0.33180877]
 [-0.5788351  -0.10338032]
 [-0.54901653  0.10388959]
 [-0.31354922 -0.28404472]
 [ 0.01930249  0.37489883]
 [-0.00996933  0.10557298]]
In [13]:
Y = X_std.dot(matrix_w)
Y
Out[13]:
array([[ 1.90035018, -1.12083103],
       [-2.1358322 ,  0.2493369 ],
       [-3.05891625, -1.68312693],
       ...,
       [ 2.0507165 , -1.182032  ],
       [-2.91418496, -1.42752606],
       [ 1.91543672, -1.17021407]])
In [14]:
from sklearn.decomposition import PCA
pca = PCA().fit(X_std)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlim(0,7,1)
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
Out[14]:
Text(0,0.5,'Cumulative explained variance')
In [15]:
#above plot shows almost 90% variance by the first 6 components. Therfore we can drop 7th component.

from sklearn.decomposition import PCA 
sklearn_pca = PCA(n_components=6)
Y_sklearn = sklearn_pca.fit_transform(X_std)
print(Y_sklearn.shape)
print(Y_sklearn)
(14999, 6)
[[-1.90035018 -1.12083103 -0.0797787   0.03228437 -0.07256447  0.06063013]
 [ 2.1358322   0.2493369   0.0936161   0.50676925  1.2487747  -0.61378158]
 [ 3.05891625 -1.68312693 -0.301682   -0.4488635  -1.12495888  0.29066929]
 ...
 [-2.0507165  -1.182032   -0.04594506  0.02441143 -0.01553247  0.24980658]
 [ 2.91418496 -1.42752606 -0.36333357 -0.31517759 -0.97107375  0.51444624]
 [-1.91543672 -1.17021407 -0.07024077  0.01486762 -0.09545357  0.01773844]]