Classroom Fundaes : PCA - An introduction

Introduction

The origin of PCA is just analogous to TLS (Total Least Squares ) ,used for regression but it is extended to data compression (reducing dimensionality of data), data imputation (imputation of missing data),Iterative PCA (IPCA) (Used for finding the errors in the data ) , Non-Negative Matrix Factorisation (NMF) (Used for extracting clean data from noisy data - The famous Cock tail party problem), Network Component Analysis (NCA) (Used generally in Bio-Informatics) , finally Kernel PCA (KPCA) (to model any physical data). PCA is such a powerful tool , which can be used to do all this provided no consideration is taken on the computational power , that is assuming we have enough computational power we can device an algorithm for all the above mentioned .

Introduction to SVD

PCA uses an mathematical tool in Matlab called a Singular Valued Decomposition (SVD) . This is very much analogous to any other factorisation technique such as the QR , Eigen Value decomposition ,etc .

Let us define a matrix A , its SVD is given by :

$A_{m*n} = U_{m*n}S_{n*m}V_{n*m}^{'}$

The Properties of the matrix : The matrix U , V are the left singular ,right singular vectors which are orthonormal to each other , S is a diagonal matrix consisting of singular values (Generic : first few singular values are non zero arranged in descending order, followed by zero singular values).

(ie $s_{1} \geq s_{2} \geq s_{3} \cdots s_{r} \geq s_{r+1} \approx 0$ .(Here singular values after the rth value are approximately equal to 0).

It can now be easily seen that :

$AA^{'} = (USV^{'} )*(VS^{'}U^{'} ) $

$\implies AA^{'} = U(SS^{'})U^{'}$

Hence it can be seen that $ U $, $SS^{'}$ is set of eigen vectors ,eigen values respectively of the matrix $AA^{'}$ . It can also be noted that if $A$ is a data matrix that contains each data points as a column vector (ie the matrix sizing is Variable x No of datapoints ) , then the matrix $AA^{'}$ kind of acts like a covariance matrix across the variables .

Similarly ,

$A^{'}A = (VS^{'}U^{'} )*(USV^{'} ) $

$\implies A^{'}A = V(S^{'}S)V^{'}$

Hence it can be seen that $ V $, $S^{'}S$ is set of eigen vectors ,eigen values respectively of the matrix $A^{'}A$ . It can also be noted that if $A$ is a data matrix that contains each data points as a column vector (ie the matrix sizing is Variable x No of datapoints ) , then the matrix $AA^{'}$ kind of acts like a covariance matrix across the datapoints.

Now it can also be seen that

$A = \sum _{i = 1} ^{k} U_{i}s_{i}V_{i}^{'}$

Where $ U_{i},V_{i}$ is the ith column of the U matrix , V matrix respectively ,$s_{i}$ represents the ith non zero singular value ,where $ k = min(m,n) $

Now that adequate details have been given about the SVD , the properties of the decomposed matrices , I hope you can yourself formulate the method that is required to perform the decomposition (ie SVD) .

Algorithm

A setup of the problem :

For all purposes let us define some basic conventions that will be used through out the series of blogs under MVDA ,

Let A - $A_{m*n}$be the data matrix (Variable x No of Datapoints ) .

Then matrix $Z = A - \bar{A}$ , here $\bar{A}$ is the mean of the data matrix .

for all purposes we shall now be dealing with Z matrix.

Now some inferences we shall make on Z matrix :

First let us assume the Z matrix to be a matrix with no errors , the system following an exact linear relation then the number of constraints = m - row rank of matrix (Z).

Algorithm Approach :

Then , let us perform an SVD on Z ,

then if we take the eigen vector corresponding ($U_{m}$) to the last singular value gives we get :
Note : when $m \leq n$ (a well posed data matrix) then $k = m$ ,hence :

$ U_{k}^{'} Z = U_{k}^{'}* (\sum _{i = 1} ^{k} U_{i}s_{i}V_{i}^{'})$

$\implies U_{k}^{'} Z = s_{k}V_{k}^{'} \approx 0$. (pre-multiply with the vector)

Now you may see how the TLS algorithm discussed in my previous blog even originated .

If the data matrix is ill posed $m \geq n$ then $m \geq k$ hence

$ U_{m}^{'} Z = U_{m}^{'}* (\sum _{i = 1} ^{k} U_{i}s_{i}V_{i}^{'})$

$\implies U_{m}^{'} Z = 0$. (pre-multiply with the vector)

Now if there are multiple constraints then we obviously take the last few vectors of the U matrix that can now be called a regression set .

Now as mentioned that the Z matrix consists of each and every data entry in the form of a column , if the data matrix that you are using is the other way around the you can post multiply with the last few vectors of $V$ corresponding to least singular values .

Matlab Function

Since we have seen a brief intro to SVD , I shall now introduce to you the matlab function that would do the decomposition for you :

Matlab Function : [U,S,V] = svd(A).

The above function can always be used to do the SVD , to obtain a regression set.
Since it was already mentioned that $k = min(m,n) $, lets consider a case where $m \leq n$ , then the extra vectors of the $V$ matrix (ie columns of the V matrix from $k+1$ to $n$ columns) are unnecessarily computed , thereby increasing computational power.

Hence we can use a specialized version of the same function under certain conditions:

[U,S,V] = svd(A,'econ')

This version basically ignores the above mentioned columns of the $V$ matrix , also returns $S$ as a square matrix.
Note : Assuming blog convention of data matrix :
The econ version is only used when $m \leq n$ .
If $m \geq n$ then econ shouldn't be used.

Classroom Fundaes

Menu

Thursday, 11 May 2017

PCA - An introduction

Introduction

Introduction to SVD

Algorithm

Matlab Function

No comments:

Post a Comment

About Me

Contact Form