Thursday, 11 May 2017

PCA - An introduction

Introduction 


The origin of PCA is just analogous to TLS (Total Least Squares ) ,used for regression but it is extended to data compression (reducing dimensionality of data), data imputation (imputation of missing data),Iterative PCA (IPCA) (Used for finding the errors in the data )  , Non-Negative Matrix Factorisation  (NMF) (Used for extracting clean data from noisy data - The famous Cock tail party problem), Network Component Analysis (NCA) (Used generally in Bio-Informatics) , finally  Kernel PCA (KPCA) (to model any physical data). PCA is such a powerful tool , which can be used to do all this provided no consideration is taken on the computational power , that is assuming we have enough computational power we can device an algorithm for all the above mentioned .

Introduction to SVD 

PCA uses an mathematical tool in Matlab called a Singular Valued Decomposition (SVD) . This is very much analogous to any other factorisation technique such as the QR , Eigen Value decomposition ,etc . 
Let us define a matrix A , its SVD is given by :
$A_{m*n}  = U_{m*n}S_{n*m}V_{n*m}^{'}$ 

The Properties of the matrix :  The matrix U , V are the left singular ,right singular vectors which are orthonormal to each other , S is a diagonal matrix consisting of singular values (Generic : first few singular values are non zero arranged in descending order, followed by zero singular values). 
(ie $s_{1} \geq s_{2} \geq s_{3} \cdots  s_{r} \geq s_{r+1} \approx 0$ .(Here singular values after the rth value are approximately equal to 0).

It can now be easily seen that :
$AA^{'} = (USV^{'} )*(VS^{'}U^{'} ) $
$\implies AA^{'} = U(SS^{'})U^{'}$
Hence it can be seen that $ U $, $SS^{'}$ is set of eigen vectors ,eigen values respectively of the matrix $AA^{'}$  . It can also be noted that if  $A$ is a data matrix that contains each data points as a column vector (ie the matrix sizing is Variable x No of datapoints ) , then the matrix  $AA^{'}$ kind of acts like a covariance matrix across the variables . 

Similarly ,
$A^{'}A = (VS^{'}U^{'} )*(USV^{'} ) $
$\implies A^{'}A = V(S^{'}S)V^{'}$
Hence it can be seen that $ V $, $S^{'}S$ is set of eigen vectors ,eigen values respectively of the matrix $A^{'}A$  . It can also be noted that if  $A$ is a data matrix that contains each data points as a column vector (ie the matrix sizing is Variable x No of datapoints ) , then the matrix  $AA^{'}$ kind of acts like a covariance matrix across the datapoints. 

Now it can also be seen that 
$A = \sum _{i = 1} ^{k} U_{i}s_{i}V_{i}^{'}$
Where $ U_{i},V_{i}$ is the ith column of the U matrix , V matrix respectively ,$s_{i}$ represents the ith non zero singular value  ,where $ k = min(m,n) $ 
Now that adequate details have been given about the SVD , the properties of the decomposed matrices  , I hope you can yourself formulate the method that is required to perform the decomposition (ie SVD) .

Algorithm 

A setup of the problem :
For all purposes let us define some basic conventions that will be used through out the series of blogs under MVDA
Let A  - $A_{m*n}$be the data matrix (Variable x No of Datapoints ) .
Then matrix $Z = A - \bar{A}$ , here $\bar{A}$ is the mean of the data matrix .
for all purposes we shall now be dealing with Z matrix. 
Now some inferences we shall make on Z matrix :
First let us assume the Z matrix to be a matrix with no errors , the system following an exact linear relation then the number of constraints  = m - row rank of matrix (Z).  

Algorithm Approach  :
Then , let us perform an SVD on Z ,
then if we take the eigen vector corresponding ($U_{m}$) to the last singular value gives we get  :
Note : when $m \leq n$ (a well posed data matrix) then $k = m$ ,hence : 

$ U_{k}^{'}  Z = U_{k}^{'}* (\sum _{i = 1} ^{k} U_{i}s_{i}V_{i}^{'})$
$\implies U_{k}^{'}  Z =  s_{k}V_{k}^{'} \approx  0$. (pre-multiply with the vector)
Now you may see how the TLS algorithm discussed in my previous blog even originated .

If the data matrix is ill posed $m \geq n$ then $m \geq k$ hence

$ U_{m}^{'}  Z = U_{m}^{'}* (\sum _{i = 1} ^{k} U_{i}s_{i}V_{i}^{'})$
$\implies U_{m}^{'}  Z =  0$. (pre-multiply with the vector)
Now if there are multiple constraints then we obviously take the last few vectors of the U matrix that can now be called a regression set

Now as mentioned that the Z matrix consists of each and every data entry in the form of a column , if the data matrix that you are using is the other way around the you can post multiply with the last few vectors of $V$ corresponding to least singular values .

Matlab Function

Since we have seen a brief intro to SVD  , I shall now introduce to you the matlab function that would do the decomposition for you :
Matlab Function : [U,S,V] = svd(A).

The above function can always be used to do the SVD  , to obtain a regression set.
Since it was already mentioned that $k = min(m,n) $, lets consider a case where $m \leq n$ , then the extra vectors of the $V$ matrix (ie columns of the V matrix from $k+1$ to $n$ columns) are unnecessarily computed , thereby increasing computational power.

Hence we can use a specialized version of the same function under certain conditions:

[U,S,V] = svd(A,'econ')

This version basically ignores the above mentioned columns of the $V$ matrix  , also returns $S$ as a square matrix.
Note : Assuming blog convention of data matrix :
           The econ version is only used when $m \leq n$ .
            If $m \geq n$ then econ shouldn't be used.



No comments:

Post a Comment