Introduction
The origin of PCA is just analogous to TLS (Total Least Squares ) ,used for regression but it is extended to data compression (reducing dimensionality of data), data imputation (imputation of missing data),Iterative PCA (IPCA) (Used for finding the errors in the data ) , Non-Negative Matrix Factorisation (NMF) (Used for extracting clean data from noisy data - The famous Cock tail party problem), Network Component Analysis (NCA) (Used generally in Bio-Informatics) , finally Kernel PCA (KPCA) (to model any physical data). PCA is such a powerful tool , which can be used to do all this provided no consideration is taken on the computational power , that is assuming we have enough computational power we can device an algorithm for all the above mentioned .
Introduction to SVD
PCA uses an mathematical tool in Matlab called a Singular Valued Decomposition (SVD) . This is very much analogous to any other factorisation technique such as the QR , Eigen Value decomposition ,etc .
Let us define a matrix A , its SVD is given by :
$A_{m*n} = U_{m*n}S_{n*m}V_{n*m}^{'}$
The Properties of the matrix : The matrix U , V are the left singular ,right singular vectors which are orthonormal to each other , S is a diagonal matrix consisting of singular values (Generic : first few singular values are non zero arranged in descending order, followed by zero singular values).
(ie $s_{1} \geq s_{2} \geq s_{3} \cdots s_{r} \geq s_{r+1} \approx 0$ .(Here singular values after the rth value are approximately equal to 0).
It can now be easily seen that :
$AA^{'} = (USV^{'} )*(VS^{'}U^{'} ) $
$\implies AA^{'} = U(SS^{'})U^{'}$
Hence it can be seen that $ U $, $SS^{'}$ is set of eigen vectors ,eigen values respectively of the matrix $AA^{'}$ . It can also be noted that if $A$ is a data matrix that contains each data points as a column vector (ie the matrix sizing is Variable x No of datapoints ) , then the matrix $AA^{'}$ kind of acts like a covariance matrix across the variables .
Similarly ,
$A^{'}A = (VS^{'}U^{'} )*(USV^{'} ) $
$\implies A^{'}A = V(S^{'}S)V^{'}$
Hence it can be seen that $ V $, $S^{'}S$ is set of eigen vectors ,eigen values respectively of the matrix $A^{'}A$ . It can also be noted that if $A$ is a data matrix that contains each data points as a column vector (ie the matrix sizing is Variable x No of datapoints ) , then the matrix $AA^{'}$ kind of acts like a covariance matrix across the datapoints.
Now it can also be seen that
$A = \sum _{i = 1} ^{k} U_{i}s_{i}V_{i}^{'}$
Where $ U_{i},V_{i}$ is the ith column of the U matrix , V matrix respectively ,$s_{i}$ represents the ith non zero singular value ,where $ k = min(m,n) $
Now that adequate details have been given about the SVD , the properties of the decomposed matrices , I hope you can yourself formulate the method that is required to perform the decomposition (ie SVD) .
Algorithm
A setup of the problem :
For all purposes let us define some basic conventions that will be used through out the series of blogs under MVDA ,
Let A - $A_{m*n}$be the data matrix (Variable x No of Datapoints ) .
Then matrix $Z = A - \bar{A}$ , here $\bar{A}$ is the mean of the data matrix .
for all purposes we shall now be dealing with Z matrix.
Now some inferences we shall make on Z matrix :
First let us assume the Z matrix to be a matrix with no errors , the system following an exact linear relation then the number of constraints = m - row rank of matrix (Z).
Algorithm Approach :
Then , let us perform an SVD on Z ,
then if we take the eigen vector corresponding ($U_{m}$) to the last singular value gives we get :
Note : when $m \leq n$ (a well posed data matrix) then $k = m$ ,hence :
Note : when $m \leq n$ (a well posed data matrix) then $k = m$ ,hence :
$ U_{k}^{'} Z = U_{k}^{'}* (\sum _{i = 1} ^{k} U_{i}s_{i}V_{i}^{'})$
$\implies U_{k}^{'} Z = s_{k}V_{k}^{'} \approx 0$. (pre-multiply with the vector)
Now you may see how the TLS algorithm discussed in my previous blog even originated .
If the data matrix is ill posed $m \geq n$ then $m \geq k$ hence
$ U_{m}^{'} Z = U_{m}^{'}* (\sum _{i = 1} ^{k} U_{i}s_{i}V_{i}^{'})$
$\implies U_{m}^{'} Z = 0$. (pre-multiply with the vector)
Now if there are multiple constraints then we obviously take the last few vectors of the U matrix that can now be called a regression set .
Now as mentioned that the Z matrix consists of each and every data entry in the form of a column , if the data matrix that you are using is the other way around the you can post multiply with the last few vectors of $V$ corresponding to least singular values .
Matlab Function
Since we have seen a brief intro to SVD , I shall now introduce to you the matlab function that would do the decomposition for you :
Matlab Function : [U,S,V] = svd(A).
The above function can always be used to do the SVD , to obtain a regression set.
Since it was already mentioned that $k = min(m,n) $, lets consider a case where $m \leq n$ , then the extra vectors of the $V$ matrix (ie columns of the V matrix from $k+1$ to $n$ columns) are unnecessarily computed , thereby increasing computational power.
Hence we can use a specialized version of the same function under certain conditions:
[U,S,V] = svd(A,'econ')
This version basically ignores the above mentioned columns of the $V$ matrix , also returns $S$ as a square matrix.
Note : Assuming blog convention of data matrix :
The econ version is only used when $m \leq n$ .
If $m \geq n$ then econ shouldn't be used.
Note : Assuming blog convention of data matrix :
The econ version is only used when $m \leq n$ .
If $m \geq n$ then econ shouldn't be used.
No comments:
Post a Comment