K-Means Algorithm
Supervised learning algorithms are those algorithms that require a previous labeling or existence of a target variable , for example in my previous blog on the Perceptron algorithm each and every data point had a label attached to it ,other examples can be that to predict the price of any object given certain features of the object by running the necessary supervised algorithm on objects for which the prices were known . Long story short ,the supervised algorithms are those that would require a tag on each and every data point.
Unsupervised algorithms are those that do not require such a tagging of data points to make a prediction . We would see a specific algorithm of this category :K means clustering
INTRODUCTION
This algorithm is a classification algorithm that segregates data points based on how they form clusters in the n-dimensional space ,points that are closer together form groups or clusters .To get a Visual representation let us look into an 3-D plot .
- MatLab code 3-D Graph
.
The above 3-D plot shows the segregation of students based on their scores (in the above case there are three test scores taken into consideration) ,although there aren't clear clusters ,the algorithms forms its own.
If you look better into the 3-D plot you would see 3 X marks apart from all those circles ,these X marks are those that define the territory of the data points (ie) the data points closest to the X mark would belong to that particular X-mark(hence a particular category) ,so the question is how to find those X marks ?
Why determine those X marks? It is necessary to find the position of those X marks to categorize the data points.
Why determine those X marks? It is necessary to find the position of those X marks to categorize the data points.
So let us look into the algorithm.
Algorithm
1. Initially , Randomly assign values to the K vertices in the n dimensional space.
2. Form clusters by assigning the data points which are close to the K vertices.
3. Find the mean of the closest vertices and reassign the K vertices.
4. If there are no changes in the K vertices ,then break out of the loop.
5. Else Go to step 2
2. Form clusters by assigning the data points which are close to the K vertices.
3. Find the mean of the closest vertices and reassign the K vertices.
4. If there are no changes in the K vertices ,then break out of the loop.
5. Else Go to step 2
The above algorithm might give you a clear ideas as to how it functions , there may be questions of convergence , performances . About convergence : The algorithm always converges ,this fact can be easily proved ,the performance is kind of decent and highly depends upon the initial values of the K vertices ,the results can be disastrous too !!! if the initial K vertices are not assigned properly (given the simplicity of the algorithm ).
It should also be noted that for every iteration the summation of Eulerian distances(aka Cost) of the datapoints from the respective K -Vertices monotonically decreases (ie) the algorithm converges.
Proof of Convergence
1)Each data point is assigned to its closest vertex hence the cost reduces.
2)Also since the vertex is reassigned to the Centroid of the closest vertices , the cost further reduces.
Hence we have proved that the cost monotonically decreases. Also since the data set is finite ,there are only a finite set of possible clusterings( $latex K^{n} $ , here K is the number of vertices,n is the number of data points) hence the algorithm would converge in finite number of steps .
2)Also since the vertex is reassigned to the Centroid of the closest vertices , the cost further reduces.
Hence we have proved that the cost monotonically decreases. Also since the data set is finite ,there are only a finite set of possible clusterings( $latex K^{n} $ , here K is the number of vertices,n is the number of data points) hence the algorithm would converge in finite number of steps .
Further information
At times it is necessary to choose the perfect number of vertices for clustering,
in such cases such a pre-deterministic number can be chosen depending upon the Cost vs K .
in such cases such a pre-deterministic number can be chosen depending upon the Cost vs K .
- Cost vs K
So by looking at the above graph we can see an elbow ,typically the number of vertices (K)
at which the elbow occurs will be taken as the ideal value,there are situations for which such an elbow wouldn't be apparent ,in such cases random heuristics should be used.
at which the elbow occurs will be taken as the ideal value,there are situations for which such an elbow wouldn't be apparent ,in such cases random heuristics should be used.
MatLab Code
This is a sample Matlab code that implements this algorithm ,you would get the same graph as above, try tinkering with it. The program shows how the vertices move towards the least cost configuration iteration by iteration (you need to press enter to move from one iteration to next).
The data set used is From the KDD -Cup . Code Link
The data set used is From the KDD -Cup . Code Link
Note: The convergence condition doesn't imply the attainment of the least cost configuration.
Things to be tried by you : 1) Find an algorithm for least cost configuration.
No comments:
Post a Comment