
visualdatatools.com Discussion for DataTank and DataGraph

View previous topic :: View next topic 
Author 
Message 
ijstokes
Joined: 27 Jan 2010 Posts: 24

Posted: Fri Feb 26, 2010 5:52 pm Post subject: kmeans grouping of data 


It would be brilliant to have a kmeans grouping function. The user could select/specify any number of columns to act as the input to kmeans fitting, specify the value of "k", and then get a synthesized column showing the grouping, possibly with geometric distance from the group's mean vector. On top of that it is trivial to calculate the hyperellipsoid for each group from the selected columns dimensions (just use the eigen vectors/values), which for 2D graphs can then be displayed as the flattened ellipse. _________________ Ian StokesRees
Harvard Medical School 

Back to top 


David Site Admin
Joined: 25 Nov 2006 Posts: 1949 Location: Chapel Hill, NC

Posted: Fri Feb 26, 2010 10:51 pm Post subject: 


I'm not familiar with this. Can you point me to a reference?
David 

Back to top 


ijstokes
Joined: 27 Jan 2010 Posts: 24

Posted: Fri Feb 26, 2010 11:09 pm Post subject: kmeans reference 


Wikipedia has a pretty good and concise description. Kmeans is really easy to implement, and very intuitive:
http://en.wikipedia.org/wiki/Kmeans_clustering
I think you'd want to add a function that could take an arbitrary number of parameters (k taken from 1:N, N=# of dimensions), where each paramter is a column name/index. This would create a new column with an arbitrary identifier for each row identifying which group it is in (this could be a letter or an integer).
The part people often miss out on is the ease of getting a hyperellipsoid (or simply an ellipse, in 2D) after Kmeans clustering has been done (or for any data set, for that matter).
I haven't done work on this in many years, but from memory, the process is really straight forward:
1. Calculate the eigen values and eigen vectors for the data in question. An SVD will give this.
2. These values can be used to normalize and remap the data, which effectively does a rotation and scale. The rotation matrix corresponds to the major and minor axes of the ellipse (hyper ellipsoid), and the scale corresponds to the radius of each dimension (the eigen value).
3. Using the rotation matrix and eigen values, you can directly infer the parameters of the ellipse.
4. Sometimes it is better to draw a smaller or bigger ellipse, just for visual purposes. _________________ Ian StokesRees
Harvard Medical School 

Back to top 




You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum

Powered by phpBB © 2001, 2005 phpBB Group
