visualdatatools.com Forum Index visualdatatools.com
Discussion for DataTank and DataGraph
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

k-means grouping of data

 
Post new topic   Reply to topic    visualdatatools.com Forum Index -> Request Fit Function
View previous topic :: View next topic  
Author Message
ijstokes



Joined: 27 Jan 2010
Posts: 24

PostPosted: Fri Feb 26, 2010 5:52 pm    Post subject: k-means grouping of data Reply with quote

It would be brilliant to have a k-means grouping function. The user could select/specify any number of columns to act as the input to k-means fitting, specify the value of "k", and then get a synthesized column showing the grouping, possibly with geometric distance from the group's mean vector. On top of that it is trivial to calculate the hyper-ellipsoid for each group from the selected columns dimensions (just use the eigen vectors/values), which for 2D graphs can then be displayed as the flattened ellipse.
_________________
Ian Stokes-Rees
Harvard Medical School
Back to top
View user's profile Send private message
David
Site Admin


Joined: 25 Nov 2006
Posts: 1940
Location: Chapel Hill, NC

PostPosted: Fri Feb 26, 2010 10:51 pm    Post subject: Reply with quote

I'm not familiar with this. Can you point me to a reference?

David
Back to top
View user's profile Send private message Send e-mail
ijstokes



Joined: 27 Jan 2010
Posts: 24

PostPosted: Fri Feb 26, 2010 11:09 pm    Post subject: k-means reference Reply with quote

Wikipedia has a pretty good and concise description. K-means is really easy to implement, and very intuitive:

http://en.wikipedia.org/wiki/K-means_clustering

I think you'd want to add a function that could take an arbitrary number of parameters (k taken from 1:N, N=# of dimensions), where each paramter is a column name/index. This would create a new column with an arbitrary identifier for each row identifying which group it is in (this could be a letter or an integer).

The part people often miss out on is the ease of getting a hyper-ellipsoid (or simply an ellipse, in 2D) after K-means clustering has been done (or for any data set, for that matter).

I haven't done work on this in many years, but from memory, the process is really straight forward:

1. Calculate the eigen values and eigen vectors for the data in question. An SVD will give this.

2. These values can be used to normalize and re-map the data, which effectively does a rotation and scale. The rotation matrix corresponds to the major and minor axes of the ellipse (hyper ellipsoid), and the scale corresponds to the radius of each dimension (the eigen value).

3. Using the rotation matrix and eigen values, you can directly infer the parameters of the ellipse.

4. Sometimes it is better to draw a smaller or bigger ellipse, just for visual purposes.
_________________
Ian Stokes-Rees
Harvard Medical School
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    visualdatatools.com Forum Index -> Request Fit Function All times are GMT - 3 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group