Lattice Insight, LLC
​Mathematical Statistics Consulting
Follow/Contact
  • Home
  • About
  • Projects
  • Contact
  • Blog

From data to decisions

Let a professional mathematician and statistician help you make evidence-based decisions.

Download our presentation

Choosing the number of clusters for a cluster analysis

5/13/2015

0 Comments

 
Picture
It's the question without an obvious answer in cluster analysis: into how many clusters should we group our data? You are lucky if the data groups into a small number of visible, compact clusters that are clearly distinct from one another. Usually the choice is ambiguous, often depending on the clustering method used.

I am a big fan of Ward's method of clustering. The dendrograms associated with Ward's method lend themselves to fairly obvious recommendations for the number of clusters, because the higher branches tend to be the longest, especially in comparison to dendrograms arising from other agglomerative hierarchical clustering methods. (In technical terms, dendrograms arising from Ward's method tend to have larger agglomerative coefficients, closer to 1, compared with complete linkage [2nd best], average linkage [3rd best], and single linkage [the worst].)

The Calinski-Harabasz pseudo-F statistic tries to explain the most variation between clusters using the fewest clusters, employing a formula that suspiciously resembles an F-statistic. When the data is multivariate normal and the clustering method is Ward's, the pseudo-F statistic is in fact an F-statistic!

Typically we try to maximize the pseudo-F statistic. For one dataset I studied recently, the maximum occurred at k = 5 clusters, as shown in the graph above, with k = 8 a close second. The dendrogram reveals that 5 clusters and 8 clusters are natural choices for this data.

But when the pseudo-F statistic is in fact an F-statistic, an alternative is to determine the p-value for the F-values above, and choose the number of clusters that minimizes p, giving us the most significant value of k. In the graph below, the partitioning of variance among 8 clusters is more significant than among 5 clusters. (Horizontal lines on this graph of log p show the location of p = .05 [upper] and p = .01 [lower].)
Picture
0 Comments

Your comment will be posted after it is approved.


Leave a Reply.

    Author

    Hal M. Switkay, Ph.D. is a professional mathematician and statistician.

    Archives

    December 2020
    September 2020
    August 2020
    July 2020
    May 2020
    April 2020
    September 2019
    August 2019
    May 2019
    December 2018
    January 2017
    October 2016
    August 2016
    June 2016
    April 2016
    March 2016
    February 2016
    January 2016
    December 2015
    October 2015
    September 2015
    July 2015
    June 2015
    May 2015
    April 2015
    March 2015
    February 2015
    January 2015

    Categories

    All

    RSS Feed

Web Hosting by MyDomain
Website contents copyright Lattice Insight, LLC, 2015. All rights reserved.