Contributor: Matthew Piekenbrock

Estimating the Empirical Cluster Tree

Mentors: Mikhail Belkin, Michael Hahsler
Organization: R project for statistical computing

The aim of this project is to provide a standalone, scalable, and extensible R package that unifies existing methodologies for estimating the empirical cluster tree.

In 1975, John Hartigan established a statistical definition of a high-density cluster. High-density clusters are defined on a population with density f in r dimensions to be the maximal connected sets of form { x : f(x) >= lambda }, for any lambda > 0. Varying this density level lambda across all possible density thresholds forms a hierarchy of density level sets known as the cluster tree.

In 1981, the asymptotic behavior of various linkage criteria were studied by by Hartigan in the context of the cluster tree. During this effort, a notion of consistency---now often referred to as "Hartigan consistency"---was introduced. In the same seminal paper, single-linkage was proven by Hartigan to be inconsistent for dimensions r >= 2.

It was not until 2010 that the first computationally tractable and provably consistent algorithm for estimating the cluster tree for r >= 2 was proven. These results have undoubtedly played a large role in rekindling a number of recent research efforts related to the cluster tree.