On Seeking Consensus Between Document Similarity Measures

What is it about?

This paper investigates the application of consensus clustering and meta-clustering to the set of all possible partitions of a data set. We show that when using a "complement" of Rand Index as a measure of cluster similarity, the total-separation partition, putting each element in a separate set, is chosen.

Why is it important?

As the number of available clustering algorithms applicable to the same data is growing, and the potential outputs may differ substantially, methodologies to reconcile them like meta-clustering or consensus clustering are under development. In this paper we demonstrated that both consensus clustering and meta-clustering using Cluster Difference (derived from Rand Index) as a measure of distance between partitions, when applied to the universe of all possible partitions, point to the partition containing each element in a separate set as the best compromise. It is quite easy to invent clustering algorithms delivering for the same set of data any clustering we want. But in the space of all partitions we get lost both by meta-clustering and consensus clustering. Because meta-clustering will provide us with a structure of partitions that has nothing to do with the data and consensus clustering will deliver the most trivial consensus having nothing to do with the data. This suggests that the user performing the task of clustering must at least have an approximate vision of the geometry of the data space. Only in this case the mentioned techniques may be helpful in the choice of appropriate compromise clustering.

The following have contributed to this page:

Mieczysław Kłopotek