Discussion about this post

User's avatar
Jack Penzer's avatar

Posting some thoughts here as we've tackled similar challenges over the years:

"Clustering workflows that better address noise. I know HDBSCAN is useful here. I often still lean towards a first K-Means step, but then disregard clusters with a few number of papers as noise, because they might not actually coherently be a part of a coherent grouping despite their semantic similarity."

I find the main issue with HBDSCAN is that areas of low density are considered noise, but often in unstructured data it's the highest density areas that are actually noise (Slop, bot spam and such). Which are virtually guaranteed to be labelled as clusters so long as there are enough nearest neighbours to satisfy the value of `min_cluster_size = `.

Conversely, kMeans assumes nothing is noise, and there is always noise.

So I prefer to run both kMeans and HDBSCAN over the data inpdendently, kMeans with a low-ish value, and then see how the HDBSCAN clusters fall within the kMeans clusters. I find it tends to be the case that:

- HDBSCAN clusters have high precision, low recall

- kMeans clusters have high recall, low precision

- ~100% of individual HDBSCAN clusters tend to fall into single kMeans clusters

- Similar HDBSCAN clusters fall into the same kMeans cluster

If we can label the HDBSCAN clusters appropriately, we should be able to use those HDBSCAN clusters to label the kMeans clusters. Eventually we end up with labelled high-level clusters/topics(kMeans) and specific cluster/subtopics (HDBSCAN).

There's still a challenge with what to do with the HDBSCAN outliers, at this point I prefer to evaluate the HDBSCAN -1 category as k-different noise categories. And then determine whether the `kmeans_cluster == 1 and hdbscan_cluster == -1` is itself clearly part of the named kmeans cluster, or whether it appears to be genuine noise. If the -1 is noisy, drop them from the kMeans cluster. Now our kMeans clusters will tend towards higher precision and lower recall.

Expand full comment

No posts