Posting some thoughts here as we've tackled similar challenges over the years:
"Clustering workflows that better address noise. I know HDBSCAN is useful here. I often still lean towards a first K-Means step, but then disregard clusters with a few number of papers as noise, because they might not actually coherently be a part of a coherent grouping despite their semantic similarity."
I find the main issue with HBDSCAN is that areas of low density are considered noise, but often in unstructured data it's the highest density areas that are actually noise (Slop, bot spam and such). Which are virtually guaranteed to be labelled as clusters so long as there are enough nearest neighbours to satisfy the value of `min_cluster_size = `.
Conversely, kMeans assumes nothing is noise, and there is always noise.
So I prefer to run both kMeans and HDBSCAN over the data inpdendently, kMeans with a low-ish value, and then see how the HDBSCAN clusters fall within the kMeans clusters. I find it tends to be the case that:
- HDBSCAN clusters have high precision, low recall
- kMeans clusters have high recall, low precision
- ~100% of individual HDBSCAN clusters tend to fall into single kMeans clusters
- Similar HDBSCAN clusters fall into the same kMeans cluster
If we can label the HDBSCAN clusters appropriately, we should be able to use those HDBSCAN clusters to label the kMeans clusters. Eventually we end up with labelled high-level clusters/topics(kMeans) and specific cluster/subtopics (HDBSCAN).
There's still a challenge with what to do with the HDBSCAN outliers, at this point I prefer to evaluate the HDBSCAN -1 category as k-different noise categories. And then determine whether the `kmeans_cluster == 1 and hdbscan_cluster == -1` is itself clearly part of the named kmeans cluster, or whether it appears to be genuine noise. If the -1 is noisy, drop them from the kMeans cluster. Now our kMeans clusters will tend towards higher precision and lower recall.
Love this overview, Jay! It’s exciting to see where AI is headed. The landscape is evolving so quickly, and NeurIPS always feels like a window into the future. Can’t wait to see how these trends shape our everyday lives in the coming years! #AI #NeurIPS
It is amazing, however it lack for me list view, so i could look at papers that way. It would be also amazing if there would be options to filter dots to for example specific category or specific poster session at neurips. Can the data be downloaded in some easy way for example to make such custom views?
Posting some thoughts here as we've tackled similar challenges over the years:
"Clustering workflows that better address noise. I know HDBSCAN is useful here. I often still lean towards a first K-Means step, but then disregard clusters with a few number of papers as noise, because they might not actually coherently be a part of a coherent grouping despite their semantic similarity."
I find the main issue with HBDSCAN is that areas of low density are considered noise, but often in unstructured data it's the highest density areas that are actually noise (Slop, bot spam and such). Which are virtually guaranteed to be labelled as clusters so long as there are enough nearest neighbours to satisfy the value of `min_cluster_size = `.
Conversely, kMeans assumes nothing is noise, and there is always noise.
So I prefer to run both kMeans and HDBSCAN over the data inpdendently, kMeans with a low-ish value, and then see how the HDBSCAN clusters fall within the kMeans clusters. I find it tends to be the case that:
- HDBSCAN clusters have high precision, low recall
- kMeans clusters have high recall, low precision
- ~100% of individual HDBSCAN clusters tend to fall into single kMeans clusters
- Similar HDBSCAN clusters fall into the same kMeans cluster
If we can label the HDBSCAN clusters appropriately, we should be able to use those HDBSCAN clusters to label the kMeans clusters. Eventually we end up with labelled high-level clusters/topics(kMeans) and specific cluster/subtopics (HDBSCAN).
There's still a challenge with what to do with the HDBSCAN outliers, at this point I prefer to evaluate the HDBSCAN -1 category as k-different noise categories. And then determine whether the `kmeans_cluster == 1 and hdbscan_cluster == -1` is itself clearly part of the named kmeans cluster, or whether it appears to be genuine noise. If the -1 is noisy, drop them from the kMeans cluster. Now our kMeans clusters will tend towards higher precision and lower recall.
I have created a simple browser: https://kotwic4.github.io/Neurips2025Papers/
Love this overview, Jay! It’s exciting to see where AI is headed. The landscape is evolving so quickly, and NeurIPS always feels like a window into the future. Can’t wait to see how these trends shape our everyday lives in the coming years! #AI #NeurIPS
It is amazing, however it lack for me list view, so i could look at papers that way. It would be also amazing if there would be options to filter dots to for example specific category or specific poster session at neurips. Can the data be downloaded in some easy way for example to make such custom views?
Great list thanks for sharing this is big list of gem if reading
Very cool application of topic modelling. The approach is very reminiscent of Anthropic’s recent CLIO paper.