Posting some thoughts here as we've tackled similar challenges over the years:
"Clustering workflows that better address noise. I know HDBSCAN is useful here. I often still lean towards a first K-Means step, but then disregard clusters with a few number of papers as noise, because they might not actually coherently be a part of a coherent grouping despite their semantic similarity."
I find the main issue with HBDSCAN is that areas of low density are considered noise, but often in unstructured data it's the highest density areas that are actually noise (Slop, bot spam and such). Which are virtually guaranteed to be labelled as clusters so long as there are enough nearest neighbours to satisfy the value of `min_cluster_size = `.
Conversely, kMeans assumes nothing is noise, and there is always noise.
So I prefer to run both kMeans and HDBSCAN over the data inpdendently, kMeans with a low-ish value, and then see how the HDBSCAN clusters fall within the kMeans clusters. I find it tends to be the case that:
- HDBSCAN clusters have high precision, low recall
- kMeans clusters have high recall, low precision
- ~100% of individual HDBSCAN clusters tend to fall into single kMeans clusters
- Similar HDBSCAN clusters fall into the same kMeans cluster
If we can label the HDBSCAN clusters appropriately, we should be able to use those HDBSCAN clusters to label the kMeans clusters. Eventually we end up with labelled high-level clusters/topics(kMeans) and specific cluster/subtopics (HDBSCAN).
There's still a challenge with what to do with the HDBSCAN outliers, at this point I prefer to evaluate the HDBSCAN -1 category as k-different noise categories. And then determine whether the `kmeans_cluster == 1 and hdbscan_cluster == -1` is itself clearly part of the named kmeans cluster, or whether it appears to be genuine noise. If the -1 is noisy, drop them from the kMeans cluster. Now our kMeans clusters will tend towards higher precision and lower recall.
What I find most interesting about this visualization is how the geometry reflects history. You can see distinct clusters for ML Theory/Optimization and statistical learning — the mathematical foundations that were largely worked out in the 1940s through 1960s (Bellman's dynamic programming, Vapnik's statistical learning theory, information-theoretic bounds from Shannon). Those foundations are still structurally separate from the applied LLM clusters, which tells you something about how theory and practice have diverged.
NeurIPS itself was born in 1987 from a small interdisciplinary workshop — about 400 attendees debating whether neural networks were worth taking seriously again. The jump from that to 6,000 accepted papers is the arc of a field growing from a contested fringe idea into the central organizing force of computer science.
The map is essentially a cross-section of that long arc at one moment in time. I write The Long Compile, which tries to trace that arc from the 1940s forward — pieces like one I did recently on Claude Shannon, whose work on information theory you can see quietly underpinning half the clusters in your visualization.
It is amazing, however it lack for me list view, so i could look at papers that way. It would be also amazing if there would be options to filter dots to for example specific category or specific poster session at neurips. Can the data be downloaded in some easy way for example to make such custom views?
Posting some thoughts here as we've tackled similar challenges over the years:
"Clustering workflows that better address noise. I know HDBSCAN is useful here. I often still lean towards a first K-Means step, but then disregard clusters with a few number of papers as noise, because they might not actually coherently be a part of a coherent grouping despite their semantic similarity."
I find the main issue with HBDSCAN is that areas of low density are considered noise, but often in unstructured data it's the highest density areas that are actually noise (Slop, bot spam and such). Which are virtually guaranteed to be labelled as clusters so long as there are enough nearest neighbours to satisfy the value of `min_cluster_size = `.
Conversely, kMeans assumes nothing is noise, and there is always noise.
So I prefer to run both kMeans and HDBSCAN over the data inpdendently, kMeans with a low-ish value, and then see how the HDBSCAN clusters fall within the kMeans clusters. I find it tends to be the case that:
- HDBSCAN clusters have high precision, low recall
- kMeans clusters have high recall, low precision
- ~100% of individual HDBSCAN clusters tend to fall into single kMeans clusters
- Similar HDBSCAN clusters fall into the same kMeans cluster
If we can label the HDBSCAN clusters appropriately, we should be able to use those HDBSCAN clusters to label the kMeans clusters. Eventually we end up with labelled high-level clusters/topics(kMeans) and specific cluster/subtopics (HDBSCAN).
There's still a challenge with what to do with the HDBSCAN outliers, at this point I prefer to evaluate the HDBSCAN -1 category as k-different noise categories. And then determine whether the `kmeans_cluster == 1 and hdbscan_cluster == -1` is itself clearly part of the named kmeans cluster, or whether it appears to be genuine noise. If the -1 is noisy, drop them from the kMeans cluster. Now our kMeans clusters will tend towards higher precision and lower recall.
I have created a simple browser: https://kotwic4.github.io/Neurips2025Papers/
What I find most interesting about this visualization is how the geometry reflects history. You can see distinct clusters for ML Theory/Optimization and statistical learning — the mathematical foundations that were largely worked out in the 1940s through 1960s (Bellman's dynamic programming, Vapnik's statistical learning theory, information-theoretic bounds from Shannon). Those foundations are still structurally separate from the applied LLM clusters, which tells you something about how theory and practice have diverged.
NeurIPS itself was born in 1987 from a small interdisciplinary workshop — about 400 attendees debating whether neural networks were worth taking seriously again. The jump from that to 6,000 accepted papers is the arc of a field growing from a contested fringe idea into the central organizing force of computer science.
The map is essentially a cross-section of that long arc at one moment in time. I write The Long Compile, which tries to trace that arc from the 1940s forward — pieces like one I did recently on Claude Shannon, whose work on information theory you can see quietly underpinning half the clusters in your visualization.
It is amazing, however it lack for me list view, so i could look at papers that way. It would be also amazing if there would be options to filter dots to for example specific category or specific poster session at neurips. Can the data be downloaded in some easy way for example to make such custom views?
Great list thanks for sharing this is big list of gem if reading
Very cool application of topic modelling. The approach is very reminiscent of Anthropic’s recent CLIO paper.