Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore HDBSCAN as a replacement for DBSCAN in RGDR #136

Open
BSchilperoort opened this issue Nov 16, 2022 · 4 comments
Open

Explore HDBSCAN as a replacement for DBSCAN in RGDR #136

BSchilperoort opened this issue Nov 16, 2022 · 4 comments
Labels
enhancement New feature or request RDGR Issues relating to the RGDR module

Comments

@BSchilperoort
Copy link
Contributor

I recently stumbled upon the alternative clustering method HDBSCAN. They promise the following:

Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.

In practice this means that HDBSCAN returns a good clustering straight away with little or no parameter tuning -- and the primary parameter, minimum cluster size, is intuitive and easy to select.

And also:

In particular performance on low dimensional data is better than sklearn's DBSCAN

Not only this, but it seems to be basically a drop-in replacement of DBSCAN which we currently use, so this could be quite interesting to explore to make RGDR more robust as well as perform better.

@BSchilperoort BSchilperoort added enhancement New feature or request RDGR Issues relating to the RGDR module labels Nov 16, 2022
@BSchilperoort BSchilperoort changed the title HDBSCAN vs DBSCAN Explore HDBSCAN as a replacement for DBSCAN in RGDR Nov 16, 2022
@semvijverberg
Copy link
Member

Cool! I stumbled upon this method a long time ago thinking I should revisit but I completely forgot!

@jannesvaningen
Copy link
Contributor

Okay guys, I have spent some time on this. HDBSCAN is in principle an improvement over DBSCAN, but I'm not really sure yet whether it is a real improvement for us. I'll give some explanation here. I can also give a presentation showing a notebook soon.

The best improvement of HDBSCAN over DBSCAN is that it does not use one lambda parameter (the eps parameter) to determine the number of clusters. Instead, it maximizes the total sum of persistence of the clusters under the constraint that the chosen clusters are non-overlapping. Bit less formally: it looks if splitting one cluster into two results in more 'mass' than before. If it does, it splits the cluster. If it doesn't, it keeps it as one. That way, it determines the lambda parameters itself.
image
image
Source: https://pberba.github.io/stats/2020/01/17/hdbscan/

As promised, the only parameter that needs tuning is minimum cluster size. It is intuitive to use, because you can indicate that you only want clusters with size > 5 cells. This is arguably better than the eps_km parameter because it requires the user to have some idea about the size of the data. However, although this parameter is easy to use, it can also lead to some 'cutoff' scenarios where there are only regions found with minimum_cluster_size <5, so setting it to the (default) 5 leads to no regions being found at all. So does it lead to more robust clusters? I don't know to be honest.

I also tested the speed in the notebook and it does not look like HDBSCAN is much faster than DBSCAN. It was actually slower in my case.

We (@semvijverberg and @geek-yang ) discussed this already a bit, and one way to proceed could be to use HDBSCAN with minimum_cluster_size is 2 (the lowest setting) and then use @BSchilperoort his extra layer of removing areas with min_area_km2. Maybe we could also look at the correlation of the ts between regions like @semvijverberg has suggested.

@BSchilperoort
Copy link
Contributor Author

Thanks for exploring this, Jannes! I have some questions!

  1. Did you test it on high resolution data? (instead of the very coarse data we have for testing).
  2. Those plots are nice, but are mostly for data with many more points than what we have. What does HDBSCAN's clusterer.condensed_tree_.plot() look like for the s2s data?
  3. Do the clusters come out basically the same with HDBSCAN?

@geek-yang
Copy link
Member

geek-yang commented Feb 17, 2023

Just saw your post. We discussed the results last Wednesday.

  1. Did you test it on high resolution data? (instead of the very coarse data we have for testing).

Jannes tested it on a larger dataset with higher resolution. But the results are similar to those with coarse data.

  1. Those plots are nice, but are mostly for data with many more points than what we have. What does HDBSCAN's clusterer.condensed_tree_.plot() look like for the s2s data?

@jannesvaningen Can you comment on it?

  1. Do the clusters come out basically the same with HDBSCAN?

The clusters are similar in general, though some details are different. But as DBSCAN, the results are not very robust, especially for those edge points.

These methods are designed to cluster data based on the density, which is actually the difference in distance. However, since our data is on structured grid, it is difficult in some cases. We might be able to get robust results with unevenly distributed data, I guess. Actually for ocean modelling, their data is always on unstructured grid. Maybe we can test our methods using some oceanic reanalysis data, e.g. ORAS5, SODA3.

Anyway, I think HDBSCAN is a nice option to add, at least we provide an alternative for the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request RDGR Issues relating to the RGDR module
Projects
None yet
Development

No branches or pull requests

4 participants