Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plotDiffHeatmap : ambiguity when multiple clusterings could match #381

Open
SamGG opened this issue Jan 5, 2024 · 2 comments
Open

plotDiffHeatmap : ambiguity when multiple clusterings could match #381

SamGG opened this issue Jan 5, 2024 · 2 comments

Comments

@SamGG
Copy link

SamGG commented Jan 5, 2024

IIUC, when multiple columns of cluster_codes could match the levels of cluster_id, the first one is selected silently.

if (is.null(k)) {
kids <- levels(y$cluster_id)
same <- vapply(cluster_codes(x), function(u)
identical(levels(u), kids), logical(1))
if (!any(same))
stop("Couldn't match any clustering",
" in input data 'x' with results in 'y'.")
k <- names(cluster_codes(x))[same][1]
} else {
k <- .check_k(x, k)
}
x$cluster_id <- cluster_ids(x, k)

IMO, if sum(same) > 1 a warning must be raised show the selected column/clustering, or, an error must be raised to force the user to select the clustering column (k argument).

@HelenaLC
Copy link
Owner

HelenaLC commented Jan 8, 2024

Thanks for this! I agree that this is a weak point. I think the most suitable place for this would actually be in .check_k here, independent of plotDiffHeatmap. Specifically, all functions accepting a k argument typically call k <- .check_k(sce, k) and then use cluster_ids(sce, k) to retrieve the corresponding assignments. So, .check_k could include a warning when multiple clusterings match k, as you suggested - agreed? I'll try to implement this soon, and thank you kindly again.

@SamGG
Copy link
Author

SamGG commented Jan 8, 2024

Thanks for your feedback. Here are my thoughts, probably not crystal clear.

  1. The vapply code I cited above smartly matches the levels of the y diffcyt object with the levels of the x CATALYST object. This goal could not be addressed by .check_k because .check_k works only x.
  2. In fact, I think I raised the current issue because I wrongly tried to hack the CATALYST object in order to manage multiple independent clustering. Having reviewed the .check_k and cluster_ids functions, it's now clear to me that it is not possible to add a column to cluster_codes. cluster_codes and cluster_ids are intimately linked together. The aims of cluster_codes is to store the hierarchical merging of high a resolution clustering into lower resolution clusterings. There could be only one single high resolution clustering in cluster_codes. If I want to add a clustering (a high resolution clustering, e.g. the result of kmeans), I need to replace cluster_ids AND cluster_codes together. That's what you wrote in section 8.2 of the vignette, but I read it too quickly.

So, I think that:

  1. there is no need to change .check_k.
  2. there should be no reason to get a more than one result in the vapply code, but it would be good to raise an error in the case there is more than one result.
  3. we could try to improve the vignette by a) first stating the link/complementary between cluster_ids and cluster_codes and b) giving a simpler example such as kmeans or encompassing the current code in a function that returns a simple vector of numerical values.

We can discuss this offline if you want. Sorry for my misunderstanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants