Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The question about the redcap algorithm #2467

Open
JagnDC opened this issue Nov 16, 2023 · 17 comments
Open

The question about the redcap algorithm #2467

JagnDC opened this issue Nov 16, 2023 · 17 comments

Comments

@JagnDC
Copy link

JagnDC commented Nov 16, 2023

I would like to ask a question about the REDCAP algorithm, I wrote the REDCAP algorithm according to the paper, but the calculation results seem to be different from the results of geoda, and some problems are found through comparison.

For FullOrder-AverageLinkage, my data property values are as follows:

[[6 5 4]
[8 9 3]
[5 9 0]
[8 7 8]
[4 7 1]]

The initial adjacency relation is as follows.

[[0.1.1.1.1.]
[1.0.1.1 1.0.]
[1.1.0.0.1.]
[1.1.0.0.0.]
[1.0.1.0.0.]]

I used this to cluster five nodes and set the number of clusters to two.
My code calculates the following:
{0: [0, 4, 1, 2], 1: [3]}
geoda calculates the following:
{0: [0, 1, 3], 1: [2, 4]}

I didn't know which of these calculations was correct, so I looked at the spanning tree, and I found that my algorithm built the following tree adjacency:

[[0.0.0.0.1.]
[0. 0. 1.
[0.1.0.0.1.]
[0.1. 0. 0.]
[1.0.1.0.0.]]

geoda's tree is:

0 5 "clusters" POLY_ID
3 5 6
5 3 6
1 2 21
21 21
1 5 17
5 1 17
1 4 24
4 1 24

After manual calculation, I found that according to my understanding, the results of the paper should be consistent with the results of my algorithm. May I ask if there is a problem with my calculation results? Or maybe there's something wrong in some other place that I don't understand. Thanks to the Author

@lixun910
Copy link
Member

lixun910 commented Nov 16, 2023

Thanks for your question! @JagnDC Can you post your step-by-step manual calculation so we can check what might be the problem? You can see a step-by-step manual calculation using FullOrder-CompleteLinkage in our GeoDa workbook: https://geodacenter.github.io/workbook/9c_spatial3/lab9c.html#appendix

@JagnDC
Copy link
Author

JagnDC commented Nov 22, 2023

image
Thank you for your reply. This is my example of the spanning tree construction process diagram. I don’t know if I calculated it correctly, but I checked it many times.
Here, I use the FullOrder-AverageLinkage

@lixun910
Copy link
Member

Can you also post how did you update the centroid (with average linkage) of the merged cluster after making a connection? I didn't see it in your attached image. You can also post the dendrogram of the hierarchical clustering of average linkage, which helps to explain the calculation (like in the GeoDa documentation I posted). Thanks!

@JagnDC
Copy link
Author

JagnDC commented Nov 23, 2023

Thank you for your reply.

First we need to update the result above. In the last step I made a mistake, but the conclusion remains the same.

image

In addition, here are the distance matrix of each step and the result of merging. I don't know if you can understand my meaning

image

@JagnDC
Copy link
Author

JagnDC commented Nov 23, 2023

By the way, when I set the number of clusters to 1, will the spanning tree I get be a complete spanning tree?
image

@lixun910
Copy link
Member

The calculation doesn't look correct to me: after merging 2 and 4, why all the sudden 3 has connection to 2?

Yes, It's better to post a screenshot of the dendrogram of SCHC, see in GeoDa documentation: Thanks!

A careful consideration of the various REDCAP algorithms reveals that they essentially consist of three steps. First, a dendrogram for contiguity constrained hierarchical clustering is constructed, using the given linkage function. This yields the exact same dendrogram as produced by SCHC. Next, this dendrogram is turned into a spanning tree, using standard graph manipulation principles. Finally, the optimal cuts in the spanning tree are obtained using the same logic (and computations) as in SKATER, up to the desired level of k.

@JagnDC
Copy link
Author

JagnDC commented Nov 24, 2023

image

Here's the dendrogram I drew and the SCHC dendrogram, and I've also labeled the order of the edges.

To be clear, I'm just building a spanning tree here.

For your question, you can see my clustering on the right side of the distance matrix diagram. Since 2 and 4 are merged into one cluster, cluster 2 in state 2 represents the set of cluster 2 and cluster 4 in state 1. When updating the distance between cluster 2 and other clusters in the calculation process, all distances are calculated. I devised a separate step to determine whether clusters are adjacent.

To avoid misunderstandings, I have updated the process plot of the distance matrix here.

image

@lixun910
Copy link
Member

lixun910 commented Nov 24, 2023 via email

@JagnDC
Copy link
Author

JagnDC commented Nov 24, 2023

Thank you for your reply. For your question, the average of all distances between two clusters I calculated here, because it is the calculation between feature values, so no judgment of adjacency is made, and constraints are made in the subsequent calculation process, so the final calculation results will not be affected.

However, I would like to thank you for reminding me. I have optimized my calculation process and reduced unnecessary calculation process.

@lixun910
Copy link
Member

lixun910 commented Nov 24, 2023 via email

@JagnDC
Copy link
Author

JagnDC commented Nov 24, 2023

Thank you for your reply. I see what you mean. I want to briefly explain my calculation process.

According to the full pattern in the original article, when I merge nodes 2,4 in state 1, I should calculate the distance between the other clusters, which are cluster 0 and cluster 1. You are confused about the distance between cluster 1 and cluster 2

dis(node 1, node 2)=√(9+0+9)=√18
dis(node 1, node 4)= √(16+4+4)=√24
avgdis(cluster 1, cluster 2)= (√18+ √24)/2=4.5708

dis(node 0, node 1)=√21
avgdis(cluster 0, cluster 2)= √21/1=4.5826

avgdis(cluster 1, cluster 2) < avgdis(cluster 0, cluster 2)

@lixun910
Copy link
Member

lixun910 commented Nov 24, 2023 via email

@JagnDC
Copy link
Author

JagnDC commented Nov 24, 2023

image
Thank you for your reply. I think my expression is not clear enough, so I updated the graph of distance matrix and refer to the format of your GeoDa documentation.
As you can see in this graph, 4.57081 is the average of 4.2426 and 4.8989. I have repeated the calculation here many times, I think it should be no problem.

@lixun910
Copy link
Member

lixun910 commented Nov 24, 2023 via email

@JagnDC
Copy link
Author

JagnDC commented Nov 24, 2023

image
Thank you for your reply. What I refer to here is the composition of this matrix diagram in the GeoDa documentation. What I marked in the red box is the distance of adjacency relationship, the non-red box is the non-adjacency, and the yellow background color is the minimum distance of adjacent nodes as the merge target.

@lixun910
Copy link
Member

I see. Can you send me the files you used in GeoDa? So I can check the details. Thanks!

@JagnDC
Copy link
Author

JagnDC commented Nov 24, 2023

test_data.zip
Thank you very much for your attention and answer. Here is my data, where feature1 and feature2 are the location coordinates, feature3-5 are the feature columns used, and clusters is the result of my clustering.

Feel free to contact me if you have any new findings or questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants