-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guidance on improving chances EM algorithm will converge? #61
Comments
Disclaimer: I am a regular user, not a fastLink developer. Does a simpler model without partial matching converge more reliably? I would use age instead of date of birth (dob). Does dropping the race variable lead to more frequent convergence? (I work for the Florida cancer registry and almost never link on race because it is not reliable enough as a linkage variable.) It would help if you could add a linkage variable with more values such as SSN (in the US) or street number+ZIP code. |
I appreciate this. A few answers:
|
Exact matching is much faster and simpler to compute, so it should converge without problems. How many exact matches are there? How much is the overlap between the two datasets? fastLink struggles if you have close to 0% or 100% overlap. Imbalance matters too -- how large are the two datasets? Did you count all missing as missing for sure? Often administrative datasets have hard-coded values such as 99 for missing which need to be recoded to NA before using fastLink. Is birth sex available as a linkage variable? |
I really appreciate the time you've put in here, thank you. There is very little overlap in many cases. So only a few new moms from 1980 would show up in hospitalization data from, say, 1990 in the same state. That could very well be a big part of the issue. In my initial testing I was testing on data that was closer in time assuming it would work as the gap got larger. So in initial testing I had convergence in many cases. Answers to your questions:
I'll experiment with removing the string matches, but I suspect you're right that in some of these datasets there will be very few true matches and this will be an issue. |
I am concerned that you will not be able to get useful results without stronger linkage variables. |
Hi @zross -- just wanted to follow up to see if you gleaned any more tips for getting the EM algorithm to converge. Thanks! |
Not really. The missing values definitely play a role sometimes and it seems like over 20% or 30% will be a problem but not all of the non-convergence was related to this it seemed. |
Closed issue #30 seems similar, and there Ted gave some additional advice not yet mentioned here, e.g., changing the tolerance criteria. However, to me, the basic issue here still is that we have no output to comment on. Re the amount of missing data, my experience is the same that it causes a convergence issue only if it is over over 30% or so. |
Hi, Having a large number of missing values in one field can affect the model's ability to converge since it must rely on the available information. Another issue is when merging many fields that only have a few possible values, such as race or gender. In such cases, the model will rely on fields that provide more discriminating power, like first and last names. One suggestion is to use partial matching instead of binary comparison for string-valued fields. Another idea is to provide different starting values for the relevant parameters. Currently, our fastLink wrapper function does not have an argument for different starting values, but we are revising it and plan to add them to the new version we will release this summer. If anything, do not hesitate to let us know. All my best, Ted |
@tedenamorado and @kosukeimai thanks so much for making all of this hard work available in this R package! I'm wondering if you had published some guidance or suggestions on what situations lead the EM algorithm to fail to converge.
Unfortunately, my data is not shareable so I'm having trouble giving you a reprex but, broadly, I'm linking birth data with hospitalization data for many different years and I'm having trouble pinpointing what is causing a failure to converge. Sometimes it does, sometimes it doesn't converge.
It does seem that if I exclude any record with any NA value I get convergence more often. But I'd really like to keep these records and the proportion of
NA
in the variables (max 4.5%) does not "seem" too high. ExcludingNA
values, in any case, is not a solution that works often.I'm running the linkage, in many cases, on a 200k subsample in my efforts to figure out where the issue is. Some facts:
Any guidance on what I might do to improve the chances the EM algorithm will converge?
The text was updated successfully, but these errors were encountered: