Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SSPN][Migration] Affiliation matching values #238

Open
zzacharo opened this issue Oct 30, 2024 · 5 comments
Open

[SSPN][Migration] Affiliation matching values #238

zzacharo opened this issue Oct 30, 2024 · 5 comments
Assignees
Milestone

Comments

@zzacharo
Copy link
Contributor

zzacharo commented Oct 30, 2024

Context

ROR organization API docs: https://ror.readme.io/docs/api-affiliation

ROR matching algorithms

PHRASE: the entire phrase matched to a variant of the organization's name
COMMON TERMS: the matching was done by comparing the words separately
FUZZY: the matching was done by fuzzy-comparing the words separately
HEURISTICS: "University of X" was matched to "X University"
ACRONYM: matched by acronym
EXACT: exact match of the entered string in name, aliases, labels or acronyms fields

We used the ROR v1 API to analyze the following scenarios:

  • API returned a high confident result using all matching algorithms
  • API returned a high confident result using only the EXACT matching algorithm

Results

Action

  • We need an evaluation of how many False positives we have in the 2 attached files separately so we can have an idea on which criteria to use to identify values (confident matches) that will replace the legacy arbitrary input and will not require curation.
  • To analyze the false positives we would need to check manually the 2 files.
@PaulinaBaranowska
Copy link
Collaborator

Some thoughts:
Affiliations can be input through the form or manually or in the controlled INSPIRE format (ICN).
Perhaps we could split the three different cases:
When the affiliation comes from autocomplete though a submission form, it has the country code in brackets and an email address, and there is the CCID in $$0
When the affiliation has been submitted manually through a submission form, there will not be an email associated
When the affiliation is in the INSPIRE format, the record will have 035__a:oai:inspirehep.net:* or a thesis

Maybe we could look at this in these three groups? For things that are coming through autocomplete (which is coming from Foundation?), the strings are often poor, maybe we could use extra data available in Foundation for the matching?

For INSPIRE, we often have RORs already, so we can do some matching there.

@zzacharo
Copy link
Contributor Author

zzacharo commented Nov 7, 2024

Hey @PaulinaBaranowska thanks a lot for your input, I will have a look and come back to you. In either case, we would really appreciate also some feedback on the ROR API responses before we would need in any case when we try to disambiguate the free text values.

@PaulinaBaranowska
Copy link
Collaborator

@zzacharo Would it be possible to provide us the example .csv files with an extra column with the name of the institution from ROR that matches the ROR identifier? It would make comparing much easier and faster.

@zzacharo
Copy link
Contributor Author

Hey @PaulinaBaranowska I updated the files with a new column called ror_match_info where you get the ROR information regarding each match.

@zzacharo zzacharo moved this from For review to In Progress in CDS-RDM - Library tasks Nov 19, 2024
@zzacharo
Copy link
Contributor Author

Some thoughts: Affiliations can be input through the form or manually or in the controlled INSPIRE format (ICN). Perhaps we could split the three different cases: When the affiliation comes from autocomplete though a submission form, it has the country code in brackets and an email address, and there is the CCID in $$0 When the affiliation has been submitted manually through a submission form, there will not be an email associated When the affiliation is in the INSPIRE format, the record will have 035__a:oai:inspirehep.net:* or a thesis

Maybe we could look at this in these three groups? For things that are coming through autocomplete (which is coming from Foundation?), the strings are often poor, maybe we could use extra data available in Foundation for the matching?

For INSPIRE, we often have RORs already, so we can do some matching there.

I created #260 to address improvements on affiliation matching as in SSPN there is no presence of 035__a:oai:inspirehep.net:*

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

3 participants