Data loader expects human gene nomenclature #44

bussec · 2021-05-27T11:33:11Z

The data loader (dataload/annotation.py, around line 250--300) assumes that gene calls use a human gene nomenclature format (e.g., IGHV1-23*04), including an all-caps gene name. Non-compliant calls will simply be dropped. This creates problems for mouse datasets, even if they use IMNC nomenclature instead of legacy naming schemes (e.g., Johnston et al.).

The text was updated successfully, but these errors were encountered:

bcorrie · 2021-05-27T16:23:34Z

This is our attempt at making gene calls comparable 8-) We need a mechanism that handles the idiosyncrasies of the annotation tools gene calling and creates something that is "Interoperable" and "Reusable". We also use an internal mechanism to try and build gene names to build allele -> gene -> family relationships. As you know, this has been discussed at length (ad nauseum?)

airr-community/airr-standards#295

Don't get me started 8-)

Our goal here is at a minimum to ensure that any data in any two Turnkey repositories is interoperable and reusable. So we do force this to some degree, as we use the IMGT nomenclature for human genes. This works well for most annotation tools for human data.

I admit we don't have a lot of mouse data, so we may need to modify our mechanism for determining valid gene names for mouse (and other species). At the same time, I would stand pretty strongly behind the premise that once data is loaded into an iReceptor Turnkey that the gene names need to be comparable. We can do some of that (we already convert gene names from various annotation tools to a consistent format), but we have to put some onus on the researcher to provide us with a reasonable starting point. Note that the Turnkey will happily load custom fields, so you can still store your original gene names in custom fields, but the v_call/d_call/j_call need to be well defined to start with.

So I think we need some help in determining what that starting point is for mouse gene names - and we can certainly make some changes to load that data more easily for the user. But what is that starting point - and shouldn't that starting point be mentioned as part of the AIRR Spec?

bussec · 2021-05-28T00:41:22Z

At the same time, I would stand pretty strongly behind the premise that once data is loaded into an iReceptor Turnkey that the gene names need to be comparable.

I fully agree with that, as this is the idea of the whole standardization exercise ;-) My point is that standardization does not mean that you have to toss species-specific nomenclature out of the window -- as long as this also follows a standard. Mouse needs to be matched with mouse and human with human, but I do not see why both species would need to use ALLCAPS gene symbols.
As a permissive sanity check for VDJ genes we use /^(Ig[hkl]|Tr[abdg])[vdj][1-9].*/ for mouse and the all-caps version for human.

In general I would like to avoid using custom fields, as it will IMO lead to less compatibility in the long run. I think that the solution is a proper germline gene ontology, but that's a discussion for another issue ;-)

bcorrie · 2021-06-02T18:41:49Z

At the same time, I would stand pretty strongly behind the premise that once data is loaded into an iReceptor Turnkey that the gene names need to be comparable.

I fully agree with that, as this is the idea of the whole standardization exercise ;-) My point is that standardization does not mean that you have to toss species-specific nomenclature out of the window -- as long as this also follows a standard. Mouse needs to be matched with mouse and human with human, but I do not see why both species would need to use ALLCAPS gene symbols.
As a permissive sanity check for VDJ genes we use /^(Ig[hkl]|Tr[abdg])[vdj][1-9].*/ for mouse and the all-caps version for human.

Yeah, but why, why, why do they need to be different when they could have been the same 8-) I know its too late, but mixing and matching just makes it harder for everyone... Sour grapes in regards to biologists and standards, I know, but really... 8-)

In general I would like to avoid using custom fields, as it will IMO lead to less compatibility in the long run.

Yes, didn't mean that you should use them for key fields, but whenever we do a conversion (e.g. for the Adaptive data) we keep custom fields using the original nomenclature so that it is possible to see how the conversion was done - in case you think we messed it up 8-)

I think that the solution is a proper germline gene ontology, but that's a discussion for another issue ;-)

8-)

bussec mentioned this issue May 28, 2021

iR+ Process for gene naming and analysis ireceptor-plus/issues#58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data loader expects human gene nomenclature #44

Data loader expects human gene nomenclature #44

bussec commented May 27, 2021

bcorrie commented May 27, 2021

bussec commented May 28, 2021

bcorrie commented Jun 2, 2021

Data loader expects human gene nomenclature #44

Data loader expects human gene nomenclature #44

Comments

bussec commented May 27, 2021

bcorrie commented May 27, 2021

bussec commented May 28, 2021

bcorrie commented Jun 2, 2021