Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCBI Taxon ID optimalisation #54

Merged
merged 12 commits into from
Sep 2, 2019
Merged

NCBI Taxon ID optimalisation #54

merged 12 commits into from
Sep 2, 2019

Conversation

bedroesb
Copy link
Member

@bedroesb bedroesb commented Aug 22, 2019

New format:

Source Name Characteristics[Organism] Term Source REF Term Accession Number Characteristics[Genus] Characteristics[Species] Characteristics[Material Source ID] Protocol REF Sample Name Characteristics[Observation Unit Type] Characteristics[Spatial Distribution]
Cork oak Barradas daSerra 03 Quercus suber NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/58331 Quercus suber INIAV:BS03 Growth BS3 plantnumber [block]1; [plot]1; [plant]BS3; [replicate]1
Corkoak Barradas da Serra 04 Quercus suber NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/58331 Quercus suber INIAV:BS04 Growth BS4 plantnumber [block]1; [plot]1; [plant]BS4; [replicate]2
Corkoak Barradas da Serra 05 Quercus suber NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/58331 Quercus suber INIAV:BS05 Growth BS5 plantnumber [block]1; [plot]1; [plant]BS5; [replicate]3
Corkoak Barradas da Serra 06 Quercus suber NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/58331 Quercus suber INIAV:BS06 Growth BS6 plantnumber [block]1; [plot]1; [plant]BS6; [replicate]4
Source Name Characteristics[Organism] Term Source REF Term Accession Number Characteristics[Genus] Characteristics[Species] Characteristics[Material Source ID] Characteristics[Material Source DOI] Protocol REF Sample Name Characteristics[Observation Unit Type] Characteristics[Spatial Distribution] Factor Value[fruit load]
S. lycopersicum cv. M82 Solanum lycopersicum NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29301054 plant [X]1054; [plot]0; [plant]29301054; [replicate]1 low (pruned till one fruit)
S. lycopersicum cv. M82 Solanum lycopersicum NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29301618 plant [X]1618; [plot]0; [plant]29301618; [replicate]1 low (pruned till one fruit)
S. lycopersicum cv. M82 Solanum lycopersicum NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29301030 plant [X]1030; [plot]0; [plant]29301030; [replicate]1 low (pruned till one fruit)
S. lycopersicum cv. M82 Solanum lycopersicum NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29302127 plant [X]2127; [plot]0; [plant]29302127; [replicate]1 low (pruned till one fruit)

generated with:

python brapi_to_isa.py -e https://brapi.biodata.pt/brapi/v1/ -t 2
python brapi_to_isa.py -e https://www.eu-sol.wur.nl/webapi/tomato/brapi/v1/ -t 2

@proccaserra
Copy link
Collaborator

@bedroesb nice one, you beat to it! I was about to push the changes.
relates to MIAPPE/ISA-Tab-for-plant-phenotyping#17
@PapoutsoglouE

Copy link
Collaborator

@proccaserra proccaserra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine but I think the 'if' block above is affected (line 84/84)
and then again at where create_isa_characteristic function is invoked at line 118 and line 122

    for key in mapping_dictionnary:
        if key in all_germplasm_attributes and all_germplasm_attributes[key]:
            c = self.create_isa_characteristic(
                    mapping_dictionnary[key], str(all_germplasm_attributes[key]),"","")
        else:
            c = self.create_isa_characteristic(
                mapping_dictionnary[key], "", "", "")

@bedroesb
Copy link
Member Author

Well it was a small thingy so sorry about that ;) I don't really see a problem with the if block to be honest.

I know that I need to change the block when taxonIDs are given through BrAPI.

@proccaserra
Copy link
Collaborator

@bedroesb if the if block, in order to be consistent, I think we need to make sure we use a similar pattern:
so:
if 'taxonId' in all_germplasm_attributes and all_germplasm_attributes['taxonId']:
taxonids =[]
organism = "multiple organisms"
ncbitaxon = OntologySource(name='NCBITaxon', description="NCBI Taxonomy")
for taxonid in all_germplasm_attributes['taxonId']:
taxonids.append(att_test(taxonid, 'sourceName', 'NCBI') + ":" + str(taxonid['taxonId']))
c = self.create_isa_characteristic('Organism', organism, ';'.join(taxonids),ncbitaxon.name,';'.join(taxonids))
returned_characteristics.append(c)

sorry didn't test

@bedroesb
Copy link
Member Author

The attribute taxonId looks like this:

        "taxonIds": [
            {
                "sourceName": "ncbiTaxon",
                "taxonId": "2340"
            },
            {
                "sourceName": "ciradTaxon",
                "taxonId": "E312"
            }
        ],

So the problem is how to handle the URI when it is not a NCBI taxon.

If I assume it is always NCBI taxon ID, than it is an easy thing to implement indeed

@bedroesb
Copy link
Member Author

I guess we can just look for a sourceName == ncbiTaxon, and than take the one that is delivered by taxonId, otherwise use the implementation (using the genus and species)

@proccaserra
Copy link
Collaborator

right but I can't remember now of top of my head if that situation (multiple taxonIds) occurs when there is one species+genus and the multiple taxonIds refer to a listing of 'alternate identifiers' for the same organism
or
if it corresponds to defined a hybrid organism where it is necessary to list all the different taxons from the parents lines.

either way, concatenation resulting from the multiple entries will not be necessarily pretty in a tabular format.

@bedroesb
Copy link
Member Author

true that, I am changing it

@bedroesb
Copy link
Member Author

bedroesb commented Aug 22, 2019

@proccaserra I've made a new function to make things more logic.

I will add some documentation to it

@bedroesb
Copy link
Member Author

WUR endpoint delivered the URI link as taxonId, while the Portuguese one gave the NCBI ID itself, but this is handled in the script now.

@PapoutsoglouE
Copy link

PapoutsoglouE commented Aug 26, 2019

Take another look at the crosslinked issue on the MIAPPE side. I am not sure that this is the best option, so let's still consider some alternatives!

@bedroesb
Copy link
Member Author

So you propose an extra column called Characteristics[NCBI] with the NCBI id ? Not a problem at all to implement

@bedroesb
Copy link
Member Author

bedroesb commented Aug 26, 2019

Does this look like a good output?:

VIB:

Source Name Characteristics[NCBI] Term Source REF Term Accession Number Characteristics[Organism] Characteristics[Genus] Characteristics[Species] Protocol REF Sample Name Characteristics[Observation Unit Type] Characteristics[Spatial Distribution] Factor Value[water regimen]0 Factor Value[water regimen]1
OE-2-1 Arabidopsis thaliana NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/3702 NCBI:3702 Arabidopsis thaliana Growth pot_13 plant [plant]13 jobau_wellwatered_3-9DAS jobau_wellwatered_3-9DAS
OE-2-1 Arabidopsis thaliana NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/3702 NCBI:3702 Arabidopsis thaliana Growth pot_27 plant [plant]27 jobau_wellwatered_3-9DAS jobau_wellwatered_3-9DAS
OE-2-1 Arabidopsis thaliana NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/3702 NCBI:3702 Arabidopsis thaliana Growth pot_24 plant [plant]24 jobau_wellwatered_3-9DAS jobau_wellwatered_3-9DAS
OE-2-1 Arabidopsis thaliana NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/3702 NCBI:3702 Arabidopsis thaliana Growth pot_3 plant [plant]3 jobau_wellwatered_3-9DAS jobau_wellwatered_3-9DAS
OE-2-1 Arabidopsis thaliana NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/3702 NCBI:3702 Arabidopsis thaliana Growth pot_17 plant [plant]17 jobau_wellwatered_3-9DAS jobau_wellwatered_3-9DAS

PT:

Source Name Characteristics[NCBI] Term Source REF Term Accession Number Characteristics[Organism] Characteristics[Genus] Characteristics[Species] Characteristics[Material Source ID] Protocol REF Sample Name Characteristics[Observation Unit Type] Characteristics[Spatial Distribution]
Cork oak Barradas daSerra 03 Quercus suber NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/58331 NCBI:58331 Quercus suber INIAV:BS03 Growth BS3 plantnumber [block]1; [plot]1; [plant]BS3; [replicate]1
Corkoak Barradas da Serra 04 Quercus suber NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/58331 NCBI:58331 Quercus suber INIAV:BS04 Growth BS4 plantnumber [block]1; [plot]1; [plant]BS4; [replicate]2
Corkoak Barradas da Serra 05 Quercus suber NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/58331 NCBI:58331 Quercus suber INIAV:BS05 Growth BS5 plantnumber [block]1; [plot]1; [plant]BS5; [replicate]3
Corkoak Barradas da Serra 06 Quercus suber NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/58331 NCBI:58331 Quercus suber INIAV:BS06 Growth BS6 plantnumber [block]1; [plot]1; [plant]BS6; [replicate]4
Corkoak Barradas da Serra 07 Quercus suber NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/58331 NCBI:58331 Quercus suber INIAV:BS07 Growth BS7 plantnumber [block]1; [plot]1; [plant]BS7; [replicate]5

WUR

Source Name Characteristics[NCBI] Term Source REF Term Accession Number Characteristics[Organism] Characteristics[Genus] Characteristics[Species] Characteristics[Material Source ID] Characteristics[Material Source DOI] Protocol REF Sample Name Characteristics[Observation Unit Type] Characteristics[Spatial Distribution] Factor Value[fruit load]
S. lycopersicum cv. M82 Solanum lycopersicum NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/4081 NCBI:4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29302110 plant [X]2110; [plot]0; [plant]29302110; [replicate]1 low (pruned till one fruit)
S. lycopersicum cv. M82 Solanum lycopersicum NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/4081 NCBI:4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29301054 plant [X]1054; [plot]0; [plant]29301054; [replicate]1 low (pruned till one fruit)
S. lycopersicum cv. M82 Solanum lycopersicum NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/4081 NCBI:4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29301824 plant [X]1824; [plot]0; [plant]29301824; [replicate]1 low (pruned till one fruit)
S. lycopersicum cv. M82 Solanum lycopersicum NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/4081 NCBI:4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29302127 plant [X]2127; [plot]0; [plant]29302127; [replicate]1 low (pruned till one fruit)
S. lycopersicum cv. M82 Solanum lycopersicum NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/4081 NCBI:4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29301317 plant [X]1317; [plot]0; [plant]29301317; [replicate]1 low (pruned till one fruit)

@PapoutsoglouE

@DanFaria
Copy link

Please check my post on the related issue on the MIAPPE github: MIAPPE/ISA-Tab-for-plant-phenotyping#17 (comment)

If the goal is for BrAPI2ISA to generate MIAPPE-compliant ISA-Tab, then what I said there holds here as well. We should not be modeling Organism in a way that differs from the MIAPPE 1.1 checklist, even if that means we cannot use some of the functionalities from ISA.

@bedroesb
Copy link
Member Author

@DanFaria
So if I am following correctly, it will stay the same as it was (so without the

Characteristics[NCBI] Term Source REF Term Accession Number

columns)

But with NCBITAXON:xxxx instead of NCBI:xxxx, for the Characteristics[Organism] column.

@DanFaria
Copy link

@bedroesb
Yes, I think that is the best solution, as I don't see a way to improve functionality on the ISA side without deviating from the MIAPPE checklist.
I would give it a couple of days to see if anyone expresses a different opinion on the pending MIAPPE ISA-Tab issue, but after that, I think you can go ahead with that configuration.

Eliana has already posted an issue on the MIAPPE checklist to update the NCBI prefix to NCBITAXON, and hopefully that can be done still within the MIAPPE 1.1 release, as it is a non-functional change.

@proccaserra
Copy link
Collaborator

@bedroesb @DanFaria I guess the ambiguity lies in the fact that for MIAPPE organism, an identifier is expected, where intuitively an organism name would be supplied (following the pattern for Genus and Species.

so may be a minor change would be to use 'organism ID' in both MIAPPE and the ISA configuration to remove that uncertainty.

@DanFaria
Copy link

so may be a minor change would be to use 'organism ID' in both MIAPPE and the ISA configuration to remove that uncertainty.

I agree that this would make the field more intuitive. I'll raise the issue on the MIAPPE checklist, and if approved, we can update the ISA configuration.

@bedroesb
Copy link
Member Author

bedroesb commented Aug 28, 2019

WUR:

Source Name Characteristics[Organism] Characteristics[Genus] Characteristics[Species] Characteristics[Material Source ID] Characteristics[Material Source DOI] Protocol REF Sample Name Characteristics[Observation Unit Type] Characteristics[Spatial Distribution] Factor Value[fruit load]
S. lycopersicum cv. M82 NCBITAXON:4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29301824 plant X:1824;plot:0;plant:29301824;replicate:1 low (pruned till one fruit)
S. lycopersicum cv. M82 NCBITAXON:4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29301642 plant X:1642;plot:0;plant:29301642;replicate:1 low (pruned till one fruit)

Pt:

Source Name Characteristics[Organism] Characteristics[Genus] Characteristics[Species] Characteristics[Material Source ID] Protocol REF Sample Name Characteristics[Observation Unit Type] Characteristics[Spatial Distribution]
Cork oak Barradas daSerra 03 NCBITAXON:58331 Quercus suber INIAV:BS03 Growth BS3 plantnumber block:1;plot:1;plant:BS3;replicate:1
Corkoak Barradas da Serra 04 NCBITAXON:58331 Quercus suber INIAV:BS04 Growth BS4 plantnumber block:1;plot:1;plant:BS4;replicate:2

VIB:

Source Name Characteristics[Organism] Characteristics[Genus] Characteristics[Species] Protocol REF Sample Name Characteristics[Observation Unit Type] Characteristics[Spatial Distribution] Factor Value[water regimen]
OE-2-1 NCBITAXON:3702 Arabidopsis thaliana Growth pot_10 plant plant:10 jobau_wellwatered_10-21DAS,jobau_wellwatered_3-9DAS
OE-2-1 NCBITAXON:3702 Arabidopsis thaliana Growth pot_24 plant plant:24 jobau_drought_10-21DAS,jobau_wellwatered_3-9DAS

Of which the VIB one has the solved treatments problem mentioned before

@PapoutsoglouE
Copy link

PapoutsoglouE commented Aug 28, 2019

Off the top of my head, I don't recall any of the WUR germplasm having S. lycopersicum in their name/ID. I am also unsure where the cv. M82 came from.
@bedroesb, could you elaborate on how the Source Name is formed in this case?
(I may be misremembering and there might indeed be germplasm with that information)

(Also, the format for Spatial Distribution has been changed from using square brackets to colons, i.e. from [block] 1;[plot] 2 to block:1;plot:2.)

@PapoutsoglouE
Copy link

I double checked, and indeed our database has some entries with that germplasm name. Apologies!

@bedroesb
Copy link
Member Author

No problem! I just updated the examples in my previous post with the latest code changes concerning Characteristics[Spatial Distribution]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants