Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oryza sativa: Found GMT files use different gene codes from that used in BioMart #4

Open
IngoGiebel opened this issue Apr 26, 2023 · 3 comments

Comments

@IngoGiebel
Copy link

Checked GMT files: http://structuralbiology.cau.edu.cn/PlantGSEA/download.php

- GO (Gene Ontology) gene sets : http://structuralbiology.cau.edu.cn/PlantGSEA/database/Osa_GO

- Gene Family based gene sets : http://structuralbiology.cau.edu.cn/PlantGSEA/database/Osa_GFam

- KEGG gene sets : http://structuralbiology.cau.edu.cn/PlantGSEA/database/Osa_KEGG

- PO gene sets : http://structuralbiology.cau.edu.cn/PlantGSEA/database/Osa_PO

All these files do not fully adhere the GMT standard which states that the genes must be separated by tabs. In these file the genes are separated by ",". That issue can of course be tackled. When doing so, a knockout problem arises... The codes for the genes differ from the codes used in the reference genome file "https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-56/plants/fasta/oryza_sativa/cdna/".

For example:
BioMart gene codes: Os12g0469300, Os07g0249200
MSU Rice Genome Annotation Project gene codes (used in the GMT files): LOC_Os01g07760, LOC_Os01g40630, LOC_Os03g59220

At http://plants.ensembl.org/Oryza_sativa/Location/Viewdb=core;g=Os03g0786000;r=3:32624612-32627796;t=Os03t0786000-01 I found the following information (and only there) when displaying the information for one of the genes:

Transcript LOC_Os01g02240.1.1
Gene LOC_Os01g02240
Protein product LOC_Os01g02240.1
Location Chromosome 1: 678,778-684,594
Gene type Msu gene
Strand Reverse
Base pairs 4,758
Amino acids 1,585
Analysis Genes (MSU)
Annotation method Gene annotation by MSU Rice Genome Annotation Project dated 2011-10-31. These genes are included alongside the IRGSP annotations, but are not included in Compara or BioMart. Read more...;


Genome Analysis
rGREAT: an R/bioconductor package for functional
enrichment on genomic regions

image

Unfortunately, I could not find any other suitable GMT files which use the BioMart gene codes (used with kallisto/reference genome file and the tximport).

@fi4sko
Copy link
Contributor

fi4sko commented Apr 26, 2023

It seems the MSU Rice Genome Annotation Project is very dated. I would not spend much time on it. What I would suggest is getting the annotations using mercator: https://www.plabipd.de/mercator_main.html
It requires fasta of peptides for the genome you used to map the reads and gets you annotation within minutes.
However, it will require a bit coding to turn it into gmt gene sets. I will try it tomorrow

@fi4sko
Copy link
Contributor

fi4sko commented Apr 26, 2023

dit it work with gost in gprofiler2?

@IngoGiebel
Copy link
Author

gprofile2 works fine! Well, the graph could be nicer, but yes, it works.

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants