-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mapping to Pfam IDs #75
Comments
Hello, Thanks for your interest in our repo! In order to get the original Pfam ID, you'll unfortunately have to compare the sequence of residues directly. If it is helpful, you can find the mapping from Pfam index to Pfam family in s3 here The process for creating our dataset is as follows: we downloaded Pfam-A.fasta from the Pfam 31 release (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam31.0/), shuffled it, and then split into train/validation/test as described in our paper. So the |
Thanks! Are you sure that you used Pfam 31? There are a lot of sequences in the dataset that are not in Pfam 31, but all appear in Pfam 32. Also, if you are interested, I can send you the mapping if other people might need it. |
Ah yes, thank you for the correction. It should be most similar to Pfam 32. We downloaded Pfam-A.fasta from the "current release" ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release ftp link in March of 2019. Pfam 32 had already been released in August 2018, and the last modification to Pfam 33 was March 2020. If there are sequences that don't appear in Pfam 32, I would check Pfam 33. And thanks for offering to send the mapping, that would be helpful to share with others! |
+1 for the mapping to original Pfam IDs - I would be very interested in them! |
The columns are |
This is awesome! Thank you! Out of curiosity, how did you link back to the pfam_ids? Did you actually just compare every literal sequence string between the tape dataset and the Pfam release? |
Yes. I just parsed The script also creates a version of the lmdb databases that contains all the information about pfam mappings, species etc. I can share them if someone is interested (however, they are trivial to make with the mappings). |
Hi,
first thanks for creating this repo, it's really useful.
One question: It's not clear to me how I can go back to the original Pfam ID for a sequence from the LMDB databases. The reason I want to do this is because I need to use species annotation in a task.
Also, I did not find information as to how the data was created (which part of Pfam, is there preprocessing etc.). Is this documented somwhere and I didn't see it?
The text was updated successfully, but these errors were encountered: