Skip to content

A Python-package for convenient access to information provided by UniProt.

License

Notifications You must be signed in to change notification settings

c-feldmann/UniProtClient

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UniProtClient

IMPORTANT! Mapping is UNDER CONSTRUCTRION and not working

Python classes in this package allow convenient access to UniProt for protein ID mapping and information retrieval.

Usage

Mapping

Protein IDs differ from database to database. The class UniProtMapper can be utilized for mapping of protein IDs from one database to corresponding IDs of another database, specified by letter codes.

from UniProtClient import UniProtMapper
origin_database = 'P_GI'  # PubChem Gene ID
target_database = 'ACC'  # UniProt Accession
gi_2_acc_mappig = UniProtMapper(origin_database, target_database)

The obtained object has a function called map_protein_ids, which takes a list of strings with protein IDs as input, returning a pandas DataFrame. The DataFrame has two columns: "From" and "To" referring to the origin and target ID, respectively.

gi_numbers = ['224586929', '224586929', '4758208'] # IDs should be represented as a list of strings
# a pandas DataFrame is returned containing the columns "From" and "To"
mapping_df = gi_2_acc_mappig.map_protein_ids(gi_numbers)
uniprot_accessions = mapping_df['To'].tolist()
mapping_df
From To
0 224586929 Q9Y2R2
1 224586929 B4DZW8
2 4758208 P51452

Protein information

UniProt provides a varity of protein specific information, such as protein family, organism, function, EC-number, and many more. The class UniProtProteinInfo is initialized with column identifier specifing the requested information. Spaces in column names should be substituted by underscores.
If no columns are specified the default is used:

Column-ID
id
entry_name
protein_names
families
organism
ec
genes(PREFERRED)
go(molecular_function)

The column "protein_names" contains all protein names, where secondary names are given in brackets or parenthesis. If this column is requested, the primary name is extracted and added as a new column, called "primary_name".

from UniProtClient import UniProtProteinInfo
info = UniProtProteinInfo()
info.load_protein_info(["B4DZW8", "Q9Y2R2", "P51452"])
entry_name protein_names protein_families organism ec_number gene_names(primary) gene_ontology(molecular_function) primary_name subfamily family superfamily
entry
P51452 DUS3_HUMAN Dual specificity protein phosphatase 3 (EC 3.1... Protein-tyrosine phosphatase family, Non-recep... Homo sapiens (Human) 3.1.3.16; 3.1.3.48 DUSP3 cytoskeletal protein binding [GO:0008092]; MAP... Dual specificity protein phosphatase 3 Non-receptor class dual specificity subfamily Protein-tyrosine phosphatase family None
Q9Y2R2 PTN22_HUMAN Tyrosine-protein phosphatase non-receptor type... Protein-tyrosine phosphatase family, Non-recep... Homo sapiens (Human) 3.1.3.48 PTPN22 kinase binding [GO:0019900]; non-membrane span... Tyrosine-protein phosphatase non-receptor type 22 Non-receptor class 4 subfamily Protein-tyrosine phosphatase family None
B4DZW8 B4DZW8_HUMAN cDNA FLJ55436, highly similar to Tyrosine-prot... Homo sapiens (Human) protein tyrosine phosphatase activity [GO:0004... cDNA FLJ55436, highly similar to Tyrosine-prot... None None None

Protein Families

If downloaded, the string 'protein_families' is parsed automatically. It is split into the categories subfamily, family and superfamily. Some proteins belong to multiple families. The default behaviour is to extract the individual categories and merge them into a ; seperated string.

# Extending column with. Not important for extraction.
import pandas as pd
pd.set_option('max_colwidth', 400)
info = UniProtProteinInfo(merge_multi_fam_strings="string")  # Default behaviour
info.load_protein_info(["Q923J1"])[["organism", "subfamily", "family", "superfamily"]]
organism subfamily family superfamily
entry
Q923J1 Mus musculus (Mouse) ALPK subfamily; LTrpC subfamily Alpha-type protein kinase family; Transient receptor (TC 1.A.4) family Protein kinase superfamily; -

Setting merge_multi_fam_strings to 'list' will arrange each family association in a list. To keep types consistent this applies to proteins with only one family as well.

info = UniProtProteinInfo(merge_multi_fam_strings="list")  # Default behaviour
info.load_protein_info(["Q923J1", "Q9Y2R2"])[["organism", "subfamily", "family", "superfamily"]]
organism subfamily family superfamily
entry
Q923J1 Mus musculus (Mouse) [ALPK subfamily, LTrpC subfamily] [Alpha-type protein kinase family, Transient receptor (TC 1.A.4) family] [Protein kinase superfamily, None]
Q9Y2R2 Homo sapiens (Human) [Non-receptor class 4 subfamily] [Protein-tyrosine phosphatase family] [None]

Setting merge_multi_fam_strings to None will create for each family association an individual row where remaining protein information are identical.

info = UniProtProteinInfo(merge_multi_fam_strings=None)
info.load_protein_info(["Q923J1"])[["organism", "subfamily", "family", "superfamily"]]
organism subfamily family superfamily
entry
Q923J1 Mus musculus (Mouse) ALPK subfamily Alpha-type protein kinase family Protein kinase superfamily
Q923J1 Mus musculus (Mouse) LTrpC subfamily Transient receptor (TC 1.A.4) family None

About

A Python-package for convenient access to information provided by UniProt.

Topics

Resources

License

Stars

Watchers

Forks

Languages