Monday, 31 July 2006 Print E-mail RSS Feeds Bookmark

A Quick Guide to IPI: The International Protein Index

Determining the proteomes of higher eukaryotes was one of the main reasons for sequencing their genomes. However, different methods of computational prediction of proteomes from genomic sequence yield different results, many of which have never been experimentally verified. Consequently there are many different sets of predicted protein sequences, with only partial overlap between them.
 
As well as this inter-database redundancy, there may be redundancy within databases, caused by the submission of many sequences representing the same protein. Moreover, sequence similarity alone can be a poor guide as to whether two database entries really are redundant, because different computational and experimental methods can produce different sequences that represent the same entity (for example, fragmentary sequence may be generated). Yet there is massive natural diversity (particularly that resulting from alternative splicing), so two sequences of high similarity may nonetheless represent real biological variants.
 
What is IPI?
IPI (www.ebi.ac.uk/IPI ) provides a top-level guide to the main databases that describe the proteomes of higher eukaryotic organisms (currently human, mouse, rat, zebrafish, Arabidopsis thaliana, chicken and cow).
 
In IPI, the results of sequence comparisons are interpreted in the context of knowledge about the relationship between database entries and individual biological entities, such as splice variants and genes, that can be derived from annotation in the source databases. EBI utilises this information to intelligently filter redundant sequences from many sources, maintaining the documented diversity in data sets produced for each complete proteome whilst typically reducing the final product to only 40 percent of the cumulative size of the input data.
 
IPI: effectively maintains a database of cross references between the primary data sources, thereby linking predicted and verified sequence; provides minimally redundant yet maximally complete sets of proteins, with one sequence per transcript; maintains stable identifiers (with incremental versioning) to allow the tracking of sequences in IPI between IPI releases; is cross-referenced to Entrez Gene, UniGene and to model organism databases such as Genew (the database of the HUGO Gene Nomenclature Committee), MGI, RGD, ZFIN and TAIR.
 
Protein classification according to InterPro, an integrated documentation resource for protein families, domains and functional sites, is provided for all IPI proteins. IPI is updated monthly in accordance with the latest data released by the primary data sources.
 
Image
Data sources for IPI and different ways of accessing the data.
 
 
What can I do with IPI?
Compare annotation for a single protein from different databases; move from a computationally derived protein (such as an Ensembl prediction) to manual annotation that exists in another data source (such as UniProt); track proteins between releases through the use of IPI identifiers, even though the protein predictions in the source databases may be highly unstable; use as comprehensive, low-redundancy, sequence-specific data sets for sequence comparisons. IPI is particularly suitable for identifying unknown proteins (a single data source may not be adequate to ensure that matches are not missed); and is increasingly used in proteomics experiments for this purpose.
 
Retrieving data from IPI. 
Quick Search.
IPI can be easily searched using the QuickSearch facility available on the IPI home page (www.ebi.ac.uk/IPI ).
 
SRS.
Advanced searches, including cross-queries with other databases, can be performed using the EBI SRS server (srs.ebi.ac.uk/). Deleted and secondary ID codes can also be tracked by searching the IPI_history database in SRS (see www.ebi.ac.uk/IPI/IPIhelp.html for a worked example). IPI and IPI_history can be selected on the SRS ‘Library’ page, under the menu for ‘Other Protein Sequence Databases’.
 
Sequence similarity search.
IPI is searchable under the EBI’s BLAST (www.ebi.ac.uk/blast ) and FASTA (www.ebi.ac.uk/fasta ) servers.
 
FTP.
Complete IPI data sets for each complete proteome can be downloaded by FTP (ftp://ftp.ebi.ac.uk/pub/databases/IPI), in FASTA or UniProt format, together with ancillary data files. All previous releases of IPI are also available at this site.
 
Web services.
Programmatic access to IPI is available through the EBI’s web services at www.ebi.ac.uk/Tools/webservices .
 
Image