Information  
 
About MisPred  
Statistics  
Gencode info  
Publications  
Useful links  
Contacts  
   
MisPred Database  
 
Search MisPred  
Analyze Your Sequence  
   
 

THE MISPRED PROJECT

Project objectives

The main objective of the MisPred project is to identify mispredicted and abnormal genes/proteins primarily from metazoan genomes in order to improve the quality of predictions.
The MisPred approach is based on the principle that a protein-coding gene is likely to be mispredicted if some of the features of the predicted protein conflict with our current knowledge about proteins. The routines of MisPred serve to identify suspicious predicted protein sequences which conflict with at least one of the dogmas and could not, therefore, become correctly folded and functional macromolecules in vivo.
By identifying erroneous protein sequences, the MisPred pipeline serves to inform both the creators of the predictive algorithms as well as experimentalists of the reliability of predictions, to thereby assist in the improvement of the quality of the available datasets. The principles of quality control are illustrated below with five of the MisPred tools.

Principles of quality control

1. Conflict between the presence of extracellular Pfam-A domain(s) in a protein and the absence of appropriate sequence signals.

Conflict 1 is based on the dogma that the subcellular localization of extracellular and transmembrane proteins is defined by the presence of appropriate sequence signals. For the domain-based prediction of subcellular localization of proteins only those Pfam-A domain families (Finn et al. 2006) have been incorporated into the MisPred pipeline that are exclusively extracellular, cytoplasmic or nuclear, respectively. Pfam-A domains that are known not to be restricted to a particular cellular compartment, such as immunoglobulin domains and fibronectin type III domains (i.e. domains that are ‘multilocale’), were not utilized in these analyses. Our domain co-occurrence analyses (Tordai et al. 2005) have identified 166 obligatory extracellular, 115 obligatory cytoplasmic and 126 obligatory nuclear Pfam-A domain families as being restricted to the respective subcellular compartment, the majority of which are also identified as such in the SMART database (Letunic et al. 2004). These Pfam-A domains are listed in Table 1-3.
This MisPred tool identifies proteins containing extracellular Pfam-A domains which occur exclusively in extracellular proteins or extracytoplasmic parts of type I, type II, and type III single pass or multispanning transmembrane proteins and examines whether the proteins also have secretory signal peptide, signal anchor or transmembrane segments that could target these domains to the extracellular space. Proteins that contain obligatory extracellular domains but lack secretory signal peptide, signal anchor and transmembrane segment(s) are considered erroneous since in the absence of these signals their extracellular domain (usually rich in disulfide-bonds) will not be delivered to the extracytoplasmic space where it is properly folded, stable and functional. Mislocalized extracellular domains are likely to be misfolded in the reductive milieu of the cytoplasm and such proteins are likely to be rapidly degraded by the protein quality control system of the cell.

2. Conflict between the presence of extracellular and cytoplasmic Pfam-A domains in a protein and the absence of transmembrane segments.

Conflict 2 is based on the principle that multidomain proteins that contain both obligatory extracellular and obligatory cytoplasmic domains must have at least one transmembrane segment to pass through the cell membrane. The MisPred tool identifies proteins containing both extracellular and cytoplasmic Pfam-A domains and examines whether they also contain transmembrane helices. Proteins that contain both obligatory extracellular and obligatory cytoplasmic domains but lack transmembrane segment(s) separating them are considered suspicious (abnormal and nonviable).

3. Co-occurrence of nuclear and extracellular Pfam-A domains in a predicted multidomain protein.

Conflict 3 is based upon the rule that protein domains that occur exclusively in the extracellular space and those that occur exclusively in the nucleus do not co-occur in a single multidomain protein (Tordai et al. 2005). The explanation for this rule is that a protein that contains both extracellular and nuclear domains would not be delivered to a compartment where both types of domains would be correctly folded and fully functional. Accordingly, proteins that contain both obligatory extracellular and obligatory nuclear domains are considered abnormal and nonviable since they cannot be delivered to a cellular compartment relevant for both types of the constituent domains.

4. Domain size deviation

Conflict 4 is based on the observation that the protein fold is highly conserved in a domain family, therefore the number of amino acid residues in closely related members of a globular domain family usually fall into a relatively narrow range (Wheelan et al. 2000, Wolf et al. 2007). This phenomenon reflects the fact that insertion/deletion of longer segments into/from structural domains may yield proteins that are unable to fold efficiently into a correctly folded, viable and stable three-dimensional structure (Tordai et al. 2005, Watters et al. 2007, Wolf et al. 2007). MisPred uses Pfam-A domain families that have a well-defined and conserved sequence length range and well-characterized members of the family (in the UniProtKB/Swiss-Prot database) do not deviate from the average domain size by more than 2 SD values. Approximately 85% of all Pfam-A families present in Metazoa turned out to be suitable for this task. Proteins containing domains that consist of a significantly larger or smaller number of residues than closely related members of that domain family (in this conservative approach the cutoff was set to 40% of the actual domain length) may be suspected to be abnormal and nonviable.

5. Chimeric proteins parts of which are encoded by exons located on different chromosomes.

Conflict 5 is based on the rule that a protein is encoded by exons located on a single chromosome. According to this dogma, proteins whose parts are encoded by two or more different genes located on distinct chromosomes are considered abnormal.

Methods

Each MisPred routine is based upon generally accepted rules about the properties of protein-coding genes and correctly folded, functionally competent protein molecules. Each routine combines reliable bioinformatic methodologies, as well as in-house programs, to analyze protein sequences. The lists of obligatory extracellular, cytoplasmic and nuclear Pfam-A domain families used in the case of Conflict 1, 2, 3 are listed in Tables 1-3. The lists of Pfam-A domain families suitable for the study of domain integrity in the case of Conflict 4 are shown in Table 4. Secretory signal peptides (PrediSi, SignalP, Phobius), transmembrane helices (TMHMM, Phobius) and Pfam-A domains (hmmpfam) are identified by standard methods. Sequence alignments were carried out using BLAST. Protein sequences were aligned onto the genome using the BLAT program in the case of Conflict 5.

Results of the analysis of different protein databases

We have analyzed protein sequences of the Swiss-Prot and TrEMBL sections of the UniProt Knowledgebase (UniProtKB), and protein sequences predicted by the EnsEMBL and NCBI/GNOMON pipelines. We have analyzed eleven species: Homo sapiens, Mus musculus, Gallus gallus, Rattus norvegicus, Caenorhabditis elegans, Fugu rubripes, Monodelphis domestica, Ciona intestinalis, Xenopus tropicalis, Drosophila melanogaster and Danio rerio. The results of these analyses on different database versions are summarized under Statistics.

The results of the latest analyses are stored in the MisPred database and are accessible through the query page under Search MisPred. It is possible to query the database by using protein accession numbers and protein sequences, and also can be filtered by the name of species, the type of conflict, the database, and the combinations of these features. It also provides links to various databases/resources to obtain further information about the nature of protein and domains. The results of the analysis of previous database versions are also accessible clicking on the Search in archive checkbox.

The annotations relating to EnsEMBL proteins are viewable also through EnsEMBL using the Distributed Annotation System.

We have extended our analyses to the Gencode sequences. The results of these analysis are found under Gencode info.

Analyze your sequence

If you do not find the protein sequence you are interested in in the MisPred database, you can analyze the sequence under Analyze Your Sequence.

References

Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A. (2006) Pfam: clans, web tools and services. Nucleic Acids Research 34: D247-51.
Letunic I, Copley R, Schmidt S, Ciccarelli F, Doerks T, Schultz J, Ponting C, Bork P (2004) SMART 4.0: towards genomic data integration. Nucleic Acids Research 32:D142-4.
Tordai H, Nagy A, Farkas K, Bányai L, Patthy L (2005) Modules, multidomain proteins and organismic complexity. The FEBS Journal 272:5064-78.
Watters AL, Deka P, Corrent C, Callender D, Varani G, Sosnick T, Baker D (2007) The highly cooperative folding of small naturally occurring proteins is likely the result of natural selection. Cell 128:613-24.
Wheelan S, Marchler-Bauer A, Bryant S (2000) Domain size distributions can predict domain boundaries. Bioinformatics 16:613-618
Wolf Y, Madej T, Babenko V, Shoemaker B, Panchenko A (2007) Long-term trends in evolution of indels in protein sequences. BMC Evolutionary Biology 7:19

Citing MisPred

Publications related to the MisPred project are found under Publications.

If you wish to cite MisPred in your paper, please cite the article below:
Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L and Patthy L. (2008) Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics 9, 353.