| |
THE MISPRED PROJECT
Project objectives
The main objective of the MisPred project is to identify mispredicted and abnormal genes/proteins primarily from metazoan genomes
in order to improve the quality of predictions.
The MisPred approach is based on the principle that a protein-coding gene is likely to be mispredicted if some of the features
of the predicted protein conflict with our current knowledge about proteins. The routines of MisPred serve to identify suspicious
predicted protein sequences which conflict with at least one of the dogmas and could not, therefore, become correctly folded
and functional macromolecules in vivo.
By identifying erroneous protein sequences, the MisPred pipeline serves to inform both the creators of the predictive algorithms
as well as experimentalists of the reliability of predictions, to thereby assist in the improvement of the quality of the available datasets.
The principles of quality control are illustrated below with five of the MisPred tools.
Principles of quality control
1. Conflict between the presence of extracellular Pfam-A domain(s) in a protein and the absence of appropriate sequence signals.
Conflict 1 is based on the dogma that the subcellular localization of extracellular and transmembrane proteins is defined by the presence
of appropriate sequence signals. For the domain-based prediction of subcellular localization of proteins only those Pfam-A domain families
(Finn et al. 2006) have been incorporated into the MisPred pipeline that are exclusively extracellular, cytoplasmic or nuclear, respectively.
Pfam-A domains that are known not to be restricted to a particular cellular compartment, such as immunoglobulin domains and fibronectin
type III domains (i.e. domains that are ‘multilocale’), were not utilized in these analyses. Our domain co-occurrence analyses (Tordai et al. 2005)
have identified 166 obligatory extracellular, 115 obligatory cytoplasmic and 126 obligatory nuclear Pfam-A domain families as being restricted
to the respective subcellular compartment, the majority of which are also identified as such in the SMART database (Letunic et al. 2004).
These Pfam-A domains are listed in
Table 1-3.
This MisPred tool identifies proteins containing extracellular Pfam-A domains which occur exclusively in extracellular proteins or extracytoplasmic
parts of type I, type II, and type III single pass or multispanning transmembrane proteins and examines whether the proteins also have secretory
signal peptide, signal anchor or transmembrane segments that could target these domains to the extracellular space. Proteins that contain obligatory
extracellular domains but lack secretory signal peptide, signal anchor and transmembrane segment(s) are considered erroneous since in the absence
of these signals their extracellular domain (usually rich in disulfide-bonds) will not be delivered to the extracytoplasmic space where it is
properly folded, stable and functional. Mislocalized extracellular domains are likely to be misfolded in the reductive milieu of the cytoplasm
and such proteins are likely to be rapidly degraded by the protein quality control system of the cell.
2. Conflict between the presence of extracellular and cytoplasmic Pfam-A domains in a protein and the absence of transmembrane segments.
Conflict 2 is based on the principle that multidomain proteins that contain both obligatory extracellular and obligatory cytoplasmic domains
must have at least one transmembrane segment to pass through the cell membrane. The MisPred tool identifies proteins containing both extracellular
and cytoplasmic Pfam-A domains and examines whether they also contain transmembrane helices. Proteins that contain both obligatory extracellular
and obligatory cytoplasmic domains but lack transmembrane segment(s) separating them are considered suspicious (abnormal and nonviable).
3. Co-occurrence of nuclear and extracellular Pfam-A domains in a predicted multidomain protein.
Conflict 3 is based upon the rule that protein domains that occur exclusively in the extracellular space and those that occur exclusively
in the nucleus do not co-occur in a single multidomain protein (Tordai et al. 2005). The explanation for this rule is that a protein
that contains both extracellular and nuclear domains would not be delivered to a compartment where both types of domains would be
correctly folded and fully functional. Accordingly, proteins that contain both obligatory extracellular and obligatory nuclear domains
are considered abnormal and nonviable since they cannot be delivered to a cellular compartment relevant for both types of the constituent domains.
4. Domain size deviation
Conflict 4 is based on the observation that the protein fold is highly conserved in a domain family, therefore the number of amino acid
residues in closely related members of a globular domain family usually fall into a relatively narrow range (Wheelan et al. 2000, Wolf et al. 2007).
This phenomenon reflects the fact that insertion/deletion of longer segments into/from structural domains may yield proteins that are
unable to fold efficiently into a correctly folded, viable and stable three-dimensional structure (Tordai et al. 2005, Watters et al. 2007,
Wolf et al. 2007).
MisPred uses Pfam-A domain families that have a well-defined and conserved sequence length range and well-characterized members of the family
(in the UniProtKB/Swiss-Prot database) do not deviate from the average domain size by more than 2 SD values. Approximately 85% of all Pfam-A families
present in Metazoa turned out to be suitable for this task. Proteins containing domains that consist of a significantly larger or
smaller number of residues than closely related members of that domain family (in this conservative approach the cutoff was set to 40%
of the actual domain length) may be suspected to be abnormal and nonviable.
5. Chimeric proteins parts of which are encoded by exons located on different chromosomes.
Conflict 5 is based on the rule that a protein is encoded by exons located on a single chromosome. According to this dogma, proteins whose parts
are encoded by two or more different genes located on distinct chromosomes are considered abnormal.
Methods
Each MisPred routine is based upon generally accepted rules about the properties of protein-coding genes and correctly folded, functionally
competent protein molecules. Each routine combines reliable bioinformatic methodologies, as well as in-house programs, to analyze protein sequences.
The lists of obligatory extracellular, cytoplasmic and nuclear Pfam-A domain families used in the case of Conflict 1, 2, 3 are listed in
Tables 1-3.
The lists of Pfam-A domain families suitable for the study of domain integrity in the case of Conflict 4 are shown in
Table 4.
Secretory signal peptides (PrediSi, SignalP, Phobius), transmembrane helices (TMHMM, Phobius) and Pfam-A domains (hmmpfam) are identified by
standard methods. Sequence alignments were carried out using BLAST. Protein sequences were aligned onto the genome using the BLAT program in the case
of Conflict 5.
Results of the analysis of different protein databases
We have analyzed protein sequences of the Swiss-Prot and TrEMBL sections of the UniProt Knowledgebase (UniProtKB), and protein sequences
predicted by the EnsEMBL and NCBI/GNOMON pipelines. We have analyzed eleven species: Homo sapiens, Mus musculus, Gallus gallus, Rattus norvegicus,
Caenorhabditis elegans, Fugu rubripes, Monodelphis domestica, Ciona intestinalis, Xenopus tropicalis, Drosophila melanogaster and Danio rerio.
The results of these analyses on different database versions are summarized under
Statistics.
The results of the latest analyses are stored in the MisPred database and are accessible through the query page under
Search MisPred.
It is possible to query the database by using protein accession numbers and protein sequences, and also can be filtered by the name of species,
the type of conflict, the database, and the combinations of these features. It also provides links to various databases/resources to obtain
further information about the nature of protein and domains. The results of the analysis of previous database versions are also accessible
clicking on the Search in archive checkbox.
The annotations relating to EnsEMBL proteins are viewable also through EnsEMBL using the Distributed Annotation System.
We have extended our analyses to the Gencode sequences. The results of these analysis are found under
Gencode info.
Analyze your sequence
If you do not find the protein sequence you are interested in in the MisPred database, you can analyze the sequence under
Analyze Your Sequence.
References
Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR,
Sonnhammer EL, Bateman A. (2006) Pfam: clans, web tools and services. Nucleic Acids Research 34: D247-51.
Letunic I, Copley R, Schmidt S, Ciccarelli F, Doerks T, Schultz J, Ponting C, Bork P (2004) SMART 4.0: towards genomic data integration.
Nucleic Acids Research 32:D142-4.
Tordai H, Nagy A, Farkas K, Bányai L, Patthy L (2005) Modules, multidomain proteins and organismic complexity. The FEBS Journal 272:5064-78.
Watters AL, Deka P, Corrent C, Callender D, Varani G, Sosnick T, Baker D (2007) The highly cooperative folding of small naturally occurring
proteins is likely the result of natural selection. Cell 128:613-24.
Wheelan S, Marchler-Bauer A, Bryant S (2000) Domain size distributions can predict domain boundaries. Bioinformatics 16:613-618
Wolf Y, Madej T, Babenko V, Shoemaker B, Panchenko A (2007) Long-term trends in evolution of indels in protein sequences. BMC Evolutionary
Biology 7:19
Citing MisPred
Publications related to the MisPred project are found under
Publications.
If you wish to cite MisPred in your paper, please cite the article below:
Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L and Patthy L. (2008) Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics 9, 353.
|