OLD COMPARATIVE MODELS Dataset


All specific and detailed information about our old comparative models dataset is listed here. The information is distributed in four major sections along this page:




General description of the dataset

This database contains a total of 3375 correct models ("good" models) and 6270 incorrect models ("bad" models). The two sets of models were calculated by large-scale comparative modeling of the protein chains representative of the Protein Data Bank (PDB, Berman et al., 2002) of known protein structures. The models were classified as good or bad depending on their structural similarity to the actual structure of the target protein. These sets of models have been used to develop the fold assessment module of ModPipe software (Sánchez and Sali, 1998; Melo et al., in preparation) and to optimize hundreds of statistical potentials for fold assessment (Melo et al., 2002). Distributions of some features for the good and bad models are available in Figure 1 of Melo et al., 2002. Details about the building of these model sets are described below.


Models with the correct fold ("good" models).

The good models were built based on the correct templates and mostly correct alignments between the target sequences and the template structures. The models were obtained by applying ModPipe to 1,085 chains representative of the PDB (Sali and Blundell, 1993; Sánchez and Sali, 1998). These representative sequences corresponded to the protein chains in PDB that shared less than 30% sequence identity or were more than 30 residues different in size. The templates for comparative modeling were 1,637 PDB chains with less than 80% identity to each other or more than 30 residue difference in length. Each target sequence was aligned separately with each one of the 1,637 known structures using the program ALIGN that implements local sequence alignment by dynamic programming (Altschul, 1998). Only the target-template alignments with a significance score higher than 22 nats (corresponding approximately to the PSI-BLAST E-value of 10E-4 were used, resulting in 3,993 models. Models with less than 30% structural overlap with the actual experimental structure were eliminated. Structural overlap was defined as the fraction of the equivalent Calpha atoms upon least-squares superposition of the two structures with the 3.5 Angstroms cutoff. This procedure also removed models based on correct templates that had a poor alignment and models based on templates that had large domain or rigid body movements with respect to the target structure. The final set contained 3,375 good models.


Models with an incorrect fold ("bad" models).

The bad models were built based on a template with an incorrect fold, a template structure with large rigid body shifts, or an incorrect alignment with the correct template. The models were obtained as described above, except that only the target--template alignments with the significance score between 15 and 20 nats were used; this procedure resulted in 7,669 models for the 1,085 representative chains. Models with more than 15% structure overlap with the actual target structure were eliminated. The final set contained 6,270 bad models.


Subsets of models.

The good and bad models were subdivided into two sets each: a training set containing 400 models and a testing set containing the remaining models. In the case of the training sets, the models were randomly selected from the initial complete sets, but selecting 100 models of less than 50 residues (very small models), 100 small models (50-100 residues), 100 medium models (100-200 residues) and 100 large models (more than 200 residues). Thus, the training sets contain models that are representative of all protein sizes. Moreover, the average and standard deviation for model length and percentage sequence identity distributions are equivalent between the complete initial sets and the final training sets.




Raw data


Filename Type Fileformat Number of elements General description
README N.A. plain text N.A. Contains a description of the models (naming procedure, modeling and structural data)
GOOD MODELS
good.dat data plain text 3375 vectors A list with the model data (15 values) for the 3375 good models (see the 'README' file)
good.rms data plain text 3375 vectors A list with the structural comparison data (11 values) between the model and the target structure for the 3375 good models (see the 'README' file)
good.pdb.tar.gz coordinates (plain text) PDB format 3375 files The PDB files of the 3375 good models (69 MB)
good.ali.tar.gz alignments (plain text) PIR format 3375 files PIR files containing the alignments used to build the 3375 good models
good.list list plain text 3375 model names A list of all the good models
good.train.list list plain text 400 model names A list of all the good models of the training set
good.test.list list plain text 2975 model names A list of all the good models of the testing set
BAD MODELS
bad.dat data plain text 6270 vectors A list with the model data (15 values) for the 6270 bad models (see the 'README' file)
bad.rms data plain text 6270 vectors A list with the structural comparison data (11 values) between the model and the target structure for the 6270 good models (see the 'README' file)
bad.pdb.tar.gz coordinates (plain text) PDB format 6270 files The PDB files of the 6270 bad models (82 MB)
bad.ali.tar.gz alignments (plain text) PIR format 6270 files PIR files containing the alignments used to build the 6270 bad models
bad.list list plain text 6270 model names A list of all the bad models
bad.train.list list plain text 400 model names A list of all the bad models of the training set
bad.test.list list plain text 5870 model names A list of all the bad models of the testing set







References

Altschul, S. (1998)
Generalized affine gap costs for protein sequence alignment.
Proteins 32,88-96.
Berman, H.M., Battistuz, T., Bhat, T.N., Bluhm, W.F., Bourne, P.E., Burkhardt, K., Feng, Z., Gilliland, G.L., Iype, L., Jain, S., Fagan, P., Marvin, J., Padilla, D., Ravichandran, V., Schneider, B., Thanki, N., Weissig, H., Westbrook, J.D., and Zardecki, C. (2002)
The Protein Data Bank.
Acta Crystallogr D Biol Crystallogr. 58, 899-907.
Melo, F., Sánchez, R., and Sali, A. (2002)
Statistical potentials for fold assessment.
Protein Science 11, 430-448.
Melo, F., Sánchez, R., and Sali, A. (2004)
Automated model assessment for large-scale comparative modeling.
(in preparation).
Sali, A. and Blundell, T.L. (1993)
Comparative protein modelling by satisfaction of spatial restraints.
J. Mol. Biol. 234, 779-815.
Sánchez R. and Sali, A. (1998)
Large-scale protein structure modeling of the Saccharomyces cerevisiae genome.
Proc. Natl. Acad. Sci. USA 95,13597-13602.
Pieper, U., Eswar, N., Stuart, A.C., Ilyin, V.A. and Sali, A. (2002)
MODBASE, a database of annotated comparative protein structure models.
Nucleic Acids Res. 30, 255-259.