The output of the `search.top' script is written to the `search.log' file. MODELLER always produces a log file. Errors and warnings in log files can be found by searching for the `_E>' and `_W>' strings, respectively. At the end of the log file, MODELLER lists the hits sorted by alignment significance. Because the log file is sometimes very long, a separate data file is created that contains the summary of the search. The example shows only the top 10 hits (file `search.dat').
# CODE_1 CODE_2 LEN1 LEN2 NID %ID1 %ID2 SCORE SIGNI ----------------------------------------------------------------- 1 TvLDH 1bdmA 335 318 153 45.7 48.1 212557. 28.9 2 TvLDH 1lldA 335 313 103 30.7 32.9 183190. 10.1 3 TvLDH 1ceqA 335 304 95 28.4 31.3 179636. 9.2 4 TvLDH 2hlpA 335 303 86 25.7 28.4 177791. 8.9 5 TvLDH 1ldnA 335 316 91 27.2 28.8 180669. 7.4 6 TvLDH 1hyhA 335 297 88 26.3 29.6 175969. 6.9 7 TvLDH 2cmd 335 312 108 32.2 34.6 182079. 6.6 8 TvLDH 1db3A 335 335 91 27.2 27.2 181928. 4.9 9 TvLDH 9ldtA 335 331 95 28.4 28.7 181720. 4.7 10 TvLDH 1cdb 335 105 69 20.6 65.7 80141. 3.8
The most important columns in the SEQUENCE_SEARCH output are the `CODE_2', `%ID' and `SIGNI' columns. The `CODE_2' column reports the code of the PDB sequence that was compared with the target sequence. The PDB code in each line is the representative of a group of PDB sequences that share 40% or more sequence identity to each other and have less than 30 residues or 30% sequence length difference. All the members of the group can be found in the MODELLER `CHAINS_3.0_40_XN.grp' file. The `%ID1' and `%ID2' columns report the percentage sequence identities between TvLDH and a PDB sequence normalized by their lengths, respectively. In general, a `%ID' value above approximately 25% indicates a potential template unless the alignment is short (i.e., less than 100 residues). A better measure of the significance of the alignment is given by the `SIGNI' column [72]. A value above 6.0 is generally significant irrespective of the sequence identity and length. In this example, one protein family represented by 1bdmA shows significant similarity with the target sequence, at more than 40% sequence identity. While some other hits are also significant, the differences between 1bdmA and other top scoring hits are so pronounced that we use only the first hit as the template. As expected, 1bdmA is a malate dehydrogenase (from a thermophilic bacterium). Other structures closely related to 1bdmA (and thus not scanned against by SEQUENCE_SEARCH ) can be extracted from the `CHAINS_3.0_40_XN.grp' file: 1b8vA, 1bmdA, 1b8uA, 1b8pA, 1bdmA, 1bdmB, 4mdhA, 5mdhA, 7mdhA, 7mdhB, and 7mdhC. All these proteins are malate dehydrogenases. During the project, all of them and other malate and lactate dehydrogenase structures were compared and considered as templates (there were 19 structures in total). However, for the sake of illustration, we will investigate only four of the proteins that are sequentially most similar to the target, 1bmdA, 4mdhA, 5mdhA, and 7mdhA. The following script performs all pairwise comparisons among the selected proteins (file `compare.top').
READ_ALIGNMENT FILE = '$(LIB)/CHAINS_all.seq',; ALIGN_CODES = '1bmdA' '4mdhA' '5mdhA' '7mdhA' MALIGN MALIGN3D COMPARE ID_TABLE DENDROGRAM
The READ_ALIGNMENT command reads the protein sequences and information about their PDB files. MALIGN calculates their multiple sequence alignment, used as the starting point for the multiple structure alignment. The MALIGN3D command performs an iterative least-squares superposition of the four 3D structures. COMPARE command compares the structures according to the alignment constructed by MALIGN3D . It does not make an alignment, but it calculates the RMS and DRMS deviations between atomic positions and distances, differences between the mainchain and sidechain dihedral angles, percentage sequence identities, and several other measures. Finally, the ID_TABLE command writes a file with pairwise sequence distances that can be used directly as the input to the DENDROGRAM command (or the clustering programs in the PHYLIP package [42]). DENDROGRAM calculates a clustering tree from the input matrix of pairwise distances, which helps visualizing differences among the template candidates. Excerpts from the log file are shown below (file `compare.log').
>> Least-squares superposition (FIT) : T Atom types for superposition/RMS (FIT_ATOMS): CA Atom type for position average/variability (DISTANCE_ATOMS[1]): CA Position comparison (FIT_ATOMS): Cutoff for RMS calculation: 3.5000 Upper = RMS, Lower = numb equiv positions 1bmdA 4mdhA 5mdhA 7mdhA 1bmdA 0.000 1.038 0.979 0.992 4mdhA 310 0.000 0.504 1.210 5mdhA 308 329 0.000 1.173 7mdhA 320 306 307 0.000 >> Sequence comparison: Diag=numb res, Upper=numb equiv res, Lower = % seq ID 1bmdA 4mdhA 5mdhA 7mdhA 1bmdA 327 168 168 158 4mdhA 51 333 328 137 5mdhA 51 98 333 138 7mdhA 48 41 41 351 .---------------------------------------------------- 1bmdA @1.9 | | .--- 4mdhA @2.5 | | .---------------------------------------------------------- 5mdhA @2.4 | .------------------------------------------------------------ 7mdhA @2.4
The comparison above shows that 5mdhA and 4mdhA are almost identical, both sequentially and structurally. They were solved at similar resolutions, 2.4 and 2.5Å, respectively. However, 4mdhA has a better crystallographic R-factor (16.7 versus 20%), eliminating 5mdhA. Inspection of the PDB file for 7mdhA reveals that its crystallographic refinement was based on 1bmdA. In addition, 7mdhA was refined at a lower resolution than 1bmdA (2.4 versus 1.9), eliminating 7mdhA. These observations leave only 1bmdA and 4mdhA as potential templates. Finally, 4mdhA is selected because of the higher overall sequence similarity to the target sequence.