Selecting a template

Next: Aligning TvLDF with the Up: Example 1: Modeling lactate Previous: Searching for structures related

Selecting a template

The output of the `search.top' script is written to the `search.log' file. MODELLER always produces a log file. Errors and warnings in log files can be found by searching for the `_E>' and `_W>' strings, respectively. At the end of the log file, MODELLER lists the hits sorted by alignment significance. Because the log file is sometimes very long, a separate data file is created that contains the summary of the search. The example shows only the top 10 hits (file `search.dat').

1.81.0

  # CODE_1      CODE_2  LEN1 LEN2 NID  %ID1  %ID2    SCORE  SIGNI 
-----------------------------------------------------------------
   1 TvLDH      1bdmA    335  318 153  45.7  48.1  212557.   28.9 
   2 TvLDH      1lldA    335  313 103  30.7  32.9  183190.   10.1 
   3 TvLDH      1ceqA    335  304  95  28.4  31.3  179636.    9.2 
   4 TvLDH      2hlpA    335  303  86  25.7  28.4  177791.    8.9 
   5 TvLDH      1ldnA    335  316  91  27.2  28.8  180669.    7.4 
   6 TvLDH      1hyhA    335  297  88  26.3  29.6  175969.    6.9 
   7 TvLDH      2cmd     335  312 108  32.2  34.6  182079.    6.6 
   8 TvLDH      1db3A    335  335  91  27.2  27.2  181928.    4.9 
   9 TvLDH      9ldtA    335  331  95  28.4  28.7  181720.    4.7 
  10 TvLDH      1cdb     335  105  69  20.6  65.7   80141.    3.8

The most important columns in the SEQUENCE_SEARCH output are the `CODE_2', `%ID' and `SIGNI' columns. The `CODE_2' column reports the code of the PDB sequence that was compared with the target sequence. The PDB code in each line is the representative of a group of PDB sequences that share 40% or more sequence identity to each other and have less than 30 residues or 30% sequence length difference. All the members of the group can be found in the MODELLER `CHAINS_3.0_40_XN.grp' file. The `%ID1' and `%ID2' columns report the percentage sequence identities between TvLDH and a PDB sequence normalized by their lengths, respectively. In general, a `%ID' value above approximately 25% indicates a potential template unless the alignment is short (i.e., less than 100 residues). A better measure of the significance of the alignment is given by the `SIGNI' column [72]. A value above 6.0 is generally significant irrespective of the sequence identity and length. In this example, one protein family represented by 1bdmA shows significant similarity with the target sequence, at more than 40% sequence identity. While some other hits are also significant, the differences between 1bdmA and other top scoring hits are so pronounced that we use only the first hit as the template. As expected, 1bdmA is a malate dehydrogenase (from a thermophilic bacterium). Other structures closely related to 1bdmA (and thus not scanned against by SEQUENCE_SEARCH ) can be extracted from the `CHAINS_3.0_40_XN.grp' file: 1b8vA, 1bmdA, 1b8uA, 1b8pA, 1bdmA, 1bdmB, 4mdhA, 5mdhA, 7mdhA, 7mdhB, and 7mdhC. All these proteins are malate dehydrogenases. During the project, all of them and other malate and lactate dehydrogenase structures were compared and considered as templates (there were 19 structures in total). However, for the sake of illustration, we will investigate only four of the proteins that are sequentially most similar to the target, 1bmdA, 4mdhA, 5mdhA, and 7mdhA. The following script performs all pairwise comparisons among the selected proteins (file `compare.top').

1.81.0

READ_ALIGNMENT FILE = '$(LIB)/CHAINS_all.seq',;
     ALIGN_CODES = '1bmdA' '4mdhA' '5mdhA' '7mdhA'
MALIGN
MALIGN3D
COMPARE
ID_TABLE 
DENDROGRAM

The READ_ALIGNMENT command reads the protein sequences and information about their PDB files. MALIGN calculates their multiple sequence alignment, used as the starting point for the multiple structure alignment. The MALIGN3D command performs an iterative least-squares superposition of the four 3D structures. COMPARE command compares the structures according to the alignment constructed by MALIGN3D . It does not make an alignment, but it calculates the RMS and DRMS deviations between atomic positions and distances, differences between the mainchain and sidechain dihedral angles, percentage sequence identities, and several other measures. Finally, the ID_TABLE command writes a file with pairwise sequence distances that can be used directly as the input to the DENDROGRAM command (or the clustering programs in the PHYLIP package [42]). DENDROGRAM calculates a clustering tree from the input matrix of pairwise distances, which helps visualizing differences among the template candidates. Excerpts from the log file are shown below (file `compare.log').

1.81.0

>> Least-squares superposition (FIT)           :       T

   Atom types for superposition/RMS (FIT_ATOMS): CA
   Atom type for position average/variability (DISTANCE_ATOMS[1]): CA

   Position comparison (FIT_ATOMS): 

       Cutoff for RMS calculation:     3.5000

       Upper = RMS, Lower = numb equiv positions

           1bmdA   4mdhA   5mdhA   7mdhA   
1bmdA      0.000   1.038   0.979   0.992
4mdhA        310   0.000   0.504   1.210
5mdhA        308     329   0.000   1.173
7mdhA        320     306     307   0.000

>> Sequence comparison: 

       Diag=numb res, Upper=numb equiv res, Lower = % seq ID

            1bmdA   4mdhA   5mdhA   7mdhA   
1bmdA         327     168     168     158
4mdhA          51     333     328     137
5mdhA          51      98     333     138
7mdhA          48      41      41     351

         .---------------------------------------------------- 1bmdA @1.9
         |
         |                                                .--- 4mdhA @2.5
         |                                                |
   .---------------------------------------------------------- 5mdhA @2.4
   |
 .------------------------------------------------------------ 7mdhA @2.4

The comparison above shows that 5mdhA and 4mdhA are almost identical, both sequentially and structurally. They were solved at similar resolutions, 2.4 and 2.5Å, respectively. However, 4mdhA has a better crystallographic R-factor (16.7 versus 20%), eliminating 5mdhA. Inspection of the PDB file for 7mdhA reveals that its crystallographic refinement was based on 1bmdA. In addition, 7mdhA was refined at a lower resolution than 1bmdA (2.4 versus 1.9), eliminating 7mdhA. These observations leave only 1bmdA and 4mdhA as potential templates. Finally, 4mdhA is selected because of the higher overall sequence similarity to the target sequence.

Next: Aligning TvLDF with the Up: Example 1: Modeling lactate Previous: Searching for structures related

Andras Fiser
2001-08-09