Tutorial

Basic example:
Modeling lactate dehydrogenase from Trichomonas vaginalis based on a single template

All input and output files for this example are available to download, in either zip format (for Windows) or .tar.gz format (for Unix/Linux).

A novel gene for lactate dehydrogenase was identified from the genomic sequence of Trichomonas vaginalis (TvLDH). The corresponding protein had a higher similarity to the malate dehydrogenase of the same species (TvMDH) than to any other LDH. We hypothesized that TvLDH arose from TvMDH by convergent evolution relatively recently. Comparative models were constructed for TvLDH and TvMDH to study the sequences in the structural context and to suggest site-directed mutagenesis experiments for elucidating specificity changes in this apparent case of convergent evolution of enzymatic specificity. The native and mutated enzymes were expressed and their activities were compared.

The individual modeling steps of this example are explained below. Note that we go through every step in this tutorial to build a model knowing only the amino acid sequence. In practice you may already know the related structures, and may even have an alignment from another program, so you can skip one or more steps. Alternatively, for very simple applications you may be able to use the ModWeb web server rather than Modeller itself.

1. Searching for structures related to TvLDH

First, it is necessary to put the target TvLDH sequence into the PIR format readable by MODELLER (file `TvLDH.ali').

>P1;TvLDH
sequence:TvLDH:::::::0.00: 0.00
MSEAAHVLITGAAGQIGYILSHWIASGELYGDRQVYLHLLDIPPAMNRLTALTMELEDCAFPHLAGFVATTDPKA
AFKDIDCAFLVASMPLKPGQVRADLISSNSVIFKNTGEYLSKWAKPSVKVLVIGNPDNTNCEIAMLHAKNLKPEN
FSSLSMLDQNRAYYEVASKLGVDVKDVHDIIVWGNHGESMVADLTQATFTKEGKTQKVVDVLDHDYVFDTFFKKI
GHRAWDILEHRGFTSAASPTKAAIQHMKAWLFGTAPGEVLSMGIPVPEGNPYGIKPGVVFSFPCNVDKEGKIHVV
EGFKVNDWLREKLDFTEKDLFHEKEIALNHLAQGG*

File: TvLDH.ali

The first line contains the sequence code, in the format `>P1;code'. The second line with ten fields separated by colons generally contains information about the structure file, if applicable. Only two of these fields are used for sequences, `sequence' (indicating that the file contains a sequence without known structure) and `TvLDH' (the model file name). The rest of the file contains the sequence of TvLDH, with `*' marking its end. The standard one-letter amino acid codes are used. (Note that they must be upper case; some lower case letters are used for non-standard residues. See the file modlib/restyp.lib in the Modeller distribution for more information.)

A search for potentially related sequences of known structure can be performed by the SEQUENCE_SEARCH command of MODELLER. The following script uses the query sequence `TvLDH' assigned to the variable ALIGN_CODES from the file `TvLDH.ali', which in turn is assigned to the variable FILE (see file `seq_search.top').

SET SEARCH_RANDOMIZATIONS = 100
SET FILE = 'TvLDH.ali'
READ_SEQUENCE_DB
SEQUENCE_SEARCH ALIGN_CODES = 'TvLDH', DATA_FILE  = ON

File: seq_search.top

Modeller should be directed to run this script by using the following command at your command line (be aware that it will take some time to run - perhaps an hour or so - so be patient!):

mod7v7 seq_search.top

(You can get a command line using xterm or GNOME Terminal in Linux, Terminal in Mac OS X, or Command Prompt in Windows. For more information on running Modeller, see the manual.)

The SEQUENCE_SEARCH command has many options, but in this example only SEARCH_RANDOMIZATIONS and DATA_FILE are set to non-default values. SEARCH_RANDOMIZATIONS specifies the number of times the query sequence is randomized during the calculation of the significance score for each sequence-sequence comparison. The higher the number of randomizations, the more accurate the significance score. DATA_FILE = ON triggers creation of an additional summary output file (see `seqsearch.dat').

2. Selecting a template

The output of the `seq_search.top' script is written to the `seq_search.log' file. MODELLER always produces a log file. Errors and warnings in log files can be found by searching for the `_E>' and `_W>' strings, respectively. At the end of the log file, MODELLER lists the hits sorted by alignment significance. Because the log file is sometimes very long, a separate data file is created that contains the summary of the search. The example shows only the top 10 hits (file `seqsearch.dat'). (Note that your results may differ slightly from these, due to differences in the search database and different floating-point precision on different computers.)

   1 TvLDH      1bdmA    335  318 139  41.5  43.7  202461.   36.1 -999.0 -999.0
   2 TvLDH      1civA    335  374 115  30.7  34.3  195914.   24.7 -999.0 -999.0
   3 TvLDH      1guzA    335  305  71  21.2  23.3  161076.   10.8 -999.0 -999.0
   4 TvLDH      2cmd     335  312  78  23.3  25.0  161524.    8.9 -999.0 -999.0
   5 TvLDH      1lldA    335  313  70  20.9  22.4  161130.    8.9 -999.0 -999.0
   6 TvLDH      2hlpA    335  303  62  18.5  20.5  154717.    8.2 -999.0 -999.0
   7 TvLDH      1ceqA    335  304  68  20.3  22.4  156953.    8.0 -999.0 -999.0
   8 TvLDH      1hyeA    335  307  73  21.8  23.8  157429.    7.4 -999.0 -999.0
   9 TvLDH      6ldh     335  329  60  17.9  18.2  161816.    6.0 -999.0 -999.0
  10 TvLDH      1ez4A    335  307  61  18.2  19.9  155527.    5.9 -999.0 -999.0

File: seqsearch.dat

The most important columns in the SEQUENCE_SEARCH output are the `CODE_2', `%ID' and `SIGNI' columns. The `CODE_2' column reports the code of the PDB sequence that was compared with the target sequence. The PDB code in each line is the representative of a group of PDB sequences that share 40% or more sequence identity to each other and have less than 30 residues or 30% sequence length difference. All the members of the group can be found in the MODELLER `CHAINS_3.0_40_XN.grp' file. The `%ID1' and `%ID2' columns report the percentage sequence identities between TvLDH and a PDB sequence normalized by their lengths, respectively. In general, a `%ID' value above approximately 25% indicates a potential template unless the alignment is short (i.e., less than 100 residues). A better measure of the significance of the alignment is given by the `SIGNI' column. A value above 6.0 is generally significant irrespective of the sequence identity and length. In this example, one protein family represented by 1bdm:A shows significant similarity with the target sequence, at more than 40% sequence identity. While some other hits are also significant, the differences between 1bdm:A and other top scoring hits are so pronounced that we use only the first hit as the template. As expected, 1bdm:A is a malate dehydrogenase (from a thermophilic bacterium). Other structures closely related to 1bdm:A (and thus not scanned against by SEQUENCE_SEARCH) can be extracted from the `CHAINS_3.0_40_XN.grp' file: 1b8v:A, 1bmd:A, 1b8u:A, 1b8p:A, 1bdm:A, 1bdm:B, 4mdh:A, 5mdh:A, 7mdh:A, 7mdh:B, and 7mdh:C. All these proteins are malate dehydrogenases. During the project, all of them and other malate and lactate dehydrogenase structures were compared and considered as templates (there were 19 structures in total).However, for the sake of illustration, we will investigate only four of the proteins that are sequentially most similar to the target, 1bmd:A, 4mdh:A, 5mdh:A, and 7mdh:A. The following script performs all pairwise comparisons among the selected proteins (file `compare.top').

READ_ALIGNMENT FILE = '$(LIB)/CHAINS_all.seq',;
	ALIGN_CODES = '1bmdA' '4mdhA' '5mdhA' '7mdhA'
MALIGN
MALIGN3D
COMPARE
ID_TABLE 
DENDROGRAM

File: compare.top

The READ_ALIGNMENT command reads the protein sequences and information about their PDB files. MALIGN calculates their multiple sequence alignment, used as the starting point for the multiple structure alignment. The MALIGN3D command performs an iterative least-squares superposition of the four 3D structures. COMPARE command compares the structures according to the alignment constructed by MALIGN3D. It does not make an alignment, but it calculates the RMS and DRMS deviations between atomic positions and distances, differences between the mainchain and sidechain dihedral angles, percentage sequence identities, and several other measures. Finally, the ID_TABLE command writes a file with pairwise sequence distances that can be used directly as the input to the DENDROGRAM command (or the clustering programs in the PHYLIP package). DENDROGRAM calculates a clustering tree from the input matrix of pairwise distances, which helps visualizing differences among the template candidates. Excerpts from the log file are shown below (file `compare.log').

Sequence identity comparison (ID_TABLE):

   Diagonal       ... number of residues;
   Upper triangle ... number of identical residues;
   Lower triangle ... % sequence identity, id/min(length).

         1bmdA @14mdhA @25mdhA @27mdhA @2
1bmdA @1      327     168     168     159
4mdhA @2       51     333     328     137
5mdhA @2       51      98     333     138
7mdhA @2       49      41      41     351


        .---------------------------------------------------- 1bmdA @1.9    49.0000
        |
        |                                                .--- 4mdhA @2.5     2.0000
        |                                                |
  .---------------------------------------------------------- 5mdhA @2.4    55.5000
  |
.------------------------------------------------------------ 7mdhA @2.4

+----+----+----+----+----+----+----+----+----+----+----+----+

Excerpts of the file compare.log

The comparison above shows that 5mdh:A and 4mdh:A are almost identical, both sequentially and structurally. They were solved at similar resolutions, 2.4 and 2.5Å, respectively. However, 4mdh:A has a better crystallographic R-factor (16.7 versus 20%), eliminating 5mdh:A. Inspection of the PDB file for 7mdh:A reveals that its crystallographic refinement was based on 1bmd:A. In addition, 7mdh:A was refined at a lower resolution than 1bmdA (2.4 versus 1.9), eliminating 7mdh:A. These observations leave only 1bmd:A and 4mdh:A as potential templates. Finally, 4mdh:A is selected because of the higher overall sequence similarity to the target sequence.

3. Aligning TvLDF with the template

A good way of aligning the sequence of TvLDH with the structure of 4mdh:A is the ALIGN2D command in MODELLER. Although ALIGN2D is based on a dynamic programming algorithm, it is different from standard sequence sequence alignment methods because it takes into account structural information from the template when constructing an alignment. This task is achieved through a variable gap penalty function that tends to place gaps in solvent exposed and curved regions, outside secondary structure segments, and between two positions that are close in space. As a result, the alignment errors are reduced by approximately one third relative to those that occur with standard sequence alignment techniques. This improvement becomes more important as the similarity between the sequences decreases and the number of gaps increases. In the current example, the template-target similarity is so high that almost any alignment method with reasonable parameters will result in the same alignment. The following MODELLER script aligns the TvLDH sequence in file `TvLDH.ali' with the 4mdh:A structure in the PDB file `4mdh.pdb' (file `align2d.top').

READ_MODEL FILE = '4mdh'
SEQUENCE_TO_ALI ALIGN_CODES = '4mdh'
READ_ALIGNMENT FILE = 'TvLDH.ali', ALIGN_CODES = ALIGN_CODES 'TvLDH', ADD_SEQUENCE = on
ALIGN2D
WRITE_ALIGNMENT FILE='TvLDH-4mdhA.ali', ALIGNMENT_FORMAT = 'PIR'
WRITE_ALIGNMENT FILE='TvLDH-4mdhA.pap', ALIGNMENT_FORMAT = 'PAP'

File: align2d.top

In the first line, MODELLER reads the 4mdh structure file. The SEQUENCE_TO_ALI command transfers the sequence to the alignment array and assigns it the name of `4mdh' (ALIGN_CODES). The third line reads the TvLDH sequence from file `TvLDH.seq', assigns it the name `TvLDH' (ALIGN_CODES) and adds it to the alignment array (`ADD_SEQUENCE = ON'). The fourth line executes the ALIGN2D command to perform the alignment. Finally, the alignment is written out in two formats, PIR (`TvLDH-4mdhA.ali') and PAP (`TvLDH-4mdhA.pap'). The PIR format is used by MODELLER in the subsequent model building stage. The PAP alignment format is easier to inspect visually. Due to the high target-template similarity, there are only a few gaps in the alignment. In the PAP format, all identical positions are marked with a `*' (file `TvLDH-4mdhA.pap').

 _aln.pos         10        20        30        40        50        60
4mdh      -SEPIRVLVTGAAGQIAYSLLYSIGNGSVFGKDQPIILVLLDITPMMGVLDGVLMELQDCALPLLKDV 
TvLDH     MSEAAHVLITGAAGQIGYILSHWIASGELYG-DRQVYLHLLDIPPAMNRLTALTMELEDCAFPHLAGF 
 _consrvd  **   ** ******* * *   *  *   * *    * **** * *  *    *** *** * *


 _aln.p   70        80        90       100       110       120       130
4mdh      IATDKEEIAFKDLDVAILVGSMPRRDGMERKDLLKANVKIFKCQGAALDKYAKKSVKVIVVGNPANTN 
TvLDH     VATTDPKAAFKDIDCAFLVASMPLKPGQVRADLISSNSVIFKNTGEYLSKWAKPSVKVLVIGNPDNTN 
 _consrvd  **     **** * * ** ***   *  * **   *  ***  *  * * ** **** * *** ***


 _aln.pos  140       150       160       170       180       190       200
4mdh      CLTASKSAPSIPKENFSCLTRLDHNRAKAQIALKLGVTSDDVKNVIIWGNHSSTQYPDVNHAKVKLQA 
TvLDH     CEIAMLHAKNLKPENFSSLSMLDQNRAYYEVASKLGVDVKDVHDIIVWGNHGESMVADLTQATFTKEG 
 _consrvd *  *   *     **** *  ** ***    * ****   **   * ****      *   *


 _aln.pos    210       220       230       240       250       260       270
4mdh      KEVGVYEAVKDDSWLKGEFITTVQQRGAAVIKARKLSSAMSAAKAICDHVRDIWFGTPEGEFVSMGII 
TvLDH     KTQKVVDVLDHD-YVFDTFFKKIGHRAWDILEHRGFTSAASPTKAAIQHMKAWLFGTAPGEVLSMGIP 
 _consrvd *   *      *      *      *       *   ** *  **   *     ***  **  ****


 _aln.pos      280       290       300       310       320       330
4mdh      SD-GNSYGVPDDLLYSFPVTIK-DKTWKIVEGLPINDFSREKMDLTAKELAEEKETAFEFLSSA- 
TvLDH     VPEGNPYGIKPGVVFSFPCNVDKEGKIHVVEGFKVNDWLREKLDFTEKDLFHEKEIALNHLAQGG 
 _consrvd    ** **       ***           ***   **  *** * * * *  *** *   *

File: TvLDH-4mdhA.pap

4. Model building

Once a target-template alignment is constructed, MODELLER calculates a 3D model of the target completely automaticaly. The following script will generate five similar models of TvLDH based on the 4mdh template structure and the alignment in file `TvLDH-4mdhA.ali' (file `model-single.top').

INCLUDE
SET ALNFILE = 'TvLDH-4mdhA.ali'
SET KNOWNS = '4mdh'
SET SEQUENCE = 'TvLDH'
SET STARTING_MODEL = 1
SET ENDING_MODEL = 5
CALL ROUTINE = 'model'

File: model-single.top

The first line includes many standard variable and routine definitions. The following five lines set parameter values for the `model' routine. ALNFILE names the file that contains the target-template alignment in the PIR format. KNOWNS defines the known template structure(s) in ALNFILE (`TvLDH-4mdh.ali'). SEQUENCE defines the name of the target sequence in ALNFILE. STARTING_MODEL and ENDING_MODEL define the number of models that are calculated (their indices will run from 1 to 5). The last line in the file calls the `model' routine that actually calculates the models. The most important output files are `model-single.log', which reports warnings, errors and other useful information including the input restraints used for modeling that remain violated in the final model, and `TvLDH.B99990001.pdb', which contains the model coordinates in the PDB format. The model can be viewed by any program that reads the PDB format, such as Chimera.

5. Model evaluation

If several models are calculated for the same target, the ``best'' model can be selected by picking the model with the lowest value of the MODELLER objective function, which is reported in the second line of the model PDB file. The value of the objective function in MODELLER is not an absolute measure in the sense that it can only be used to rank models calculated from the same alignment.

Once a final model is selected, there are many ways to assess it. In this example, PROSAII is used to evaluate the model fold and PROCHECK is used to check the model's stereochemistry. Before any external evaluation of the model, one should check the log file from the modeling run for runtime errors (`model-single.log') and restraint violations (see the MODELLER manual for details). Both PROSAII and PROCHECK confirm that a reasonable model was obtained, with a Z-score comparable to that of the template (-10.53 and -12.69 for the model and the template, respectively). However, the PROSAII energy profile indicates a possible error in the long active site loop between residues 90 and 100.

PROSAII profile for template 4mdh:A

PROSAII profile for model TvLDH.B99990001

This loop interacts with region 220-250, that forms the other half of the active site. This latter part is well resolved in the template and probably correctly modeled in the target structure, but due to the unfavourable non-bonded interactions with the 90-100 region, it is also reported to be in error by PROSAII. In general, an error indicated by PROSAII is not neccessarily an actual error, especially if it highlights an active site or a protein-protein interface. However, in this case, the same active site loops have a better profile in the template structure, which strengthens the assessment that the model is probably incorrect in the active site region. This problem is addressed in the advanced modeling tutorial.