Modeling with cryo-EM


Step 1: Search for suitable templates

For this step of the tutorial, all input and output files can be found in the step1_search directory of the file archive available on the index page.

The very first thing to do is to put the 'target' sequence of TvLDH into a form that can be read by MODELLER. The file TvLDH.ali can be created with any text editor (e.g. Windows Notepad, Mac TextEdit, Linux gedit, emacs, vi etc.) Note that this file, like most of the files used by MODELLER, should be plain text - do not use a word processor (e.g. Microsoft Word, OpenOffice) to create it! (If using TextEdit, select "Make Plain Text" from the "Format" menu; if using Notepad, make sure you select "ANSI" encoding when you save the file, not "Unicode" or "UTF-8".)

>P1;TvLDH
sequence:::::::::
MSEAAHVLITGAAGQIGYILSHWIASGELYGDRQVYLHLLDIPPAMNRLTALTMELEDCAFPHLAGFVATTDPKA
AFKDIDCAFLVASMPLKPGQVRADLISSNSVIFKNTGEYLSKWAKPSVKVLVIGNPDNTNCEIAMLHAKNLKPEN
FSSLSMLDQNRAYYEVASKLGVDVKDVHDIIVWGNHGESMVADLTQATFTKEGKTQKVVDVLDHDYVFDTFFKKI
GHRAWDILEHRGFTSAASPTKAAIQHMKAWLFGTAPGEVLSMGIPVPEGNPYGIKPGVVFSFPCNVDKEGKIHVV
EGFKVNDWLREKLDFTEKDLFHEKEIALNHLAQGG*

File: TvLDH.ali

This file is in the PIR alignment format. The first line contains the sequence code, in the format ">P1;code". The code is used by MODELLER to extract sequences from an alignment file containing multiple sequences, as shall be seen later. The second line with ten fields separated by colons generally contains information about the structure file, if applicable. Only the first of these fields is used for sequences, "sequence" (indicating that the file contains a sequence without known structure). The remaining lines in the file contain the sequence of TvLDH, with "*" marking its end. The standard one-letter amino acid codes are used. (Note that they must be upper case; some lower case letters are used for non-standard residues. See the file modlib/restyp.lib in the Modeller distribution for more information.)

In order to search for sequences similar to this one, a database of known sequences is required. This is the file pdb_95.pir, which can be found in the file archive on the index page (in the step1_search directory). Alternatively, up to date copies can be downloaded from the MODELLER website. This file is also in PIR format, and contains sequences for the structures in the PDB database. (To eliminate redundancy, the PDB sequences are clustered at 95% identity and only the cluster representatives are included in this file.) Unlike the sequence file created earlier, the second line of each sequence typically starts with structureX or structureN. This indicates to MODELLER that that an X-ray or NMR structure for the sequence is available, and provides the PDB file name, residue numbers, and chain IDs, to allow MODELLER to read the corresponding atomic coordinates if needed.

As a first demonstration of the use of MODELLER, the database of known sequences will be filtered and converted from PIR format to binary format. (MODELLER can work with either format, but binary files are much faster to use.) To run MODELLER, a Python script file must be created, here named convert_db.py:

from modeller import *

log.verbose()
env = environ()

sdb = sequence_db(env)
sdb.convert(seq_database_file='pdb_95.pir', seq_database_format='PIR',
            chains_list='ALL', minmax_db_seq_len=(30, 4000),
            clean_sequences=True, outfile='pdb_95.hdf5')

File: convert_db.py

The script, taken line by line, does the following:

  1. Imports the 'modeller' Python module. This is typically the first line in any MODELLER Python script.
  2. Requests verbose output in the log file.
  3. Initializes the environment for this run, by creating a new 'environ' object. Almost all MODELLER scripts require this step, as the new object (which is called here 'env', but can be given other names if desired) is needed to build most other useful objects.
  4. Creates a new 'sequence_db' object, calling it 'sdb'. 'sequence_db' objects are used to contain large databases of protein sequences.
  5. Calls sequence_db's 'convert' method, which reads in the file pdb_95.pir in PIR format, discards sequences which have fewer than 30 or more than 4000 residues, removes any non-standard residues, and writes out a new file pdb_95.hdf5 in binary format.

To actually run the script, first obtain a terminal window or command line (on the Mac, the Terminal application in /Applications/Utilities is probably best; on Windows, use the Command Prompt in the Modeller menu on the Start Menu; on Linux, a GNOME Terminal or xterm). Change into the directory containing the input files with the cd command. (For example, if running the examples directly from the extracted archive available on the index page, "cd cryoem_files/step1_search" on Linux or Mac, or "cd cryoem_files\step1_search" on Windows, is probably sufficient.) Then run MODELLER with:

python3 convert_db.py > convert_db.log

After only a few seconds, the output database pdb_95.hdf5 and the MODELLER log file convert_db.log should be produced. The log file contains miscellaneous information about the MODELLER run, including warnings. If a fatal error is encountered, a standard Python exception is raised, which you will see in your terminal window. (If this does not work, please see below [*].)

A search for potentially related sequences of known structure can now be performed by the profile.build() command of MODELLER. The following script reads in the sequence database created in the previous step, and then reads the query sequence "TvLDH" from the alignment file "TvLDH.ali" and converts it to a profile called 'prf'. Profiles contain similar information to alignments, but are more compact and better for sequence database searching. The sequence database 'sdb' is then searched with the query profile 'prf' using profile.build(); matches from the sequence database are added to the profile. Finally the profile of the query sequence and its homologs is written out to build_profile.prf; the equivalent information is also written out in PIR alignment format.

from modeller import *

log.verbose()
env = environ()

# Read in the sequence database in binary format
sdb = sequence_db(env, seq_database_file='pdb_95.hdf5',
                  seq_database_format='BINARY', chains_list='ALL')

# Read in the target sequence in PIR alignment format
aln = alignment(env)
aln.append(file='TvLDH.ali', alignment_format='PIR', align_codes='ALL')

# Convert the input sequence "alignment" into profile format
prf = aln.to_profile()

# Scan sequence database to pick up homologous sequences
prf.build(sdb, matrix_offset=-450, rr_file='${LIB}/blosum62.sim.mat',
          gap_penalties_1d=(-500, -50), n_prof_iterations=1,
          check_profile=False, max_aln_evalue=0.01)

# Write out the profile in text format
prf.write(file='build_profile.prf', profile_format='TEXT')

# Convert the profile back to alignment format
aln = prf.to_alignment()

#- Write out a PIR alignment file
aln.write(file='build_profile.ali', alignment_format='PIR')

File: build_profile.py

The profile.build() command has many options. In this example rr_file is set to use the BLOSUM62 similarity matrix (file blosum62.sim.mat provided in the MODELLER distribution). Accordingly, the parameters matrix_offset and gap_penalties_1d are set to the appropriate values for the BLOSUM62 matrix. For this example, only one search iteration is run by setting the parameter n_prof_iterations equal to 1. Thus, there is no need to check the profile for deviation (check_profile set to False). Finally, the parameter max_aln_evalue is set to 0.01, indicating that only sequences with e-values smaller than or equal to 0.01 will be included in the final profile.

As before, run this script with Python ("python3 build_profile.py > build_profile.log"). It will generate an output log file and a profile build_profile.prf. An extract of this profile (omitting the aligned sequences) can be seen next. The first 6 commented lines indicate the input parameters used in MODELLER to build the profile. Subsequent lines correspond to the detected similarities by profile.build().

# Number of sequences:     41
# Length of profile  :    335
# N_PROF_ITERATIONS  :      1
# GAP_PENALTIES_1D   :   -500.0   -50.0
# MATRIX_OFFSET      : -450.0
# RR_FILE            : ${LIB}/blosum62.sim.mat
    1 TvLDH      S     0   335     1   335     0     0     0    0.    0.0
    2 1a5zA      X     1   312    75   242    63   229   164   28.   0.19E-07
    3 2a92A      X     1   316     8   191     6   186   174   26.   0.41E-04
    4 1b8pA      X     1   327     7   331     6   325   316   42.    0.0
    5 1civA      X     1   374     6   334    33   358   325   35.    0.0
    6 2cmdA      X     1   312     7   320     3   303   289   27.   0.37E-05
    7 3d5tA      X     1   321     4   325     3   316   312   44.    0.0
    8 2dfdA      X     1   314     5   198     2   190   186   26.   0.60E-06
    9 2ewdA      X     1   317     1   301     1   288   270   26.   0.33E-07
   10 1ez4B      X     1   318    69   239    55   230   159   31.   0.79E-06
   11 1guzA      X     1   305    13   301     8   280   265   25.   0.65E-08
   12 1gv0A      X     1   301    13   323     8   289   274   26.   0.65E-04
   13 1gv1B      X     1   290    99   301    80   271   191   23.   0.66E-03
   14 2hjrA      X     1   313    13   333    10   309   288   26.   0.78E-06
   15 1hyeA      X     1   307     7   191     3   183   173   29.   0.32E-07
   16 1i0zA      X     1   332    85   300    94   304   207   25.   0.15E-04
   17 1i10A      X     1   331    85   295    93   298   196   26.   0.20E-04
   18 1ldnA      X     1   316    78   298    73   301   214   26.   0.44E-03
   19 2ldxA      X     1   331    66   306    67   306   227   26.   0.56E-04
   20 5ldhA      X     1   333    85   300    94   304   207   26.   0.69E-05
   21 6ldhA      X     1   329    47   301    56   302   244   23.   0.38E-02
   22 9ldtA      X     1   331    85   301    93   304   207   26.   0.24E-05
   23 1llcA      X     1   321    64   239    53   234   164   26.   0.45E-03
   24 1lldA      X     1   313    13   242     9   233   216   31.   0.72E-07
   25 5mdhA      X     1   333     2   332     1   331   328   44.    0.0
   26 7mdhA      X     1   351     6   334    14   339   325   34.    0.0
   27 1mldA      X     1   313     5   198     1   189   183   26.   0.13E-05
   28 1o6zA      X     1   303     7   320     3   287   278   26.   0.61E-05
   29 1oc4A      X     1   315     5   191     4   186   174   28.   0.41E-04
   30 1ojuA      X     1   294    78   320    68   285   218   28.   0.99E-05
   31 1pzgA      X     1   327    74   191    71   190   114   30.   0.37E-06
   32 1smkA      X     1   313     7   202     4   198   188   34.    0.0
   33 1sovA      X     1   316    81   256    76   248   160   27.   0.21E-02
   34 1t2dA      X     1   315     5   256     4   250   238   25.   0.15E-03
   35 1u5aA      X     1   306    97   256    84   240   155   26.   0.92E-03
   36 1ur5A      X     1   299    13   191     9   171   158   31.   0.57E-02
   37 1uxkA      X     1   296    13   191    10   169   155   32.   0.95E-02
   38 1v6aA      X     1   332    94   300   103   304   197   25.   0.28E-03
   39 2v65A      X     1   326    94   300    97   298   197   24.   0.59E-03
   40 1y6jA      X     1   289    77   191    58   167   109   33.   0.75E-05
   41 1y7tA      X     1   327     1   325     1   319   318   45.    0.0

File: build_profile.prf

The most important columns in the profile.build() output are the second, tenth, eleventh and twelfth columns. The second column reports the code of the PDB sequence that was compared with the target sequence. The eleventh column reports the percentage sequence identities between TvLDH and a PDB sequence normalized by the lengths of the alignment (indicated in the tenth column). In general, a sequence identity value above approximately 25% indicates a potential template unless the alignment is short (i.e., less than 100 residues). A better measure of the significance of the alignment is given in the twelfth column by the e-value of the alignment. In this example, seven PDB sequences show very significant similarities to the query sequence with e-values equal to 0 (1b8p:A, 1civ:A, 3d5t:A, 5mdh:A, 7mdh:A, 1smk:A, and 1y7t:A) and are thus potential templates.

As an aside, the search for templates in this case is relatively straightforward, as there are several with relatively high sequence identity. In more difficult cases, where templates share 15%-25% sequence identity, multiple iterations of profile.build() may be required, or it may be necessary to enrich the profiles with additional sequences (e.g. from UniProt) and then use the profile.scan() method to scan this profile against similarly-built profiles for all known PDB sequences. This is, however, out of the scope of this tutorial.

On to the next step, or back to the index.


[*] Problems running MODELLER with Python
On a Mac running a recent version of Mac OS, this should work "out of the box". On other systems, MODELLER requires some version of Python between 2.3 and 3.8 to be installed, which can be obtained from the Python website if it is not already present on the machine.

On a Windows machine, it may be necessary to use the full path to the Python binary, e.g. C:\Python27\python.

On a Linux or Unix system, with MODELLER installed using the .tar.gz (not the RPM) package, it may be necessary to prefix this command with [TOP]/bin/modpy.sh, where [TOP] is the directory MODELLER was installed in - see the instructions displayed by the MODELLER installer.

If for some reason this command does not work, it may be possible to fall back to 'mod9v5 convert_db.py' which uses a copy of Python 2.3 built in to the MODELLER distribution. This, however, will only work for simple scripts that do not use the Python standard library, since MODELLER does not include it.