Modeling with cryo-EM
Step 1: Search for suitable templates
For this step of the tutorial, all input and output files can be found in the step1_search directory of the file archive available on the index page.
The very first thing to do is to put the 'target' sequence of TvLDH into a form that can be read by MODELLER. The file TvLDH.ali can be created with any text editor (e.g. Windows Notepad, Mac TextEdit, Linux gedit, emacs, vi etc.) Note that this file, like most of the files used by MODELLER, should be plain text - do not use a word processor (e.g. Microsoft Word, OpenOffice) to create it! (If using TextEdit, select "Make Plain Text" from the "Format" menu; if using Notepad, make sure you select "ANSI" encoding when you save the file, not "Unicode" or "UTF-8".)
>P1;TvLDH sequence::::::::: MSEAAHVLITGAAGQIGYILSHWIASGELYGDRQVYLHLLDIPPAMNRLTALTMELEDCAFPHLAGFVATTDPKA AFKDIDCAFLVASMPLKPGQVRADLISSNSVIFKNTGEYLSKWAKPSVKVLVIGNPDNTNCEIAMLHAKNLKPEN FSSLSMLDQNRAYYEVASKLGVDVKDVHDIIVWGNHGESMVADLTQATFTKEGKTQKVVDVLDHDYVFDTFFKKI GHRAWDILEHRGFTSAASPTKAAIQHMKAWLFGTAPGEVLSMGIPVPEGNPYGIKPGVVFSFPCNVDKEGKIHVV EGFKVNDWLREKLDFTEKDLFHEKEIALNHLAQGG*
File: TvLDH.ali
This file is in the PIR alignment format. The first line contains the sequence code, in the format ">P1;code". The code is used by MODELLER to extract sequences from an alignment file containing multiple sequences, as shall be seen later. The second line with ten fields separated by colons generally contains information about the structure file, if applicable. Only the first of these fields is used for sequences, "sequence" (indicating that the file contains a sequence without known structure). The remaining lines in the file contain the sequence of TvLDH, with "*" marking its end. The standard one-letter amino acid codes are used. (Note that they must be upper case; some lower case letters are used for non-standard residues. See the file modlib/restyp.lib in the Modeller distribution for more information.)
In order to search for sequences similar to this one, a database of known sequences is required. This is the file pdb_95.pir, which can be found in the file archive on the index page (in the step1_search directory). Alternatively, up to date copies can be downloaded from the MODELLER website. This file is also in PIR format, and contains sequences for the structures in the PDB database. (To eliminate redundancy, the PDB sequences are clustered at 95% identity and only the cluster representatives are included in this file.) Unlike the sequence file created earlier, the second line of each sequence typically starts with structureX or structureN. This indicates to MODELLER that that an X-ray or NMR structure for the sequence is available, and provides the PDB file name, residue numbers, and chain IDs, to allow MODELLER to read the corresponding atomic coordinates if needed.
As a first demonstration of the use of MODELLER, the database of known sequences will be filtered and converted from PIR format to binary format. (MODELLER can work with either format, but binary files are much faster to use.) To run MODELLER, a Python script file must be created, here named convert_db.py:
from modeller import * log.verbose() env = Environ() sdb = SequenceDB(env) sdb.convert(seq_database_file='pdb_95.pir', seq_database_format='PIR', chains_list='ALL', minmax_db_seq_len=(30, 4000), clean_sequences=True, outfile='pdb_95.hdf5')
File: convert_db.py
The script, taken line by line, does the following:
- Imports the 'modeller' Python module. This is typically the first line in any MODELLER Python script.
- Requests verbose output in the log file.
- Initializes the environment for this run, by creating a new 'Environ' object. Almost all MODELLER scripts require this step, as the new object (which is called here 'env', but can be given other names if desired) is needed to build most other useful objects.
- Creates a new 'SequenceDB' object, calling it 'sdb'. 'SequenceDB' objects are used to contain large databases of protein sequences.
- Calls SequenceDB's 'convert' method, which reads in the file pdb_95.pir in PIR format, discards sequences which have fewer than 30 or more than 4000 residues, removes any non-standard residues, and writes out a new file pdb_95.hdf5 in binary format.
To actually run the script, first obtain a terminal window or command line (on the Mac, the Terminal application in /Applications/Utilities is probably best; on Windows, use the Command Prompt in the Modeller menu on the Start Menu; on Linux, a GNOME Terminal or xterm). Change into the directory containing the input files with the cd command. (For example, if running the examples directly from the extracted archive available on the index page, "cd cryoem_files/step1_search" on Linux or Mac, or "cd cryoem_files\step1_search" on Windows, is probably sufficient.) Then run MODELLER with:
python3 convert_db.py > convert_db.log
After only a few seconds, the output database pdb_95.hdf5 and the MODELLER log file convert_db.log should be produced. The log file contains miscellaneous information about the MODELLER run, including warnings. If a fatal error is encountered, a standard Python exception is raised, which you will see in your terminal window. (If this does not work, please see below [*].)
A search for potentially related sequences of known structure can now be performed by the Profile.build() command of MODELLER. The following script reads in the sequence database created in the previous step, and then reads the query sequence "TvLDH" from the alignment file "TvLDH.ali" and converts it to a profile called 'prf'. Profiles contain similar information to alignments, but are more compact and better for sequence database searching. The sequence database 'sdb' is then searched with the query profile 'prf' using Profile.build(); matches from the sequence database are added to the profile. Finally the profile of the query sequence and its homologs is written out to build_profile.prf; the equivalent information is also written out in PIR alignment format.
from modeller import * log.verbose() env = Environ() # Read in the sequence database in binary format sdb = SequenceDB(env, seq_database_file='pdb_95.hdf5', seq_database_format='BINARY', chains_list='ALL') # Read in the target sequence in PIR alignment format aln = Alignment(env) aln.append(file='TvLDH.ali', alignment_format='PIR', align_codes='ALL') # Convert the input sequence "alignment" into profile format prf = aln.to_profile() # Scan sequence database to pick up homologous sequences prf.build(sdb, matrix_offset=-450, rr_file='${LIB}/blosum62.sim.mat', gap_penalties_1d=(-500, -50), n_prof_iterations=1, check_profile=False, max_aln_evalue=0.01) # Write out the profile in text format prf.write(file='build_profile.prf', profile_format='TEXT') # Convert the profile back to alignment format aln = prf.to_alignment() #- Write out a PIR alignment file aln.write(file='build_profile.ali', alignment_format='PIR')
File: build_profile.py
The Profile.build() command has many options. In this example rr_file is set to use the BLOSUM62 similarity matrix (file blosum62.sim.mat provided in the MODELLER distribution). Accordingly, the parameters matrix_offset and gap_penalties_1d are set to the appropriate values for the BLOSUM62 matrix. For this example, only one search iteration is run by setting the parameter n_prof_iterations equal to 1. Thus, there is no need to check the profile for deviation (check_profile set to False). Finally, the parameter max_aln_evalue is set to 0.01, indicating that only sequences with e-values smaller than or equal to 0.01 will be included in the final profile.
As before, run this script with Python ("python3 build_profile.py > build_profile.log"). It will generate an output log file and a profile build_profile.prf. An extract of this profile (omitting the aligned sequences) can be seen next. The first 6 commented lines indicate the input parameters used in MODELLER to build the profile. Subsequent lines correspond to the detected similarities by Profile.build().
# Number of sequences: 41 # Length of profile : 335 # N_PROF_ITERATIONS : 1 # GAP_PENALTIES_1D : -500.0 -50.0 # MATRIX_OFFSET : -450.0 # RR_FILE : ${LIB}/blosum62.sim.mat 1 TvLDH S 0 335 1 335 0 0 0 0. 0.0 2 1a5zA X 1 312 75 242 63 229 164 28. 0.19E-07 3 2a92A X 1 316 8 191 6 186 174 26. 0.41E-04 4 1b8pA X 1 327 7 331 6 325 316 42. 0.0 5 1civA X 1 374 6 334 33 358 325 35. 0.0 6 2cmdA X 1 312 7 320 3 303 289 27. 0.37E-05 7 3d5tA X 1 321 4 325 3 316 312 44. 0.0 8 2dfdA X 1 314 5 198 2 190 186 26. 0.60E-06 9 2ewdA X 1 317 1 301 1 288 270 26. 0.33E-07 10 1ez4B X 1 318 69 239 55 230 159 31. 0.79E-06 11 1guzA X 1 305 13 301 8 280 265 25. 0.65E-08 12 1gv0A X 1 301 13 323 8 289 274 26. 0.65E-04 13 1gv1B X 1 290 99 301 80 271 191 23. 0.66E-03 14 2hjrA X 1 313 13 333 10 309 288 26. 0.78E-06 15 1hyeA X 1 307 7 191 3 183 173 29. 0.32E-07 16 1i0zA X 1 332 85 300 94 304 207 25. 0.15E-04 17 1i10A X 1 331 85 295 93 298 196 26. 0.20E-04 18 1ldnA X 1 316 78 298 73 301 214 26. 0.44E-03 19 2ldxA X 1 331 66 306 67 306 227 26. 0.56E-04 20 5ldhA X 1 333 85 300 94 304 207 26. 0.69E-05 21 6ldhA X 1 329 47 301 56 302 244 23. 0.38E-02 22 9ldtA X 1 331 85 301 93 304 207 26. 0.24E-05 23 1llcA X 1 321 64 239 53 234 164 26. 0.45E-03 24 1lldA X 1 313 13 242 9 233 216 31. 0.72E-07 25 5mdhA X 1 333 2 332 1 331 328 44. 0.0 26 7mdhA X 1 351 6 334 14 339 325 34. 0.0 27 1mldA X 1 313 5 198 1 189 183 26. 0.13E-05 28 1o6zA X 1 303 7 320 3 287 278 26. 0.61E-05 29 1oc4A X 1 315 5 191 4 186 174 28. 0.41E-04 30 1ojuA X 1 294 78 320 68 285 218 28. 0.99E-05 31 1pzgA X 1 327 74 191 71 190 114 30. 0.37E-06 32 1smkA X 1 313 7 202 4 198 188 34. 0.0 33 1sovA X 1 316 81 256 76 248 160 27. 0.21E-02 34 1t2dA X 1 315 5 256 4 250 238 25. 0.15E-03 35 1u5aA X 1 306 97 256 84 240 155 26. 0.92E-03 36 1ur5A X 1 299 13 191 9 171 158 31. 0.57E-02 37 1uxkA X 1 296 13 191 10 169 155 32. 0.95E-02 38 1v6aA X 1 332 94 300 103 304 197 25. 0.28E-03 39 2v65A X 1 326 94 300 97 298 197 24. 0.59E-03 40 1y6jA X 1 289 77 191 58 167 109 33. 0.75E-05 41 1y7tA X 1 327 1 325 1 319 318 45. 0.0
File: build_profile.prf
The most important columns in the Profile.build() output are the second, tenth, eleventh and twelfth columns. The second column reports the code of the PDB sequence that was compared with the target sequence. The eleventh column reports the percentage sequence identities between TvLDH and a PDB sequence normalized by the lengths of the alignment (indicated in the tenth column). In general, a sequence identity value above approximately 25% indicates a potential template unless the alignment is short (i.e., less than 100 residues). A better measure of the significance of the alignment is given in the twelfth column by the e-value of the alignment. In this example, seven PDB sequences show very significant similarities to the query sequence with e-values equal to 0 (1b8p:A, 1civ:A, 3d5t:A, 5mdh:A, 7mdh:A, 1smk:A, and 1y7t:A) and are thus potential templates.
As an aside, the search for templates in this case is relatively straightforward, as there are several with relatively high sequence identity. In more difficult cases, where templates share 15%-25% sequence identity, multiple iterations of Profile.build() may be required, or it may be necessary to enrich the profiles with additional sequences (e.g. from UniProt) and then use the Profile.scan() method to scan this profile against similarly-built profiles for all known PDB sequences. This is, however, out of the scope of this tutorial.
On to the next step, or back to the index.
[*] Problems running MODELLER
with Python
On a Mac running a recent version of Mac OS, this should work "out of the box".
On other systems, MODELLER requires some version of Python between 2.3 and 3.8
to be installed,
which can be obtained from the Python
website if it is not already present on the machine.
On a Windows machine, it may be necessary to use the full path to the Python binary, e.g. C:\Python27\python.
On a Linux or Unix system, with MODELLER installed using the .tar.gz (not the RPM) package, it may be necessary to prefix this command with [TOP]/bin/modpy.sh, where [TOP] is the directory MODELLER was installed in - see the instructions displayed by the MODELLER installer.
If for some reason this command does not work, it may be possible to fall back to 'mod9v5 convert_db.py' which uses a copy of Python 2.3 built in to the MODELLER distribution. This, however, will only work for simple scripts that do not use the Python standard library, since MODELLER does not include it.