Rebuilding sequence databases: Difference between revisions

Latest revision as of 21:17, 16 August 2022

Older versions of the Modeller distribution contain a sequence database, in the files modlib/CHAINS_*. These files are

CHAINS_all.seq or pdball.pir: sequences for every chain in every structure in the PDB.
CHAINS_3.0.95_XN.cod or pdb_95.cod: all chains are clustered at 95% sequence identity, and for each cluster, the PDB code of the representative chain is listed in this file.
CHAINS_3.0.95_XN.grp or pdb_95.grp: for each representative, the other chains which are 95% sequence identical.
CHAINS_3.0.40_XN.cod and CHAINS_3.0.40_XN.grp: similar files, clustered at 40% sequence identity.

These files are obviously not updated whenever the PDB is, and are not included at all with newer versions of Modeller, but you can download updated copies from our supplemental data file download page, or regenerate them yourself if you have a local copy of PDB. Firstly, to build CHAINS_all.seq or pdball.pir:

For each PDB file, run a script similar to that below. Set code to the PDB code and set atom_files_directory to the directory containing your local copy of PDB:

from modeller import *

e = Environ()
e.io.atom_files_directory = ['/database/pdb/']

code = '1xyz'
m = Model(e, file=code)

m.make_chains(file=code, minimal_chain_length=30, minimal_stdres=30,
              chop_nonstd_termini=True, max_nonstdres=10,
              minimal_resolution=99.0, structure_types='structureN structureX')

This will produce a .chn file for every chain in PDB. Concatenate these together (e.g. with the Unix cat command) to make the new CHAINS_all.seq or pdball.pir file.

Now you can build the .cod and .grp files for any sequence identity cutoff using the following script (adjust the seqid_cut variable accordingly):

from modeller import *

e = Environ()

s = SequenceDB(e, seq_database_file='CHAINS_all.seq', chains_list='all',
               seq_database_format='PIR', minmax_db_seq_len=(30, 3000),
               clean_sequences=True)

s.filter(matrix_offset=-450, rr_file='${LIB}/blosum62.sim.mat',
         gap_penalties_1d=(-500, -50), seqid_cut=40,
         output_grp_file='CHAINS_3.0.40_XN.grp', output_cod_file='CHAINS_3.0.40_XN.cod')

Note that this will take a long time to run. For high (>90%) sequence identity cutoffs, it is more efficient to use CD-HIT instead. A script that automates this is included as part of ModPipe (python/ClusterPDB.py).

@@ Line 1: / Line 1: @@
 __NOTOC__
-The Modeller distribution contains a sequence database, in the files `modlib/CHAINS_*`. These files are
+<!-- ## page was renamed from Rebuilding_sequence_databases -->
-* `CHAINS_all.seq`: sequences for every chain in every structure in the PDB.
+Older versions of the Modeller distribution contain a sequence database, in the files <code>modlib/CHAINS_*</code>. These files are
-* `CHAINS_3.0.95_XN.cod`: all chains are clustered at 95% sequence identity, and for each cluster, the PDB code of the representative chain is listed in this file.
+* <code>CHAINS_all.seq</code> or <code>pdball.pir</code>: sequences for every chain in every structure in the PDB.
-* `CHAINS_3.0.95_XN.grp`: for each representative, the other chains which are 95% sequence identical.
+* <code>CHAINS_3.0.95_XN.cod</code> or <code>pdb_95.cod</code>: all chains are clustered at 95% sequence identity, and for each cluster, the PDB code of the representative chain is listed in this file.
-* `CHAINS_3.0.40_XN.cod` and `CHAINS_3.0.40_XN.grp`: similar files, clustered at 40% sequence identity.
+* <code>CHAINS_3.0.95_XN.grp</code> or <code>pdb_95.grp</code>: for each representative, the other chains which are 95% sequence identical.
+* <code>CHAINS_3.0.40_XN.cod</code> and <code>CHAINS_3.0.40_XN.grp</code>: similar files, clustered at 40% sequence identity.
-These files are obviously not updated whenever the PDB is, but you can download updated copies from our [http://salilab.org/modeller/supplemental.html supplemental data file download page], or regenerate them yourself if you have a local copy of PDB. Firstly, to build `CHAINS_all.seq`:
+These files are obviously not updated whenever the PDB is, and are not included at all with newer versions of Modeller, but you can download updated copies from our [https://salilab.org/modeller/supplemental.html supplemental data file download page], or regenerate them yourself if you have a local copy of PDB. Firstly, to build <code>CHAINS_all.seq</code> or <code>pdball.pir</code>:
-# For each PDB file, run a script similar to that below. Set `code` to the PDB code and set `atom_files_directory` to the directory containing your local copy of PDB:
+For each PDB file, run a script similar to that below. Set <code>code</code> to the PDB code and set <code>atom_files_directory</code> to the directory containing your local copy of PDB:
-<pre><nowiki>#!python
+<syntaxhighlight lang="python">
-e = environ()
+from modeller import *
-e.io.atom_files_directory = '/database/pdb/'
+e = Environ()
+e.io.atom_files_directory = ['/database/pdb/']
 code = '1xyz'
-m = model(e, file=code)
+m = Model(e, file=code)
 m.make_chains(file=code, minimal_chain_length=30, minimal_stdres=30,
-               chop_nonstd_terminii=True, max_nonstdres=10,
+               chop_nonstd_termini=True, max_nonstdres=10,
                minimal_resolution=99.0, structure_types='structureN structureX')
-</nowiki></pre>
+</syntaxhighlight>
+This will produce a <code>.chn</code> file for every chain in PDB. Concatenate these together (e.g. with the Unix <code>cat</code> command) to make the new <code>CHAINS_all.seq</code> or <code>pdball.pir</code> file.
-# This will produce a `.chn` file for every chain in PDB. Concatenate these together (e.g. with the Unix `cat` command) to make the new `CHAINS_all.seq` file.
+Now you can build the <code>.cod</code> and <code>.grp</code> files for any sequence identity cutoff using the following script (adjust the <code>seqid_cut</code> variable accordingly):
-Now you can build the `.cod` and `.grp` files for any sequence identity cutoff using the following script (adjust the `seqid_cut` variable accordingly):
+<syntaxhighlight lang="python">
+from modeller import *
-<pre><nowiki>#!python
+e = Environ()
-e = environ()
-s = sequence_db(e, seq_database_file='CHAINS_all.seq', chains_list='all',
+s = SequenceDB(e, seq_database_file='CHAINS_all.seq', chains_list='all',
-                seq_database_format='PIR', minmax_db_seq_len=(30, 3000),
+               seq_database_format='PIR', minmax_db_seq_len=(30, 3000),
-                clean_sequences=True)
+               clean_sequences=True)
 s.filter(matrix_offset=-450, rr_file='${LIB}/blosum62.sim.mat',
           gap_penalties_1d=(-500, -50), seqid_cut=40,
           output_grp_file='CHAINS_3.0.40_XN.grp', output_cod_file='CHAINS_3.0.40_XN.cod')
-</nowiki></pre>
+</syntaxhighlight>
+Note that this will take a long time to run. For high (>90%) sequence identity cutoffs, it is more efficient to use [http://weizhongli-lab.org/cd-hit/ CD-HIT] instead. A script that automates this is included as part of [https://salilab.org/modpipe/ ModPipe] (<code>python/ClusterPDB.py</code>).
+[[Category:Examples]]