Rebuilding sequence databases: Difference between revisions

(Fix scripts to run without warnings with recent Modeller versions.)
(Add more familiar "modern" file names)
Line 2: Line 2:
<!-- ## page was renamed from Rebuilding_sequence_databases -->
<!-- ## page was renamed from Rebuilding_sequence_databases -->
Older versions of the Modeller distribution contain a sequence database, in the files `modlib/CHAINS_*`. These files are
Older versions of the Modeller distribution contain a sequence database, in the files `modlib/CHAINS_*`. These files are
* `CHAINS_all.seq`: sequences for every chain in every structure in the PDB.
* `CHAINS_all.seq` or `pdball.pir`: sequences for every chain in every structure in the PDB.
* `CHAINS_3.0.95_XN.cod`: all chains are clustered at 95% sequence identity, and for each cluster, the PDB code of the representative chain is listed in this file.
* `CHAINS_3.0.95_XN.cod` or `pdb_95.cod`: all chains are clustered at 95% sequence identity, and for each cluster, the PDB code of the representative chain is listed in this file.
* `CHAINS_3.0.95_XN.grp`: for each representative, the other chains which are 95% sequence identical.
* `CHAINS_3.0.95_XN.grp` or `pdb_95.grp`: for each representative, the other chains which are 95% sequence identical.
* `CHAINS_3.0.40_XN.cod` and `CHAINS_3.0.40_XN.grp`: similar files, clustered at 40% sequence identity.
* `CHAINS_3.0.40_XN.cod` and `CHAINS_3.0.40_XN.grp`: similar files, clustered at 40% sequence identity.


These files are obviously not updated whenever the PDB is, and are not included at all with newer versions of Modeller, but you can download updated copies from our [http://salilab.org/modeller/supplemental.html supplemental data file download page], or regenerate them yourself if you have a local copy of PDB. Firstly, to build `CHAINS_all.seq`:
These files are obviously not updated whenever the PDB is, and are not included at all with newer versions of Modeller, but you can download updated copies from our [http://salilab.org/modeller/supplemental.html supplemental data file download page], or regenerate them yourself if you have a local copy of PDB. Firstly, to build `CHAINS_all.seq` or `pdball.pir`:
# For each PDB file, run a script similar to that below. Set `code` to the PDB code and set `atom_files_directory` to the directory containing your local copy of PDB:
# For each PDB file, run a script similar to that below. Set `code` to the PDB code and set `atom_files_directory` to the directory containing your local copy of PDB:
    
    
Line 25: Line 25:




# This will produce a `.chn` file for every chain in PDB. Concatenate these together (e.g. with the Unix `cat` command) to make the new `CHAINS_all.seq` file.
# This will produce a `.chn` file for every chain in PDB. Concatenate these together (e.g. with the Unix `cat` command) to make the new `CHAINS_all.seq` or `pdball.pir` file.


Now you can build the `.cod` and `.grp` files for any sequence identity cutoff using the following script (adjust the `seqid_cut` variable accordingly):
Now you can build the `.cod` and `.grp` files for any sequence identity cutoff using the following script (adjust the `seqid_cut` variable accordingly):

Revision as of 14:22, 22 October 2009

Older versions of the Modeller distribution contain a sequence database, in the files `modlib/CHAINS_*`. These files are

  • `CHAINS_all.seq` or `pdball.pir`: sequences for every chain in every structure in the PDB.
  • `CHAINS_3.0.95_XN.cod` or `pdb_95.cod`: all chains are clustered at 95% sequence identity, and for each cluster, the PDB code of the representative chain is listed in this file.
  • `CHAINS_3.0.95_XN.grp` or `pdb_95.grp`: for each representative, the other chains which are 95% sequence identical.
  • `CHAINS_3.0.40_XN.cod` and `CHAINS_3.0.40_XN.grp`: similar files, clustered at 40% sequence identity.

These files are obviously not updated whenever the PDB is, and are not included at all with newer versions of Modeller, but you can download updated copies from our supplemental data file download page, or regenerate them yourself if you have a local copy of PDB. Firstly, to build `CHAINS_all.seq` or `pdball.pir`:

  1. For each PDB file, run a script similar to that below. Set `code` to the PDB code and set `atom_files_directory` to the directory containing your local copy of PDB:
#!python
from modeller import *

e = environ()
e.io.atom_files_directory = ['/database/pdb/']

code = '1xyz'
m = model(e, file=code)

m.make_chains(file=code, minimal_chain_length=30, minimal_stdres=30,
              chop_nonstd_terminii=True, max_nonstdres=10,
              minimal_resolution=99.0, structure_types='structureN structureX')


  1. This will produce a `.chn` file for every chain in PDB. Concatenate these together (e.g. with the Unix `cat` command) to make the new `CHAINS_all.seq` or `pdball.pir` file.

Now you can build the `.cod` and `.grp` files for any sequence identity cutoff using the following script (adjust the `seqid_cut` variable accordingly):

#!python
from modeller import *

e = environ()

s = sequence_db(e, seq_database_file='CHAINS_all.seq', chains_list='all',
                seq_database_format='PIR', minmax_db_seq_len=(30, 3000),
                clean_sequences=True)

s.filter(matrix_offset=-450, rr_file='${LIB}/blosum62.sim.mat',
         gap_penalties_1d=(-500, -50), seqid_cut=40,
         output_grp_file='CHAINS_3.0.40_XN.grp', output_cod_file='CHAINS_3.0.40_XN.cod')