Rebuilding sequence databases

Revision as of 00:00, 1 January 1970 by Modeller Caretaker (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

The Modeller distribution contains a sequence database, in the files `modlib/CHAINS_*`. These files are

  • `CHAINS_all.seq`: sequences for every chain in every structure in the PDB.
  • `CHAINS_3.0.95_XN.cod`: all chains are clustered at 95% sequence identity, and for each cluster, the PDB code of the representative chain is listed in this file.
  • `CHAINS_3.0.95_XN.grp`: for each representative, the other chains which are 95% sequence identical.
  • `CHAINS_3.0.40_XN.cod` and `CHAINS_3.0.40_XN.grp`: similar files, clustered at 40% sequence identity.

These files are obviously not updated whenever the PDB is, but you can regenerate them yourself if you have a local copy of PDB. Firstly, to build `CHAINS_all.seq`:

  1. For each PDB file, run a script similar to that below. Set `code` to the PDB code and set `atom_files_directory` to the directory containing your local copy of PDB:
#!python
e = environ()
e.io.atom_files_directory = '/database/pdb/'

code = '1xyz'
m = model(e, file=code)

m.make_chains(file=code, minimal_chain_length=30, minimal_stdres=30,
              chop_nonstd_terminii=True, max_nonstdres=10,
              minimal_resolution=99.0, structure_types='structureN structureX')


  1. This will produce a `.chn` file for every chain in PDB. Concatenate these together (e.g. with the Unix `cat` command) to make the new `CHAINS_all.seq` file.

Now you can build the `.cod` and `.grp` files for any sequence identity cutoff using the following script (adjust the `seqid_cut` variable accordingly):

#!python
e = environ()

s = sequence_db(e, seq_database_file='CHAINS_all.seq', chains_list='all',
                seq_database_format='PIR', minmax_db_seq_len=(30, 3000),
                clean_sequences=True)

s.filter(matrix_offset=-450, rr_file='${LIB}/blosum62.sim.mat',
         gap_penalties_1d=(-500, -50), seqid_cut=40,
         output_grp_file='CHAINS_3.0.40_XN.grp', output_cod_file='CHAINS_3.0.40_XN.cod')