Rebuilding sequence databases

The Modeller distribution contains a sequence database, in the files `modlib/CHAINS_*`. These files are

`CHAINS_all.seq`: sequences for every chain in every structure in the PDB.
`CHAINS_3.0.95_XN.cod`: all chains are clustered at 95% sequence identity, and for each cluster, the PDB code of the representative chain is listed in this file.
`CHAINS_3.0.95_XN.grp`: for each representative, the other chains which are 95% sequence identical.
`CHAINS_3.0.40_XN.cod` and `CHAINS_3.0.40_XN.grp`: similar files, clustered at 40% sequence identity.

These files are obviously not updated whenever the PDB is, but you can regenerate them yourself if you have a local copy of PDB. Firstly, to build `CHAINS_all.seq`:

For each PDB file, run a script similar to that below. Set `code` to the PDB code and set `atom_files_directory` to the directory containing your local copy of PDB:

#!python
e = environ()
e.io.atom_files_directory = '/database/pdb/'

code = '1xyz'
m = model(e, file=code)

m.make_chains(file=code, minimal_chain_length=30, minimal_stdres=30,
              chop_nonstd_terminii=True, max_nonstdres=10,
              minimal_resolution=99.0, structure_types='structureN structureX')

This will produce a `.chn` file for every chain in PDB. Concatenate these together (e.g. with the Unix `cat` command) to make the new `CHAINS_all.seq` file.

Now you can build the `.cod` and `.grp` files for any sequence identity cutoff using the following script (adjust the `seqid_cut` variable accordingly):

#!python
e = environ()

s = sequence_db(e, seq_database_file='CHAINS_all.seq', chains_list='all',
                seq_database_format='PIR', minmax_db_seq_len=(30, 3000),
                clean_sequences=True)

s.filter(matrix_offset=-450, rr_file='${LIB}/blosum62.sim.mat',
         gap_penalties_1d=(-500, -50), seqid_cut=40,
         output_grp_file='CHAINS_3.0.40_XN.grp', output_cod_file='CHAINS_3.0.40_XN.cod')