The Modeller distribution contains a sequence database, in the files `modlib/CHAINS_*`. These files are
- `CHAINS_all.seq`: sequences for every chain in every structure in the PDB.
- `CHAINS_3.0.95_XN.cod`: all chains are clustered at 95% sequence identity, and for each cluster, the PDB code of the representative chain is listed in this file.
- `CHAINS_3.0.95_XN.grp`: for each representative, the other chains which are 95% sequence identical.
- `CHAINS_3.0.40_XN.cod` and `CHAINS_3.0.40_XN.grp`: similar files, clustered at 40% sequence identity.
These files are obviously not updated whenever the PDB is, but you can regenerate them yourself if you have a local copy of PDB. Firstly, to build `CHAINS_all.seq`:
- For each PDB file, run a script similar to that below. Set `code` to the PDB code and set `atom_files_directory` to the directory containing your local copy of PDB:
#!python e = environ() e.io.atom_files_directory = '/database/pdb/' code = '1xyz' m = model(e, file=code) m.make_chains(file=code, minimal_chain_length=30, minimal_stdres=30, chop_nonstd_terminii=True, max_nonstdres=10, minimal_resolution=99.0, structure_types='structureN structureX')
- This will produce a `.chn` file for every chain in PDB. Concatenate these together (e.g. with the Unix `cat` command) to make the new `CHAINS_all.seq` file.
Now you can build the `.cod` and `.grp` files for any sequence identity cutoff using the following script (adjust the `seqid_cut` variable accordingly):
#!python e = environ() s = sequence_db(e, seq_database_file='CHAINS_all.seq', chains_list='all', seq_database_format='PIR', minmax_db_seq_len=(30, 3000), clean_sequences=True) s.filter(matrix_offset=-450, rr_file='${LIB}/blosum62.sim.mat', gap_penalties_1d=(-500, -50), seqid_cut=40, output_grp_file='CHAINS_3.0.40_XN.grp', output_cod_file='CHAINS_3.0.40_XN.cod')