SequenceDB.filter() — cluster sequences by sequence-identity

filter(seqid_cut, output_grp_file, output_cod_file, gap_penalties_1d=(-900.0, -50.0), matrix_offset=0.0, rr_file='$(LIB)/as1.sim.mat', max_diff_res=30, window_size=512)
This command clusters a set of sequences by sequence identity. The command uses a greedy algorithm: the first sequence in the file becomes the first group representative. All other sequences are compared with this and if they are similar enough, as specified in seqid_cut, they are added as members of this group. These sequences are not used for further comparisons. The next non-member sequence becomes the next group representative and so on.

The initial set of sequences must be read previously by the SequenceDB.read() command with seq_database_format being either 'PIR' or 'FASTA'.

rr_file is the residue-residue substitution matrix and matrix_offset its offset. The command only handles similarity matrices for efficiency purposes.

The command uses the Smith-Waterman dynamic programming method (as in Alignment.align()) for the best sequence alignment, given the gap creation and extension penalties specified by gap_penalties_1d and residue type scores read from file rr_file. gap_penalties_1d[0] is a gap creation penalty and gap_penalties_1d[1] is a gap extension penalty.

The final list of groups and their members is written out to output_grp_file. The codes of the representative sequences is written out to output_cod_file.

The clustering algorithm evaluates the following conditions in hierarchical order before adding a sequence to a group:

  1. The difference in length: If the difference in the number of residues between the group representative and the sequence being compared is greater than max_diff_res, the sequence will not be included into that group.

  2. The number of unaligned residues: After the local alignment is performed, a sequence will not be considered for addition into a group unless the difference between the smaller of the two sequences and the number of aligned positions in the alignment is less than max_unaligned_res.

  3. Sequence Identity: Finally, if the sequence identity calculated from the alignment is greater than seqid_cut, the sequence is added to a group.

If the initial set of sequences read were in 'PIR' format with values in the resolution field, then the group representative is the sequence with the highest resolution. This is especially useful when clustering sequences from the PDB.

See SequenceDB.read() for a discussion of the window_size parameter. Note that this function acts on two regions of the database simultaneously (it does an all-against-all comparison) and so the default window size is half that of other functions.

Example: examples/commands/seqfilter.py

from modeller import *

log.verbose()
env = Environ()

sdb = SequenceDB(env, seq_database_file='sequences.pir',
                 seq_database_format='PIR',
                 chains_list='ALL', minmax_db_seq_len=[30, 3000],
                 clean_sequences=True)

sdb.filter(rr_file='${LIB}/id.sim.mat', gap_penalties_1d=[-3000, -1000],
           max_diff_res=30, seqid_cut=95, output_grp_file='seqfilt.grp',
           output_cod_file='seqfilt.cod')