This command clusters a set of sequences by sequence identity. The command uses a greedy algorithm: the first sequence in the file becomes the first group representative. All other sequences are compared with this and if they are similar enough, as specified in seqid_cut, they are added as members of this group. These sequences are not used for further comparisons. The next non-member sequence becomes the next group representative and so on.
The initial set of sequences must be read previously by the sequence_db.read() command with seq_database_format being either 'PIR' or 'FASTA'.
rr_file is residue-residue substitution matrix. The command only handles similarity matrices for efficiency purposes.
The command uses the Smith-Waterman dynamic programming method for the best sequence alignment, given the gap creation and extension penalties specified by gap_penalties_1d and residue type scores read from file rr_file. gap_penalties_1d[0] is a gap creation penalty and gap_penalties_1d[1] is a gap extension penalty. The command only works with similarity matrices for efficiency reasons.
The final list of groups and their members is written out to output_grp_file. The codes of the representative sequences is written out to output_cod_file.
The clustering algorithm evaluates the following conditions in hierarchial order before adding a sequence to a group:
If the initial set of sequences read were in 'PIR' format with values in the resolution field, then the group representative is the sequence with the highest resolution. This is especially useful when clustering sequences from the PDB.
from modeller import * log.verbose() env = environ() sdb = sequence_db(env, seq_database_file='sequences.pir', seq_database_format='PIR', chains_list='ALL', minmax_db_seq_len=[30, 3000], clean_sequences=True) sdb.filter(rr_file='${LIB}/id.sim.mat', gap_penalties_1d=[-3000, -1000], max_diff_res=30, seqid_cut=95, output_grp_file='seqfilt.grp', output_cod_file='seqfilt.cod')