The initial set of sequences must be read previously by the sequence_db.read() command with seq_database_format being either 'PIR' or 'FASTA'.
rr_file is residue-residue substitution matrix. The command only handles similarity matrices for efficiency purposes.
The command uses the Smith-Waterman dynamic programming method for the best sequence alignment, given the gap creation and extension penalties specified by gap_penalties_1d and residue type scores read from file rr_file. gap_penalties_1d[1] is a gap creation penalty and gap_penalties_1d[2] is a gap extension penalty. The command only works with similarity matrices for efficiency reasons.
The final list of groups and their members is written out to output_grp_file. The codes of the representative sequences is written out to output_cod_file.
The clustering algorithm evaluates the following conditions in hierarchial order before adding a sequence to a group:
If the initial set of sequences read were in 'PIR' format with values in the resolution field, then the group representative is the sequence with the highest resolution. This is especially useful when clustering sequences from the PDB.
log.verbose() env = environ() sdb = sequence_db(env, seq_database_file='sequences.pir', seq_database_format='PIR', chains_list='ALL', minmax_db_seq_len=[30, 3000], clean_sequences=True) sdb.filter(rr_file='${LIB}/id.sim.mat', gap_penalties_1d=[-3000, -1000], max_diff_res=30, seqid_cut=95, output_grp_file='seqfilt.grp', output_cod_file='seqfilt.cod')