The initial set of sequences must be read previously by the READ_SEQUENCE_DB command with SEQ_DATABASE_FORMAT being either 'PIR' or 'FASTA'.
RR_FILE is residue-residue substitution matrix. The command only handles similarity matrices for efficiency purposes.
The command uses the Smith-Waterman dynamic programming method for the best sequence alignment, given the gap creation and extension penalties specified by GAP_PENALTIES_1D and residue type scores read from file RR_FILE. GAP_PENALTIES_1D[1] is a gap creation penalty and GAP_PENALTIES_1D[2] is a gap extension penalty. The command only works with similarity matrices for efficiency reasons.
The final list of groups and their members is written out to OUTPUT_GRP_FILE. The codes of the representative sequences is written out to OUTPUT_COD_FILE.
The clustering algorithm evaluates the following conditions in hierarchial order before adding a sequence to a group:
If the initial set of sequences read were in 'PIR' format with values in the resolution field, then the group representative is the sequence with the highest resolution. This is especially useful when clustering sequences from the PDB.
SET OUTPUT_CONTROL = 1 1 1 1 1 SET MINMAX_DB_SEQ_LEN = 30 3000, CLEAN_SEQUENCES = on READ_SEQUENCE_DB SEQ_DATABASE_FILE = 'sequences.pir', ; CHAINS_LIST = 'all', ; SEQ_DATABASE_FORMAT = 'PIR' SET RR_FILE = '${LIB}/id.sim.mat' SET GAP_PENALTIES_1D = -3000 -1000 SET MAX_DIFF_RES = 30 SET MAX_UNALIGNED_RES = 10 SET OUTPUT_GRP_FILE = 'seqfilt.grp' SET OUTPUT_COD_FILE = 'seqfilt.cod' SEQFILTER SEQID_CUT = 95