next up previous contents index
Next: The density class: handling Up: The sequence_db class: using Previous: sequence_db.search() search   Contents   Index

sequence_db.filter() -- cluster sequences by sequence-identity

rr_file = <str:1> '$(LIB)/as1.sim.mat' input residue-residue scoring file
gap_penalties_1d = <float:2> 900 50 gap creation and extension penalties for sequence/sequence alignment
matrix_offset = <float:1> 0.00 substitution matrix offset for local alignment
output_grp_file = <str:1> 'seqfilt.grp' output file for seqfilter groups
output_cod_file = <str:1> 'seqfilt.cod' output file for seqfilter representative groups
seqid_cut = <int:1> 95 Sequence Identity cut-off for SEQFILTER
max_diff_res = <int:1> 30 Length cut-off for SEQFILTER

Description:
This command clusters a set of sequences by sequence identity. The command uses a greedy algorithm: the first sequence in the file becomes the first group representative. All other sequences are compared with this and if they are similar enough, as specified in seqid_cut, they are added as members of this group. These sequences are not used for further comparisons. The next non-member sequence becomes the next group representative and so on.

The initial set of sequences must be read previously by the sequence_db.read() command with seq_database_format being either 'PIR' or 'FASTA'.

rr_file is residue-residue substitution matrix. The command only handles similarity matrices for efficiency purposes.

The command uses the Smith-Waterman dynamic programming method for the best sequence alignment, given the gap creation and extension penalties specified by gap_penalties_1d and residue type scores read from file rr_file. gap_penalties_1d[1] is a gap creation penalty and gap_penalties_1d[2] is a gap extension penalty. The command only works with similarity matrices for efficiency reasons.

The final list of groups and their members is written out to output_grp_file. The codes of the representative sequences is written out to output_cod_file.

The clustering algorithm evaluates the following conditions in hierarchial order before adding a sequence to a group:

  1. The difference in length: If the difference in the number of residues between the group representative and the sequence being compared is greater than max_diff_res, the sequence will not be included into that group.

  2. The number of unaligned residues: After the local alignment is performed, a sequence will not be considered for addition into a group unless the difference between the smaller of the two sequences and the number of aligned positions in the alignment is less than max_unaligned_res.

  3. Sequence Identity: Finally, if the sequence identity calculated from the alignment is greater than seqid_cut, the sequence is added to a group.

If the initial set of sequences read were in 'PIR' format with values in the resolution field, then the group representative is the sequence with the highest resolution. This is especially useful when clustering sequences from the PDB.

Example: examples/commands/seqfilter.py


log.verbose()
env = environ()

sdb = sequence_db(env, seq_database_file='sequences.pir',
                  seq_database_format='PIR',
                  chains_list='ALL', minmax_db_seq_len=[30, 3000],
                  clean_sequences=True)

sdb.filter(rr_file='${LIB}/id.sim.mat', gap_penalties_1d=[-3000, -1000],
           max_diff_res=30, seqid_cut=95, output_grp_file='seqfilt.grp',
           output_cod_file='seqfilt.cod')


next up previous contents index
Next: The density class: handling Up: The sequence_db class: using Previous: sequence_db.search() search   Contents   Index
Ben Webb 2006-02-28