alignment.segment_matching() -- align segments

Next: The profile class: using Up: The alignment class: comparison Previous: alignment.expand() put Contents Index

alignment.segment_matching() -- align segments

rr_file = <str:1> '$(LIB)/as1.sim.mat' input residue-residue scoring file

align_block = <int:1> 0 the last sequence in the first block of sequences

segment_report = <int:1> 1D6 for SEGMENT_MATCHING

segment_cutoff = <float:1> 999999 cutoff for writing out an alignment in SEGMENT_MATCHING

segment_shifts = <int:0> segment shifts +- in SEGMENT_MATCHING

segment_growth_n = <int:0> reducing/growing segment N-termini in SEGMENT_MATCHING

segment_growth_c = <int:0> reducing/growing segment C-termini in SEGMENT_MATCHING

min_loop_length = <int:0> inter-segment minimal lengths in SEGMENT_MATCHING

file = <str:1> 'default' partial or complete filename

output_directory = <str:1> '' output directory

root_name = <str:1> 'undf' root of a filename for filename construction

file_id = <str:1> 'default' file id for filename construction

file_ext = <str:1> '' file extension for filename construction

Description:

This command enumerates alignments between two blocks of sequences. More precisely, it enumerates the alignments between the segments in the first block and the sequences in the second block. The segments can be moved to the left and right as well as lengthened and shortened, relative to the initial alignment. The regions not in segments or not aligned with segments are left un-aligned, possibly to be modeled as insertions. Typically, the first block of sequences corresponds to structures, the segments to secondary structure elements in the first block, and the second block to the sequences one of which is to be modeled later on. The command is useful for generating many alignments which can then be used by another MODELLER script to generate and evaluate the corresponding 3D models.

All the sequences and segments are defined in the alignment array. The first block of sequences, the ones with segments, are the first align_block sequences. The regions corresponding to the segments are defined by the last entry in the alignment as contiguous blocks of non-gap residues. Any standard single character residue code may be used. The segments must be separated by gap residues, `-'. The remaining sequences from align_block + 1 to $\mbox{NSEQ}-1$ are the second block of sequences. The alignment of the sequences within the two blocks does not change. A sample alignment file is

The enumeration of alignments explores all possible combinations of alignments between each segment and the 2nd block of sequences: The starting position of each segment is varied relative to the input alignment in the interval from segment_shift[] to segment_shift[]. There has to be at least min_loop_length[] and min_loop_length[] residues that are not in any segment before and after the -th segment, respectively. The location of the N-terminus of segment is varied relative to the location in the input alignment in the interval from segment_growth_n[] to segment_growth_n[]. Similarly, the location of the C-terminus of segment is varied relative to the location in the input alignment in the interval from segment_growth_c[] to segment_growth_c[]. The shortening and lengthening of the segments may be useful in determining the best anchor regions for modeling of a loop.

Each alignment is scored according to the similarity scoring matrix specified by filename rr_file. This matrix may contain residue--gap scores, the gap being residue type 21; otherwise the value is set to the smallest value in the matrix. The score for an alignment is obtained by summing scores only over all alignment positions corresponding to the segments (no gap penalty is added for loops). When there is more than one sequence in any of the two blocks, the position score is an average of all pairwise comparisons between the two blocks of sequences. In the case where the number of positions in the alignment changes (i.e., the segments grow or shorten), the scores are not comparable to each other. It is feasible to enumerate on the order of $10^{10}$ different alignments in less than one hour of CPU time.

In general, two runs are required. In the first run, the alignments are scored and a histogram of the scores is written to file file. Then this file must be inspected to determine the cutoff segment_cutoff. In the second run, all the alignments with a score higher than segment_cutoff are written to files in the PIR format, using the standard filenaming convention: output_directory/root_namefile_idfile_ext, where is the alignment file counter. In addition, the alignments are also written out in the PAP format for easier inspection by eye. Thus, segment_cutoff has to be set to a very large value in the first run, to avoid writing alignment files. During a run, a message is written to the log every segment_report aligments; this is useful for knowing what is going on during very long runs.

Example: examples/commands/segment_matching.py

# Example for: alignment.segment_matching()

log.level(1, 1, 1, 1, 0)
env = environ()

aln = alignment(env, file='ednf2.pap', align_codes=('7rsa', 'edn', 'templ'),
                alignment_format='PAP', remove_gaps=True)

aln.segment_matching(output_directory='./', file='segmatch.dat',
                     align_block=1, rr_file='$(LIB)/as1.sim.mat',
                     segment_shifts=(-8, 8, 0, 0),
                     segment_growth_n=(0, 0, 0, 0),
                     segment_growth_c=(0, 0, 0, 0),
                     min_loop_length=(0,2,0),
                     segment_report=1000000, segment_cutoff=0,
                     root_name='segmatch', file_ext='.ali', file_id='default')

Next: The profile class: using Up: The alignment class: comparison Previous: alignment.expand() put Contents Index

Ben Webb 2005-04-21

rr_file = `<str:1>`	`'$(LIB)/as1.sim.mat'`	input residue-residue scoring file
align_block = `<int:1>`	`0`	the last sequence in the first block of sequences
segment_report = `<int:1>`	`1D6`	for SEGMENT_MATCHING
segment_cutoff = `<float:1>`	`999999`	cutoff for writing out an alignment in SEGMENT_MATCHING
segment_shifts = `<int:0>`		segment shifts +- in SEGMENT_MATCHING
segment_growth_n = `<int:0>`		reducing/growing segment N-termini in SEGMENT_MATCHING
segment_growth_c = `<int:0>`		reducing/growing segment C-termini in SEGMENT_MATCHING
min_loop_length = `<int:0>`		inter-segment minimal lengths in SEGMENT_MATCHING
file = `<str:1>`	`'default'`	partial or complete filename
output_directory = `<str:1>`	`''`	output directory
root_name = `<str:1>`	`'undf'`	root of a filename for filename construction
file_id = `<str:1>`	`'default'`	file id for filename construction
file_ext = `<str:1>`	`''`	file extension for filename construction