ALIGN -- align two (blocks of) sequences

Next: ALIGN2D align Up: Comparison and searching of Previous: PRINCIPAL_COMPONENTS clustering Contents Index

ALIGN -- align two (blocks of) sequences

RR_FILE = <string:1> '$(LIB)/as1.sim.mat' input residue-residue scoring file

DIRECTORY = <string:1> '' directory list (e.g., 'dir1:dir2:dir3:./:/')

GAP_PENALTIES_1D = <real:2> 900 50 gap creation and extension penalties for sequence/sequence alignment

ALIGN_BLOCK = <integer:1> 0 the last sequence in the first block of sequences

STOP_ON_ERROR = <integer:1> 1 whether to stop on error

OFF_DIAGONAL = <integer:1> 100 to speed up the alignment

MATRIX_OFFSET = <real:1> 0.00 substitution matrix offset for local alignment

OVERHANG = <integer:1> 0 un-penalized overhangs in protein comparisons

LOCAL_ALIGNMENT = <logical:1> off whether to do local as opposed to global alignment

ALIGN_WHAT = <string:1> 'BLOCK' what to align in ALIGN; 'BLOCK' | 'ALIGNMENT' | 'LAST' | 'PROFILE'

READ_WEIGHTS = <logical:1> off whether to read the whole NxM weight matrix for ALIGN*

WRITE_WEIGHTS = <logical:1> off whether to write the whole NxM weight matrix for ALIGN*

INPUT_WEIGHTS_FILE = <string:1> ''

OUTPUT_WEIGHTS_FILE = <string:1> ''

WEIGH_SEQUENCES = <logical:1> off whether or not to weigh sequences in a profile

SMOOTH_PROF_WEIGHT = <real:1> 10 for smoothing the profile aa frequency with a prior

Output:: MODELLER_STATUS = <integer:1>

Description:

This command aligns two blocks of sequences.

The two blocks of sequences to be aligned are sequences 1 to ALIGN_BLOCK and ALIGN_BLOCK+1 to the last sequence. The sequences within the two blocks should already be aligned; their alignment does not change.

The command can do either the global (similar to [Needleman & Wunsch, 1970]; LOCAL_ALIGNMENT = off) or local dynamic programming alignment (similar to [Smith & Waterman, 1981]; LOCAL_ALIGNMENT = on).

For the global alignment, set overhang length OVERHANG to more than 0 so that the corresponding number of residues at either of the four termini won't be penalized by any gap penalties (this makes it a pseudo local alignment).

To speed up the calculation, set OFF_DIAGONAL to a number smaller than the shortest sequence length. The alignments matching residues and with $\vert i-j\vert > {\sf OFF\_DIAGONAL}\index{OFF\_DIAGONAL@{\sf OFF\_DIAGONAL}}$ are not considered at all in the search for the best alignment.

The gap initiation and extension penalties are specified by GAP_PENALTIES_1D. The default values of -900 -50 for the 'as1.sim.mat' similarity matrix were found to be optimal for pairwise alignments of sequences that share from 30% to 45% sequence identity (RS and AŠ, in preparation).

The residue type - residue type scores are read from file RR_FILE. The routine automatically determines whether it has to maximize similarity or minimize distance.

MATRIX_OFFSET applies to local alignment only and influences its length. MATRIX_OFFSET should be somewhere between the lowest and highest residue-residue scores. A smaller value of this parameter will make the local alignments shorter when distance is minimized, and longer when similarity is maximized. This works as follows: The recursively constructed dynamic programming comparison matrix is reset to 0 at position when the current alignment score becomes larger (distance) or smaller (similarity) than MATRIX_OFFSET. Note that this is equivalent to the usual shifting of the residue-residue scoring matrix in the sense that there are two combinations of GAP_PENALTIES_1D and MATRIX_OFFSET values that will give exactly the same alignments irrespective of whether the matrix is actually offset (with 0 used to restart local alignments in dynamic programming) or the matrix is not offset but MATRIX_OFFSET is used as the cutoff for restarting local alignments in dynamic programming. For the same reason, the matrix offset does not have any effect on the global alignments if the gap extension penalty is also shifted for half of the matrix offset.

The position-position score is an average residue-residue score for all possible pairwise comparisons between the two blocks ( $n \times m$ comparisons are done, where and are the number of sequences in the two blocks, respectively). The first exception to this is when ALIGN_WHAT is set to 'ALIGNMENT', in which case the two alignments defined by ALIGN_BLOCK are aligned; i.e., the score is obtained by comparing only equivalent positions between the two alignment blocks (only comparisons are done, where is the number of sequences in each of the two blocks). This option is useful in combination with COMPARE_ALIGNMENTS and WRITE_ALIGNMENT for evaluation of various alignment parameters and methods. The second exception is when ALIGN_WHAT is set to 'LAST', in which case only the last sequences in the two blocks are used to get the scores. In 'BLOCK', 'ALIGNMENT', and 'LAST' comparisons, penalty for a comparison of a gap with a residue during the calculation of the scoring matrix is obtained from the score file (gap-gap match should have a score of 0.0).

Only the 20 standard residue types, plus Asx (changes to Asn) and Glx (changes to Gln) are recognized. Every other unrecognized residue, except for a gap and a chain break, changes to Gly for comparison purposes.

If you receive an error message to increase the MAXRES constant, you can try to increase the gap penalties first. Here and elsewhere in MODELLER, MAXRES is both the maximal number of residues in a protein as well as the maximal length of an alignment. If the length of the alignment arrays is too small, MODELLER_STATUS becomes 1 (Section 2.1.3).

For the time being, this and the other alignment commands (MALIGN, ALIGN2D, ALIGN3D, and MALIGN3D) remove chain break information from the CALN array, which means that chain breaks are not retained when the alignment is written to a file after executing these commands.

Example:

# Example for: ALIGN

# This will read two sequences, align them, and write the alignment
# to a file:

SET OUTPUT_CONTROL = 1 1 1 1 1

READ_ALIGNMENT FILE = 'toxin.ali', ALIGN_CODES = '1fas' '2ctx'
# The as1.sim.mat similarity matrix is used by default:
ALIGN GAP_PENALTIES_1D = -600 -400
WRITE_ALIGNMENT FILE = 'toxin-seq.ali'

Next: ALIGN2D align Up: Comparison and searching of Previous: PRINCIPAL_COMPONENTS clustering Contents Index

Ben Webb 2004-10-04

RR_FILE = `<string:1>`	`'$(LIB)/as1.sim.mat'`	input residue-residue scoring file
DIRECTORY = `<string:1>`	`''`	directory list (e.g., `'dir1:dir2:dir3:./:/'`)
GAP_PENALTIES_1D = `<real:2>`	`900 50`	gap creation and extension penalties for sequence/sequence alignment
ALIGN_BLOCK = `<integer:1>`	`0`	the last sequence in the first block of sequences
STOP_ON_ERROR = `<integer:1>`	`1`	whether to stop on error
OFF_DIAGONAL = `<integer:1>`	`100`	to speed up the alignment
MATRIX_OFFSET = `<real:1>`	`0.00`	substitution matrix offset for local alignment
OVERHANG = `<integer:1>`	`0`	un-penalized overhangs in protein comparisons
LOCAL_ALIGNMENT = `<logical:1>`	`off`	whether to do local as opposed to global alignment
ALIGN_WHAT = `<string:1>`	`'BLOCK'`	what to align in ALIGN; `'BLOCK'` `\|` `'ALIGNMENT'` `\|` `'LAST'` `\|` `'PROFILE'`
READ_WEIGHTS = `<logical:1>`	`off`	whether to read the whole NxM weight matrix for ALIGN*
WRITE_WEIGHTS = `<logical:1>`	`off`	whether to write the whole NxM weight matrix for ALIGN*
INPUT_WEIGHTS_FILE = `<string:1>`	`''`
OUTPUT_WEIGHTS_FILE = `<string:1>`	`''`
WEIGH_SEQUENCES = `<logical:1>`	`off`	whether or not to weigh sequences in a profile
SMOOTH_PROF_WEIGHT = `<real:1>`	`10`	for smoothing the profile aa frequency with a prior