io = <io_data> | Options for reading atom files | |
align3d_trf = <bool:1> | False | whether to transform the distances before dynamic programming |
align_block = <int:1> | 0 | the last sequence in the first block of sequences |
alignment_type = <str:1> | 'PROGRESSIVE' | 'PAIRWISE' 'TREE' 'PROGRESSIVE' for SALIGN |
align_what = <str:1> | 'BLOCK' | what to align in ALIGN; 'BLOCK' | 'ALIGNMENT' | 'LAST' | 'PROFILE' |
auto_overhang = <bool:1> | False | overhang values made dependent on sequence length difference |
comparison_type = <str:1> | 'MAT' | 'MAT' or 'PSSM' for comparing matrices or PSSMs when profiles are compared |
current_directory = <bool:1> | True | whether to write output .fit files to current directory |
dendrogram_file = <str:1> | '' | File into which the SALIGN dendrogram is written out |
ext_tree = <bool:1> | False | To read a pre computed tree for SALIGN |
feature_weights = <float:6> | 1 0 0 0 0 0 | feature weights for SALIGN |
fit_atoms = <str:1> | 'CA' | atom type(s) being superposed |
fit_on_first = <bool:1> | False | whether or not all structures are to be fit on the first structure, given the final alignment |
fit_pdbnam = <bool:1> | True | whether or not to add _fit to the PDB file name in output alifile by SALIGN |
fit = <bool:1> | True | whether to do pairwise least-squares fitting or ALIGN2D alignment |
fix_offsets = <float:5> | 0 1 2 3 4 | offsets of the ALIGN2D alignment score for "fixed" positions indicated by ' 123456789' in line '_fix_pos' |
gap_function = <bool:1> | False | whether or not to switch on functional gap penalty in salign |
gap_gap_score = <float:1> | 0 | dissimilarity score for aligning gap with gap, in SALIGN |
gap_penalties_1d = <float:2> | 900 50 | gap creation and extension penalties for sequence/sequence alignment |
gap_penalties_2d = <float:9> | 3.5 3.5 3.5 0.2 4.0 6.5 2.0 0 0 | gap penalties for sequence/structure alignment: helix, beta, accessibility, straightness, and CA-CA distance factor, dst min, dst power, t, structure_profile ; best U,V=-100,0 |
gap_penalties_3d = <float:2> | 0.0 1.75 | gap creation and extension penalties for structure/structure superposition |
gap_residue_score = type1 | 0 | dissimilarity score for aligning gap with residue, in SALIGN |
improve_alignment = <bool:1> | True | whether or not to optimize alignment in SALIGN |
input_weights_file = <str:1> | '' | Exteral weight matrix input to MODELLER (SALIGN/ALIGN) |
local_alignment = <bool:1> | False | whether to do local as opposed to global alignment |
matrix_comparison = <str:1> | 'CC' | 'CC', 'MAX', 'AVE', - kinds of matrix comparisons |
matrix_offset = <float:1> | 0.00 | substitution matrix offset for local alignment |
matrix_offset_3d = <float:1> | 0.1 | substitution matrix offset for local alignment (RMS) |
max_gap_length = <int:1> | 999999 | maximal length of gap in protein comparisons |
normalize_pp_scores = <bool:1> | False | whether or not to normalize position-position scores in SALIGN |
nsegm = <int:1> | 2 | number of segments for DRMS computation |
no_ter = <bool:1> | False | whether to not write TER into PDB |
n_subopt = <int:1> | 1 | number of optimal and suboptimal alignments ALIGN/ALIGN2D |
off_diagonal = <int:1> | 100 | to speed up the alignment |
output_weights_file = <str:1> | '' | File into which the weight file is wriiten (iff WRITE_WEIGHTS = 'on') |
output = <str:1> | 'LONG' | what and/or how to output |
overhang = <int:1> | 0 | un-penalized overhangs in protein comparisons |
overhang_auto_limit = <int:1> | 60 | auto_overhang effective if seq length diff greater than overhang_auto_limit |
overhang_factor = <float:1> | 0.4 | Factor to multiply seq. length diff with when auto_overhang is on |
rms_cutoffs = <float:11> | 3.5 3.5 60 60 15 60 60 60 60 60 60 | cutoffs for RMS, DRMS, Alpha Phi Psi Omega chi1 chi2 chi3 chi4 chi5 |
rr_file = <str:1> | '$(LIB)/as1.sim.mat' | input residue-residue scoring file |
similarity_flag = <bool:1> | False | when turned on, the SALIGN command does not convert numbers into a distance sense. |
smooth_prof_weight = <float:1> | 10 | for smoothing the profile aa frequency with a prior |
subopt_offset = <float:1> | 0.0 | offset for residue-residue score in getting suboptimals in ALIGN/ALIGN2D |
substitution = <bool:1> | False | whether to use the background in PSSM comparison |
weights_type = <str:1> | 'SIMILAR' | or 'DISTANCE' for the kind of substitution values |
write_fit = <bool:1> | False | whether to write out fitted coordinates to .fit files |
write_weights = <bool:1> | False | whether to write the whole NxM weight matrix for ALIGN* |
write_whole_pdb = <bool:1> | True | whether to write out all lines in the input PDB file |
Please note that the method is still in development, and has not yet been fully benchmarked. As with any other alignment method, generated alignments should be assessed for quality.
Broadly classifying, three different types of protein alignment categories are tackled by this command:
The command incorporates the functionality of several old MODELLER commands (alignment.align(), alignment.align2d(), alignment.malign(), alignment.align3d(), and alignment.malign3d()). Some of the examples below illustrate the equivalent script files to replace the old alignment commands with alignment.salign().
In addition to these, this command has several new alignment features including profile-profile sequence alignments and a dendrogram based multiple sequence/structure alignment among others.
All pair-wise alignments make use of local or global dynamic programming. A switch from one to another can be effected by setting local_alignment to True or False. The dynamic programming can be carried out using affine gap penalties (as previously used in alignment.align()) or an environment dependent gap penalty function (as used in alignment.align2d()). The choice of gap penalty types can be regulated by switching the variable gap_function on or off. All arguments that associated to the alignment.align() and alignment.align2d() commands apply.
If at least one of the blocks in a pairwise alignment consists of structures, dynamic programming can be performed using structure dependent gap penalties.
The values of both improve_alignment and fit are used in the calculation of the position feature. That is, the initial alignment and the orientation of the coordinates can be selected not to change at all during the calculation of the inter-molecular distance matrix.
When the calculation of the inter-molecular distance matrix is finished, all the structures in the second sub-alignment are rotated and translated following the optimal rotation and translation of the second average on the first average. These superpositions prepare the individual structures for the next of the stages of the progressive multiple alignment, and also orient all the structures for writing out to 'alignment[i].code_fit.pdb' atom files if write_fit = True. If fit_pdbnam = False, the PDB filenames in the output alignment file will not have the '_fit.pdb' extensions. Thus, feature 2 needs to be selected by feature_weight[2] if you wish to write out the structures superposed according to the tree-following procedure; also, fit_on_first must be False, otherwise the structures are written out superposed on the first structure according to the final alignment (see also below).
The alignment produced within the routine that calculates does not generally correspond to the alignment calculated based on . Therefore, the multiply superposed structures are not necessarily superposed based on the final multiple alignment produced by alignment.salign(). If you wish such a superposition, you can use alignment.malign3d() with fit = False and write_fit = True (the meaning of fit is different between alignment.salign() and alignment.malign3d()).
Unless the position feature is selected, the initial alignment does not matter. If the position feature is selected, a good starting alignment is a multiple sequence alignment, obtained either by alignment.malign() or by alignment.salign() used without the position feature (the initial alignment can also be prepared using the position feature). If the position feature is used, each pair of structures needs to have at least 3 aligned residues at all points during the alignment.
There are several possibilities as to the final orientation of the input coordinates. If fit_on_first is True, all the coordinate sets are superposed on the first structure, using the final multi-feature multiple alignment. If fit_on_first is False, and position feature was used, and fit was True, the coordinates will be superposed in the progressive manner guided by the tree, by the routine that calculates the inter-molecular distance matrices; this superposition is based only on the positions of the selected atoms (feature 2), not on other features such as residue type, secondary, structure, etc. If improve_alignment is False, it does not make much sense to have fit = True (use fit_on_first = True).
For local alignments, the matrix offset variable is matrix_offset_3d.
# profile-profile alignment using salign log.level(1, 0, 1, 1, 1) env = environ() aln = alignment(env, file='mega_prune.faa', alignment_format='FASTA') aln.salign(rr_file='${LIB}/blosum62.sim.mat', gap_penalties_1d=(-500, 0), output='', align_block=15, # no. of seqs. in first MSA align_what='PROFILE', alignment_type='PAIRWISE', comparison_type='PSSM', # or 'MAT' (Caution: Method NOT benchmarked # for 'MAT') similarity_flag=True, # The score matrix is not rescaled substitution=True, # The BLOSUM62 substitution values are # multiplied to the corr. coef. #write_weights=True, #output_weights_file='test.mtx', # optional, to write weight matrix smooth_prof_weight=0.10) # For mixing data with priors #write out aligned profiles (MSA) aln.write(file='salign.ali', alignment_format='PIR') # Make a pairwise alignment of two sequences aln = alignment(env, file='salign.ali', alignment_format='PIR', align_codes=('12asA', '1b8aA')) aln.write(file='salign_pair.ali', alignment_format='PIR') aln.write(file='salign_pair.pap', alignment_format='PAP')
# align2d/align using salign # parameters to be input by the user # 1. gap_penalties_1d # 2. gap_penalties_2d = (4.35, 1.2, 0.9, 1.2, 0.6, 8.6, 1.2, 0., 0.) (default) # 3. input alignment file log.verbose() env = environ() env.io.atom_files_directory='./' aln = alignment(env, file='align2d_in.ali', align_codes='all') aln.salign(rr_file='$(LIB)/as1.sim.mat', # Substitution matrix used output='', max_gap_length=20, gap_function=True, # If False then align2d not done feature_weights=(1., 0., 0., 0., 0., 0.), gap_penalties_1d=(-100, 0), gap_penalties_2d=(3.5, 3.5, 3.5, 0.2, 4.0, 6.5, 2.0, 0.0, 0.0), # d.p. score matrix #write_weights=True, output_weights_file='salign.mtx' similarity_flag=True) # Ensuring that the dynamic programming # matrix is not scaled to a difference matrix aln.write(file='align2d.ali', alignment_format='PIR') aln.write(file='align2d.pap', alignment_format='PAP')
The alignment of proteins within a sub-alignment does not change when the sub-alignment is aligned with another protein or sub-alignment. The pairwise alignment of sub-alignments is guided by the dendrogram. First, the most similar pair of proteins are aligned. Second, the next most similar pair of proteins are aligned, or the third protein is aligned with the sub-alignment of the first two, as indicated by the dendrogram. This greedy, progressive procedure requires steps to align all proteins, and each step requires a pairwise alignment of two sub-alignments.
If in a multiple alignment, overhangs are to be penalized differently for the pairs of alignments that create the multiple, auto_overhang can be set to True. This will ensure that the value of overhang changes as overhang_factor times the numerical difference in the residues of the pair. Further, this is only effected if the difference is greater than overhang_auto_limit.
The dendrogram can be written out in a separate file by specifying the file name to dendrogram_file.
# Illustrates the SALIGN multiple structure/sequence alignment log.verbose() env = environ() env.io.atom_files_directory = './:../atom_files/' aln = alignment(env) for (code, chain) in (('1is4', 'A'), ('1uld', 'D'), ('1ulf', 'B'), ('1ulg', 'B'), ('1is5', 'A')): mdl = model(env, file=code, model_segment=('FIRST:'+chain, 'LAST:'+chain)) aln.append_model(mdl, atom_files=code, align_codes=code+chain) for (weights, write_fit, whole) in (((1., 0., 0., 0., 1., 0.), False, True), ((1., 0.5, 1., 1., 1., 0.), False, True), ((1., 1., 1., 1., 1., 0.), True, False)): aln.salign(rms_cutoffs=(3.5, 6., 60, 60, 15, 60, 60, 60, 60, 60, 60), normalize_pp_scores=False, rr_file='$(LIB)/as1.sim.mat', overhang=30, gap_penalties_1d=(-450, -50), gap_penalties_3d=(0, 3), gap_gap_score=0, gap_residue_score=0, dendrogram_file='1is3A.tree', alignment_type='tree', # If 'progresive', the tree is not # computed and all structues will be # aligned sequentially to the first #ext_tree=True, # Tree building can be avoided if the tree # is input #input_weights_file='1is3A_exmat.mtx', feature_weights=weights, # For a multiple sequence alignment only # the first feature needs to be non-zero improve_alignment=True, fit=True, write_fit=write_fit, write_whole_pdb=whole, output='ALIGNMENT QUALITY') aln.write(file='1is3A.pap', alignment_format='PAP') aln.write(file='1is3A.ali', alignment_format='PIR') # The number of equivalent positions at different RMS_CUTOFF values can be # computed by changing the RMS value and keeping all feature weights = 0 aln.salign(rms_cutoffs=(1.0, 6., 60, 60, 15, 60, 60, 60, 60, 60, 60), normalize_pp_scores=False, rr_file='$(LIB)/as1.sim.mat', overhang=30, gap_penalties_1d=(-450, -50), gap_penalties_3d=(0, 3), gap_gap_score=0, gap_residue_score=0, dendrogram_file='1is3A.tree', alignment_type='progressive', feature_weights=[0]*6, improve_alignment=False, fit=False, write_fit=True, write_whole_pdb=False, output='QUALITY')
log.verbose() env = environ() aln = alignment(env, file='fm07254_test.ali', alignment_format='PIR') aln.salign(feature_weights=(1., 0, 0, 0, 0, 0), gap_penalties_1d=(-450, -50), n_subopt = 5, subopt_offset = 15)
# Demonstrating the use of alignment restraints, only available in # align2d and salign: log.verbose() env = environ() # The special alignment entry '_fix_pos' has to be the last entry in the # alignment array. Its sequence contains characters blank, 1, 2, 3, and 4 # at the restrained alignment positions. The residue-residue score from # the substitution matrix for these positions will be offset by the scalar # value FIX_OFFSETS[1..5]. aln = alignment(env, file='fix_positions.ali', align_codes=('1leh', '3btoA', '_fix_pos')) # fix_offsets specifies the offset corresponding to character ' 1234' in the # _fix_pos entry in the alignment # (this offsets unlabeled positions for 0, the ones indicated by 1 by # 1000, those indicated by 2 by 2000, etc.) aln.salign(fix_offsets=(0, 1000, 2000, 3000, 4000), gap_penalties_2d=(0, 0, 0, 0, 0, 0, 0, 0, 0), # Any values are # possible here local_alignment=False, # Local alignment works, too gap_penalties_1d=(-600, -400)) # This is best with the default value # of gap_penalties_2d # Write it out, the _fix_pos is erased automatically in salign: aln.write(file='fix_positions_salign.pap', alignment_format='PAP')
External weight matrix: An example of using feature 6.
# Reads an external matrix log.verbose() env = environ() aln = alignment(env, file='1dubA-1nzyA.ali', align_codes='all') aln.salign(alignment_type='pairwise', output='', rr_file='$(LIB)/blosum62.sim.mat', #rr_file='$(LIB)/as1.sim.mat', #max_gap_length=20, gap_function=False, input_weights_file='external.mtx', # External weight matrix #weights_type='DISTANCE', # type of ext. wgt. mtx # ensure appropriate gap penalites for the ext. matrix #feature_weights=(1., 0., 0., 0., 0., 0.), gap_penalties_1d=(30, 26), #write_weights=True, output_weights_file='score.mtx', feature_weights=(1., 0., 0., 0., 0., 1.), gap_penalties_1d=(-500, -300)) aln.write(file='output.ali', alignment_format='PIR') aln.write(file='output.pap', alignment_format='PAP')
Multiple structure alignment according to a user specified dendrogram The user has the option of inputting an X matrix from which a dendrogram can be inferred. The multiple tree alignment is then confined to follow this externally input dendrogram. To effect this, ext_tree is set to True and the external matrix file is specified in input_weights_file. A word of caution: it is not advisible to have feature 6 as non zero when an external dendrogram matrix is read.
The gap correction function is , where and are the number of proteins in the two sub-alignments, is the number of gap-residue pairs, and is the number of gap-gap pairs when comparing protein positions from one sub-alignment with protein position from the other sub-alignment, is gap_residue_score and is gap_gap_score. The smaller (even negative) is gap_gap_score, and the larger is gap_residue_score, the more will the gaps be aligned with gaps.
For each pairwise alignment, the weight matrix has dimensions and that correspond to the lengths of the sub-alignments to be aligned based on the weight matrix . The dissimilarity score for aligning position with position is calculated as , where the sum runs over all selected features , and is a function that may be used to correct the score for the presence of gaps within the sub-alignments (see below). A feature is selected when its weight (specified in feature_weights) is non-zero. The matrices are normalized to have the mean of 0 and standard deviation of 1 when normalize_pp_scores is True, but it is recommended not to use this option for now (i.e., use feature_weights to scale the contributions of the different features to the final ). The weights of 1 will weigh the different features approximately evenly (the residue-residue dissimilarities of feature 1 are scaled to a range from 0 to 1, the position differences of feature 2 are in angstroms, the fractional solvent accessibility scores of feature 3 and the secondary structure scores of feature 4 range from 0 to 2, and the DRMS difference of feature 5 is expressed in angstroms).
If you enable verbose logging with log.verbose(), there will be more output in the 'log' file, such as the dendrogram. The dendrogram can also be written out in a separate file by specifying the file name to dendrogram_file.
Argument output can contain the following values:
If write_fit is True, the fitted atom files are written out in their fitted orientations. For this and other options below, also read the text above.
If write_weights is True, the dynamic programming weight matrix is written out into the file specified by output_weights_file.
If current_directory is True, the output _pdb.fit files will be written to the current directory. Otherwise, the output will be in the directory with the original files.
If write_whole_pdb is True, the whole PDB files are written out; otherwise only the parts corresponding to the aligned sequences are output.
If fit is False, the initial superposition is not changed. This is useful when all the structures have to be compared with a given alignment as is, without changing their relative orientation.
If fit_on_first is True, the structures are fit to the first structure according to the final alignment before they are written out.
If improve_alignment is False, the initial alignment is not changed, though the structures may still be superimposed if fit = True. This is useful when all the structures have to be superimposed with the initial alignment.