next up previous contents index
Next: DELETE_ALIGNMENT delete Up: Comparison and searching of Previous: EXPAND_ALIGNMENT put   Contents   Index

SEQUENCE_SEARCH -- search for similar sequences

RR_FILE = $\langle{\tt string:1}\rangle$ '$(LIB)/as1.sim.mat' input residue-residue scoring file
FILE = $\langle{\tt string:1}\rangle$ 'default' file with the target sequence
ALIGN_CODES = $\langle{\tt string:0}\rangle$ 'all' the code of the target sequence
DIRECTORY = $\langle{\tt string:1}\rangle$ '' directory list (e.g., 'dir1:dir2:dir3:./:/')
GAP_PENALTIES_1D = $\langle{\tt real:2}\rangle$ -900 -50 gap creation and extension penalties for sequence/sequence alignment
OFF_DIAGONAL = $\langle{\tt integer:1}\rangle$ 100 to speed up the alignment
MATRIX_OFFSET = $\langle{\tt real:1}\rangle$ 0.00 substitution matrix offset for local alignment
OVERHANG = $\langle{\tt integer:1}\rangle$ 0 un-penalized overhangs in protein comparisons
LOCAL_ALIGNMENT = $\langle{\tt logical:1}\rangle$ off whether to do local as opposed to global alignment
SEARCH_CHAINS_LIST = $\langle{\tt string:1}\rangle$ '$(LIB)/CHAINS_3.0_40_XN.cod' file with sequences
SEARCH_CHAINS_FILE = $\langle{\tt string:1}\rangle$ '$(LIB)/CHAINS_all.seq' file with a list of sequence codes
SEARCH_GROUP_LIST = $\langle{\tt string:1}\rangle$ '$(LIB)/CHAINS_3.0_40_XN.grp' file with 40% groups of sequences
ALIGNMENT_FORMAT = $\langle{\tt string:0}\rangle$ 'PIR' sequence file formats; has to be 'PIR'
ALIGNMENT_FEATURES = $\langle{\tt string:0}\rangle$ 'INDICES CONSERVATION' what alignment features to write out: 'ACCURACY' | 'HELIX' | 'BETA' | 'ACCESSIBILITY' | 'STRAIGHTNESS' | 'CONSERVATION' | 'INDICES' | 'ALL' | 'GAPS'
REMOVE_GAPS = $\langle{\tt logical:1}\rangle$ on whether to remove all-gap positions in input alignment
SEARCH_TOP_LIST = $\langle{\tt integer:1}\rangle$ 20 the length of the output hits list
OUTPUT = $\langle{\tt string:1}\rangle$ 'LONG' 'SHORT' | 'LONG'
STOP_ON_ERROR = $\langle{\tt integer:1}\rangle$ 1 whether to stop on error
SEARCH_SORT = $\langle{\tt string:1}\rangle$ 'LONGER' which sequence to use for normalization when sorting the hit list: 'SHORTER' | 'LONGER'
SEARCH_RANDOMIZATIONS = $\langle{\tt integer:1}\rangle$ 0 number of randomizations for calculating the significance of a sequence/sequence similarity
RAND_SEED = $\langle{\tt integer:1}\rangle$ -8123 random seed from -50000 to -2
FAST_SEARCH = $\langle{\tt logical:1}\rangle$ off whether to use fast sequence search or not
FAST_SEARCH_CUTOFF = $\langle{\tt real:1}\rangle$ 1.0 if FAST_SEARCH is ON only sequences with database scan significance higher than this value are considered for randomization significance
DATA_FILE = $\langle{\tt logical:1}\rangle$ off whether results go to a separate file or not
SIGNIF_CUTOFF = $\langle{\tt real:2}\rangle$ 4.0 5.0 cutoff for adding sequences to alignment, max difference from the best

Output:
MODELLER_STATUS = $\langle{\tt integer:1}\rangle$

Description:
This command searches a sequence database for proteins that are similar to a given target sequence.

Target sequence is read from file FILE.

ALIGN_CODES specifies the code of the target sequence in the FILE file. If only one sequence is in the file, you can use ALIGN_CODES = 'all' to read it without bothering about the actual sequence code.

SEARCH_CHAINS_LIST specifies a file that contains protein codes for the proteins to be compared with the target sequence.

The database sequences specified in SEARCH_CHAINS_LIST file must occur in the SEARCH_CHAINS_FILE file.

The command uses the dynamic programming method for the best sequence alignment, given the gap creation and extension penalties specified by GAP_PENALTIES_1D and residue type scores read from file RR_FILE. GAP_PENALTIES_1D[1] is a gap creation penalty and GAP_PENALTIES_1D[2] is a gap extension penalty.

The SEARCH_TOP_LIST top hits are written to the log file at the end. The hits are sorted according to the fractional sequence identity score obtained by dividing the number of identical residue pairs by the length of the longer sequence (SEARCH_SORT = 'LONGER') or the shorter sequence (SEARCH_SORT = 'SHORTER').

The final list of hits contains three different significance values:

  1. SIGNI. Z-score from sequence randomizations. This is the most accurate significance score, but the slowest one to calculate. For each pairwise comparison, the two sequences are shuffled a specified number of times (SEARCH_RANDOMIZATIONS) to obtain the mean and standard deviation of ``random'' scores from which the Z-score for an alignment score of a given pair of sequences is calculated.

  2. SIGNI2. Z-score for sequence identity from the database scan. After comparison of the target sequence with all sequences in the database is done, the comparisons are sorted by the length of the database sequence. The pairwise sequence identities of the 20 sequences closest in length to the target sequence are used to calculate the average and standard deviation of the percentage sequence identities for subsequent calculation of the Z-score for the percentage sequence identity of a given pairwise alignment.

  3. SIGNI3. Z-score for alignment score from the database scan. The procedure is the same as for SIGNI2, except that the alignment scores are used instead of the pairwise sequence identities.

The calculation of the Z-scores assumes that the random scores are distributed according to the Gaussian distribution, instead of the extreme value distribution [Karlin & Altschul, 1990], which is more correct.

SEARCH_RANDOMIZATIONS specifies how many alignments of the shuffled sequences are done to calculate the significance score for the overall sequence similarity. If 0, the significance is not calculated. If more than 5 randomizations are done, the significance score, not sequence identity, is used for sorting the hit list.

When FAST_SEARCH is on only those sequences that have a database-scan alignment score significance (SIGNI3 in output) above FAST_SEARCH_CUTOFF are used for the ``full'' randomization-based significance calculation. Since the mean and the standard deviation of the distribution obtained by randomizing the two compared sequences are much more appropriate than the corresponding quantities for the target/database comparisons, FAST_SEARCH should be on only when you are in a hurry and the database is large.

If DATA_FILE is on the final results (list of PDB codes with significances, etc.) are also written to a separate file 'seqsearch.dat'.

If OUTPUT is 'LONG', the best alignment for each sequence in SEARCH_CHAINS_FILE and its various scores are also written to the log file. If OUTPUT is 'VERY_LONG', individual scores obtained for randomized sequences are also written to the log file (this is almost never needed).

If the selected significance score is larger than SIGNIF_CUTOFF[1] and not more than SIGNIF_CUTOFF[2] units worse than the best hit, all the members of the same group, as defined in SEARCH_GROUP_LIST, are added to the alignment array. Subsequent MALIGN, DENDROGRAM and WRITE_ALIGNMENT can then be used to write out all related PDB chains aligned to the target sequence.

Example:


# Example for: SEQUENCE_SEARCH

# This will search the MODELLER database of representative protein chains
# for chains similar to the specified sequence.

SET OUTPUT_CONTROL = 1 1 1 1 1
SET SEARCH_RANDOMIZATIONS = 20 # should use 100 in real life;
SET GAP_PENALTIES_1D = -800 -400
SET SEARCH_CHAINS_LIST = 'very-short-for-test.cod'
SEQUENCE_SEARCH FILE = 'toxin.ali', ALIGN_CODES = '1nbt'
MALIGN
WRITE_ALIGNMENT FILE = 'toxin-search.pap', ALIGNMENT_FORMAT = 'PAP'


next up previous contents index
Next: DELETE_ALIGNMENT delete Up: Comparison and searching of Previous: EXPAND_ALIGNMENT put   Contents   Index
Ben Webb 2004-04-20