aln = <alignment> | Alignment containing the target sequence | |
io = <io_data> | Options for reading atom files | |
rr_file = <str:1> | '$(LIB)/as1.sim.mat' | input residue-residue scoring file |
gap_penalties_1d = <float:2> | 900 50 | gap creation and extension penalties for sequence/sequence alignment |
off_diagonal = <int:1> | 100 | to speed up the alignment |
matrix_offset = <float:1> | 0.00 | substitution matrix offset for local alignment |
overhang = <int:1> | 0 | un-penalized overhangs in protein comparisons |
local_alignment = <bool:1> | False | whether to do local as opposed to global alignment |
seq_database_file = <str:1> | '$(LIB)/CHAINS_all.seq' | file with a list of sequence codes |
search_group_list = <str:1> | '$(LIB)/CHAINS_3.0_40_XN.grp' | file with 40% groups of sequences |
alignment_features = <str:1> | 'INDICES CONSERVATION' | what alignment features to write out: 'ACCURACY' | 'HELIX' | 'BETA' | 'ACCESSIBILITY' | 'STRAIGHTNESS' | 'CONSERVATION' | 'INDICES' | 'ALL' | 'GAPS' |
search_top_list = <int:1> | 20 | the length of the output hits list |
output = <str:1> | 'LONG' | 'SHORT' | 'LONG' |
search_sort = <str:1> | 'LONGER' | which sequence to use for normalization when sorting the hit list: 'SHORTER' | 'LONGER' |
search_randomizations = <int:1> | 0 | number of randomizations for calculating the significance of a sequence/sequence similarity |
fast_search = <bool:1> | False | whether to use fast sequence search or not |
fast_search_cutoff = <float:1> | 1.0 | if FAST_SEARCH is ON only sequences with database scan significance higher than this value are considered for randomization significance |
data_file = <bool:1> | False | whether results go to a separate file or not |
signif_cutoff = <float:2> | 4.0 5.0 | cutoff for adding sequences to alignment, max difference from the best |
This command searches a sequence database for proteins that are similar to a given target sequence.
The target sequence should be the only sequence in the provided alignment, aln.
The database of sequences to be scanned against must be read previously by the sequence_db.read() command.
The command uses the dynamic programming method for the best sequence alignment, given the gap creation and extension penalties specified by gap_penalties_1d and residue type scores read from file rr_file. gap_penalties_1d[0] is a gap creation penalty and gap_penalties_1d[1] is a gap extension penalty.
The search_top_list top hits are written to the log file at the end. The hits are sorted according to the fractional sequence identity score obtained by dividing the number of identical residue pairs by the length of the longer sequence (search_sort = 'LONGER') or the shorter sequence (search_sort = 'SHORTER').
The final list of hits contains three different significance values:
The calculation of the Z-scores assumes that the random scores are distributed according to the Gaussian distribution, instead of the extreme value distribution [Karlin & Altschul, 1990], which is more correct.
search_randomizations specifies how many alignments of the shuffled sequences are done to calculate the significance score for the overall sequence similarity. If 0, the significance is not calculated. If more than 5 randomizations are done, the significance score, not sequence identity, is used for sorting the hit list.
When fast_search is True only those sequences that have a database-scan alignment score significance (SIGNI3 in output) above fast_search_cutoff are used for the ``full'' randomization-based significance calculation. Since the mean and the standard deviation of the distribution obtained by randomizing the two compared sequences are much more appropriate than the corresponding quantities for the target/database comparisons, fast_search should be True only when you are in a hurry and the database is large.
If data_file is True the final results (list of PDB codes with significances, etc.) are also written to a separate file 'seqsearch.dat'.
If output is 'LONG', the best alignment for each sequence in the database and its various scores are also written to the log file. If output is 'VERY_LONG', individual scores obtained for randomized sequences are also written to the log file (this is almost never needed).
If the selected significance score is larger than signif_cutoff[0] and not more than signif_cutoff[1] units worse than the best hit, all the members of the same group, as defined in search_group_list, are added to the alignment array. These sequences are taken from seq_database_file, which is often (but not always) the same file previously provided to sequence_db.read(). Subsequent alignment.malign(), environ.dendrogram() and alignment.write() can then be used to write out all related PDB chains aligned to the target sequence.
# Example for: sequence_db.search() # This will search the MODELLER database of representative protein chains # for chains similar to the specified sequence. from modeller import * log.verbose() env = environ() # Read in the sequences of all PDB structures try: sdb = sequence_db(env, seq_database_file='$(LIB)/CHAINS_all.seq', chains_list='very-short-for-test.cod') except IOError: print """ Could not read sequence database file. This file is not included by default in the Modeller distribution, but you can download it from the Modeller downloads page (http://salilab.org/modeller/download_installation.html). Note: it is recommended to use profile.build() rather than sequence_db.search(). See step 1 of the Modeller basic tutorial at http://salilab.org/modeller/tutorial/basic.html """ raise # Read in the query sequence in alignment format aln = alignment(env, file='toxin.ali', align_codes='1nbt') sdb.search(aln, search_randomizations=20, # should use 100 in real life seq_database_file='$(LIB)/CHAINS_all.seq', search_group_list = '$(LIB)/CHAINS_3.0_40_XN.grp', off_diagonal=9999, gap_penalties_1d=(-800, -400), signif_cutoff=(1.5, 5.0)) aln.malign() aln.write(file='toxin-search.pap', alignment_format='PAP')