sdb = <sequence_db> | Input database of sequences | |
rr_file = <str:1> | '$(LIB)/as1.sim.mat' | input residue-residue scoring file |
gap_penalties_1d = <float:2> | 900 50 | gap creation and extension penalties for sequence/sequence alignment |
matrix_offset = <float:1> | 0.00 | substitution matrix offset for local alignment |
n_prof_iterations = <int:1> | 3 | number of iterations in PROFILE_SEARCH |
check_profile = <bool:1> | True | whether to monitor profile degenration |
output_scores = <bool:1> | False | whether to output individual scores in a build_profile scan |
output_score_file = <str:1> | 'default' | output file for writing out individual scores in seqfilter |
max_aln_evalue = <float:1> | 0.1 | Max. E-value of alignments to include in BUILD_PROFILE |
gaps_in_target = <bool:1> | False | whether to include gaps in target when using build_profile |
Alignments with e-values below max_aln_evalue are then added to the current alignment. A position-specific scoring matrix is then calculated for the current alignment and is used to search the sequence database. This procedure is repeated for n_prof_iterations or until there are are no significant alignments below the threshold, whichever occurs first.
The initial sequence or alignment can be read in either in the profile format, with profile.read(), or as an alignment using alignment.append(). In the latter case, the alignment has to be converted to the profile format using alignment.to_profile().
The output contains a multiple sequence alignment (assembled) of all the homologues of the input sequence found in the database. The output can be formatted as a profile with profile.write() or converted into any of the standard alignment formats using profile.to_alignment(). It can then be written out to a file with alignment.write().
The fit between the observed and theoretical distributions of the z-scores is calculated after each iteration and is reported in the log file. The fit is calculated using the Kolmogorov-Smirnov D-statistic. If the check_profile flag is set to 'on', then the command will not proceed if the fit deviates by more than 0.04 (D-statistic).
By default, regions of the alignment that introduce gaps in the target sequence are ignored (deleted) in the final multiple alignment. But if gaps_in_target is set to 'on', then the gaps are retained. (See below for comments).
If the output_scores flag is set to 'on', then the scores of each alignment between the input sequence and each database sequence, from all iterations, will be written out to the file specified in output_score_file.
Comments:
log.verbose() env = environ() #-- Prepare the input files #-- Read in the sequence database sdb = sequence_db(env) sdb.read(seq_database_file='pdb95.fsa', seq_database_format='FASTA', chains_list='ALL', minmax_db_seq_len=(1, 40000), clean_sequences=True) #-- Write the sequence database in binary form sdb.write(seq_database_file='pdb95.bin', seq_database_format='BINARY', chains_list='ALL') #-- Now, read in the binary database sdb.read(seq_database_file='pdb95.bin', seq_database_format='BINARY', chains_list='ALL') #-- Read in the target sequence/alignment aln = alignment(env) aln.append(file='toxin.ali', alignment_format='PIR', align_codes='ALL') #-- Convert the input sequence/alignment into # profile format prf = aln.to_profile() #-- Scan sequence database to pick up homologous sequences prf.build(sdb, matrix_offset=-450, rr_file='${LIB}/blosum62.sim.mat', gap_penalties_1d=(-500, -50), n_prof_iterations=5, check_profile=False, max_aln_evalue=0.01, gaps_in_target=False) #-- Write out the profile prf.write(file='buildprofile.prf') #-- Convert the profile back to alignment format aln = prf.to_alignment() #-- Write out the alignment file aln.write(file='buildprofile.ali', alignment_format='PIR')