next up previous contents index
Next: The sequence_db class: using Up: The profile class: using Previous: profile.scan() Compare   Contents   Index

profile.build() -- Build a profile for a given sequence or alignment

sdb = <sequence_db>   Input database of sequences
rr_file = <str:1> '$(LIB)/as1.sim.mat' input residue-residue scoring file
gap_penalties_1d = <float:2> 900 50 gap creation and extension penalties for sequence/sequence alignment
matrix_offset = <float:1> 0.00 substitution matrix offset for local alignment
n_prof_iterations = <int:1> 3 number of iterations in PROFILE_SEARCH
check_profile = <bool:1> True whether to monitor profile degenration
output_scores = <bool:1> False whether to output individual scores in a build_profile scan
output_score_file = <str:1> 'default' output file for writing out individual scores in seqfilter
max_aln_evalue = <float:1> 0.1 Max. E-value of alignments to include in BUILD_PROFILE
gaps_in_target = <bool:1> False whether to include gaps in target when using build_profile

Description:
This command iteratively scans a database of sequences to build a profile for the input sequence or alignment. The command calculates the score for a Smith-Waterman local alignment between the input sequence and each of the sequences in the database. The significance of the alignment scores (e-values) are calculated using a procedure similar to that described by Pearson (1998).

Alignments with e-values below max_aln_evalue are then added to the current alignment. A position-specific scoring matrix is then calculated for the current alignment and is used to search the sequence database. This procedure is repeated for n_prof_iterations or until there are are no significant alignments below the threshold, whichever occurs first.

The initial sequence or alignment can be read in either in the profile format, with profile.read(), or as an alignment using alignment.append(). In the latter case, the alignment has to be converted to the profile format using alignment.to_profile().

The output contains a multiple sequence alignment (assembled) of all the homologues of the input sequence found in the database. The output can be formatted as a profile with profile.write() or converted into any of the standard alignment formats using profile.to_alignment(). It can then be written out to a file with alignment.write().

The fit between the observed and theoretical distributions of the z-scores is calculated after each iteration and is reported in the log file. The fit is calculated using the Kolmogorov-Smirnov D-statistic. If the check_profile flag is set to 'on', then the command will not proceed if the fit deviates by more than 0.04 (D-statistic).

By default, regions of the alignment that introduce gaps in the target sequence are ignored (deleted) in the final multiple alignment. But if gaps_in_target is set to 'on', then the gaps are retained. (See below for comments).

If the output_scores flag is set to 'on', then the scores of each alignment between the input sequence and each database sequence, from all iterations, will be written out to the file specified in output_score_file.

Comments:

  1. The procedure has been optimized only for the BLOSUM62 similarity matrix.

  2. The dynamic programming algorithm has been optimized for performance on Intel Itanium2 architecture. Nevertheless, the calculation is sufficiently CPU intensive. It takes about 20 min for an iteration, using an input sequence of 250aa against a database containing 500,000 sequences on an Itanium2 machine. It could take much longer on any other machine.

  3. It is advisable to have gaps_in_target set to 'off', when scanning against large databases to avoid the local-alignments inserting a huge number of gaps in the final alignments.

  4. The statistics will not be accurate (or may even fail) if the database does not have sequences that represent the entire range of lengths possible.

  5. The method can be used for fold-assignment by first building a profile for the target sequence by scanning against a large non-redundant sequence database (like swissprot) and then using the resulting profile to scan once against a database of sequences extracted from PDB structures. gaps_in_target can be set to 'on' in the second step to get the complete alignments that can then be used for modeling.

Example: examples/commands/build_profile.py


log.verbose()
env = environ()

#-- Prepare the input files

#-- Read in the sequence database
sdb = sequence_db(env)
sdb.read(seq_database_file='pdb95.fsa', seq_database_format='FASTA',
         chains_list='ALL', minmax_db_seq_len=(1, 40000), clean_sequences=True)

#-- Write the sequence database in binary form
sdb.write(seq_database_file='pdb95.bin', seq_database_format='BINARY',
          chains_list='ALL')

#-- Now, read in the binary database
sdb.read(seq_database_file='pdb95.bin', seq_database_format='BINARY',
         chains_list='ALL')

#-- Read in the target sequence/alignment
aln = alignment(env)
aln.append(file='toxin.ali', alignment_format='PIR', align_codes='ALL')

#-- Convert the input sequence/alignment into
#   profile format
prf = aln.to_profile()

#-- Scan sequence database to pick up homologous sequences
prf.build(sdb, matrix_offset=-450, rr_file='${LIB}/blosum62.sim.mat',
          gap_penalties_1d=(-500, -50), n_prof_iterations=5,
          check_profile=False, max_aln_evalue=0.01, gaps_in_target=False)

#-- Write out the profile
prf.write(file='buildprofile.prf')

#-- Convert the profile back to alignment format
aln = prf.to_alignment()

#-- Write out the alignment file
aln.write(file='buildprofile.ali', alignment_format='PIR')


next up previous contents index
Next: The sequence_db class: using Up: The profile class: using Previous: profile.scan() Compare   Contents   Index
Ben Webb 2005-04-21