READ_SEQUENCE_DB -- read a database of sequences

Next: WRITE_SEQUENCE_DB write Up: Comparison and searching of Previous: BUILD_PROFILE Build Contents Index

READ_SEQUENCE_DB -- read a database of sequences

CHAINS_LIST = <string:1> '$(LIB)/CHAINS_3.0_40_XN.cod' file with sequences

SEQ_DATABASE_FILE = <string:1> '$(LIB)/CHAINS_all.seq' file with a list of sequence codes

SEQ_DATABASE_FORMAT = <string:1> 'PIR' 'PIR' 'FASTA' 'BINARY'; for READ/WRITE_SEQUENCE_DB

CLEAN_SEQUENCES = <logical:1> on whether or not clean non-standard residues

MINMAX_DB_SEQ_LEN = <integer:2> 0 999999 minimal/maximal database sequence length

OUTPUT_CONTROL = <integer:5> 1 0 1 1 0 selects output, flow-control msgs, warnings, errors, dynamic mem msgs

Description:

This command will read a database of sequences, either in the PIR, FASTA, or BINARY format.

If the format is PIR or FASTA:

It is possible to clean all sequences of non-standard residue types by setting CLEAN_SEQUENCES to on.
Sequences shorter than MINMAX_DB_SEQ_LEN[1] and longer than MINMAX_DB_SEQ_LEN[2] are eliminated.
Only sequences whose codes are listed in the CHAINS_LIST file are read from the SEQ_DATABASE_FILE of sequences. If CHAINS_LIST is all, all sequences in the SEQ_DATABASE_FILE file are read in, and there is no need for the CHAINS_LIST file.

For the PIR and FASTA formats, make sure the order of sequences in the CHAINS_LIST and SEQ_DATABASE_FILE is the same for faster access (there can of course be more sequences in the sequence file than there are sequence codes in the codes file).

Additionally, if the sequences are in 'PIR' format, then the protein type and resolution fields are stored in the database format. (see Section 2.4.1 for description of 'PIR' fields).

The protein type field is encoded in a single letter format. 'S' for sequence and 'X' for structures of any kind. This information is transferred to the profile arrays when using BUILD_PROFILE. (See also READ_PROFILE).

The resolution field is used to pick representatives from the clusters in SEQFILTER.

None of the options above apply to the BINARY format, which, in return, is very fast (i.e., 3 seconds for 300 MB of 800,000 sequences in the TrEMBL database).

Example: See BUILD_PROFILE command.

Next: WRITE_SEQUENCE_DB write Up: Comparison and searching of Previous: BUILD_PROFILE Build Contents Index

Ben Webb 2004-10-04

CHAINS_LIST = `<string:1>`	`'$(LIB)/CHAINS_3.0_40_XN.cod'`	file with sequences
SEQ_DATABASE_FILE = `<string:1>`	`'$(LIB)/CHAINS_all.seq'`	file with a list of sequence codes
SEQ_DATABASE_FORMAT = `<string:1>`	`'PIR'`	'PIR' 'FASTA' 'BINARY'; for READ/WRITE_SEQUENCE_DB
CLEAN_SEQUENCES = `<logical:1>`	`on`	whether or not clean non-standard residues
MINMAX_DB_SEQ_LEN = `<integer:2>`	`0 999999`	minimal/maximal database sequence length
OUTPUT_CONTROL = `<integer:5>`	`1 0 1 1 0`	selects output, flow-control msgs, warnings, errors, dynamic mem msgs