SequenceDB.read() — read a database of sequences

read(chains_list, seq_database_file, seq_database_format, clean_sequences=True, minmax_db_seq_len=(0, 999999))
This command will read a database of sequences, either in the PIR, FASTA, or BINARY format.

If the format is PIR or FASTA:

For the PIR and FASTA formats, make sure the order of sequences in the chains_list and seq_database_file is the same for faster access (there can of course be more sequences in the sequence file than there are sequence codes in the codes file).

Additionally, if the sequences are in 'PIR' format, then the protein type and resolution fields are stored in the database format. (see Section B.1 for description of 'PIR' fields).

The protein type field is encoded in a single letter format. 'S' for sequence and 'X' for structures of any kind. This information is transferred to the profile arrays when using Profile.build(). (See also Profile.read()).

The resolution field is used to pick representatives from the clusters in SequenceDB.filter().

None of the options above apply to the BINARY format, which, in return, is very fast. Binary files are standard HDF5 files (see Section B.4).

When using PIR or FASTA files, the entire sequence database is stored in memory. Thus, extremely large databases, such as UNIPROT, will require your computer to have a large amount of system memory (RAM) available, to store the database and to provide working space. In cases where the database requires more than 2 gigabytes of memory, you will also need to use a 64-bit machine, such as Alpha, Itanium, or x86_64 (Opteron/EM64T). On the other hand, when using a binary file, only part of the file is read into memory on demand. (Functions which utilize sequence databases have a window_size parameter, which determines how much of the file is read in at a time. A larger window size will generally result in faster execution, at the expense of increased memory use.) Thus, binary files are strongly recommended whenever speed or memory is a concern.

If you are intending to read in a sequence database simply to write it out again in binary format, you should consider using the SequenceDB.convert() function instead, as this does not need to keep the whole database in memory.

Example: See Profile.build() command.