The preferred format for comparative modeling is related to the PIR database format:
C; A sample alignment in the PIR format; used in tutorial >P1;5fd1 structureX:5fd1:1 :A:106 :A:ferredoxin:Azotobacter vinelandii: 1.90: 0.19 AFVVTDNCIKCKYTDCVEVCPVDCFYEGPNFLVIHPDECIDCALCEPECPAQAIFSEDEVPEDMQEFIQLNAELA EVWPNITEKKDPLPDAEDWDGVKGKLQHLER* >P1;1fdx sequence:1fdx:1 : :54 : :ferredoxin:Peptococcus aerogenes: 2.00:-1.00 AYVINDSC--IACGACKPECPVNIIQGS--IYAIDADSCIDCGSCASVCPVGAPNPED----------------- -------------------------------*
The first line of each sequence entry specifies the protein code after the >P1; line identifier. The line identifier must occur at the beginning of the line. For example, 1fdx is the protein code of the first entry in the alignment above. The protein code corresponds to the Sequence.code variable. (Conventionally, this code is the PDB code followed by an optional one-letter chain ID, but this is not required.)
The second line of each entry contains information necessary to extract atomic coordinates of the segment from the original PDB coordinate set. The fields in this line are separated by colon characters, `:'. The fields are as follows:
The unspecified beginning and ending residue numbers and chain id's for a structure entry in an alignment file are taken automatically from the corresponding atom file, if possible. The first matching sequence in the atom file that also satisfies the explicitly specified residue numbers and chain id's is used. A residue number is not specified when a blank character or a dot, `.', is given. A chain id is not specified when a dot, `.', is given. This slight difference between residue and chain id's is necessary because a blank character is a valid chain id.
A residue identifier is simply the 5-letter PDB residue number, and a chain identifier the 1-letter PDB chain code. For example, '10I:A' is residue number '10I' in chain 'A', and '6:' is residue number '6' in a chain without a name.
The residue number for the first position (resID1) in the model_segment range 'resID1:chainID1 resID2:chainID2' can be either a real residue number or 'FIRST' (which indicates the first residue in a matching chain). The residue number for the second position (resID2) in the model_segment range can be either: (1) a real residue number; (2) 'LAST' (which indicates the last residue in a matching chain); (3) '+nn' (which requests the total number of residues to read, in which case the chain id is ignored); or 'END' (which indicates the last residue in the PDB file). The chain id for either position in the model_segment range (chainID1 or chainID2) can be either: (1) a real chain id (including a blank/space/null/empty); or '@', which matches any chain id.
Examples, assuming a two chain PDB file (chains A and B):
For the selection_segment the string containing '@' will match any residue number and chainID. For example, '@:A' is the first residue in chain 'A' and '@:@' is the first residue in the coordinate file. The last chain can not be specified in a general way, except if it is the last residue in the file.
When an alignment file is used in conjunction with structural information, the first two fields must be filled in; the rest of them can be empty. If the alignment is not used in conjunction with structural data, all but the first field can be empty. This means that in comparative modeling, the template structures must have at least the first two fields specified while the target sequence must only have the first field filled in. Thus, a simple second line of an entry in an alignment file in the 'PIR' format is
structure:pdb_file:.:.:.:.::::
This entry will result in reading from PDB file pdb_file the structure segment corresponding to the sequence in the subsequent lines of the alignment entry.
Each sequence must be terminated by the terminating character, `*'.
When the first character of the sequence line is the terminating character, `*', the sequence is obtained from the specified PDB coordinate file (Section 5.1.3).
Chain breaks are indicated by `/'. There should not be more than one chain break character to indicate a single chain break (use gap characters instead, `-'). All residue types specified in $RESTYP_LIB, but not patching residue types, are allowed; there are on the order of 100 residue types specified in the $RESTYP_LIB library. To add your own residue types to this library, see Section 3.1, Question 8.
The alignment file can contain any number of blank lines between the protein entries. Comment lines can occur outside protein entries and must begin with the identifiers `C;' or `R;' as the first two characters in the line.
An alignment file is also used to input non-aligned sequences.