The preferred format for comparative modeling is related to the PIR database format:
C; A sample alignment in the PIR format; used in tutorial >P1;5fd1 structureX:5fd1:1 : :106 : :ferredoxin:Azotobacter vinelandii: 1.90: 0.19 AFVVTDNCIKCKYTDCVEVCPVDCFYEGPNFLVIHPDECIDCALCEPECPAQAIFSEDEVPEDMQEFIQLNAELA EVWPNITEKKDPLPDAEDWDGVKGKLQHLER* >P1;1fdx sequence:1fdx:1 : :54 : :ferredoxin:Peptococcus aerogenes: 2.00:-1.00 AYVINDSC--IACGACKPECPVNIIQGS--IYAIDADSCIDCGSCASVCPVGAPNPED----------------- -------------------------------*
The first line of each sequence entry specifies the protein code after the >P1; line identifier. The line identifier must occur at the beginning of the line. For example, 1fdx is the protein code of the first entry in the alignment above. The protein code corresponds to the ALIGN_CODES variable.
The second line of each entry contains information necessary to extract atomic coordinates of the segment from the original PDB coordinate set. The fields in this line are separated by colon characters, `:'. The fields are as follows:
The unspecified beginning and ending residue numbers and chain id's for a structure entry in an alignment file are taken automatically from the corresponding atom file, if possible. The first matching sequence in the atom file that also satisfies the explicitly specified residue numbers and chain id's is used. A residue number is not specified when a blank character or a dot, `.', is given. A chain id is not specified when a dot, `.', is given. This slight difference between residue and chain id's is necessary because a blank character is a valid chain id.
A residue identifier consists of a residue number and an optional chain identifier. They must be separated by a colon, `:'. For example, '10I:A' is residue number '10I' in chain 'A', and '6' or '6:' is residue number '6' in a chain without a name. Free format can be used, that is the blank characters are ignored. The residue number is a string of up to 5 characters long, as found in the PDB atom file and consists of the PDB residue number proper (22X,A4 in the PDB ATOM record) and PDB residue insertion code (26X, A1). The chain identifier is a single character, as found in the PDB atom file (21X,A1).
The residue number for the first position (resID1) in the MODEL_SEGMENT range 'resID1:chainID1 resID2:chainID2' can be either a real residue number or 'FIRST' (which indicates the first residue in a matching chain). The residue number for the second position (resID2) in the MODEL_SEGMENT range can be either: (1) a real residue number; (2) 'LAST' (which indicates the last residue in a matching chain); or 'END' (which indicates the last residue in the PDB file). The chain id for either position in the MODEL_SEGMENT range (chainID1 or chainID2) can be either: (1) a real chain id (including a blank/space/null/empty); or '@', which matches any chain id.
Examples, assuming a two chain PDB file (chains A and B):
For the SELECTION_SEGMENT the string containing '@' will match any residue number and chainID. For example, '@:A' is the first residue in chain 'A' and '@:@' is the first residue in the coordinate file. The last chain can not be specified in a general way, except if it is the last residue in the file.
When an alignment file is used in conjunction with structural information, the first two fields must be filled in, the rest of them can be empty or even missing entirely. If the alignment is not used in conjunction with structural data, all but the first field can be empty. This means that in comparative modeling, the template structures must have at least the first two fields specified while the target sequence must only have the first field filled in. Thus, a simple second line of an entry in an alignment file in the 'PIR' format is
structure:pdb_file:.:.:.:.
This entry will result in reading from PDB file pdb_file the structure segment corresponding to the sequence in the subsequent lines of the alignment entry.
The fields that do not exist are assigned blank values. Thus,
structure:pdb_file
is equivalent to
structure:pdb_file: : : : : : : :
which will achieve what was probably intended (read in the structure segment from file pdb_file that corresponds to the sequence in the subsequent lines of the alignment entry) only if the chain id is a blank character.
Each sequence must be terminated by the terminating character, `*'.
When the first character of the sequence line is the terminating character, `*', the sequence is obtained from the specified PDB coordinate file (Section 2.1.4).
Chain breaks are indicated by `/'. There should not be more than one chain break character to indicate a single chain break (use gap characters instead, `-'). All residue types specified in $RESTYP_LIB, but not patching residue types, are allowed; there are on the order of 100 residue types specified in the $RESTYP_LIB library. To add your own residue types to this library, see Section 1.9, Question 17.
The alignment file can contain any number of blank lines between the protein entries. Comment lines can occur outside protein entries and must begin with the identifiers `C;' or `R;' as the first two characters in the line.
An alignment file is also used to input non-aligned sequences.
The best way to generate initial alignment files containing PDB sequences, which can later be edited by hand, is to follow this example:
# Specify the PDB and protein codes in the alignment: SET ATOM_FILES = '1fdx' '5fd1', ALIGN_CODES = '1fdx' '5fd1' READ_MODEL FILE = '1fdx', MODEL_SEGMENT = '@:@' 'X:X' # Read the whole 1fdx atom file SEQUENCE_TO_ALI # Copy the residues to the alignment array READ_MODEL FILE = '5fd1', MODEL_SEGMENT = '1:' '63:' # Read 5fd1 atom file from 1-63 SEQUENCE_TO_ALI ADD_SEQUENCE = on # Add this segment to the alignment array MALIGN GAP_PENALTIES = -500 -300 # align them by sequence WRITE_ALIGNMENT FILE = 'fer1-seq.ali' MALIGN3D GAP_PENALTIES = 0.0 2.0 # align them by structure CHECK_ALIGNMENT # check the alignment for its suitability for modeling WRITE_ALIGNMENT FILE = 'fer1.ali'