next up previous contents index
Next: READ_ALIGNMENT read Up: Comparison and searching of Previous: Comparison and searching of   Contents   Index


Alignment file format

The preferred format for comparative modeling is related to the PIR database format:


C; A sample alignment in the PIR format; used in tutorial
>P1;5fd1
structureX:5fd1:1    : :106  : :ferredoxin:Azotobacter vinelandii: 1.90: 0.19
AFVVTDNCIKCKYTDCVEVCPVDCFYEGPNFLVIHPDECIDCALCEPECPAQAIFSEDEVPEDMQEFIQLNAELA
EVWPNITEKKDPLPDAEDWDGVKGKLQHLER*
>P1;1fdx
sequence:1fdx:1    : :54   : :ferredoxin:Peptococcus aerogenes: 2.00:-1.00
AYVINDSC--IACGACKPECPVNIIQGS--IYAIDADSCIDCGSCASVCPVGAPNPED-----------------
-------------------------------*

The first line of each sequence entry specifies the protein code after the >P1; line identifier. The line identifier must occur at the beginning of the line. For example, 1fdx is the protein code of the first entry in the alignment above. The protein code corresponds to the ALIGN_CODES variable.

The second line of each entry contains information necessary to extract atomic coordinates of the segment from the original PDB coordinate set. The fields in this line are separated by colon characters, `:'. The fields are as follows:

Field 1:
A specification of whether or not 3D structure is available and of the type of the method used to obtain the structure (structureX, X-ray; structureN, NMR; structureM, model; sequence, sequence). Only structure is also a valid value.

Field 2:
The PDB code. While the protein code in the first line of an entry, which is used to identify the entry, must be unique for all proteins in the file, the PDB code in this field, which is used to get structural data, does not have to be unique. It is a good idea to use the PDB code with an optional chain identifier as the protein code. The PDB code corresponds to the ATOM_FILES variable and can also contain the full atom filename, directory included.

Fields 3-6:
The residue identifiers (see below) for the first (fields 3-4) and last residue (fields 5-6) of the sequence in the subsequent lines. There is no need to edit the coordinate file if a contiguous sequence of residues is required -- simply specify the beginning and ending residues of the required contiguous region of the chain. If the beginning residue is not found, no segment is read in. If the ending residue identifier is not found in the coordinate file, the last residue in the coordinate file is used. By default, the whole file is read in.

The unspecified beginning and ending residue numbers and chain id's for a structure entry in an alignment file are taken automatically from the corresponding atom file, if possible. The first matching sequence in the atom file that also satisfies the explicitly specified residue numbers and chain id's is used. A residue number is not specified when a blank character or a dot, `.', is given. A chain id is not specified when a dot, `.', is given. This slight difference between residue and chain id's is necessary because a blank character is a valid chain id.

Field 7:
Protein name. Optional.

Field 8:
Source of the protein. Optional.

Field 9:
Resolution of the crystallographic analysis. Optional.

Field 10:
R-factor of the crystallographic analysis. Optional.

A residue identifier consists of a residue number and an optional chain identifier. They must be separated by a colon, `:'. For example, '10I:A' is residue number '10I' in chain 'A', and '6' or '6:' is residue number '6' in a chain without a name. Free format can be used, that is the blank characters are ignored. The residue number is a string of up to 5 characters long, as found in the PDB atom file and consists of the PDB residue number proper (22X,A4 in the PDB ATOM record) and PDB residue insertion code (26X, A1). The chain identifier is a single character, as found in the PDB atom file (21X,A1).

The residue number for the first position (resID1) in the MODEL_SEGMENT range 'resID1:chainID1 resID2:chainID2' can be either a real residue number or 'FIRST' (which indicates the first residue in a matching chain). The residue number for the second position (resID2) in the MODEL_SEGMENT range can be either: (1) a real residue number; (2) 'LAST' (which indicates the last residue in a matching chain); or 'END' (which indicates the last residue in the PDB file). The chain id for either position in the MODEL_SEGMENT range (chainID1 or chainID2) can be either: (1) a real chain id (including a blank/space/null/empty); or '@', which matches any chain id.

Examples, assuming a two chain PDB file (chains A and B):

For the SELECTION_SEGMENT the string containing '@' will match any residue number and chainID. For example, '@:A' is the first residue in chain 'A' and '@:@' is the first residue in the coordinate file. The last chain can not be specified in a general way, except if it is the last residue in the file.

When an alignment file is used in conjunction with structural information, the first two fields must be filled in, the rest of them can be empty or even missing entirely. If the alignment is not used in conjunction with structural data, all but the first field can be empty. This means that in comparative modeling, the template structures must have at least the first two fields specified while the target sequence must only have the first field filled in. Thus, a simple second line of an entry in an alignment file in the 'PIR' format is

structure:pdb_file:.:.:.:.

This entry will result in reading from PDB file pdb_file the structure segment corresponding to the sequence in the subsequent lines of the alignment entry.

The fields that do not exist are assigned blank values. Thus,

structure:pdb_file

is equivalent to

structure:pdb_file: : : : : : : :

which will achieve what was probably intended (read in the structure segment from file pdb_file that corresponds to the sequence in the subsequent lines of the alignment entry) only if the chain id is a blank character.

Each sequence must be terminated by the terminating character, `*'.

When the first character of the sequence line is the terminating character, `*', the sequence is obtained from the specified PDB coordinate file (Section 2.1.4).

Chain breaks are indicated by `/'. There should not be more than one chain break character to indicate a single chain break (use gap characters instead, `-'). All residue types specified in $RESTYP_LIB, but not patching residue types, are allowed; there are on the order of 100 residue types specified in the $RESTYP_LIB library. To add your own residue types to this library, see Section 1.9, Question 17.

The alignment file can contain any number of blank lines between the protein entries. Comment lines can occur outside protein entries and must begin with the identifiers `C;' or `R;' as the first two characters in the line.

An alignment file is also used to input non-aligned sequences.

The best way to generate initial alignment files containing PDB sequences, which can later be edited by hand, is to follow this example:

# Specify the PDB and protein codes in the alignment:
SET ATOM_FILES = '1fdx' '5fd1', ALIGN_CODES = '1fdx' '5fd1'
READ_MODEL FILE = '1fdx', MODEL_SEGMENT = '@:@' 'X:X' # Read the whole 1fdx atom file
SEQUENCE_TO_ALI # Copy the residues to the alignment array
READ_MODEL FILE = '5fd1', MODEL_SEGMENT = '1:' '63:' # Read 5fd1 atom file from 1-63
SEQUENCE_TO_ALI ADD_SEQUENCE = on # Add this segment to the alignment array
MALIGN GAP_PENALTIES = -500 -300   # align them by sequence
WRITE_ALIGNMENT FILE = 'fer1-seq.ali'
MALIGN3D GAP_PENALTIES = 0.0 2.0   # align them by structure
CHECK_ALIGNMENT   # check the alignment for its suitability for modeling
WRITE_ALIGNMENT FILE = 'fer1.ali'




next up previous contents index
Next: READ_ALIGNMENT read Up: Comparison and searching of Previous: Comparison and searching of   Contents   Index
Ben Webb 2004-04-20