Alignment file (PIR)

The preferred format for comparative modeling is related to the PIR database format:

C; A sample alignment in the PIR format; used in tutorial

>P1;5fd1
structureX:5fd1:1    :A:106  :A:ferredoxin:Azotobacter vinelandii: 1.90: 0.19
AFVVTDNCIKCKYTDCVEVCPVDCFYEGPNFLVIHPDECIDCALCEPECPAQAIFSEDEVPEDMQEFIQLNAELA
EVWPNITEKKDPLPDAEDWDGVKGKLQHLER*

>P1;1fdx
sequence:1fdx:1    : :54   : :ferredoxin:Peptococcus aerogenes: 2.00:-1.00
AYVINDSC--IACGACKPECPVNIIQGS--IYAIDADSCIDCGSCASVCPVGAPNPED-----------------
-------------------------------*

The first line of each sequence entry specifies the protein code after the >P1; line identifier. The line identifier must occur at the beginning of the line. For example, 1fdx is the protein code of the first entry in the alignment above. The protein code corresponds to the Sequence.code variable. Conventionally, this code is the PDB code followed by an optional one-letter chain ID, but this is not required; codes can be any unique identifier. If the code in the alignment file is made up of several words (separated by spaces or tabs), only the first is read in; the remainder are treated by MODELLER as comments.

The second line of each entry contains information necessary to extract atomic coordinates of the segment from the original PDB or mmCIF coordinate set. The fields in this line are separated by colon characters, `:'. The fields are as follows:

Field 1:

A specification of whether or not 3D structure is available and of the type of the method used to obtain the structure (structureX, X-ray; structureN, NMR; structureM, model; sequence, sequence). Only structure is also a valid value.

Field 2:

The PDB or mmCIF filename or code. While the protein code in the first line of an entry, which is used to identify the entry, must be unique for all proteins in the file, the name in this field, which is used to get structural data, does not have to be unique. It can be a full file name with path (e.g., '/home/foo/pdbs/mystructure.pdb'), a file name without a path (e.g., 'mystructure.pdb' or 'mystructure.cif'), or a PDB code (e.g., '1abc'; MODELLER will automatically convert the code to a filename by adding '.pdb', '.cif' or '.ent' file extensions as necessary, and/or a 'pdb' prefix). In the latter two cases, where no path is given, MODELLER will search in the directories specified by io_data.atom_files_directory to find PDB or mmCIF files.

Fields 3-6:

The residue and chain identifiers (see below) for the first (fields 3-4) and last residue (fields 5-6) of the sequence in the subsequent lines. There is no need to edit the coordinate file if a contiguous sequence of residues is required -- simply specify the beginning and ending residues of the required contiguous region of the chain. If the beginning residue is not found, no segment is read in. If the ending residue identifier is not found in the coordinate file, the last residue in the coordinate file is used. By default, the whole file is read in.

The unspecified beginning and ending residue numbers and chain id's for a structure entry in an alignment file are taken automatically from the corresponding atom file, if possible. The first matching sequence in the atom file that also satisfies the explicitly specified residue numbers and chain id's is used. A residue number is not specified when a blank character or a dot, `.', is given. A chain id is not specified when a dot, `.', is given. This slight difference between residue and chain id's is necessary because a blank character is a valid chain id.

Field 7:

Protein name. Optional.

Field 8:

Source of the protein. Optional.

Field 9:

Resolution of the crystallographic analysis. Optional.

Field 10:

R-factor of the crystallographic analysis. Optional.

A residue identifier is simply the 5-letter PDB residue number (including insertion code, if any), and a chain identifier the 1-letter PDB chain code. For example, '10I:A' is residue number '10I' in chain 'A', and '6:' is residue number '6' in a chain without a name.

The residue number for the first position (resID1) in the model_segment range 'resID1:chainID1 resID2:chainID2' can be either a real residue number or 'FIRST' (which indicates the first residue in a matching chain). The residue number for the second position (resID2) in the model_segment range can be either: (1) a real residue number; (2) 'LAST' (which indicates the last residue in a matching chain); (3) '+nn' (which requests the total number of residues to read, in which case the chain id is ignored); or 'END' (which indicates the last residue in the PDB file). The chain id for either position in the model_segment range (chainID1 or chainID2) can be either: (1) a real chain id (including a blank/space/null/empty); or '@', which matches any chain id.

Examples, assuming a two chain PDB file (chains A and B):

'15:A 75:A' reads residues 15 to 75 in chain A.
'FIRST:@ 75:@' reads the first 75 residues in chain A (the first chain).
'FIRST:@ LAST:@' reads all residues in chain A, assuming 'FIRST' is not a real number of the non-first residue in chain A.
'FIRST:@ +125:' reads a total of 125 residues, regardless of the PDB numbering, starting from the first residue in chain A.
'10:@ LAST:' reads all residues from 10 in chain A to the end of the file (chain id for the last residue is irrelevant), again assuming 'LAST' is not a real residue number of a non-last residue.
'FIRST:@ END:' reads the whole file no matter what, the chainID is ignored completely.

For the selection_segment the string containing '@' will match any residue number and chainID. For example, '@:A' is the first residue in chain 'A' and '@:@' is the first residue in the coordinate file. The last chain can not be specified in a general way, except if it is the last residue in the file.

When an alignment file is used in conjunction with structural information, the first two fields must be filled in; the rest of them can be empty. If the alignment is not used in conjunction with structural data, all but the first field can be empty. This means that in comparative modeling, the template structures must have at least the first two fields specified while the target sequence must only have the first field filled in. Thus, a simple second line of an entry in an alignment file in the 'PIR' format is

structure:pdb_file:.:.:.:.::::

This entry will result in reading from PDB or mmCIF file pdb_file the structure segment corresponding to the sequence in the subsequent lines of the alignment entry.

Each sequence must be terminated by the terminating character, `*'.

When the first character of the sequence line is the terminating character, `*', the sequence is obtained from the specified PDB or mmCIF coordinate file (Section 5.1.3).

Chain breaks are indicated by `/'. There should not be more than one chain break character to indicate a single chain break (use gap characters instead, `-'). All residue types specified in $RESTYP_LIB, but not patching residue types, are allowed; there are on the order of 100 residue types specified in the $RESTYP_LIB library. To add your own residue types to this library, see Section 3.1, Question 8.

The alignment file can contain any number of blank lines between the protein entries. Comment lines can occur outside protein entries and must begin with the identifiers `C;' or `R;' as the first two characters in the line.

An alignment file is also used to input non-aligned sequences.

Automatic builds 2014-07-26