The preferred format for comparative modeling is related to the PIR database format:
C; A sample alignment in the PIR format; used in tutorial >P1;5fd1 structureX:5fd1:1 : :106 : :ferredoxin:Azotobacter vinelandii: 1.90: 0.19 AFVVTDNCIKCKYTDCVEVCPVDCFYEGPNFLVIHPDECIDCALCEPECPAQAIFSEDEVPEDMQEFIQLNAELA EVWPNITEKKDPLPDAEDWDGVKGKLQHLER* >P1;1fdx sequence:1fdx:1 : :54 : :ferredoxin:Peptococcus aerogenes: 2.00:-1.00 AYVINDSC--IACGACKPECPVNIIQGS--IYAIDADSCIDCGSCASVCPVGAPNPED----------------- -------------------------------*
The first line of each sequence entry specifies the protein code after the >P1; line identifier. The line identifier must occur at the beginning of the line. For example, 1fdx is the protein code of the first entry in the alignment above. The protein code corresponds to the alignment[x].code variable.
The second line of each entry contains information necessary to extract atomic coordinates of the segment from the original PDB coordinate set. The fields in this line are separated by colon characters, `:'. The fields are as follows:
The unspecified beginning and ending residue numbers and chain id's for a structure entry in an alignment file are taken automatically from the corresponding atom file, if possible. The first matching sequence in the atom file that also satisfies the explicitly specified residue numbers and chain id's is used. A residue number is not specified when a blank character or a dot, `.', is given. A chain id is not specified when a dot, `.', is given. This slight difference between residue and chain id's is necessary because a blank character is a valid chain id.
A residue identifier consists of a residue number and an optional chain identifier. They must be separated by a colon, `:'. For example, '10I:A' is residue number '10I' in chain 'A', and '6' or '6:' is residue number '6' in a chain without a name. Free format can be used, that is the blank characters are ignored. The residue number is a string of up to 5 characters long, as found in the PDB atom file and consists of the PDB residue number proper (22X,A4 in the PDB ATOM record) and PDB residue insertion code (26X, A1). The chain identifier is a single character, as found in the PDB atom file (21X,A1).
The residue number for the first position (resID1) in the model_segment range 'resID1:chainID1 resID2:chainID2' can be either a real residue number or 'FIRST' (which indicates the first residue in a matching chain). The residue number for the second position (resID2) in the model_segment range can be either: (1) a real residue number; (2) 'LAST' (which indicates the last residue in a matching chain); or 'END' (which indicates the last residue in the PDB file). The chain id for either position in the model_segment range (chainID1 or chainID2) can be either: (1) a real chain id (including a blank/space/null/empty); or '@', which matches any chain id.
Examples, assuming a two chain PDB file (chains A and B):
For the selection_segment the string containing '@' will match any residue number and chainID. For example, '@:A' is the first residue in chain 'A' and '@:@' is the first residue in the coordinate file. The last chain can not be specified in a general way, except if it is the last residue in the file.
When an alignment file is used in conjunction with structural information, the first two fields must be filled in, the rest of them can be empty or even missing entirely. If the alignment is not used in conjunction with structural data, all but the first field can be empty. This means that in comparative modeling, the template structures must have at least the first two fields specified while the target sequence must only have the first field filled in. Thus, a simple second line of an entry in an alignment file in the 'PIR' format is
structure:pdb_file:.:.:.:.
This entry will result in reading from PDB file pdb_file the structure segment corresponding to the sequence in the subsequent lines of the alignment entry.
The fields that do not exist are assigned blank values. Thus,
structure:pdb_file
is equivalent to
structure:pdb_file: : : : : : : :
which will achieve what was probably intended (read in the structure segment from file pdb_file that corresponds to the sequence in the subsequent lines of the alignment entry) only if the chain id is a blank character.
Each sequence must be terminated by the terminating character, `*'.
When the first character of the sequence line is the terminating character, `*', the sequence is obtained from the specified PDB coordinate file (Section 4.1.3).
Chain breaks are indicated by `/'. There should not be more than one chain break character to indicate a single chain break (use gap characters instead, `-'). All residue types specified in $RESTYP_LIB, but not patching residue types, are allowed; there are on the order of 100 residue types specified in the $RESTYP_LIB library. To add your own residue types to this library, see Section 1.8, Question 10.
The alignment file can contain any number of blank lines between the protein entries. Comment lines can occur outside protein entries and must begin with the identifiers `C;' or `R;' as the first two characters in the line.
An alignment file is also used to input non-aligned sequences.
# This demonstrates one way to generate an initial alignment between two # PDB sequences. It can later be edited by hand. # Set Modeller environment (including search patch for model.read()) env = environ() env.io.atom_files_directory = "./:../atom_files/" # Create a new empty alignment and model: aln = alignment(env) mdl = model(env) # Read the whole 1fdx atom file code='1fdx' mdl.read(file=code, model_segment=('FIRST:@', 'END:')) # Add the model sequence to the alignment aln.append_model(mdl, align_codes=code, atom_files=code) # Read 5fd1 atom file from 1-63, and add to alignment code='5fd1' mdl.read(file=code, model_segment=('1:', '63:')) aln.append_model(mdl, align_codes=code, atom_files=code) # Align them by sequence aln.malign(gap_penalties_1d=(-500, -300)) aln.write(file='fer1-seq.ali') # Align them by structure aln.malign3d(gap_penalties_3d=(0.0, 2.0)) # check the alignment for its suitability for modeling aln.check() aln.write(file='fer1.ali')