Joy - protein structure and alignment analysis

SYNOPSIS

joy -options file

DESCRIPTION

Joy is an analysis and formatting program for multiple protein sequence alignments or single protein structures. It produces a number of files that are used either directly or by other programs. There are a large number of options, but the defaults are usually what you will want to do to a basic alignment. the way that the program is used is;

joy -options file

The options may be omitted if you are happy with the default ones. (see below for a description of these).

It is your responsibility to make sure that the PDB file is o.k., i.e. it should not have alternate atoms, hydrogens etc. for this reason it is recommended that your PDB file has a .pdb extension as input to joy, this means that the PDB file will be preprocessed (to a .atm file) before use.

If your files follow this simple naming convention then you can drop the extension from the command line, as in.

joy gamma

What joy does next depends on the existence of various files, but you should get something on standard output. In the above example the first file searched for is the alignment file gamma.ali If this file exists then it is read and processed. If the .ali file does not exist then joy will look for the file gamma.seq if this exists it is used, if this does not exist then Joy searches for gamma.atm, in the current directory, this file should be a standard PDB format file. If this .atm file is missing joy will check for a file with a .pdb extension, this file will then be converted to a .atm file with the filter pdb2atm. The program will then create a .seq file from its contents and do the processing as before. Obviously any existing .seq file is overwritten if you explicitly specify gamma.atm file from the command line.

If you want to include structural information into the analysis the program is smart enough to try to calculate any missing data as long as you have the corresponding .atm for all entries in the alignment in the current directory (and joy is correctly installed). This feature relies on the presence of the programs psa, hbond, sstruc, pdb2atm, and atm2seq in your path. See below in the section on data files for more details.

The default file extensions used by joy are detailed below.

extensioncontents
pdbRaw PDB format coordinates
atmProcessed PDB format coordinates
hbdHydrogen-bonding data
psaAccessibility data
sstSecondary structure data
segSegment definition file
lblLabel data for a structure
aliAlignment
seqSequence
subSubstitution data
temFile containing a `template' representation of structure
texLaTeX file containing alignment

One of the files that joy produces is a file with a .sub extension, this contains a breakdown of residue substitutions classified according to the local environment. As you would expect this data is quite sparse, so there is an ancillary program called summer to merge the data from many datasets. The data produced by summer is then used by a number of other programs.

Another file produced by the program, usually with a .tex extension, can be used to produce a pretty alignment on a typesetter. This file is then simply processed with latex to get a nicely formatted alignment. The .tem file is the main input to the qslave, and pslave template alignment programs.

Joy has a large number of options, to see the current ones, simply type joy at the command line and the options will be listed, some of the more important options are:

FORMAT OF THE ALIGNMENT FILE

.P The program uses a .I pir(1) type format for alignment files, with a few extensions to allow easy labelling of the alignment. The only restriction is that all sequences (including N- and C-terminal insertion codes) should be the same length, and that they are formatted in blocks of 75. This restriction will be removed shortly. See the documentation of .I pir for further details. A blank line (or end of file) acts as a signal to stop reading sequences. Remember the program does no alignment; what you put in is what you get out. An example alignment file (for the crystallin family) is:
>P1;1gcr
structure
--GKITFYEDRGFQGHCYECSSDCPNLQP-YFSRCNSIRVDSGCWMLYERPNYQGHQYFLRRGDYPDYQQWMGF-
-NDSIRSCRLIPQHTGTFRMRIYERDDFRGQMSEITD-DCPSLQDRFHLSEVHSLNVLEGSWVLYEMPSYRGRQY
LLRPGEYRRYLDWGAMNAKVGSLRRVMDFY-*
>P1;2gcr
structure
--GKITFYEDRGFQGRHYECSSDHSNLQP-YFSRCNSIRVDSGCWMLYEQPNFTGCQYFLRRGDYPDYQQWMGF-
-SDSVRSCRLIP-HTSSHRLRIYEREDYRGQMVEITE-DCSSLQDRFHFSDIHSFHVMEGYWVLYEMPNYRGRQY
LLRPGDYRRYLDWGAANARVGSLRRAVDFY-*
>P1;1bb2      
structure
LNPKIIIFEQENFQGHSHELNGPCPNLKETGVEKAGSVLVQAGPWVGYEQANCKGEQFVFEKGEYPRWDSWTSSR
RTDSLSSLRPIKVDSQEHKITLYENPNFTGKKMEVIDDDVPSFHAHGYQEKVSSVRVQSGTWVGYQYPGYRGLQY
LLEKGDYKDSGDFGAPQPQVQSVRRIRDMQW*
By default, underneath the alignment is the consensus secondary structure. It should be obvious what it all means, (if it isn't, then what can you expect to gain from using the program). The definition of `consensus' is that a fraction of greater than 0.7 is in a particular conformational state at a position. If you want to change this fraction there is a hidden flag so you can fiddle things. Also underneath the alignment is a series of bullets showing the positions of consensus buried residues, You can turn this feature off if you want to. .P The current limitations on the size of various things the user is likely to encounter are: .P .TS center tab(:); l l. total length of alignment : 1000 total number of structures : 35 total number of `plain' sequences : 30 number of text strings : 6 number of label strings : 3 .TE You may want to mix `featured' and `plain' sequence in the formatted alignment, to do this you simply prefix the title of the sequence with a `*', this marks the sequence as simply a string of characters, and no data files are required. A comment line may be added to the alignment file by preceding it with a `#'. .P

KEY TO FORMATTED ALIGNMENT

.P The key for the featured alignment is as follows: .sp 0.5 .P UPPERCASE : solvent inaccessible .br lowercase : solvent accessible .br Bold\fP : H-bond to amide proton .br .us "underline" : H-bond to mainchain carbonyl .br tilde (~) : H-bond to other sidechain .br dot (\u.\d) : H-bond to heterogen .br breve (\(be) : cis-peptide bond .br cedilla (\(c,) : half cystine .sp 0.5 .P So, for example, the residue $fat D under$ is an aspartic acid that is buried and hydrogen bonded to both a mainchain amide proton and a mainchain carbonyl; the residue $italic s tilde$ is a surface serine in a positive \(*f conformation hydrogen bonded to another sidechain.

DATA FILES

.P To produce a `featured' alignment you must have a set of data files for each sequence in the alignment. These must have the same name as the title of the sequence in the alignment and be present in the same directory as the alignment file. See notes concerning the use of the g\fP option. If needed joy can create all the data files automatically (see earlier). .P The hydrogen bond data comes from a file with a .I .hbd extension produced by the program .I hbond. You should not try to directly interpret the contents of this file, as there are some non-valid data in these files that is filtered out by .I joy. See the relevant documentation of .I hbond for further information. .P The accessibility data comes from a file with a .I .psa extension, produced the .I psa program. By default the cutoff value for deciding if a residue is inaccessible is a relative total sidechain accessibility of 7%; this can be changed by using a command line flag to .I joy. .P The secondary structure and \(*W data comes from the .I .sst file produced by the .I sstruc written by David Smith. .P

CAVEATS

.P There is a bug in the \*(LX output of residues that have many `features', for example if a residue is hydrogen bonded to another sidechain and has a \fIcis\fP-peptide bond then the tilde and breve will not line up above the letter. This is probably due to a problem with nesting of parentheses. If it gives you problems edit the .I .tex file. .P There is a bug in the numbering routine, if you try to number more than 1 structure then the number labels may appear in the wrong rows of the table, this should be fixed. .P Occasionally you will run out of memory in \*(LX, if this happens you will have to put a line \\clearpage, just before the alignment block that causes latex to run out of memory, in the \fI.tex\fP file manually. .P There are bugs, many things do not probably work now. If you find anything like this I will try to fix things by email (overingtonj@pfizer.com). The parts of the code that almost certainly have broken are the substitution table generation routines. I would exercsie extreme caution in using these. .P

REFERENCING

.P If you use .I joy and publish anything with it, it would be nice to be referenced. The reference you should use is: .P J.P. Overington, M.S. Johnson, A. \(Svali and T.L. Blundell, (1990) ``Tertiary structural constraints on protein evolutionary diversity: Templates, key residues and structure prediction'', \fIProc. Roy. Soc. Lond.\fP, 241\fP, pp. 132-145. .P I will also write a small paper specifically on joy in the near future. .P

SEE ALSO

\fIsummer(1), psa(1), orgasmus(1), sstruc(1), atm2seq(1), hbond(1), pdb2atm(1), pir(4)\fP .P The program .I joy is highly complementary to the profile programs of Mark Johnson, he should be contacted for details of these.

RELEASE LEVEL

This document describes joy version 2.7 and later. (The manual page has not been updated since then).