This file contains the modifications made to the alignments database: August 3, 1993: This is the first official release of the alignments database. It was published in Sali, Overington, and Karplus, 1993. ALBASE 1: 97 alignment files with igvar split into igvar-v and igvar-l in the ALBASE_1.list file. ALBASE 1a: the same as ALBASE 1, except that igvar are completely removed. August 13, 1993 (AS): tren, 2hsc, 1pai, 1esp, 1sav, 1ypa PDB codes were changed to start with X to indicate that they were not yet deposited to PDB. mucor pusilus --> mucor pusillus; spelling only fkbp was changed from an NMR structure to X-ray structure August 19, 1993 (AS): agreed with JPO on C; and R; comments replacing all other comments (changes to *.ali, joy, MODELLER). Jan 7, 1993 (AS): Obtained the whole set of alignments from JPO. Created ALBASE_2. Major changes from ALBASE_1: Added: ace.ali, cyp.ali, cyto.ali, igcell2.ali, porin.ali, rnase2.ali, sh3.ali Deleted: Hla_ig.ali Renamed (modified): igcell.ali --> igcell1.ali igvar.ali --> igvar-h.ali and igvar-l.ali sod2.ali --> sodcu.ali Modified by JPO: most of the files. All alignments in one directory. Editing zf-CCHH.ali to make full entry specification records for the Pabo zinc-finger sequences. Editing asp, igvar-h, sermam, serpin, az (and maybe one or two others) to correct small chain specification mistakes. Editing sh3.ali to make full entry specification records for the 1pnj entry. Editing cyp.ali (N271 changed to E271). Non-existing 1neaN PDB filename changed to 1nea (I hope JPO did not rename/modidy the original PDB for a good purpose). Non-existing 1zaa1, 1zaa2, and 1zaa3 PDB filenames were changed to 1zaa. Changing the special (non-PDB) names back to X: mv 2hsc.atm Xhsc.atm (actin.ali) mv tren.atm Xren.atm (asp.ali) mv 1ypa.atm Xypa.atm (asp.ali) mv 1pai.atm Xpai.atm (serpin.ali) mv 1esp.atm Xesp.atm (subt.ali) mv 1sav.atm Xsav.atm (subt.ali) JPO: I beg you to accept this convention that non-PDB files start with X, not a digit or something else. This allows an easy identification of such structures by grep and in printed Tables where we do have to acknowledge the individual sources (like in our database paper). The 1pai situation is another reason: there is a model in PDB file 1pai, which is not your 1pai (my Xpai). The database was checked by producing TeX files for all alignments with joy and by running MDT on alignments and PDB producing the minimal PDB subset and some other statistics. Januar 20, 1994: ALBASE_2a. JPO sends back an upgrade of AS ALBASE_2. It is called ALBASE_2a. This is what he said about his changes to ALBASE_2: Two new families (or ones you did not email back) p450 - cytochrome p450s hpr - histidine carrier proteins A small number of changes (arrows are the wrong way around for diffs, sorry) 1) fkbp C; class: alpha plus beta > C; class: akoha plus beta 2) icd - line 15 (D - E) mutation < HPELTDMVIFRENSEDIYAGIEWKADSADAEKVIKFLREEMGVKKIRFPDHCGIGIKPCSEEGTKRLVRAAIEYA > HPELTDMVIFRENSEDIYAGIEWKADSADAEKVIKFLREEMGVKKIRFPEHCGIGIKPCSEEGTKRLVRAAIEYA 3) igcell1 family: immunoglobulin -- cell surface - type 1 > family: immunoglobulin cell surface - type 1 4) igcell2 family: immunoglobulin -- cell surface - type 2 > family: immunoglobulin cell surface - type 2 5) kazal family: serine proteinase inhibitor -- Kazal-type > family: serine proteinase inhibitor Kazal-type 6) rhv - chain breaks made explicit in alignment (/ character where appropriate) also Name change for one of proteins (it was too long for tables). Diffs are not shown, beacuae they are noiminally substantial. I also made some alignment changes along the lines of chain A was never equivalenced with chain B of a differeing protein. I think this is a more consistent treatment of the data. 7) sermam - new structure (salmon trypsin) (1tbs) rat trypsin (N->D), as discussed in previous email, for my own selfish evolutionary reasons I like the D. chymotrypsin, too out S early on in sequence (only N in file for this) < --------CGVPAIQPVL///////////////////IVNGEEAVPGSWPWQVSLQDKT---GFHFCGGSLINEN $ > --------CGVPAIQPVLS//////////////////IVNGEEAVPGSWPWQVSLQDKT---GFHFCGGSLINEN tonin (I think) deleted I, similar reason to above. + Chain break here < WVITAAHCY------SN----NYQVLLGRNNLFKDE-PFAQRRLVRQSFRHPDYIPL/PVHDHSNDLMLLHLSEP $ > WVITAAHCY------SN----NYQVLLGRNNLFKDE-PFAQRRLVRQSFRHPDYIPLIPVHDHSNDLMLLHLSEP 8) sh3 - name change, too long for table < structureN:1pnj: 1A: : 84 : :p85-alpha subunit SH3 domain:-1.00:-1.00 --- > structureN:1pnj: 1A: : 84 : :phosphatidylinositol 3-kinase (p85-alpha subun 9) sodcu - new structure (1srd) + consequent changes in alignment to accom this new sequence 10) subt - change 1sav structure (Unilver proprietary) to public domain 1st3. 11) zf-CCHH - new structure 1ard + couple of name changes I hope this is enough detail for you Also I have put in name changes, to the proper XABC code for private coord sets (affects asp, serpin and subt families) I then renamed the relevant files, ran the thing through joy again (with no errors at all), regenerated the tables and enclose at the bottom of the file the latest version of the table for the database. Now that the thing truly is automated it is a lot easier to keep the thing up to date. Do you have groff ?. Do you have Phylip ? (You need groff to print the table from troff source, and you need phylip to get nice trees from the alignments) I have partly changed joy so that the phylip tree order is the order found on the LaTeX output (so that one can stick the two things together and get a nice figure), but this is buggy at the moment for several reasons, I will email you joy with all your/mine modifs soon, but I am sure you are in no hurry for this. P.S. As you will see there is a bug in the numbering of the families in the first table, this is not as easy to fix as it should be, but I will do it soon. I will also put the whole thing under the control of make as well, (i.e. from atm files to everything) I imagine that you could then extend this to take into account all of the modeller db stuff as well. ------------------------------------------------------------------------------ AS made the following changes (Januar 25, 1994; remains ALBASE_2a) to allow an automated cross-reference with the original PDB: Do a diff on sh3.ali and zf-CCHH.ali; the changes are obvious. John: There is an old problem with several ali files (e.g. 3icd in icd.ali, and 1ton and 1trm in sermam.ali, etc.). I think we have to have a strict rule of what must be a sequence of a protein with a given name in the .ali file. I think the only reasonable rule is that the sequence must be exactly the same as the sequence of residues that have at least one atom in the PDB file with the same name. There is no logical reason here to distinguish CA atoms from the others and require that the CA atom must be present. Also, I do not think that the SEQRES records should be taken into account even though they may be in better agreement with the sequence databases. If we agree, a simple solution to your historical problems is that you slightly change the root of the original PDB name, store the non-PDB sequence in that file, and also include it in the alignment, together with the original PDB sequence. The current .ali files conform to the rule above. It is difficult to see how the database could be portable to other machines, which have to rely on the original PDB, if we do not accept this rule. Also, I had to make a lot of changes for the second time. I could see why you did not want to accept some of these changes, but not all of them. The only important thing for me is that we agree on the rules and that we follow them. I do not mind if the rules are not entirely mine. I appologize for being a pain in the ass here but I hate repetitive, dull, and unnecessary work. The thing is that you are probably using manually edited PDB files in which you have manually taken care of the problems that are encountered by the automated PDB access because the rules are not strictly observed. Also, I do not know how joy treats partially present residues. Given all that, I would still like to convince you that the future is in complete automation. For example, with the new alignment commands in MODELLER, the superposed sets of all PDB structures in the alignments can easily be generated automatically (given your alignments), so there is no need at all for any manually edited files -- we only need the original PDB distribution which is just one ftp command away, not days of error prone editing away. I'd better stop my propaganda now. Non-existing 1zaa1, 1zaa2, and 1zaa3 PDB filenames were again changed to 1zaa: The code in >P1;code can be 1zaa1, but the second line should contain the root for the PDB filename (1zaa if the original PDB file is to be used; otherwise, Xzaa1, or something like that should be specified and a new PDB file created). A similar change again from 1neaN to 1nea. I could not find 2hsc and 1esp gift structures in PDB today. I changed them back to Xhsc and Xesp. What is the situation here? Did you maybe forget that the second line contains the PDB filename spec, and that the first line (>P1;Xesp) can be anything, although it is best if it is the same as the PDB root plus chain ID (this chain ID rule is not strictly enforced but it is not essential because only complicates the Table because we cannot say that the chain ID is the last uppercase letter in the protein code, where applicable). Chain id's for the first and last residue were added again to 1lya in asp.ali. Residue numbers of 1azu and 2plt were corrected again. Sequence of 2hfl in igvar-h.ali edited again to reflect the PDB sequence. Sequence of 2cyp in peroxidase.ali edited again to reflect the PDB sequence. Chain id for 1tbs in sermam.ali corrected. Sequence of 2gch edited again to reflect the PDB sequence. Chain ids and residue numbers changed for 1ppb in sermam.ali. 1pai PDB code changed again to Xpai in serpin.ali. 1hle residue numbers in serpin.ali changed again. Februar 7, 1994, fkbp.ali: 1fkb structureN is changed to structureX Februar 28: Changed 2cyp.ali to reflect the changed sequence in PDB (272 goes to Asn). March, 18: Commented out all sequence entries (igcell1.ali and zf-CCHH.ali). March 19: Changed multiple chain break characters ///// to /---- to allow easier MODELLER code (sermam.ali, rvp.ali, rsp.ali). May 10: Received a set of test files for Release 3 from JPO. He could include his changes here. May 11: Editing the JPO test set to make it run with the latest PDB release, mdt and joy: 1) multiple //// replaced by a single / in all alignments: ace_NEW.ali asp_NEW.ali cyto_NEW.ali flavbb_NEW.ali hemocyan_NEW.ali ins_NEW.ali ldh_NEW.ali prc_NEW.ali rhv_NEW.ali rnh_NEW.ali sermam_NEW.ali serpin_NEW.ali. 2) PDB entries changed, so the corresponding .ali files had to be updated: 1f3g (gpr.ali); 1hom, 1lfb (hom.ali); 1ppb, 1bbr (sermam.ali); 2ins & 4ins (ins.ali); 2ctx, 1fas (toxin.ali); 1bbs (asp.ali); 2tbv (bv.ali); Probably also true for other .ali files which you have not sent me. 3) 1prc in prc.ali has residue FOR inserted in the .ali file so that it now corresponds to the PDB file (JPO: either split the alignment in separate subunits so that the first FOR can be omitted; or introduce FOR residue type in joy, otherwise we cannot distribute prc.ali for the PDB files as they are). 4) ins.ali: the order of some segments is not the same as in the PDB file: this cannot be the case; omitted 2gf1 which necessitates this (in effect, using the ALBASE2 version of ins.ali). May 30: JPO sent me his final complete set, which I edited again to make it consistent with April 94 PDB release and my programs (ins.ali changed to ins-jpo.ali and old ins.ali with PDB order of segments used instead). There were many sequence and chain ID changes also. This is my release 3. No HETATMs included in the alignments because SSTRUC does not work with HETATM; filtered out by mkfile() in MDT. Tested with PSA, DIH, PDB, NGH, and SSTRUC.