PDB updates for Modeller

older
Markus Jaritz is out of the office.

Bruno Afonso

2 Oct 2004 2 Oct '04

8:38 p.m.

Dear modeller hackers :),

I'm trying to use modeller to find good PDB templates and there is at least one structure that is missing from modeller's database, which is good since it was deposited Jun 2003 in PDB, released in august. Could it be because it is 3.3 A?

Since I know this PDB is probably a good model and it won't come up in seq_search I was wondering how I could manually update CHAINS_all.seq or create my own sequence database.

best, BA

ps: I found in the archives a person asking a similar question regarding a PDB database update but no script or url was given to get a newer version, or without resolution cut-off. :-)

-- Bruno Afonso http://brunoafonso.net http://dequim.ist.utl.pt/~bruno/sciTocs/ - Bruno's SciTocs! http://freebsd-pt.org/forum/ - Portuguese FreeBSD forum

Show replies by date

Eswar Narayanan

2 Oct 2 Oct

8:59 p.m.

> > Since I know this PDB is probably a good model and it won't come up in > seq_search I was wondering how I could manually update CHAINS_all.seq > or create my own sequence database.

The latest release of MODELLER (version 7v7, released last month) has a new command called SEQFILTER that can be used to cluster PDB sequences. You can use MAKE_CHAINS (also in the latest release) to collect the PDB chains prior to running SEQFILTER.

--- Eswar Narayanan, Ph.D Mission Bay Genentech Hall 600 16th Street, Suite N474Q University of California - San Francisco San Francisco, CA 94143-2240 (CA 94158 for courier) Tel +1 (415) 514-4233; Fax +1 (415) 514-4231 http://www.salilab.org/~eashwar

Bruno Afonso

3 Oct 3 Oct

7:47 a.m.

Eswar Narayanan wrote: >> >> Since I know this PDB is probably a good model and it won't come up in >> seq_search I was wondering how I could manually update CHAINS_all.seq >> or create my own sequence database. > > > The latest release of MODELLER (version 7v7, released last month) has a > new command called SEQFILTER that can be used to cluster PDB sequences. > You can use MAKE_CHAINS (also in the latest release) to collect the PDB > chains prior to running SEQFILTER.

There was a mistake in my previous e-mail. :) The PDB sequence is missing, which is *bad*, not good. I'm sorry to ask this questions, but I'm still puzzled as to how to deal with this:

1) What's the criteria for make chains_all.seq? I ask this because clearly not all of PDB is there :) and there are sequences there with resolutions as high as 5.0 angstroms...

2) Can't I make a chains_all.seq alike with MY criteria without making my own script? ie, is there a "right way"(TM) to do it?

3) I can use MAKE_CHAINS and then load the .chn as a database, but that involves having me first finding the good PDBs that aren't on the modeller's DB, which is kind of misses the whole point. I was using modeller to try to find the good ones in the first place.

Thanks for the tip on seqfilter, but my problem was the sequence missing in the modeller's default database in the first place ;-)

sorry for the inconvenience, BA

-- Bruno Afonso http://brunoafonso.net http://dequim.ist.utl.pt/~bruno/sciTocs/ - Bruno's SciTocs! http://freebsd-pt.org/forum/ - Portuguese FreeBSD forum

J B Procter

4 Oct 4 Oct

6:37 a.m.

It is possible to build new sequence databases for modeller - and, as Eswar said, there are two relevant commands. Writing a script to do this is unavoidable, though, unless the caretaker has one ready for everyone to download!

As a very quick fix, you could get the current pdb sequence list from here : ftp://ftp.rcsb.org/pub/pdb/derived_data/pdb_seqres.txt

Then, follow the script in modeller7v7/examples/commands/build_profile.top, which shows how you can read in a simple FASTA sequence flatfile database, like the one from the pdb website, and then use it to align against your sequence in order to build a sequence profile (and by that, retrieve all homologous sequences from the PDB).

To do the job properly, you need to apply the make_chains command (modeller7v7/examples/commands/make_chains.top) to generate the extra information that is written into the PIR information fields, and used by modeller fetch the correct PDB file for each sequence in the database.

If you have a mirror of the PDB, then this script (for unix) might work:

#!/bin/bash # makes chain records and places them pdb_seq.chn in the current working # directory. # you need to change this to point to your local copy of the PDB,

PDBDIR="/projects/biodata/pdb/data/structures/all/pdb"

for p in `ls -1 $PDBDIR` do y=`basename $p .ent.Z`; if [[ $p != $y ]]; then echo READ_MODEL FILE = '$PDBDIR/$p' > make_chains_.top echo MAKE_CHAINS MINIMAL_CHAIN_LENGTH = 30, \ MINIMAL_RESOLUTION = 2.0, MINIMAL_STDRES = 30, \ CHOP_NONSTD_TERMINII = on, \ STRUCTURE_TYPES ='structureN structureX' >> make_chains_.top mod7v7 make_chains_.top cat ${y/pdb/./}.*.chn >> pdb_seq.chn rm ${y/pdb/./}.*.chn fi done;

After that, which will take some time to run, pdb_seq.chn will contain a subset of all the PDB chains, in a similar form to the CHAINS_all.seq file.

You should, then, be able to read this new database in, apply SEQFILTER (see the example/command/seqfilter.top) , and write out the list of chain representatives (at 95%, for instance). For best use, you should rewrite the database (via READ_SEQUENCE_DB and WRITE_SEQUENCE_DB) in binary format and limit it to just the representative sequences generated by SEQFILTER (by specifying the CHAINS_LIST option on READ_SEQUENCE_DB).

Enjoy! j.

_______________________________________________________________________ Dr JB Procter:Biomolecular Modelling at ZBH - Center for Bioinformatics Hamburg http://www.zbh.uni-hamburg.de/staff.php

Eswar Narayanan

5:41 p.m.

Procter is right. BUILD_PROFILE can be seen as a command that supersedes SEQUENCE_SEARCH, to identify potential templates and get a reliable alignment for modeling.

Eswar.

On Oct 4, 2004, at 6:37 AM, J B Procter wrote:

> > It is possible to build new sequence databases for modeller - and, as > Eswar said, there are two relevant commands. Writing a script to do > this > is unavoidable, though, unless the caretaker has one ready for everyone > to download! > > As a very quick fix, you could get the current pdb sequence list from > here : > ftp://ftp.rcsb.org/pub/pdb/derived_data/pdb_seqres.txt > > Then, follow the script in > modeller7v7/examples/commands/build_profile.top, which shows how you > can read in a simple FASTA sequence flatfile database, like the one > from > the pdb website, and then use it to align against your sequence in > order > to build a sequence profile (and by that, retrieve all homologous > sequences from the PDB). > > To do the job properly, you need to apply the make_chains command > (modeller7v7/examples/commands/make_chains.top) to generate the extra > information that is written into the PIR information fields, and used > by > modeller fetch the correct PDB file for each sequence in the database. > > If you have a mirror of the PDB, then this script (for unix) might > work: > > #!/bin/bash > # makes chain records and places them pdb_seq.chn in the current > working > # directory. > # you need to change this to point to your local copy of the PDB, > > PDBDIR="/projects/biodata/pdb/data/structures/all/pdb" > > for p in `ls -1 $PDBDIR` > do > y=`basename $p .ent.Z`; > if [[ $p != $y ]]; then > echo READ_MODEL FILE = '$PDBDIR/$p' > make_chains_.top > echo MAKE_CHAINS MINIMAL_CHAIN_LENGTH = 30, \ > MINIMAL_RESOLUTION = 2.0, MINIMAL_STDRES = 30, \ > CHOP_NONSTD_TERMINII = on, \ > STRUCTURE_TYPES ='structureN structureX' >> make_chains_.top > mod7v7 make_chains_.top > cat ${y/pdb/./}.*.chn >> pdb_seq.chn > rm ${y/pdb/./}.*.chn > fi > done; > > After that, which will take some time to run, pdb_seq.chn will contain > a > subset of all the PDB chains, in a similar form to the CHAINS_all.seq > file. > > You should, then, be able to read this new database in, apply SEQFILTER > (see the example/command/seqfilter.top) , and write out the list of > chain representatives (at 95%, for instance). For best use, you should > rewrite the database (via READ_SEQUENCE_DB and WRITE_SEQUENCE_DB) in > binary format and limit it to just the representative sequences > generated by SEQFILTER (by specifying the CHAINS_LIST option on > READ_SEQUENCE_DB). > > > Enjoy! > j. > > _______________________________________________________________________ > Dr JB Procter:Biomolecular Modelling at ZBH - Center for Bioinformatics > Hamburg http://www.zbh.uni-hamburg.de/staff.php > _______________________________________________ > modeller_usage mailing list > modeller_usage@salilab.org > http://salilab.org/mailman/listinfo/modeller_usage

Eswar Narayanan

5:37 p.m.

On Oct 3, 2004, at 7:47 AM, Bruno Afonso wrote:

> Eswar Narayanan wrote: >>> >>> Since I know this PDB is probably a good model and it won't come up >>> in seq_search I was wondering how I could manually update >>> CHAINS_all.seq or create my own sequence database. >> The latest release of MODELLER (version 7v7, released last month) has >> a new command called SEQFILTER that can be used to cluster PDB >> sequences. You can use MAKE_CHAINS (also in the latest release) to >> collect the PDB chains prior to running SEQFILTER. > > There was a mistake in my previous e-mail. :) The PDB sequence is > missing, which is *bad*, not good. I'm sorry to ask this questions, > but I'm still puzzled as to how to deal with this:

If you know exactly what your template(s) is(are) going to be, you do not have to use SEQUENCE_SEARCH to "identify" your template. You can use any of the alignment commands (ALIGN, ALIGN2D etc) to create your alignment and model your sequence based on that alignment.

> > 1) What's the criteria for make chains_all.seq? I ask this because > clearly not all of PDB is there :) and there are sequences there with > resolutions as high as 5.0 angstroms...

One usually wants to use a non-redundant version of PDB to search for templates. One way is to first select sequences of all X-ray structures that are solved at a resolution better than 3.5A, that are longer than 30aa, have no more than 10 non-standard residues, have at least 30 standard residues. These can all be specified as options to MAKE_CHAINS. You can then cluster these sequences using SEQFILTER to remove redundancies with a sequence identity threshold (usually set at 30% or 95%).

Ben has put these files on the web at http://salilab.org/modeller/supplemental.html. These are the representative sequences derived PDB files at 30% and 95% sequence identity. All x-ray and NMR PDB chains, with no limits on resolution, that are at least 30aa long, have more than 30 standard residues and not more than 10 non-standard residues were use to get these files. This is just the output of SEQFILTER on last weeks' release (09-28-04) of PDB.

> > 2) Can't I make a chains_all.seq alike with MY criteria without making > my own script? ie, is there a "right way"(TM) to do it?

See the comments above.

> > 3) I can use MAKE_CHAINS and then load the .chn as a database, but > that involves having me first finding the good PDBs that aren't on the > modeller's DB, which is kind of misses the whole point. I was using > modeller to try to find the good ones in the first place. > > Thanks for the tip on seqfilter, but my problem was the sequence > missing in the modeller's default database in the first place ;-)

The reviews listed on the modeller web-site (http://salilab.org/modeller/documentation.html) will help you understand the process of identifying a useful template for modelling.

J B Procter

6:44 a.m.

New subject: Hard limits for size of sequence database

Hi. I was wondering if there is a hard wired maximum size for the number of sequences that can be held in memory. I got this error :

Input/Output Error 192: Record too large

In Procedure: read_sequence_db_module..read_sequence_list1 At Line: 156

Statement: Formatted READ Unit: 11 Connected To: /local/procter/Nrdb2mod7/nr.fsa Form: Formatted Access: Sequential Records Read : 769635 Records Written: 0

End of diagnostics

when I tried to read the NCBI non-redundant sequence database into Modeller (v7).

ta! j.

_______________________________________________________________________ Dr JB Procter:Biomolecular Modelling at ZBH - Center for Bioinformatics Hamburg http://www.zbh.uni-hamburg.de/staff.php

Modeller Caretaker

6 Oct 6 Oct

10:37 a.m.

New subject: Hard limits for size of sequence database

On Mon, Oct 04, 2004 at 03:44:25PM +0200, J B Procter wrote: > Hi. I was wondering if there is a hard wired maximum size for the number > of sequences that can be held in memory. I got this error : > > Input/Output Error 192: Record too large

No, but there is a maximum size for a 'record', which translates to the line length. There appears to be no way round this in Fortran, short of ditching formatted IO entirely. Modeller uses a maximum line length of 16384 characters, which I guess used to be 'long enough for everyone', but some of the NR database IDs end up being longer than this, if many corrections have been made. The simplest workaround for now is to chop any excessively long lines before feeding the database to Modeller (e.g. with a Perl script). We will look into getting round this problem 'properly' in the next release.

Ben Webb, Modeller Caretaker

-- modeller-care@salilab.org http://www.salilab.org/modeller/ Modeller mailing list: http://salilab.org/mailman/listinfo/modeller_usage

J B Procter

7 Oct 7 Oct

9:24 a.m.

New subject: Hard limits for size of sequence database

On Wed, 6 Oct 2004 10:37:57 -0700 Modeller Caretaker modeller-care@salilab.org wrote:

> > Input/Output Error 192: Record too large > > No, but there is a maximum size for a 'record', which translates to > the line length. There appears to be no way round this in Fortran,

I should have remembered that :-)

> short of ditching formatted IO entirely. Modeller uses a maximum line > length of 16384 characters, which I guess used to be 'long enough for > everyone', but some of the NR database IDs end up being longer than > this, if many corrections have been made.

I see - not something you notice with the way that ?blast truncates the id line to three or so rows in its output. The perl script has been written....

Thanks!

_______________________________________________________________________ Dr JB Procter:Biomolecular Modelling at ZBH - Center for Bioinformatics Hamburg http://www.zbh.uni-hamburg.de/staff.php

7389

Age (days ago)

7393

Last active (days ago)

List overview

Download

8 comments

4 participants

tags (0)

participants (4)

Bruno Afonso
Eswar Narayanan
J B Procter
Modeller Caretaker