ModPipe command line tools

ModPipe is usually run from the command line by running the modpipe command line tool, found in the bin directory. For example, a typical run consists of first running the modpipe add command to populate the ModPipe filesystem with input FASTA sequences, then running modpipe build for each sequence to find hits and build models. Finally, the modpipe gather command can be used to select the best model(s) from the run. Other commands are provided to accomplish other tasks, such as manipulating alignments or benchmarking the ModPipe system.

Common options

To get help on a given command, use modpipe help <command>. For example, modpipe help add shows the options supported by the ‘add’ command.

modpipe add

This command will, given a FASTA file containing one or more sequences, populate the ModPipe filesystem ready for running modpipe build. It will also create a mapping file that maps FASTA identifiers to ModPipe sequence IDs (see Unique mapping file).

modpipe build

Required arguments

--conf_file The name of the configuration file (typically, modpipe.conf) must be provided.

--sequence_id The identifier of the sequence to be modelled must be provided, as created by modpipe add.

Optional arguments

--hits_mode Option that specifies the search methods used to find templates (“hits”) matching the input (“target”) sequence. See Fold assignment methods for full details on each method. The following methods can be requested:

  • Seq-Seq: Sequence-Sequence search.

  • Prf-Seq: Profile-Sequence search using MODELLER

    Profile.build() profiles.

  • PSI-Blast-Prf-Seq: Profile-Sequence search using PSI-BLAST profiles.

  • Prf-Prf: Profile-Profile search using MODELLER

    Profile.build() profiles.

  • PSI-Blast-Prf-Prf: Profile-Profile search using PSI-BLAST profiles.

  • Seq-Prf: Sequence-Profile search.

  • Max-PSSM-Seq-Prf: Sequence-Profile search with Max-PSSM scoring.

  • Max-Freq-Seq-Prf: Sequence-Profile search with Max-frequency scoring.

Multiple methods can be requested by several hits_mode statements, or a comma-separated list. “Seq-Seq” will be always added, regardeless of user-input.

The default for --hits_mode is “Seq-Seq,Seq-Prf”.

Each search method specified will be used independently. See also the CLUSTERALI variable in the configuration file if you use multiple methods.

modpipe gather

This command is designed to be run after the main modpipe build command. It parses all of the generated models in the models file and generates a final models file that contains the ‘best’ models (in the same YAML format as the input models file). This can be done either for a single sequence (--seq_id option) or for all sequences in the unique mapping (unq) file (--unq_file option).

One or more selection methods (e.g. pick the model generated from the highest identity template, or that has the best DOPE score) can be specified with the --final_models_by option. The final models file will contain a single model per method, or potentially fewer if multiple methods select the same model (e.g. the model with the highest identity template also happens to have the best DOPE score). The gather command will generally select only models with good scores. The following criteria are used:

  • GA341: score>=0.7

  • z-DOPE: score<0

  • MPQS: score>1.0

  • SEQID: no threshold, model with highest sequence identity will be selected

  • LONGEST_GA341: longest models with score>=0.7

  • LONGEST_DOPE: longest model with z-DOPE<0

  • TSVMOD: predicted native overlap (3.5) >0.4

  • INPUT_TEMPLATE: model with highest sequence identity using input template, used in template based calculations.

It is often the case that no template spans the whole query sequence - for example, in a two-domain system, one set of templates may yield models for the first domain and another set may yield models for the second domain, while no template covers both domains. Model selection may then pick a good model for one domain, discarding the other domain models. In this case, therefore, it may make sense to turn on --select_by_region. This will cluster the models by region and then apply the selection criteria to each region individually rather than only to the whole sequence. Thus, at most one model per selection method per region will be returned.

modpipe benchmark

This command is used to benchmark the performance of the ModPipe system by comparing generated models for a sequence to a known structure (usually a PDB crystal structure) of the same sequence.

The command, given a sequence identifier in the ModPipe filesystem, parses the models file and, for each model in that file, compares it to the native structure and writes out a new YAML file that is similar to the original models file but which contains an extra ‘native_benchmark’ field for each model, containing the benchmark data.

The sequence is identified using the --conf_file and --sequence_id options. The PDB code and chain ID of the known structure is specified using the --native and --native_chain options (use the --pdb_repository option to specify a list of directories to search for PDB files). Finally, the --output_filename option is used to name the output file containing model and benchmark data.

The benchmark data in the output file looks similar to:

native_benchmark:
  code: 1abc
  chain: A
  length: 116
  region: ['1', '116']
  cutoff_rms:
  - {cutoff: 1.0, num_equiv_pos: 115, rms: 0.037}
  - {cutoff: 2.0, num_equiv_pos: 115, rms: 0.037}
  - {cutoff: 3.0, num_equiv_pos: 116, rms: 0.229}
  - {cutoff: 4.0, num_equiv_pos: 116, rms: 0.229}
  - {cutoff: 5.0, num_equiv_pos: 116, rms: 0.229}
  mean_cutoff_rms: 0.152
  mean_num_equiv_pos: 115
  cutoff_rms_35: 0.229
  num_equiv_pos_35: 116
  global_rms: 0.229

The benchmarking data is created using Modeller’s Selection.superpose() method; a simple 1:1 alignment (no gaps) is created between the model sequence and the native structure, using C-alpha atoms to define each residue’s spatial position. The fields reported are:

code

The PDB code of the native structure.

chain

The PDB chain ID of the native structure.

length

The number of residues in the model.

region

The starting and ending PDB residue numbers in the native structure that correspond to the model. (The model may not cover the entire native chain.)

cutoff_rms

A list of comparisons between the model and native structure. Each row is a superposition using a different cutoff to Selection.superpose(). For each cutoff, the number of equivalent positions (number of model residues within the cutoff distance from the same residue in the superposed native structure) is reported. The root-mean-square deviation of model residue positions from the native positions is also reported, for all residues that are within the cutoff distance.

mean_cutoff_rms, mean_num_equiv_pos

The average of the rms and num_equiv_pos values for all rows in cutoff_rms.

cutoff_rms_35, num_equiv_pos_35

The equivalent of the rms and num_equiv_pos values from the cutoff_rms table, for a 3.5 angstrom cutoff.

global_rms

The root-mean-square deviation of model residue positions from native positions, for all residues in the structure.

modpipe convert

This tool will convert an input alignment file to a different format. Since ModPipe uses exclusively FASTA format for the input sequence (see modpipe add) this is a useful tool if your sequence is in another format, such as PIR.