File formats

Hits file

The hits file (usually has a .hit extension, or .sel for selected hits after filtering) is in standard YAML format, and contains a header indicating which version of ModPipe generated the file, plus a list of hits, each starting with the line ‘- !<Hit>’. Each hit contains the following fields:

  • sequence: contains a unique identifier for the target sequence (id) and its length.

  • alignment: contains a unique identifier for the target-template alignment, chi squared and KS statistics for the agreement between the observed/expected score distribution, the e-value of the alignment, and the percentage of gaps in the alignment.

  • region: the actual beginning and end of the region modeled.

  • fold_assignment_method: the method used to detect the template and create the alignment (see Fold assignment methods for a fuller description).

  • highest_sequence_identity, as a percentage, of all templates.

  • templates: a list of the templates used. For each template the PDB code and chain, the region used in the alignment, and the sequence identity between this template and the target is listed.

It is permitted for a hits file to contain multiple YAML documents. This may be useful to break up very large files, when using YAML parsers that read in a whole document at a time.

Models file

The models file (usually has a .mod extension) is in YAML format, and contains a header indicating which version of ModPipe generated the file, plus a list of models, each starting with the line ‘- !<Model>’. Each model contains the same fields as in the hits file, plus the following:

  • id: a unique identifier for the model.

  • hetatms: number of HETATM residues in the model.

  • waters: number of waters in the model.

  • score: a set of assessment scores. These include the MODELLER objective function; the DOPE and DOPE-HR scores; the GA341 score, compactness score, and its individual components (pairwise distance energy, surface area energy, combined energy) and the Z scores of each; the TSVMod scores (if requested), the normalized DOPE Z score; and the ModPipe quality score. The quality score is defined as (model_len / target_len) + (sequence_identity / 100) + (GA341_score / 10) - (percentage_gaps / 100) - alignment_ks_stat - (normalized_DOPE_score / 10); higher scores are better.

  • rating: a string of 9 digits to rate the quality of the alignment and the model using various measures (1 for a good measure, 0 for a poor one). These measures are, in order:

    1. Coverage

    2. chi squared alignment score

    3. Alignment E value

    4. Gap ratio

    5. Sequence identity

    6. GA341 score

    7. GA341 compactness score

    8. DOPE Z score

    9. TSVMod Predicted Native Overlap (3.5)

As for hits files, models files may contain multiple YAML documents.

Note

The Python module modpipe.serialize contains functions and classes that can be used in Python scripts to handle the Models file and Hits file YAML format data.