2/1/2005

#######################################################
#  MIFS for non-synonomous SNP annotation             #
#        INSTRUCTIONS                                 #
#######################################################

software used in                  
"Improving functional annotation of non-synonymous SNPs
with information theory" Karchin et al.  PSB 2005


The perl scripts are designed to be run on any linux system.
They should work on any other unix-based system, but have only been 
tested on linux.


After unpacking the tar ball, you should have the following files:

--------------------------------------------------------------------------

PERL SCRIPTS:

mutual-info-2D    takes as input a two-column data file in RDB format 
                  (see EXAMPLE DATA FILES below and 
                  http://www.soe.ucsc.edu/research/compbio/rdb/index.html
                  for RDB documentation) and estimates the 
                  mutual information between the data in the two columns. 

MIFS              implementation of Battiti's MIFS algorithm



PERL MODULE:

RDBtable.pm   perl module of routines to manipulate RDB tables

------------------------------------------------------------------------

EXAMPLE DATA FILES

Grantham-feature-pairs.rdb         RDB formatted input files to mutual-info-2D
ASA_WT-Grantham-feature-pairs.rdb

Grantham-MI                        mutual-info-2D output files
ASA_WT-Grantham-MI

candidate-features                 list of features to evaluate

best-features-example              MIFS output file

----------------------------------------------------------------------------

Makefile                           demonstrates example usage of 
                                   mutual-info-2D and MIFS

----------------------------------------------------------------------------



To run the mutual-info-2D script, you will also need two perl modules
available from http://www.cpan.org

Algorithm
Statistics

Download and install these before beginning.

----------------------------------------------------------------------------
Instructions:

1.  Prior to using the MIFS script, you must have selected:
        --  a functionally annotated set of examples.

            Example: a list of nsSNPs and their functional effects.
            We used experimentally characterized point mutations in 
            lac repressor and lysozyme obtained from Pauline Ng 
            email: sift@fhcrc.org
 
        --  a discrete description of the functional effects ("class labels")
            we use "N" and "E" for "no effect" and "effect"

        --  a list of features describing the examples that you wish to 
            evaluate.  

            Example: Grantham values of the nsSNP amino acid changes, 
            solvent accessibility of the amino-acid residue positions, and 
            so forth.

2.  Use the mutual-info-2D script to:

   --  compute the mutual information between each feature and the class labels
   --  compute the mutual information between each pair of features

   HOW?

   Discretize each feature into discrete bins and assign a letter to each bin.

   Example:  For Grantham values, we chose to use five equal-frequency bins
   with the lac repressor/lysozyme data.  
   Bin "A"   0 <= x < 52
   Bin "B"   52 <= x < 84
   Bin "C"   84 <= x < 102
   Bin "D"   102 <= x < 132
   Bin "E"   x >= 132

   For each feature,
 
   Create a two-column file of classes and features in RDB format.

   Example:  see Grantham-classes-features.rdb
             
             The first two rows of the file are column names and 
             column definitions that conform to RDB format.

             The first column is a list of class labels for the examples in
             your data set.  In the coding nsSNP setting, we use two class 
             labels: N (no effect) and E (effect), but the choice of letters 
             is not important.  You can use as many classes and whatever 
             letters you like to describe them.

             The second column is a list of feature labels.  
             You can subdivide your data into bins according to your own 
             criteria, use as many bins as you like, and assign bin letters 
             of your own choosing.

    Create a two-column file for each pair of features in RDB format.  The
    format is exactly the same as the two-column file of classes and features.

    Example: see ASA_WT-Grantham-feature-pairs.rdb

             Here we have binned the solvent accessible surface area (in
             units of square Angstroms) of each wild-type amino-acid residue 
             in the lac repressor/lysozyme data set, and assigned letters 
             "A,B,C,D,E" to the bins.  

             Bin "A"   0 <= x < 40
             Bin "B"   40 <= x < 70
             Bin "C"   70 <= x < 125
             Bin "D"   125 <= x < 175
             Bin "E"   x >= 175


            So, the first data row describes an example in which
            the wild-type amino acid has solvent accessibility > 175 A^2
            and a Grantham value between 102 and 132.


3. edit the mutual-info-2D perl script

The script uses two modules downloadable from CPAN (www.cpan.org): Algorithm
and Statistics.  You must install these on your system.

You also need to set the perl library path to point to where Algorithm.pm,
Statistics.pm, and RDBtable.pm are installed on your system.

4. for each rdb file you create, run the mutual-info-2D script

Example usage:

mutual-info-2D -inputfile Grantham-feature-pairs.rdb -colname1 LABEL -colname2
FEATURE -numpermutes 1000

The script will compute the mutual information between the letters in the LABEL
column and the letters in the FEATURE column of Grantham-feature-pairs.rdb.

Note that the "colname" parameters must match the column headings in your rdb
file.

Another example usage:

mutual-info-2D -inputfile ASA_WT-Grantham-feature-pairs.rdb -colname1 ASA_WT 
-colname2 Grantham -numpermutes 1000


The script will compute the mutual information between the letters in the 
ASA_WT column and the letters in the Grantham column of 
ASA_WT-Grantham-feature-pairs.rdb.

Mutual information is symmetric, so it does not matter which feature you put in
the first column.

To correct for small sample effects, the label-feature (or feature-feature)
pairs are repeatedly scrambled and random mutual information is computed.  The
number of scramblings is set by the -numpermutes parameter.  The mean of the
random mutual information distribution is then subtracted from the mutual
information value computed in step 1.

Example output of the script is found in the Grantham-MI file and in the
ASA_WT-Grantham-MI file.  The script reports:

    the entropy of the distribution in each column
    the mutual information of the data in the first and second column
      prior to small sample correction ("Observed MI")
    the maximum possible mutual information of the data in the first and 
      second column
    the mean and standard error of the distribution of random mutual 
     information ("Expected MI")
    the corrected estimate of mutual information ("Excess MI").

By default, the output of mutual-info-2D is printed to STDOUT.
You should redirect the output to a file.
Our file naming convention is as follows:

For (class,feature) mutual information

featurename-MI

For (feature,feature) mutual informaion

featurename-featurename-MI

5. Prepare a list of candidate features for MIFS to evaluate. The feature
names can be whatever you like,as long as the names match the column headings
in your rdb files.


6.  After you have precomputed the mutual information between each feature of
interest and your class labels and between each pair of features, and prepared
the list of candidate features, you are ready to run the MIFS script.
MIFS currently expects your rdb files to follow the naming convention
described above.


MIFS usage:

Usage: MIFS -feature_class_mi_dir foo -feature_feature_mi_dir bar -nselect 7 <
feature-list -beta 0.5 -objective <1,2>

parameters:

-feature_class_mi_dir

Full path to the directory where you stored your feature/class mutual
information output files (from mutual-info-2D script).


-feature_feature_mi_dir

Full path to the directory where you stored your feature/feature mutual
information output files (from mutual-info-2D script).

-nselect

Number of features you wish to select from your feature candidate list.

-beta

MIFS parameter that controls trade-off between two terms in the objective function.

-objective

The two choices are 1 or 2.
1 is the original objective function from Battiti's MIFS paper
2 is a modified version from the Kwak and Choi paper.

     Citations:

     Kwak N. and Choi C.H. (1999) in: IJCNN, Vol. 2 pp. 1313-1318
     Battiti (1994) IEEE Trans. Neural Networks 5, 537-550


Example output of an MIFS run is found in the file
best-features-example
