Navigation Banner
  next up previous
Next: Global vesus local parameters: Up: Defining Substitution Models Previous: Defining Substitution Models


Simple Nucleotide Models: modeldefs.bf

One of the primary objectives of Hy-Phy is to free users from relying on the substitution models chosen by authors of software. While a relatively small set of model choices may be sufficient for performing phylogenetic analyses, having only a few potential models is often limiting for studies of substitution rates and patterns. To define a model in Hy-Phy, one needs only to describe the elements in a substitution rate matrix. If the characters being studied have $ n$ states, the rate matrix is $ n \times n$. For example, nucleotide models are $ 4 \times 4$; models of amino acid change are $ 20 \times
20$; codon-based models might be $ 64 \times 64$. Hy-Phy can work properly with any member of the class of general time reversible models. Instantaneous rate matrices in this class of models satisfy the condition $ \pi_i R_{ij} = \pi_j R_{ji}$, where $ \pi_i$ is the equilibrium frequency of character $ i$ (for nucleotide data, ) and $ R_{ij}$ is the $ ij^{th}$ entry in the instantaneous rate matrix. Hy-Phy comes with many predefined rate matrices for commonly used substitution models. You can find examples in the batchfiles under the Examples and TemplateBatchFiles directories of the Hy-Phy distribution.

To illustrate the basics of model definition, examine the batch file modeldefs.bf from the Tutorial directory:

SetDialogPrompt("Select a nucleotide data file:"); 
DataSet myData  = ReadDataFile(PROMPT_FOR_FILE);
DataSetFilter myFilter = CreateFilter(myData,1);
HarvestFrequencies(obsFreqs,myFilter,1,1,1);
F81RateMatrix  = {{*,m,m,m}{m,*,m,m}{m,m,*,m}{m,m,m,*}};
Model F81 = (F81RateMatrix, obsFreqs);
Tree myTree = ((a,b),c,d);
fprintf(stdout,"\n\n F81 Analysis \n\n");
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize(results,theLikFun);
fprintf(stdout,theLikFun);

fprintf(stdout,"\n\n HKY85 Analysis  \n\n");
HKY85RateMatrix = {{*,b,a,b}{b,*,b,a}{a,b,*,b}{b,a,b,*}};
Model HKY85 = (HKY85RateMatrix, obsFreqs);
Tree myTree = ((a,b),c,d);
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize(results,theLikFun);
fprintf(stdout,theLikFun);
This batch file illustrates two new concepts. First, and most importantly, the lines
HKY85RateMatrix = {{*,b,a,b}{b,*,b,a}{a,b,*,b}{b,a,b,*}};
Model HKY85 = (HKY85RateMatrix, obsFreqs);
illustrate the definition of a new substitution matrix. In this case, we have defined the model of Hasegawa, Kishino, and Yano (1985) and named the model HKY85. If you are familiar with the HKY85 model, you will probably recognize the form of the matrix: transitions occur with rate a and transversions occur with rate b, with each of those values multiplied by the appropriate nucleotide frequency. The second important point to note is that we must associate the model with a tree before we can do anything useful. In this case, we simply redefined the old tree to use the HKY85 model instead of the F81 model (Recall that a tree consists of both the topology and the substitution matrices attached to its branches). When the statement Tree myTree = ((a,b),c,d); is issued, the variable myTree is assigned the topology ((a,b),c,d) and the branches are assigned the HKY85 substitution model, which was the most recently defined Model. If we wanted to preserve the original variable myTree, we could simply have defined a new Tree structure using a command like Tree myNextTree = ((a,b),c,d);

Finally, for completeness, we created a new Tree and assigned it the F81 model and reproduced the original F81 analysis. These steps simply illustrate how predefined Models can be assigned to Trees using the UseModel command.

UseModel(F81);
Tree myTree = ((a,b),c,d);
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize(results,theLikFun);
fprintf(stdout,theLikFun);

One of the most general models of nucleotide substitution is the general time reversible model (REV). The instantaneous rate matrix for the REV model is

$\displaystyle R_{\mbox{\tiny {REV}}} = \bordermatrix{ &\texttt{A} & \texttt{C} ...
...h{\pi_A}& \theta_4 \ensuremath{\pi_C}& \theta_5 \ensuremath{\pi_G}& -\sum \cr }$    

It is simple to implement this model in Hy-Phy

REVRateMatrix  = {{*,a,b,c}{a,*,d,e}{b,d,*,f}{c,e,f,*}};
Model REV = (REVRateMatrix, obsFreq);
does the job.

To illustrate these notions in a more usefule context, consider the batchfile models.bf:

SetDialogPrompt("Select a nucleotide data file:"); 
DataSet myData  = ReadDataFile(PROMPT_FOR_FILE);
DataSetFilter myFilter = CreateFilter(myData,1);
HarvestFrequencies(obsFreqs,myFilter,1,1,1);

equalFreqs = {{0.25}{0.25}{0.25}{0.25}};

myTopology = "((a,b),c,og)";

F81RateMatrix  = {{*,m,m,m}{m,*,m,m}{m,m,*,m}{m,m,m,*}};
Model F81 = (F81RateMatrix, obsFreqs);

HKY85RateMatrix = {{*,b,a,b}{b,*,b,a}{a,b,*,b}{b,a,b,*}};
Model HKY85 = (HKY85RateMatrix, obsFreqs);

REVRateMatrix  = {{*,a,b,c}{a,*,d,e}{b,d,*,f}{c,e,f,*}};
Model REV = (REVRateMatrix, obsFreqs);

Model JC69 = (F81RateMatrix, equalFreqs);

Model K2P = (HKY85RateMatrix, equalFreqs);

UseModel(JC69);
Tree myTree = myTopology;
fprintf(stdout,"\n\n JC69 Analysis \n\n");
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize(results,theLikFun);
fprintf(stdout,theLikFun);

UseModel(F81);
Tree myTree = myTopology;
fprintf(stdout,"\n\n F81 Analysis \n\n");
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize(results,theLikFun);
fprintf(stdout,theLikFun);

UseModel(K2P);
Tree myTree = myTopology;
fprintf(stdout,"\n\n K2P Analysis \n\n");
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize(results,theLikFun);
fprintf(stdout,theLikFun);

UseModel(HKY85);
Tree myTree = myTopology;
fprintf(stdout,"\n\n HKY85 Analysis \n\n");
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize(results,theLikFun);
fprintf(stdout,theLikFun);

UseModel(REV);
Tree myTree = myTopology;
fprintf(stdout,"\n\n REV Analysis \n\n");
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize(results,theLikFun);
fprintf(stdout,theLikFun);

If you understand the components of models.bf, then you are ready to define substitution models, attach them to tree topologies, and find maximum likelihood estimates. models.bf also demonstrates a few useful Hy-Phy features. First, notice the definition of the simple vector (0.25,0.25,0.25,0.25). In a similar manner, we define the string constant myTopology. By changing the topology in the definition of myTopology, the entire analysis can be repeated using the new topology. This single step is faster than updating the topology for every Tree statement. Finally, note the reuse of the three substitution matrices and the two frequency vectors. The original matrix definitions are used as templates by the Model statements.


next up previous
Next: Global vesus local parameters: Up: Defining Substitution Models Previous: Defining Substitution Models
Spencer Muse
2000-05-31
 
Sergei L. Kosakovsky Pond and Spencer V. Muse, 1997-2002