|
|
Next: Global vesus local parameters:
Up: Defining Substitution Models
Previous: Defining Substitution Models
Simple Nucleotide Models: modeldefs.bf
One of the primary objectives of Hy-Phy is to free users from relying on
the substitution models chosen by authors of software. While a
relatively small set of model choices may be sufficient for performing
phylogenetic analyses, having only a few potential models is often
limiting for studies of substitution rates and patterns. To define a
model in Hy-Phy, one needs only to describe the elements in a
substitution rate matrix. If the characters being studied have
states, the rate matrix is
. For example, nucleotide
models are
; models of amino acid change are
; codon-based models might be
. Hy-Phy can work
properly with any member of the class of general time reversible
models. Instantaneous rate matrices in this class of models satisfy
the condition
, where
is the
equilibrium frequency of character
(for nucleotide data, ) and
is the
entry in the instantaneous rate matrix. Hy-Phy
comes with many predefined rate matrices for commonly used
substitution models. You can find examples in the batchfiles under the
Examples and TemplateBatchFiles directories of the
Hy-Phy distribution.
To illustrate the basics of model definition, examine the batch file
modeldefs.bf from the Tutorial directory:
SetDialogPrompt("Select a nucleotide data file:");
DataSet myData = ReadDataFile(PROMPT_FOR_FILE);
DataSetFilter myFilter = CreateFilter(myData,1);
HarvestFrequencies(obsFreqs,myFilter,1,1,1);
F81RateMatrix = {{*,m,m,m}{m,*,m,m}{m,m,*,m}{m,m,m,*}};
Model F81 = (F81RateMatrix, obsFreqs);
Tree myTree = ((a,b),c,d);
fprintf(stdout,"\n\n F81 Analysis \n\n");
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize(results,theLikFun);
fprintf(stdout,theLikFun);
fprintf(stdout,"\n\n HKY85 Analysis \n\n");
HKY85RateMatrix = {{*,b,a,b}{b,*,b,a}{a,b,*,b}{b,a,b,*}};
Model HKY85 = (HKY85RateMatrix, obsFreqs);
Tree myTree = ((a,b),c,d);
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize(results,theLikFun);
fprintf(stdout,theLikFun);
This batch file illustrates two new concepts. First, and most
importantly, the lines
HKY85RateMatrix = {{*,b,a,b}{b,*,b,a}{a,b,*,b}{b,a,b,*}};
Model HKY85 = (HKY85RateMatrix, obsFreqs);
illustrate the definition of a new substitution matrix. In this case,
we have defined the model of Hasegawa, Kishino, and Yano (1985) and
named the model HKY85. If you are familiar with the HKY85 model, you
will probably recognize the form of the matrix: transitions occur with
rate a and transversions occur with rate b, with each of those values
multiplied by the appropriate nucleotide frequency. The second
important point to note is that we must associate the model with a
tree before we can do anything useful. In this case, we simply
redefined the old tree to use the HKY85 model instead of the F81 model
(Recall that a tree consists of both the topology and the substitution
matrices attached to its branches). When the statement Tree
myTree = ((a,b),c,d); is issued, the variable myTree is
assigned the topology ((a,b),c,d) and the branches are assigned the
HKY85 substitution model, which was the most recently defined
Model. If we wanted to preserve the original variable
myTree, we could simply have defined a new Tree
structure using a command like Tree myNextTree = ((a,b),c,d);
Finally, for completeness, we created a new Tree and assigned
it the F81 model and reproduced the original F81 analysis. These
steps simply illustrate how predefined Models can be assigned
to Trees using the UseModel command.
UseModel(F81);
Tree myTree = ((a,b),c,d);
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize(results,theLikFun);
fprintf(stdout,theLikFun);
One of the most general models of nucleotide substitution is the
general time reversible model (REV). The instantaneous rate matrix for
the REV model is
It is simple to implement this model in Hy-Phy
REVRateMatrix = {{*,a,b,c}{a,*,d,e}{b,d,*,f}{c,e,f,*}};
Model REV = (REVRateMatrix, obsFreq);
does the job.
To illustrate these notions in a more usefule context, consider the
batchfile models.bf:
SetDialogPrompt("Select a nucleotide data file:");
DataSet myData = ReadDataFile(PROMPT_FOR_FILE);
DataSetFilter myFilter = CreateFilter(myData,1);
HarvestFrequencies(obsFreqs,myFilter,1,1,1);
equalFreqs = {{0.25}{0.25}{0.25}{0.25}};
myTopology = "((a,b),c,og)";
F81RateMatrix = {{*,m,m,m}{m,*,m,m}{m,m,*,m}{m,m,m,*}};
Model F81 = (F81RateMatrix, obsFreqs);
HKY85RateMatrix = {{*,b,a,b}{b,*,b,a}{a,b,*,b}{b,a,b,*}};
Model HKY85 = (HKY85RateMatrix, obsFreqs);
REVRateMatrix = {{*,a,b,c}{a,*,d,e}{b,d,*,f}{c,e,f,*}};
Model REV = (REVRateMatrix, obsFreqs);
Model JC69 = (F81RateMatrix, equalFreqs);
Model K2P = (HKY85RateMatrix, equalFreqs);
UseModel(JC69);
Tree myTree = myTopology;
fprintf(stdout,"\n\n JC69 Analysis \n\n");
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize(results,theLikFun);
fprintf(stdout,theLikFun);
UseModel(F81);
Tree myTree = myTopology;
fprintf(stdout,"\n\n F81 Analysis \n\n");
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize(results,theLikFun);
fprintf(stdout,theLikFun);
UseModel(K2P);
Tree myTree = myTopology;
fprintf(stdout,"\n\n K2P Analysis \n\n");
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize(results,theLikFun);
fprintf(stdout,theLikFun);
UseModel(HKY85);
Tree myTree = myTopology;
fprintf(stdout,"\n\n HKY85 Analysis \n\n");
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize(results,theLikFun);
fprintf(stdout,theLikFun);
UseModel(REV);
Tree myTree = myTopology;
fprintf(stdout,"\n\n REV Analysis \n\n");
LikelihoodFunction theLikFun = (myFilter, myTree);
Optimize(results,theLikFun);
fprintf(stdout,theLikFun);
If you understand the components of models.bf, then you are
ready to define substitution models, attach them to tree topologies,
and find maximum likelihood estimates. models.bf also
demonstrates a few useful Hy-Phy features. First, notice the definition
of the simple vector (0.25,0.25,0.25,0.25). In a similar manner, we
define the string constant myTopology. By changing the
topology in the definition of myTopology, the entire analysis
can be repeated using the new topology. This single step is faster
than updating the topology for every Tree statement. Finally,
note the reuse of the three substitution matrices and the two
frequency vectors. The original matrix definitions are used as
templates by the Model statements.
Next: Global vesus local parameters:
Up: Defining Substitution Models
Previous: Defining Substitution Models
Spencer Muse
2000-05-31
|
|