HyPhy message board
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl
Theoretical questions >> Sequence Analysis >> model parameters and branch lengths
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?num=1207111830

Message started by eliot on Apr 1st, 2008 at 9:50pm

Title: model parameters and branch lengths
Post by eliot on Apr 1st, 2008 at 9:50pm
hello,
                                                                               
I'm wondering about the relationship between the parameters in a model (say the F81 model) and branch lengths.
                                                                               
Take for example, the file relrate.bf. If I print out the parameter myTree.b.mu (in the unconstrained case), it is different from the branch length (in the tree that's calculated by HYPHY) by a fixed ratio. This ratio seems to be the same for different branches. What is the meaning of this ratio? What is myTree.b.mu? I thought it was the mu in the F81 rate matrix. But if this were the case, it shouldn't always have the same proportion with branch length, right? Or am I thinking about this wrong?
                                                                               
If we had a model like HKY85, we would have myTree.b.transversion and myTree.b.transition. What would be the relationship between these and the branch length?
                                                                               
thanks!

Title: Re: model parameters and branch lengths
Post by artpoon on Apr 2nd, 2008 at 1:22pm
Hello Eliot,

Branch lengths are measured in units of expected number of substitutions.  The parameter mu is different from the branch length reported in HyPhy because the latter is scaled by the expected number of substitutions, calculated by summing the rates of all possible substitutions (all parameters in the off-diagonal of the rate matrix).

In order to see this, try the following example code:


Code (]
DataSet            nucleotideSequences = ReadDataFile ("data/3.seq");
DataSetFilter      filteredData = CreateFilter (nucleotideSequences,1);

HarvestFrequencies (observedFreqs, filteredData, 1, 1, 1);

F81RateMatrix =

           [*,mu,mu,mu}
            {mu,*,mu,mu}
            {mu,mu,*,mu}
            {mu,mu,mu,*];

Model      F81 = (F81RateMatrix, observedFreqs);


ALLOW_SEQUENCE_MISMATCH = 1;        

Tree      threeTaxaTree = (a,b,c);

LikelihoodFunction  theLnLik = (filteredData, threeTaxaTree);

Optimize (paramValues, theLnLik);
unconstrainedLnLik = paramValues[1):

[0];

fprintf  (stdout, "\n0).UNCONSTRAINED MODEL:", theLnLik);
fprintf (stdout, "\n", observedFreqs, "\n");

/*_____________________________________________________________________ */
function computeScalingFactorB(rateMatrix, baseFreqs)
{
     B = 0;
     for (n1 = 0; n1 < Rows(rateMatrix); n1 = n1+1)
     {
           for (n2 = 0; n2 < Columns(rateMatrix); n2 = n2+1)
           {
                 if (n2 != n1)
                 {
                       B = B + baseFreqs[n1]*baseFreqs[n2]*rateMatrix[n1][n2];
                 }
           }
     }
     return B;
}

fprintf (stdout, "scaling factor = ", computeScalingFactorB (F81RateMatrix, observedFreqs), "\n");


Hope this helps,
- Art.

Title: Re: model parameters and branch lengths
Post by eliot on Apr 3rd, 2008 at 12:47pm
hi Art,

That does help. I can see that

threeTaxaTree.a.mu = branch length * mu (i.e. entry in the rate matrix) / B

This makes it seem like there are two parameters which must be estimated on each branch, mu and branch length. Is that right? If so, why does paramValues[1][1] say there are only 3 parameters (ie one per branch)?

thanks again.

Title: Re: model parameters and branch lengths
Post by artpoon on Apr 3rd, 2008 at 3:15pm
Hi eliot,

The three parameters being estimated are the branch lengths, in units of expected number of substitutions.   The local parameter mu is confounded with time, in that it's impossible to estimate mu and (t) separately for each branch.  :-)

Cheers,
- Art.

Title: Re: model parameters and branch lengths
Post by eliot on Apr 3rd, 2008 at 5:26pm
hi Art,

I'm still confused about something here. Ok, so when I run your file on 3.seq I get

Tree threeTaxaTree=(a:0.0850301,b:0.0733855,c:0.165368);
threeTaxaTree.a.mu: 0.116828

scaling factor = 0.165368

You said that the branch length is scaled by the expected number of substitutions, which is the scaling factor. I was expecting this would mean you multiply threeTaxaTree.a.mu by the scaling factor.

But 0.116828*0.165368 obviously is not 0.0850301

If your computeScalingFactorB function had the line

B = B + baseFreqs[n1]*baseFreqs[n2];

instead of

B = B + baseFreqs[n1]*baseFreqs[n2]*rateMatrix[n1][n2];

then B=0.727825 and it works out.


Why does B have the rateMatrix term in it? And what is the interpretation of "my version" of B?

Thanks again for your help,
Eliot

Title: Re: model parameters and branch lengths
Post by artpoon on Apr 3rd, 2008 at 6:07pm
Hi eliot,

Okay, let's forget the scaling factor for one sec --- I'm not sure it's working correctly anyhow.  :-[

The expected number of substitutions on a branch is the number reported when HyPhy spools results from the likelihood function:


Code (]
0).UNCONSTRAINED MODEL:Log Likelihood = -944.001000454328;
Tree threeTaxaTree=(a:0.0850301,b:0.0733855,c:0.165368);
[/code):

So for branch 'c', E[s] = 0.165368.  mu is a local parameter in the F81 model that is estimated for every branch in the tree, and for 'c', mu = 0.227208.   How do we get from mu to E[s]?

E[s] is calculated by taking the negative of the weighted sum of the rate matrix Q, i.e.

E[s] = - (p(A) (-mu p(C) - mu p(G) - mu p(T)) + p(C) (-mu p(A) - mu p(G) - mu p(T) ) + ...


Now our stationary distribution is estimated from observed nucleotide frequencies:
[code]
>observedFreqs
{
{    0.349333333333}
{    0.151111111111}
{    0.214222222222}
{    0.285333333333}
}



Proceeding through the formula for E[s], we get:
[code]
>0.349333333333 * 0.227208119226 * (1-0.349333333333)
0.0516443
>0.151111111111 * 0.227208119226 * (1-0.151111111111)
0.0291455
>0.214222222222 * 0.227208119226 * (1-0.214222222222)
0.0382462
>0.285333333333 * 0.227208119226 * (1-0.285333333333)
0.0463319
>0.0516443 + 0.0291455 + 0.0382462 + 0.0463319
0.165368
[/code]

What this formula for E[s] is doing is taking into account multiple substitutions.  In other words, a nucleotide might not only change from A to G, but subsequently back to A over the evolutionary time represented by the branch.  (Recall that this time is already confounded by, i.e. represented in, mu).

Better?   :)
- Art.

Title: Re: model parameters and branch lengths
Post by eliot on Apr 4th, 2008 at 2:47pm
Yes, that helps. Thanks!

HyPhy message board » Powered by YaBB 2.5.2!
YaBB Forum Software © 2000-2024. All Rights Reserved.