HyPhy message board - Print Page

HyPhy message board
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl Methodology Questions >> How to >> Unrestricted/non reversible models http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?num=1135050643 Message started by Jacek Majewski on Dec 19^th, 2005 at 7:50pm

Title: Unrestricted/non reversible models
Post by Jacek Majewski on Dec 19^th, 2005 at 7:50pm

Hi,

Does HyPhy handle unconstrained, non-time reversible models? I'm trying to implement an 11 rate-parameter model (similar to PAML nucleotide model 8), but so far I've had not luck - the results seem very suspicious. Most likely I'm doing something drastically wrong.

I want the rate parameters to be constant across the phylogeny, so I define my parameters and matrix as below:

ACCEPT_ROOTED_TREES=1;

global a;
global b;
global c;
global d;
global e;
global f;
global g;
global h;
global i;
global j;
global k;
global l;

UNRESTRateMatrix = [*,m*a,m*b,m*c}{m*d,*,m*e,m*f}{m*g,m*h,*,m*i}{m*j,m*l,m*k,*];
Model UNREST = (UNRESTRateMatrix, obsFreqs, 0);

Here are some problems:

1) I don't want to multiply the rate matrix by nucleotide frequencies, since all the parameters can vary independently and I don't need time-reversibility - > I use the 0 option as the third parameter of the "Model". Why do I still need to supply the frequency vector as the second argument? How are those frequencies used? The likelihood, tree, and parameter values change if I use equalFreqs vs. obsFreqs, even though I don't multiply the matrix by the frequency vector.

2. The model, as it is above, seems over-parametriized. I think I should be able to get rid of one of the global rate parameters (a - k). I tried this by setting one of the parameters to 1 (i.e. replacing, say, a by 1) in the matrix definition, but this alters the likelihood and is dependent on which parameter is held constant.

What am I doing wrong and how should I parametrize the rate matrix? Thanks in advance for any help,

Jacek

Title: Re: Unrestricted/non reversible models
Post by Sergei on Dec 19^th, 2005 at 10:06pm

Dear Jacek,

HyPhy doesn't restrict models to be time-reversible (just Markovian). To answer your questions...

In order to compute the likelihood of a given tree, HyPhy needs to know the distrubution of characters at the root of the tree (so that summation over all possible ancestral states in the pruning algorithm can be seeded). This is what obsFreqs is for in the Model statment. It is also improtant to keep in mind that another standard assumption of phylogenetic evolutonary models is stationarity of base frequencies along the entire tree (for reversible and non-reversible models). If your transition matrix for time t is R(t) = exp (Q*t), then stationary distribution pi (a row vector here) satisfies pi * R(t) = pi for all t>=0 (this is the same as stating that pi * Q = 0), i.e. the substitution process does not modify the distribution of characters.

If the distribution of nucleotides you specify for the root is not stationary for Q, then at each internal node (and tips of the tree), the expected frequency of nucleotides is going to be different, depending on the branch length; this may or may not be biologicall sensible, but one should at the very least like to have the expected frequency of nucleotides at the leaves (i.e. for observed sequences) to be the same (or similar) to the proportion observed in these sequences; unless care is taken, a non-stationary process may fail this spectacularly.

Now we are ready to begin adressing your second question. In general, the stationary distribution of the unrestricted model is not going to match your observed frequencies. If you wish this restriction to hold, 3 out of 12 rate parameters have to be constrained (based on the values of obsFreqs). Take a look at the file NRM.bf in TemplateBatchFiles/TemplateModels; it defines a general model (non-reversible) which is forced to have the stationary distribution which matches observed counts. Incidentally, this model is an available option in AnalyzeNucProtData.bf standard analysis (so you can use HyPhy to fit non-reversible models out of the box).

If you attempt to fit an unrestricted model as defined in your example, the estimation procedure will probably attempt to adjust rates in such a way as to approximate the stationary distribution by something close to obsFreq. When you constrain some of the parameters to be '1', you may be breaking the ability of the model to adjust rates and frequencies, while for other parameters the '1' constaint is fine.

An alternative solution is to replace the obsFreq in the defintion of the model by the actual stationary distribution defined via a-k parameters. You can work out what this distribution is by solving

(pi_a, pi_c, pi_g, pi_t) * UNRESTRateMatrix = 0, analytically for pi_{a,c,g,t} (each irreducible discrete state Markov process is guaranteed to have a stationary distribution, so the above equation will have a solution in terms of a-k). In this case you can arbitarily set one of the rates to 1 (it is best to choose a transition, e.g. A<->G or C<->T).

Then you could define equilibFreq = [expression1}{expression2}{expression3}{expression4], using the solutions of the above system and then write,

Model UNREST = (UNRESTRateMatrix, equilibFreq, 0);

Hope this helps. Please don't hesitate to ask for clarification.

Cheers,
Sergei

Title: Re: Unrestricted/non reversible models
Post by Jacek on Dec 20^th, 2005 at 1:08pm

Hi Sergei.

Thanks for the reply. I like the second option - solving the equations analytically for the equilibrium frequencies, since that would allow me to use the unconstrained model and interpret the parameters directly as relative rates of substitutions. BUT, I remember at some point trying to solve the equilibrium distribution for the 12 parameter model, and the solution is painfully long...

So I chose the first suggestion: I used the NRM.mdl definition. However, I seem to run into the same problem that I had before. When I use your original code, which sets AG as 1, I get an answer that for my dataset may differ by as much as 20 LogLikelihood units from setting another parameter, GT = 1 (I made sure that I replaced GT = 1 in both the matrix definition and the global parameter definition for TG).

global TA:=AT+(AC-CA)*r1__+(AG-GA)*r2__;
global TC:=CT+(CA-AC)*r0__+(CG-GC)*r2__;
global TG:=1+(GA-AG)*r0__+(GC-CG)*r1__;

ModelMatrixName =
[*,t*AC,t*AG,t*AT}
{t*CA,*,t*CG,t*CT}
{t*GA,t*GC,*,t*1}
{t*TA,t*TC,t*TG,*];

Fixing most of the other parameters alters the likelihood to some extent.

I also tried this with the GTR model, and I still get differences when using different parameters as a reference (although not as big, about 2 units).

I've redone this many times now, and can't find a problem. Do you have any ideas?

Jacek

Title: Re: Unrestricted/non reversible models
Post by Sergei on Dec 20^th, 2005 at 2:27pm

Dear Jacek,

This sounds (at least for GTR) as a convergence issue, unless you did not remember to add a
[code]
global AG=1;
[/code]
statement before model definition. In this case, each branch will have its own AG rate (you can check this by printing all parameter values after the likelihood optimization; if AG appears as givenTree.nodename.AG a bunch of times, then you have a local parameter), which will yield better likelihoods than the default parameterization.

Check messages.log to see if the optimization routines terminate prematurely (before convergence).

If you data set is fairly small, default optimization settings might be bad. Try adjusting OPTIMIZATION_PRECISION and MAXIMUM_ITERATIONS_PER_VARIABLE parameters in the batch file.

Also, if you pick a rate which is very small as the reference, other relative rates may be very large (and difficult to estimate well).

Lastly, there is a way to have HyPhy deduce correct stationary distribution for a model run-time (so you don't need to solve analytically). I can give an example of how, if you want.

Hope this helps,
Sergei

Title: Re: Unrestricted/non reversible models
Post by Jacek Majewski on Dec 21^st, 2005 at 7:27am

Thanks!!!

I'm glad I asked instead of banging my head against the wall. The MAXIMUM_ITERATIONS_PER_VARIABLE was indeed the problem. Now everything converges fine. I would very much appreciate and example of how to:

"deduce correct stationary distribution for a model run-time (so you don't need to solve analytically)".

HyPhy seems like a great tool, but expectedly with all its flexibility it takes a while to figure out. I may be bothering you again...

Jacek

Title: Re: Unrestricted/non reversible models
Post by Sergei on Jan 4^th, 2006 at 5:34pm

Dear Jacek,

wrote on Dec 21^st, 2005 at 7:27am:

"deduce correct stationary distribution for a model run-time (so you don't need to solve analytically)".

Sorry I didn't get back to you sooner (holidays intervened).

At any rate, here's the code for such a model

Code (]
/* This file defines the transition matrix for the General Non-Reversible model without expicit frequency parameterization
The file should be used as follows:

01/04/2006 by Sergei L. Kosakovsky Pond
*/

global AC = 1;
global AT = 1;
global CA = 1;
global CG = 1;
global CT = 1;
global GA = 1;
global GC = 1;
global GT = 1;
global TA = 1;
global TC = 1;
global TG = 1;

function PopulateModelMatrix (ModelMatrixName&)
{
/* All the global rate parameters are defined relative to
the rate for A->G. For instance, CG represents the ratio
of the rates C->G/A->G. These parameters are also going
to be confounded with character frequency parameters */

ModelMatrixName =
[*,t*AC,t,t*AT}
{t*CA,*,t*CG,t*CT}
{t*GA,t*GC,*,t*GT}
{t*TA,t*TC,t*TG,*];

return 0;
}

NRM = 0;
MULTIPLY_BY_FREQS = PopulateModelMatrix ("NRM");

dynamicFreqs = {4,1};

vectorOfFrequencies = {4,1};
vectorOfFrequencies [0):

:= computeEQF (AC,AT,CG,CT,GT,CA,GA,TA,GC,TC,TG);
vectorOfFrequencies [1] := dynamicFreqs[1];
vectorOfFrequencies [2] := dynamicFreqs[2];
vectorOfFrequencies [3] := dynamicFreqs[3];

Model GRMModel = (NRM, vectorOfFrequencies, MULTIPLY_BY_FREQS);
FREQUENCY_SENSITIVE = 0;

/*--------------------------------------------------------*/

ffunction computeEQF (dummy1,dummy2,dummy3,dummy4,dummy5,dummy6,dummy7,dummy8,dummy9,dummy10,dummy11)
{
t = 1;
tempMatrix = NRM;
for (k=0; k<4; k=k+1)
{
tempMatrix[k][3] = 1;
}
tempMatrix = Inverse(tempMatrix);
for (k=0; k<4; k=k+1)
{
dynamicFreqs[k] = tempMatrix[3][k];
}
return dynamicFreqs[0];
}

.

The trick (pointed out to me by Raazesh Sainudiin) is to notice that if you take the rate matrix for the Markov process (Q), set all the entries of the last column to '1' and then invert it, the equilibrium frequencies of the process will constitute the last row of the inverse.

Cheers,
Sergei

Title: Re: Unrestricted/non reversible models
Post by Marissa on Jun 2^nd, 2009 at 9:59am

Hello,

In the code that is provided here, is the output the substitution rates? If not, how can I calculate them?

Thanks,

Marissa

Title: Re: Unrestricted/non reversible models
Post by Sergei on Jun 3^rd, 2009 at 10:28pm

Dear Marissa,

If you are thinking in terms of standard (e.g. GTR or HKY85) substitution rates, you need to factor out the equilibrium frequencies, by diving each entry of the estimated matrix (i,j) by the j-th entry of the estimated equilibrium frequency vector.

Code (]
t=1;
MLE_rate_matrix = NRM;
MLE_base_freqs = vectorOfFrequencies;

for (r = 0; r < 16; r=r+1)
{
MLE_rate_matrix[r):

= MLE_rate_matrix[r]/MLE_base_freqs[r%4];
}

Cheers,
Sergei

Title: Re: Unrestricted/non reversible models
Post by Marissa on Jun 11^th, 2009 at 4:11pm

Hello Sergei,

Thank you for the help and the prompt response!

After thinking about this more, I think I was not asking about quite the right thing. To give you an idea about what it is that I am looking for, I will explain what I am doing and ask if you have any insight about what I should be calculating.

I'm trying to compare the nucleotide substitution process in two different data sets. Both of these consist of ancestor descendent alignments where we have an ancestral transposon aligned to the human genome. I would like to be able to compare the rate of one particular substitution (say C->T) in the two data sets.

The processes I'm interested in are not time-reversible, so the model you created in your 2006 code earlier in this thread seems to be the right one to use.

I believe that what I want to compare is the instantaneous rate matrix for the two data sets. However, it seems to me that with the model in your 2006 code, I cannot compare the values I get for C to T substitutions (in the rate matrix) between two different data sets, because everything is relative to A to G substitutions. Is that correct?

Is the instantaneous rate matrix what I want? And if so, is there a way to compare it between the two datasets?

Thanks,

Marissa

Title: Re: Unrestricted/non reversible models
Post by Sergei on Jun 12^th, 2009 at 1:01pm

Dear Marissa,

For any evolutionary model (reversible or not), we cannot independently estimate rates and times (all you get is evolutionary distance, which is of course the product of the two). It is therefore very difficult to compare absolute rates if you don't have any calibration information (i.e. a way to measure evolutionary time).

The best you can do without this information is compare relative rates (e.g. does one sample accumulate C->T faster, PER UNIT TIME, than the other?). For that purpose, the actual parameterization of the model does not matter very much -- all you would do is fit the NonREV model to both alignments separately to get the alternative (C->T are different), and then fit a (more complicated to set up) model where the fitting is done jointly, but C->T (i.e. CT/dynamicFreq[4]) are constrained to be the same between two alignments.

If you need pointers, I can help you modify the code to fit the null model.

Cheers,
Sergei

Title: Re: Unrestricted/non reversible models
Post by Marissa on Jun 15^th, 2009 at 11:57am

Dear Sergei,

For our two data sets, we know that the evolutionary time is the same for the two data sets. Given that information, is there some way to compare the rates?

Also, here is an attempt I made at doing what you suggest (which to me sounds like calculating a joint likelihood and performing a log ratio test), modified from your 2006 code. However, I think this code isn't doing quite what I want it to. I would certainly appreciate any pointers!

Code (]
/* Load data 1 */
DataSet seq1 = ReadDataFile ("Data/human/NNNA2.fa");
DataSetFilter filter1 = CreateFilter (seq1, 1);

/* Load data 2 */
DataSet seq2 = ReadDataFile ("Data/human/NNNG2.fa");
DataSetFilter filter2 = CreateFilter (seq2, 1);

/* Define general non-reversible model */
/* This file defines the transition matrix for the General Non-Reversible model
without explicit frequency parameterization.
01/04/2006 by Sergei L. Kosakovsky Pond
*/
global AC = 1;
global AG = 1;
global AT = 1;
global CA = 1;
global CG = 1;
global CT = 1;
global GA = 1;
global GC = 1;
global GT = 1;
global TA = 1;
global TC = 1;
global TG = 1;

function PopulateModelMatrix (ModelMatrixName&)
{
ModelMatrixName =
{{*,t*AC,t,t*AT}
{t*CA,*,t*CG,t*CT}
{t*GA,t*GC,*,t*GT}
{t*TA,t*TC,t*TG,*}};
return 0;
}

/* NRM = Non-Reversible Model */
NRM = 0;
MULTIPLY_BY_FREQS = PopulateModelMatrix ("NRM");

/* Calculate equilibrium frequencies */
dynamicFreqs = {4,1};
vectorOfFrequencies = {4,1};
vectorOfFrequencies [0):

Thank you for all the help!

Marissa

Title: Re: Unrestricted/non reversible models
Post by Sergei on Jun 16^th, 2009 at 8:58pm

Dear Marissa,

All the rate parameters are global, hence they will be shared by both data sets. The idea is to clone the model definition but change all variable names to AC_2, AG_2 (for example).

Finally, because in this parameterization, the rate is actually confounded with the frequencies (ie the parameter CT is a product of the C->T rate times the frequency of the target, T), the constraint you probably want is more along the lines of:

Code (]
CT_2 := CT*dynamicFreqs_2[3):

/dynamicFreqs[3]

Give that a go and let me know if you run into problems.

Sergei

Title: Re: Unrestricted/non reversible models
Post by Marissa on Jun 18^th, 2009 at 3:55pm

Dear Sergei,

I fixed that up and now it's working fine. Thanks!

Now that I know that the instantaneous substitution rate (in my case CT) is larger in one data set than the other, what I would also like to know is the magnitude of the difference between the C->T rates between the data sets, given we know that the evolutionary time is the same for both data sets.

In your 2006 code, it looks like I should be comparing the values in NRM (and in my fixed code, NRM_2 for the second data set). However, I did some testing with modeled data, and the values in NRM seem to fluctuate related to the ACGT content of the sequence even when the underlying substitution rates are the same, but not in any obvious way that I could simply factor out.

My question in, what do the values in NRM represent and what should I be comparing to find out the relative substitution rates between my two data sets?

Thanks in advance,

Marissa

Title: Re: Unrestricted/non reversible models
Post by Sergei on Jun 22^nd, 2009 at 1:19pm

Dear Marissa,

Can you paste in your current code?

Sergei

Title: Re: Unrestricted/non reversible models
Post by Marissa on Jun 22^nd, 2009 at 1:26pm

Here is the code I've been using:

Code (]
/* Load data 1 */
DataSet seq1 = ReadDataFile ("Data/modeled/AT1_s_1.fa");
DataSetFilter filter1 = CreateFilter (seq1, 1);

/* Load data 2 */
DataSet seq2 = ReadDataFile ("Data/human/AT2_s_1.fa");
DataSetFilter filter2 = CreateFilter (seq2, 1);

/* Define general non-reversible model */
/* This file defines the transition matrix for the General Non-Reversible model
without explicit frequency parameterization.
01/04/2006 by Sergei L. Kosakovsky Pond
*/

global AC = 1;
global AG = 1;
global AT = 1;
global CA = 1;
global CG = 1;
global CT = 1;
global GA = 1;
global GC = 1;
global GT = 1;
global TA = 1;
global TC = 1;
global TG = 1;

global AC_2 = 1;
global AG_2 = 1;
global AT_2 = 1;
global CA_2 = 1;
global CG_2 = 1;
global CT_2 = 1;
global GA_2 = 1;
global GC_2 = 1;
global GT_2 = 1;
global TA_2 = 1;
global TC_2 = 1;
global TG_2 = 1;

function PopulateModelMatrix (ModelMatrixName&)
{
ModelMatrixName =
{{*,t*AC,t,t*AT}
{t*CA,*,t*CG,t*CT}
{t*GA,t*GC,*,t*GT}
{t*TA,t*TC,t*TG,*}};
return 0;
}

function PopulateModelMatrix_2 (ModelMatrixName&)
{
ModelMatrixName =
{{*, t_2*AC_2, t_2, t_2*AT_2}
{t_2*CA_2, *, t_2*CG_2, t_2*CT_2}
{t_2*GA_2, t_2*GC_2, *, t_2*GT_2}
{t_2*TA_2, t_2*TC_2, t_2*TG_2, *}};
return 0;
}

/* Create two trees, one for seq1 and one for seq2.
We set the anc branch to 0 by setting t to 0. */

/* NRM = Non-Reversible Model */
NRM = 0;
MULTIPLY_BY_FREQS = PopulateModelMatrix ("NRM");

/* Calculate equilibrium frequencies */
dynamicFreqs = {4,1};
vectorOfFrequencies = {4,1};
vectorOfFrequencies [0):

:= computeEQF (AC,AT,CG,CT,GT,CA,GA,TA,GC,TC,TG);
vectorOfFrequencies [1] := dynamicFreqs[1];
vectorOfFrequencies [2] := dynamicFreqs[2];
vectorOfFrequencies [3] := dynamicFreqs[3];

/* Create the model for the */
Model GRMModel = (NRM, vectorOfFrequencies, MULTIPLY_BY_FREQS);
FREQUENCY_SENSITIVE = 0;
ACCEPT_ROOTED_TREES = 1;

Tree tree1 = (anc,der);
tree1.anc.t := 0;

NRM_2 = 0;
MULTIPLY_BY_FREQS_2 = PopulateModelMatrix_2 ("NRM_2");

dynamicFreqs_2 = {4,1};
vectorOfFrequencies_2 = {4,1};
vectorOfFrequencies_2 [0] := computeEQF_2 (AC_2,AT_2,CG_2,CT_2,GT_2,CA_2,GA_2,TA_2,GC_2,TC_2,TG_2);
vectorOfFrequencies_2 [1] := dynamicFreqs_2[1];
vectorOfFrequencies_2 [2] := dynamicFreqs_2[2];
vectorOfFrequencies_2 [3] := dynamicFreqs_2[3];

Model GRMModel_2 = (NRM_2, vectorOfFrequencies_2, MULTIPLY_BY_FREQS_2);
FREQUENCY_SENSITIVE = 0;
ACCEPT_ROOTED_TREES = 1;

Tree tree2 = (anc,der);
tree2.anc.t_2 := 0;

/*---------- Unconstrained Model ----------*/

LikelihoodFunction theLnLik = (filter1, tree1, filter2, tree2);

Optimize (paramValues, theLnLik);
unconstrLik = paramValues[1][0];
unConstrainedParamCount = paramValues[1][1];
fprintf (stdout, "unconstrained param count: ", unConstrainedParamCount,"\n");

fprintf (stdout, "NRM:\n", NRM, "\n");

/*----------- Constrained Model -----------*/

LikelihoodFunction theLnLik = (filter1, tree1, filter2, tree2);

/* Constraint */
t_2 := t*dynamicFreqs_2[2]/dynamicFreqs[2];

Optimize (paramValues, theLnLik);
constrainedLnLik = paramValues[1][0];
constrainedParamCount = paramValues[1][1];
fprintf (stdout, "constrained param count: ", constrainedParamCount, "\n");

fprintf (stdout, "NRM_2:\n", NRM_2, "\n");

/* Compute and print the log ratio */
LRT = 2 (unconstrLik - constrainedLnLik);
fprintf (stdout, "LRT: ", LRT, "\n");

/*--------------- function to compute equilibrium frequencies ---------------*/

ffunction computeEQF (dummy1,dummy2,dummy3,dummy4,dummy5,dummy6,dummy7,dummy8,dummy9,dummy10,dummy11)
{
t = 1;
tempMatrix = NRM;
for (k=0; k<4; k=k+1)
{
tempMatrix[k][3] = 1;
}
tempMatrix = Inverse(tempMatrix);
for (k=0; k<4; k=k+1)
{
dynamicFreqs[k] = tempMatrix[3][k];
}
return dynamicFreqs[0];
}

ffunction computeEQF_2 (dummy1,dummy2,dummy3,dummy4,dummy5,dummy6,dummy7,dummy8,dummy9,dummy10,dummy11)
{
t_2 = 1;
tempMatrix = NRM_2;
for (k=0; k<4; k=k+1)
{
tempMatrix[k][3] = 1;
}
tempMatrix = Inverse(tempMatrix);
for (k=0; k<4; k=k+1)
{
dynamicFreqs_2[k] = tempMatrix[3][k];
}
return dynamicFreqs_2[0];
}

The two data sets being loaded are modeled data with equal ACGT content (seq1) and double AT content (seq2), where the instantaneous substitution rate to any other base is 1%.

Marissa

Title: Re: Unrestricted/non reversible models
Post by Sergei on Jun 22^nd, 2009 at 5:38pm

Dear Marissa,

Could you also append the files you've been using? I'll run the script and see what happens and also if I can spot any issues.

Sergei

Title: Re: Unrestricted/non reversible models
Post by Marissa on Jun 24^th, 2009 at 10:20am

Hi Sergei,

The following code and test data sets I made might help make my question clearer. (both are a little different from what I was referring to in my last post).

The fasta files 1.fa and 2.fa have ancestor descendent alignments in them. The ACGT freqs in the ancestors are 25%. In 1.fa, the probability of an ancestral base A mutating to a C in the descendent is 0.005. All 11 other types of substitution have the same probability, e.g.:

AC:0.005
AG:0.005
AT:0.005
CA:0.005
CG:0.005
CT:0.005
GA:0.005
GC:0.005
GT:0.005
TA:0.005
TC:0.005
TG:0.005

In 2.fa all probabilities are the same as 1.fa, except the probability of an ancestral C mutating to a T in the descendent is now 0.01

AC:0.005
AG:0.005
AT:0.005
CA:0.005
CG:0.005
CT:0.01
GA:0.005
GC:0.005
GT:0.005
TA:0.005
TC:0.005
TG:0.005

I've run the attached hyphy code on this. I tries to do several things. One is to run a likelihood ratio test as you suggested to see if CT is significantly different in the two.

In addition to this, I'd like to have some way to compare the magnitude of the difference in CT rates. In my mock data sets, the probability of C changing to T differs by a factor of 2. Since these probabilities are small, I guess the instantaneous rates should also differ by something close to this. (If it helps we can assume that the time over which 1.fa and 2.fa have evolved is the same--this assumption will also be true of the real dataset I want to work on).

But I wasn't quite sure how to get those instantaneous rates. I gather NRM isn't quite it. In my code here I took the entry in NRM and divided by the target frequency.

My code produces an output table like this

NRM matrix / freqs
1 2 1/2
A->C 4.58358 4.25391 1.0775
A->G 4.00145 3.99932 1.00053
A->T 4.67443 3.45216 1.35406
C->A 4.44949 3.49971 1.27139
C->G 4.68142 3.44173 1.36019
C->T 4.53659 6.4503 0.703314
G->A 4.36394 3.91865 1.11363
G->C 4.4771 4.32558 1.03503
G->T 4.4755 3.50581 1.2766
T->A 4.44616 4.28324 1.03804
T->C 4.60668 4.81588 0.95656
T->G 4.63384 4.30495 1.0764

The ratio between C->T for the two data sets is different, but it isn't quite 0.5, which is roughly what I was expecting. Also there's a fair amount of variation in the other ratios, which should all be close to 1. (when I made my test data sets larger, but having the same underlying probabilities, this didn't change.) Am I comparing the wrong thing? Doing something else wrong?

Thanks again for all your help!

--Marissa

http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?action=downloadfile;file=mquitt.zip (5 KB | )

Title: Re: Unrestricted/non reversible models
Post by Sergei on Jul 9^th, 2009 at 9:15am

Hi Marissa,

The code is doing something funny (there may be a bug in HyPhy with regards to dynamically computed frequencies), but there is also the following issue of interpretation.

Traditional phylogenetic methods assume that the composition of frequencies in the sequence has reached equilibrium. Each substitution process (encoded by its rate matrix), if run long enough, will attain equilibrium. For example, your simulated rate matrix for dataset 2:
Q=
{
{ -0.015, 0.005, 0.005, 0.005}
{ 0.005, -0.02, 0.005, 0.01}
{ 0.005, 0.005, -0.015, 0.005}
{ 0.005, 0.005, 0.005, -0.015}
}

Has the equilibrium frequency distribution of {{0.25,0.2,0.25,0.3}}, which simply reflects the intuitive understanding that if you start at equal frequencies, the process will redistribute C's to T's until an equilibrium has been reached.

The sequences that you simulated are far from this equilibrium distribution (they are very close to 0.25 each), hence the assumption of the model is violated and you have 'biased' rate estimates (e.g. CT = 0.01/0.3, TC = 0.005/0.2, CT/TC = 4/3 instead of 2 as you expected). This is due to how we interpret the term 'rate': namely it's the cumulative substitution rate, with the base frequency factored out so that we can directly compare the rates between data sets with different base frequencies.

I hope that makes sense,
Sergei