Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
Pages: 1 2 
Unrestricted/non reversible models (Read 8112 times)
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Unrestricted/non reversible models
Reply #15 - Jun 22nd, 2009 at 5:38pm
 
Dear Marissa,

Could you also append the files you've been using? I'll run the script and see what happens and also if I can spot any issues.

Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Marissa
YaBB Newbies
*
Offline


Feed your monkey!

Posts: 6
Re: Unrestricted/non reversible models
Reply #16 - Jun 24th, 2009 at 10:20am
 
Hi Sergei,

The following code and test data sets I made might help make my question clearer. (both are a little different from what I was referring to in my last post).

The fasta files 1.fa and 2.fa have ancestor descendent alignments in them. The ACGT freqs in the ancestors are 25%. In 1.fa, the probability of an ancestral base A mutating to a C in the descendent is 0.005. All 11 other types of substitution have the same probability, e.g.:

AC:0.005
AG:0.005
AT:0.005
CA:0.005
CG:0.005
CT:0.005
GA:0.005
GC:0.005
GT:0.005
TA:0.005
TC:0.005
TG:0.005

In 2.fa all probabilities are the same as 1.fa, except the probability of an ancestral C mutating to a T in the descendent is now 0.01

AC:0.005
AG:0.005
AT:0.005
CA:0.005
CG:0.005
CT:0.01
GA:0.005
GC:0.005
GT:0.005
TA:0.005
TC:0.005
TG:0.005

I've run the attached hyphy code on this. I tries to do several things. One is to run a likelihood ratio test as you suggested to see if CT is significantly different in the two.

In addition to this, I'd like to have some way to compare the magnitude of the difference in CT rates. In my mock data sets, the probability of C changing to T differs by a factor of 2. Since these probabilities are small, I guess the instantaneous rates should also differ by something close to this. (If it helps we can assume that the time over which 1.fa and 2.fa have evolved is the same--this assumption will also be true of the real dataset I want to work on).

But I wasn't quite sure how to get those instantaneous rates. I gather NRM isn't quite it. In my code here I took the entry in NRM and divided by the target frequency.

My code produces an output table like this

NRM matrix / freqs
      1       2       1/2
A->C      4.58358      4.25391      1.0775
A->G      4.00145      3.99932      1.00053
A->T      4.67443      3.45216      1.35406
C->A      4.44949      3.49971      1.27139
C->G      4.68142      3.44173      1.36019
C->T      4.53659      6.4503      0.703314
G->A      4.36394      3.91865      1.11363
G->C      4.4771      4.32558      1.03503
G->T      4.4755      3.50581      1.2766
T->A      4.44616      4.28324      1.03804
T->C      4.60668      4.81588      0.95656
T->G      4.63384      4.30495      1.0764


The ratio between C->T for the two data sets is different, but it isn't quite 0.5, which is roughly what I was expecting. Also there's a fair amount of variation in the other ratios, which should all be close to 1. (when I made my test data sets larger, but having the same underlying probabilities, this didn't change.) Am I comparing the wrong thing? Doing something else wrong?

Thanks again for all your help!

--Marissa
Back to top
 
Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login (5 KB | )
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Unrestricted/non reversible models
Reply #17 - Jul 9th, 2009 at 9:15am
 
Hi Marissa,

The code is doing something funny (there may be a bug in HyPhy with regards to dynamically computed frequencies), but there is also the following issue of interpretation.

Traditional phylogenetic methods assume that the composition of frequencies in the sequence has reached equilibrium. Each substitution process (encoded by its rate matrix), if run long enough, will attain equilibrium. For example, your simulated rate matrix for dataset 2:
Q=
{
{            -0.015,             0.005,             0.005,             0.005}
{             0.005,             -0.02,             0.005,              0.01}
{             0.005,             0.005,            -0.015,             0.005}
{             0.005,             0.005,             0.005,            -0.015}
}

Has the equilibrium frequency distribution of {{0.25,0.2,0.25,0.3}}, which simply reflects the intuitive understanding that if you start at equal frequencies, the process will redistribute C's to T's until an equilibrium has been reached.

The sequences that you simulated are far from this equilibrium distribution (they are very close to 0.25 each), hence the assumption of the model is violated and you have 'biased' rate estimates (e.g. CT = 0.01/0.3, TC = 0.005/0.2, CT/TC = 4/3 instead of 2 as you expected). This is due to how we interpret the term 'rate': namely it's the cumulative substitution rate, with the base frequency factored out so that we can directly compare the rates between data sets with different base frequencies.

I hope that makes sense,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Pages: 1 2