HyPhy message board - Unrestricted/non reversible models

	Welcome, Guest. Please Login

Home

Help

HyPhy message board › Methodology Questions › How to › Unrestricted/non reversible models

(Moderators: Sergei, Simon)

‹ Previous Topic | Next Topic ›

Pages: 1 2

Send Topic

Unrestricted/non reversible models (Read 8112 times)

Sergei YaBB Administrator Offline Datamonkeys are forever... Posts: 1658 UCSD Gender:	Re: Unrestricted/non reversible models Reply #15 - Jun 22^nd, 2009 at 5:38pm Dear Marissa, Could you also append the files you've been using? I'll run the script and see what happens and also if I can spot any issues. Sergei
Back to top	Associate Professor Division of Infectious Diseases Division of Biomedical Informatics School of Medicine University of California San Diego WWW IP Logged

Marissa

YaBB Newbies

Offline

Feed your monkey!

Posts: 6

Re: Unrestricted/non reversible models
Reply #16 - Jun 24^th, 2009 at 10:20am

Hi Sergei,

The following code and test data sets I made might help make my question clearer. (both are a little different from what I was referring to in my last post).

The fasta files 1.fa and 2.fa have ancestor descendent alignments in them. The ACGT freqs in the ancestors are 25%. In 1.fa, the probability of an ancestral base A mutating to a C in the descendent is 0.005. All 11 other types of substitution have the same probability, e.g.:

AC:0.005
AG:0.005
AT:0.005
CA:0.005
CG:0.005
CT:0.005
GA:0.005
GC:0.005
GT:0.005
TA:0.005
TC:0.005
TG:0.005

In 2.fa all probabilities are the same as 1.fa, except the probability of an ancestral C mutating to a T in the descendent is now 0.01

AC:0.005
AG:0.005
AT:0.005
CA:0.005
CG:0.005
CT:0.01
GA:0.005
GC:0.005
GT:0.005
TA:0.005
TC:0.005
TG:0.005

I've run the attached hyphy code on this. I tries to do several things. One is to run a likelihood ratio test as you suggested to see if CT is significantly different in the two.

In addition to this, I'd like to have some way to compare the magnitude of the difference in CT rates. In my mock data sets, the probability of C changing to T differs by a factor of 2. Since these probabilities are small, I guess the instantaneous rates should also differ by something close to this. (If it helps we can assume that the time over which 1.fa and 2.fa have evolved is the same--this assumption will also be true of the real dataset I want to work on).

But I wasn't quite sure how to get those instantaneous rates. I gather NRM isn't quite it. In my code here I took the entry in NRM and divided by the target frequency.

My code produces an output table like this

NRM matrix / freqs
1 2 1/2
A->C 4.58358 4.25391 1.0775
A->G 4.00145 3.99932 1.00053
A->T 4.67443 3.45216 1.35406
C->A 4.44949 3.49971 1.27139
C->G 4.68142 3.44173 1.36019
C->T 4.53659 6.4503 0.703314
G->A 4.36394 3.91865 1.11363
G->C 4.4771 4.32558 1.03503
G->T 4.4755 3.50581 1.2766
T->A 4.44616 4.28324 1.03804
T->C 4.60668 4.81588 0.95656
T->G 4.63384 4.30495 1.0764

The ratio between C->T for the two data sets is different, but it isn't quite 0.5, which is roughly what I was expecting. Also there's a fair amount of variation in the other ratios, which should all be close to 1. (when I made my test data sets larger, but having the same underlying probabilities, this didn't change.) Am I comparing the wrong thing? Doing something else wrong?

Thanks again for all your help!

--Marissa

Multimedia File Viewing and Clickable Links are available for Registered Members only!! You need to

IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: Unrestricted/non reversible models
Reply #17 - Jul 9^th, 2009 at 9:15am

Hi Marissa,

The code is doing something funny (there may be a bug in HyPhy with regards to dynamically computed frequencies), but there is also the following issue of interpretation.

Traditional phylogenetic methods assume that the composition of frequencies in the sequence has reached equilibrium. Each substitution process (encoded by its rate matrix), if run long enough, will attain equilibrium. For example, your simulated rate matrix for dataset 2:
Q=
{
{ -0.015, 0.005, 0.005, 0.005}
{ 0.005, -0.02, 0.005, 0.01}
{ 0.005, 0.005, -0.015, 0.005}
{ 0.005, 0.005, 0.005, -0.015}
}

Has the equilibrium frequency distribution of {{0.25,0.2,0.25,0.3}}, which simply reflects the intuitive understanding that if you start at equal frequencies, the process will redistribute C's to T's until an equilibrium has been reached.

The sequences that you simulated are far from this equilibrium distribution (they are very close to 0.25 each), hence the assumption of the model is violated and you have 'biased' rate estimates (e.g. CT = 0.01/0.3, TC = 0.005/0.2, CT/TC = 4/3 instead of 2 as you expected). This is due to how we interpret the term 'rate': namely it's the cumulative substitution rate, with the base frequency factored out so that we can directly compare the rates between data sets with different base frequencies.

I hope that makes sense,
Sergei

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

Pages: 1 2

Send Topic

‹ Previous Topic | Next Topic ›

« Home

‹ Board

Top of this page