Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
Likelihood computation (Read 4698 times)
Miguel Lacerda
YaBB Newbies
*
Offline


Feed your monkey!

Posts: 36
Natl Univ of Ireland, Galway
Gender: male
Likelihood computation
Sep 4th, 2008 at 5:13am
 
Hi Sergei,

I have been trying to compute the likelihood for a codon model and have found that the answer I get with HyPhy is different from the one I get if I compute the likelihood "by hand"; that is, by multiplying together the transition probbilities and summing over all ancestoral nodes.

I have attached the bf file for a simple (fake) data set (seq.fasta) comprising of 3 taxa with only a single observed codon site which is the same for all taxa. The rate matrices and equilibrium frequencies, which were previously estimated and are contained in the file ParameterEstimates.txt, are allowed to differ between branches as indicated in the specification of the tree topology.

Please could you help me understand why my computed likelihood (as given in ComputeLikelihood.bf) differs from that computed by HyPhy.

Much appreciated!

Miguel
Back to top
 

Likelihood.zip (10 KB | )
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Likelihood computation
Reply #1 - Sep 4th, 2008 at 7:28am
 
Dear Miguel,

HYPHY was unrooting the ((a,b),c) tree to (a,b,c) and losing the model on the interior branch. You need to set a flag to override this behavior. I also show an example of how to use LFCompute to evaluate the LF; since you don't need to Optimize. Finally, because your models have different stationary distros, use the LikelihoodFunction3 construct to tell HyPhy which one to use at the root.

[code]
...
ACCEPT_ROOTED_TREES = 1;
Tree myTree = "((N1{K1},N2{K0})A2{KX},N3{K0})";
LikelihoodFunction3 logL = (CodonData, myTree,rx);
LFCompute(logL,LF_START_COMPUTE);
LFCompute(logL,value);
LFCompute(logL,LF_DONE_COMPUTE);

fprintf(stdout, "Likelihood according to HyPhy: ", Exp(value), "\n");
...
[/code]

Results
[code]
Likelihood according to HyPhy: 0.022623
Likelihood computed "by hand": 0.022623
[/code]

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Miguel Lacerda
YaBB Newbies
*
Offline


Feed your monkey!

Posts: 36
Natl Univ of Ireland, Galway
Gender: male
Re: Likelihood computation
Reply #2 - Sep 6th, 2008 at 7:56am
 
Thanks so much for your quick reply! Your code works with this example, but I can't seem to get it to work with a more complicated example (probably due to an unrelated problem!)...

I have attached another batch file (ComputingLikelihood3.bf) which is essentially an extension of the previous one I sent you. Here I have expressed the EFVs r0, r1 and rx as functions of two variables k0 and k1 as well as a previously estimated EFV (pi.txt). An estimated rate matrix is given (Q.txt) but must be multiplied by the equilibrium frequencies, so you end up with three different rate matrices and EFVs which apply to the different branches of the topology as specified in the Tree statement. The simplified topology consists of 3 taxa with only one observed codon site for each taxon. Taxon N1 has codon AAA and taxa N2 and N3 each have codon GGA.

Given Q, pi, k0 and k1, I have computed the likelihood of the tree. But, if I compute the likelihood a second time by simply "copy-and-paste"-ing the previous code below it, I get a different likelihood value! This is relevant as I intend looping over (k0, k1) pairs and computing the likelihood for each pair, which currently does not work.

Also, neither of the above two likelihoods equal the likelihood that I have computed manually, which is also concerning.

Your help would be greatly appreciated!!

Thanks in advance,
Miguel
Back to top
 
Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login (3 KB | )
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Likelihood computation
Reply #3 - Sep 8th, 2008 at 12:44pm
 
Dear Miguel,

Turns out HyPhy was overwriting the Q matrix during LF calculations, because I never tested the code with constant (i.e. no independent parameters of any kind, not even time) Q matrices. This is why you were getting different results for different LF calls.

This has now been fixed and the result is

Code:
First Likelihood according to HyPhy: 8.82249e-07
Second Likelihood according to HyPhy: 8.82249e-07
Likelihood computed "by hand": 8.82249e-07
 

.

Please follow the instructions on Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login to get and compile the latest developmental version that has the fix in place.

Thanks for alerting me to this very serious bug (fortunately none of the standard analyses use constant Q matrices...)

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Miguel Lacerda
YaBB Newbies
*
Offline


Feed your monkey!

Posts: 36
Natl Univ of Ireland, Galway
Gender: male
Re: Likelihood computation
Reply #4 - Nov 18th, 2008 at 1:33am
 
Dear Sergei,

I am trying to implement a GY94 model of codon substitution and cannot understand one of the results I'm getting.

I have attached a simplified version of the problem where most of the parameters (alpha, kappa, mu and t) are fixed. The only parameter (k) which is estimated through optimization of the LF occurs in the EFV.

The frequency r_{i} of codon i is computed as:

[code]
global k = 0.5;
k:>-0.000000000000000000000001; k:<1.00000000000000000000000000001;

r={Rows(pi),1};
        
for(i=0; i<Rows(pi); i=i+1){
   
   if(pi[i][1]==wt){
     /* codon codes for wild AA */
     global r[i][0]:=k*pi__[i__][0]/aa_freq__[wt__][0];
   }
   else{
     /* codon does NOT code for wild AA */
     global r[i][0]:=(1-k)*pi__[i__][0]/(1-aa_freq__[wt__][0]);            
   }
}
[/code]

where pi is the "usual" EFV computed using HarvestFrequencies, aa_freq is a vector containing the frequencies of the amino acids and wt is an index corresponding to the "wild type" amino acid occurring at a given site.

The code I have given you focuses on only a single site in Gag where the only amino acid that is observed across all HIV strains is E (so that wt=16 corresponding to E). In such a case, I would expect the MLE of k to be 1 (or at least very close to it). However, the MLE according to HyPhy is very close to 0 - the exact opposite of what I would expect! This would mean that the columns in Q corresponding to the codons encoding E are all zero vectors - surely this cannot be the case!?

I have played around a lot with this and have discovered that if I make all the branch lengths equal to one, the MLE for k is 1 as expected - I'm not sure why though.

Hope you can help. Thanks a lot!

Miguel






Back to top
 

OptimizeK.zip (66 KB | )
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Likelihood computation
Reply #5 - Nov 18th, 2008 at 11:23am
 
Dear Miguel,

I am getting k = 1 using the current code from the SVN tree. Also, please delete 'global' from 'global r[i][0]' as it is redundant and will be ignored.

Sergei

[code]
-------------------------------------------------------------------------------------
PARAMETER ESTIMATES:
--------------------------------------------------------------------------------------

Likelihood Function's Current Value = -451.496222426653
k=1
mu13.3622==13.3622
kappa9.78031==9.78031
alpha0.653415==0.653415
...
[/code]
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Miguel Lacerda
YaBB Newbies
*
Offline


Feed your monkey!

Posts: 36
Natl Univ of Ireland, Galway
Gender: male
Re: Likelihood computation
Reply #6 - Nov 19th, 2008 at 12:37am
 
This is really weird. I was using the development version you told me to previously in this thread and get:

Code:
Likelihood Function's Current Value = -34.5934741151561
k=0.00383036
mu13.3622==13.3622
kappa9.78031==9.78031
alpha0.653415==0.653415 



I have now downloaded the latest version and get the same answer! What could I be doing wrong?

Miguel
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Likelihood computation
Reply #7 - Nov 19th, 2008 at 7:29am
 
Dear Miguel,

I am sure it's not you that's being wrong. The developmental version is constantly changing - could have been a transient bug. I'll add you example file to the test suite and make sure it gets checked with new builds.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Miguel Lacerda
YaBB Newbies
*
Offline


Feed your monkey!

Posts: 36
Natl Univ of Ireland, Galway
Gender: male
Re: Likelihood computation
Reply #8 - Nov 20th, 2008 at 3:24am
 
Hi Sergei,

I'm sorry for being stupid  Embarrassed , but where can I get hold of the version of HyPhy that you are using to get the correct answer above? I have downloaded the latest hyphy-r298.zip from Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login, but I'm still getting the wrong answer.

Thanks again,

Miguel
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Likelihood computation
Reply #9 - Nov 20th, 2008 at 1:32pm
 
Dear Miguel,

Hmm; on second thought - the inferred value for the original code (close to 0) is probably correct. This is because the observed distribution of codons in position 39 is 99.3% GAG (index 34) and 0.7% codon GAA (index 33). Setting k = 1 in your parameterization forces the equilibrium distribution to be very far for what is observed at the tips (32.9% for GAG and 67.1% for GAA), and any value of k is going to force an about 2:1 ratio on GAA/GAG but also permit intermediate states, which is probably more optimal to correct for the discrepancy between the estimated alignment-wide and site-specific codon usage bias.. You may need to add another parameter for codon usage bias to fit these types of situations.

Sergei

PS Now I am confused why my previous run generated an estimate of 1 for the 'k' parameter; I can't reproduce that either.
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Miguel Lacerda
YaBB Newbies
*
Offline


Feed your monkey!

Posts: 36
Natl Univ of Ireland, Galway
Gender: male
Re: Likelihood computation
Reply #10 - Nov 24th, 2008 at 4:49am
 
Thanks! This makes sense!! Smiley

Perhaps you could also help me understand the following observation: Whenever I set all the branch lengths to a very small number, say 0.01, k goes to zero, but setting all the branch lengths to a larger number, say 1, produces the desired result k=1. I can't quite grasp why this should be the case.

Cheers,
Miguel
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Likelihood computation
Reply #11 - Nov 24th, 2008 at 7:35am
 
Dear Miguel,

This is probably because P{x->y | t} -> 1/61 for larger t, hence the observed distribution of codon frequencies at a site and the stationary distribution of the model are not quite as co-dependant as they are for smaller branch lengths.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged