Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
Parameter estimation across multiple trees? (Read 4505 times)
Aidan Budd
YaBB Newbies
*
Offline


Monkey fed...

Posts: 29
Heidelberg, Germany
Gender: male
Parameter estimation across multiple trees?
Feb 22nd, 2008 at 5:04am
 
Hi all,

I'm working with a single binary character, that has been scorred for OTUs present in a set of different trees.

I'd like to set up an analysis where I use the same time-non-reversible model to investigate the evolution of the character across all these trees. Thus, for this character, I'd like to estimate the forward and reverse rate parameters, using an EM estimate of the data across all the trees.

I've been looking through the hyphy documentation, and I see that the batch language command LikelihoodFunction will accept a list of (tree, dataset) pairs.

Am I right that, by submitting the set of tree/datasets I want to use to estimate these parameters across, I can use this function to carry out this analysis?

Any comments/suggestions/info much appreciated!

Aidan
Back to top
 

Aidan Budd&&Computational Biologist&&EMBL Heideberg, Germany
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Parameter estimation across multiple trees?
Reply #1 - Feb 25th, 2008 at 4:32pm
 
Dear Aidan,

The basic idea for you analysis is to set up a non-reversible binary model using something like
(the eqFrequencies frequencies for a binary non-rev model are completely defined by the rates).

Code:
/* read in your data; set up filters etc */

global forwardRate   = 1;
global backwardRate = 1;

binaryQ = {{*, forwardRate*t}
		     {backwardRate*t,*}};

eqFrequencies = {{backwardRate/(forwardRate+ backwardRate), forwardRate/(forwardRate+ backwardRate)}};

Model nonRevBinary = (binaryQ, eqFrequencies,0);

/* set up trees; tree1 etc are simply example identifiers; you can use your own */

Tree tree1 = ...

LikelihoodFunction lf = (myFilter1, tree1, ...);
Optimize (ref,lf);

fprintf (stdout, lf);

 



This should do it. Sufficiently recent versions of HyPhy can read binary data directly, assuming 0 and 1 are the two state characters.
Otherwise you may need to fiddle with datasets commands (BASESET etc, do a search on this board to see some exchanges re: their use)

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Aidan Budd
YaBB Newbies
*
Offline


Monkey fed...

Posts: 29
Heidelberg, Germany
Gender: male
Re: Parameter estimation across multiple trees?
Reply #2 - Feb 26th, 2008 at 8:05am
 
Dear Sergei,

Thanks for your help - I'm really pleased to have come across HyPhy, it looks like it will be able to help me do lots of things I never got around to trying as I thought they looked too hard to code on my own. Now I just need to get it running...  Smiley

So, I have a couple of follow-up questions:

I've been hunting around the hyphy distribution and documentation for tips and examples that deal with binary characters in general using the batch langauge, but haven't had much luck finding things like that (could well be searching with just the wrong terms...). Would be great if you could point me somewhere appropriate (might deal with all the questions I have below already...)

Thus - I've yet to successfully execute the script you sent - I think the problem is that I haven't worked out how to import the data in a way that it is recognised as binary (I tried adding a line at the start
Code:
BASESET = "01";
 


but this didn't fix it)

The code I'm trying to run looks like this (basically your code plus a few extra lines defining tree and data)

Code:
x = ">a 10 >b 10 >c 10 >d 01";

DataSet myData = ReadFromString (x);
DataSetFilter myFilter1 = CreateFilter (myData,1);

global forwardRate   = 1;
global backwardRate  = 1;

binaryQ = {{*, forwardRate*t}
                 {backwardRate*t,*}};

eqFrequencies = {{backwardRate/(forwardRate+ backwardRate), forwardRate/(forwardRate+ backwardRate)}};
Model nonRevBinary = (binaryQ, eqFrequencies,0);

/* set up trees; tree1 etc are simply example identifiers; you can use your own */
Tree myTree1 = (((a:1,b:1):1,c:1):1,d:1);
LikelihoodFunction lf = (myFilter1, myTree1);
Optimize (ref,lf);

fprintf (stdout, lf);

 


Any suggestions on how to fix this? At the moment, when I run it, I get the complaint:

"The dimensions of the equilibrium frequencies vector eqFrequencies (2) doesn't match the number of states in the dataset filter (4)
Current BL Command:Construct the following likelihood function:myFilter1 ; myTree1;


The two other quick questions

- is it possible to import trees from external files using something like ReadDataFile?

- running the osX intel binaries, is it possible to pass batch files to the software and run them on the command line? I've tried things like this with no success

$ /Applications/HyPhy/HyPhy.app/Contents/MacOS/HyPhy < batch_file


Would I  need to compile from source the *nix version to get this functionality?

Lots of questions, I know - as I said before, it looks like the software should be able to do loads of very useful things, so very keen to get it up working. Any help/advice much appreciated

Best wishes

Aidan
Back to top
 

Aidan Budd&&Computational Biologist&&EMBL Heideberg, Germany
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Parameter estimation across multiple trees?
Reply #3 - Feb 27th, 2008 at 2:44pm
 
Dear budd,

Try this code (note the inclusion of \n in the inline FASTA file). Current versions of HyPhy do not need the explicit BASESET instruction to recognize 0,1 binary data.

Code:
x = ">a\n100001\n>b\n100010\n>c\n100000\n>d\n111000";

DataSet myData = ReadFromString (x);
DataSetFilter myFilter1 = CreateFilter (myData,1);

global forwardRate  = 1;
global backwardRate = 1;

binaryQ = {{*, forwardRate*t}
	   {backwardRate*t,*}};

eqFrequencies = {{backwardRate/(forwardRate+ backwardRate), forwardRate/(forwardRate+ backwardRate)}};
Model nonRevBinary = (binaryQ, eqFrequencies,0);

/* set up trees; tree1 etc are simply example identifiers; you can use your own */
Tree myTree1 = (((a:1,b:1):1,c:1):1,d:1);
LikelihoodFunction lf = (myFilter1, myTree1);
Optimize (ref,lf);

fprintf (stdout, lf);
 



To address your other points:

Quote:
is it possible to import trees from external files using something like ReadDataFile?


Yes. You can simple use (#include or ExecuteAFile) the "queryTree.bf" file found in TemplateBatchFiles. Look at the source of that if you want more control.

Quote:
running the osX intel binaries, is it possible to pass batch files to the software and run them on the command line?


You would have to compile the command line version from source. It builds out of the box on OS X however.

Cheers,
Sergei


Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Aidan Budd
YaBB Newbies
*
Offline


Monkey fed...

Posts: 29
Heidelberg, Germany
Gender: male
Re: Parameter estimation across multiple trees?
Reply #4 - Mar 3rd, 2008 at 10:51am
 
Hi again, Sergi,

getting there, slowly

I have a couple more questions - put them all together than a bunch of separate posts, sorry for the long mail.

There are a couple of things I'm trying to do.

Firstly, I'd like to evaluate the likelihood of some binary data for a given topology and branch length dataset, and a given forward and backward rate parameter. The idea here is to sample a range of different forward and reverse rate parameters to generate a plot of the likelihood surface associated with the parameters.

(I've tried running without the $BASESET:01 statement, but it seems to have trouble reading the data in that case - also if I use fasta format).

This seems to work reasonably, although I don't understand why, after setting new values of forwardRate and backwardRate, the branch length parameters for the tree change by some scaling factor. I think this is something to do with "t" in the binary matrix (not clear to me what it's for...). Looks like I'm missing something fundamental here, would be great if you could point me to a part of the HyPhy docs that I could read through to get to grips with this better.

Code:
dss = "$BASESET:01\n#a\n10111\n#b\n10101\n#c\n01100\n#d\n11100";
DataSet myData = ReadFromString(dss);
DataSetFilter myFilter = CreateFilter (myData,1);

ACCEPT_BRANCH_LENGTHS = 1;
ACCEPT_ROOTED_TREES   = 1;

global forwardRate   = 1;
global backwardRate  = 1;

binaryQ = {{*, forwardRate*t}
           {backwardRate*t,*}};

eqFrequencies = {{backwardRate/(forwardRate+ backwardRate), forwardRate/(forwardRate+ backwardRate)}};

Model nonRevBinary = (binaryQ, eqFrequencies,0);
Tree myTree = "(((a:1,b:3):1,c:1):1,d:1);";

LikelihoodFunction lf = (myFilter,myTree);

LIKELIHOOD_FUNCTION_OUTPUT=5;
fprintf (stdout, "\n",lf);
LIKELIHOOD_FUNCTION_OUTPUT=2;
fprintf (stdout,"\n", lf);

forwardRate = 0.9;
backwardRate = 0.4;

LIKELIHOOD_FUNCTION_OUTPUT=5;
fprintf (stdout,"\n", lf);
LIKELIHOOD_FUNCTION_OUTPUT=2;
fprintf (stdout,"\n", lf);
 



Secondly, I'd like to run this analysis, but optimising forward and backward rates - again, this seems to work reasonably, again for fixed topology and branch lengths. (I want to do this to check that the analysis where I explore the values myself finds similar/the same likelihood and parameter values as when the optimisation in HyPhy is run)

Again, this seems to work reasonably well, although I have again this issue with the scaling that I don't get -

Code:
fprintf (stdout,"\n\nOptimise rates but not branch lengths");

dss = "$BASESET:01\n#a\n10111\n#b\n10101\n#c\n01100\n#d\n11100";
DataSet myData = ReadFromString(dss);
DataSetFilter myFilter = CreateFilter (myData,1);

ACCEPT_BRANCH_LENGTHS = 1;
ACCEPT_ROOTED_TREES   = 1;

global forwardRate   = 1;
global backwardRate  = 1;

binaryQ = {{*, forwardRate*t}
           {backwardRate*t,*}};

eqFrequencies = {{backwardRate/(forwardRate+ backwardRate), forwardRate/(forwardRate+ backwardRate)}};

UseModel (USE_NO_MODEL);
Tree dummy = "(((a:1,b:3):1,c:1):1,d:1);";  /* NB it has to be a Tree, not a Topology object */
branchNames  	  = BranchName   (dummy,-1); /* populate a vector of branch names; post order traversal */
branchLengthsNow  = BranchLength (dummy,-1); /* populate a vector of branch lengths; post order traversal */

Model nonRevBinary = (binaryQ, eqFrequencies,0);
Tree myTree = "(((a:1,b:3):1,c:1):1,d:1);";

for (k = 0; k < Columns (branchNames)-1; k=k+1)
{
 	ExecuteCommands ("myTree."+branchNames[k]+".t:="+branchLengthsNow[k]+";");
}  

LikelihoodFunction lf = (myFilter,myTree);
Optimize (res,lf);

LIKELIHOOD_FUNCTION_OUTPUT=5;
fprintf (stdout, "\n",lf);
LIKELIHOOD_FUNCTION_OUTPUT=2;
fprintf (stdout,"\n", lf);
 



Thirdly, I'd like to run the analysis using a second binary model, attached to some of the internal branches (the idea here is to attache this model to internal branches that are spawned after a duplication event - will allowing these two extra parameters [or perhaps just one extra parameter that scales the non-post-duplication-matrix by a constant] provide a significantly better fit to the data using a likelihood ratio test. I think I need to set the equilibrium frequencies for the two models equal - I've tried to do that in this script, but not sure I've done it properly, just wanted to ask how it looked to you

Code:
fprintf (stdout,"\n\nOptimise rates but not branch lengths, use different rate matrices for different branches");

dss = "$BASESET:01\n#a\n10111\n#b\n10101\n#c\n01100\n#d\n11100\n#e\n11100\n#f\n11001";
DataSet myData = ReadFromString(dss);
DataSetFilter myFilter = CreateFilter (myData,1);

ACCEPT_BRANCH_LENGTHS = 1;
ACCEPT_ROOTED_TREES   = 1;


global forwardRateA   = 1;
global backwardRateA  = 1;

binaryQ_A = {{*, forwardRateA*t}
           {backwardRateA*t,*}};

/* I need to constrain the two equilibrium frequencies to be the same, so I get them from just one of the two matrices...*/

eqFrequencies = {{backwardRateA/(forwardRateA+ backwardRateA), forwardRateA/(forwardRateA+ backwardRateA)}};

global forwardRateB   = 1;
global backwardRateB  = 1;

binaryQ_B = {{*, forwardRateB*t}
           {backwardRateB*t,*}};


UseModel (USE_NO_MODEL);
Tree dummy = "((((a:1{nonRevBinaryA},b:1{nonRevBinaryA}):0.2{nonRevBinaryB},(c:1{nonRevBinaryA},d:1{nonRevBinaryA}):0.2{nonRevBinaryB}):1{nonRevBinaryA},e:1.5{nonRevBinaryA}):1{nonRevBinaryA},f:2{nonRevBinaryA});";  /* NB it has to be a Tree, not a Topology object */
branchNames  	  = BranchName   (dummy,-1); /* populate a vector of branch names; post order traversal */
branchLengthsNow  = BranchLength (dummy,-1); /* populate a vector of branch lengths; post order traversal */

Model nonRevBinaryA = (binaryQ_A, eqFrequencies,0);
Model nonRevBinaryB = (binaryQ_B, eqFrequencies,0);



Tree myTree = ""+dummy;

for (k = 0; k < Columns (branchNames)-1; k=k+1)
{
 	ExecuteCommands ("myTree."+branchNames[k]+".t:="+branchLengthsNow[k]+";");
}  

LikelihoodFunction lf = (myFilter,myTree);
Optimize (res,lf);

LIKELIHOOD_FUNCTION_OUTPUT=5;
fprintf (stdout, "\n",lf);
LIKELIHOOD_FUNCTION_OUTPUT=2;
fprintf (stdout,"\n", lf);
 



Any help or feedback much appreciated - and thanks again for all your help so far.

Aidan
Back to top
 

Aidan Budd&&Computational Biologist&&EMBL Heideberg, Germany
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Parameter estimation across multiple trees?
Reply #5 - Mar 3rd, 2008 at 2:34pm
 
Dear Aidan,

Re: BASESET; you need to download the most recent version. I added built-in binary character support in October 2007, so if your version is older than that you need to include BASESET.

The thing about branch length is this: forward and backward rates are shared by all branches in the tree (because of the 'global' term in their declarations), but 't' is a local (automatically created for each branch) time parameter.

A branch length for any branch will look like (take a look at the GettingStarted.pdf for more about branch lengths)

BL (t, pi, F,B) = t*pi_0 * F + t*(1-pi_0) * B

where t is the branch length, F is the forward rate, B is the backward rate and pi is the vector of equilibrium frequencies (which is determined by F and B).

You actually need to do a bit of work to (e.g. see Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login) to fix branch lengths under a given model.

It is probably easiest for me to modify one of your examples to get you started, because HyPhy does not make fixing branch lengths particularly easy (mostly because it is not easy to do under an arbitrary model). Stay tuned.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Aidan Budd
YaBB Newbies
*
Offline


Monkey fed...

Posts: 29
Heidelberg, Germany
Gender: male
Re: Parameter estimation across multiple trees?
Reply #6 - Mar 4th, 2008 at 11:25am
 
Hi Sergei,

Right - got this fixed now, reading your chapter in Rasmus Nielsen's book helped, along with your pointing out the expression for the branch lengths, helped a lot.

Just in case it helps others, I've posted the code at the end of the mail that successfully changes forward and reverse rates while maintaining constant branch lengths.

Just want to check I understand this right -  "Optimize" optimizes forwardRate and reverseRate (but doesn't attempt to optimize "t" due to the constraints that are set)?

Also, would  be great if you could give me a hint why the non-time-reversible models need to include "t" in their Q while time-rev models seem not to need it - not obvious to me why this is.

Thanks

Aidan

Code:
dss = "$BASESET:01\n#a\n10111\n#b\n10101\n#c\n01100\n#d\n11100";
DataSet myData = ReadFromString(dss);
DataSetFilter myFilter = CreateFilter (myData,1);

ACCEPT_BRANCH_LENGTHS = 1;
ACCEPT_ROOTED_TREES   = 1;

global forwardRate   = 1;
global backwardRate  = 1;

t=1;

binaryQ = {{*, forwardRate*t}
           {backwardRate*t,*}};

eqFrequencies = {{backwardRate/(forwardRate+ backwardRate), forwardRate/(forwardRate+ backwardRate)}};

Model nonRevBinary = (binaryQ, eqFrequencies,0);

Tree myTree = "(((a:1,b:3):1,c:1):1,d:1);";
/* get all branch names and branch lengths*/
branchNames  = BranchName  (myTree,-1);
branchLengthsInitial = BranchLength (myTree,-1);


treeString = Format (myTree, 1, 1);
fprintf (stdout, "tree string with initial forward and backward rates\n", treeString, "\n");

forwardRate   = 0.8;
backwardRate  = 1.1;

/* scalingFactor =  ( ( backwardRate/(forwardRate+ backwardRate) ) * forwardRate ) + ( ( forwardRate/(forwardRate+ backwardRate) ) * backwardRate ); which is the same thing, of course, as */
scalingFactor = 2 * ( ( backwardRate * forwardRate ) / (forwardRate + backwardRate) );

LikelihoodFunction lf = (myFilter,myTree);
Optimize (res,lf);

for (k = 0; k < Columns (branchNames)-1; k=k+1)
{
 ExecuteCommands ("myTree."+branchNames[k]+".t="+branchLengthsInitial[k]+"/scalingFactor;");
}

LIKELIHOOD_FUNCTION_OUTPUT=1;

fprintf (stdout, "\n",lf);
fprintf (stdout, "tree string with forwardRate:",forwardRate," backwardRate:",backwardRate,"\n",  treeString, "\n");
 



Back to top
 

Aidan Budd&&Computational Biologist&&EMBL Heideberg, Germany
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Parameter estimation across multiple trees?
Reply #7 - Mar 18th, 2008 at 11:13pm
 
Dear Aidan,

You are correct about Optimize - it will only adjust free parameters. Actually, in the non-rev model you can only estimate the relative rates (i.e. forward/backward ratio). I didn't mention this in the beginning to avoid even more confusion. This is because you can only 'absorb' one of the rates into the 't' parameter, e.g. redefine t' to be forwardRate * t, then the (1,0) entry of the rate matrix will go from backwardRate * t to (backwardRate/forwardRate * t'). Hence:

Code:
global BtoF_ratio   = 1;

t=1;

binaryQ = {{*,t}
	     {BtoF_ratio*t,*}};

...
 



However if you can fix the scale of time (i.e. you are no longer free to absorb rates into it), e.g. by using estimated divergence times (fossil record etc), you can decouple the rates again. Just keep in mind that in the above parameterization t is really defined as time*forwardRate.

This should also help answer your question about why there is no explicit 't' because the rate parameter is a product of time and rate already.

HTH,
Sergei

Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged