Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
How to simulate data with user branch lengths (Read 1111 times)
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
How to simulate data with user branch lengths
May 18th, 2005 at 10:42am
 
I received this query by e-mail
Quote:
Here is what I tried for HKY.  But I got a bunch of sequences that were all identical.

****

HKYM =  {{*,trvs,trst,trvs}
                {trvs,*,trvs,trst}
                {trst,trvs,*,trvs}
                {trvs,trst,trvs,*}};


eqf = {{0.2}{0.3}{0.2}{0.3}};
Model  HKY = (HKYM,eqf,1);

ACCEPT_ROOTED_TREES = 1;
Tree L_1 = (((T1:0.0728981,T2:0.156032):0.821206,T3:0.829292):0.247948,((T4:0.123786,(T5:0.
0252201,T6:0.0267699):0.0845906):0.646862,((T7:0.420598,T8:0.297769):0.648832,(T
9:0.113452,T10:0.136777):0.684275):0.155985):0.44216);

characters = {{"A","C","G","T"}
             {"1","","",""}};


DataSet sim = Simulate (L_1,eqf,characters,10000,1,"testDataHKY");



Under HKY85 the branch length is defined as a weighted sum of the transversion rate and a transition rate (see below):

BranchLength (trvs, trst) = 2*trst*(f_A*f_G+f_C*f_T) + 2*trsv (f_A*f_C+f_A*f_T+f_C*f_G+f_G*f_T).

ecause a branch length is the function of two parameters, HyPhy can't map a branch length from the tree string to actual values of transition and transversion; thus both are set to '0' (default values) and sequences are generated along a zero-length tree. You could check for this by calling and printing the value of BranchLength(L_1, 0) - 0 is simply the first branch in post-order traversal.

To fix the the problem of branch length->parameter mapping, I have modified your file to

(1). Use HKY85 with a global tranversion/transition ratio (fixed by the user). Thus BranchLength becomes the function of a single parameter (t).
(2). Wrote a function which computes the scaling factor B is BranchLength = B*t. B is a function of R and base frequencies.
(3). Added a loop after the tree has been read to set correct branch lengths.

The file is attached. If you go to other models, you'll need to update the function which computes 'B', but the rest of the analysis should work fine.

Code:
global TvTs = 0.3; /* the global transition/transversion ratio */

HKYM =  {{*,TvTs*t,t,TvTs*t}
	   {TvTs*t,*,TvTs*t,t}
	   {t,TvTs*t,*,TvTs*t}
	   {TvTs*t,t,TvTs*t,*}};


eqf = {{0.2}{0.3}{0.2}{0.3}};


Model  HKY = (HKYM,eqf,1);

ACCEPT_ROOTED_TREES = 1;

t 		 = 1;

/* it is important to set the branch length parameter to 1, so that the rate matrix is evaluated properly for the function,
otherwise the scaling factor will be of by a factor of 1/t */

scalingB = computeScalingFactorB (HKYM, eqf);

Tree L_1 = (((T1:0.0728981,T2:0.156032):0.821206,T3:0.829292):0.247948,((T4:0.123786,(T5:0.0252201,T6:0.0267699):0.0845906):0.646862,((T7:0.420598,T8:0.297769):0.648832,(T9:0.113452,T10:0.136777):0.684275):0.155985):0.44216);

/* after this step, HyPhy will set .t parameters of each branch to the value in the string, e.g L_1.T1.t = 0.0728981.
But the branch length of T1 will then be B*0.0728981. Thus we need to divide the .t of each branch by B to obtain
correct branch lengths. I added the print statments for diagnostics */

fprintf (stdout, "\nBranch scaling factor computed to be ", scalingB,"\n\n");

/* get all branch names and branch lengths*/

branchNames 	= BranchName 	(L_1,-1);
branchLenghts	= BranchLength	(L_1,-1);

for (k = 0; k < Columns (branchNames)-1; k=k+1)
{
	ExecuteCommands ("L_1."+branchNames[k]+".t=L_1."+branchNames[k]+".t/scalingB;");
	fprintf (stdout, "Branch ", branchNames[k],".\n\t Length before scaling: ", Format (branchLenghts[k],10,5),
											   "\n\t Length after scaling : ", Format (BranchLength (L_1,k),10,5), "\n");
}

characters = {{"A","C","G","T"}
		  {"1","","",""}};


DataSet sim = Simulate (L_1,eqf,characters,10000,1,"testDataHKY");


function computeScalingFactorB (rateMatrix, baseFreqs)
{
	B = 0;
	for (n1 = 0; n1 < 4; n1 = n1+1)
	{
		for (n2 = 0; n2 < 4; n2 = n2+1)
		{
			if (n2!=n1)
			{
				B = B + baseFreqs[n1]*baseFreqs[n2]*rateMatrix[n1][n2];
			}
		}
	}
	return B;
}
 



Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged