Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
Pages: 1 2 3 4
Modification of site specific rate estimation (Read 20970 times)
Francesc
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 32
Modification of site specific rate estimation
Jul 24th, 2007 at 8:29am
 
Hi Sergei,

I am Francesc from NESCent Course. I am still having problems with the modified batch file to estimate site rates (SiteRates2.bf).

Our goal: Estimates rates for each site independently, using the branch lengths of an input tree (BL=time).

When I am running the batch file, I load the nucleotide data file with the tree, then I select a model (K2P), then Global, then the program recognize the tree in the file, and then I get the following message errors:
Quote:
Expression syntax/sematics error.
Current BL Command:BranchLength(refL,referenceTree,-1)
RPN:refL,referenceTree,1,-,BranchLength contains errors.
Current BL Command:BranchLength(refL,referenceTree,-1)
Problem occurred in line:BranchLength(refL,referenceTree,-1)
Current BL Command:BranchLength(refL,referenceTree,-1)


The messages.log continues with:
Quote:
Row and/or column partition is empty. All the data will be used by default.
Overwritten previously defined function:"GaussLaguerreQuardature"
Overwritten previously defined function:"GammaQuadWeight"
Umay has 0 parameters - branch length not assigned
…..


Probably you already have it, in any case, here is the batch file:
Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login

Any suggestion to solve this?
Let me know if you need more information.

Thank you very much for your help and time!!!!

Francesc
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Modification of site specific rate estimation
Reply #1 - Jul 27th, 2007 at 11:32am
 
Dear Francesc,

Instead of
Code:
BranchLength (a,b,-1);
 



use

Code:
a = BranchLength (b,-1);
 




Also, you should constrain global parameter estimates (e.g. kappa) after you fit the model to the whole dataset (otherwise they are re-estimated for each site).

Insert
Code:
	ExecuteAFile ("Utility/GrabBag.bf");
	fixGlobalParameters ("lf");
 



after the first call to 'Optimize'  (line 81).

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Francesc
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 32
Re: Modification of site specific rate estimation
Reply #2 - Jul 30th, 2007 at 4:47pm
 
Thanks Sergei!
I did the changes you told me (except last one) and it works.
There is also a couple of things I would like to comment. Please, let me know if I am wrong.

To constrain the model, I fixed the model to be (K2P with kappa=2) changing this line (#64):
Code:
SelectTemplateModel(filteredData); /*Prompts the user to select a template model */ 


By:
Code:
equalFreqs = {{0.25}{0.25}{0.25}{0.25}};
K2Pkappa2RateMatrix = {{*,b,b*2,b}{b,*,b,b*2}{b*2,b,*,b}{b,b*2,b,*}};
Model K2P = (K2Pkappa2RateMatrix, equalFreqs);
UseModel(K2P); 


One of my worries is the position of UseModel(K2P) since, below in the batch file, there is the following:
      Code:
UseModel(USE_NO_MODEL);
	Tree 	 referenceTree = treeString;
	refL = BranchLength(referenceTree,-1);
	estL = BranchLength(givenTree,-1); 


the estL is an array with each branch length of a tree calculated from the topology inputted and the whole alignment, right?

How is the "USE_NO_MODEL" affecting the estimation of the total length of the tree? Should I change the position of UseModel(K2P)?

I think now I understand better why we should apply the correction factor for each rate. Is this the only way that Hyphy can estimate rates for each site using the branch lengths of the input tree?

So, is optimizing the substitutions of a given tree topology and a model the only way to get a rate estimation for each site?

The approach is different for DNArates or Rate4site: based on a model, the program tries to find a factor X (mutation rate) that multiplies the branch length that maximizes the function (or in other words calculate the different probabilities of nucleotide changes along a branch of known length given a rate). Well, more or less. Our branch lengths are in time units (trees are chronograms) that that approach appears to work fine. The problem is the super high rates that we get with DNArates.

I really appreciate your help,

Francesc
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Modification of site specific rate estimation
Reply #3 - Jul 30th, 2007 at 9:45pm
 
Dear Francesc,

The K2P bit looks OK. UseModel does not affect trees that you constructed BEFORE the call (hence UseModel(USE_NO_MODEL); will have no effect on BranchLength(givenTree,-1).

Let me know how this approach compares to DNARates, i.e. if you get rid of the very high rates. You may still get them in HyPhy but that should indicate that a given site is saturated...

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Francesc
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 32
Re: Modification of site specific rate estimation
Reply #4 - Aug 1st, 2007 at 9:05am
 
Thanks again Sergei!

Yes. We still have the high rates - or even higher - but not always. There is some high values for DNArates that are not high in Hyphy. So, the number of saturated sites for hyphy is lower.

In any case, as commented before, the approach is different. DNArates is trying to optimize a rate for each site independently using the topology and the branch length of the input tree. The rate is a factor which multiplies the branch length in the model.

In this batch file, we are optimizing the number of substitutions for each site independently, based only on the topology of the input tree. Then, we apply a correction factor which is the tree length of the input tree divided by the optimized tree length based on the whole alignment. Thus, the rate estimation is not "totally independent for each site". if we change the length of the alignment we probably also change the rates.

Here is the batch file (the rates are printed in the log file):
Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Modification of site specific rate estimation
Reply #5 - Aug 1st, 2007 at 9:31am
 
Dear Francesc,

In the SiteRates2.bf each site rate is estimated independently from all other sites; global branch lengths are not estimated at all, but taken from the input tree. The only reason there exists an Optimize for the entire alignment, is that we may wish to estimate global parameters (kappa etc) from the entire alignment. Since you have hardcoded K2P, with a fixed kappa, there is nothing left to optimize. In essence the approach in SiteRates2.bf should be conceptually identical to DNARates.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Francesc
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 32
Re: Modification of site specific rate estimation
Reply #6 - Aug 1st, 2007 at 11:34am
 
Hi again,

then...  I am not sure if I fully understand the correction factor which is applied to each rate.
Code:
	  refL = BranchLength(referenceTree,-1);
	estL = BranchLength(givenTree,-1);

	referenceLength = 0;
	estimatedLength = 0;

	for (i=0; i< Columns(refL); i=i+1)
	{
	   referenceLength = referenceLength + refL[i];
	   estimatedLength = estimatedLength + estL[i];
	}
	scalingFactor = estimatedLength/referenceLength;
 


The scaling factor is a constant value that we apply to each site.

referenceLength is the tree length of the input tree.

what is the estimatedLength? is the estimatedLength the tree length of the tree estimated based on the hardcoded model and the whole alignment?

If we input half of the alignment then this factor will change, won't it?

Thanks!

Best,

Francesc
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Modification of site specific rate estimation
Reply #7 - Aug 1st, 2007 at 9:44pm
 
Dear Francesc,

You don't actually need to estimate the global tree for your model; it's simple a left over construct now (for the old code which fitted the tree to the data). You have a reference (chrono) tree, call it R, the global tree, call it T, and the site tree S. Branch lengths are arranged as follows: S = alpha * T; T = beta * R; hence S = alpha*beta*R = gamma * R, i.e. S is directly proportional to the reference tree, and you could estimate 'gamma' directly at each site or as a multiplier to some 'global' beta (as the code does). The 'beta' factor only depends on the base frequencies and the global model rates - for K2P with a fixed kappa, these will be independent of the input alignment; hence 'beta' is independent of the input alignment. We could in fact compute 'beta' directly from the rate matrix, but the code in place now is more compact.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Francesc
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 32
Re: Modification of site specific rate estimation
Reply #8 - Aug 2nd, 2007 at 8:40am
 
OK. I got the point.
Rate for each site = S/T if we fix the model and its parameters.

Now, I can change the batch file and simplify the scalingfactor to be   scalingFactor = 1/referenceLength and rid of  the estimation of the global tree for the model.

I hadn't realized until now. I used this approach with PAML. The problem with PAML was that I had to send to PAML site by site.

In any case, I am still seeing differences between this approach and DNArates one. I am not saying it is better. DNArates uses the individual branch length of the input tree to calculate the rate.

How Hyphy or the other ML approaches deals with saturated sites? I mean how do they determine the rate when the likelihood distribution keeps increasing? is there a arbitrary maximum for the rate in the code of the program? depends on the tree length of the input tree?

Thanks again for your patience. I am learning a lot thanks to that Wink
Francesc


Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Modification of site specific rate estimation
Reply #9 - Aug 2nd, 2007 at 9:44am
 
Dear Francesc,

You are exactly right - the global tree estimation is now redundant. I am not sure what DNARates does differently - this approach seems to be the most logical given the data:)

I'll be curious to see how different the results are anyway...

HyPhy has an arbitrary numerical infinity (10000) for branch length parameters; a saturated site will probably hit that bound.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Francesc
Ex Member


Re: Modification of site specific rate estimation
Reply #10 - Aug 13th, 2007 at 2:13pm
 
Hi again, Sorry for the delay replying.

I am going to try to explain what DNArates does (I think).

it is calculates the ML score for each position based on the input tree and a model (K2P with kappa =2). So, It is gonna calculate the ML score by summing the probabilities for all possible internal states (in the internal nodes of the tree) using each individual branch length (BL) and a parameter r (rate). The only unknown parameter is r which multiplies the branch length in the model.
For a Junkes-Cantor model the probabilities would be:

(when state i is equal to state j)       Pij = 0.25 + 0.75*e^(r*BL)
(when state i is not equal to state j) Pij = 0.25 – 0.25*e^(r*BL)
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Modification of site specific rate estimation
Reply #11 - Aug 13th, 2007 at 3:25pm
 
Dear Francesc,

Based on your description, modified site rate in HyPhy and DNARates should be doing exactly the same thing (at least if you use K2P with fixed kappa).

Best,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Francesc
Ex Member


Re: Modification of site specific rate estimation
Reply #12 - Aug 14th, 2007 at 4:37pm
 
I don't want to be boring Tongue but I am still thinking it is not the same (or worse neither)

I did a test:
I took a tree. I reorganized the values of the branch lengths - without changing the topology - so the  tree length was the same.

With Hyphy we should get the same values of rates with both trees.
However, with DNArates the rates changed. DNArates takes each individual branch length to calculate the probabilities of each possible internal state. If we change a branch lengths, for exemple: from the tip A (with a known nucleotide) to the internal node B (unknown nucleotide), the probabilities, the score and I the optimization of rate will change even if we compensate the tree length changing another branch.

am I wrong?

I think it would be esier if I explain wit details the DNArates algorithms

In Pupko et al. 2002. Rate4site program. Explains the algorithm used for proteins and I think it is the same one used in DNArates.

Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Modification of site specific rate estimation
Reply #13 - Aug 14th, 2007 at 4:57pm
 
Dear Francesc,

The values in HyPhy should also change if you adjust branch lengths in a way to preserve total tree length (as it will alter the likelihood of the tree, even for a fixed rate). If it doesn't, then something is wrong with the implementation. The total tree length is simply used as a scaling factor - it doesn't have to be the total tree length - you can use the length of a single (non-zero) branch as a reference; taking the sum over the tree guarantees (unless the sequences are all identical) that it will not be zero, so when we divide by it (recall the BL = rate * time, we have time and we have BL, hence we can compute rate = BL/time).

The Bioinformatics paper essentially describes a standard ML procedure (summing over unknown internal states, ect) for fitting a model to a tree, except that branch lengths are fixed.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Francesc
Ex Member


Re: Modification of site specific rate estimation
Reply #14 - Aug 14th, 2007 at 5:06pm
 
Yes, I know it is the standard ML procedure.
I though the Hyphy only uses the topology of the input tree but no the branch lengths to calculate the rates.

I will test it.

Then, this approach (Hyphy one) is different to the one I was using with PAML - which only takes into account the topology if you want to do a similar approach.
Back to top
 
 
IP Logged
 
Pages: 1 2 3 4