Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
Pages: 1 2 3 4 
Modification of site specific rate estimation (Read 20974 times)
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Modification of site specific rate estimation
Reply #15 - Aug 14th, 2007 at 5:24pm
 
Dear Francesc,

It is amazing how many different ways there are to say the same thing and implement the same algorithm:)

The modified version of the HBL script definitely uses input branch lengths for rate estimation.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Francesc
Ex Member


Re: Modification of site specific rate estimation
Reply #16 - Aug 16th, 2007 at 3:32pm
 
Hi,

I did the tests (changing the branch lengths but keeping the tree length) and it seems to work.  Grin

??? I though Hyphy was only using the information of branch lengths from the input tree to calculate scaling factor (1/Tree Length). I though, during the loop for each site in the alignment, hyphy was optimizing the tree length for that position taking into account only the topology. And that tree length for each site was the rate on which we applied the scaling factor. (that was the approach I used with PAML).

Thanks for your help!!!!
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Modification of site specific rate estimation
Reply #17 - Aug 16th, 2007 at 3:48pm
 
Dear Francesc,

I am glad we settled this issue:)
Let me know how the actual rate estimation goes...

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Francesc
Ex Member


Re: Modification of site specific rate estimation
Reply #18 - Aug 24th, 2007 at 9:51am
 
Hi,

Here we have some results. Some sites from a gene and 2 graphics comparing both programs. The second chart is a detail of the lowest rates. I am also including the batch file in case you wanna take a look.

General things:
My tree length is almost 1.
I think DNArates is using K2P (kappa=2) and I am sure Hyphy is using that.
the correlates pretty well but, with DNArates, I am getting 3 - 4 times bigger rates.
There are some saturated positions in DNArates which are not saturated in Hyphy.

if a unit length in my chronogram is a billion of years, the rates units in hyphy after applying the scaling factor are substitution per billion years, aren’t they?

Even the position is invariant , shouldn’t Hyphy show some rate value?

I am going to take a look to the columns in the alignment with macclade and see what’s going on with this saturated positions.

another option would be to simulate data with another program assuming K2P and test both programs. In any case, I feel more confident with Hyphy because I have more control what is happening (if  the batch file is correct.)

Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login

Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login


site     DNArates   HYPHY
1      3.88          0.96
2      0.49            0
3      74.49            26.45
4      0.49            0
5      0.49            0
6      4.11            0.98
7      2.09            0.63
8      3.64            0.98
9      12.06            2.76
10      5.61            1.35
11      0.97            0.29
12      44.05            9.05
13      0.49            0
14      0.49            0
15      5.04            1.37
16      0.49            0
17      0.49            0
18      7.55            1.75
19      0.49            0
20      0.49            0
21      11.83            2.49
22      2.28            0.64
23      0.49            0
24      10.4            2.24
25      6.52            1.44
26      3.47            0.98
27      3.49            0.95
28      0.49            0
29      0.49            0
30      24.65            3.04
31      1.04            0.3
32      1.05            0.3
33      3.31            0.98
34      0.49            0
35      0.49            0
36      14.11            3.82
37      2.64            0.65
38      5            1.38
39      13.57            3.12
40      8.1            1.91
41      0.97            0.29
42      27.77            5
43      7.8            1.72
44      4.1            1.02
45      55.61            11.69
46      5.81            1.53
47      5.81            1.53
48      26.7            8.23
49      1.09            0.31
50      2.38            0.66
51      19.4            3.35
52      13.57            3.16
53      10.27            2.62
54      499.44      3.51
55      20.56            4.01
56      5.93            1.41
57      20.04            4.75
58      9.45            2.65
59      7.34            1.99
60      65.5            13.71
61      11.22            2.81
62      22.27            4.77
63      499.44      399.65
.....
397      499.44      2.98
.....
Back to top
« Last Edit: Aug 30th, 2007 at 11:39am by N/A »  
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Modification of site specific rate estimation
Reply #19 - Aug 28th, 2007 at 1:42pm
 
Dear Francesc,

Quote:
the correlates pretty well but, with DNArates, I am getting 3 - 4 times bigger rates.
There are some saturated positions in DNArates which are not saturated in Hyphy.



Not sure where the 3-4x factor is coming from; my guess is that the models are still not the same. In your batch file, line 118 is commented out, and line 119 is put in instead; this should be reversed. Line 118 is the proper way to compute the scaling factor; try printing out both and seeing what difference it makes.

Quote:
if a unit length in my chronogram is a billion of years, the rates units in hyphy after applying the scaling factor are substitution per billion years, aren’t they?


Yes; substitution/site/billion years.

Quote:
Even the position is invariant , shouldn’t Hyphy show some rate value?


The way the model is parameterized, the ML rate estimate for an invariant site is exactly '0'.

Quote:

another option would be to simulate data with another program assuming K2P and test both programs. In any case, I feel more confident with Hyphy because I have more control what is happening (if  the batch file is correct.)


Simulations are indeed a good idea; you can do that in HyPhy as well if you want.

Cheers,
Sergei

Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Francesc
Ex Member


Re: Modification of site specific rate estimation
Reply #20 - Aug 29th, 2007 at 8:55am
 
Hi Sergei,

Quote:
In your batch file, line 118 is commented out, and line 119 is put in instead; this should be reversed. Line 118 is the proper way to compute the scaling factor; try printing out both and seeing what difference it makes.


The reason I changed this
scalingFactor = estimatedLength/referenceLength;
for this:
scalingFactor = 1/referenceLength;

is, as we had commented before (replies #7 and #8), because  we don’t need to estimate the global tree for the model if we fix the model and its parameters (right?). So, I commented the lines related with estimatedLenght like   /*estimatedLength = estimatedLength + estL[i];*/.

When I find some time I will try the simulations.

Thanks!
Francesc
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Modification of site specific rate estimation
Reply #21 - Aug 29th, 2007 at 11:58am
 
Dear Francesc,

Your scaling factor (for K2P) should is like 1/referenceLength, but only works for kappa = 2; if you want to try other model parameters, restore the treeLength numerator.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Francesc
Ex Member


Re: Modification of site specific rate estimation
Reply #22 - Aug 30th, 2007 at 11:35am
 
More test. x=DNArates y=Hyphy.
Pretty good correlation except for the BRCA1 gene.
I erased a couple of datapoints which DNArates found saturated but not Hyphy.
Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login
Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Modification of site specific rate estimation
Reply #23 - Aug 30th, 2007 at 12:02pm
 
Dear Francesc,

I am still puzzled as to why the slope of the line would differ from 1 in some cases. It must be something different in the model, at least for the BRCA1 dataset.

Overall the agreement is quite good!

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Francesc
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 32
Re: Modification of site specific rate estimation
Reply #24 - Mar 17th, 2008 at 1:29pm
 
Hi Sergei,
Longtime! I gave you some rest cos I guess you were tired to explain things to a such stubborn.  Wink

the thing is I still don't understand one thing about this batch file: Why do we need to apply a scaling factor?

Goal: Calculate substitution rate for each column of our alignment, given a tree with a branch lengths (chronogram) and fix model and its parametres (e.g. Kimura k=2).

code to calculate scaling factor:
Code:
UseModel(USE_NO_MODEL);
Tree 	 referenceTree = treeString;
refL = BranchLength(referenceTree,-1);

referenceLength = 0;
for (i=0; i< Columns(refL); i=i+1)
{
   referenceLength = referenceLength + refL[i];
}

scalingFactor = 1/referenceLength; 



You told me that once:
Quote:
You have a reference (chrono) tree, call it R, the global tree, call it T, and the site tree S. Branch lengths are arranged as follows: S = alpha * T; T = beta * R; hence S = alpha*beta*R = gamma * R, i.e. S is directly proportional to the reference tree, and you could estimate 'gamma' directly at each site or as a multiplier to some 'global' beta (as the code does). The 'beta' factor only depends on the base frequencies and the global model rates - for K2P with a fixed kappa, these will be independent of the input alignment; hence 'beta' is independent of the input alignment. We could in fact compute 'beta' directly from the rate matrix, but the code in place now is more compact.


I don't understand what global tree is?

I though that, with the following part of the code, we already estimated the substitution rate for each site using the branch lengths of the inputted tree. Then, in the same units of the tree.
Code:
		for (siteCount = 0; siteCount < filteredData.sites; siteCount = siteCount+1)
		{
			siteMap = dupInfo[siteCount];
			if (alreadyDone[siteMap] == 0)
			{
				filterString = "" + siteCount;
				DataSetFilter siteFilter = CreateFilter (ds,1,filterString);
				LikelihoodFunction siteLikelihood = (siteFilter, siteTree);
				siteRate = 1;
				Optimize (site_res, siteLikelihood);
				alreadyDone[siteMap]  = 1;
				siteLengths = BranchLength (siteTree,-1);
				doneSites[siteMap][0] = siteRate;
				doneSites[siteMap][1] = site_res[1][0];/*Loglikelihood*/
			}
			dummy = ReportSite (siteCount, siteMap);
		}
		 



What is siteRate? what units has?

Thank you very much for your help and time,
Francesc
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Modification of site specific rate estimation
Reply #25 - Mar 17th, 2008 at 10:32pm
 
Dear Francesc,

Quote:
I don't understand what global tree is?


In this context this is the tree that is estimated from the entire alignment (average rate tree if you will).


Quote:
What is siteRate? what units has?


I don't remember what version of the code we settled on, but siteRate is most likely measured in terms of substitutions/site.
You need to divide it by the time factor (which is obtained from your chronogram) to get substitutions/site/unit time.

Cheers,
Sergei

Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Francesc
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 32
Re: Modification of site specific rate estimation
Reply #26 - Mar 18th, 2008 at 7:58am
 
Hi Sergei,

if siteRate uses the branch length of the inputted tree to calculate the rate, then shouldn't the rate be already in the same units of the tree?

if we have to correct the rate, shouldn't we divide the rate by the tree length?

Code:
UseModel(USE_NO_MODEL);
Tree 	 referenceTree = treeString;
refL = BranchLength(referenceTree,-1);

referenceLength = 0;
for (i=0; i< Columns(refL); i=i+1)
{
   referenceLength = referenceLength + refL[i];
}

scalingFactor = 1/referenceLength;  



Reference length is not the tree length, right?

This is the batch file, just in case.
Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login

Thanks!
Francesc
     
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Modification of site specific rate estimation
Reply #27 - Mar 18th, 2008 at 9:02am
 
Dear Francesc,

referenceLength is the total length of the chronogram (in whatever units you specified it).
siteRate will eventually be divided by referenceLength (or multiplied by the scalingFactor) to disentangle rates and times (in reportSite).

HTH,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Francesc
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 32
Re: Modification of site specific rate estimation
Reply #28 - Mar 18th, 2008 at 2:06pm
 
Hi,
Let me put you an example and walkthrough the output. Let me know if I am wrong.

The input tree:
Code:
((((((Human:0.18,Macaca:0.18):0.69,(Mouse:0.16,Rat:0.16):0.71):0.07,(((Sheep:0.56,Cow:0.56):0.05,Boar:0.61):0.21,Dog:0.82):0.12):2.32,Gallus:3.26):0.44,Xenopus:3.70):1.06,Salmon:4.76);
 


tree length (manually): 20.62

The alignment:
Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login

The firs position in the alignment looks like:
Code:
Dog    C
Gallus	T
Human     C
Rat	 C
Salmon    C
Xenopus    C
Cow	   C
Macaca    C
Sheep    C
Boar	C
Mouse	C 



OUTPUT:
Quote:
Tree givenTree=((Xenopus:0.480423,Salmon:0.618058)Node2:0.0571314,Gallus:0.423292,(((
Human:0.0233719,Macaca:0.0233719)Node8:0.0895924,(Mouse:0.0207751,Rat:0.0207751)
Node11:0.0921893)Node7:0.00908909,(((Sheep:0.0727127,Cow:0.0727127)Node16:0.0064
922,Boar:0.0792049)Node15:0.0272673,Dog:0.106472)Node14:0.0155813)Node6:0.301238
);

Where does this tree come from? The treelength is 2.53975029

Quote:
Reference tree length (units time): 19.56 (not equal to 20.62)
Scaling factor: 0.0511247
Site    1 Rate =  0.0225 Not scaled =  0.4411 Log(L) -4.8025



It looks like referenceLength is not including the bold number: Xenopus:3.70):1.06,Salmon:4.76);
Do we always need to provide and unrooted tree? With the same tree unrooted I am getting the correct tree length.

For tis site, the total substitutions (independently of the model) should be at least 1. And the rate bigger than 1/19.56 = 0.0511.
but siteRate = 0.4411.
Are you sure this is substitutions/site?

If I divide by 10 each branch of my tree. I get the following.
Quote:
Reference tree length (units time): 1.956
Scaling factor: 0.511247
Site    1 Rate =  0.2255 Not scaled =  0.4411 Log(L) -4.8025


Clearly, we have to a apply a correction to the unscaled value but I am still thinking that unscaled value should be bigger than one.

I run the same situation in DNArates. We get a rate of 0.0524 that multiplied by the tree length 20.62 gives 1.08 substitutions.

Is not possible that the unscaled value is being scaled by that tree with a tree length of 2.53?

I am eager to apply hyphy for rate estimation cos I can control the model and the parameters easily.

Thanks a lot,
Francesc
Back to top
« Last Edit: Mar 18th, 2008 at 3:38pm by Francesc »  
 
IP Logged
 
Francesc
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 32
Re: Modification of site specific rate estimation
Reply #29 - Mar 18th, 2008 at 3:45pm
 
.... Continuation of the last post....

I just realized about one thing. When I am running only one position (the fist column, see last post) I am getting this:

Rooted tree:
Quote:
Reference tree length (units time): 19.56
Scaling factor: 0.0511247
Site    1 Rate =  0.0511 Not scaled =  1.00000000 Log(L) -4.8025


Or

Using unrooted tree:
Quote:
Reference tree length (units time): 20.62
Scaling factor: 0.0484966
Site    1 Rate =  0.0485 Not scaled =  0.99999546 Log(L) -4.8524


Then, There is something wrong in the batch file I am using. The estimation should be independent for each site.
I should get the same rate for a specific position if I am using only the one or the whole alignment.
The model and its parameters are fixed.

What could it be?
Back to top
 
 
IP Logged
 
Pages: 1 2 3 4