Hi.
I have been in contact before concerning to a BF that I created with your help to estimate substitution rates in an alignment for each site independently. I am using the BF in a pipeline of a webapp. THANKS!
Please, find attached the BF and samples files.
The BF has hardcoded the input files and the model (HKY85 with kappa=2).
First issue:I am estimating the base freqs using:
Code:HarvestFrequencies (Freqs, myFilter, 1, 1, 1);
and I am getting the following:
Quote:Base freqs: {
{ 0.191379310345}
{ 0.170689655172}
{ 0.274137931034}
{ 0.363793103448}
}
If I calculate them manually or with phyml I am getting:
Quote:0.1463
0.1098
0.2927
0.4512
Estimating the freqs in the maximum likelihood framework with phyml:
Quote:0.2413
0.1309
0.2497
0.3781
Isn't that function estimating the frequencies from the data?
Is there any error in Hyphy? Are they estimated in the LH framework but I am getting different results than PhyML? Which one should I use?
Issue 2:I am getting rates when the column in the alignment is full of indels except for 1 site ("one informative site", sites 1 and 2 in the sample alignment, see below). That doesn't make much sense. Is there any way to "filter" that sites (or filter sites with a minimum X number of informative sites) and maybe show them us "undefined rate"?
Issue 3:Usually, I use chronograms with millions years as units for branch lengths. That implies that some branches can be easily over 100. I usually change the unit to work with values <10 and then I modify the rates with the same factor accordingly. I am including the same tree in different magnitudes and there is a problem when units are too high.
Why is this happening? What's the proper range to work with?
Original tree:
Quote:Chronogram length (time units): 578
Site 1 Rate = 1.1109 subst/time, info sites = 1
Site 2 Rate = 1.1109 subst/time, info sites = 1
Site 3 Rate = 1.0148 subst/time, info sites = 2
Site 4 Rate = 1.1477 subst/time, info sites = 5
Site 5 Rate = 1.0514 subst/time, info sites = 5
Site 6 Rate = 1.1476 subst/time, info sites = 5
Site 7 Rate = 1.0514 subst/time, info sites = 5
Site 8 Rate = 1.0514 subst/time, info sites = 5
Site 9 Rate = 1.0514 subst/time, info sites = 5
Site 10 Rate = 1.1477 subst/time, info sites = 5
Original tree divided by 100:
Quote:Chronogram length (time units): 5.78
Site 1 Total Rate = 0.9914 subst/time, info sites = 1
Site 2 Total Rate = 1.0522 subst/time, info sites = 1
Site 3 Total Rate = 0.0000 subst/time, info sites = 2
Site 4 Total Rate = 0.0000 subst/time, info sites = 5
Site 5 Total Rate = 0.0000 subst/time, info sites = 5
Site 6 Total Rate = 0.3230 subst/time, info sites = 5
Site 7 Total Rate = 0.2319 subst/time, info sites = 5
Site 8 Total Rate = 2.9807 subst/time, info sites = 5
Site 9 Total Rate = 0.0000 subst/time, info sites = 5
Site 10 Total Rate = 0.0000 subst/time, info sites = 5
Original tree divided by 1000:
Quote:Chronogram length (time units): 0.578
Site 1 Rate = 1.1470 subst/time, info sites = 1
Site 2 Rate = 1.0697 subst/time, info sites = 1
Site 3 Rate = 0.0000 subst/time, info sites = 2
Site 4 Rate = 0.0000 subst/time, info sites = 5
Site 5 Rate = 0.0000 subst/time, info sites = 5
Site 6 Rate = 3.2301 subst/time, info sites = 5
Site 7 Rate = 2.3185 subst/time, info sites = 5
Site 8 Rate = 29.8072 subst/time, info sites = 5
Site 9 Rate = 0.0000 subst/time, info sites = 5
Site10 Rate = 0.0000 subst/time, info sites = 5
For the last two trees in which there is no branch lengths bigger than two, the columns with enough informative sites have the same rate (after applying correction factor).
Thank you very much for your time and help,
Francesc