HyPhy message board
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl
Theoretical questions >> Sequence Analysis >> Site rate
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?num=1166812752

Message started by ofedrigo on Dec 22nd, 2006 at 10:39am

Title: Site rate
Post by ofedrigo on Dec 22nd, 2006 at 10:39am
Hi,

I have a theorical question. I am doing a sliding window analysis to detect regions of an alignment with high susbtitution rate. For this, I try to infer the specific site rate and calculate the average for each window.

what is the most reliable and precise method?

1- fit a model with rate classes (e.g. G+I) on the whole dataset. And associate a rate classe to each site with Naive Empirical Bayes.

2- using a DNArates type approach implemented in siteRates.bf

or maybe there is an other obvious method I did not think about.

thanks,

Olivier

Title: Re: Site rate
Post by Sergei on Dec 22nd, 2006 at 12:00pm
Dear Olivier,

Both methods (not suprisingly) have their pros and cons. The rule of thumb is probably to use random effects (G+I + Emp Bayes) for smaller data sets, and fixed effects (siteRates.bf) for larger data sets.

The strength of FEL is that it is distribution free, so that it won't force a potentially wrong distribution (G+I or any other) onto the rates in your alignments, and the weakness, is that it is fairly noisy, because you are estimating a rate from one site,  so your sample size is about the number of sequences, hence you need to have about 25 or more to avoid serious overfitting.

The strength of REL is that it can avoid overfitting by 'pooling' similar sites into a single rate class, but this is also its weakness, because one has to assume something about the unknown distribution of rates. The practical result of this is that the inference can 'smooth' the rates, forcing sites with different patterns into the same rate class, if the underlying distribution is insufficiently flexible. The sample size of hyperparameters of REL (alpha and P_I for G+I) is the length of the alignment, so it will almost surely do poorly for short sequences. I would recommend using beta-Gamma, as it is much more flexible that G+I with only one extra parameter.

HTH,
Sergei

HyPhy message board » Powered by YaBB 2.5.2!
YaBB Forum Software © 2000-2024. All Rights Reserved.