Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
Site rate (Read 1972 times)
ofedrigo
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 14
Site rate
Dec 22nd, 2006 at 10:39am
 
Hi,

I have a theorical question. I am doing a sliding window analysis to detect regions of an alignment with high susbtitution rate. For this, I try to infer the specific site rate and calculate the average for each window.

what is the most reliable and precise method?

1- fit a model with rate classes (e.g. G+I) on the whole dataset. And associate a rate classe to each site with Naive Empirical Bayes.

2- using a DNArates type approach implemented in siteRates.bf

or maybe there is an other obvious method I did not think about.

thanks,

Olivier
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Site rate
Reply #1 - Dec 22nd, 2006 at 12:00pm
 
Dear Olivier,

Both methods (not suprisingly) have their pros and cons. The rule of thumb is probably to use random effects (G+I + Emp Bayes) for smaller data sets, and fixed effects (siteRates.bf) for larger data sets.

The strength of FEL is that it is distribution free, so that it won't force a potentially wrong distribution (G+I or any other) onto the rates in your alignments, and the weakness, is that it is fairly noisy, because you are estimating a rate from one site,  so your sample size is about the number of sequences, hence you need to have about 25 or more to avoid serious overfitting.

The strength of REL is that it can avoid overfitting by 'pooling' similar sites into a single rate class, but this is also its weakness, because one has to assume something about the unknown distribution of rates. The practical result of this is that the inference can 'smooth' the rates, forcing sites with different patterns into the same rate class, if the underlying distribution is insufficiently flexible. The sample size of hyperparameters of REL (alpha and P_I for G+I) is the length of the alignment, so it will almost surely do poorly for short sequences. I would recommend using beta-Gamma, as it is much more flexible that G+I with only one extra parameter.

HTH,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged