HyPhy message board

	Welcome, Guest. Please Login

Home

Help

HyPhy message board › Theoretical questions › Sequence Analysis › Site rate

(Moderators: Sergei, Simon)

‹ Previous Topic | Next Topic ›

Pages: 1

Send Topic

Site rate (Read 1972 times)

ofedrigo

YaBB Newbies

Offline

I love YaBB 1G - SP1!

Posts: 14

Site rate
Dec 22^nd, 2006 at 10:39am

Hi,

I have a theorical question. I am doing a sliding window analysis to detect regions of an alignment with high susbtitution rate. For this, I try to infer the specific site rate and calculate the average for each window.

what is the most reliable and precise method?

1- fit a model with rate classes (e.g. G+I) on the whole dataset. And associate a rate classe to each site with Naive Empirical Bayes.

2- using a DNArates type approach implemented in siteRates.bf

or maybe there is an other obvious method I did not think about.

thanks,

Olivier

IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: Site rate
Reply #1 - Dec 22^nd, 2006 at 12:00pm

Dear Olivier,

Both methods (not suprisingly) have their pros and cons. The rule of thumb is probably to use random effects (G+I + Emp Bayes) for smaller data sets, and fixed effects (siteRates.bf) for larger data sets.

The strength of FEL is that it is distribution free, so that it won't force a potentially wrong distribution (G+I or any other) onto the rates in your alignments, and the weakness, is that it is fairly noisy, because you are estimating a rate from one site, so your sample size is about the number of sequences, hence you need to have about 25 or more to avoid serious overfitting.

The strength of REL is that it can avoid overfitting by 'pooling' similar sites into a single rate class, but this is also its weakness, because one has to assume something about the unknown distribution of rates. The practical result of this is that the inference can 'smooth' the rates, forcing sites with different patterns into the same rate class, if the underlying distribution is insufficiently flexible. The sample size of hyperparameters of REL (alpha and P_I for G+I) is the length of the alignment, so it will almost surely do poorly for short sequences. I would recommend using beta-Gamma, as it is much more flexible that G+I with only one extra parameter.

HTH,
Sergei

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

Pages: 1

Send Topic

‹ Previous Topic | Next Topic ›

« Home

‹ Board

Top of this page