HyPhy message board - Minimum number of seqeunces

hy_newbie YaBB Newbies Offline Feed your monkey! Posts: 2	Minimum number of seqeunces Aug 7^th, 2010 at 7:06pm Hi, I am interested in using HyPhy to measure the selection pressure (global beta/alpha) on a viral data set and was wondering what is the minimum number of sequences needed to get a sensible answer? Is there a reference for this? Thank you
Back to top	IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: Minimum number of seqeunces
Reply #1 - Aug 9^th, 2010 at 4:44pm

Hi there,

This depends on your question -- like anywhere else in statistics, for the sample size (power) estimate. For instance, the answer "How many sequences and what divergence level are needed to distinguish dN/dS of X from dN/dS of 1" will be different if X = 5 or X = 1.5 (greater for the latter). I would direct you to take a look at Multimedia File Viewing and Clickable Links are available for Registered Members only!! You need to

Login, sections 1.4.4.x on confidence intervals and LRT.

Sergei

Back to top

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

hy_newbie

YaBB Newbies

Offline

Feed your monkey!

Posts: 2

Re: Minimum number of sequences
Reply #2 - Aug 10^th, 2010 at 6:12pm

Hi Sergei,

Thank you for your reply and the reference, it is a very insightful chapter. I am not a statistician by training so I hope my questions make sense… The LRT is the most powerful test it is true, but is there a way to determine the sample size required by HyPhy (i.e. minimum number of sequences) needed to obtain a specified power? That is, is there a power analysis for the nested LRT in HyPhy comparing beta/alpha with beta/alpha=1? You also mentioned the CIs. Is infernal based on CIs statistically completely equivalent to infernal based on the LRT?

Beyond the statistical issues I was concerned that when the number of sequences is small (say 3-4 sequences) the method employed by HyPhy to estimate global beta and alpha is possibly nonsensical since it relies on a phylogenetic tree with a very small number of sequences. Is there a critical number of sequences below which it is better to assume a star topology then estimate the topology? (i.e. just do all the pairwise comparisons?)

Back to top

IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: Minimum number of sequences
Reply #3 - Aug 13^th, 2010 at 7:44am

Hi there,

hy_newbie wrote on Aug 10^th, 2010 at 6:12pm:

Hi Sergei,

Thank you for your reply and the reference, it is a very insightful chapter. I am not a statistician by training so I hope my questions make sense… The LRT is the most powerful test it is true, but is there a way to determine the sample size required by HyPhy (i.e. minimum number of sequences) needed to obtain a specified power? That is, is there a power analysis for the nested LRT in HyPhy comparing beta/alpha with beta/alpha=1? You also mentioned the CIs. Is infernal based on CIs statistically completely equivalent to infernal based on the LRT?

Sounds like you are interested in a power analysis; generally speaking, any statistical power analysis requires a model and assumptions about the effects you are trying to detect. In your application there are three major factors contributing to the sample size

Number of sequences (N)
Length of sequences (L)
Divergence, e.g. measured by the total tree length (T).

The sample size for estimating global omega is proportional to the product of L and N, and depends on T in a non-trivial and topology dependent way (i.e. too high a divergence will lead to saturation; and too low a divergence -- to high variance for smaller N and L).

If you are interested in a proper power study, you could use HyPhy to simulate data under the global omega model, vary N, L and T, generate, say, 100 replicates for each combination and see how many reject the null of omega_0 = 1 for a given value of omega > 1.

Quote:

Beyond the statistical issues I was concerned that when the number of sequences is small (say 3-4 sequences) the method employed by HyPhy to estimate global beta and alpha is possibly nonsensical since it relies on a phylogenetic tree with a very small number of sequences. Is there a critical number of sequences below which it is better to assume a star topology then estimate the topology? (i.e. just do all the pairwise comparisons?)

One should always attempt to use a reasonable topology for ANY number of sequences. For three sequences there is only one unrooted topology (star), and for four sequences even short alignments usually provide strong signal to infer the correct one out of 3 possible topologies.

Sergei

Back to top

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

	Welcome, Guest. Please Login