Hi there,
hy_newbie wrote on Aug 10
th, 2010 at 6:12pm:
Hi Sergei,
Thank you for your reply and the reference, it is a very insightful chapter. I am not a statistician by training so I hope my questions make sense… The LRT is the most powerful test it is true, but is there a way to determine the sample size required by HyPhy (i.e. minimum number of sequences) needed to obtain a specified power? That is, is there a power analysis for the nested LRT in HyPhy comparing beta/alpha with beta/alpha=1? You also mentioned the CIs. Is infernal based on CIs statistically completely equivalent to infernal based on the LRT?
Sounds like you are interested in a power analysis; generally speaking, any statistical power analysis requires a model and assumptions about the effects you are trying to detect. In your application there are three major factors contributing to the sample size
- Number of sequences (N)
- Length of sequences (L)
- Divergence, e.g. measured by the total tree length (T).
The sample size for estimating global omega is proportional to the product of L and N, and depends on T in a non-trivial and topology dependent way (i.e. too high a divergence will lead to saturation; and too low a divergence -- to high variance for smaller N and L).
If you are interested in a proper power study, you could use HyPhy to simulate data under the global omega model, vary N, L and T, generate, say, 100 replicates for each combination and see how many reject the null of omega_0 = 1 for a given value of omega > 1.
Quote:Beyond the statistical issues I was concerned that when the number of sequences is small (say 3-4 sequences) the method employed by HyPhy to estimate global beta and alpha is possibly nonsensical since it relies on a phylogenetic tree with a very small number of sequences. Is there a critical number of sequences below which it is better to assume a star topology then estimate the topology? (i.e. just do all the pairwise comparisons?)
One should always attempt to use a reasonable topology for ANY number of sequences. For three sequences there is only one unrooted topology (star), and for four sequences even short alignments usually provide strong signal to infer the correct one out of 3 possible topologies.
Sergei