HyPhy message board
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl
Methodology Questions >> How to >> Minimum number of seqeunces
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?num=1281233166

Message started by hy_newbie on Aug 7th, 2010 at 7:06pm

Title: Minimum number of seqeunces
Post by hy_newbie on Aug 7th, 2010 at 7:06pm
Hi,

I am interested in using HyPhy to measure the selection pressure (global beta/alpha) on a viral data set and was wondering what is the minimum number of sequences needed to get a sensible answer? Is there a reference for this?

Thank you

Title: Re: Minimum number of seqeunces
Post by Sergei on Aug 9th, 2010 at 4:44pm
Hi there,

This depends on your question -- like anywhere else in statistics, for the sample size (power) estimate. For instance, the answer "How many sequences and what divergence level are needed to distinguish dN/dS of X from dN/dS of 1" will be different if X = 5 or X = 1.5 (greater for the latter). I would direct you to take a look at Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login, sections 1.4.4.x on confidence intervals and LRT.

Sergei

Title: Re: Minimum number of sequences
Post by hy_newbie on Aug 10th, 2010 at 6:12pm
Hi Sergei,

Thank you for your reply and the reference, it is a very insightful chapter. I am not a statistician by training so I hope my questions make sense… The LRT is the most powerful test it is true, but is there a way to determine the sample size required by HyPhy (i.e. minimum number of sequences) needed to obtain a specified power? That is, is there a power analysis for the nested LRT in HyPhy comparing beta/alpha with beta/alpha=1?  You also mentioned the CIs. Is infernal based on CIs statistically completely equivalent to infernal based on the LRT?

Beyond the statistical issues I was concerned that when the number of sequences is small (say 3-4 sequences) the method employed by HyPhy to estimate global beta and alpha is possibly nonsensical since it relies on a phylogenetic tree with a very small number of sequences.  Is there a critical number of sequences below which it is better to assume a star topology then estimate the topology? (i.e. just do all the pairwise comparisons?)


Title: Re: Minimum number of sequences
Post by Sergei on Aug 13th, 2010 at 7:44am
Hi there,


hy_newbie wrote on Aug 10th, 2010 at 6:12pm:
Hi Sergei,

Thank you for your reply and the reference, it is a very insightful chapter. I am not a statistician by training so I hope my questions make sense… The LRT is the most powerful test it is true, but is there a way to determine the sample size required by HyPhy (i.e. minimum number of sequences) needed to obtain a specified power? That is, is there a power analysis for the nested LRT in HyPhy comparing beta/alpha with beta/alpha=1?  You also mentioned the CIs. Is infernal based on CIs statistically completely equivalent to infernal based on the LRT?


Sounds like you are interested in a power analysis; generally speaking, any statistical power analysis requires a model and assumptions about the effects you are trying to detect. In your application there are three major factors contributing to the sample size

  • Number of sequences (N)
  • Length of sequences (L)
  • Divergence, e.g. measured by the total tree length (T).


The sample size for estimating global omega is proportional to the product of L and N, and depends on T in a non-trivial and topology dependent way (i.e. too high a divergence will lead to saturation; and too low a divergence -- to high variance for smaller N and L).

If you are interested in a proper power study, you could use HyPhy to simulate data under the global omega model, vary N, L and T, generate, say, 100 replicates for each combination and see how many reject the null of omega_0 = 1 for a given value of omega > 1.



Quote:
Beyond the statistical issues I was concerned that when the number of sequences is small (say 3-4 sequences) the method employed by HyPhy to estimate global beta and alpha is possibly nonsensical since it relies on a phylogenetic tree with a very small number of sequences.  Is there a critical number of sequences below which it is better to assume a star topology then estimate the topology? (i.e. just do all the pairwise comparisons?)


One should always attempt to use a reasonable topology for ANY number of sequences. For three sequences there is only one unrooted topology (star), and for four sequences even short alignments usually provide strong signal to infer the correct one out of 3 possible topologies.

Sergei



HyPhy message board » Powered by YaBB 2.5.2!
YaBB Forum Software © 2000-2024. All Rights Reserved.