Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
Minimum number of seqeunces (Read 2185 times)
hy_newbie
YaBB Newbies
*
Offline


Feed your monkey!

Posts: 2
Minimum number of seqeunces
Aug 7th, 2010 at 7:06pm
 
Hi,

I am interested in using HyPhy to measure the selection pressure (global beta/alpha) on a viral data set and was wondering what is the minimum number of sequences needed to get a sensible answer? Is there a reference for this?

Thank you
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Minimum number of seqeunces
Reply #1 - Aug 9th, 2010 at 4:44pm
 
Hi there,

This depends on your question -- like anywhere else in statistics, for the sample size (power) estimate. For instance, the answer "How many sequences and what divergence level are needed to distinguish dN/dS of X from dN/dS of 1" will be different if X = 5 or X = 1.5 (greater for the latter). I would direct you to take a look at Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login, sections 1.4.4.x on confidence intervals and LRT.

Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
hy_newbie
YaBB Newbies
*
Offline


Feed your monkey!

Posts: 2
Re: Minimum number of sequences
Reply #2 - Aug 10th, 2010 at 6:12pm
 
Hi Sergei,

Thank you for your reply and the reference, it is a very insightful chapter. I am not a statistician by training so I hope my questions make sense… The LRT is the most powerful test it is true, but is there a way to determine the sample size required by HyPhy (i.e. minimum number of sequences) needed to obtain a specified power? That is, is there a power analysis for the nested LRT in HyPhy comparing beta/alpha with beta/alpha=1?  You also mentioned the CIs. Is infernal based on CIs statistically completely equivalent to infernal based on the LRT?

Beyond the statistical issues I was concerned that when the number of sequences is small (say 3-4 sequences) the method employed by HyPhy to estimate global beta and alpha is possibly nonsensical since it relies on a phylogenetic tree with a very small number of sequences.  Is there a critical number of sequences below which it is better to assume a star topology then estimate the topology? (i.e. just do all the pairwise comparisons?)

Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Minimum number of sequences
Reply #3 - Aug 13th, 2010 at 7:44am
 
Hi there,

hy_newbie wrote on Aug 10th, 2010 at 6:12pm:
Hi Sergei,

Thank you for your reply and the reference, it is a very insightful chapter. I am not a statistician by training so I hope my questions make sense… The LRT is the most powerful test it is true, but is there a way to determine the sample size required by HyPhy (i.e. minimum number of sequences) needed to obtain a specified power? That is, is there a power analysis for the nested LRT in HyPhy comparing beta/alpha with beta/alpha=1?  You also mentioned the CIs. Is infernal based on CIs statistically completely equivalent to infernal based on the LRT?



Sounds like you are interested in a power analysis; generally speaking, any statistical power analysis requires a model and assumptions about the effects you are trying to detect. In your application there are three major factors contributing to the sample size

  • Number of sequences (N)
  • Length of sequences (L)
  • Divergence, e.g. measured by the total tree length (T).


The sample size for estimating global omega is proportional to the product of L and N, and depends on T in a non-trivial and topology dependent way (i.e. too high a divergence will lead to saturation; and too low a divergence -- to high variance for smaller N and L).

If you are interested in a proper power study, you could use HyPhy to simulate data under the global omega model, vary N, L and T, generate, say, 100 replicates for each combination and see how many reject the null of omega_0 = 1 for a given value of omega > 1.


Quote:
Beyond the statistical issues I was concerned that when the number of sequences is small (say 3-4 sequences) the method employed by HyPhy to estimate global beta and alpha is possibly nonsensical since it relies on a phylogenetic tree with a very small number of sequences.  Is there a critical number of sequences below which it is better to assume a star topology then estimate the topology? (i.e. just do all the pairwise comparisons?)



One should always attempt to use a reasonable topology for ANY number of sequences. For three sequences there is only one unrooted topology (star), and for four sequences even short alignments usually provide strong signal to infer the correct one out of 3 possible topologies.

Sergei


Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged