HyPhy message board | |
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl
Theoretical questions >> Sequence Analysis >> [DISCUSSION] How many samples are in an alignment? http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?num=1164656629 Message started by Sergei on Nov 27th, 2006 at 11:43am |
Title: [DISCUSSION] How many samples are in an alignment? Post by Sergei on Nov 27th, 2006 at 11:43am
This is a thread for discussing a topic to which there seems to be no agreed-upon answer in the literature. Please feel free to post your comments/suggestions/questions
Konrad Scheffler and I have been discussing how to count the number of independent observations in a sample of N taxa and S sites for the purposes of fitting a likelihood model. This is relevant for at least two reasons.
There seems to be no consensus in the literature (thanks to Konrad for these quotes) Quote:
Intuitively, at least when the topology is given and we are after estimating model parameters such as branch lengths and substitution rates, the amount of information in the alignment increases both with the number of sites and the number of taxa. A further insight can be gained by considering the following:
These two points, taken together, suggest that the effective sample size for the alignment is proportional to the product NS. Some simple simulations (to be posted shortly) seem to bear this out, and I am currently trying to work out the scaling constant k*NS for the linear relationship and see if it holds for a variety of settings. Cheers, Sergei |
Title: Re: [DISCUSSION] How many samples are in an alignm Post by David on Dec 4th, 2006 at 3:08am
I think sample size is between S and NS. However, I am not sure as setting NS as the standard. An alignment with 20 taxa and 1000 sites has a sample size of 20000? um. But the arguments above make sense and NS might be a better approximation than S. N being the number of haplotypes.
From Posada and Buckley (2004): "Both in the AICc and the BIC descriptions above, the total number of characters was used as an estimate of sample size. However, effective sample sizes in phylogenetic studies are poorly understood, and depend on the quantity of interest (Churchill et al., 1992; Goldman, 1998; Morozov et al., 2000). Characters in an alignment will often not be independent, so using the total number of characters as a surrogate for sample size (Minin et al., 2003; Posada and Crandall, 2001b) could be an overestimate. Using only the number of variable sites as an estimate of sample size is a more conservative approach, but could be an underestimate (note that all sites are used when estimating base frequencies or the proportion of invariable sites). Indeed, sample size also depends on the number of taxa. Importantly, sample size can have an effect on the outcome of model selection with the AICc . In our example above, if we were to use the number of variable characters (301 sites) as the sample size, instead of the total number of characters (1927 sites), the best AICc model would not change, but the second and third AIC models would exchange their rankings." Best, D. |
Title: Re: [DISCUSSION] How many samples are in an alignm Post by konrad on Feb 2nd, 2007 at 11:25am
Hmm, I never got around to looking up this thread at the time Sergei and I were discussing it. As I recall, my opinion was that NS probably makes sense for estimating parameters when the topology is known, but for estimating topology things are more complicated because of it being a discrete "parameter". For topology estimation, one either has to use a sample size less than NS, or treat the topology as containing more than one estimable parameter (I remember having some ideas on how one might do this and may be able to dig them up if anyone is interested - but it's not clear that that's the right way to go).
Konrad |
HyPhy message board » Powered by YaBB 2.5.2! YaBB Forum Software © 2000-2024. All Rights Reserved. |