HyPhy message board - [DISCUSSION] How many samples are in an alignment?

	Welcome, Guest. Please Login

Home

Help

HyPhy message board › Theoretical questions › Sequence Analysis › [DISCUSSION] How many samples are in an alignment?

(Moderators: Sergei, Simon)

‹ Previous Topic | Next Topic ›

Pages: 1

Send Topic

[DISCUSSION] How many samples are in an alignment? (Read 2506 times)

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

[DISCUSSION] How many samples are in an alignment?
Nov 27^th, 2006 at 11:43am

This is a thread for discussing a topic to which there seems to be no agreed-upon answer in the literature. Please feel free to post your comments/suggestions/questions

Konrad Scheffler and I have been discussing how to count the number of independent observations in a sample of N taxa and S sites for the purposes of fitting a likelihood model.

This is relevant for at least two reasons.

Deciding if the alignment is large enough to afford good asymptotics, such as the approximate normality of parameter estimates and the applicability of chi^2 for likelihood ratio testing?
For model selection based on criteria which incorporate the number of samples (e.g. small sample AIC and BIC).

There seems to be no consensus in the literature (thanks to Konrad for these quotes)

Quote:

E.g. from the README file for MrAIC by Johan Nylander:

"In this script, sample size (n) used in AICc and BIC is assumed to be the
number of characters in the data matrix. This is probably not correct when
it comes to phylogenetic analyses (Nylander, 2004), but serve as an
approximation to the true n." (The reference Nylander 2004 contains no
discussion of this.)

And from David Posada's ProtTest manual:

"What is the sample size of a protein alignment is very unclear. ProtTest
offers different criteria for sample size determination:

- Alignment length (default).
- Number of variable sites.
- Shannon entropy summed over all alignment positions [description removed]
- Number of sequences × length of the alignment × normalized Shannon’s
entropy [description removed]
- Number of sequences × length of the alignment.
- User’s provided size."

Intuitively, at least when the topology is given and we are after estimating model parameters such as branch lengths and substitution rates, the amount of information in the alignment increases both with the number of sites and the number of taxa.

A further insight can be gained by considering the following:

For parameters which describe the evolution along a single branch (e.g. branch lengths), the effective sample size is close to the number of sites S (per parameter). Indeed, if ancestral states were known, then inferring the branch length 'i' would be intuitively like fitting a Markov model (with a fixed rate matrix) to S observations of (a_i, b_i)_s, where a_i and b_i are the characters labeling branch 'i' at site 's'. There may be more information because of correlation with neighboring branches (although this correlation, I believe, decays very rapidly as you move away from the branch of interest), but we also lose information because ancestral states are known, and because shared parameters (see below) are inferred from the same data.
For parameters which affect all branches (e.g. the transition/transversion ratio), the sample size should be on the order of S*N, where N is the number of sequences, because by the same logic, each branch at each site provides an independent realization of the substitution process to infer shared parameters from. Unknown ancestral states and branch lengths being inferred from the same data will lower the effective sample size.

These two points, taken together, suggest that the effective sample size for the alignment is proportional to the product NS.

Some simple simulations (to be posted shortly) seem to bear this out, and I am currently trying to work out the scaling constant k*NS for the linear relationship and see if it holds for a variety of settings.

Cheers,
Sergei

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

David

YaBB Newbies

Offline

viva hyphy

Posts: 7
Vigo, Spain
Gender: male

Re: [DISCUSSION] How many samples are in an alignm
Reply #1 - Dec 4^th, 2006 at 3:08am

I think sample size is between S and NS. However, I am not sure as setting NS as the standard. An alignment with 20 taxa and 1000 sites has a sample size of 20000? um. But the arguments above make sense and NS might be a better approximation than S. N being the number of haplotypes.

From Posada and Buckley (2004): "Both in the AICc and the BIC descriptions above, the total number of characters was used as an estimate of sample size. However, effective sample sizes in phylogenetic studies are poorly understood, and depend on the quantity of interest (Churchill et al., 1992; Goldman, 1998; Morozov et al., 2000). Characters in an alignment will often not be independent, so using the total number of characters as a surrogate for sample size (Minin et al., 2003; Posada and Crandall, 2001b) could be an overestimate. Using only the number of variable sites as an estimate of sample size is a more conservative approach, but could be an underestimate (note that all sites are used when estimating base frequencies or the proportion of invariable sites). Indeed, sample size also depends on the number of taxa. Importantly, sample size can have an effect on the outcome of model selection with the AICc . In our example above, if we were to use the number of variable characters (301 sites) as the sample size, instead of the total number of characters (1927 sites), the best AICc model would not change, but the second and third AIC models would exchange their rankings."

Best,

D.

IP Logged

konrad

Junior Member

Offline

I love YaBB 1G - SP1!

Posts: 53

Re: [DISCUSSION] How many samples are in an alignm
Reply #2 - Feb 2^nd, 2007 at 11:25am

Hmm, I never got around to looking up this thread at the time Sergei and I were discussing it. As I recall, my opinion was that NS probably makes sense for estimating parameters when the topology is known, but for estimating topology things are more complicated because of it being a discrete "parameter". For topology estimation, one either has to use a sample size less than NS, or treat the topology as containing more than one estimable parameter (I remember having some ideas on how one might do this and may be able to dig them up if anyone is interested - but it's not clear that that's the right way to go).

Konrad

WWW

IP Logged

Pages: 1

Send Topic

‹ Previous Topic | Next Topic ›

« Home

‹ Board

Top of this page