This is a thread for discussing a topic to which there seems to be no agreed-upon answer in the literature. Please feel free to post your comments/suggestions/questionsKonrad Scheffler and I have been discussing how to count the number of independent observations in a sample of N taxa and S sites for the purposes of fitting a likelihood model.
This is relevant for at least two reasons.
- Deciding if the alignment is large enough to afford good asymptotics, such as the approximate normality of parameter estimates and the applicability of chi^2 for likelihood ratio testing?
- For model selection based on criteria which incorporate the number of samples (e.g. small sample AIC and BIC).
There seems to be no consensus in the literature (thanks to Konrad for these quotes)
Quote:
E.g. from the README file for MrAIC by Johan Nylander:
"In this script, sample size (n) used in AICc and BIC is assumed to be the
number of characters in the data matrix. This is probably not correct when
it comes to phylogenetic analyses (Nylander, 2004), but serve as an
approximation to the true n." (The reference Nylander 2004 contains no
discussion of this.)
And from David Posada's ProtTest manual:
"What is the sample size of a protein alignment is very unclear. ProtTest
offers different criteria for sample size determination:
- Alignment length (default).
- Number of variable sites.
- Shannon entropy summed over all alignment positions [description removed]
- Number of sequences × length of the alignment × normalized Shannon’s
entropy [description removed]
- Number of sequences × length of the alignment.
- User’s provided size."
Intuitively, at least when the topology is given and we are after estimating model parameters such as branch lengths and substitution rates, the amount of information in the alignment increases both with the number of sites and the number of taxa.
A further insight can be gained by considering the following:
- For parameters which describe the evolution along a single branch (e.g. branch lengths), the effective sample size is close to the number of sites S (per parameter). Indeed, if ancestral states were known, then inferring the branch length 'i' would be intuitively like fitting a Markov model (with a fixed rate matrix) to S observations of (a_i, b_i)_s, where a_i and b_i are the characters labeling branch 'i' at site 's'. There may be more information because of correlation with neighboring branches (although this correlation, I believe, decays very rapidly as you move away from the branch of interest), but we also lose information because ancestral states are known, and because shared parameters (see below) are inferred from the same data.
- For parameters which affect all branches (e.g. the transition/transversion ratio), the sample size should be on the order of S*N, where N is the number of sequences, because by the same logic, each branch at each site provides an independent realization of the substitution process to infer shared parameters from. Unknown ancestral states and branch lengths being inferred from the same data will lower the effective sample size.
These two points, taken together, suggest that the effective sample size for the alignment is proportional to the product NS.
Some simple simulations (to be posted shortly) seem to bear this out, and I am currently trying to work out the scaling constant k*NS for the linear relationship and see if it holds for a variety of settings.
Cheers,
Sergei