Sergei
|
Hi there,
HyPhy uses the standard sum of -p log (p), where the sum is over nucleotides (A,C,G,T), p is the frequency of a given nucleotide, and log has base 2. The only trick is how HyPhy deals with ambiguous bases (e.g. R,Y), which basically adds fractional counts to the possible resolutions (e.g. an R would contribute 0.5 A and 0.5 G).
There are a number of ways to present a plot of sequence variation: a moving average is reasonable, but depends on what you are trying to show. I would actually encourage you NOT to use entropy: it ignores the fact that all sequences are related by a phylogenetic tree. As an extreme case, consider a site with 25% of each base (the entropy is 2, which is the maximal value for nucleotide data): however if the sequences are related by a tree like this ((A,C),(G,T)), where A stands for all the sequences that have an A, etc, then only 3 substitutions are needed to explain the observed pattern.
I would recommend you use some estimate of site by site evolutionary rates, and then smooth that (e.g. using the SiteRates.bf standard analysis). Better yet, just estimate MEAN rates (or whatever you are interested in) using phylogenetic likelihood (implemented in SlidingWindowAnalysis.bf) -- the output is a CSV file which reports sliding window averages.
Sergei
|