Sergei
|
Hi Marissa,
The code is doing something funny (there may be a bug in HyPhy with regards to dynamically computed frequencies), but there is also the following issue of interpretation.
Traditional phylogenetic methods assume that the composition of frequencies in the sequence has reached equilibrium. Each substitution process (encoded by its rate matrix), if run long enough, will attain equilibrium. For example, your simulated rate matrix for dataset 2: Q= { { -0.015, 0.005, 0.005, 0.005} { 0.005, -0.02, 0.005, 0.01} { 0.005, 0.005, -0.015, 0.005} { 0.005, 0.005, 0.005, -0.015} }
Has the equilibrium frequency distribution of {{0.25,0.2,0.25,0.3}}, which simply reflects the intuitive understanding that if you start at equal frequencies, the process will redistribute C's to T's until an equilibrium has been reached.
The sequences that you simulated are far from this equilibrium distribution (they are very close to 0.25 each), hence the assumption of the model is violated and you have 'biased' rate estimates (e.g. CT = 0.01/0.3, TC = 0.005/0.2, CT/TC = 4/3 instead of 2 as you expected). This is due to how we interpret the term 'rate': namely it's the cumulative substitution rate, with the base frequency factored out so that we can directly compare the rates between data sets with different base frequencies.
I hope that makes sense, Sergei
|