Hello all,
The field of molecular adaptation is evolving so quickly, and you guys are developping so much nice methods, that for me, a biologist, it's becoming quite confusing (and by reading the other posts, I don't think I'm alone...). Can we try to describe a good approach to the problem of detection of positive selection using HyPhy? By that I mean: Can you correct me if I'm wrong or missing something? I would really prefer to discover my mistakes now, than in a review ...
Let's image, we have a huge cluster running HyPhy and we want to know if a gene (100 taxa x 2000 sites) has evolved by positive selection, and if yes, which codons, and which lineages show evidences of positive selection. The question may already appear biased for a statistician (multiple hypothesis testing?), I don't know, but I think that most biologists have this approach in mind.
From what I have read so far:
- Recombination should be considered first (Scheller et al 2006 Bioinformatic in press, Kosakovsky Pond et al 2006 MBE in press, ...). If present, the data set should be cut in fragments with the same history, using for example GARD (GARecomb.bf?).
- Using a model that allows both dS and dN rate variation across sites is better (Kosakovsky Pond and Muse 2005 MBE 22(12)). So PAML is out, and, as far as I know, the only program to do that is HyPhy.
- A nucleotide model should be used in addition to the codon model (eg. MG94xREV). This model can be selected using NucModelCompare.bf (?).
- Phylogenetic incertainty might biased the results (Pie 2006 MBE in press). If it exists some incertainty it might be good to test several likely topologies, and test their influences on the results.
So basically, lets say we have a reliable topology, the most likely model (eg. MG94xREV), and an alignment free of recombination.
1. What should we do to test globally for positive selection? Compare the Dual model to the Dual(-), as in Sorhannus and Kosakovsky Pond (2006 JME 63) (using dNdSRateAnalysis.bf for the Dual, a modification of this one for the Dual(-), and then using parametric bootstrap to compare them?)
2. If there is positive selection, how do we determine which codons have evolved under positive selection? Using one or a combination of SLAC, FEL and REL methods (QuickSelectionDetection.bf?)? The Bayes Factor of the Dual model?
3. How do we determine if there is lineage specific positive selection? By mapping non-synonymous substitution on the topology (but that's not a test)? By running the genetic algorithm of Kosakovsky and Frost (2005 MBE 22(3)) (not included in HyPhy sources)? By comparing the Dual model to the Lineage Dual model?
Thanks, and sorry for this accumulation of questions...
Tristan