Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
which approach to test for positive selection? (Read 2824 times)
tlefebure
YaBB Newbies
*
Offline



Posts: 36
Cornell University
which approach to test for positive selection?
Sep 15th, 2006 at 9:16am
 
Hello all,

The field of molecular adaptation is evolving so quickly, and you guys are developping so much nice methods, that for me, a biologist, it's becoming quite confusing (and by reading the other posts, I don't think I'm alone...). Can we try to describe a good approach to the problem of detection of positive selection using HyPhy? By that I mean: Can you correct me if I'm wrong or missing something? I would really prefer to discover my mistakes now, than in a review ...Wink

Let's image, we have a huge cluster running HyPhy and we want to know if a gene (100 taxa x 2000 sites) has evolved by positive selection, and if yes, which codons, and which lineages show evidences of positive selection. The question may already appear biased for a statistician (multiple hypothesis testing?), I don't know, but I think that most biologists have this approach in mind.

From what I have read so far:

- Recombination should be considered first (Scheller et al 2006 Bioinformatic in press, Kosakovsky Pond et al 2006 MBE in press, ...). If present, the data set should be cut in fragments with the same history, using for example GARD (GARecomb.bf?).
- Using a model that allows both dS and dN rate variation across sites is better (Kosakovsky Pond and Muse 2005 MBE 22(12)). So PAML is out, and, as far as I know, the only program to do that is HyPhy.
- A nucleotide model should be used in addition to the codon model (eg. MG94xREV). This model can be selected using NucModelCompare.bf (?).
- Phylogenetic incertainty might biased the results (Pie 2006 MBE in press). If it exists some incertainty it might be good to test several likely topologies, and test their influences on the results.

So basically, lets say we have a reliable topology, the most likely model (eg. MG94xREV), and an alignment free of recombination.
1. What should we do to test globally for positive selection? Compare the Dual model to the Dual(-), as in Sorhannus and Kosakovsky Pond (2006 JME 63) (using dNdSRateAnalysis.bf for the Dual, a modification of this one for the Dual(-), and then using parametric bootstrap to compare them?)
2. If there is positive selection, how do we determine which codons have evolved under positive selection? Using one or a combination of SLAC, FEL and REL methods (QuickSelectionDetection.bf?)? The Bayes Factor of the Dual model?
3. How do we determine if there is lineage specific positive selection? By mapping non-synonymous substitution on the topology (but that's not a test)? By running the genetic algorithm of Kosakovsky and Frost (2005 MBE 22(3)) (not included in HyPhy sources)? By comparing the Dual model to the Lineage Dual model?


Thanks, and sorry for this accumulation of questions...
Tristan
Back to top
 
WWW WWW  
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: which approach to test for positive selection?
Reply #1 - Sep 15th, 2006 at 10:09am
 
Dear Tristan,

We are actually in the process of writing a book chapter for a new edition of the Phylogenetic Handbook on how to do positive selection analyses, so hopefully that'll address the shortcomings.

In the meantime let me try to answer your points.

  • Recombination should indeed be considered first and dealt with as you suggest. In practice, however, the effects of low to moderate level recombination on the detection of selection seem to be low; but better to test anyway. It's probably best to run GARD on our server (Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login) for now, because the analyses are written to require an MPI environment with at least 10-16 nodes. GARecomb.bf (inside TemplateBatchFiles) will run GARD for a fixed number of breakpoints, if you want to try it locally
  • dS variation seems pervasive indeed, and can bias the identification of selected sites if left out. This is especially true for larger (>20 sequences) data sets. There have been a number of papers now which used both PAML and our methods (datamonkey.org for example), and they show varying levels of agreement between the two tools on the same datasets; however our simulations (in the Not So Different... MBE paper) show that PAML-type models can reliably misidentify relaxed functional constraint (i.e. elevated dS and dN, but dS>dN) as positive selection. I would recommend always checking for dS rate variation, because it is not any more computationally intensive than not doing it.
  • Nucleotide model is nice to add, but the effects are fairly minor, unless nucleotide biases are really quite not like HKY85. NucModelCompare.bf (or datamonkey.org) will let you do that.
  • Phylogenetic uncertainty is generally not very important (unless there is a recombination); even the Pie paper says as much in the context of PAML tests. Effectively if you have enough data to reliably infer a topology, you will also have enough data to run robust selection analyses. When you have low diversity/divergence data or short sequences, both will suffer, but then there  never was much power to detect selection (unless it is very strong). I would say, if you have several nearly equally likely topologies which are very different (i.e. the scenario when selection analyses are not robust), there is probably recombination/gene conversion/serious lack of power to be concerned about. However, it is a matter of due-diligence to try a couple of phylogenies if you have doubts.
  • To test for selection globally using variable dS and dN, you could use Dual and the Dual(-) in dNdSRateAnalysis.bf (both should be there), and the chi^2 test to be on the conservative side (if evidence for selection is strong, it will pass), and the bootstrap (which is slow) for borderline cases. There is also an alternative parameterization in the Scheller paper as well, with accompanying batch files.
  • Bayes factor in the Dual model is REL. If you look at our "Not so different ..." paper, we argue that for large data sets all methods give very similar results, so you can just pick the fastest, for intermediate ones it is probably best to run all 3 and decide based upon method consensus, and for small ones it doesn't really matter what you do - there just isn't enough power to reliably call a single site as selected, unless you make a whole lot of modeling assumptions to boost said power (e.g. dS = 1, specific distribution of omega's etc). Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login is well suited for rapid site-level analysis (there are also messages on these boards on how to run SLAC/FEL/REL locally)
  • Lineage-specific selection is a bit tricky, indeed. I do recommend the use of a GA model selection (Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login has a Web portal to run this on our cluster, and also batch files to run it locally), mostly because it gives you a robust way to explore the data and run multi-model inference without the need to assume that you know which branches are selected and forcing the rest of the tree to do unrealistic things. This analysis will look for evidence of alignment wide selection along a given branch. Lineage-Dual vs Dual model is a good way to test if there is any variation in mean branch rates "somewhere" in the tree, accounting for site-to-site variation as well, but it is not well suited to identifying which branches are driving the variation (except by post-hoc analyses, which are not statistically robust).


HTH,
Sergei

Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
tlefebure
YaBB Newbies
*
Offline



Posts: 36
Cornell University
Re: which approach to test for positive selection?
Reply #2 - Sep 18th, 2006 at 7:25am
 
Thanks a lot Sergei! Things are clearer in my mind.

I'm very happy to know that REL is just the application of the Dual model to identify sites under PS. That was making me a lot of trouble...
Regarding the fact that I can use datamonkey to run most of these time consuming analyzes, I'm unfortunately limited by the size of my data sets (~200 taxa) or by the number of gene to analyze per data set (~1000) which is not doable using a web-interface. I asked the cluster administrators of my university to install HyPhy on one of their big cluster...

Tristan
Back to top
 
WWW WWW  
IP Logged