HyPhy message board
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl
Methodology Questions >> How to >> Review methods for dN and dS on trees
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?num=1157562913

Message started by Sarah on Sep 6th, 2006 at 10:15am

Title: Review methods for dN and dS on trees
Post by Sarah on Sep 6th, 2006 at 10:15am
I'm new to HyPhy and have been trying to understand the array of different methods for inferring dN and dS on trees. I've looked at the static HyPhy documentation, messages here, and the batch files themselves, and I'm still unclear.

In particular, I was wondering if you could explain the differences between:

  • dNdSRateAnalysis.bf: What exactly is being done here? What does it mean to test "all available models"? What are the default initial value options? How long should this analysis take? I've been running it for MG94-HKY85 on 253 sequences (984 nucs each) with a 3x3 gamma dist for syn and nonsyn on my Intel iMac; three models have been tested in 90 h.
  • dNdSpost.bf
  • post_sns.bf
  • Loading a partitioned codon file, assigning a tree MG94xHKY85..._Rates, building a likelihood function, and optimizing the function as described on p. 27 of the "HyPhy: Hypothesis Testing Using Phylogenies" documentation.
  • Because of my calculations I can't see the Standard Analyses menu now, but I believe there are similar-looking dNdS calculators under the Codon and Positive Selection submenus. A few lines on each (or references to where they are described) would be very appreciated.

Please feel free to direct me to explanatory sources I might have missed. Apologies for the broad questions--I can be more specific about what I'm trying to calculate, but I thought others might benefit from an overview too.

Thanks,
Sarah

Title: Re: Review methods for dN and dS on trees
Post by artpoon on Sep 6th, 2006 at 12:49pm
Dear Sarah,

Hi, I'm covering for Sergei this week while he's abroad.

Regarding the batch file dNdSRateAnalysis.bf:
• HyPhy attempts to fit one of several available codon substitution rate models to your data.  The parameters of the codon rate matrix are estimated as either global variables, or local to each branch of your tree.  
• By selecting "Run all available models" at the "Rate Variation Options" window, HyPhy will iterate through all five possible models of rate variation across sites (starting at line 530) so that you can compare their likelihoods.
• The option to select between default or randomized initial values for the rate distribution parameters (line 407) actually appears to be an unused option -- possibly left-over from earlier versions of the batch file.
• Given the number of sequences that you are trying to fit, and the complexity of the model that you're fitting, the amount of time elapsed doesn't seem unreasonable for your computer.  (There's a lot of parameters being fit to a lot of data!)  You might want to keep an eye on the messages.log file for error messages, just to be sure that nothing's gone awry.

I can't find the batch file dNdSpost.bf anywhere on my computer!  Is this from an earlier version of HyPhy?

As far as I can tell, post_sns.bf is a post-processing batch file that generates and displays trees based specifically on either synonymous or non-synonymous rates of substitution that were estimated from fitting a codon model.  In other words, one would use this after executing something like dNdSRateAnalysis.bf.

The step-by-step procedure for fitting a codon model to a codon partition is, as you say, in the documentation.  This is basically equivalent to dNdSRateAnalysis.bf, except executed through the HyPhy GUI.

Under the "Codon Selection" submenu, there is a batch file dNdSBivariateRateAnalysis.bf which applies the site-specific estimation of synonymous substitution rates described in Kosakovsky Pond and Muse (1995) MBE 22(12):2375.

Under the "Positive Selection" submenu, the batch file QuickSelectionDetection.bf applies methods described in Kosakovsky Pond et al. (2005) MBE 22(5): 1208.  This first fits a codon model to the data in a similar manner to dNdSRateAnalysis.bf before reconstructing ancestral states for inferring the number of NS and S substitutions per site.

Okay!  I'm sure Sergei could provide a much more informed overview, but I hope this helps you out a bit.  Lemme know if you need more detailed explanations on stuff.

- Art.


Title: Re: Review methods for dN and dS on trees
Post by Sarah on Sep 7th, 2006 at 11:38am
Thanks, Art! I'm still trying to learn the HyPhy batch language (and some of the techniques in this area) and really appreciate your help.

On dNdSRateAnalysis.bf:
  • Though dNdSRateAnalysis.bf compares AIC for different models of nonsyn and syn substitution rates, I still had to choose between MG94xREV v. MG94 v. MG94xHKY85, etc. I could/should have used CodonModelCompare.bf to choose the best instantaneous rate matrix model for the dNdSRateAnalysis.bf, right? (In other words, dNdSRateAnalysis.bf does not consider the appropriateness of my rate matrix.)
  • To obtain bootstraps and variance for dN and dS values on branches, one uses bootstrap.bf?
  • I also wanted to check that it is tractable to create a partition and recalculate dN/dS for every branch with respect to that partition.

On dNdSBivariateRateAnalysis.bf:
  • It seems to me that the last two models ("Dual" and "Lineage Dual") of dNdSRateAnalysis are similar, if not identical, to the bivariate rate analyses described in Kosakovsky Pond & Muse (2005) MBE 22(12):2375. It's confusing that there's a separate batch file for the bivariate analyses. Does dNdSBivariateRateAnalysis.bf do anything different from the last two models of dNdSRateAnalysis? Does the bivariate version automatically use discrete distributions instead of gamma distributions? I don't recall being prompted to choose which to use.

dNdSpost.bf was referenced here: Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login

My goal is to compare dN/dS rates across the branches of a tree, comparing results from partitioned and nonpartitioned datasets. I hope to do this with the bivariate analyses and bootstrap the confidence intervals. There are other files that also look like they could assess significant differences between dN/dS on branches: BranchClassDNDS.bf, SelectionLRT.bf, TestBranchDNDS.bf, MRPositiveSelection.bf, SubtreeSelectionComparison.bf. I don't quite see their purpose if I can show differences in dNdSBivariateRateAnalysis.bf. Do they differ hugely in assumptions or power? Are they documented anywhere other than the message boards and the HyPhy user manuals?

I have also run FEL on the data to obtain selected codons. Is there an easy way to map inferred substitutions on the tree? Is there a way to map the (dis)appearance of certain motifs on the tree?

Thanks again so much for your help.

Sarah

p.s. Computer froze after 98 h of dNdSRateAnalysis.bf, so I restarted with dNdSBivariateRateAnalysis.bf 15 h ago. Two odd things: first, the program timer isn't running, though the "LF Optimization. Value X and Y evals/sec" changes every hour or two, and there's occasionally movement on the progress bar. Second, the status light remains yellow. I encountered similar issues with CodonModelCompare.bf, except there I had no signs in the first 20 min or so that anything was running after the first min. Is this a problem with my computer (2 Ghz Intel iMac), the latest Universal Binary version of HyPhy, or am I misinterpreting something?

Title: Re: Review methods for dN and dS on trees
Post by Sarah on Sep 7th, 2006 at 11:56am
Regarding the difference between dNdSRateAnalysis.bf and dNdSBivariateRateAnalysis.bf, one comment that concerns me is


Quote:
Dear Albert,

I don't think dSdNtree will work properly with the general bivariate analysis, because this analysis uses a fundamentally different way to set up the likelihood function...; I'll take a look into the cause of the crash and get back to you.

Cheers,
Sergei

(from Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login)

This suggests there are major differences between the two implementations, but I can't find where they are described.

Thanks again,
Sarah

Title: Re: Review methods for dN and dS on trees
Post by artpoon on Sep 7th, 2006 at 2:21pm
Dear Sarah,

• Sure, it seems reasonable to me to apply CodonModelCompare.bf before running dNdSRateAnalysis.bf.

• I don't think bootstrap.bf will yield what you're looking for.  Instead, try simpleBootstrap.bf (please refer to Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login).

• Again, a partition is handled like any other data set filter in HyPhy, so there's no reason why it shouldn't work with any batch file.

More later, =)
- Art.

Title: Re: Review methods for dN and dS on trees
Post by Simon on Sep 7th, 2006 at 2:23pm
Dear Sarah,

Phew! That's a lot of questions...I thought that I would chip in too...


Quote:
My goal is to compare dN/dS rates across the branches of a tree, comparing results from partitioned and nonpartitioned datasets. I hope to do this with the bivariate analyses and bootstrap the confidence intervals. There are other files that also look like they could assess significant differences between dN/dS on branches: BranchClassDNDS.bf, SelectionLRT.bf, TestBranchDNDS.bf, MRPositiveSelection.bf, SubtreeSelectionComparison.bf. I don't quite see their purpose if I can show differences in dNdSBivariateRateAnalysis.bf. Do they differ hugely in assumptions or power? Are they documented anywhere other than the message boards and the HyPhy user manuals?


We had a paper looking at selection varying across both branches and across sites Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login, although unlike the 'branch-sites' sorts of models, we assumed that the distributions across branches and sites were independent. Is this what you want?

MFPositiveSelection is an old batch file used to estimate dN/dS jointly across multiple datasets (like HIV sequences from different infected individuals). The different branch batch files really depend on which branches you want to look at; in the tips of the tree, an isolated branch, a subtree, or some more complicated setup.

Best
Simon

Title: Re: Review methods for dN and dS on trees
Post by Simon on Sep 7th, 2006 at 2:46pm
Dear Sarah,

It's probably a reasonable approximation to estimate the nucleotide biases using a nucleotide mode (NucModelCompare) rather than the codon model (CodonModelCompare), which will be much faster. I'd be interested to hear if they came up with different answers for the best fitting nucleotide model.

Rather than estimate confidence intervals using bootstrap, I would use profile likelihood confidence intervals, which is going to be much faster.

If you know which branch(es) to look at, try using TestBranchDNDS, which will allow you to specify dual rate variation (using two independent beta-gamma distributions), a custom nucleotide bias, and to specify one or more branches, although I would start assuming no site-to-site rate variation first, if that is just a nuisance term. Following the analysis, you can then plot dN and dS trees (Analyses>Results>Syn and nonsyn trees), obtain profile likelihood confidence intervals (under Analyses>Results>Variance Estimates) etc.

Best
Simon

Title: Re: Review methods for dN and dS on trees
Post by Sarah on Sep 7th, 2006 at 4:44pm
Hi, Art & Simon,

First, your GA approach is impressive and ideally what I would do, but I don't have the skills/time at the moment. If these preliminary analyses are suggestive, I would like to pursue it. (I assume it's not implemented anywhere in HyPhy.)

TestBranchDNDS.bf could be bad here: I don't want to specify branches a priori because I expect a lot of variation in selection across the tree. I was hoping that dNdSBivariateRateAnalysis.bf (or the last model in dNdSRateAnalysis.bf) would provide a way around this problem. Lineage Dual is obviously not as nice as pulling from discrete categories of omegas, but it's a start... though I'm not sure if it's what I'm running now... or if anything's running... or exactly what form the output will take. It would definitely be good to run independent branch and site models and compare them (I suspect I'll need both branch and site variation). Will later try the free ratio model in PAML.

I'll test NucModelCompare.bf and CodonModelCompare.bf once dNdSBivariateRateAnalysis.bf has finished.

Thanks again for your help; I wish I weren't so new to this. Rather than peppering you with more questions, I will focus on understanding the batch code and linking the literature to the options in the software.

Thanks again,
Sarah

Title: Re: Review methods for dN and dS on trees
Post by Simon on Sep 7th, 2006 at 5:02pm
Dear Sarah,

The GA branch analyses are available for download separately Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login. This approach is only feasible for a small number of sequences (I'd say 20 or less, in which case you could run the analyses on our cluster - it's very easy), and I'd use no site-to-site rate variation to begin with.

Best
Simon

Title: Re: Review methods for dN and dS on trees
Post by Simon on Sep 7th, 2006 at 5:27pm
Dear Sarah,

One more thing; you can fit models allowing different dN/dS classes for each branch in HyPhy by specifying 'local' models rather than 'global' models. This is easy to do in the graphical user interface - if the model you want isn't there, you can make your own up (see Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login). You can fit a 'local' model, with separate dN/dS categories per branch in much the same way as described in one of Sergei's excellent tutorials (Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login).

Have fun learning about HyPhy!
Simon

Title: Re: Review methods for dN and dS on trees
Post by Sarah on Sep 13th, 2006 at 12:43pm
Thanks again for all the help. I've been doing a lot of rereading and my earlier confusion is slowly resolving.

Is the GA then your only "branch-site" model, i.e., model that infers which codons have been selected on particular branches?

Is there a reference for dNdSBivariateRateAnalysis.bf?

I've a single Athlon processor chugging away on CodonModelCompare.bf. I'll let you know the results when they're in.

Sarah

p.s. As an aside, I started a 13 sequence, 984 bp job running on your GA server on Friday. Between Friday and Tuesday, I never got beyond "This page will update every 15 minutes until your program starts running" page, even though it was clearly running when I checked the job queue, and I refreshed my browser's cache regularly. Today the same page yields "Not found" and the job is no longer running. Did I lose the results? (Did they ever exist?)

Title: Re: Review methods for dN and dS on trees
Post by Sergei on Sep 13th, 2006 at 1:12pm
Dear Sarah,

GA-Branch is not a branch-site method, it looks for signatures of alignment-wide selection at the level of a single branch. You can include site-by-site rate variation as well, but it will be independent of the branch-by-branch rate variation (i.e. a 'slow' site will be slow in all branches of the tree). I am not sure why your analysis died - would you please try it again and let me know if the problem persists?

I did implement a PAML style Branch-Site model upon request (search the message boards way back and you should find the reference), in case you want to try that.

There is no reference to dNdSBivariateRateAnalysis.bf because it is an unpublished (yet) method - it is very similar in spirit to dNdSRateAnalysis.bf (which is described in the paper we wrote with Spencer Muse in MBE), except that dN and dS are no longer assumed to come from independent distributions, but rather from a general discrete bivariate distribution. The key difference is that dN and dS can (and will in general) co-vary in the dNdSBivariateRateAnalysis.bf analysis. Another thing about this analysis is that selection strength does not vary along the tree.

Based on what you want to do (compare alignment-wide dN/dS along branches between different data sets), a GA-branch style approach seems ideal. It will

(a). Free you from having to assign branches to rate classes a priori
(b). Avoid model overfitting as the free ratio model will almost certainly do (hence leading to very large variances in parameter estimates)
(c). Provide natural confidence intervals for dN/dS over branches averaged over models.

Let me know if I can be of further assistance.

Sergei


Title: Re: Review methods for dN and dS on trees
Post by Sarah on Sep 13th, 2006 at 5:13pm
Dear Sergei,

I just ran a GA job on the same sequences without problems. The program and output formats are great! I'm looking into running the program on a local cluster so I can see the effects of site-by-site variation and different topologies.

Argh, you're (obviously) right about the GA model not being branch-site--don't know what I was thinking. It seems that one could, in theory, run FEL on the same tree used for GA, and map codon substitutions to the tree to see on which branches selection may have occurred at particular codons--but this clearly does not yield the same precision claimed by the branch-site model. That said, I am wary of false positives in the branch-site model.

Thanks,
Sarah


Title: Re: Review methods for dN and dS on trees
Post by Sergei on Sep 14th, 2006 at 12:14pm
Dear Sarah,

Let me know if you run into problems running jobs locally.

Indeed, one could use FEL on the same tree to look for codons under selection post-hoc (i.e. use the GA to find putatively selected branches, then use FEL to test for selection on those branches), but there are some statistical issues of bias here, whereby one uses the same data to first formulate a hypothesis (i.e. find branches under selection) and then test for evidence for/against it using the same data. It's permissible for exploratory data analysis, though.

Alternatively, you could use SLAC/FEL to infer substitution histories for each codon, and then correlate them with the branches under selection globally. However, this is also mostly an exploratory analysis, without much statistical rigor.

Generally speaking, there is not very much power in any branch-site type method, even if the 'foreground' branch is specified a priori (as shown by poor power in simulation studies by Yang et al in their 2005 MBE paper). Also, there is a problem of model mis-specification (i.e. postulating which branches are under selection).

Cheers,
Sergei

Title: Re: Review methods for dN and dS on trees
Post by Sarah on Sep 17th, 2006 at 1:20pm
I don't have administrative privileges on the cluster where we're trying to install the GA batch file, but so far the administrators haven't run into any preliminary compilation problems. In the meantime, I've tried to run a few more analyses on your server. The queue has remained the same since Friday, and one analysis (started Thursday, #7627?) seems stuck on generation 100--the progress page hasn't changed. Am I using too many system resources? Again, I plan to run locally soon.

I've submitted overlapping sequence files in the GA analysis, e.g. A, A+B, B, B+C, etc. I've noticed that a few of the branches get quite different results between the single-group analysis (A) to the paired analysis (A+B). I was hoping to explore this problem when running the GA routine locally first by increasing the number of categories. Is that possible? If not, could I constrain a few? It seems like it would also help to constrain analysis to a specific nucleotide model and provide consistent topologies, rather than relying on NJ. Does this seem reasonable?

Sarah

p.s. The FEL would be exploratory only; I plan to use the percentiles generated by the GA to infer statistical support for dN>dS on particular branches (I do have a hypothesis which ones will be selected). That looks like the approach you took in your 2005 MBE paper.

Title: Re: Review methods for dN and dS on trees
Post by Sergei on Sep 17th, 2006 at 1:45pm
Dear Sarah,

One of the nodes on our cluster seems to have developed hardware problems - thus the hung jobs. I restarted the queue just now, having taken the problem node out of the pool; we'll see how that goes.

You can indeed increase the number of categories when running the analysis locally, and also tighten the convergence criterion (e.g. not 30 generations with no c-AIC improvement, but 50 or 100).

Using a consistent tree would be a good idea, because lineage specific analyses could be influenced quite a bit by the changes in the topology. I could take a look at the specific outputs which generate different results and see if anything else stands out; it's a bit difficult to speak in generalities and be helpful here:)


Cheers,
Sergei

Title: Re: Review methods for dN and dS on trees
Post by Sarah on Nov 13th, 2006 at 10:07am

wrote on Sep 7th, 2006 at 2:46pm:
It's probably a reasonable approximation to estimate the nucleotide biases using a nucleotide mode (NucModelCompare) rather than the codon model (CodonModelCompare), which will be much faster. I'd be interested to hear if they came up with different answers for the best fitting nucleotide model.

Just wanted to say that CodonModelCompare found three good, statistically indistinguishable models, and NucModelCompare (after two months on two processors) settled on the most likely of those three.

Sarah

Title: Re: Review methods for dN and dS on trees
Post by Sergei on Nov 13th, 2006 at 10:16am
Dear Sarah,

Two months? Wow! Did you use the branch length approximation heuristic? It speeds things up by a factor of 50x or so, and almost always gets the same results.

Cheers,
Sergei

P.S. Also, check out Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login

Title: Re: Review methods for dN and dS on trees
Post by Sarah on Nov 13th, 2006 at 11:33am
My mistake--it was a single processor (Athlon XP2000+ I think). I used the branch length approximation.

I'd love to read that chapter.

Sarah

HyPhy message board » Powered by YaBB 2.5.2!
YaBB Forum Software © 2000-2024. All Rights Reserved.