HyPhy message board - GA Analysis: Number of rate categories

Sarah

YaBB Newbies

Offline

Posts: 47

GA Analysis: Number of rate categories
Nov 29^th, 2006 at 4:43pm

These questions could probably be answered numerically, but I'm hoping for more insight.

I'm trying to test whether there is episodic selection on a large tree (~250 taxa), i.e., whether pre-specified branches have more positive selection than all the other branches. I don't want to lose information by averaging their dN/dS values or assume that selection on individual sites does not vary among the pre-specified branches. I am thus conducting GA analysis instead of some foreground-background test.

It's computationally intractable to run the analysis on the entire tree. But (how) does my choice of how to subdivide the tree affect results? Assuming I stick with the same codon/nucleotide model for all the subtrees, it seems that my results could still be affected by, for example, the number of branches that must be grouped into the same category. Shouldn't one ideally search models from one rate category to many (with "many" depending on resolution or branch lengths?)? Is eight or fewer reasonable for large trees? It seems like there has to be some tradeoff in accuracy and precision as one changes tree size.

I also assume that I should enforce the nucleotide/codon model determined for the whole tree in each of the subtrees. Can the rate matrix values be defined in the batch code? I worry that each subtree will arrive at different (overfitted) values if I define the model (e.g. 012340) only.

I've a ways to think on this but appreciate all the catalysts I can get!

Sarah

Back to top

IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: GA Analysis: Number of rate categories
Reply #1 - Nov 29^th, 2006 at 5:46pm

Dear Sarah,

The newest GA analysis will automatically determine the appropriate number of categories starting with 1 and stepping up until no further AIC score improvement can be found.

A 250 taxon tree would probably take too long to run in it's entirety, even with the new, much faster GA version. There are several subsampling strategies:

1). Simply divide your tree into a number of clades, so each subsample forms a monophyletic group, if you can (might be difficult for very laddery trees). They don't have to be the same size, but should probably contain at least 20 sequences each. Then you can analyze each clade separately and collate the results. The drawbacks of this sampling approach are:

You lose the information on branches connecting the clades
Even though two branches in different clades may have the same dN/dS rate in a joint analysis (if we could run it), they may end up in different classes in individual analyses. This can be fixed by hacking the analysis to use the same set of dN/dS on all subtrees (analyzed jointly). Hence, instead of searching a single 250 taxon tree, you could jointly search a (say) 60, 50,40,55 and 45 taxon trees, where in each tree the allocation of branches to dN/dS classes is done independently of other trees, but the values of dN/dS for each rate class are estimated jointly from all trees.

2). Collapse all short branches. E.g, if the maximum pairwise distance (e.g. nucleotide TN93 metric) between K taxa is less than some a priori threshold, represent all K taxa with one of them (chosen randomly). This is not a bad idea if you have a sample of viruses from different hosts/localities or of different subtypes. You can use ClusterByDistanceRange.bf (a standard analysis under Data File Tools) to see if your data has collapsible clusters.

3). Random samples. Probably not a good idea, unless a large number of them are run and properly collated.

In terms of nucleotide rates, I would not worry about overfitting too much. The precisions of nucleotide rate estimation from 50 and 250 taxa on long alignments (yours is flu HA, right) are not qualitatively different.

HTH,
Sergei

Back to top

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

Sarah

YaBB Newbies

Offline

Posts: 47

Re: GA Analysis: Number of rate categories
Reply #2 - Nov 29^th, 2006 at 6:00pm

Oh, dear--my memory updates erratically!!! I had forgotten that the eight categories were replaced by this AIC-improvement criterion. Sorry about that.

I think I will pursue the second bullet under option 1 and then option 2. It will be interesting to compare results. (And yes, I'm working with HA.)

Thanks!

Sarah

Back to top

IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: GA Analysis: Number of rate categories
Reply #3 - Nov 29^th, 2006 at 6:04pm

Dear Sarah,

Split tree - joint rate (STJR) analysis may be tedious to implement. My HBL GA code is not commented (bad me) and may be hard to modify. I would recommend option 2 to start with (it can run with existing code), and then STJR. You may actually be able to run the entire 250 taxon tree through, depending on how patient you are. It should finish in a week or two, my guess.

Cheers,
Sergei

Back to top

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

Sarah YaBB Newbies Offline Posts: 47	Re: GA Analysis: Number of rate categories Reply #4 - Nov 29^th, 2006 at 6:08pm Is the limiting factor in GA analysis the generation time or the population size (or the landscape)? Many of the branches on HA are so short--and end in egg-adapted sequences anyway--that collapsing them should be straightforward. Thanks, Sarah
Back to top	IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: GA Analysis: Number of rate categories
Reply #5 - Nov 29^th, 2006 at 6:12pm

Dear Sarah,

Generation time (i.e. the cost of computing the fitness of an individual) is one limiting factor (probably the primary one), and the size of the search space (which grows factorially with the size of the tree) is another (but less important in my experience) - it simply results in slower convergence. Boosting generation SIZE (i.e. the number of nodes you run the analysis on) should alleviate the second factor to a large extent.

Collapsing really short branches is a good idea anyway - you can't really estimate dN/dS all that well on them anyway, even with a good model.

Cheers,
Sergei

Back to top

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

Sarah

YaBB Newbies

Offline

Posts: 47

Re: GA Analysis: Number of rate categories
Reply #6 - Dec 1^st, 2006 at 10:40am

Quote:

Collapsing really short branches is a good idea anyway - you can't really estimate dN/dS all that well on them anyway, even with a good model.

Generally in phylogenetic analysis it's better to "cut" branches because extra sequences add information about the order of substitutions. (This is the remedy for long branch attraction under parsimony.) It seems like these extra taxa also contribute to the correct inference of nucleotide or substitution model in ML analysis. But shorter branches have fewer substitutions and thus potentially less power to demonstrate positive selection. It seems like more power should be possible if one groups adjacent branches. Is my thinking on track and, if so, has anyone addressed this issue?

Sarah

Back to top

IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: GA Analysis: Number of rate categories
Reply #7 - Dec 1^st, 2006 at 10:57am

Dear Sarah,

Additional sequences do provide more information, but, for example adding 10 copies of the same sequence (i.e. the same virus sampled from 10 epi-linked hosts) will provide no useful information at all about substitution rates, and could in fact lower the estimates of rates because of a sampling bias.

By collapsing short branches, I was referring to two steps

1). Removing all identical sequences from the analysis
2). If there are very short internal branches, they may signify poorly resolved polytomies. Forcing a specific structure on a poorly resolved clade is not, in my mind, a robust approach.

Also, it is simply difficult to infer any rates reliably from very short branches.

HTH,
Sergei

Back to top

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

Sarah

YaBB Newbies

Offline

Posts: 47

Re: GA Analysis: Number of rate categories
Reply #8 - Dec 1^st, 2006 at 11:34am

Dear Sergei,

I understand the kind of collapsing you suggest for the GA analysis. My question was more general--the shortness of branches depends on how finely sampled the evolutionary process is. If the branches in the tree are quite short, there will be less evidence on any particular branch for positive selection, even though aggregating several contiguous branches would show a strong signal. The latter seems equivalent to sampling less often.

I'm wondering if anyone has tried to make corrections by considering selection across multiple contiguous branches, or if there is something flawed with this approach.

Sarah

Back to top

« Last Edit: Dec 1^st, 2006 at 4:11pm by Sarah »

IP Logged

Sergei YaBB Administrator Offline Datamonkeys are forever... Posts: 1658 UCSD Gender:	Re: GA Analysis: Number of rate categories Reply #9 - Dec 1^st, 2006 at 1:05pm Dear Sarah, You can definitely try to subsample (by cluster), to coalesce the short branches. I am not sure if there are any publsihed results on this though. Cheers, Sergei
Back to top	Associate Professor Division of Infectious Diseases Division of Biomedical Informatics School of Medicine University of California San Diego WWW IP Logged

	Welcome, Guest. Please Login