HyPhy message board
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl
Theoretical questions >> Sequence Analysis >> Number of Sites Required for Ti/Tv
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?num=1108651563

Message started by Jennifer Knies on Feb 17th, 2005 at 6:46am

Title: Number of Sites Required for Ti/Tv
Post by Jennifer Knies on Feb 17th, 2005 at 6:46am
How robust is the Ti/Tv estimate (built using maximum likelihood) to a small number of sites?  We want to estimate Ti/Tv for 5 sites from an ~200bp alignment that has ~110 sequences.  We will use the 200bp alignment to determine the base pair frequencies and have built a tree based on the full genome for these sequences (they are HIV sequences).  Any references you have on this topic would also be appreciated.  Thanks!

Title: Re: Number of Sites Required for Ti/Tv
Post by Simon on Feb 17th, 2005 at 10:49am
Dear Jennifer,

Your message raises a number of points. To answer your main question, if you try to estimate a parameter such as the transition/transversion ratio with too little data, then ML estimates may be highly biased; imagine you have so few changes that there are no transversions, then the ML estimate of the ratio will be infinity. For small datasets, it's highly recommended that you inspect the confidence intervals on the parameters. HyPhy provides a simple interface to calculating confidence intervals based on profile likelihood.

As an aside, you should be careful about interpreting the results obtained by fitting a HKY85 model, as it may not be the best model for your data. Spencer Muse has a chapter in Keith Crandall's Evolution of HIV book (Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login) which shows that HIV genes don't often conform to one of the simple 'named' models such as HKY85, TN93, etc. John Huelsenbeck et al. (Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login) also demonstrated that the best fitting models are not always 'named' models. Also have a look at David Posada and Thomas Buckley's model selection paper in Systematic Biology. If you want to get a quick idea of which is the best model, upload the data to DataMonkey, and use the model selection option, which will fit 203 reversible nucleotide models to your data. There's also an option in HyPhy that allows you to do the same thing, but unless you have an MPI enabled cluster of computers handy, DataMonkey will be much faster.

In addition, it may be wise to use the tree based on the alignment of interest, rather than the tree estimated from the whole genome, as this will be less likely to be affected by recombination. Even though there's less information in short alignments, the tree shouldn't affect the estimates of the substitution parameters much.

Best wishes
Simon

HyPhy message board » Powered by YaBB 2.5.2!
YaBB Forum Software © 2000-2024. All Rights Reserved.