Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
Number of Sites Required for Ti/Tv (Read 2193 times)
Jennifer Knies
Guest


Number of Sites Required for Ti/Tv
Feb 17th, 2005 at 6:46am
 
How robust is the Ti/Tv estimate (built using maximum likelihood) to a small number of sites?  We want to estimate Ti/Tv for 5 sites from an ~200bp alignment that has ~110 sequences.  We will use the 200bp alignment to determine the base pair frequencies and have built a tree based on the full genome for these sequences (they are HIV sequences).  Any references you have on this topic would also be appreciated.  Thanks!
Back to top
 
 
IP Logged
 
Simon
Ex Member


Re: Number of Sites Required for Ti/Tv
Reply #1 - Feb 17th, 2005 at 10:49am
 
Dear Jennifer,

Your message raises a number of points. To answer your main question, if you try to estimate a parameter such as the transition/transversion ratio with too little data, then ML estimates may be highly biased; imagine you have so few changes that there are no transversions, then the ML estimate of the ratio will be infinity. For small datasets, it's highly recommended that you inspect the confidence intervals on the parameters. HyPhy provides a simple interface to calculating confidence intervals based on profile likelihood.

As an aside, you should be careful about interpreting the results obtained by fitting a HKY85 model, as it may not be the best model for your data. Spencer Muse has a chapter in Keith Crandall's Evolution of HIV book (Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login) which shows that HIV genes don't often conform to one of the simple 'named' models such as HKY85, TN93, etc. John Huelsenbeck et al. (Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login) also demonstrated that the best fitting models are not always 'named' models. Also have a look at David Posada and Thomas Buckley's model selection paper in Systematic Biology. If you want to get a quick idea of which is the best model, upload the data to DataMonkey, and use the model selection option, which will fit 203 reversible nucleotide models to your data. There's also an option in HyPhy that allows you to do the same thing, but unless you have an MPI enabled cluster of computers handy, DataMonkey will be much faster.

In addition, it may be wise to use the tree based on the alignment of interest, rather than the tree estimated from the whole genome, as this will be less likely to be affected by recombination. Even though there's less information in short alignments, the tree shouldn't affect the estimates of the substitution parameters much.

Best wishes
Simon
Back to top
 
 
IP Logged