HyPhy message board - Inconsistent results between Hyphy and datamonkey

	Welcome, Guest. Please Login

Home

Help

HyPhy message board › Methodology Questions › How to › Inconsistent results between Hyphy and datamonkey

(Moderators: Sergei, Simon)

‹ Previous Topic | Next Topic ›

Pages: 1

Send Topic

Inconsistent results between Hyphy and datamonkey (Read 3052 times)

Gabriel

YaBB Newbies

Offline

Feed your monkey!

Posts: 11

Inconsistent results between Hyphy and datamonkey
Feb 25^th, 2010 at 1:13pm

Hello everyone,

I'm trying to compare the results of an analysis of positive selection on HyPhy versus datamonkey website. The problem I have is that both give me completely different results.

With the following options in Hyphy, I get the following results:

First get the tree with:

(9) Phylogeny Reconstruction
(3) Perform a phylogeny reconstuction for nucleotide, protein or codon data with user-selectable models using the method of neighbor joining.
(1):[Distance formulae] Use one of the predefined distance measures based on data comparisons. Fast.
(2):[Codon] Codon (several available genetic codes).
(1):[Universal] Universal code. (Genebank transl_table=1).
Please choose a codon data file::/home/.....
(2):[Force Zero] Negative Branch Lengths are Forced to 0.
(1):[Nei_Gojobori] Nei and Gojobori (1986) method.
(1):[Joint p-distance(Default)] Observed N/E[N]+Observed S/E[S]

And then, my analysis of positive selection:

(1):[Universal] Universal code. (Genebank transl_table=1).
(1):[New Analysis] Perform a new analysis.
Please specify a codon data file:: /home/.....
(1):[Default] Use HKY85 and MG94xHKY85.
Please select a tree file for the data::/home/.....
/Save nucleotide model fit to::/home/....
(1):[Neutral] dN/dS=1
(1):[Single Ancestor Counting] Use the most likely ancestor state
Branch Corrections Factor (<0 to estimate):-1
(1):[Full tree] Analyze the entire tree
(1):[Averaged] All possible resolutions are considered and averaged.
(1):[Approximate] Use the approximate extended binomial distribution (fast)
Significance level for a site to be classified as positively/negatively selected?0.01

******* FOUND 6 POSITIVELY SELECTED SITES ********

+--------------+--------------+--------------+--------------+
| Index    | Site Index | dN-dS    | p-value |
+--------------+--------------+--------------+--------------+
| 1 | 266.000000 | 16.975860 |    0.004127 |
+--------------+--------------+--------------+--------------+
| 2 | 711.000000 | 17.775800 |    0.001779 |
+--------------+--------------+--------------+--------------+
| 3 | 871.000000 | 15.384795 |    0.007866 |
+--------------+--------------+--------------+--------------+
| 4 | 933.000000 | 13.977618 |    0.008030 |
+--------------+--------------+--------------+--------------+
| 5 | 963.000000 | 15.981690 |    0.008440 |
+--------------+--------------+--------------+--------------+
| 6 | 964.000000 | 20.738813 |    0.001250 |
+--------------+--------------+--------------+--------------+

(2):[Export to File] Output is spooled to a tab separated file.
Export tab separated data to::/home/....
(1):[Skip] Skip the estimation of number of dS and dN rate classes

Now, with the following options in datamonkey, I get the following results:

Read 35 sequences and 1367 codon alignment columns and 1 partitions.
Nucleotide composition
A 22.0984%
C 29.0403%
G 26.9878%
T 21.8734%
Method: SLAC
Use this tree set:Neighbor Joining Tree
Define a custom (or choose a "named" HKY85) nucleotide substitution bias model
Global dN/dS value is: Neutral 1.0
Handling ambiguities: Averaged
Significance Level (p-value/Bayes Factor/posterior probability): 0.01

Data summary
35 sequences with 1 partition
Partition 1: 1367 codons 17.0656 subs/site
Nucleotide Model(010010) Fit Results
Log(L) = -102728.673
Relative substitution rates

A C G T
A * 0.890379 1 0.890379
C - * 0.890379 1
G - - * 0.890379
T - - - *

Codon Model Fit Results (processor time taken: 100.97 seconds)
Log(L) = -101613 mean dN/dS = 1

Found 7 positively selected sites ( significance level 0.01)

Codon dN-dS Normalized dN-dS p-value
77    21.7536 0.424902 0.00730707
175    33.3654 0.651709 0.00675704
434    25.8345 0.504611 0.00679637
738    25.4569 0.497236 0.00283785
1095 31.5538 0.616323 0.0021509
1143 24.7542 0.48351 0.00337522
1198 26.4843 0.517304 0.00461045

Well, as you can see, in both analysis I get different results. I mean that in each tests I get different sites under positive selection. I have done the same exercise with the sample file that appears on the website of datamonkey (Influenza A H5N1 hemagluttinin) and in that case both analysis gave me the same results (this was what I expected for my records). For this reason I believe that perhaps there is something wrong with my alignment. I attached my file to see if someone can help me. Another thing that comes to mind is that the options that I'm using to generate the tree in Hyphy maybe are differents to those used in datamonkey website.

Well, I hope someone can help me.

Gabriel

Multimedia File Viewing and Clickable Links are available for Registered Members only!! You need to

IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: Inconsistent results between Hyphy and datamonkey
Reply #1 - Feb 25^th, 2010 at 1:38pm

Dear Gabriel,

Try downloading the NJ tree built by datamonkey (you can get it via the [Information:Other analyses] link on the SLAC results page) for the HyPhy analysis. Datamonkey uses nucleotide NJ (TN93) trees, whereas you built a codon based one. Usually this won't matter much, but your alignment is very gappy, so the trees could be quite different. Also, I do not recommend using dN/dS = 1 for the analysis - this will OVERESTIMATE the number of non-synonymous changes in most datasets. The best 'default' option is to [Estimate dN/dS only]. Try that and see if the results are in better agreement.

Sergei

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

Gabriel

YaBB Newbies

Offline

Feed your monkey!

Posts: 11

Re: Inconsistent results between Hyphy and datamonkey
Reply #2 - Mar 8^th, 2010 at 6:51am

Dear Sergei,

Thank you very much for your reply. I used the nucleotide NJ (TN93) trees to perform my hyphy analysis and also eliminated the poorly aligned positions and divergent regions of my alignments with Gblocks and the results were much more agreement.

I take this ocation to ask you, what parameter do you recommend for testing positive selection with sequences from widely divergent species?

Gabriel

IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: Inconsistent results between Hyphy and datamonkey
Reply #3 - Mar 8^th, 2010 at 12:23pm

Hi Gabriel,

I am glad my suggestions improved reproducibility. Generally speaking, traditional molecular selection approaches may not work well with highly divergent species, because alignment and saturation (very long branch lengths) make inference difficult. Saturation means, in particular, that the methods may be unable to estimate the background rate of synonymous substitutions reliably, because it is beyond the upper limit of detection. I would suggest that you limit your analyses to more conserved (reliably aligned) blocks, and perhaps consider . How divergent are your taxa?

Sergei

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

Pages: 1

Send Topic

‹ Previous Topic | Next Topic ›

« Home

‹ Board

Top of this page