Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
Inconsistent results between Hyphy and datamonkey (Read 3052 times)
Gabriel
YaBB Newbies
*
Offline


Feed your monkey!

Posts: 11
Inconsistent results between Hyphy and datamonkey
Feb 25th, 2010 at 1:13pm
 
Hello everyone,

I'm trying to compare the results of an analysis of positive selection on HyPhy versus datamonkey website. The problem I have is that both give me completely different results.

With the following options in Hyphy, I get the following results:

First get the tree with:

(9) Phylogeny Reconstruction
(3) Perform a phylogeny reconstuction for nucleotide, protein or codon data with user-selectable models using the method of neighbor joining.
(1):[Distance formulae] Use one of the predefined distance measures based on data comparisons. Fast.
(2):[Codon] Codon (several available genetic codes).
(1):[Universal] Universal code. (Genebank transl_table=1).
Please choose a codon data file::/home/.....
(2):[Force Zero] Negative Branch Lengths are Forced to 0.
(1):[Nei_Gojobori] Nei and Gojobori (1986) method.
(1):[Joint p-distance(Default)] Observed N/E[N]+Observed S/E[S]


And then, my analysis of positive selection:

(1):[Universal] Universal code. (Genebank transl_table=1).
(1):[New Analysis] Perform a new analysis.
Please specify a codon data file:: /home/.....
(1):[Default] Use HKY85 and MG94xHKY85.
Please select a tree file for the data::/home/.....
/Save nucleotide model fit to::/home/....
(1):[Neutral] dN/dS=1
(1):[Single Ancestor Counting] Use the most likely ancestor state
Branch Corrections Factor (<0 to estimate):-1
(1):[Full tree] Analyze the entire tree
(1):[Averaged] All possible resolutions are considered and averaged.
(1):[Approximate] Use the approximate extended binomial distribution (fast)
Significance level for a site to be classified as positively/negatively selected?0.01


******* FOUND 6 POSITIVELY SELECTED SITES ********

+--------------+--------------+--------------+--------------+
| Index        | Site Index   | dN-dS        | p-value      |
+--------------+--------------+--------------+--------------+
|            1 |   266.000000 |    16.975860 |     0.004127 |      
+--------------+--------------+--------------+--------------+
|            2 |   711.000000 |    17.775800 |     0.001779 |      
+--------------+--------------+--------------+--------------+
|            3 |   871.000000 |    15.384795 |     0.007866 |      
+--------------+--------------+--------------+--------------+
|            4 |   933.000000 |    13.977618 |     0.008030 |      
+--------------+--------------+--------------+--------------+
|            5 |   963.000000 |    15.981690 |     0.008440 |      
+--------------+--------------+--------------+--------------+
|            6 |   964.000000 |    20.738813 |     0.001250 |      
+--------------+--------------+--------------+--------------+


(2):[Export to File] Output is spooled to a tab separated file.
Export tab separated data to::/home/....
(1):[Skip] Skip the estimation of number of dS and dN rate classes


Now, with the following options in datamonkey, I get the following results:

Read 35 sequences and 1367 codon alignment columns and 1 partitions.
Nucleotide composition
    A 22.0984%
    C 29.0403%
    G 26.9878%
    T 21.8734%
Method: SLAC
Use this tree set:Neighbor Joining Tree
Define a custom (or choose a "named" HKY85) nucleotide substitution bias model
Global dN/dS value is: Neutral      1.0
Handling ambiguities: Averaged
Significance Level (p-value/Bayes Factor/posterior probability): 0.01

Data summary
35 sequences with 1 partition
Partition 1: 1367 codons 17.0656 subs/site
Nucleotide Model(010010) Fit Results
Log(L) = -102728.673
Relative substitution rates

     A      C            G            T
A      *      0.890379      1            0.890379
C      -      *            0.890379      1
G      -      -            *            0.890379
T      -      -            -            *

Codon Model Fit Results (processor time taken: 100.97 seconds)
Log(L) = -101613  mean dN/dS = 1

Found 7 positively selected sites ( significance level 0.01)

Codon      dN-dS      Normalized dN-dS      p-value      
77        21.7536      0.424902      0.00730707      
175        33.3654      0.651709      0.00675704      
434        25.8345      0.504611      0.00679637      
738        25.4569      0.497236      0.00283785                  
1095   31.5538      0.616323      0.0021509      
1143   24.7542      0.48351      0.00337522      
1198   26.4843      0.517304      0.00461045      

Well, as you can see, in both analysis I get different results. I mean that in each tests I get different sites under positive selection. I have done the same exercise with the sample file that appears on the website of datamonkey (Influenza A H5N1 hemagluttinin) and in that case both analysis gave me the same results (this was what I expected for my records). For this reason I believe that perhaps there is something wrong with my alignment. I attached my file to see if someone can help me. Another thing that comes to mind is that the options that I'm using to generate the tree in Hyphy maybe are differents to those used in datamonkey website.

Well, I hope someone can help me.

Gabriel
Back to top
 
Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login (30 KB | )
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Inconsistent results between Hyphy and datamonkey
Reply #1 - Feb 25th, 2010 at 1:38pm
 
Dear Gabriel,

Try downloading the NJ tree built by datamonkey (you can get it via the [Information:Other analyses] link on the SLAC results page) for the HyPhy analysis. Datamonkey uses nucleotide NJ (TN93) trees, whereas you built a codon based one. Usually this won't matter much, but your alignment is very gappy, so the trees could be quite different. Also, I do not recommend using dN/dS = 1 for the analysis - this will OVERESTIMATE the number of non-synonymous changes in most datasets.  The best 'default' option is to [Estimate dN/dS only]. Try that and see if the results are in better agreement.

Sergei

Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Gabriel
YaBB Newbies
*
Offline


Feed your monkey!

Posts: 11
Re: Inconsistent results between Hyphy and datamonkey
Reply #2 - Mar 8th, 2010 at 6:51am
 
Dear Sergei,

Thank you very much for your reply. I used the nucleotide NJ (TN93) trees to perform my hyphy analysis and also eliminated the poorly aligned positions and divergent regions of my alignments with Gblocks and the results were much more agreement.

I take this ocation to ask you, what parameter do you recommend for testing positive selection with sequences from widely divergent species?

Gabriel
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Inconsistent results between Hyphy and datamonkey
Reply #3 - Mar 8th, 2010 at 12:23pm
 
Hi Gabriel,

I am glad my suggestions improved reproducibility. Generally speaking, traditional molecular selection approaches may not work well with highly divergent species, because alignment and saturation (very long branch lengths) make inference difficult. Saturation means, in particular, that the methods may be unable to estimate the background rate of synonymous substitutions reliably, because it is beyond the upper limit of detection.  I would suggest that you limit your analyses to more conserved (reliably aligned) blocks, and perhaps consider . How divergent are your taxa?

Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged