Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
What is Hyphy actually doing with the gaps? (Read 5559 times)
mm
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 5
What is Hyphy actually doing with the gaps?
Jan 24th, 2008 at 10:07am
 
Dear HyPhy team,

I am comparing the results of several RRT analysis that I made using Hyphy (standalone) with several version of an alignment. Some of these alignment versions include sequences with a large number of gaps in one of the most informative regions. Could you please tell me how Hyphy handles gaps by default? I would like to know whether including these gappy sequences is eliminating positions that are useful (in case hyphy employed complete deletion) or introducing additional information (in case the gaps were by default considered as an additional state). Your answer is important for me because in additon, I am comparing the results of other analysis made with Hyphy and and PAML. For the case of PAML I have explicitly eliminated the gapped positions.

Thanks in advance and looking forwards to your answer! 

M&M
Back to top
 
 
IP Logged
 
Art Poon
Global Moderator
*****
Offline


Feed your monkey!

Posts: 0
Re: What is Hyphy actually doing with the gaps?
Reply #1 - Jan 24th, 2008 at 4:21pm
 
Hi mm,

By default, HyPhy handles gaps as missing data, i.e. completely ambiguous character states.  In other words, a gap '-' would be treated as if an 'N' occurred at that position.

- Art.
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: What is Hyphy actually doing with the gaps?
Reply #2 - Jan 24th, 2008 at 5:44pm
 
Dear mm,

One other thing to add to Art's answer: by default HyPhy will count a gap as contributing 1/4 to each of the 'observed' nucleotide counts (or 1/20 to each protein residue if it is a protein alignment). This could have subtle effects on model parameter estimates when compared with PAML. You can override this behavior by setting:

Code:
COUNT_GAPS_IN_FREQUENCIES = 0;
 



at the beginning of the analysis file.

HyPhy automatically prune all positions with gaps, use the preferences dialog (under Data Read/Write Settings). Also, intuitively, treating gaps as missing data is the same as removing all branches (and subtrees) that have only gaps at the tips from the analysis at a given site; hence the site with all gaps contributes no information to the analysis, a site with only 3 characters which are not gaps has the same information as a 3-taxon tree, etc.

Cheers,
Sergei

Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
mm
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 5
Re: What is Hyphy actually doing with the gaps?
Reply #3 - Jan 25th, 2008 at 7:39am
 
Thanks a lot for your answers!

M&M  Cheesy
Back to top
 
 
IP Logged
 
Jamie
YaBB Newbies
*
Offline


Feed your monkey!

Posts: 7
How to deploy COUNT_GAPS_IN_FREQUENCIES?
Reply #4 - Mar 11th, 2011 at 2:04am
 
Sergei wrote on Jan 24th, 2008 at 5:44pm:
You can override this behavior by setting:

Code:
COUNT_GAPS_IN_FREQUENCIES = 0;
 



at the beginning of the analysis file.




This is a useful thread and advice regarding how to port PAML-like behavior to HYPHY, but I have a few more questions about trying to implement this.  In particular, can you please be more specific about what 'analysis file' means here?  And will  the positioning of this line of code matter depending on the what analysis is being run?

I am using the 'inputRedirect' approach to run SelectionSubtreeComparison.bf.  So do I put COUNT_GAPS_IN_FREQUENCIES=0; at the top of my control file above 'inputRedirect ={}', or do I create a new version of SelectionSubtreeComparison.bf with this line at the top (or somewhere else)?  

Also, is there an obvious way to know from the output that this has performed as expected relative to default?  For instance will the dataviewer show a partition excluding these sites?  Will the reported list of sites analyzed be reduced by the number of gaps relative to default?

Thanks,  Jamie
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: How to deploy COUNT_GAPS_IN_FREQUENCIES?
Reply #5 - Mar 11th, 2011 at 2:06pm
 
Hi Jamie,


In HyPhy, all variables are effectively global, so if you write

Code:
COUNT_GAPS_IN_FREQUENCIES = 0;
 



BEFORE you call ExecuteAFile, that should set the environment variable for the analysis.

Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Jamie
YaBB Newbies
*
Offline


Feed your monkey!

Posts: 7
Re: What is Hyphy actually doing with the gaps?
Reply #6 - Mar 14th, 2011 at 5:57am
 
Sergei, thanks for offering the more explicit instructions.  However, in doing so, I realized that I was looking for a clever and you gave me scalpel (to reference the prez's budgeting...).

To explain further, and in reference to mm who started this thread, PAML has a 'dirty data' option which causes the program to ignore any site that has a gap.  This is what I mean by a clever.  My understanding of PAML is that, with this option turned on, if you give it a 10 codon alignment and there is a gap in one or more taxa for 1 codon, it analyzes only 9 codons.  In hyphy terms, it creates a partition that excludes any codon sites with a gap in the alignment.

Now, assuming I'm understanding things correctly, the 'COUNT_GAPS_IN_FREQUENCIES = 0;' "scalpel" is much more nuanced than this.  Using this option does not affect the partitioning of the data in any way -- all 10 sites in our alignment will be analyzed.  it just affects how the nucleotide frequencies are estimated for that codon site, as you describe above.

So what I should have asked is whether there is a way to tell HYPHY to ignore every alignment position that includes a gap.  I believe this is what mm was asking for.  So, in the case of an FEL analysis (for instance in the SelectionSubtreeComparison that I'm working with), I should have an output where estimates are reported for only 9 of 10 sites.  

I do recognize that FELs are conveniently 'quantized' by site and that it would not be all that difficult to generate a 'gapped T/F' vector externally via some script that parses the alignment.  But this is not the case for REL analyses which combine info across sites.  So, for anyone migrating from PAML to HYPHY (as I think we all should...), having this option in HYPHY would help the pilgrims.

Thanks for helping me out with the nitty-gritty here...

Jamie
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: What is Hyphy actually doing with the gaps?
Reply #7 - Mar 14th, 2011 at 6:16am
 
Hi Jamie,

HyPhy has a similar, but not exactly identical, flag (poorly named, albeit)

Code:
SKIP_OMISSIONS = 1;
 



When the flag is set to 1, any DataSetFilter object will exclude sites with at least one N-fold ambiguity (including '-','?','N'). This should approximate the behavior of PAML quite closely.

Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged