HyPhy message board
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl
HYPHY Package >> HyPhy feedback >> What is Hyphy actually doing with the gaps?
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?num=1201198027

Message started by mm on Jan 24th, 2008 at 10:07am

Title: What is Hyphy actually doing with the gaps?
Post by mm on Jan 24th, 2008 at 10:07am
Dear HyPhy team,

I am comparing the results of several RRT analysis that I made using Hyphy (standalone) with several version of an alignment. Some of these alignment versions include sequences with a large number of gaps in one of the most informative regions. Could you please tell me how Hyphy handles gaps by default? I would like to know whether including these gappy sequences is eliminating positions that are useful (in case hyphy employed complete deletion) or introducing additional information (in case the gaps were by default considered as an additional state). Your answer is important for me because in additon, I am comparing the results of other analysis made with Hyphy and and PAML. For the case of PAML I have explicitly eliminated the gapped positions.

Thanks in advance and looking forwards to your answer!  

M&M

Title: Re: What is Hyphy actually doing with the gaps?
Post by artpoon on Jan 24th, 2008 at 4:21pm
Hi mm,

By default, HyPhy handles gaps as missing data, i.e. completely ambiguous character states.  In other words, a gap '-' would be treated as if an 'N' occurred at that position.

- Art.

Title: Re: What is Hyphy actually doing with the gaps?
Post by Sergei on Jan 24th, 2008 at 5:44pm
Dear mm,

One other thing to add to Art's answer: by default HyPhy will count a gap as contributing 1/4 to each of the 'observed' nucleotide counts (or 1/20 to each protein residue if it is a protein alignment). This could have subtle effects on model parameter estimates when compared with PAML. You can override this behavior by setting:

[code]
COUNT_GAPS_IN_FREQUENCIES = 0;
[/code]

at the beginning of the analysis file.

HyPhy automatically prune all positions with gaps, use the preferences dialog (under Data Read/Write Settings). Also, intuitively, treating gaps as missing data is the same as removing all branches (and subtrees) that have only gaps at the tips from the analysis at a given site; hence the site with all gaps contributes no information to the analysis, a site with only 3 characters which are not gaps has the same information as a 3-taxon tree, etc.

Cheers,
Sergei


Title: Re: What is Hyphy actually doing with the gaps?
Post by mm on Jan 25th, 2008 at 7:39am
Thanks a lot for your answers!

M&M  :D

Title: How to deploy COUNT_GAPS_IN_FREQUENCIES?
Post by Jamie on Mar 11th, 2011 at 2:04am

Sergei wrote on Jan 24th, 2008 at 5:44pm:
You can override this behavior by setting:

[code]
COUNT_GAPS_IN_FREQUENCIES = 0;
[/code]

at the beginning of the analysis file.



This is a useful thread and advice regarding how to port PAML-like behavior to HYPHY, but I have a few more questions about trying to implement this.  In particular, can you please be more specific about what 'analysis file' means here?  And will  the positioning of this line of code matter depending on the what analysis is being run?

I am using the 'inputRedirect' approach to run SelectionSubtreeComparison.bf.  So do I put COUNT_GAPS_IN_FREQUENCIES=0; at the top of my control file above 'inputRedirect ={}', or do I create a new version of SelectionSubtreeComparison.bf with this line at the top (or somewhere else)?  

Also, is there an obvious way to know from the output that this has performed as expected relative to default?  For instance will the dataviewer show a partition excluding these sites?  Will the reported list of sites analyzed be reduced by the number of gaps relative to default?

Thanks,  Jamie

Title: Re: How to deploy COUNT_GAPS_IN_FREQUENCIES?
Post by Sergei on Mar 11th, 2011 at 2:06pm
Hi Jamie,


In HyPhy, all variables are effectively global, so if you write

[code]
COUNT_GAPS_IN_FREQUENCIES = 0;
[/code]

BEFORE you call ExecuteAFile, that should set the environment variable for the analysis.

Sergei

Title: Re: What is Hyphy actually doing with the gaps?
Post by Jamie on Mar 14th, 2011 at 5:57am
Sergei, thanks for offering the more explicit instructions.  However, in doing so, I realized that I was looking for a clever and you gave me scalpel (to reference the prez's budgeting...).

To explain further, and in reference to mm who started this thread, PAML has a 'dirty data' option which causes the program to ignore any site that has a gap.  This is what I mean by a clever.  My understanding of PAML is that, with this option turned on, if you give it a 10 codon alignment and there is a gap in one or more taxa for 1 codon, it analyzes only 9 codons.  In hyphy terms, it creates a partition that excludes any codon sites with a gap in the alignment.

Now, assuming I'm understanding things correctly, the 'COUNT_GAPS_IN_FREQUENCIES = 0;' "scalpel" is much more nuanced than this.  Using this option does not affect the partitioning of the data in any way -- all 10 sites in our alignment will be analyzed.  it just affects how the nucleotide frequencies are estimated for that codon site, as you describe above.

So what I should have asked is whether there is a way to tell HYPHY to ignore every alignment position that includes a gap.  I believe this is what mm was asking for.  So, in the case of an FEL analysis (for instance in the SelectionSubtreeComparison that I'm working with), I should have an output where estimates are reported for only 9 of 10 sites.  

I do recognize that FELs are conveniently 'quantized' by site and that it would not be all that difficult to generate a 'gapped T/F' vector externally via some script that parses the alignment.  But this is not the case for REL analyses which combine info across sites.  So, for anyone migrating from PAML to HYPHY (as I think we all should...), having this option in HYPHY would help the pilgrims.

Thanks for helping me out with the nitty-gritty here...

Jamie

Title: Re: What is Hyphy actually doing with the gaps?
Post by Sergei on Mar 14th, 2011 at 6:16am
Hi Jamie,

HyPhy has a similar, but not exactly identical, flag (poorly named, albeit)

[code]
SKIP_OMISSIONS = 1;
[/code]

When the flag is set to 1, any DataSetFilter object will exclude sites with at least one N-fold ambiguity (including '-','?','N'). This should approximate the behavior of PAML quite closely.

Sergei

HyPhy message board » Powered by YaBB 2.5.2!
YaBB Forum Software © 2000-2024. All Rights Reserved.