HyPhy

From HyPhy Wiki
Jump to: navigation, search

Contents

HyPhy tutorial

This tutorial highlights three primary ways to interact with HyPhy

  1. Through menu-driven standard analyses
  2. Through the interactive graphical user interface
  3. Through HyPhy Batch Language Scripts

Documents

Sections of this tutorial will be referring to two PDF documents

  1. The HyPhy book chapter in this book
  2. The soon to be published Applied Methods Chapter on "Evolution of viral genomes - Interplay between selection, recombination and other forces" in this book

Using standard analyses

In this section, we will learn how to interact with HyPhy via its standard analyses. The section assumes access to a HyPhy executable with GUI capabilities. Most of the foregoing can also be performed in the command line version of the program. To access the list of standard analyses in the CL-version, simply launch cd to the installation directory and start HyPhy.

Running SLAC

In this exercise we will screen an alignment of 19 West Nile virus sequences (NS3 gene) for evidence of diversifying positive selection using the SLAC method implemented in HyPhy.

  1. Download File:WN NS3.fas to your computer saving it into a new directory (name it WNNS3)
  2. Lauch HyPhy, select Analysis:Standard Analyses from the console window menu.
  3. In the ensuing dialog box listing analysis types, click on the arrow next to Positive Selection to expand the list further, choose QuickSelectionDetection.bf. This file implements a wide variety of analyses, each having multiple options, and can be quite intimidating! For detailed discussion of many of the options, please take a look at Section 2 of The Positive Selection in HyPhy book chapter. In the following listing of input options to the analysis, we are going to use the shorthand Option : Value
  1. Choose Genetic Code : Universal
  2. New/Restore : New Analysis
  3. Please specify a codon data file : Use the navigation box to find WN_NS3.fas
  4. Model Options : Custom
  5. Please enter a 6 character model designation (e.g:010010 defines HKY85):012345
  6. A tree was found in the data file...: Y (type into the lower box of the console window and press Enter)
  7. Save nucleotide model fit to : Use the navigation box to save the file to the same directory as WN_NS3.fas, but name it WN_NS3.HKY85.fit
  8. dN/dS bias parameter options : Estimate dN/dS only
  9. Ancestor counting options : Single Ancestor Counting
  10. SLAC Options : Full tree
  11. Treatment of Ambiguities : Averaged
  12. Test Statistic : Approximate
  13. ... HYPHY will do some work and print some text to the console window ...
  14. Significance level for a site to be classified as positively/negatively selected? : 0.1 (type into the lower box of the console window and press Enter)
  15. ... More text to the screen, including the list of sites under positive and negative selection ...
  16. Output Options : Chart
  17. Rate Class Estimator : Skip

At the end of the execution there will be several windows displayed, and there should be a single positively selected site


******* FOUND 1 POSITIVELY SELECTED SITES ********
+--------------+--------------+--------------+--------------+
| Index        | Site Index   | dN-dS        | p-value      | 
+--------------+--------------+--------------+--------------+
|            1 |   249.000000 |     3.304463 |     0.087525 |
+--------------+--------------+--------------+--------------+

SLAC screen shot

Take a look at the row for site 249 in the results chart: this is the site reported positively. The method inferred that a total of 7 substitutions took place at that site, with 0 (Observed S Changes) of them being synonymous and 7 (Observed NS Changes) being non-synonymous. How does this observed synonymous proportion (0, Observed S. Prop.) compare to the expected proportion under neutrality? This neutral expectation is computed by averaging the imputed proportions of synonymous and non-synonymous 'sites' along the tree. These 'site' counts, which can be more accurately thought of as the proportion of random one-nucleotide substitutions that are expected to be synonymous (E[S Sites] = 0.8817) and non-synonymous (E[NS Sites] = 2.002). Their ratio (P{S} = 0.293884) is the proportion of substitutions expected to be synonymous under neutral evolution. For site 249, this means that about 2/7 substitutions would be synonymous, and the probability of getting 0/7 synonymous substitutions under the binomial distribution with the probability parameter P{S} = 0.293884 is (1-P{S})7 = 0.087525. This number is the one reported in the column P{S leq. observed}, and is the p-value for having as many or fewer synonymous substitutions at a site (positive selection), with the analogous quantity for negative selection reported in the column P{S geq. observed}. dS and dN are simply S/E[S Sites] and NS/ E[NS Sites], dN-dS is their difference, and Scaled dN-dS is dN-dS normalized by the total length of the tree.



Using Single Breakpoint Recombination (SBP)

In this exercise we will examine an alignment of 9 HIV-1 pol sequences for evidence of recombination using SBP.

  1. Download File:HIV pol BC 08.fas to your computer saving it into a new directory (name it SBP)
  2. Lauch HyPhy, select Analysis:Standard Analyses from the console window menu.
  3. In the ensuing dialog box listing analysis types, click on the arrow next to Recombination to expand the list further (if needed), choose SingleBreakpointRecomb.bf. The analysis examines every possible location for a single recombination breakpoint (i.e. every variable site in the alignment) and computes the goodness of fit of the model that allows recombination (i.e. has two trees, left and right of the breakpoint) versus the universal null (single tree).
  1. Data type : Nucleotide
  2. Locate a nucleotide data file : Use the navigation box to find HIV_pol_BC_08.fas
  3. KH Testing : Skip (we'll do it later)
  4. Choose one of the standard models : HKY85
  5. Model options : Global
  6. Save analysis results to : Use the navigation box to save the file to the same directory as HIV_pol_BC_08.fas, but name it SBP.txt
  7. ... HYPHY will do some work and print some text to the console window ...

For each possible breakpoint, the program will fit the 2-tree model and report its goodness of fit according to three information criteria: AIC, small sample AIC (AICc) - the default, and BIC. AIC is the least conservative and BIC is by far the most.

A typical output line may look like this, where AIC and AICc both support locating a breakpoint, but BIC does not.

Breakpoint at position    260. dAIC =       63.21 dAICc =      61.54 dBIC =    -110.89

At the end of the run, a summary is presented:

AIC

Best supported breakpoint is located at position 260
AIC = 5155.5 : an improvement of 63.2106 AIC points

AIC-c

Best supported breakpoint is located at position 260
AIC = 5157.78 : an improvement of 61.5429 AIC points

BIC

There seems to be NO recombination in this alignment

The analysis will generate two tree files (one for each tree left to the breakpoint, and one for each tree right to the breakpoint), written to the same directory as the main result file (SBP.txt in this case), called SBP.txt.trees1, and SBP.txt.trees2.

Next, we will perform a standard topological incongruence test, using the GARDProcessor.bf file in the Recombination section of standard analyses.

  1. Please load a nucleotide data file : Use the navigation box to find HIV_pol_BC_08.fas
  2. Please load a GA partition analysis result file: : Use the navigation box to find SBP.txt_cAIC.splits
  3. ... HYPHY will do some work and print some text to the console window ...

HyPhy will perform a variety of tests to determine which if the breakpoint can be attributed to topological incongruence (i.e. recombination and not rate variation or heterotachy), and report that in this case, the breakpoint at position 261 is likely due to recombination according to the Kishino-Hasegawa test

Breakpoint | LHS Raw p | LHS adjusted p | RHS Raw p | RHS adjusted p 
       261 |   0.00050 |        0.00100 |   0.00540 |        0.01080

At p = 0.01 there are 0 significant breakpoints
At p = 0.05 there are 1 significant breakpoints
At p = 0.1 there are 1 significant breakpoints

Using a multi-partition fixed effects likelihood (FEL) to correct for the confounding effect of recombination

In this exercise we will examine an alignment of 13 HIV-1 glycoprotein sequences from Cache Valley Fever virus, with and without correcting for recombination.

  1. Download File:CVV G.fas and File:CVV G GARD.nex to your computer saving it into a new directory (name it FEL)
  2. Lauch HyPhy, select Analysis:Standard Analyses from the console window menu.
  3. In the ensuing dialog box listing analysis types, click on the arrow next to Selection/Recombination to expand the list further (if needed), choose QuickSelectionDetectionMF.bf. The analysis allows SLAC and FEL to be run on a dataset which is partitioned into multiple non-recombinant fragments (e.g. by GARD), i.e. using the [PARRIS approach http://bioinformatics.oxfordjournals.org/content/22/20/2493.abstract]
  1. Choose Genetic Code : Universal
  2. New/Restore : New Analysis
  3. Model Options : Custom
  4. Please enter a 6 character model designation (e.g:010010 defines HKY85):012345 (type into the lower box of the console window and press Enter)
  5. How many datafiles are to be analyzed (>=1):?: 1 (type into the lower box of the console window and press Enter)
  6. Please specify a codon data file : Use the navigation box to find CVV_G_GARD.nex (this file contains the alignment and the five partitions inferred to be non-recombinant by GARD).
  7. Save nucleotide model fit to : Use the navigation box to save it to CVV_G_GARD.REV
  8. dN/dS bias parameter options : Estimate dN/dS only
  9. Which method? : FEL
  10. ... HYPHY will do some work and print some text to the console window ...
  11. Significance level for Likelihood Ratio Tests (between 0 and 1)? : 0.1 (type into the lower box of the console window and press Enter)
  12. Branch option : All
  13. ... HYPHY will do some work and print some text to the console window ...
  14. Save results to : Save site-by-site LRT results toCVV_G_GARD.csv

A typical output line may look like this:

Site  195 dN/dS =     inf dN =  4.9848 dS =  0.0000 dS(=dN)  2.3353 Full Log(L) = -14.5463 LRT=  3.9208 p-value =  0.04769 *P

Here, codon 195, has the maximum likelihood synonymous rate (dS) inferred at 0, and the non-syn rate (dN) - at 4.9848 (their ratio is infinite). The log-likelihood of the site with these parameters is -14.5463. The null model which forces dN=dS infers the value at 2.3353. The likelihood ratio test for non-neutral evolution has the test statistic of 3.9208 and the p-value of 0.04769 (which is significant at the specified level). The site is called positively selected (*P), because the test is significant and dN>dS.

Now repeat the analysis on the same file but not corrected for recombination (File:CVV G.fas). Make sure not to overwrite any of the result files. If you have access to a plotting program (e.g. Excel or R), load the two resulting .csv files and plot the p-values against each other. Do you think that there is an effect depending on whether or not we correct for the possible confounding caused by recombination?

GARD or no GARD?

Finding co-evolving sites

Follow the exercise described in section 1.8 of the Applied Methods Chapter (pages 21-24). You will need to use this alignment file.


Using the HyPhy Graphical User Interface (GUI)

Fitting a simple nucleotide model

Perform the steps for the Example: Basic Analysis described on pages 4-8 of the HyPhy book chapter.

After the analysis is finished, click on the RT Gene Shared TVTS line in the Parameter Table, then choose Likelihood:Covariance, Sampler and CImenu, select menu options as shown in the following figures and obtain 95% confidence interval estimate for this parameter.

Confidence Intervals Options

Please note that if some of the values/parameter names may be different in the current implementation of HyPhy compared to the linked documents.

Local branch parameters

Perform the steps for the Local Branch Parameters described on pages 12-15 of the HyPhy book chapter.

Multiple partitions and hypothesis testing

Perform the steps for the Multiple partitions and hypothesis testing described on pages 15-19 of the HyPhy book chapter.


HyPhy Batch Language

The HBL files needed for this section can be found in [1]

HBL 101

Open the file basics.bf in a text editor and follow section 3.1 in the HyPhy book chapter to see what each line of the file does.

Automating HyPhy analyses

Interactive, dialog driven analyses are useful for learning, exploring new options and running analyses infrequently. However, if a large number of sequence files must be subjected to the same analysis flow, then a mechanism to automate making the same choices over and over again is desirable. HyPhy provides a mechanism for scripting any standard analysis using input redirection. To instruct HyPhy to make any given set of selections automatically, one must first execute the standard analysis for which the selections must be made and make notes of the actual choices being made. For instance, to use the standard analysis AnalyzeCodonData.bf with a local MG94 × 012232 substitution model, 6 choices must be made: genetic code to use (Universal), alignment file to import, substitution model (MG94CUSTOM), model options (Local), nucleotide bias (012232) and the tree to use. Having made these choices, one next creates a text file with a script in the HyPhy batch language which may look like this (assuming that the tree included in the file is to be used for step 6).

inputRedirect = {};
inputRedirect["01"]="Universal";
inputRedirect["02"]="/Users/sergei/Desktop/MyFiles/somealignment.nex";
inputRedirect["03"]="MG94CUSTOM";
inputRedirect["04"]="Local";
inputRedirect["05"]="012232";
inputRedirect["06"]="y";
ExecuteAFile (HYPHY_BASE_DIRECTORY + "TemplateBatchFiles"+ DIRECTORY_SEPARATOR+"AnalyzeCodonData.bf", inputRedirect);

inputRedirect is a data structure (an associative array) which stores a succession of inputs, indexed by the order in which they will be used, and the ExecuteAFile command executes the needed standard analysis using some predefined variables to specify the path to that file, using inputRedirect to fetch user responses from. All standard analyses reside in the same directory, so this command can be easily adjusted for other analyses. The input for step “02” must, of course, refer to an existing file. Another option is to leave that option blank (""), and have HyPhy prompt just for the file, keeping other options as specified. To execute a file like this, invoke File: Open:Open Batch File

Exercise 1

Create a HyPhy 'wrapper' batch file for AnalyzeCodonData.bf as described above. Execute it on p51.nex, both leaving option "02" blank, and filling it with the path to the file.

Exercise 2

First create a similar wrapper file for the SLAC analysis we performed earlier in this tutorial. Second, wrap it in a simple loop structure to apply several files. For example, assume that you want to apply the same analysis to files

/Users/sergei/testdata/file1.fas and /Users/sergei/testdata/file2.fas

First, create a text file that contains (on a separate line) the full path to the each alignment file, for example (try doing it with files p51.nex and Integrase_BDA.nex in the data directory of the HyPhy distribution).

/Users/sergei/testdata/file1.fas
/Users/sergei/testdata/file2.fas

Next, create the following text file to be executed by HyPhy (this example uses AnalyzeCodonData.bf). When you execute this HBL in HyPhy, provide the file with path names you have just created as input.

fileToExe = HYPHY_BASE_DIRECTORY + "TemplateBatchFiles" + DIRECTORY_SEPARATOR + "AnalyzeCodonData.bf";
 
/* a  list of file paths */
SetDialogPrompt ( "Provide a list of files to process:" );
fscanf ( PROMPT_FOR_FILE, "Lines", _inDirectoryPaths );
 
fprintf (stdout, "[READ ", Columns (_inDirectoryPaths), " file path lines]\n");
 
/* the options passed to the GUI are encoded here */
inputRedirect = {};
inputRedirect["01"]="Universal";
inputRedirect["03"]="MG94CUSTOM";
inputRedirect["04"]="Local";
inputRedirect["05"]="012232";
inputRedirect["06"]="y";
 
for ( _fileLine = 0; _fileLine < Columns ( _inDirectoryPaths ); _fileLine = _fileLine + 1 ) {
 
	inputRedirect [ "02" ]	= _inDirectoryPaths[ _fileLine ];
	ExecuteAFile ( fileToExe, inputRedirect );
}

Using the code above as a guide, modify inputRedirect to work for SLAC and run the batched analysis on two files.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox