FUBAR
Citation: If you are using FUBAR, please cite ^{[1]}
Motivation: Modelbased selection analyses (such as those performed by PAML and HyPhy) can be slow, becoming impractical for large alignments. We present a method to model and detect selection much faster than existing methods and to leverage Bayesian MCMC to robustly account for parameter estimation errors.
Results: By exploiting some commonly used approximations, FUBAR can perform detection of positive selection under a model that allows rich site tosite rate variation about 30 to 50 times faster than existing random effects likelihood methods, and 10 to 30 times faster than existing fixed effects likelihood methods. We introduce an ultrafast MCMC routine that allows a flexible prior specification, with no parametric constraints on the prior shape. Furthermore, our method allows us to visualize Bayesian inference for each site, revealing the model supported by the data.
Contents

Batch files and examples
FUBAR can be run on Datamonkey, or in HyPhy, using the HBL files available from GitHub. Note that several phases of the analysis have been coded to run in parallel on multiple MPI nodes, when available.
Obtain a recent HyPhy version
Please note that FUBAR uses some of the recently added functionality only available in the 2.2 distribution.
FUBAR workflow
FUBAR consists of five phases. The analysis will checkpoint each of the phases, by writing intermediate result files into the same directory as the data file. To redo the analysis from scratch, please delete all these intermediate result files. For the purposes of this example, we will assume that the original alignment and tree files reside in /home/sergei/FUBAR/test/lysin.nex
Initial startup
Upon startup (e.g. via the HYPHYMP /path/to/FUBAR.bf), the analysis will display a general description text and its version
============================= FUBAR v1.0 ============================ This file will perform a Fast Unconstrained Bayesian AppRoximation (FUBAR) analysis of a coding sequence alignment to determine whether some sites have been subject to pervasive purifying or diversifying selection. For details of the method and explanations of various settings, please see http://www.hyphy.org/wiki/FUBAR Please note that a FUBAR analysis generates many files in the same directory as the original alignment. HyPhy needs to have write privileges to this directory. For example if the original file is in /home/sergei/FUBAR/data/pol.nex then at the end of a FUBAR run, there will also exist files such as /home/sergei/FUBAR/data/pol.nex.samples, /home/sergei/FUBAR/data/pol.nex.gridInfo etc. Many of these files can be further examined for diagnostic and other purposes. They also provide checkpointing so that a partially completed analysis can be restarted. =====================================================================
The user is prompted to designate
 Choose Genetic Code : which genetic code is appropriate for the alignment at hand (Universal for our example)
 How many datafiles are to be analyzed (>=1):: how many parts does this alignment come in (1 for our example). Note, that FUBAR can handle partitioned data to account for recombination, e.g. by supplying a NEXUS file output by GARD, one can request that different trees be used for different parts for the data (e.g. File:CVV G GARD.nex)
 Please specify codon data #N: paths to alignment files (/home/sergei/FUBAR/test/lysin.nex for this example; replace with the appropriate path).
[FUBAR PHASE 1] Optimizing relative branch lengths under the nucleotide REV model
FUBAR will first fit branch lengths and nucleotide substitution parameters under the general time reversible nucleotide model of substitution to use for later stages. This phase is not MPI enabled.
Example output
[FUBAR PHASE 1] Optimizing relative branch lengths under the nucleotide REV model [FUBAR PHASE 1 FINISHED] log(L) = 4592.95 Length of tree 1 (substitutions/site) = 2.50902 [DIAGNOSTIC] FUBAR wrote the selfcontained nucleotide fit file to /home/sergei/FUBAR/test/lysin.nex.gtr_fit
Files written
It will generate a .gtr_fit NEXUS file, which contains a serialized HyPhy likelihood function.
[FUBAR PHASE 2] Grid calculations
In this phase FUBAR will:
 Prompt the user for the size of the grid:
 <<Number of grid points per dimension (total number is D^2) (permissible range = [5,50], default value = 20, integer): decide on how many points (D) will be used to represent synonymous (α) and nonsynonynmous (β) rates.
 Grid points are defined according to the following simple algorithm:
 70% of points (rounded up to the nearest integer, N = ceil(7D / 10)) are allocated to [01), using even spacing
 d_{N} = 1
 The remainder of the points, P = D − N, are mapped to (1,50]), using a cubic spacing:
 For D = 20, this results in a grid with weights at 0,0.0714286,0.142857,0.214286,0.285714,0.357143,0.428571,0.5,0.571429,0.642857,0.714286,0.785714,0.857143,1,1.392,4.136,11.584,26.088,50.
 Evaluate the likelihood of the MuseGaut 94 model on each (αβ) grid point. The point which maximizes the likelihood is then used to rescale branch length in the codon model to match those obtained with the nucleotide GTR model.
 Reevaluate the likelihood of the MuseGaut 94 model on each (α,β) grid point using the branch lengths from the previous step. This set of values gives likelihood weighting on the grid of D*D possible values of (αβ), i.e. a discretized likelihood function.
Example output
<<Number of grid points per dimension (total number is D^2) (permissible range = [5,50], default value = 20, integer)>>20 [DIAGNOSTIC] FUBAR will use a 20X20 grid [FUBAR PHASE 2] Determining appropriate branch scaling using the 20X20 grid points. Computing the likelihood function on grid points 400/400 00:00:05 Best scaling achieved for dN/dS = 1.00. Computing sitebysite likelihoods at 20X20 grid points Computing the likelihood function on grid points 400/400 00:00:05 Finished with likelihood calculations. Achieved throughput of 40.00 calculations/second [DIAGNOSTIC] FUBAR wrote the selfcontained codon fit file to /home/sergei/FUBAR/test/lysin.nex.codon_fit [DIAGNOSTIC] FUBAR wrote the the site likelihoods file to /home/sergei/FUBAR/test/lysin.nex.grid_info
Files written
 .codon_fit  a selfcontained NEXUS serialized likelihood function containing the MG94 model fit with the best branch length scaling
 .grid_info  a text file containing
 A D^{2}x 2 matrix (in HBL notation) showing the grid points as (α,β) pairs
 An Associative List with two keys
 conditionals  a DxS (number of sites) matrix, where the (i,j) value is the (potentially scaled) likelihood of site j given rate pair (α_{j},β_{j}).
 scalers  sitespecific scaling (logspace), i.e. the jth column of the matrix above needs to be multiplied by e^{s}, where s is the scaler.
To use these objects in HyPhy, one could employ this code:
fscanf (grid_file, "NMatrix,Raw", grid, site_probs); site_probs = Eval (site_probs);
[FUBAR PHASE 3] Estimating the (α,β) weights using MCMC
In this phase FUBAR will:
 Prompt the user for various MCMC parameters:
 <<Number of MCMC chains to run (permissible range = [2,20], default value = 5, integer)>>: decide on how many independent chains (≥2) will be run. Multiple chains are needed to estimate convergence via the potential scale reduction factor.
 <<The length of each chain (permissible range = [500000,100000000], default value = 5000000, integer)>>: decide on how long each chain will be.
 <<Discard this many samples as burnin (permissible range = [250000,4750000], default value = 2500000, integer)>>: decide on the length of chain burnin.
 <<How many samples should be drawn from each chain (permissible range = [10,2500000], default value = 1000, integer)>>: how many samples to draw from the chain, post burnin (thinning).
 <<The concentration parameter of the Dirichlet prior (permissible range = [0.001,1], default value = 0.1)>>: decide on the properties of the prior distribution.
 Run multiple MCMC chains to evaluate the prior on the rate distribution to use for sitewise empirical Bayes inference.
Example output
<<Number of MCMC chains to run (permissible range = [2,20], default value = 5, integer)>>5 [DIAGNOSTIC] FUBAR will use run 5 independent chains <<The length of each chain (permissible range = [500000,100000000], default value = 5000000, integer)>>5000000 [DIAGNOSTIC] FUBAR will run the chains for 5000000 steps <<Discard this many samples as burnin (permissible range = [250000,4750000], default value = 2500000, integer)>>2500000 [DIAGNOSTIC] FUBAR will run discard 2500000 steps as burnin <<How many samples should be drawn from each chain (permissible range = [10,2500000], default value = 1000, integer)>>1000 [DIAGNOSTIC] FUBAR will run thin each chain down to 1000 samples <<The concentration parameter of the Dirichlet prior (permissible range = [0.001,1], default value = 0.1)>>0.1 [DIAGNOSTIC] FUBAR will use the Dirichlet prior concentration parameter of 0.1 [FUBAR PHASE 3] Running an MCMC chain (ID 0) to obtain a posterior sample of grid point weights: 5000000 total steps, of which 2500000 will be discarded as burnin, and sampling every 2500 steps. Dirichlet prior concentration parameter = 0.1. Running MCMC chain ID 0. Current step: 5000000/5000000. Mean sampled log(L) = 4368.47. Acceptance rate = 0.0778732 [FUBAR PHASE 3] Running an MCMC chain (ID 1) to obtain a posterior sample of grid point weights: 5000000 total steps, of which 2500000 will be discarded as burnin, and sampling every 2500 steps. Dirichlet prior concentration parameter = 0.1. Running MCMC chain ID 1. Current step: 5000000/5000000. Mean sampled log(L) = 4367.12. Acceptance rate = 0.0816364 [FUBAR PHASE 3] Running an MCMC chain (ID 2) to obtain a posterior sample of grid point weights: 5000000 total steps, of which 2500000 will be discarded as burnin, and sampling every 2500 steps. Dirichlet prior concentration parameter = 0.1. Running MCMC chain ID 2. Current step: 5000000/5000000. Mean sampled log(L) = 4368.49. Acceptance rate = 0.0776762 [FUBAR PHASE 3] Running an MCMC chain (ID 3) to obtain a posterior sample of grid point weights: 5000000 total steps, of which 2500000 will be discarded as burnin, and sampling every 2500 steps. Dirichlet prior concentration parameter = 0.1. Running MCMC chain ID 3. Current step: 5000000/5000000. Mean sampled log(L) = 4368.36. Acceptance rate = 0.0839152 [FUBAR PHASE 3] Running an MCMC chain (ID 4) to obtain a posterior sample of grid point weights: 5000000 total steps, of which 2500000 will be discarded as burnin, and sampling every 2500 steps. Dirichlet prior concentration parameter = 0.1. Running MCMC chain ID 4. Current step: 5000000/5000000. Mean sampled log(L) = 4367.96. Acceptance rate = 0.0890902[FUBAR PHASE 3 DONE] Finished running the MCMC chains; drew 5x1000 samples from chains of length 5000000 after discarding 2500000 burnin steps. Achieved throughput of 56180 moves/sec. [DIAGNOSTIC] FUBAR wrote samples from 5 independent chains to /Volumes/TeraMonkey/Users/sergei/Coding/HBL/FUBAR/test/lysin.nex.samples[04]
Files written
 .samples.C  where C is the (0based) index of the MCMC chain, each file contains data on S samples.
 An Sx1 matrix (HBL) with the loglikelihoods of each sampled set of prior weights
 A D^{2}xS matrix where each column each column gives the weights of each grid point (in the same order as written to the .grid_info file.
 .samples  a file with S samples collected from all chains (1/C from each). The same format as above, except the first entry in the file is the number of chains run.
[FUBAR PHASE 4/5] Tabulate the results and obtain False Discovery Rate estimates
In this phase FUBAR will:
 Compute posterior probabilities of positive selection (α<β) at each site s (PP_{s}), potential scale reduction factors for PP_{s}, and effective sample sizes for PP_{s}.
 Simulate 1000 neutral/negatively selected sites using (α≥β) pairs from the grid, with weights proportional to the priors.
 Compute conditional likelihoods for each of the simulated sites on each of the grid points.
 Generate the list of simulated rates of positive selection detection at a given PP thershold, T
 For each T, compute the estimated FDR as . Here is the fraction of simulated sites which have posterior probabilities of positive selection of at least T,  the fraction of actual sites which have posterior probabilities of positive selection of at least T, and Pr(H_{0}) is the total mass assigned to points with (α≥β) on the mean grid (based on the final MCMC sample).
 Report all sites at which PP ≥ 0.9.
Example output
Tabulating results for site 133/134 00:00:00 [FUBAR PHASE 5] Performing 1,000 simulations under the dataderived composite null model to derive False Discovery Rate (FDR) estimates Tabulating results for site 999/1000 00:00:04 [DIAGNOSTIC] FUBAR wrote the results of its analysis to /Volumes/TeraMonkey/Users/sergei/Coding/HBL/FUBAR/test/lysin.nex.fubar.csv [DIAGNOSTIC] FUBAR wrote FDR simulation data to /Volumes/TeraMonkey/Users/sergei/Coding/HBL/FUBAR/test/lysin.nex.sim_codon_fit [DIAGNOSTIC] FUBAR wrote FDR grid information to /Volumes/TeraMonkey/Users/sergei/Coding/HBL/FUBAR/test/lysin.nex.sim_codon_fit [RESULTS] At posterior probability >= 0.9 there were 11 sites under diversifying positive selection Codon Prob[dN/dS>1] PSRF N_eff FDR 44 0.987168 1.29891 12.2584 0.230109 36 0.984098 1.51963 8.81218 0.115055 87 0.9823 1.59808 8.21254 0.0767031 126 0.970761 1.59559 8.22901 0.086291 106 0.966982 1.21704 15.359 0.0690328 75 0.949868 1.2491 13.8996 0.0958789 41 0.943569 1.65716 7.85887 0.0986183 12 0.932069 1.84566 7.0748 0.100673 113 0.930575 1.78016 7.30186 0.089487 127 0.926549 1.31789 11.7699 0.0805383 40 0.916834 1.48757 9.11496 0.0836761
Files written
 .fubar.csv  the penultimate result file (comma separated) with the following columns for each site
 Codon : the index of the codon
 alpha: posterior mean synonymous substitution rate
 beta: posterior mean nonsynonymous substitution rate
 betaalpha: selfexplanatory
 Prob[alpha<beta]: the (empirical Bayes, given the mean sampled grid weights) posterior probability of positive diversifying selection
 Prob[alpha>beta]: same for negative selection
 PSRF: the estimated potential scale reduction factor
 Neff: effective sample size
 FDR: false discovery rate if the threshold is set at Prob[alpha<beta].
 .sim_codon_fit  analogous to the codon fit file, but for simulated data
 .sim_grid_info  analogous to the grid info file, but for simulate data
 An Sx1 matrix (HBL) with the loglikelihoods of each sampled set of prior weights
 A D^{2}xS matrix where each column each column gives the weights of each grid point (in the same order as written to the .grid_info file.
 .samples  a file with S samples collected from all chains (1/C from each). The same format as above, except the first entry in the file is the number of chains run.
References
 ↑ Ben Murrell, Sasha Moola, Amandla Mabona, Thomas Weighill, Daniel Sheward, Sergei L. Kosakovsky Pond, and Konrad Scheffler FUBAR: A Fast, Unconstrained Bayesian AppRoximation for Inferring Selection Mol Biol Evol (2013) 30 (5): 11961205 first published online February 18, 2013 http://dx.doi.org/10.1093/molbev/mst030