# FUBAR

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Citation: If you are using FUBAR, please cite [1]

Motivation: Model-based selection analyses (such as those performed by PAML and HyPhy) can be slow, becoming impractical for large alignments. We present a method to model and detect selection much faster than existing methods and to leverage Bayesian MCMC to robustly account for parameter estimation errors.

Results: By exploiting some commonly used approximations, FUBAR can perform detection of positive selection under a model that allows rich site- to-site rate variation about 30 to 50 times faster than existing random effects likelihood methods, and 10 to 30 times faster than existing fixed effects likelihood methods. We introduce an ultra-fast MCMC routine that allows a flexible prior specification, with no parametric constraints on the prior shape. Furthermore, our method allows us to visualize Bayesian inference for each site, revealing the model supported by the data.

## Contents

### Batch files and examples

FUBAR can be run on Datamonkey, or in HyPhy, using the HBL files available from GitHub. Note that several phases of the analysis have been coded to run in parallel on multiple MPI nodes, when available.

Obtain a recent HyPhy version

Please note that FUBAR uses some of the recently added functionality only available in the 2.2 distribution.

### FUBAR workflow

FUBAR consists of five phases. The analysis will checkpoint each of the phases, by writing intermediate result files into the same directory as the data file. To redo the analysis from scratch, please delete all these intermediate result files. For the purposes of this example, we will assume that the original alignment and tree files reside in /home/sergei/FUBAR/test/lysin.nex

#### Initial startup

Upon startup (e.g. via the HYPHYMP /path/to/FUBAR.bf), the analysis will display a general description text and its version


============================= FUBAR v1.0 ============================

This file will perform a Fast Unconstrained Bayesian AppRoximation (FUBAR)
analysis of a coding sequence alignment to determine whether some
sites have been subject to pervasive purifying or diversifying
selection. For details of the method and explanations of various

Please note that a FUBAR analysis generates many files in the same
directory as the original alignment. HyPhy needs to have write
privileges to this directory. For example if the original file is in
/home/sergei/FUBAR/data/pol.nex then at the end of a FUBAR run, there
will also exist files such as /home/sergei/FUBAR/data/pol.nex.samples,
/home/sergei/FUBAR/data/pol.nex.gridInfo etc. Many of these files can
be further examined for diagnostic and other purposes. They also
provide checkpointing so that a partially completed analysis can be
restarted.

=====================================================================


The user is prompted to designate

1. Choose Genetic Code : which genetic code is appropriate for the alignment at hand (Universal for our example)
2. How many datafiles are to be analyzed (>=1):: how many parts does this alignment come in (1 for our example). Note, that FUBAR can handle partitioned data to account for recombination, e.g. by supplying a NEXUS file output by GARD, one can request that different trees be used for different parts for the data (e.g. File:CVV G GARD.nex)
3. Please specify codon data #N: paths to alignment files (/home/sergei/FUBAR/test/lysin.nex for this example; replace with the appropriate path).

#### [FUBAR PHASE 1] Optimizing relative branch lengths under the nucleotide REV model

FUBAR will first fit branch lengths and nucleotide substitution parameters under the general time reversible nucleotide model of substitution to use for later stages. This phase is not MPI enabled.

##### Example output
[FUBAR PHASE 1] Optimizing relative branch lengths under the nucleotide REV model
[FUBAR PHASE 1 FINISHED] log(L) = -4592.95
Length of tree 1 (substitutions/site) = 2.50902
[DIAGNOSTIC] FUBAR wrote the self-contained nucleotide fit file to /home/sergei/FUBAR/test/lysin.nex.gtr_fit

##### Files written

It will generate a .gtr_fit NEXUS file, which contains a serialized HyPhy likelihood function.

#### [FUBAR PHASE 2] Grid calculations

In this phase FUBAR will:

1. Prompt the user for the size of the grid:
1. <<Number of grid points per dimension (total number is D^2) (permissible range = [5,50], default value = 20, integer): decide on how many points (D) will be used to represent synonymous (α) and non-synonynmous (β) rates.
2. Grid points are defined according to the following simple algorithm:
1. 70% of points (rounded up to the nearest integer, N = ceil(7D / 10)) are allocated to [0-1), using even spacing $d_i = i/N, i = 0\ldots N-1$
2. dN = 1
3. The remainder of the points, P = DN, are mapped to (1,50]), using a cubic spacing: $d_{N+i} = 1+(s\times i)^3, i = 1\ldots D-N, s = \frac{49^{1/3}}{P}$
4. For D = 20, this results in a grid with weights at 0,0.0714286,0.142857,0.214286,0.285714,0.357143,0.428571,0.5,0.571429,0.642857,0.714286,0.785714,0.857143,1,1.392,4.136,11.584,26.088,50.
2. Evaluate the likelihood of the Muse-Gaut 94 model on each (αβ) grid point. The point which maximizes the likelihood is then used to rescale branch length in the codon model to match those obtained with the nucleotide GTR model.
3. Re-evaluate the likelihood of the Muse-Gaut 94 model on each (α,β) grid point using the branch lengths from the previous step. This set of values gives likelihood weighting on the grid of D*D possible values of (αβ), i.e. a discretized likelihood function.
##### Example output

<<Number of grid points per dimension (total number is D^2) (permissible range = [5,50], default value = 20, integer)>>20
[DIAGNOSTIC] FUBAR will use a 20X20 grid
[FUBAR PHASE 2] Determining appropriate branch scaling using the 20X20 grid points.
Computing the likelihood function on grid points 400/400 00:00:05	Best scaling achieved for dN/dS =  1.00.
Computing site-by-site likelihoods at 20X20 grid points
Computing the likelihood function on grid points 400/400 00:00:05	Finished with likelihood calculations. Achieved throughput of  40.00 calculations/second
[DIAGNOSTIC] FUBAR wrote the self-contained codon fit file to /home/sergei/FUBAR/test/lysin.nex.codon_fit
[DIAGNOSTIC] FUBAR wrote the the site likelihoods file to /home/sergei/FUBAR/test/lysin.nex.grid_info


##### Files written
1. .codon_fit - a self-contained NEXUS serialized likelihood function containing the MG94 model fit with the best branch length scaling
2. .grid_info - a text file containing
1. A D2x 2 matrix (in HBL notation) showing the grid points as (α,β) pairs
2. An Associative List with two keys
1. conditionals - a DxS (number of sites) matrix, where the (i,j) value is the (potentially scaled) likelihood of site j given rate pair (αjj).
2. scalers - site-specific scaling (log-space), i.e. the j-th column of the matrix above needs to be multiplied by es, where s is the scaler.

To use these objects in HyPhy, one could employ this code:

fscanf (grid_file, "NMatrix,Raw", grid, site_probs);
site_probs = Eval (site_probs);

#### [FUBAR PHASE 3] Estimating the (α,β) weights using MCMC

In this phase FUBAR will:

1. Prompt the user for various MCMC parameters:
1. <<Number of MCMC chains to run (permissible range = [2,20], default value = 5, integer)>>: decide on how many independent chains (≥2) will be run. Multiple chains are needed to estimate convergence via the potential scale reduction factor.
2. <<The length of each chain (permissible range = [500000,100000000], default value = 5000000, integer)>>: decide on how long each chain will be.
3. <<Discard this many samples as burn-in (permissible range = [250000,4750000], default value = 2500000, integer)>>: decide on the length of chain burn-in.
4. <<How many samples should be drawn from each chain (permissible range = [10,2500000], default value = 1000, integer)>>: how many samples to draw from the chain, post burn-in (thinning).
5. <<The concentration parameter of the Dirichlet prior (permissible range = [0.001,1], default value = 0.1)>>: decide on the properties of the prior distribution.
2. Run multiple MCMC chains to evaluate the prior on the rate distribution to use for site-wise empirical Bayes inference.
##### Example output
<<Number of MCMC chains to run (permissible range = [2,20], default value = 5, integer)>>5
[DIAGNOSTIC] FUBAR will use run 5 independent chains
<<The length of each chain (permissible range = [500000,100000000], default value = 5000000, integer)>>5000000
[DIAGNOSTIC] FUBAR will run the chains for 5000000 steps
<<Discard this many samples as burn-in (permissible range = [250000,4750000], default value = 2500000, integer)>>2500000
[DIAGNOSTIC] FUBAR will run discard 2500000 steps as burn-in
<<How many samples should be drawn from each chain (permissible range = [10,2500000], default value = 1000, integer)>>1000
[DIAGNOSTIC] FUBAR will run thin each chain down to 1000 samples
<<The concentration parameter of the Dirichlet prior (permissible range = [0.001,1], default value = 0.1)>>0.1
[DIAGNOSTIC] FUBAR will use the Dirichlet prior concentration parameter of 0.1

[FUBAR PHASE 3] Running an MCMC chain (ID 0) to obtain a posterior sample of grid point weights: 5000000 total steps, of which 2500000 will be discarded as burn-in, and sampling every 2500 steps. Dirichlet prior concentration parameter = 0.1.
Running MCMC chain ID 0. Current step: 5000000/5000000. Mean sampled log(L) = -4368.47. Acceptance rate = 0.0778732
[FUBAR PHASE 3] Running an MCMC chain (ID 1) to obtain a posterior sample of grid point weights: 5000000 total steps, of which 2500000 will be discarded as burn-in, and sampling every 2500 steps. Dirichlet prior concentration parameter = 0.1.
Running MCMC chain ID 1. Current step: 5000000/5000000. Mean sampled log(L) = -4367.12. Acceptance rate = 0.0816364
[FUBAR PHASE 3] Running an MCMC chain (ID 2) to obtain a posterior sample of grid point weights: 5000000 total steps, of which 2500000 will be discarded as burn-in, and sampling every 2500 steps. Dirichlet prior concentration parameter = 0.1.
Running MCMC chain ID 2. Current step: 5000000/5000000. Mean sampled log(L) = -4368.49. Acceptance rate = 0.0776762
[FUBAR PHASE 3] Running an MCMC chain (ID 3) to obtain a posterior sample of grid point weights: 5000000 total steps, of which 2500000 will be discarded as burn-in, and sampling every 2500 steps. Dirichlet prior concentration parameter = 0.1.
Running MCMC chain ID 3. Current step: 5000000/5000000. Mean sampled log(L) = -4368.36. Acceptance rate = 0.0839152
[FUBAR PHASE 3] Running an MCMC chain (ID 4) to obtain a posterior sample of grid point weights: 5000000 total steps, of which 2500000 will be discarded as burn-in, and sampling every 2500 steps. Dirichlet prior concentration parameter = 0.1.
Running MCMC chain ID 4. Current step: 5000000/5000000. Mean sampled log(L) = -4367.96. Acceptance rate = 0.0890902[FUBAR PHASE 3 DONE] Finished running the MCMC chains; drew 5x1000 samples from chains of length 5000000 after discarding 2500000 burn-in steps. Achieved throughput of  56180 moves/sec.

[DIAGNOSTIC] FUBAR wrote samples from 5 independent chains to /Volumes/TeraMonkey/Users/sergei/Coding/HBL/FUBAR/test/lysin.nex.samples[0-4]


##### Files written
1. .samples.C - where C is the (0-based) index of the MCMC chain, each file contains data on S samples.
1. An Sx1 matrix (HBL) with the log-likelihoods of each sampled set of prior weights
2. A D2xS matrix where each column each column gives the weights of each grid point (in the same order as written to the .grid_info file.
2. .samples - a file with S samples collected from all chains (1/C from each). The same format as above, except the first entry in the file is the number of chains run.

#### [FUBAR PHASE 4/5] Tabulate the results and obtain False Discovery Rate estimates

In this phase FUBAR will:

1. Compute posterior probabilities of positive selection (α<β) at each site s (PPs), potential scale reduction factors for PPs, and effective sample sizes for PPs.
2. Simulate 1000 neutral/negatively selected sites using (α≥β) pairs from the grid, with weights proportional to the priors.
3. Compute conditional likelihoods for each of the simulated sites on each of the grid points.
4. Generate the list of simulated rates of positive selection detection at a given PP thershold, T
5. For each T, compute the estimated FDR as $FDR(PP \ge T) = \frac{Pr (PP \ge T | H_0) Pr (H_0)}{Pr(PP \ge T)}$. Here $Pr (PP \ge T | H_0)$ is the fraction of simulated sites which have posterior probabilities of positive selection of at least T, $Pr (PP \ge T)$- the fraction of actual sites which have posterior probabilities of positive selection of at least T, and Pr(H0) is the total mass assigned to points with (α≥β) on the mean grid (based on the final MCMC sample).
6. Report all sites at which PP ≥ 0.9.
##### Example output
Tabulating results for site 133/134 00:00:00
[FUBAR PHASE 5] Performing 1,000 simulations under the data-derived composite null model to derive False Discovery Rate (FDR) estimates
Tabulating results for site 999/1000 00:00:04
[DIAGNOSTIC] FUBAR wrote the results of its analysis to /Volumes/TeraMonkey/Users/sergei/Coding/HBL/FUBAR/test/lysin.nex.fubar.csv
[DIAGNOSTIC] FUBAR wrote FDR simulation data to /Volumes/TeraMonkey/Users/sergei/Coding/HBL/FUBAR/test/lysin.nex.sim_codon_fit
[DIAGNOSTIC] FUBAR wrote FDR grid information to /Volumes/TeraMonkey/Users/sergei/Coding/HBL/FUBAR/test/lysin.nex.sim_codon_fit

[RESULTS] At posterior probability >= 0.9 there were 11 sites under diversifying positive selection

Codon	Prob[dN/dS>1]	PSRF	N_eff	FDR
44	0.987168	1.29891	12.2584	0.230109
36	0.984098	1.51963	8.81218	0.115055
87	0.9823	1.59808	8.21254	0.0767031
126	0.970761	1.59559	8.22901	0.086291
106	0.966982	1.21704	15.359	0.0690328
75	0.949868	1.2491	13.8996	0.0958789
41	0.943569	1.65716	7.85887	0.0986183
12	0.932069	1.84566	7.0748	0.100673
113	0.930575	1.78016	7.30186	0.089487
127	0.926549	1.31789	11.7699	0.0805383
40	0.916834	1.48757	9.11496	0.0836761


##### Files written
1. .fubar.csv - the penultimate result file (comma separated) with the following columns for each site
1. Codon : the index of the codon
2. alpha: posterior mean synonymous substitution rate
3. beta: posterior mean non-synonymous substitution rate
4. beta-alpha: self-explanatory
5. Prob[alpha<beta]: the (empirical Bayes, given the mean sampled grid weights) posterior probability of positive diversifying selection
6. Prob[alpha>beta]: same for negative selection
7. PSRF: the estimated potential scale reduction factor
8. Neff: effective sample size
9. FDR: false discovery rate if the threshold is set at Prob[alpha<beta].
2. .sim_codon_fit - analogous to the codon fit file, but for simulated data
3. .sim_grid_info - analogous to the grid info file, but for simulate data
1. An Sx1 matrix (HBL) with the log-likelihoods of each sampled set of prior weights
2. A D2xS matrix where each column each column gives the weights of each grid point (in the same order as written to the .grid_info file.
4. .samples - a file with S samples collected from all chains (1/C from each). The same format as above, except the first entry in the file is the number of chains run.

### References

1. Ben Murrell, Sasha Moola, Amandla Mabona, Thomas Weighill, Daniel Sheward, Sergei L. Kosakovsky Pond, and Konrad Scheffler FUBAR: A Fast, Unconstrained Bayesian AppRoximation for Inferring Selection Mol Biol Evol (2013) 30 (5): 1196-1205 first published online February 18, 2013 http://dx.doi.org/10.1093/molbev/mst030