HyPhy message board - BEB, NEB and Bayes Factor...Ahhhhhhhhhhh!!!

	Welcome, Guest. Please Login

Home

Help

HyPhy message board › HYPHY Package › HyPhy feedback › BEB, NEB and Bayes Factor...Ahhhhhhhhhhh!!!

(Moderators: Sergei, Simon)

‹ Previous Topic | Next Topic ›

Pages: 1

Send Topic

BEB, NEB and Bayes Factor...Ahhhhhhhhhhh!!! (Read 5782 times)

matty

YaBB Newbies

Offline

zzzzzzzzz

Posts: 6

BEB, NEB and Bayes Factor...Ahhhhhhhhhhh!!!
Jun 15^th, 2007 at 3:29pm

Hi Sergei,
I have just run the YangNielsenBranchSite2005.bf test, and I have found several sites under positive selection with Bayes Factors well over 100.

I have also run the branch site model implemented in PAML, specifically Test 2. PAML returns the same sites as HYPHY with significant values for Naive Empirical Bayes (NEB), but returns no positively selected sites with Bayes Empirical Bayes (BEB).

The authors of PAML recommend ignoring the NEB values, thus I am wondering which results are correct. Should I follow Yang's advice and ignore NEB sites, and thus disregard the HYPHY results? Or should I use the results returned by HYPHY?

I'm thinking that it could be that the BEB was not powerful enough to pick up any sites, even though positive selection did occur (given a significant LRT and an omega > 1). What do you believe?

Thanks for your time,
Matt

IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: BEB, NEB and Bayes Factor...Ahhhhhhhhhhh!!!
Reply #1 - Jun 15^th, 2007 at 4:10pm

Dear Matt,

BEB is really just a hack to attempt to approximate the results of a fully Bayesian inference without actually doing it; it allows one to incorporate sampling errors in parameter estimates.

Generally speaking, if your data set is small and parameter estimate errors are lare (the most likely scenario where NEB and BEB might disagree), there is very little power to detect individual sites under selection, regardless of which method you use. The fact that LRT is significant (for omega>1) does not guarantee that you will be able to find at least one site under selection, e.g. the LRT could be due to small contributions from a number of sites, but individually, no single site is reliably significant.

If you are really interested in pinning down the significance of your findings, do some parametric simulations; generate 100 data sets using the omega distribution (and all other parameters, such as base frequencies and branch lengths) inferred from your alignment, and see what type of false positive error in detecting individual sites under selection you get. Then you can tweak the bayes factor cutoff to make that small.

Cheers,
Sergei

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

matty

YaBB Newbies

Offline

zzzzzzzzz

Posts: 6

Re: BEB, NEB and Bayes Factor...Ahhhhhhhhhhh!!!
Reply #2 - Jun 20^th, 2007 at 10:10am

Hi Sergei,

Thanks for your response. I am familiar with how to do the parametric simulations you suggested using the likelihood functions created via the data panel interface (I performed this on one of your example datasets). However, I cannot get it to work for a standard analysis on my own dataset. It complains that I cannot save a likelihood function if i did not use the data panel interface (ie. I just loaded in my data).

Any help would be appreciated.

Matt

IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: BEB, NEB and Bayes Factor...Ahhhhhhhhhhh!!!
Reply #3 - Jun 20^th, 2007 at 10:29am

Dear Matt,

Ah yes - because likelihood function states are attached to data panels, they are not available for reloaded data (even though parameter estimate tables load just fine). This is a fairly fundamental design assumption and will not be easy to change, I am afraid (it has to do with the need for the GUI to know how to build a copy of the likelihood function on simulated data, and likelihood functions created in the batch language directly are next to impossible to clone runtime).

Do this however:

1). Load your model fit (e.g. REL); simulate 100 copies from it using the little gear wheels icon from the console window (you should get a Simulate LF option).

2). Write a little wrapper script following Multimedia File Viewing and Clickable Links are available for Registered Members only!! You need to

for (file=0; file<100; file=file+1)
{
     set up analysis options
     ExecuteAFile ("analysis file", options);
     Record relevant results
     (or better yet, just save a copy of the entire
	likelihood function for each replicate, and
	then write a separate processor script to
	tabulate what you need, say omega or
	sites under selection)
}

Cheers,
Sergei

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

matty

YaBB Newbies

Offline

zzzzzzzzz

Posts: 6

Re: BEB, NEB and Bayes Factor...Ahhhhhhhhhhh!!!
Reply #4 - Jun 20^th, 2007 at 5:08pm

Hey Sergei,

Thank you for all your help. So I'm in the process of implementing what was stated, but I've hit a little snag. When it comes time to choose the node to set as the foreground, I'm having trouble when listing my choices of nodes as stdinRedirect[]. I would like to find a way to let the program know that there are no more nodes I would like to select. When running this on command line without a batchfile, I would normally hit "d", but it does not accept this as part of my stdin array.

I found that if I make it implicit in choiceList the amount of nodes I want to choose, then the problem is solved. But then if I only want to only choose ONE node as my foreground, I get an error stating "Operation MAccess is not defined for 0"

Do you have any idea on how to get around this?

Thank you again for your help!

Matt

IP Logged

Sergei YaBB Administrator Offline Datamonkeys are forever... Posts: 1658 UCSD Gender:	Re: BEB, NEB and Bayes Factor...Ahhhhhhhhhhh!!! Reply #5 - Jun 20^th, 2007 at 5:39pm Dear Matt, Instead of "d", place an empty string "" in the stdinArray structure to terminate the variable length selection. HTH, Sergei
Back to top	Associate Professor Division of Infectious Diseases Division of Biomedical Informatics School of Medicine University of California San Diego WWW IP Logged

Andrew_Roth

YaBB Newbies

Offline

Posts: 17

Re: BEB, NEB and Bayes Factor...Ahhhhhhhhhhh!!!
Reply #6 - Jun 20^th, 2007 at 10:25pm

Hi Sergei,
I am working with matty on this project. Thanks for all the help so far, we have got our batch script to run and we are just waiting on our server to finish analysing the simulated data sets. The script is setup to save the likelihood function from each dataset.

I was wondering if there where any articles or books you could point us towards that might help us understand how to make use the bootstrap data. I have done some reading online but all the articles seem to be about using bootstrapping to test the robustness of inferred trees.
We are looking to find out if the positively selected sites we have found are indeed positively selected. They tend to have very high Bayes factors anywhere from 70 - 40000. As per your suggestion we would like to use the bootstrapping procedure to find a reasonable cutoff for the Bayes factor but I am unsure of how to go about this.

Thanks again for all the help.

Cheers,
Andrew

IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: BEB, NEB and Bayes Factor...Ahhhhhhhhhhh!!!
Reply #7 - Jun 22^nd, 2007 at 10:25am

Dear Andrew,

If you want general statistical books on bootstrap, then any by the inventor of the procedure (Bradley Efron) would do. In your case, the application is rather simple

0). Fit your data using some model M, with estimated parameter vector R (either direct model parameters, like branch lengths, or functions of model parameters and the data, such as the Bayes factor for dN/dS at site 10).
1). The objective of the bootstrap is to estimate sampling properties of elements of R (e.g. mean, variance etc) the the function of random data. Ideally, one would prefer to repeat the same analysis on a large number of independent data samples which were generated using the same process as that which produced the observed data. Since such independent replicates are not available, we do the next best thing and assume that the model we estimated was what actually generated the data, and then simulate under it. Efron showed that this actually works really well asymptotically.

In your case, for each simulated data, record the actual dN/dS for each site.
2). Run the same inference procedure as the one you used in step 0.
3). Tabulate the proportions of (p_F) false positives (sites with dN<dS which were inferred to be under selection) and of true positives (p_T) (sites with dN>dS which were inferred to be under selection) as a function of the significance level of your test (e.g. the Bayes Factor = 50, 100, 200 etc). If you plot p_T vs p_F for a fixed value of the Bayes factor, you will get what is known as an ROC curve.
4). Decide what Bayes factor gives you decent performance (low p_F and high p_T, decreasing the BF will increase both, increasing the BF will decrease both)
5). Now use that Bayes factor cutoff to reparse your original results and claim that the sites you found under selection are reasonably robust based on bootstrapped sampling properties of the estimator.

Anisimova and Yang had two papers on this in MBE in 2001/2002, and our Not So Different ... paper in MBE 2005 uses the same procedure.

HTH,
Sergei

P.S. Take a look at Multimedia File Viewing and Clickable Links are available for Registered Members only!! You need to

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

Pages: 1

Send Topic

‹ Previous Topic | Next Topic ›

« Home

‹ Board

Top of this page