HyPhy message board - confidence intervals around Fst values

Jon

YaBB Newbies

Offline

I love YaBB 1G - SP1!

Posts: 11

confidence intervals around Fst values
Jul 21^st, 2006 at 10:17am

Hi Sergei,

We'd like to write a batch file producing condidence intervals around Fst values based on the F_ST.bf currently included in HYPHY. To do this, the bootstrap resampling should be done only within the groups defined by the user rather than from the whole metapopulation as it is now implemented in F_ST.bf.

This would e.g. allow comparisons of F_ST values obtained by considering non-synonymous positions at functional positively selected sites vs those from synonymous positions to detect adaptive divergence.

We're struggling a bit with the code though. Would it be possible to get some hints about which modifications would be needed to get the bootstrap done within groups to get a C.I. around Fst ?

For instance, would simply replacing "overallSample" and "bothCladeSize" by "cladeA", "cladeB" and "clASize", "clBSize" in the bootsrap routine do the job?

Any guidance would be appreciated.

Cheers,

Jonatan

Back to top

IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: confidence intervals around Fst values
Reply #1 - Jul 24^th, 2006 at 9:01am

Dear Jonatan,

If I understand your setup correctly, you want to use bootstrap with resapmling within each subpopulation, as opposed to a global population permutation? You need to use resampling within each population, because F_st is invariant with respect to within-population permutations.

I can help you with the modifications if this is what you had in mind...

Cheers,
Sergei

Back to top

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

Jon YaBB Newbies Offline I love YaBB 1G - SP1! Posts: 11	Re: confidence intervals around Fst values Reply #2 - Jul 24^th, 2006 at 10:39am Hi Sergei, You understood correctly. Sorry I was perhaps not very clear. We would indeed like to use within subpopulations bootstrap resampling instead of the existing global permutation. This would give us the sampling variance of Fst values. I would greatly appreciate your help. Jonatan
Back to top	IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: confidence intervals around Fst values
Reply #3 - Jul 31^st, 2006 at 2:47pm

Dear Jonatan,

In order to implement this you'd need to modify the variable (for each replicate) distanceMatrix (which stores the pairwise distances between all sequences) and then call computeCompartmentValues (p1vector,p2vector) where p1vector indexes all sequences in population 1 and p2vector - all in population 2.

Here's the general idea for the (slow) code which will generate resampled distanceMatrix. Let me know if this makes any sense...

Code:

saveDM 	   = distanceMatrix;

p1Size		   = Abs (p1vector);
p2Size		   = Abs (p2vector);
basePartition1 = {1,p1Size};
basePartition2 = {1,p2Size};

for (k=0; k<p1Size;k=k+1)
{
	basePartition1[k] = p1vector[k];
}
for (k=0; k<p2Size;k=k+1)
{
	basePartition2[k] = p2vector[k];
}

/* repeat the following code for each replicate */
resampleP1 = Random (basePartition1,1); /* resample vector of indices with replacement */
resampleP2 = Random (basePartition1,2);

for (k=0; k<p1Size;k=k+1) /* repopulate distances for population 1 */
{
	ki = resampleP1[k];
	for (k2 = k+1; k2 < p1Size; k2=k2+1)
	{
		k2i = resampleP1[k2];
		distanceMatrix [k][k2] = saveDM[ki][ki2];
		distanceMatrix [k2][k] = saveDM[ki2][ki];
	}
}

for (k=0; k<p2Size;k=k+1) /* repopulate distances for population 2 */
{
	ki = resampleP2[k];
	for (k2 = k+1; k2 < p2Size; k2=k2+1)
	{
		k2i = resampleP2[k2];
		distanceMatrix [k][k2] = saveDM[ki][ki2];
		distanceMatrix [k2][k] = saveDM[ki2][ki];
	}
}

for (k=0; k<p1Size;k=k+1) /* repopulate interpopulation distances */
{
	ki = resampleP1[k];
	for (k2 = 0; k2 < p2Size; k2=k2+1)
	{
		k2i = resampleP2[k2];
		distanceMatrix [k][k2] = saveDM[ki][ki2];
		distanceMatrix [k2][k] = saveDM[ki2][ki];
	}
}

Cheers,
Sergei

Back to top

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

Jonatan Blais

Guest

Re: confidence intervals around Fst values
Reply #4 - Sep 5^th, 2006 at 1:16pm

Hi Sergei,

Thank you so much for the bootstrapping program.

We managed to insert it to replace the old permutation routine in the F_ST.bf and it seems that HYPHY reads it through (after correcting a little mistake by replacing “basePartition1,2” by “basePartition2,1” in “resampleP2 = Random (basePartition2,1)”).

However, the output doesn’t seem to make sense. All Fst estimates from bootstrapped samples are lower than the observed one. Normally of course, some should be lower and some higher so a confidence interval could be obtained. I’ve pasted the last part of the F_ST.bf containing the newly inserted bootstrapping code. I suspect there is something wrong in the way the bootstrapping is done but can’t manage to figure out what it could be.

Cheers,

Jon

F_ST_OBS = {{pi_D/(resMx[1]+pi_D),pi_D/(resMx[1]+resMx[2]),1-resMx[1]/resMx[0]}};

fprintf (stdout, "\n\nPopulation characterisitcs:",
"\nMetapopulation diversity (pi_T) = ", resMx[0],
"\nMean subpopulation diversity (pi_S) = ", resMx[1],
"\nMean interpopulation diversity (pi_B) = ", resMx[2],
"\n\nF_ST\n",
"\nHudson, Slatkin and Madison (Genetics    132:583-589): ",F_ST_OBS[0],
"\nSlatkin    (Evolution 47:264-279) : ",F_ST_OBS[1],
"\nHudson, Boos and Kaplan    (Mol Bio Evol 9: 138-151) : ",F_ST_OBS[2], "\n");

F_ST_OBS = F_ST_OBS;

ChoiceList (resample,"Permutation Test",1,SKIP_NONE,
"Skip","Do not perform a permutation test.",
      "But of course","Randomly allocate sequences into subpopulations and tabulate the distribution of various F_ST statistics.");

if (resample < 1)
{
return 0;
}

sampleCount = 0;
while (sampleCount < 1)
{
fprintf (stdout, "\nHow many permutations would you like to perform?");
fscanf (stdin,"Number",sampleCount);
}

F_ST_1 = {sampleCount,2};
F_ST_2 = {sampleCount,2};
F_ST_3 = {sampleCount,2};
step = sampleCount$20;

fprintf (stdout, "Running the simulations...\n");

/* New bootstrapping within groups replacing old permutation code */

saveDM = distanceMatrix;

p1Size = Abs (p1vector);
p2Size = Abs (p2vector);
basePartition1 = {1,p1Size};
basePartition2 = {1,p2Size};

for (k=0; k<p1Size;k=k+1)
{
basePartition1[k] = p1vector[k];
}
for (k=0; k<p2Size;k=k+1)
{
basePartition2[k] = p2vector[k];
}

for (sampleID = 0; sampleID < sampleCount; sampleID = sampleID + 1)
{

/* repeat the following code for each replicate */
resampleP1 = Random (basePartition1,1); /* resample vector of indices with replacement */
resampleP2 = Random (basePartition2,1);

for (k=0; k<p1Size;k=k+1) /* repopulate distances for population 1 */
{
ki = resampleP1[k];
for (k2 = k+1; k2 < p1Size; k2=k2+1)
{
k2i = resampleP1[k2];
distanceMatrix [k][k2] = saveDM[ki][ki2];
distanceMatrix [k2][k] = saveDM[ki2][ki];
}
}

for (k=0; k<p2Size;k=k+1) /* repopulate distances for population 2 */
{
ki = resampleP2[k];
for (k2 = k+1; k2 < p2Size; k2=k2+1)
{
k2i = resampleP2[k2];
distanceMatrix [k][k2] = saveDM[ki][ki2];
distanceMatrix [k2][k] = saveDM[ki2][ki];
}
}

for (k=0; k<p1Size;k=k+1) /* repopulate interpopulation distances */
{
ki = resampleP1[k];
for (k2 = 0; k2 < p2Size; k2=k2+1)
{
k2i = resampleP2[k2];
distanceMatrix [k][k2] = saveDM[ki][ki2];
distanceMatrix [k2][k] = saveDM[ki2][ki];
}
}
/*End of new bootstrapping within groups code*/

resMx = computeCompartentValues (p1_1,p2_1);*/
resMx = computeCompartentValues (p1vector,p2vector);
pi_D = resMx[2]-resMx[1];
F_ST_1[sampleID][1] = pi_D/(resMx[1]+pi_D);
F_ST_2[sampleID][1] = pi_D/(resMx[1]+resMx[2]);
F_ST_3[sampleID][1] = 1-resMx[1]/resMx[0];

if ((sampleID+1)%step == 0)
{
fprintf (stdout, Format ((sampleID+1)*100/sampleCount, 6, 2), "% done\n");
}
}

F_ST_1 = F_ST_1%1;
F_ST_2 = F_ST_2%1;
F_ST_3 = F_ST_3%1;

pv = {{sampleCount,sampleCount,sampleCount}};

for (sampleID = 0; sampleID < sampleCount; sampleID = sampleID + 1)
{
step = sampleID/sampleCount;
F_ST_1[sampleID][0] = step;
F_ST_2[sampleID][0] = step;
F_ST_3[sampleID][0] = step;
if (pv[0]==sampleCount)
{
if (F_ST_1[sampleID][1] > F_ST_OBS[0])
{
pv[0] = sampleID;
}
}
if (pv[1]==sampleCount)
{
if (F_ST_2[sampleID][1] > F_ST_OBS[1])
{
pv[1] = sampleID;
}
}
if (pv[2]==sampleCount)
{
if (F_ST_3[sampleID][1] > F_ST_OBS[2])
{
pv[2] = sampleID;
}
}
}

fprintf (stdout, "\n\nProb {Random F_ST > Observed F_ST}\n",
"\nHudson, Slatkin and Madison : ",(sampleCount-pv[0])/sampleCount,
"\nSlatkin    : ",(sampleCount-pv[1])/sampleCount,
"\nHudson, Boos and Kaplan    : ",(sampleCount-pv[2])/sampleCount, "\n");

labels = {{"Cumulative Weight","F_ST"}};

OpenWindow (CHARTWINDOW,{{"Hudson, Slatkin and...

Back to top

IP Logged

Art Poon

Global Moderator

Offline

Feed your monkey!

Posts: 0

Re: confidence intervals around Fst values
Reply #5 - Sep 5^th, 2006 at 3:38pm

Dear Jon,

Hi, I'm covering for Sergei this week. I had a look at the batch file fragment in your last post and noticed something odd at lines 97-98:

resMx = computeCompartentValues (p1_1,p2_1);*/ resMx = computeCompartentValues (p1vector,p2vector);

I can't find a matching open-comment in the document, so I'm wondering if this is a typo. In the meantime, I'll have a closer look at this code.

Cheers,
- Art.

Back to top

IP Logged

Art Poon

Global Moderator

Offline

Feed your monkey!

Posts: 0

Re: confidence intervals around Fst values
Reply #6 - Sep 5^th, 2006 at 4:28pm

Okay, I've had a good look at the batch snippet now. A possible place to look for problems is when repopulating the global variable distanceMatrix with values from the saved distance matrix saveDM. It seems to me that the second iteration, repopulating distances for the second clade, will overwrite entries from the first clade. This would cause the resulting distance matrix to contain a lot more zero entries than you were expecting there to be.

In other words, the three loops should each be filling a different quadrant of the distance matrix. The first fills the upper left, the second fills the lower right, and the last (interpopulation distances) fills the upper right/lower left.

Adjusting the indices in the for loops (while retaining indexing over the resampling vectors) might fix your problem.

I hope.

=)
- Art.

Back to top

IP Logged

Jon

Ex Member

Re: confidence intervals around Fst values
Reply #7 - Sep 12^th, 2006 at 9:02am

Dear Art,

I understand your point, but I'm not sure how to change the indices so each clade-specific resampling will fill its own distance matrix quadrant. I'm a beginner and not yet familiar with HBL. I tried modifying the [k] in clade 2 for [m] or the i index with j in clade 2 but without success. What index exactly should I change and where?

Cheers

Jon

Back to top

IP Logged

Sergei YaBB Administrator Offline Datamonkeys are forever... Posts: 1658 UCSD Gender:	Re: confidence intervals around Fst values Reply #8 - Sep 14^th, 2006 at 11:55am Dear Jon, There were (as Art pointed out) indexing issues with the code I gave you (sorry about that!) It's all been fixed, and incorporated as an option into the standard F_st analysis as of today's build. Download it and give that a try. Thanks for your patience! Sergei
Back to top	Associate Professor Division of Infectious Diseases Division of Biomedical Informatics School of Medicine University of California San Diego WWW IP Logged

Art Poon Global Moderator Offline Feed your monkey! Posts: 0	Re: confidence intervals around Fst values Reply #9 - Sep 14^th, 2006 at 12:03pm Yay! I was right about something!
Back to top	IP Logged

Jon

Ex Member

Re: confidence intervals around Fst values
Reply #10 - Sep 15^th, 2006 at 1:45pm

Hi guys,

Tried the new version with the bootstrap included but it still give me non sense values. This time, bootstrap values are way off in the other direction (they're too high). E.g., in my data set, the observed value is 0.078 for Hudson, Slatkin, Madison (HSM), but the bootstrap mean is 0.315 with CI of 0.282-0.347. Same pattern for the other two Fst.

Then I tried it with a different data set with an observed HSM Fst of 0.023 and this time mean bootstrap was way too low and negative: -0.243 with CI of -0.269--0.200.

I'm clueless...

Jon

Back to top

IP Logged

Sergei YaBB Administrator Offline Datamonkeys are forever... Posts: 1658 UCSD Gender:	Re: confidence intervals around Fst values Reply #11 - Sep 15^th, 2006 at 3:19pm Dear Jon, This is odd, since the 5 different data sets I tried the bootstrap on all gave sensible results. Could you maybe e-mail me one of the problematic data sets (and the partition), so that I can try to troubleshoot the issue? Cheers, Sergei
Back to top	Associate Professor Division of Infectious Diseases Division of Biomedical Informatics School of Medicine University of California San Diego WWW IP Logged

Jon

Ex Member

Re: confidence intervals around Fst values
Reply #12 - Oct 31^st, 2006 at 8:19am

Dear Sergei,

The implementation of the bootstrap calculation of Fst CI and standard deviation in the last HYPHY build gives me huge bootstrap biases (bootstrap mean half the observed mean), quite large CI, and strangely, what seems rather small standard deviations.

While I guess it might not be impossible that these bootstrap biases are real, I wonder if there couldn't be something wrong with the calculation, although I can't spot anything obvious in the code myself. Before I submit the results for publication, would it be possible to double check that the results are reliable ?

Cheers

Jon

Back to top

IP Logged

Sergei YaBB Administrator Offline Datamonkeys are forever... Posts: 1658 UCSD Gender:	Re: confidence intervals around Fst values Reply #13 - Oct 31^st, 2006 at 11:21am Dear Jon, OK; I'll take a look on some of the example files we have here... It's a good idea to check the results before you publish, indeed :P Cheers, Sergei
Back to top	Associate Professor Division of Infectious Diseases Division of Biomedical Informatics School of Medicine University of California San Diego WWW IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: confidence intervals around Fst values
Reply #14 - Nov 14^th, 2006 at 5:31pm

Dear Jon,

There was another typo (ki2 instead of k2i) in the F_st code (bugger!). Please replace the code starting at line 347 of F_st.bf with
[code]
for (k=0; k<p1Size;k=k+1) /* repopulate distances for population 1 */
{
ki = resampleP1[k];
ii = p1vector[k];
for (k2 = k+1; k2 < p1Size; k2=k2+1)
{
k2i = resampleP1[k2];
ii2 = p1vector[k2];
distanceMatrix [ii][ii2] = saveDM[ki][k2i];
distanceMatrix [ii2][ii] = saveDM[k2i][ki];
}
}

for (k=0; k<p2Size;k=k+1) /* repopulate distances for population 2 */
{
   ki = resampleP2[k];
ii = p2vector [k];
   for (k2 = k+1; k2 < p2Size; k2=k2+1)
{
k2i = resampleP2[k2];
ii2 = p2vector [k2];
distanceMatrix [ii][ii2] = saveDM[ki][k2i];
distanceMatrix [ii2][ii] = saveDM[k2i][ki];
}
}

for (k=0; k<p1Size;k=k+1) /* repopulate interpopulation distances */
{
ki = resampleP1[k];
ii = p1vector[k];
for (k2 = 0; k2 < p2Size; k2=k2+1)
{
k2i = resampleP2[k2];
ii2 = p2vector [k2];
distanceMatrix [ii][ii2] = saveDM[ki][k2i];
distanceMatrix [ii2][ii] = saveDM[k2i][ki];
}
}
[/code]

I got these results on a compartmentalized HIV patient (expected population segregation):

[code]
Hudson, Slatkin and Madison (Genetics    132:583-589)
Observed value : 0.189
Bootst. mean : 0.253
Bootst. median : 0.250
Bootst. st. dev.: 0.059
Bootst. 95% CI : 0.146 - 0.375

Slatkin    (Evolution 47:264-279)
Observed value : 0.104
Bootst. mean : 0.146
Bootst. median : 0.143
Bootst. st. dev.: 0.039
Bootst. 95% CI : 0.079 - 0.231

Hudson, Boos and Kaplan    (Mol Bio Evol 9: 138-151)
Observed value : 0.100
Bootst. mean : 0.140
Bootst. median : 0.137
Bootst. st. dev.: 0.038
Bootst. 95% CI : 0.075 - 0.222
[/code]

The upwards bias in BS estimates is what you would expect, because resampling within group with replacement will introduce some artificial '0' distances, leading to inflated within-group similarity and between-group dissimilarity

Cheers,
Sergei

Back to top

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

	Welcome, Guest. Please Login