HyPhy message board - Print Page

Hi Sergei,

We'd like to write a batch file producing condidence intervals around Fst values based on the F_ST.bf currently included in HYPHY. To do this, the bootstrap resampling should be done only within the groups defined by the user rather than from the whole metapopulation as it is now implemented in F_ST.bf.

This would e.g. allow comparisons of F_ST values obtained by considering non-synonymous positions at functional positively selected sites vs those from synonymous positions to detect adaptive divergence.

We're struggling a bit with the code though. Would it be possible to get some hints about which modifications would be needed to get the bootstrap done within groups to get a C.I. around Fst ?

For instance, would simply replacing "overallSample" and "bothCladeSize" by "cladeA", "cladeB" and "clASize", "clBSize" in the bootsrap routine do the job?

Any guidance would be appreciated.

Cheers,

Jonatan

Dear Jonatan,

If I understand your setup correctly, you want to use bootstrap with resapmling within each subpopulation, as opposed to a global population permutation? You need to use resampling within each population, because F_st is invariant with respect to within-population permutations.

I can help you with the modifications if this is what you had in mind...

Cheers,
Sergei

Hi Sergei,

You understood correctly. Sorry I was perhaps not very clear.

We would indeed like to use within subpopulations bootstrap resampling instead of the existing global permutation. This would give us the sampling variance of Fst values.

I would greatly appreciate your help.

Jonatan

Dear Jonatan,

In order to implement this you'd need to modify the variable (for each replicate) distanceMatrix (which stores the pairwise distances between all sequences) and then call computeCompartmentValues (p1vector,p2vector) where p1vector indexes all sequences in population 1 and p2vector - all in population 2.

Here's the general idea for the (slow) code which will generate resampled distanceMatrix. Let me know if this makes any sense...

Code (]
saveDM = distanceMatrix;

p1Size = Abs (p1vector);
p2Size = Abs (p2vector);
basePartition1 = {1,p1Size};
basePartition2 = {1,p2Size};

for (k=0; k<p1Size;k=k+1)
{
basePartition1[k):

= p1vector[k];
}
for (k=0; k<p2Size;k=k+1)
{
basePartition2[k] = p2vector[k];
}

/* repeat the following code for each replicate */
resampleP1 = Random (basePartition1,1); /* resample vector of indices with replacement */
resampleP2 = Random (basePartition1,2);

for (k=0; k<p1Size;k=k+1) /* repopulate distances for population 1 */
{
ki = resampleP1[k];
for (k2 = k+1; k2 < p1Size; k2=k2+1)
{
k2i = resampleP1[k2];
distanceMatrix [k][k2] = saveDM[ki][ki2];
distanceMatrix [k2][k] = saveDM[ki2][ki];
}
}

for (k=0; k<p2Size;k=k+1) /* repopulate distances for population 2 */
{
ki = resampleP2[k];
for (k2 = k+1; k2 < p2Size; k2=k2+1)
{
k2i = resampleP2[k2];
distanceMatrix [k][k2] = saveDM[ki][ki2];
distanceMatrix [k2][k] = saveDM[ki2][ki];
}
}

for (k=0; k<p1Size;k=k+1) /* repopulate interpopulation distances */
{
ki = resampleP1[k];
for (k2 = 0; k2 < p2Size; k2=k2+1)
{
k2i = resampleP2[k2];
distanceMatrix [k][k2] = saveDM[ki][ki2];
distanceMatrix [k2][k] = saveDM[ki2][ki];
}
}

Cheers,
Sergei

Hi Sergei,

Thank you so much for the bootstrapping program.

We managed to insert it to replace the old permutation routine in the F_ST.bf and it seems that HYPHY reads it through (after correcting a little mistake by replacing “basePartition1,2” by “basePartition2,1” in “resampleP2 = Random (basePartition2,1)”).

However, the output doesn’t seem to make sense. All Fst estimates from bootstrapped samples are lower than the observed one. Normally of course, some should be lower and some higher so a confidence interval could be obtained. I’ve pasted the last part of the F_ST.bf containing the newly inserted bootstrapping code. I suspect there is something wrong in the way the bootstrapping is done but can’t manage to figure out what it could be.

Cheers,

Jon

F_ST_OBS = [pi_D/(resMx[1]+pi_D),pi_D/(resMx[1]+resMx[2]),1-resMx[1]/resMx[0]];

fprintf (stdout, "\n\nPopulation characterisitcs:",
"\nMetapopulation diversity (pi_T) = ", resMx[0],
"\nMean subpopulation diversity (pi_S) = ", resMx[1],
"\nMean interpopulation diversity (pi_B) = ", resMx[2],
"\n\nF_ST\n",
"\nHudson, Slatkin and Madison (Genetics 132:583-589): ",F_ST_OBS[0],
"\nSlatkin (Evolution 47:264-279) : ",F_ST_OBS[1],
"\nHudson, Boos and Kaplan (Mol Bio Evol 9: 138-151) : ",F_ST_OBS[2], "\n");

F_ST_OBS = F_ST_OBS;

ChoiceList (resample,"Permutation Test",1,SKIP_NONE,
"Skip","Do not perform a permutation test.",
"But of course","Randomly allocate sequences into subpopulations and tabulate the distribution of various F_ST statistics.");

if (resample < 1)
{
return 0;
}

sampleCount = 0;
while (sampleCount < 1)
{
fprintf (stdout, "\nHow many permutations would you like to perform?");
fscanf (stdin,"Number",sampleCount);
}

F_ST_1 = {sampleCount,2};
F_ST_2 = {sampleCount,2};
F_ST_3 = {sampleCount,2};
step = sampleCount$20;

fprintf (stdout, "Running the simulations...\n");

/* New bootstrapping within groups replacing old permutation code */

saveDM = distanceMatrix;

p1Size = Abs (p1vector);
p2Size = Abs (p2vector);
basePartition1 = {1,p1Size};
basePartition2 = {1,p2Size};

for (k=0; k<p1Size;k=k+1)
{
basePartition1[k] = p1vector[k];
}
for (k=0; k<p2Size;k=k+1)
{
basePartition2[k] = p2vector[k];
}

for (sampleID = 0; sampleID < sampleCount; sampleID = sampleID + 1)
{

/* repeat the following code for each replicate */
resampleP1 = Random (basePartition1,1); /* resample vector of indices with replacement */
resampleP2 = Random (basePartition2,1);

for (k=0; k<p1Size;k=k+1) /* repopulate distances for population 1 */
{
ki = resampleP1[k];
for (k2 = k+1; k2 < p1Size; k2=k2+1)
{
k2i = resampleP1[k2];
distanceMatrix [k][k2] = saveDM[ki][ki2];
distanceMatrix [k2][k] = saveDM[ki2][ki];
}
}

for (k=0; k<p2Size;k=k+1) /* repopulate distances for population 2 */
{
ki = resampleP2[k];
for (k2 = k+1; k2 < p2Size; k2=k2+1)
{
k2i = resampleP2[k2];
distanceMatrix [k][k2] = saveDM[ki][ki2];
distanceMatrix [k2][k] = saveDM[ki2][ki];
}
}

for (k=0; k<p1Size;k=k+1) /* repopulate interpopulation distances */
{
ki = resampleP1[k];
for (k2 = 0; k2 < p2Size; k2=k2+1)
{
k2i = resampleP2[k2];
distanceMatrix [k][k2] = saveDM[ki][ki2];
distanceMatrix [k2][k] = saveDM[ki2][ki];
}
}
/*End of new bootstrapping within groups code*/

resMx = computeCompartentValues (p1_1,p2_1);*/
resMx = computeCompartentValues (p1vector,p2vector);
pi_D = resMx[2]-resMx[1];
F_ST_1[sampleID][1] = pi_D/(resMx[1]+pi_D);
F_ST_2[sampleID][1] = pi_D/(resMx[1]+resMx[2]);
F_ST_3[sampleID][1] = 1-resMx[1]/resMx[0];

if ((sampleID+1)%step == 0)
{
fprintf (stdout, Format ((sampleID+1)*100/sampleCount, 6, 2), "% done\n");
}
}

F_ST_1 = F_ST_1%1;
F_ST_2 = F_ST_2%1;
F_ST_3 = F_ST_3%1;

pv = [sampleCount,sampleCount,sampleCount];

for (sampleID = 0; sampleID < sampleCount; sampleID = sampleID + 1)
{
step = sampleID/sampleCount;
F_ST_1[sampleID][0] = step;
F_ST_2[sampleID][0] = step;
F_ST_3[sampleID][0] = step;
if (pv[0]==sampleCount)
{
if (F_ST_1[sampleID][1] > F_ST_OBS[0])
{
pv[0] = sampleID;
}
}
if (pv[1]==sampleCount)
{
if (F_ST_2[sampleID][1] > F_ST_OBS[1])
{
pv[1] = sampleID;
}
}
if (pv[2]==sampleCount)
{
if (F_ST_3[sampleID][1] > F_ST_OBS[2])
{
pv[2] = sampleID;
}
}
}

fprintf (stdout, "\n\nProb {Random F_ST > Observed F_ST}\n",
"\nHudson, Slatkin and Madison : ",(sampleCount-pv[0])/sampleCount,
"\nSlatkin : ",(sampleCount-pv[1])/sampleCount,
"\nHudson, Boos and Kaplan : ",(sampleCount-pv[2])/sampleCount, "\n");

labels = ["Cumulative Weight","F_ST"];

OpenWindow (CHARTWINDOW,["Hudson, Slatkin and...

Dear Jon,

Hi, I'm covering for Sergei this week. I had a look at the batch file fragment in your last post and noticed something odd at lines 97-98:

resMx = computeCompartentValues (p1_1,p2_1);*/ resMx = computeCompartentValues (p1vector,p2vector);

I can't find a matching open-comment in the document, so I'm wondering if this is a typo. In the meantime, I'll have a closer look at this code.

Cheers,
- Art.

Okay, I've had a good look at the batch snippet now. A possible place to look for problems is when repopulating the global variable distanceMatrix with values from the saved distance matrix saveDM. It seems to me that the second iteration, repopulating distances for the second clade, will overwrite entries from the first clade. This would cause the resulting distance matrix to contain a lot more zero entries than you were expecting there to be.

In other words, the three loops should each be filling a different quadrant of the distance matrix. The first fills the upper left, the second fills the lower right, and the last (interpopulation distances) fills the upper right/lower left.

Adjusting the indices in the for loops (while retaining indexing over the resampling vectors) might fix your problem.

I hope.

=)
- Art.

Dear Art,

I understand your point, but I'm not sure how to change the indices so each clade-specific resampling will fill its own distance matrix quadrant. I'm a beginner and not yet familiar with HBL. I tried modifying the [k] in clade 2 for [m] or the i index with j in clade 2 but without success. What index exactly should I change and where?

Cheers

Jon

Dear Jon,

There were (as Art pointed out) indexing issues with the code I gave you (sorry about that!)

It's all been fixed, and incorporated as an option into the standard F_st analysis as of today's build. Download it and give that a try.

Thanks for your patience!
Sergei

Yay! I was right about something! ;D

Hi guys,

Tried the new version with the bootstrap included but it still give me non sense values. This time, bootstrap values are way off in the other direction (they're too high). E.g., in my data set, the observed value is 0.078 for Hudson, Slatkin, Madison (HSM), but the bootstrap mean is 0.315 with CI of 0.282-0.347. Same pattern for the other two Fst.

Then I tried it with a different data set with an observed HSM Fst of 0.023 and this time mean bootstrap was way too low and negative: -0.243 with CI of -0.269--0.200.

I'm clueless...

Jon

Dear Jon,

This is odd, since the 5 different data sets I tried the bootstrap on all gave sensible results. Could you maybe e-mail me one of the problematic data sets (and the partition), so that I can try to troubleshoot the issue?

Cheers,
Sergei

Dear Sergei,

The implementation of the bootstrap calculation of Fst CI and standard deviation in the last HYPHY build gives me huge bootstrap biases (bootstrap mean half the observed mean), quite large CI, and strangely, what seems rather small standard deviations.

While I guess it might not be impossible that these bootstrap biases are real, I wonder if there couldn't be something wrong with the calculation, although I can't spot anything obvious in the code myself. Before I submit the results for publication, would it be possible to double check that the results are reliable ?

Cheers

Jon

Dear Jon,

OK; I'll take a look on some of the example files we have here...
It's a good idea to check the results before you publish, indeed :P

Cheers,
Sergei

Dear Jon,

There was another typo (ki2 instead of k2i) in the F_st code (bugger!). Please replace the code starting at line 347 of F_st.bf with

Code (]
for (k=0; k<p1Size;k=k+1) /* repopulate distances for population 1 */
{
ki = resampleP1[k):

;
ii = p1vector[k];
for (k2 = k+1; k2 < p1Size; k2=k2+1)
{
k2i = resampleP1[k2];
ii2 = p1vector[k2];
distanceMatrix [ii][ii2] = saveDM[ki][k2i];
distanceMatrix [ii2][ii] = saveDM[k2i][ki];
}
}

for (k=0; k<p2Size;k=k+1) /* repopulate distances for population 2 */
{
ki = resampleP2[k];
ii = p2vector [k];
for (k2 = k+1; k2 < p2Size; k2=k2+1)
{
k2i = resampleP2[k2];
ii2 = p2vector [k2];
distanceMatrix [ii][ii2] = saveDM[ki][k2i];
distanceMatrix [ii2][ii] = saveDM[k2i][ki];
}
}

for (k=0; k<p1Size;k=k+1) /* repopulate interpopulation distances */
{
ki = resampleP1[k];
ii = p1vector[k];
for (k2 = 0; k2 < p2Size; k2=k2+1)
{
k2i = resampleP2[k2];
ii2 = p2vector [k2];
distanceMatrix [ii][ii2] = saveDM[ki][k2i];
distanceMatrix [ii2][ii] = saveDM[k2i][ki];
}
}

I got these results on a compartmentalized HIV patient (expected population segregation):

[code]
Hudson, Slatkin and Madison (Genetics 132:583-589)
Observed value : 0.189
Bootst. mean : 0.253
Bootst. median : 0.250
Bootst. st. dev.: 0.059
Bootst. 95% CI : 0.146 - 0.375

Slatkin (Evolution 47:264-279)
Observed value : 0.104
Bootst. mean : 0.146
Bootst. median : 0.143
Bootst. st. dev.: 0.039
Bootst. 95% CI : 0.079 - 0.231

Hudson, Boos and Kaplan (Mol Bio Evol 9: 138-151)
Observed value : 0.100
Bootst. mean : 0.140
Bootst. median : 0.137
Bootst. st. dev.: 0.038
Bootst. 95% CI : 0.075 - 0.222
[/code]

The upwards bias in BS estimates is what you would expect, because resampling within group with replacement will introduce some artificial '0' distances, leading to inflated within-group similarity and between-group dissimilarity

Cheers,
Sergei

HyPhy message board
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl HYPHY Package >> Feature Additions >> confidence intervals around Fst values http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?num=1153502230 Message started by Jon on Jul 21^st, 2006 at 10:17am