Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
confidence intervals around Fst values (Read 13075 times)
Jon
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 11
confidence intervals around Fst values
Jul 21st, 2006 at 10:17am
 
Hi Sergei,

We'd like to write a batch file producing condidence intervals around Fst values based on the F_ST.bf currently included in HYPHY. To do this, the bootstrap resampling should be done only within the groups defined by the user rather than from the whole metapopulation as it is now implemented in F_ST.bf.

This would e.g. allow comparisons of F_ST values obtained by considering non-synonymous positions at functional positively selected sites vs those from synonymous positions to detect adaptive divergence.

We're struggling a bit with the code though. Would it be possible to get some hints about which modifications  would be needed to get the bootstrap done within groups to get a C.I. around Fst ?

For instance, would simply replacing "overallSample" and "bothCladeSize" by "cladeA", "cladeB" and "clASize", "clBSize" in the bootsrap routine do the job?

Any guidance would be appreciated.

Cheers,

Jonatan

Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: confidence intervals around Fst values
Reply #1 - Jul 24th, 2006 at 9:01am
 
Dear Jonatan,

If I understand your setup correctly, you want to use bootstrap with resapmling within each subpopulation, as opposed to a global population permutation? You need to use resampling within each population, because F_st is invariant with respect to within-population permutations.

I can help you with the modifications if this is what you had in mind...

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Jon
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 11
Re: confidence intervals around Fst values
Reply #2 - Jul 24th, 2006 at 10:39am
 
Hi Sergei,

You understood correctly. Sorry I was perhaps not very clear.

We would indeed like to use within subpopulations bootstrap resampling instead of the existing global permutation. This would give us the sampling variance of Fst values.

I would greatly appreciate your help.

Jonatan
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: confidence intervals around Fst values
Reply #3 - Jul 31st, 2006 at 2:47pm
 
Dear Jonatan,

In order to implement this you'd need to modify the variable (for each replicate) distanceMatrix (which stores the pairwise distances between all sequences) and then call computeCompartmentValues (p1vector,p2vector) where p1vector indexes all sequences in population 1 and p2vector - all in population 2.

Here's the general idea for the (slow) code which will generate resampled distanceMatrix. Let me know if this makes any sense...

Code:
saveDM 	   = distanceMatrix;

p1Size		   = Abs (p1vector);
p2Size		   = Abs (p2vector);
basePartition1 = {1,p1Size};
basePartition2 = {1,p2Size};

for (k=0; k<p1Size;k=k+1)
{
	basePartition1[k] = p1vector[k];
}
for (k=0; k<p2Size;k=k+1)
{
	basePartition2[k] = p2vector[k];
}

/* repeat the following code for each replicate */
resampleP1 = Random (basePartition1,1); /* resample vector of indices with replacement */
resampleP2 = Random (basePartition1,2);

for (k=0; k<p1Size;k=k+1) /* repopulate distances for population 1 */
{
	ki = resampleP1[k];
	for (k2 = k+1; k2 < p1Size; k2=k2+1)
	{
		k2i = resampleP1[k2];
		distanceMatrix [k][k2] = saveDM[ki][ki2];
		distanceMatrix [k2][k] = saveDM[ki2][ki];
	}
}

for (k=0; k<p2Size;k=k+1) /* repopulate distances for population 2 */
{
	ki = resampleP2[k];
	for (k2 = k+1; k2 < p2Size; k2=k2+1)
	{
		k2i = resampleP2[k2];
		distanceMatrix [k][k2] = saveDM[ki][ki2];
		distanceMatrix [k2][k] = saveDM[ki2][ki];
	}
}

for (k=0; k<p1Size;k=k+1) /* repopulate interpopulation distances */
{
	ki = resampleP1[k];
	for (k2 = 0; k2 < p2Size; k2=k2+1)
	{
		k2i = resampleP2[k2];
		distanceMatrix [k][k2] = saveDM[ki][ki2];
		distanceMatrix [k2][k] = saveDM[ki2][ki];
	}
}

 



Cheers,
Sergei


Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Jonatan Blais
Guest


Re: confidence intervals around Fst values
Reply #4 - Sep 5th, 2006 at 1:16pm
 
Hi Sergei,

Thank you so much for the bootstrapping program.

We managed to insert it to replace the old permutation routine in the F_ST.bf and it seems that HYPHY reads it through (after correcting a little mistake by replacing “basePartition1,2” by “basePartition2,1” in “resampleP2 = Random (basePartition2,1)”).

However, the output doesn’t seem to make sense. All Fst estimates from bootstrapped samples are lower than the observed one. Normally of course, some should be lower and some higher so a confidence interval could be obtained. I’ve pasted the last part of the F_ST.bf containing the newly inserted bootstrapping code. I suspect there is something wrong in the way the bootstrapping is done but can’t manage to figure out what it could be.

Cheers,

Jon



F_ST_OBS = {{pi_D/(resMx[1]+pi_D),pi_D/(resMx[1]+resMx[2]),1-resMx[1]/resMx[0]}};

fprintf (stdout, "\n\nPopulation characterisitcs:",
                        "\nMetapopulation diversity (pi_T)       = ", resMx[0],
                        "\nMean subpopulation diversity (pi_S)   = ", resMx[1],
                        "\nMean interpopulation diversity (pi_B) = ", resMx[2],
                        "\n\nF_ST\n",
                        "\nHudson, Slatkin and Madison (Genetics     132:583-589): ",F_ST_OBS[0],
                        "\nSlatkin                     (Evolution    47:264-279) : ",F_ST_OBS[1],
                        "\nHudson, Boos and Kaplan          (Mol Bio Evol 9: 138-151) : ",F_ST_OBS[2], "\n");
                       
F_ST_OBS = F_ST_OBS;
                       
ChoiceList (resample,"Permutation Test",1,SKIP_NONE,
                              "Skip","Do not perform a permutation test.",
                            "But of course","Randomly allocate sequences into subpopulations and tabulate the distribution of various F_ST statistics.");
                       

if (resample < 1)
{
     return 0;
}

sampleCount = 0;
while (sampleCount < 1)
{
     fprintf (stdout, "\nHow many permutations would you like to perform?");
     fscanf (stdin,"Number",sampleCount);
}

F_ST_1 = {sampleCount,2};
F_ST_2 = {sampleCount,2};
F_ST_3 = {sampleCount,2};
step   = sampleCount$20;

fprintf (stdout, "Running the simulations...\n");

/* New bootstrapping within groups replacing old permutation code */

saveDM  = distanceMatrix;

p1Size  = Abs (p1vector);
p2Size  = Abs (p2vector);
basePartition1 = {1,p1Size};
basePartition2 = {1,p2Size};

for (k=0; k<p1Size;k=k+1)
{
basePartition1[k] = p1vector[k];
}
for (k=0; k<p2Size;k=k+1)
{
basePartition2[k] = p2vector[k];
}

for (sampleID = 0; sampleID < sampleCount; sampleID = sampleID + 1)
{


/* repeat the following code for each replicate */
resampleP1 = Random (basePartition1,1); /* resample vector of indices with replacement */
resampleP2 = Random (basePartition2,1);

for (k=0; k<p1Size;k=k+1) /* repopulate distances for population 1 */
{
ki = resampleP1[k];
for (k2 = k+1; k2 < p1Size; k2=k2+1)
{
  k2i = resampleP1[k2];
  distanceMatrix [k][k2] = saveDM[ki][ki2];
  distanceMatrix [k2][k] = saveDM[ki2][ki];
}
}

for (k=0; k<p2Size;k=k+1) /* repopulate distances for population 2 */
{
  ki = resampleP2[k];
  for (k2 = k+1; k2 < p2Size; k2=k2+1)
  {
   k2i = resampleP2[k2];
   distanceMatrix [k][k2] = saveDM[ki][ki2];
   distanceMatrix [k2][k] = saveDM[ki2][ki];
}
}

for (k=0; k<p1Size;k=k+1) /* repopulate interpopulation distances */
{
ki = resampleP1[k];
for (k2 = 0; k2 < p2Size; k2=k2+1)
{
  k2i = resampleP2[k2];
  distanceMatrix [k][k2] = saveDM[ki][ki2];
  distanceMatrix [k2][k] = saveDM[ki2][ki];
}
}
/*End of new bootstrapping within groups code*/

     resMx = computeCompartentValues (p1_1,p2_1);*/
     resMx = computeCompartentValues (p1vector,p2vector);
     pi_D = resMx[2]-resMx[1];      
     F_ST_1[sampleID][1] = pi_D/(resMx[1]+pi_D);
     F_ST_2[sampleID][1] = pi_D/(resMx[1]+resMx[2]);
     F_ST_3[sampleID][1] = 1-resMx[1]/resMx[0];
     
     if ((sampleID+1)%step == 0)
     {
           fprintf (stdout, Format ((sampleID+1)*100/sampleCount, 6, 2), "% done\n");
     }
}

F_ST_1 = F_ST_1%1;
F_ST_2 = F_ST_2%1;
F_ST_3 = F_ST_3%1;

pv = {{sampleCount,sampleCount,sampleCount}};

for (sampleID = 0; sampleID < sampleCount; sampleID = sampleID + 1)
{
     step = sampleID/sampleCount;
     F_ST_1[sampleID][0] = step;
     F_ST_2[sampleID][0] = step;
     F_ST_3[sampleID][0] = step;
     if (pv[0]==sampleCount)
     {
           if (F_ST_1[sampleID][1] > F_ST_OBS[0])
           {
                 pv[0] = sampleID;
           }
     }
     if (pv[1]==sampleCount)
     {
           if (F_ST_2[sampleID][1] > F_ST_OBS[1])
           {
                 pv[1] = sampleID;
           }
     }
     if (pv[2]==sampleCount)
     {
           if (F_ST_3[sampleID][1] > F_ST_OBS[2])
           {
                 pv[2] = sampleID;
           }
     }
}


fprintf (stdout, "\n\nProb {Random F_ST > Observed F_ST}\n",
                        "\nHudson, Slatkin and Madison : ",(sampleCount-pv[0])/sampleCount,
                        "\nSlatkin                     : ",(sampleCount-pv[1])/sampleCount,
                        "\nHudson, Boos and Kaplan          : ",(sampleCount-pv[2])/sampleCount, "\n");


labels = {{"Cumulative Weight","F_ST"}};

OpenWindow (CHARTWINDOW,{{"Hudson, Slatkin and...
Back to top
 
 
IP Logged
 
Art Poon
Global Moderator
*****
Offline


Feed your monkey!

Posts: 0
Re: confidence intervals around Fst values
Reply #5 - Sep 5th, 2006 at 3:38pm
 
Dear Jon,

Hi, I'm covering for Sergei this week.  I had a look at the batch file fragment in your last post and noticed something odd at lines 97-98:

       resMx = computeCompartentValues (p1_1,p2_1);*/
     resMx = computeCompartentValues (p1vector,p2vector);


I can't find a matching open-comment in the document, so I'm wondering if this is a typo.  In the meantime, I'll have a closer look at this code.

Cheers,
- Art.
Back to top
 
 
IP Logged
 
Art Poon
Global Moderator
*****
Offline


Feed your monkey!

Posts: 0
Re: confidence intervals around Fst values
Reply #6 - Sep 5th, 2006 at 4:28pm
 
Okay, I've had a good look at the batch snippet now.  A possible place to look for problems is when repopulating the global variable distanceMatrix with values from the saved distance matrix saveDM.  It seems to me that the second iteration, repopulating distances for the second clade, will overwrite entries from the first clade.  This would cause the resulting distance matrix to contain a lot more zero entries than you were expecting there to be.

In other words, the three loops should each be filling a different quadrant of the distance matrix.  The first fills the upper left, the second fills the lower right, and the last (interpopulation distances) fills the upper right/lower left.

Adjusting the indices in the for loops (while retaining indexing over the resampling vectors) might fix your problem.

I hope.

=)
- Art.
Back to top
 
 
IP Logged
 
Jon
Ex Member


Re: confidence intervals around Fst values
Reply #7 - Sep 12th, 2006 at 9:02am
 
Dear Art,

I understand your point, but I'm not sure how to change the indices so each clade-specific resampling will fill its own distance matrix quadrant. I'm a beginner and not yet familiar with HBL. I tried modifying the [k] in clade 2 for [m] or the i index with j in clade 2 but without success. What index exactly should I change and where?

Cheers

Jon
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: confidence intervals around Fst values
Reply #8 - Sep 14th, 2006 at 11:55am
 
Dear Jon,

There were (as Art pointed out) indexing issues with the code I gave you (sorry about that!)

It's all been fixed, and incorporated as an option into the standard F_st analysis as of today's build. Download it and give that a try.

Thanks for your patience!
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Art Poon
Global Moderator
*****
Offline


Feed your monkey!

Posts: 0
Re: confidence intervals around Fst values
Reply #9 - Sep 14th, 2006 at 12:03pm
 
Yay!  I was right about something!  Grin
Back to top
 
 
IP Logged
 
Jon
Ex Member


Re: confidence intervals around Fst values
Reply #10 - Sep 15th, 2006 at 1:45pm
 
Hi guys,

Tried the new version with the bootstrap included but it still give me non sense values. This time, bootstrap values are way off in the other direction (they're too high). E.g., in my data set, the observed value is 0.078 for Hudson, Slatkin, Madison (HSM), but the bootstrap mean is 0.315 with CI of 0.282-0.347. Same pattern for the other two Fst.

Then I tried it with a different data set with an observed HSM Fst of 0.023 and this time mean bootstrap was way too low and negative: -0.243 with CI of -0.269--0.200.

I'm clueless...

Jon
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: confidence intervals around Fst values
Reply #11 - Sep 15th, 2006 at 3:19pm
 
Dear Jon,

This is odd, since the 5 different data sets I tried the bootstrap on all gave sensible results. Could you maybe e-mail me one of the problematic data sets (and the partition), so that I can try to troubleshoot the issue?

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Jon
Ex Member


Re: confidence intervals around Fst values
Reply #12 - Oct 31st, 2006 at 8:19am
 
Dear Sergei,

The implementation of the bootstrap calculation of Fst CI and standard deviation in the last HYPHY build gives me huge bootstrap biases (bootstrap mean half the observed mean), quite large CI, and strangely, what seems rather small standard deviations.

While I guess it might not be impossible that these bootstrap biases are real, I wonder if there couldn't be something wrong with the calculation, although I can't spot anything obvious in the code myself. Before I submit the results for publication, would it be possible to double check that the results are reliable ?

Cheers

Jon
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: confidence intervals around Fst values
Reply #13 - Oct 31st, 2006 at 11:21am
 
Dear Jon,

OK; I'll take a look on some of the example files we have here...
It's a good idea to check the results before you publish, indeed :P

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: confidence intervals around Fst values
Reply #14 - Nov 14th, 2006 at 5:31pm
 
Dear Jon,

There was another typo (ki2 instead of k2i) in the F_st code (bugger!). Please replace the code starting at line 347 of F_st.bf with
[code]
            for (k=0; k<p1Size;k=k+1) /* repopulate distances for population 1 */
            {
                 ki = resampleP1[k];
                 ii = p1vector[k];
                 for (k2 = k+1; k2 < p1Size; k2=k2+1)
                 {
                       k2i = resampleP1[k2];
                       ii2 = p1vector[k2];
                       distanceMatrix [ii][ii2] = saveDM[ki][k2i];
                       distanceMatrix [ii2][ii] = saveDM[k2i][ki];
                 }
            }

            for (k=0; k<p2Size;k=k+1) /* repopulate distances for population 2 */
            {
                   ki = resampleP2[k];
                 ii = p2vector  [k];
                   for (k2 = k+1; k2 < p2Size; k2=k2+1)
                 {
                       k2i = resampleP2[k2];
                       ii2 = p2vector  [k2];
                       distanceMatrix  [ii][ii2] = saveDM[ki][k2i];
                       distanceMatrix  [ii2][ii] = saveDM[k2i][ki];
                 }
            }

           for (k=0; k<p1Size;k=k+1) /* repopulate interpopulation distances */
           {
                 ki = resampleP1[k];
                 ii = p1vector[k];
                 for (k2 = 0; k2 < p2Size; k2=k2+1)
                 {
                       k2i = resampleP2[k2];
                       ii2 = p2vector  [k2];
                       distanceMatrix [ii][ii2] = saveDM[ki][k2i];
                       distanceMatrix [ii2][ii] = saveDM[k2i][ki];
                 }
           }
[/code]

I got these results on a compartmentalized HIV patient (expected population segregation):

[code]
Hudson, Slatkin and Madison (Genetics     132:583-589)
Observed value  :    0.189
Bootst. mean    :    0.253
Bootst. median  :    0.250
Bootst. st. dev.:    0.059
Bootst. 95% CI  :    0.146 -    0.375

Slatkin                     (Evolution    47:264-279)
Observed value  :    0.104
Bootst. mean    :    0.146
Bootst. median  :    0.143
Bootst. st. dev.:    0.039
Bootst. 95% CI  :    0.079 -    0.231

Hudson, Boos and Kaplan     (Mol Bio Evol 9: 138-151)
Observed value  :    0.100
Bootst. mean    :    0.140
Bootst. median  :    0.137
Bootst. st. dev.:    0.038
Bootst. 95% CI  :    0.075 -    0.222
[/code]

The upwards bias in BS estimates is what you would expect, because resampling within group with replacement will introduce some artificial '0' distances, leading to inflated within-group similarity and between-group dissimilarity

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged