Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
splitting simpleBootstrap.bf into multiple jobs (Read 8712 times)
Danny
YaBB Newbies
*
Offline



Posts: 31
splitting simpleBootstrap.bf into multiple jobs
Mar 2nd, 2009 at 4:55pm
 
I thought it might useful (e.g. adding additional bootstraps later) and probably faster to split up the bootstrapping into multiple jobs using the single processor version of hyphy.

I wanted to avoid optimizing on the original dataset every time so I needed to included the lf values and the optimization receptacle matrix for the optimized lf.  These values can be printed out once and reused for different bootstrap runs. For example.

ExecuteAFile(lf_values_file);

where lf_values_file is the output of LIKELIHOOD_FUNCTION_OUTPUT=5; fprintf(lf_values_file, lf)

and

res = receptacle_matrix

where receptacle_matrix is the output of fprintf("receptacle_matrix", CLEAR_FILE, res);

If you do this, a problem occurs in the simpleBootstrap.bf script when the native global variables are reset for the next bootstrap iteration.  If you have multiple global variables, their values may get scrambled and this will cause the simulation to create a dataset based on the wrong model. This seems to be caused by a reindexing of the lf array when Optimize is run.  For example before Optimize your nt rate indices may be ordered as:

AC, AT, GT

but after Optimize it might change to something like

AC, GT, AT

One solution is to get the optimized global variable values from the input lf instead of the res matrix.

In the simpleBootstrap.bf you can change:

  if (SAVE_GLOBALS) {
    globalSpoolMatrix = {1, SAVE_GLOBALS};
    for (bsCounter = 0; bsCounter < SAVE_GLOBALS; bsCounter = bsCounter + 1) {
     globalSpoolMatrix[bsCounter] = res[0][bsCounter];
    }
  }

to

  if (SAVE_GLOBALS) {
    globalSpoolMatrix = {1, SAVE_GLOBALS};
    for (bsCounter = 0; bsCounter < SAVE_GLOBALS; bsCounter = bsCounter + 1) {
     GetString (_i, lf, bsCounter);
     globalSpoolMatrix[bsCounter] = valueGrab(_i);
    }
  }

valueGrab should be defined somewhere before it's called:

function valueGrab (varName&) {
  return varName;
}

Since the order of the values in the specified res matrix will likely be different than the GetString order of the specified lf, the MLE order in the tabulated and summary output files will not match the headers.  This can be fixed by changing the order of the res matrix.

change

  for (bsCounter = 0; bsCounter < dataDimension; bsCounter = bsCounter + 1) {
    GetString (_i, lf, bsCounter);
    _i = _i ^ {{"givenTree\\.", ""}}; /* gets rid of the givenTree. from variables */
    _variableMap[_i] = Abs (_variableMap);
  }

to

  for (bsCounter = 0; bsCounter < dataDimension; bsCounter = bsCounter + 1) {
    GetString (_i, lf, bsCounter);
    res[0][bsCounter] = valueGrab(_i);
    _i = _i ^ {{"givenTree\\.", ""}}; /* gets rid of the givenTree. from variables */
    _variableMap[_i] = Abs (_variableMap);
  }


It would be nice if simpleBootstrap.bf didn't have to use the res matrix at all.  I think all that is needed from it is the dataDimension and SAVE_GLOBALS.  These can probably be obtained directly from the lf. Would

dataDimension = Columns (lf_summary["Global Independent"]) + Columns(lf_summary["Local Independent"]) + Columns(lf_summary["Global Constrained"]);

and

SAVE_GLOBALS = Columns (lf_summary["Global Independent"]);

be reliable?

-danny
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: splitting simpleBootstrap.bf into multiple jobs
Reply #1 - Mar 2nd, 2009 at 7:56pm
 
Dear Danny,

The easiest thing to do for saving and reloading a complete likelihood state is to use

Code:
LIKELIHOOD_FUNCTION_OUTPUT=7;
fprintf (savedLF, CLEAR_FILE,lf);
...
for (it = 0; it < max_it; it = it+1)
{
	ExecuteAFile (savedLF);
/* restore the LF and all corresponding parameter values */
     simulate from lf, etc..
}

 



Option '7' will write out a NEXUS file with a HyPhy block spooling out the batch code needed to recreate the likelihood function from scratch.

HTH,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Danny
YaBB Newbies
*
Offline



Posts: 31
Re: splitting simpleBootstrap.bf into multiple jobs
Reply #2 - Mar 3rd, 2009 at 3:13am
 
Dear Sergei,

Thanks for the LIKELIHOOD_FUNCTION_OUTPUT=7; tip.

simpleBootstrap.bf still wants a treeString and needs to know if _Genetic_Code is true when using codon models.

I defined treeString = Format(givenTree,1,0) + ";";
and had to include chooseGeneticCode.def.

If anyone could benefit from running bootstraps after the lf is defined here is a way that is working for me.  First you have a local script: call it bsLF.bf.

[code]

/* the following 4 lines are for when you are using a codon model only */
skipCodeSelectionStep = 1;
#include "/home/dwrice/hyphy/TemplateBatchFiles/TemplateModels/chooseGeneticCode.def";
modelType = 0; /* for universal genetic code */
ApplyGeneticCodeTable(modelType);


/* the file name savedLF could be any some_file that you saved with LIKELIHOOD_FUNCTION_OUTPUT=7; fprintf (some_file, CLEAR_FILE,lf); after optimizing the lf */
savedLF = "savedLF";
ExecuteAFile (savedLF);


/* arguments for BootStrapFunction */
bsIterates = 2;
tabulatedFileName = "bsLF.tab";
summaryFileName = "bsLF.summary";
parametricOrNot = 1;

#include "/home/dwrice/hyphy/TemplateBatchFiles/simpleBootstrapLF.bf";
BootStrapFunction(bsIterates, tabulatedFileName, summaryFileName, parametricOrNot);

[/code]

run it like

hyphy BASEPATH=path-to-directory-before-TemplateBatchFiles full-direoctory-path-to-above-batch-file/bsLF.bf

The modified simpleBootstrap.bf follows. I called it simpleBootstrapLF.bf.  It should also do the same thing as the original simpleBootstrap.bf but it figures out how do deal with lf variables that are indexed different than the optimized variables.  It will also write the summary to the specified file instead of stdout but if you are running this a bunch of times you probably don't care about the individual summaries.  I'll post script later to summarize several different runs.

The simpleBootstrapLF.bf file is attached.

Back to top
 
 
IP Logged
 
Danny
YaBB Newbies
*
Offline



Posts: 31
Re: splitting simpleBootstrap.bf into multiple jobs
Reply #3 - Mar 3rd, 2009 at 4:42pm
 
The above attached batch file was not printing the lnL correctly on the MLE line.  I added:

LFCompute (lf,LF_START_COMPUTE);
LFCompute (lf,LFCompute_lnL);
LFCompute (lf,LF_DONE_COMPUTE);
res[1][0] = LFCompute_lnL;

to get the lnL value.  Attached is an updated batch file.
Back to top
 
Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login (12 KB | )
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: splitting simpleBootstrap.bf into multiple jobs
Reply #4 - Mar 4th, 2009 at 10:08am
 
Dear Danny,

Thanks for the script! I am actually in the process of redesigning the HyPhy web site, and will include a procedure for users to submit files to the HyPhy library. When that is available, it would be great if you could add any files that you wish to share to that repository.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Danny
YaBB Newbies
*
Offline



Posts: 31
Re: splitting simpleBootstrap.bf into multiple jobs
Reply #5 - Mar 4th, 2009 at 12:50pm
 
Dear Sergei,

A user repository would be great, thanks!

On another topic, I've noticed that calling batch scripts from a local script can lead to various troubles with the path depending on how things are done and I don't understand the rules for the paths.  I notice that sometimes when a file can't be found, the Path Stack that is printed out contains the necessary path, but it wasn't tried.  Is only the directory at the top of the stack tried?  I can usually get things to work by a combination using explicit paths and setting BASEPATH and HYPHY_BASE_DIRECTORY correctly.  If I understood how the paths were working better it might make things easier, but it would be nice if the working directory was always the priority path for input and output and the rest of the path could be specified with multiple values that remained static.  And the path was searched in order like a shell $PATH.

-Danny
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: splitting simpleBootstrap.bf into multiple jobs
Reply #6 - Mar 4th, 2009 at 8:54pm
 
Dear Danny,

When you start HyPhy, the path that becomes the root for all relative paths is taken from

1). USEPATH command line argument (if provided)
2). BASEPATH command line argument (if provided and USE_PATH is not provided)
3). Current working directory otherwise.

This path will be used for any top level (i.e. not #include' d or ExecuteAFile'd). When you include a batch file from another batch file, the path stack is augmented with the directory where that batch file is (i.e. relative paths are always in relation to the included file).

HyPhy does not search the path stack; it will throw an error if a relative path composed with the top of the stack does not resolve to an existing object.

There are also a few useful runtime path variables that HyPhy sets and makes available to any batch file
  • PATH_TO_CURRENT_BF The absolute path to the current batch file
  • HYPHY_BASE_DIRECTORY The base directory (either BASEPATH or current working directory). This directory is expected to contain TemplateBatchFiles and other distribution modules
  • DIRECTORY_SEPARATOR The directory separator character appropriate for the host platform ('/' for Linux, ':' for Mac OS and '\' for Windows). It is useful if you want to compose a file path at run time (e.g. an absolute path + directory_name + DIRECTORY_SEPARATOR + filename).


There is also a function in TemplateBatchFiles/Utility/GrabBag.bf called splitFilePath that takes an absolute path and returns a dictionary with "DIRECTORY", "FILENAME" and "EXTENSION" keys.

HTH,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Danny
YaBB Newbies
*
Offline



Posts: 31
Re: splitting simpleBootstrap.bf into multiple jobs
Reply #7 - Mar 5th, 2009 at 1:24am
 
Dear Sergei,

Thanks. I think I might have it as long as the answer to two questions below are yes.

Sergei wrote on Mar 4th, 2009 at 8:54pm:
When you include a batch file from another batch file, the path stack is augmented with the directory where that batch file is (i.e. relative paths are always in relation to the included file).


And the stack is popped when the included file returns?

Quote:
HyPhy does not search the path stack; it will throw an error if a relative path composed with the top of the stack does not resolve to an existing object.


ok.

So once you're in an included file in a new subdirectory you have to call any other includes relative to that subdirectory?

-Danny
Back to top
 
 
IP Logged
 
Danny
YaBB Newbies
*
Offline



Posts: 31
Re: splitting simpleBootstrap.bf into multiple jobs
Reply #8 - Mar 5th, 2009 at 1:41am
 
For anyone that might use perl I wrote a perl function to parse one or more table files from simpleBootstrap.bf:

Below is the header and the function is attached.

perl function ParseBootstrapTables

Parses the tabular output(s) of simpleBootstrap.bf

The parameter headers in the different input table files do NOT have to be in the same order.

usage

ParseBootstrapTable(@list_of_table_files);

return values

return (\@headers, \%headers, \@p, \@mean, \@sd, \@MLE, $nbs, $length_of_longest_header_string);

\@headers - array of headers
\%headers - hash with header strings as keys and indices as values
\@p - 2D matrix where row $i corresponds to $headers[$i]; columns correspond to replicates
\@MLE - array of MLEs for the native dataset where $MLE[$i] corresponds to $headers[$i]
\@mean - $mean[$i] corresponds to mean of replicates for $headers[$i]
\@sd - $sd[$i] corresponds to standard deviation of replicates for $headers[$i]
$nbs - the number of bootstrap replicates
$length_of_longest_header_string - length of longest header string

The %headers hash allows you to easily index into one of the arrays when you know the header string but not the index, e.g.

$header = some-header-string-that-you-got-somehow

$sd_im_interested_in = $sd->[$headers{$header}];
Back to top
 
Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login (1 KB | )
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: splitting simpleBootstrap.bf into multiple job
Reply #9 - Mar 5th, 2009 at 6:48am
 
Dear Danny,

Yes, the stack is popped. Actually what I said in the previous posting is not entirely correct: even the first file that you execute will add its enclosing directory to the stack path, so that you can always think of paths relative to the location of the currently executing file. USEPATH, BASEPATH, cwd are used to resolve partial pathnames to find the batch file to execute. I'll write an example to demonstrate.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged