Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
Genetic Algorithm with HyPhy (Read 6028 times)
tlefebure
YaBB Newbies
*
Offline



Posts: 36
Cornell University
Genetic Algorithm with HyPhy
Sep 20th, 2006 at 12:03pm
 
Hello HyPhy community,

I will run 2 HyPhy GA analyses on a cluster (BranchSelector and GARD) that I can not run with datamonkey (too much taxa), and have some question:

  • Where is the GARD batch file? GARecomb.bf does not implement the complete breakpoint search, but instead fixed it to a predefine number, right?

  • Is there any rule to find the number of CPUs and CHC population size to use? In the GARD paper, you wrote: "All sequence analyses and model fitting were performed using the HyPhy (Kosakovsky Pond et al. 2005) software on a P-node message passing interface cluster. P - 1 slave nodes were used to fit various models, and a single master node dispatched the jobs and assembled the results. The size of CHC population was set to 2P - 2 individuals. We set P = 17 for the analyses in this article." Should we first define the CHC population size according to the data set analyzed, and then apply the P = (pop_size +2)/2 rule?


Thanks
Tristan
Back to top
 
WWW WWW  
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Genetic Algorithm with HyPhy
Reply #1 - Sep 20th, 2006 at 8:33pm
 
Dear Tristan,

The GARD batch file as implemented in datamonkey.org/gard is not (yet) a part of the HyPhy distribution. Let me make some small changes to it (so it is more interactive), and post a version here.

The population size in GARD is wholly determined by the number of nodes (P) you run it on (pop size = 2P-2). Anything with 16 CPUs or over should be fine, and it doesn't really depend on the size of the problem; larger problems might take more generations to converge, that's about it.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Genetic Algorithm with HyPhy
Reply #2 - Sep 27th, 2006 at 8:54pm
 
Dear Tristan,

To run GARD locally you need to download two files (linked below) into the same directory and then call:

Code:
 mpirun -np xx ./HYPHYMPI pathto/LocalGARD.bf 



follow the prompts and wait.

The analysis prints a lot of information messages, and gives live updates of the form
Code:
GENERATION 68 with 3 breakpoints (~82% converged)
Breakpoints    c-AIC  Delta c-AIC [BP	1] [BP	2] [BP	3]
	    0  9395.61
	    1  9230.63	164.979	 1105
	    2  9147.48	 83.151	 1105	  1883
	    3  9119.19	 28.289	  440	  1105	  1883
GA has considered	  1118/     1179616 (2221 over all runs) unique models
Total run time	     0 hrs 2 mins 23 seconds
Throughput		   15.53 models/second
Allocated time remaining 999 hrs 57 mins 37 seconds (approx. 5.59111e+07 more models.)
 



At the end of the run you will have 4 files

  • HTML summary (written to the path (call it OUTPATH) you specified when prompted)
  • OUTPATH_splits. A file which contains inferred non-recombinant fragments and their respective trees and looks like this:
    Code:
    0-440
    ((B_US_83_RF_ACC_M17451:0.0139054,B_US_90_WEAU160_ACC_U21135:0.00229248):0,(B_FR_83_HXB2_ACC_K03455:0.00691885,B_US_86_JRFL_ACC_U63632:0):0.00229712,(((RecombStrain:0.00460448,(D_CD_83_ELI_ACC_K03454:0.0185925,D_CD_83_NDK_ACC_M27323:0):0.00460677):0.00461784,D_CD_84_84ZR085_ACC_U88822:0.00460458):0,D_UG_94_94UG114_ACC_U88824:0.023297):0.00459755)
    441-1105
    ((((D_UG_94_94UG114_ACC_U88824:0.0183339,D_CD_84_84ZR085_ACC_U88822:0.0122237):0.00753758,B_US_83_RF_ACC_M17451:0.00604776):0.00146116,B_FR_83_HXB2_ACC_K03455:0.00149575):0,B_US_90_WEAU160_ACC_U21135:0.00905084,((RecombStrain:0.00299745,(D_CD_83_ELI_ACC_K03454:0.00449303,D_CD_83_NDK_ACC_M27323:0):0.00148771):0.00301784,B_US_86_JRFL_ACC_U63632:0.00147536):0.00752275)
    1106-1883
    ((((D_CD_83_ELI_ACC_K03454:0.00775632,D_CD_83_NDK_ACC_M27323:0.00517493):0.00126575,B_US_86_JRFL_ACC_U63632:0.00387136):0.00516652,B_US_90_WEAU160_ACC_U21135:0.00775083):0,B_FR_83_HXB2_ACC_K03455:0.00905566,(((RecombStrain:0.0025888,D_UG_94_94UG114_ACC_U88824:0.0182976):0.0130763,D_CD_84_84ZR085_ACC_U88822:0.00517239):0.00523374,B_US_83_RF_ACC_M17451:0.0104269):0.0012376)
    1884-2399
    (((RecombStrain:0.00389106,D_CD_83_NDK_ACC_M27323:0.00194546):0,D_CD_83_ELI_ACC_K03454:0.00584616):0.00196102,D_UG_94_94UG114_ACC_U88824:0.017756,(D_CD_84_84ZR085_ACC_U88822:0.00785642,(B_US_83_RF_ACC_M17451:0.0078008,((B_FR_83_HXB2_ACC_K03455:0,B_US_86_JRFL_ACC_U63632:0.00585428):0,B_US_90_WEAU160_ACC_U21135:0):0.0019538):0.00982158):0.00190309)
     
    
    

    You can feed this file to Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login to do some postprocessing, like tree splits identity and pairwise SH tests.
  • OUTPATH_finalout - is a NEXUS file with the original data, splits and trees, which you can feed to other packages which understand NEXUS
  • OUTPATH_ga_details - is the raw output from the GA run; I'll post another script soon to generate model-averaged support charts for the placement of breakpoints.


Please let me know if this works.

Sergei

File 1 (MAIN): Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login
File 2 (supporting): Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
tlefebure
YaBB Newbies
*
Offline



Posts: 36
Cornell University
Re: Genetic Algorithm with HyPhy
Reply #3 - Oct 3rd, 2006 at 9:46am
 
It works fine!  Grin
I analyzed this recombinant data set: Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login, with the following command:
Code:
$MPIDIR/mpirun -np $PROCS -machinefile machines $HYPHY /state/partition1/LocalGARD.bf < answers >& output 


where $PROCS=40 and the predefine answers file contained:

Code:
/state/partition1/test.fasta
010010
2
4
/state/partition1/result 



The computation was done in 818 seconds, and here is the _splits file:


Code:
0-427
((B_US_83_RF_ACC_M17451:0.442157,B_US_90_WEAU160_ACC_U21135:0.0698794):0,(B_FR_83_HXB2_ACC_K03455:0.219894,B_US_86_JRFL_ACC_U63632:0):0.0702914,(((RecombStrain:0.142335,(D_CD_83_ELI_ACC_K03454:0.599337,D_CD_83_NDK_ACC_M27323:0):0.141972):0.143386,D_CD_84_84ZR085_ACC_U88822:0.142707):0,D_UG_94_94UG114_ACC_U88824:0.675537):0.141978)
428-1105
(((D_UG_94_94UG114_ACC_U88824:0.643429,D_CD_84_84ZR085_ACC_U88822:0.388112):0.23253,B_US_83_RF_ACC_M17451:0.186243):0.0398654,B_FR_83_HXB2_ACC_K03455:0.0442666,(((RecombStrain:0.0899883,(D_CD_83_ELI_ACC_K03454:0.134285,D_CD_83_NDK_ACC_M27323:0):0.0434083):0.0920238,B_US_86_JRFL_ACC_U63632:0.042531):0.22826,B_US_90_WEAU160_ACC_U21135:0.274624):0)
1106-1883
((((D_CD_83_ELI_ACC_K03454:0.245332,D_CD_83_NDK_ACC_M27323:0.163252):0.0384636,B_US_86_JRFL_ACC_U63632:0.121158):0.162595,B_US_90_WEAU160_ACC_U21135:0.244594):0,B_FR_83_HXB2_ACC_K03455:0.28691,(((RecombStrain:0.0820236,D_UG_94_94UG114_ACC_U88824:0.600629):0.42888,D_CD_84_84ZR085_ACC_U88822:0.163998):0.168732,B_US_83_RF_ACC_M17451:0.339635):0.0356817)
1884-2399
(((RecombStrain:0.114621,D_CD_83_NDK_ACC_M27323:0.0573596):0,D_CD_83_ELI_ACC_K03454:0.173408):0.0585003,D_UG_94_94UG114_ACC_U88824:0.55106,(D_CD_84_84ZR085_ACC_U88822:0.241296,(B_US_83_RF_ACC_M17451:0.232004,((B_FR_83_HXB2_ACC_K03455:0,B_US_86_JRFL_ACC_U63632:0.173414):0,
B_US_90_WEAU160_ACC_U21135:0):0.0578578):0.300282):0.05317) 



So a very similar result than your...
Is that ok doctor?

Tristan
Back to top
« Last Edit: Oct 3rd, 2006 at 12:19pm by tlefebure »  
WWW WWW  
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Genetic Algorithm with HyPhy
Reply #4 - Oct 3rd, 2006 at 10:35am
 
Dear Tristan,

The results look fine, there should be 3 breakpoints (true bps are at 500,1100 and 1800).

For smaller data sets like this you actually want less than 40 processors, because latency and overhead become too big. Try running it with 16,32 and 48 and see what happens it terms of time taken and convergence.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
tlefebure
YaBB Newbies
*
Offline



Posts: 36
Cornell University
Re: Genetic Algorithm with HyPhy
Reply #5 - Oct 3rd, 2006 at 12:54pm
 
Dear Sergei,

The number of processor has a unexpected impact on the time taken:

16 processor: 474 sec, 3 breakpoints: 428-1106-1884
32 processor: 602 sec, 3 breakpoints: 428-1106-1884
40 processor: 818 sec, 3 breakpoints: 428-1106-1884
48 processor: (the cluster is busy for the moment)

Also did you use some rate variation across sites in your simulation? Because I used some with 4 categories of sites, and two infrequent ones show "non-biological" rates:

Code:
Using general discrete distribution of rates across sites
 Rate : 0.024   Weight : 0.977
 Rate : 0.566   Weight : 0.022
 Rate : 525.729 Weight : 0.000
 Rate : 821.162 Weight : 0.001
 



So I'm wondering if using some rate variation across sites might not have an impact on GARD results (eg. on the position of the breakpoints).

Tristan
Back to top
 
WWW WWW  
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Genetic Algorithm with HyPhy
Reply #6 - Oct 3rd, 2006 at 3:02pm
 
Dear Tristan,

For small datasets like that adding more processors (after a point) will indeed slow the analysis, because of latency and overhead. Effectively, because each candidate model is fitted very quickly, by the time the master sends out a job to node #20 (hypothetically speaking), node #1 will be done, and will have to sit idle while nodes 21 through 40 are being loaded. For larger files you should not see this.

This particular simulation did not have any rate variation, thus you see the tend to collapse to a single rate class. In our experience, rate variation has a slight effect on the positioning of breakpoints, unless of course the rates are strongly correlated (e.g. sites 1-100 are 'slow' and sites 101-200 are all 'fast'), but this situation is quite infrequent.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
tlefebure
YaBB Newbies
*
Offline



Posts: 36
Cornell University
Re: Genetic Algorithm with HyPhy
Reply #7 - Oct 4th, 2006 at 7:33am
 
Dear Sergei

Unfortunately GARD crashes with a "real" data set.

The data set contains 61 taxa and 2100 sites.
HyPhy was runned with:
Code:
$MPIDIR/mpirun -np $PROCS -machinefile machines $HYPHY /state/partition1/LocalGARD.bf < $ANSW >& output
 


Where $PROCS=32, and $ANSW is a file containing:

Code:
/state/partition1/pbp1a.phy
012343
2
4
/state/partition1/results
 



Here is the output file:

Code:
OB

Initialized GARD on 32 MPI nodes.
Population size is 62 models

(/state/partition1/) Nucleotide file to screen:
Please enter a 6 character model designation (e.g:010010 defines HKY85):

				+----------------------+
				|Rate variation options|
				+----------------------+


	  (1):Homogeneous rates across sites (fastest)
	  (2):General discrete distribution on N-bins
	  (3):Adaptively discretized gamma N-bins. Try is GDD doesn't converge well, or if you suspect a lot of rate classes and do not want to overparameterize the model

 Please choose an option (or press q to cancel selection):How many distribution bins [2-32]?:
Using 4 distribution bins

(/state/partition1/) Save results to:Error:MPI Node:0
_INTERNAL_REROOT_TREE_.TCTATGATGA is already being used - please rename one of the two variables.
Current BL Command:d=treeString*0
p0_27734:  p4_error: interrupt SIGSEGV: 11 



and the two log files written by the first node:

- messages.log.mpinode0:
Code:
_INTERNAL_REROOT_TREE_.TCTATGATGA is already being used - please rename one of the two variables.
Current BL Command:d=treeString*0
 


- messages.log.mpinode1:
Code:
Node 1 is shutting down 



Doing exactly the same thing with the test data set, works perfectly, so I suspect a problem within LocalGARD.bf.

Thanks for any help....
Tristan
Back to top
 
WWW WWW  
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Genetic Algorithm with HyPhy
Reply #8 - Oct 4th, 2006 at 8:48am
 
Dear Tristan,

Based on the error message it seems like HyPhy is not reading the data file properly - a part of the sequence, TCTATGATGA, is made into a taxon name. I would check that first (you can open the file in a data viewer using the GUI version) and see if tweaking the format solves the problem.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
tlefebure
YaBB Newbies
*
Offline



Posts: 36
Cornell University
Re: Genetic Algorithm with HyPhy
Reply #9 - Oct 4th, 2006 at 9:06am
 
Shame on me, I used an old version of the data...
Now its running.
Thanks sergei, and sorry for a so stupid morning question
Tristan
Back to top
 
WWW WWW  
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Genetic Algorithm with HyPhy
Reply #10 - Oct 25th, 2006 at 9:11am
 
Dear Natasha,

The errors you are seeing are because GA population size is defined as 2*(MPI_NODE_COUNT-1), and it should be positive.

Edit line 3 of the LocalGARD.bf file (the produceOffspring variable) to be a half of the population size you want (e.g. set it to 8).

I haven't tested the code in a non-MPI (single computer) environment, but it should work.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Natasha
YaBB Newbies
*
Offline



Posts: 1
Cape Town
Gender: female
Re: Genetic Algorithm with HyPhy
Reply #11 - Oct 31st, 2006 at 5:21am
 
Hi Sergei,

Thanks for your comments on running localGARD on a single processor.  We tried setting produceOffspring = 0.5, and therefore the populationSize = 1, but it did not seem to work.  However, we have now got MPI running on the cluster  Cheesy , so will run localGARD.bf with mpirun.
Thanks again,

Natasha
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Genetic Algorithm with HyPhy
Reply #12 - Oct 31st, 2006 at 5:49am
 
Dear Natasha,

Population size of 1 can not work in principle:) You need at least 2 indivuals to run a GA, and, realistically a population size of about 10 is probably the absolute minimum you can get away with.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged