HyPhy message board
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl
Methodology Questions >> How to >> Genetic Algorithm with HyPhy
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?num=1158779001

Message started by tlefebure on Sep 20th, 2006 at 12:03pm

Title: Genetic Algorithm with HyPhy
Post by tlefebure on Sep 20th, 2006 at 12:03pm
Hello HyPhy community,

I will run 2 HyPhy GA analyses on a cluster (BranchSelector and GARD) that I can not run with datamonkey (too much taxa), and have some question:

  • Where is the GARD batch file? GARecomb.bf does not implement the complete breakpoint search, but instead fixed it to a predefine number, right?

  • Is there any rule to find the number of CPUs and CHC population size to use? In the GARD paper, you wrote: "All sequence analyses and model fitting were performed using the HyPhy (Kosakovsky Pond et al. 2005) software on a P-node message passing interface cluster. P - 1 slave nodes were used to fit various models, and a single master node dispatched the jobs and assembled the results. The size of CHC population was set to 2P - 2 individuals. We set P = 17 for the analyses in this article." Should we first define the CHC population size according to the data set analyzed, and then apply the P = (pop_size +2)/2 rule?


Thanks
Tristan

Title: Re: Genetic Algorithm with HyPhy
Post by Sergei on Sep 20th, 2006 at 8:33pm
Dear Tristan,

The GARD batch file as implemented in datamonkey.org/gard is not (yet) a part of the HyPhy distribution. Let me make some small changes to it (so it is more interactive), and post a version here.

The population size in GARD is wholly determined by the number of nodes (P) you run it on (pop size = 2P-2). Anything with 16 CPUs or over should be fine, and it doesn't really depend on the size of the problem; larger problems might take more generations to converge, that's about it.

Cheers,
Sergei

Title: Re: Genetic Algorithm with HyPhy
Post by Sergei on Sep 27th, 2006 at 8:54pm
Dear Tristan,

To run GARD locally you need to download two files (linked below) into the same directory and then call:


Code (] mpirun -np xx ./HYPHYMPI pathto/LocalGARD.bf[/code):


follow the prompts and wait.

The analysis prints a lot of information messages, and gives live updates of the form
[code]
GENERATION 68 with 3 breakpoints (~82% converged)
Breakpoints    c-AIC  Delta c-AIC [BP      1] [BP      2] [BP      3]
         0  9395.61
         1  9230.63      164.979       1105
         2  9147.48       83.151       1105        1883
         3  9119.19       28.289        440        1105        1883
GA has considered        1118/     1179616 (2221 over all runs) unique models
Total run time           0 hrs 2 mins 23 seconds
Throughput               15.53 models/second
Allocated time remaining 999 hrs 57 mins 37 seconds (approx. 5.59111e+07 more models.)


At the end of the run you will have 4 files

  • HTML summary (written to the path (call it OUTPATH) you specified when prompted)
  • OUTPATH_splits. A file which contains inferred non-recombinant fragments and their respective trees and looks like this:
    [code]
    0-440
    ((B_US_83_RF_ACC_M17451:0.0139054,B_US_90_WEAU160_ACC_U21135:0.00229248):0,(B_FR_83_HXB2_ACC_K03455:0.00691885,B_US_86_JRFL_ACC_U63632:0):0.00229712,(((RecombStrain:0.00460448,(D_CD_83_ELI_ACC_K03454:0.0185925,D_CD_83_NDK_ACC_M27323:0):0.00460677):0.00461784,D_CD_84_84ZR085_ACC_U88822:0.00460458):0,D_UG_94_94UG114_ACC_U88824:0.023297):0.00459755)
    441-1105
    ((((D_UG_94_94UG114_ACC_U88824:0.0183339,D_CD_84_84ZR085_ACC_U88822:0.0122237):0.00753758,B_US_83_RF_ACC_M17451:0.00604776):0.00146116,B_FR_83_HXB2_ACC_K03455:0.00149575):0,B_US_90_WEAU160_ACC_U21135:0.00905084,((RecombStrain:0.00299745,(D_CD_83_ELI_ACC_K03454:0.00449303,D_CD_83_NDK_ACC_M27323:0):0.00148771):0.00301784,B_US_86_JRFL_ACC_U63632:0.00147536):0.00752275)
    1106-1883
    ((((D_CD_83_ELI_ACC_K03454:0.00775632,D_CD_83_NDK_ACC_M27323:0.00517493):0.00126575,B_US_86_JRFL_ACC_U63632:0.00387136):0.00516652,B_US_90_WEAU160_ACC_U21135:0.00775083):0,B_FR_83_HXB2_ACC_K03455:0.00905566,(((RecombStrain:0.0025888,D_UG_94_94UG114_ACC_U88824:0.0182976):0.0130763,D_CD_84_84ZR085_ACC_U88822:0.00517239):0.00523374,B_US_83_RF_ACC_M17451:0.0104269):0.0012376)
    1884-2399
    (((RecombStrain:0.00389106,D_CD_83_NDK_ACC_M27323:0.00194546):0,D_CD_83_ELI_ACC_K03454:0.00584616):0.00196102,D_UG_94_94UG114_ACC_U88824:0.017756,(D_CD_84_84ZR085_ACC_U88822:0.00785642,(B_US_83_RF_ACC_M17451:0.0078008,((B_FR_83_HXB2_ACC_K03455:0,B_US_86_JRFL_ACC_U63632:0.00585428):0,B_US_90_WEAU160_ACC_U21135:0):0.0019538):0.00982158):0.00190309)
    [/code]
    You can feed this file to Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login to do some postprocessing, like tree splits identity and pairwise SH tests.
  • OUTPATH_finalout - is a NEXUS file with the original data, splits and trees, which you can feed to other packages which understand NEXUS
  • OUTPATH_ga_details - is the raw output from the GA run; I'll post another script soon to generate model-averaged support charts for the placement of breakpoints.


Please let me know if this works.

Sergei

File 1 (MAIN): Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login
File 2 (supporting): Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login

Title: Re: Genetic Algorithm with HyPhy
Post by tlefebure on Oct 3rd, 2006 at 9:46am
It works fine!  ;D
I analyzed this recombinant data set: Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login, with the following command:

Code (]$MPIDIR/mpirun -np $PROCS -machinefile machines $HYPHY /state/partition1/LocalGARD.bf < answers >& output[/code):

where $PROCS=40 and the predefine answers file contained:

[code]/state/partition1/test.fasta
010010
2
4
/state/partition1/result


The computation was done in 818 seconds, and here is the _splits file:


[code]
0-427
((B_US_83_RF_ACC_M17451:0.442157,B_US_90_WEAU160_ACC_U21135:0.0698794):0,(B_FR_83_HXB2_ACC_K03455:0.219894,B_US_86_JRFL_ACC_U63632:0):0.0702914,(((RecombStrain:0.142335,(D_CD_83_ELI_ACC_K03454:0.599337,D_CD_83_NDK_ACC_M27323:0):0.141972):0.143386,D_CD_84_84ZR085_ACC_U88822:0.142707):0,D_UG_94_94UG114_ACC_U88824:0.675537):0.141978)
428-1105
(((D_UG_94_94UG114_ACC_U88824:0.643429,D_CD_84_84ZR085_ACC_U88822:0.388112):0.23253,B_US_83_RF_ACC_M17451:0.186243):0.0398654,B_FR_83_HXB2_ACC_K03455:0.0442666,(((RecombStrain:0.0899883,(D_CD_83_ELI_ACC_K03454:0.134285,D_CD_83_NDK_ACC_M27323:0):0.0434083):0.0920238,B_US_86_JRFL_ACC_U63632:0.042531):0.22826,B_US_90_WEAU160_ACC_U21135:0.274624):0)
1106-1883
((((D_CD_83_ELI_ACC_K03454:0.245332,D_CD_83_NDK_ACC_M27323:0.163252):0.0384636,B_US_86_JRFL_ACC_U63632:0.121158):0.162595,B_US_90_WEAU160_ACC_U21135:0.244594):0,B_FR_83_HXB2_ACC_K03455:0.28691,(((RecombStrain:0.0820236,D_UG_94_94UG114_ACC_U88824:0.600629):0.42888,D_CD_84_84ZR085_ACC_U88822:0.163998):0.168732,B_US_83_RF_ACC_M17451:0.339635):0.0356817)
1884-2399
(((RecombStrain:0.114621,D_CD_83_NDK_ACC_M27323:0.0573596):0,D_CD_83_ELI_ACC_K03454:0.173408):0.0585003,D_UG_94_94UG114_ACC_U88824:0.55106,(D_CD_84_84ZR085_ACC_U88822:0.241296,(B_US_83_RF_ACC_M17451:0.232004,((B_FR_83_HXB2_ACC_K03455:0,B_US_86_JRFL_ACC_U63632:0.173414):0,
B_US_90_WEAU160_ACC_U21135:0):0.0578578):0.300282):0.05317)[/code]

So a very similar result than your...
Is that ok doctor?

Tristan

Title: Re: Genetic Algorithm with HyPhy
Post by Sergei on Oct 3rd, 2006 at 10:35am
Dear Tristan,

The results look fine, there should be 3 breakpoints (true bps are at 500,1100 and 1800).

For smaller data sets like this you actually want less than 40 processors, because latency and overhead become too big. Try running it with 16,32 and 48 and see what happens it terms of time taken and convergence.

Cheers,
Sergei

Title: Re: Genetic Algorithm with HyPhy
Post by tlefebure on Oct 3rd, 2006 at 12:54pm
Dear Sergei,

The number of processor has a unexpected impact on the time taken:

16 processor: 474 sec, 3 breakpoints: 428-1106-1884
32 processor: 602 sec, 3 breakpoints: 428-1106-1884
40 processor: 818 sec, 3 breakpoints: 428-1106-1884
48 processor: (the cluster is busy for the moment)

Also did you use some rate variation across sites in your simulation? Because I used some with 4 categories of sites, and two infrequent ones show "non-biological" rates:

[code]Using general discrete distribution of rates across sites
Rate : 0.024   Weight : 0.977
Rate : 0.566   Weight : 0.022
Rate : 525.729 Weight : 0.000
Rate : 821.162 Weight : 0.001
[/code]

So I'm wondering if using some rate variation across sites might not have an impact on GARD results (eg. on the position of the breakpoints).

Tristan

Title: Re: Genetic Algorithm with HyPhy
Post by Sergei on Oct 3rd, 2006 at 3:02pm
Dear Tristan,

For small datasets like that adding more processors (after a point) will indeed slow the analysis, because of latency and overhead. Effectively, because each candidate model is fitted very quickly, by the time the master sends out a job to node #20 (hypothetically speaking), node #1 will be done, and will have to sit idle while nodes 21 through 40 are being loaded. For larger files you should not see this.

This particular simulation did not have any rate variation, thus you see the tend to collapse to a single rate class. In our experience, rate variation has a slight effect on the positioning of breakpoints, unless of course the rates are strongly correlated (e.g. sites 1-100 are 'slow' and sites 101-200 are all 'fast'), but this situation is quite infrequent.

Cheers,
Sergei

Title: Re: Genetic Algorithm with HyPhy
Post by tlefebure on Oct 4th, 2006 at 7:33am
Dear Sergei

Unfortunately GARD crashes with a "real" data set.

The data set contains 61 taxa and 2100 sites.
HyPhy was runned with:

Code (]
$MPIDIR/mpirun -np $PROCS -machinefile machines $HYPHY /state/partition1/LocalGARD.bf < $ANSW >& output
[/code):


Where $PROCS=32, and $ANSW is a file containing:

[code]/state/partition1/pbp1a.phy
012343
2
4
/state/partition1/results


Here is the output file:


Code (]OB

Initialized GARD on 32 MPI nodes.
Population size is 62 models

(/state/partition1/) Nucleotide file to screen:
Please enter a 6 character model designation (e.g:010010 defines HKY85):

                       +----------------------+
                       |Rate variation options|
                       +----------------------+


       (1):Homogeneous rates across sites (fastest)
       (2):General discrete distribution on N-bins
       (3):Adaptively discretized gamma N-bins. Try is GDD doesn't converge well, or if you suspect a lot of rate classes and do not want to overparameterize the model

Please choose an option (or press q to cancel selection):How many distribution bins [2-32):

?:
Using 4 distribution bins

(/state/partition1/) Save results to:Error:MPI Node:0
_INTERNAL_REROOT_TREE_.TCTATGATGA is already being used - please rename one of the two variables.
Current BL Command:d=treeString*0
p0_27734:  p4_error: interrupt SIGSEGV: 11


and the two log files written by the first node:

- messages.log.mpinode0:

Code (]_INTERNAL_REROOT_TREE_.TCTATGATGA is already being used - please rename one of the two variables.
Current BL Command:d=treeString*0
[/code):


- messages.log.mpinode1:
[code]
Node 1 is shutting down


Doing exactly the same thing with the test data set, works perfectly, so I suspect a problem within LocalGARD.bf.

Thanks for any help....
Tristan

Title: Re: Genetic Algorithm with HyPhy
Post by Sergei on Oct 4th, 2006 at 8:48am
Dear Tristan,

Based on the error message it seems like HyPhy is not reading the data file properly - a part of the sequence, TCTATGATGA, is made into a taxon name. I would check that first (you can open the file in a data viewer using the GUI version) and see if tweaking the format solves the problem.

Cheers,
Sergei

Title: Re: Genetic Algorithm with HyPhy
Post by tlefebure on Oct 4th, 2006 at 9:06am
Shame on me, I used an old version of the data...
Now its running.
Thanks sergei, and sorry for a so stupid morning question
Tristan

Title: Re: Genetic Algorithm with HyPhy
Post by Sergei on Oct 25th, 2006 at 9:11am
Dear Natasha,

The errors you are seeing are because GA population size is defined as 2*(MPI_NODE_COUNT-1), and it should be positive.

Edit line 3 of the LocalGARD.bf file (the produceOffspring variable) to be a half of the population size you want (e.g. set it to 8).

I haven't tested the code in a non-MPI (single computer) environment, but it should work.

Cheers,
Sergei

Title: Re: Genetic Algorithm with HyPhy
Post by Natasha on Oct 31st, 2006 at 5:21am
Hi Sergei,

Thanks for your comments on running localGARD on a single processor.  We tried setting produceOffspring = 0.5, and therefore the populationSize = 1, but it did not seem to work.  However, we have now got MPI running on the cluster  :D , so will run localGARD.bf with mpirun.
Thanks again,

Natasha

Title: Re: Genetic Algorithm with HyPhy
Post by Sergei on Oct 31st, 2006 at 5:49am
Dear Natasha,

Population size of 1 can not work in principle:) You need at least 2 indivuals to run a GA, and, realistically a population size of about 10 is probably the absolute minimum you can get away with.

Cheers,
Sergei

HyPhy message board » Powered by YaBB 2.5.2!
YaBB Forum Software © 2000-2024. All Rights Reserved.