Dear Sergei
I have performed a GARD analysis and have noticed a few idiosyncrasies when processing the results using the latest versions of GARDProcessor.bf and KHTest.bf posted in this thread.
Firstly, running GARDProcessor.bf multiple times on the same dataset and splits file produces (slightly) different results. For example, here are the KH tests for 3 runs with the same input files:
Code:RUN 1:
----------------------------------------------------------------------
Fitting tree 1 to partition 1
Log Likelihood = -4655.72380305908;
Fitting tree 2 to partition 1
Log Likelihood = -4730.55612145537;
KH Testing partition 1
Tree 2 base LRT = 149.665. p-value = 0.0044
Fitting tree 1 to partition 2
Log Likelihood = -5043.76888590518;
Fitting tree 2 to partition 2
Log Likelihood = -4836.52025035318;
KH Testing partition 2
Tree 1 base LRT = 414.497. p-value = 0.0001
Breakpoint | LHS Raw p | LHS adjusted p | RHS Raw p | RHS adjusted p
507 | 0.00010 | 0.00020 | 0.00440 | 0.00880
----------------------------------------------------------------------
RUN 2:
----------------------------------------------------------------------
Fitting tree 1 to partition 1
Log Likelihood = -4655.72380305908;
Fitting tree 2 to partition 1
Log Likelihood = -4730.55612145537;
KH Testing partition 1
Tree 2 base LRT = 149.665. p-value = 0.0038
Fitting tree 1 to partition 2
Log Likelihood = -5043.76888590518;
Fitting tree 2 to partition 2
Log Likelihood = -4836.52025035318;
KH Testing partition 2
Tree 1 base LRT = 414.497. p-value = 0.0001
Breakpoint | LHS Raw p | LHS adjusted p | RHS Raw p | RHS adjusted p
507 | 0.00010 | 0.00020 | 0.00380 | 0.00760
----------------------------------------------------------------------
RUN 3:
----------------------------------------------------------------------
Fitting tree 1 to partition 1
Log Likelihood = -4655.31331109413;
Fitting tree 2 to partition 1
Log Likelihood = -4730.60159498021;
KH Testing partition 1
Tree 2 base LRT = 150.577. p-value = 0.0042
Fitting tree 1 to partition 2
Log Likelihood = -5043.72987443926;
Fitting tree 2 to partition 2
Log Likelihood = -4836.4869482817;
KH Testing partition 2
Tree 1 base LRT = 414.486. p-value = 0.0001
Breakpoint | LHS Raw p | LHS adjusted p | RHS Raw p | RHS adjusted p
507 | 0.00010 | 0.00020 | 0.00420 | 0.00840
------------------------------------------------------------------------
You'll notice that the likelihoods and p-values differ slightly between runs. Could this just be due to different starting values in the optimisation? (I'm actually just curious - the differences in the results are pedantic!)
Of more concern to me is that I get very different results when I run the processor file locally vs on my cluster with HYPHYMP_DEV SVN415 (probably due to different versions of HyPhy). I obtained the above results on my machine, but this is what I get from two runs on the cluster:
Code:RUN 1:
---------------------------------------------------------------------
Fitting tree 1 to partition 1
Log Likelihood = -4655.74764230272;
Fitting tree 2 to partition 1
Log Likelihood = -4730.57601129128;
KH Testing partition 1
Tree 2 base LRT = 149.657. p-value = 0.3469
Fitting tree 2 to partition 2
Log Likelihood = -4836.49821145768;
KH Testing partition 2
Tree 1 base LRT = 414.467. p-value = 0.0001
Breakpoint | LHS Raw p | LHS adjusted p | RHS Raw p | RHS adjusted p
507 | 0.00010 | 0.00020 | 0.34690 | 0.69380
---------------------------------------------------------------------
RUN 2:
---------------------------------------------------------------------
Fitting tree 1 to partition 1
Log Likelihood = -4655.74764230272;
Fitting tree 2 to partition 1
Log Likelihood = -4730.57601129128;
KH Testing partition 1
Tree 2 base LRT = 149.657. p-value = 0.354
Fitting tree 1 to partition 2
Log Likelihood = -5043.73191448845;
Fitting tree 2 to partition 2
Log Likelihood = -4836.49821145768;
KH Testing partition 2
Tree 1 base LRT = 414.467. p-value = 0.0001
Breakpoint | LHS Raw p | LHS adjusted p | RHS Raw p | RHS adjusted p
507 | 0.00010 | 0.00020 | 0.35400 | 0.70800
Again, I have copied the GARDProcessor.bf and KHTest.bf files from this thread into TemplateBatchFiles on the cluster. As you can see, the breakpoint is no longer significant.
Please could you advise as to which of the above results is correct. I have attached the relevant files.
Thanks a lot as always!
Miguel
PS: While I'm here... I have one more question
I have a dataset with ~500 sequences and 1100 nucleotides - i.e. too few sequences relative to the number of sites in order to run GARD. So I have divided the dataset into random samples of 50 sequences and am running GARD on each of these smaller datasets. Is that what you would advise?
Thanks again....