Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
Pages: 1 2 
comparing star trees  with non-star trees (Read 11790 times)
bruno
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 13
Re: comparing star trees  with non-star trees
Reply #15 - Apr 12th, 2007 at 8:34am
 
Dear Sergei,

A couple of questions:

-The number of d.f. for the HKY+G+I is 3. Is this because the nucleotide frequencies do not count when the empirical values are used?

-For this analysis the Jackknife part should be done with or without replacement?

-I used the ConstructCategoryMatrix function to get a matrix of likelihoods for each site on the alignment. I got a matrix with the correct dimensions (1 x n.sites) but the sum of the likelihoods of each site gives me a value that does not agree with the result of optimize. Below the code

     LikelihoodFunction small_ES = (filteredData, es_tree);
     Optimize (res_ES, small_ES);
       BestMatrix = {};
     ConstructCategoryMatrix (BestMatrix, small_ES, COMPLETE);

       numberColumns = Columns(BestMatrix);
       log_counter=0;
       for (hj=0;hj<numberColumns;hj=hj+1){
              log_counter = log_counter+BestMatrix[0][hj];
              }

But then log_counter = 177.826
while res_ES[1][0] = -1683.34
(for a random replicate)

How can i get the likelihood of the alignment from the sum of the likelihood of the sites?

Thanks,
Cheers,
Bruno

Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: comparing star trees  with non-star trees
Reply #16 - Apr 12th, 2007 at 3:05pm
 
Dear Bruno,

Quote:
-The number of d.f. for the HKY+G+I is 3. Is this because the nucleotide frequencies do not count when the empirical values are used?



They still count for statistical purposes, but HyPhy does not know about it because the frequencies are fixed before the likelihood function is constructed and do not show up as adjustable parameters. Add 3 to all parameter counts

Quote:
-For this analysis the Jackknife part should be done with or without replacement?



With

Quote:
-I used the ConstructCategoryMatrix function to get a matrix of likelihoods for each site on the alignment. I got a matrix with the correct dimensions (1 x n.sites) but the sum of the likelihoods of each site gives me a value that does not agree with the result of optimize. Below the code

How can i get the likelihood of the alignment from the sum of the likelihood of the sites?



You need to marginalize over the values of category (Gamma+I) variable.
Use GetInformation(catInfo,c) to populate the vector of rate values and probabilities and then, for each column of the matrix returned by ConstructCategoryMatrix, do something like

Code:
ConstructCategoryMatrix (catMat, lf, COMPLETE);

siteCount = Columns(catMat);
rateCount = Rows(catMat);

bySiteLogL = {siteCount,1};

for (siteID =0; siteID < siteCount; siteID = siteID+1)
{
     siteLik = 0;
     for (categID = 0; categID < rateCount; categID= categID+1)
     {
	   siteLik = siteLik + catMat[categID][siteID]*catInfo[1][categID];
     }
     bySiteLogL[siteID] = Log(siteLik);
}

  



Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
bruno
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 13
Re: comparing star trees  with non-star trees
Reply #17 - Apr 13th, 2007 at 6:34am
 
Dear Sergei,

There is still something not working properly. I followed your instructions and now it calculates the correct value of LogLik for the best tree, but it returns a different value for the star tree than the one obtained with optimize.
The part of the code related to this:

--------------------------------------------
     
/* make a star tree and optimize its likelihood */

     LikelihoodFunction smallLF = (filteredData, star);
     Optimize (resStar, smallLF);
     fprintf (stdout, "LogLik star tree: ", resStar[1][0], "\n");
 
  /*  Optimize the likfunc with the best tree found through exaustive search */
     
     LikelihoodFunction small_ES = (filteredData, es_tree);
     Optimize (res_ES, small_ES);
     fprintf (stdout, "LogLik best tree: ", res_ES[1][0], "\n");
           
     /* Take into account the diferent rate heterogeneity classes */
     
     GetInformation(catInfo,c);
     
     /* For best tree        */

     ConstructCategoryMatrix (catMatBest, small_ES, COMPLETE);
     siteCount = Columns(catMatBest);
     rateCount = Rows(catMatBest);
     bySiteLogLBest = {siteCount,1};
     for (siteID =0; siteID < siteCount; siteID = siteID+1)
           {
           siteLik = 0;
           for (categID = 0; categID < rateCount; categID= categID+1)
                 {
                 siteLik = siteLik + catMatBest[categID][siteID]*catInfo[1][categID];
                 }
           bySiteLogLBest[siteID] = Log(siteLik);

           }

     /*  Vector bySiteLogLBest [siteID] now contains the LogLik of each site for the best tree */
     
     /* For Star tree */
     
     ConstructCategoryMatrix (catMatStar, smallLF, COMPLETE);
     siteCount = Columns(catMatStar);
     rateCount = Rows(catMatStar);
     bySiteLogLStar = {siteCount,1};
     for (siteID =0; siteID < siteCount; siteID = siteID+1)
           {
           siteLik = 0;
           for (categID = 0; categID < rateCount; categID= categID+1)
                 {
                 siteLik = siteLik + catMatStar[categID][siteID]*catInfo[1][categID];
                 }
           bySiteLogLStar[siteID] = Log(siteLik);
           }

     /*  Vector bySiteLogLStar [siteID] now contains the LogLik of each site for the star tree*/

     rows_vector = Rows(catInfo);
     columns_vector = Columns (catInfo);

     /* This just shows the catInfo matrix */
     
fprintf (stdout, "Vector of cat Info has ", rows_vector, " rows and ", columns_vector, " columns\n");
for (ghj=0;ghj<rows_vector;ghj=ghj+1){
     fprintf (stdout, "\n");
     for (jhg=0;jhg<columns_vector;jhg=jhg+1){
           fprintf (stdout, " ", catInfo[ghj][jhg]);
           }
     }

     
     /* This calculates the LogLik of the 2 trees by summing over the logLik of each site */
     
     log_counterBest=0;
     log_counterStar = 0;
     for (siteID =0; siteID < siteCount; siteID = siteID+1) {
           log_counterBest= log_counterBest + bySiteLogLBest[siteID];
           log_counterStar= log_counterStar + bySiteLogLStar [siteID];
           }
fprintf (stdout, "\nLogLik of Best tree calculated by the sum of likelihoods = ", log_counterBest, " calculated by hyphy = ", res_ES[1][0], " \n");
fprintf (stdout, "LogLik of Star tree calculated by sum of likelihoods = ", log_counterStar, " calculated by hyphy = ", resStar[1][0], " \n");

--------------------------- end of code -------------------------------
This part would print (for a random replicate)


LogLik star tree: -1602.19
LogLik best tree: -1595.8
Vector of cat Info has 2 rows and 5 columns

1e-150 0.259531 1.17978 2.87653 7.89522
0.672428 0.081893 0.081893 0.081893 0.081893
LogLik of Best tree calculated by the sum of likelihoods = -1595.8 calculated by hyphy = -1595.8
LogLik of Star tree calculated by sum of likelihoods = -1630.57 calculated by hyphy = -1602.19
---------------------- end of output -------------------

Strange? I have no clue whats going on here, the same code works for the best tree but not for the star tree. I also removed the part of the best tree (it could be some variable/ counter that was not properly reseted) but the same problem still happens. Any ideas?

Cheers,
Bruno
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: comparing star trees  with non-star trees
Reply #18 - Apr 13th, 2007 at 7:39am
 
Dear bruno,

The second optimization may re-adjust global parameters (alpha, kappa etc). Move
GetInformation(catInfo,c); (and ConstructCategory...) for the star phylogeny to be called BEFORE the second Optimize.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
bruno
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 13
Re: comparing star trees  with non-star trees
Reply #19 - Apr 13th, 2007 at 9:33am
 
Dear Sergei,

Great, it did the trick, now both the optimize and the sum of site likelihoods returns the same value.
Now i think it is more or less finished, but i have what i hope to be the last questions on this analysis (and thanks again for all your help).
below is the part of the code implementing the Jackknife protocol:

-----------------------------------------------------------------------
     pv_significant = 0;
     AIC_significant = 0;
     AICcc_significant = 0;
     AICcl_significant = 0;
     
     /* Starts looping over JackKnife replicates */
     
     for (JackIterate = 0; JackIterate < Njackknife; JackIterate = JackIterate + 1) {
           fprintf (stdout, "Calculating JackKnife replicate ", JackIterate, "\n");
           Loglik_best = 0;
           Loglik_worst = 0;
           
           /* Calculates the likelihood by resampling (with replacement) a specified number of sites
                 the resulting likelihood is assumed to be the sum of individual sites' likelihood */
           
           for (JackSite = 0; JackSite < JackLength; JackSite =  JackSite+1){
                 aNumber = Random (0, filteredData.sites-1);
                 anIntegerRandomNumber = aNumber$1;

                 Loglik_best = Loglik_best + bySiteLogLBest[anIntegerRandomNumber];
                 Loglik_worst = Loglik_worst + bySiteLogLStar[anIntegerRandomNumber];
                 }
           
           fprintf (stdout, "\n\n Loglikelihood of star ", Loglik_worst, " Loglik of best ", Loglik_best, "\n");
           
           /* Now the LRT and AIC tests */
           
           LRTtest = calculateLRT (Loglik_best, Loglik_worst, DFdifference);
           
           fprintf (stdout, "LRT significance: ", LRTtest, "\n");
           
           if (LRTtest < 0.05){
                 pv_significant = pv_significant+1;
                 }
           AICtest_star = calculateAIC (Loglik_worst, DF_star);
           AICtest_es = calculateAIC (Loglik_best, DF_es);
           
           fprintf (stdout, "AICstar and AIC best tree: ", AICtest_star, " ", AICtest_es, "\n");
           
           if (AICtest_star > AICtest_es){
                 AIC_significant = AIC_significant+1;
                 }
           AICcctest_star = calculateAICc (Loglik_worst, DF_star, JackLength);
           AICcctest_es = calculateAICc (Loglik_best, DF_es, JackLength);
           
     fprintf (stdout, "AICcc star and AICcc best tree: ", AICcctest_star, " ", AICcctest_es, "\n");

           
           if (AICcctest_star > AICcctest_es){
                 AICcc_significant = AICcc_significant+1;
                 }
           AICcltest_star = calculateAICc (Loglik_worst, DF_star, JackLength*cladesDefined);
           AICcltest_es = calculateAICc (Loglik_best, DF_es, JackLength*cladesDefined);
           
     fprintf (stdout, "AICcl star and AICcl best tree: ", AICcltest_star, " ", AICcltest_es, "\n");

           if (AICcltest_star > AICcltest_es){
                 AICcl_significant = AICcl_significant+1;
                 }
           }
     RESULTS_LRT[iterate] = pv_significant;
     RESULTS_AIC[iterate] = AIC_significant;
     RESULTS_AICcc[iterate] = AICcc_significant;
     RESULTS_AICcl[iterate] = AICcl_significant;
     }
------------------   end of code  ------------------------

I think it is working fine, but im not sure. specially the line
aNumber = Random (0, filteredData.sites-1);
does this sample from the entire alignment? Or the -1 should not be there?

Then i also have some questions regarding the AIC calculation (the LRT i used your formula so it should be ok). I am not quite sure of what value to use in the formula

AIC = 2k - 2*lik

Does lik in this formula correspond to LogLik or to LnLik? I add the following 2 functions for AIC and AICc:

function calculateAIC (lik, DegreesFreedom){
           AIC = 2*DegreesFreedom-2*lik;
           return AIC;
           }
     
     /* Now the AICc - this can be calculated either with the nulber of sites (conservative
           estimate) or with the number of sites*number of sequences (liberal estimate)
           just feed it the correct number of observations in the 3rd parameter */
           
     function calculateAICc (lik, DegreesFreedom, Nobs){
           AICc = 2*DegreesFreedom - 2*lik + 2*DegreesFreedom*(DegreesFreedom+1)/(Nobs-DegreesFreedom-1);
           return AICc;
           }

--------------------------------------------------------

And for each case i supplied the LogLikelihood as calculated by the sum of LogLik of each site. Should this be correct?


And finally, is there any way i can test this script? Just to make sure that there is nothing wrong with it before actually using the results of the analysis.


Thank you very much,
Cheers,
Bruno
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: comparing star trees  with non-star trees
Reply #20 - Apr 13th, 2007 at 10:32am
 
Dear Bruno,

[quote author=bruno  link=1174559973/15#19 date=1176481983]Dear Sergei,
I think it is working fine, but im not sure. specially the line
aNumber = Random (0, filteredData.sites-1);
does this sample from the entire alignment? Or the -1 should not be there?

Then i also have some questions regarding the AIC calculation (the LRT i used your formula so it should be ok). I am not quite sure of what value to use in the formula

AIC = 2k - 2*lik

Does lik in this formula correspond to LogLik or to LnLik? I add the following 2 functions for AIC and AICc:

[/quote]

Do not use "-1" (otherwise you will never sample the last site). "$1" takes care of rounding down in the range 0..sites-1

Log is the same as Ln (natural log).

[quote]

And for each case i supplied the LogLikelihood as calculated by the sum of LogLik of each site. Should this be correct?

[/quote]

Yes; this is the standard way to do it for the KH test

[quote]
And finally, is there any way i can test this script? Just to make sure that there is nothing wrong with it before actually using the results of the analysis.

[/quote]

Run two tests

1). Collapse all the deep branches in your species tree to simulate star-like evolution. Simulate some data with this tree and evaluate the rate of false positives.

2). Define several trees with well resolved basal branches and use those to simulate the data for power analysis.

You can do both in HyPhy - just load your original alignment into the GUI, run an HKY+G model fit, then modify branch lengths at the base as needed (using the parameter table window), and finally simulate some data from the sequence viewer window.


Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
bruno
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 13
Re: comparing star trees  with non-star trees
Reply #21 - Apr 16th, 2007 at 9:16am
 
Dear Sergei,

I have simulated several datasets using hyphy following your directions. In a first case i changed all branch lengths connecting the clades so that the tree had a basal polytomie from where all the 6 clades emerged. These 10 datasets are then supposed to be run through the script using 100 subsamples with 1000 Jackknife replicates each. Ive already run the first 2 datasets and the result was that the best tree was never preferred over the star tree. Indeed the likelihood of the 2 trees seem to be in all cases the same. I guess this is good, the script is working ok, but was this the test you told me to do? Shouldnt there be some cases in which the best tree would be significantly better than the star tree, if for nothing else at least by stochasticity?
The other simulation was pretty much the same, except that 2 clades were a sister group to all other clades, and the length of the branch connecting them to the remaining clades was set to a high value (0.05). Ive also run already 2 of these simulated datasets, and in each of the random subsamples there are at least 1/3 Jackknife replicates that prefer the best tree over the star tree (usualy between 300 and 600 - out of 1000 Jackknife iterations - reject the null hypothesis) meaning that the null hypothesis would always be rejected (which i guess is good because thats what we tried to simulate).
Are these tests ok? Was this what you meant? And should the second test (power analysis) be done with a different value for the branch connecting the 2 clades to the remaining ones?
And a final question - i keep getting this error message when the script finishes. It says "Bad symbols in expression" and then you have 2 buttons, one say yes and the other says no. Everything is already printed into the stdout and the file, so whatever the choice i make nothing happens. Also this only happens in windows, not in mac. Any idea?

Cheers,
Bruno
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: comparing star trees  with non-star trees
Reply #22 - Apr 17th, 2007 at 10:25pm
 
Dear Bruno,

These are the tests I wanted you to run. In  the null case (complete basal polytomy), you should see some samples with likelihoods of best trees being better than the likelihood of the star tree, but rarely. If you never see any improvement, I would be worried that the script is doing something wrong.

For power simulations, you should indeed try different branch lengths and different tree shapes. I am a bit concerned that the jackknife proportion is low; 0.3 - 0.6 would actually suggest that the null hypothesis (star tree) cannot be confideently rejected (i.e. a substantial proportion of replicates support the null). You want the jackknife proportion to be 0.95 or more.

In terms of the error message: look for invisible (non-printable) characters at the end of the .bf; they may be confusing the Windows build.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
bruno
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 13
Re: comparing star trees  with non-star trees
Reply #23 - Apr 18th, 2007 at 4:32am
 
Dear Sergei,

I have now run 10 simulated datasets for the topology with 1 basal polytomie for the 6 clades. For each of the 10 simulated datasets i run 100 subsample replicates, and for each of these 1000 Jackknife replicates. In no case has there been a higher likelihood for the best tree than for the star tree (at least not significantly so, and AIC didnt choose the best tree either). Maybe i am doing something wrong in the simulations. Could you please give me some more detailed information on how to properly simulate the datasets (for both the false positive and power analysis)? for the false positive analysis i do not know if there should be a node then a branch leading to each lineage and then the sequences within each lineage, or if all sequences irrespective of lineage should be connected to the original root node (this would make a comb-like tree i guess). Also i am not sure of the exact steps and order of these steps to make sure i am simulating the datasets following the correct tree (not the original one).

Cheers,
Bruno
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: comparing star trees  with non-star trees
Reply #24 - Apr 18th, 2007 at 8:33am
 
Dear Bruno,

For null simulations, you should use the tree with a 6-fold basal polytomy (i.e. collapse the branches starting at the MRCA of each clade). Send me your code and I will take a look at the simulation bits to make sure nothing is amiss.

For power simulations, take the levels of basal divergence (i.e. the distances between clade MRCAs) that you get from an ML tree as a start (i.e. simulate from the ML tree), and then scale that distance 2x, 4x, until you get good power. Also, try simple MRCA trees (e.g. balanced, ladder and some random trees) at a single level of divergence - high enough so that the ML basal structure is reliably recongnized by the permutation test.

HTH,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Pages: 1 2