HyPhy message board - comparing star trees with non-star trees

bruno

YaBB Newbies

Offline

I love YaBB 1G - SP1!

Posts: 13

comparing star trees with non-star trees
Mar 22^nd, 2007 at 3:39am

Hello,

I have a question that i hope i can answer with hyphy. I am working with a dataset, and whatever phylogenetic reconstruction method i use i always get 6 clades. The branching order of these clades varies depending on method (MP,ML,quartet puzzling) and substitution model (ML) / weigthing schemes (MP). Doing a bootstrap analysis (ML) results in a fully unresolved politomie from which all 6 clades arise. Likelihood mapping analysis calculates c. 80% fully resolved quartets, so i guess this is not exclusively a artifact due to lack of phylogenetic signal: it seems likely thtat it represents an almost simultaneous spliting of the 6 clades.

in order to test this i have thought of the following scheme: chose randomly a sequence from each of the 6 clades. Create a dataset with those sequences and choose the best model of sequence evolution for it (with modeltest eg). Then load the data into hyphy, infer the best tree topology and calculate the likelihood. Then load a "bogus" tree file that represents a fully star-shaped phylogeny for the 6 sequences. Calculate the likelihood of this tree (with the same model of sequence evolution). Compare the likelihoods of the 2 trees and see wether they are significantly different. Go back to the beginning, select randomly 6 more sequences and create a new dataset, then do this all over again. Do this a few times (maybe 10, maybe 100, i dont know how many times i should do it).

Now is where the questions come, and i have 2:

i) i guess i can use a LRT to compare the likelihoods of the 2 trees, but how many degrees of freedom should this test have? I notice that the unconstrained tree can be resolved with a variable number of branches / nodes (i.e. according to the sequences i pick i can have a fully dicotomic tree, but i can also have some polytomies in that tree). How do i know how many d.f. do i have to use?

ii) In case the 2 trees are not significantly different, does this actually support the hypothesis that the 6 clades diverged simultaniously?

Thats all, sorry for the long text but i wanted to be as clear as i possibly can (and i hope that will be enough).

Thank you for your help on this,
cheers,
Bruno

Back to top

IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: comparing star trees with non-star trees
Reply #1 - Mar 22^nd, 2007 at 12:50pm

Dear Bruno,

The resampling test seems sensible.

1). One way to do this comparison is to run a Shimodaira-Hasegawa test on the two trees using AIC to correct for the different number of parameters. Effectively, you fit both trees to the same data, compute the likelihood of every site under the model, and then conduct a number of permutation tests where you draw sites with replacement to make a synthetic dataset, compute the AIC of both models on a dataset (this is fast, because you assume that the likelihood of a new dataset is the sum of site likelihoods) and compare the difference AIC2-AIC1 to 0. If tree 2 is better than tree 1 you will find that the number of replicates with negative differences is very small (e.g. < 0.01 or something like that).

2). Not really - it is suggestive though; failure to reject a hypothesis is not evidence FOR the hypothesis - rather it may be our lack of power to detect the splits.

Also, what does Bayesian analysis suggest (e.g. in BEAST or MrBayes)? What about the possibility of recombination confounding tree building?

The permutation analysis can be scripted in HyPhy fairly easily...

Cheers,
Sergei

Back to top

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

bruno

YaBB Newbies

Offline

I love YaBB 1G - SP1!

Posts: 13

Re: comparing star trees with non-star trees
Reply #2 - Mar 23^rd, 2007 at 2:32am

Dear Sergei,
Thank you for your quick reply. I have a few more questions though.

i) I was talking about resampling sequences, not sites. Or do you mean i have to do several replicates of 6 sequences each, and then for each of the replicates resample the sites too (and in this case, should i resample an equal number of sites as the alignment, or do jacknifing - what do you mean with a synthetic dataset?)?

ii) Also why using the AIC and not a LRT? I guess the start tree would always be a special case of the non-star tree (thus the models would be nested and a LRT could be used).

iii) I am still not sure on how to count the number of d.f. of the trees. Is it simply the number of branches? In that case i guess that for each replicate i have to manually count the number of branches of the "unconstrained" tree, because that number will differ between replicates (wether the trees are fully bifurcating or if some polytomies exist).

iv) A question on numbers. I know it is difficult and probably it will depend on the outcome of the analysis (if i have a "close-to-significant" value than i should probably do more replicates) but should i aim at doing 10, 100 or 1000 replicates?

Now answering your questions:
-I have not (yet) run a bayesian sampling, but i do intend to use MrBayes soon (unfortunatly the number of computers at my disposal is not infinite...)
-I work with the control region of mtDNA so recombination should not be a problem

Thanks,
Cheers,
Bruno

Back to top

IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: comparing star trees with non-star trees
Reply #3 - Mar 23^rd, 2007 at 10:55am

Dear bruno,

Quote:

i) I was talking about resampling sequences, not sites. Or do you mean i have to do several replicates of 6 sequences each, and then for each of the replicates resample the sites too (and in this case, should i resample an equal number of sites as the alignment, or do jacknifing - what do you mean with a synthetic dataset?)?

Yes, you do dual resampling - first of taxa, then for a given sample of data you jackknife the sites to assess significance.

Quote:

ii) Also why using the AIC and not a LRT? I guess the start tree would always be a special case of the non-star tree (thus the models would be nested and a LRT could be used).

You could indeed collapse any tree to a star tree by setting internal branch lengths to 0 - however the appropriate LRT statistic would be very ugly (and impossible to work out a priori) - odd things happen when the simpler model is obtained by setting parameters to extreme values. AIC is simpler to use.

Quote:

iii) I am still not sure on how to count the number of d.f. of the trees. Is it simply the number of branches? In that case i guess that for each replicate i have to manually count the number of branches of the "unconstrained" tree, because that number will differ between replicates (wether the trees are fully bifurcating or if some polytomies exist).

For phylogenetic ML estimation the number of branches is the number of estimated parameters. What # to assign to the structure of the tree itself is a more difficult question, but it is probably irrelevant to your question. You can use some HyPhy code to automatically count the # of tree branches for you.

Quote:

iv) A question on numbers. I know it is difficult and probably it will depend on the outcome of the analysis (if i have a "close-to-significant" value than i should probably do more replicates) but should i aim at doing 10, 100 or 1000 replicates?

It should be feasible to do 1000 - aim for that.

Cheers,
Sergei

Back to top

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

bruno

YaBB Newbies

Offline

I love YaBB 1G - SP1!

Posts: 13

Re: comparing star trees with non-star trees
Reply #4 - Apr 3^rd, 2007 at 9:26am

Dear Sergei,

I've been playing around with HYPHY trying to learn some scripting but it has not been easy... I still need some help with this analysis, so here follows what i have manage to do so far and some questions i still have.

I have built 1000 Jack-knifed datasets using Felsenstein's program SeqBoot. These 1000 datasets were then run through phyml. Model used was HKY+G+I, and all parameters were estimated. I ended up with 1000 datasets + 1000 trees. Removed the branch lenghts from the trees so only topology is used further in HYPHY.

As i have not seen any comments on reading datafiles with multiple datasets, im planning on using a perl/expect script to read the 1000 datasets file, write 1 dataset at a time to a file name data.fas, and then create a batchfile for HYPHY that should look something like

-------------------- BATCH FILE ---------------------------

#include "lib/displayfunction.bf";

DataSet myData = ReadDataFile ("data.fas");
DataSetFilter myFilter = CreateFilter (myData,1);

global Pinvar;
global shape;

category invClass = (2,{{Pinvar}{1-Pinvar}},MEAN, ,{{0}{1}}, 0,1);
category rateCat =(4, EQUAL, MEAN, GammaDist(_x_,shape,shape), CGammaDist(_x_,shape,shape),0,1e25);
HarvestFrequencies (obsFreqs, myFilter, 1, 1, 1);
HKY_G_I_RateMatrix ={{*,b*invClass*rateCat,a*invClass*rateCat,b*invClass*rateCat}
{b*invClass*rateCat,*,b*invClass*rateCat,a*invClass*rateCat}
{a*invClass*rateCat,b*invClass*rateCat,*,b*invClass*rateCat}
{b*invClass*rateCat,a*invClass*rateCat,b*invClass*rateCat,*}};

Model HKY_G_I = (HKY_G_I_RateMatrix, obsFreqs);

useModel(HKY_G_I);

Tree StarTree = (a,b,c,d,e,f);

LikelihoodFunction theLikFunStar = (myFilter,StarTree);
Optimize(paramValues,theLikFunStar);

fprintf(stdout,theLikFunStar);

Tree phymlTree = ((((e,a),d),b),f,c);

LikelihoodFunction theLikFunPhyml = (myFilter,phymlTree);
Optimize(paramValues,theLikFunPhyml);

fprintf(stdout,theLikFunPhyml);

---------------- END OF BATCH FILE ---------------

The perl/expect script will rewrite this batch file for each dataset so that the correct tree is loaded in line "Tree phymlTree = *". Then the batch file should be run through HYPHY to get the likelihoods of the star tree and ML-tree for each dataset. Hopefully it will also calculate the AIC (for small datasets) and output the result of this test, which can then be stored in an array in the perl/expect script.

As you can see i based the definition of HKY+G+I on the thread of Jose_Patane, could you please let me know if it is well defined? I omitted the term "m" on his matrix because im not sure where it comes from... is it a scalling factor? Should i use it?

If the above batch file is ok, then i should end up with the likelihoods of each dataset according to the 2 trees. How do i perform the AICc from within the batch file? And how can i redirect this result into standard output / some file so i can get it into my perl/expect file?

Sorry if these questions are so naive, but i am very new to HYPHY and it is quite complicated to get started with. At least for me.

Thank you for your help,
cheers,
Bruno

Back to top

IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: comparing star trees with non-star trees
Reply #5 - Apr 4^th, 2007 at 10:10am

Dear Bruno,

You can actually script the entire pipeline in HyPhy in, I'm guessing about 200 lines of HBL. I'll be happy to get you started with some example code, but first let me make sure I recall your problem correctly:

1). Start a large alignment with 6 basal clades that you wish to resolve (i.e. a sequence file and a list which breaks all taxa into 6 groups).
2). Draw a large number of 6-taxon samples with one from each clade
3). Compare the ML topology of the sample to the star tree using AIC and the KH test
4). Report significant improvements

I'd like you to add as much specifics to this as possible, e.g. the input and the output formats, which substitution model to use, etc. Once I have that, I can write a barebones script which you can then modify to your needs. I think this exercise will be a useful example to have as a HyPhy custom pipeline tutorial - hence I am willing to contribute the effort to write it.

Cheers,
Sergei

Back to top

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

bruno

YaBB Newbies

Offline

I love YaBB 1G - SP1!

Posts: 13

Re: comparing star trees with non-star trees
Reply #6 - Apr 4^th, 2007 at 10:35am

Dear Sergei,

That is exactely what i want to do.

Output and input formats should not be a problem, fasta phylip or nexus is ok, whatever suits you better. Sequential formats are in my opinion easier to handle than the (sometimes) strange interleaved formats. The model im planning to use is the HKY+G+I which was chosen by modeltest for the complete alignment.

Hope this helps,
Cheers,
Bruno

Back to top

IP Logged

Sergei YaBB Administrator Offline Datamonkeys are forever... Posts: 1658 UCSD Gender:	Re: comparing star trees with non-star trees Reply #7 - Apr 4^th, 2007 at 11:04am Dear Bruno, How do you want to specify the 6 clades? Perhaps the easiest thing to do is to provide a tree with 6 clades and nothing else, i.e. ((a1,a2,...,aN),(b1,...), (f1,...))? Could you e-mail me the data (spond at ucsd dot edu) Cheers, Sergei
Back to top	Associate Professor Division of Infectious Diseases Division of Biomedical Informatics School of Medicine University of California San Diego WWW IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: comparing star trees with non-star trees
Reply #8 - Apr 8^th, 2007 at 11:13pm

Dear Bruno,

Here's some code for you (taking the format we discussed by e-mail: a sequence file and a list of sequences - one per line - with blank lines separating clades). At this point the code will compare star trees to NJ trees - more advanced tree building methods can be plugged in. Also you can modify the method to screen a star versus some hypothesized topology. Let me know if you have problems/questions.

Cheers,
Sergei

Code:

RequireVersion ("0.9920070401"); /* require a recent HyPhy version */

VERBOSITY_LEVEL = -1;

/* include a standard NJ module */
 	  ExecuteAFile (HYPHY_BASE_DIRECTORY+"TemplateBatchFiles"+DIRECTORY_SEPARATOR+"Utility"+DIRECTORY_SEPARATOR+"NJ.bf");

/* first, read the data file and get a list of taxon names */

	SetDialogPrompt ("A nucleotide alignment file:");
	DataSet allData = ReadDataFile (PROMPT_FOR_FILE);
	availableTaxonNames = {}; /* an associative array of the form array[taxon] = taxon # for
								 all taxa which are in the data file */

	for (i = 0; i<allData.species; i=i+1)
	{
		GetString (ithTaxon,allData,i);
		availableTaxonNames[ithTaxon&&1] = i+1; /* &&1 capitalizes the string;
												 this is used to make comparisons
												 case insensitive */
	}

	fprintf (stdout, "[PHASE 0] Read a data file with ", allData.species, " taxa and ",
								allData.sites, " nucleotides.\n");

/* now, read and process the list of clades */

	SetDialogPrompt ("A list of taxa in every clade:");
	currentClade = 1;
	taxaByClade  = {}; /* assoc. array of the form array[clade#] = a list of taxon names */

	fscanf (PROMPT_FOR_FILE, "Lines", taxonList); /* read the file into an array
													 of lines*/
	isLastLineEmpty = 0;

	for (i = 0; i<Columns (taxonList); i=i+1)
	{
		currentLine = taxonList[i]&&1;
		if (Abs(currentLine) == 0 ) /* empty line */
		{
			if (isLastLineEmpty == 0)
			{
				currentClade    = currentClade + 1;
				isLastLineEmpty = 1;
			}
		}
		else
		{
			isLastLineEmpty = 0;
			if (availableTaxonNames[currentLine] > 0) /* valid taxon name */
			{
				if (Abs(taxaByClade[currentClade])==0)
				{
					(taxaByClade)[currentClade] = {};
				}
				((taxaByClade)[currentClade])[Abs((taxaByClade)[currentClade])] = currentLine;
			}
		}
	}

	cladesDefined = Abs(taxaByClade);
	fprintf (stdout, "[PHASE 1] Defined ",cladesDefined , " clades");
	taxaInClade   = {cladesDefined,1}; /* how many taxa in each clade */
	for (i=1; i<=cladesDefined; i=i+1)
	{
		fprintf (stdout, "\n\tClade ", i);
		taxaInClade[i-1] = Abs(taxaByClade[i]);
		for (j=0; j<Abs(taxaByClade[i]); j=j+1)
		{
			fprintf (stdout, "\n\t\t", (taxaByClade[i])[j]);
		}
	}
	fprintf (stdout, "\n");

/* select a model to use */

	DataSetFilter		filteredData = CreateFilter (allData,1);
	SelectTemplateModel (filteredData); /* invoke a user-driven selection of substitution model */

	fprintf (stdout, "[PHASE 2] Fitting global model parameters using a global tree.\n");

/* see if there is a tree for the entire alignment to infer model parameters from;
   if not use neighbor joining to get one */

   if (Abs(DATAFILE_TREE)) /* have tree */
   {
 		  Tree globalTree = DATAFILE_TREE;
 		  fprintf (stdout, "\tInference from a user (in-file) tree\n");
   }
   else
   {
 		  njTreeString = InferTreeTopology(0);
 		  Tree globalTree = njTreeString;
 		  fprintf (stdout, "\tInference from a NJ tree\n");
   }
   LikelihoodFunction global_lf = (filteredData,globalTree);
   Optimize (global_res, global_lf);
   fprintf  (stdout, "\tLogL = ", global_res[1][0],"\n");
   /* load more utility functions */
   ExecuteAFile (HYPHY_BASE_DIRECTORY+"TemplateBatchFiles"+DIRECTORY_SEPARATOR+"Utility"+DIRECTORY_SEPARATOR+"GrabBag.bf");
   /* constrain all global model parameters (e.g. trans/transv ratio, gamma shape parameter etc) based on the global
	alignment */
   fixGlobalParameters ("global_lf");

/* prompt for the number of replicates and where to write the results*/

	fprintf (stdout, "How many samples should be drawn:?");
	fscanf  (stdin,"Number",sampleCount);
	SetDialogPrompt ("Write the results to:");
	fprintf (PROMPT_FOR_FILE, CLEAR_FILE, KEEP_OPEN,"Iterate\tTaxa\tStarLog\tNJ Log\tDeltaDegFreedom\tp-value\tNJ Tree");
	baseOutPath = LAST_FILE_PATH;


/* loop over the replicates */

	for (iterate = 0; iterate < sampleCount; iterate = iterate + 1)
	{
		selectedTaxa = {}; /* which taxa fall in this iterate;
							  an assoc. array of the form array[taxon INDEX] = 1 for chosen taxa*/
		selectedByClade = {};
		for (clade = 0; clade < cladesDefined; clade = clade + 1)
		{
			taxonName									    = (taxaByClade[clade+1])[Random (0,taxaInClade[clade])$1];
			selectedTaxa [availableTaxonNames[taxonName]-1] = 1;
			selectedByClade[clade] 							= taxonName;
		}
		DataSetFilter 	 filteredData = CreateFilter (allData,1,"",selectedTaxa[speciesIndex]>0);
		fprintf (baseOutPath, "\n", iterate+1,"\t");
		starTree = "(";
		for (i=0; i<cladesDefined; i=i+1)
		{
			if (i>0)
			{
				fprintf (baseOutPath, ",");
				starTree = starTree + "," + selectedByClade[i];
			}
			else
			{
				starTree = starTree + selectedByClade[i];
			}
			fprintf			 (baseOutPath, selectedByClade[i]);
		}
		starTree = starTree + ")";

		/* make a star tree */
		Tree 				  star = starTree;
		LikelihoodFunction smallLF = (filteredData, star);
		Optimize (resStar, smallLF);

		/* make a NJ tree */
 		  njTreeString = InferTreeTopology(0);
		Tree 				  nj = njTreeString;
		LikelihoodFunction smallLF = (filteredData, nj);
		Optimize (resNJ, smallLF);

		/* compute LRT p-value and write the results */
		DF = resNJ[1][1]-resStar[1][1];
		pv = 1-CChi2(2(resNJ[1][0]-resStar[1][0]),DF);
		fprintf (baseOutPath,"\t",resStar[1][0],"\t",resNJ[1][0],"\t",DF,"\t",pv,"\t",Format(nj,0,1));
		fprintf (stdout, "[ITERATE ", iterate + 1, "/",sampleCount," p = ", pv, "]\n");
	}

/* close file */

fprintf (baseOutPath, CLOSE_FILE);

Back to top

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

bruno YaBB Newbies Offline I love YaBB 1G - SP1! Posts: 13	Re: comparing star trees with non-star trees Reply #9 - Apr 9^th, 2007 at 1:52am Dear Sergei, Thank you very much. Ill play around with the code and ill post again if i have some questions, Cheers, Bruno
Back to top	IP Logged

bruno

YaBB Newbies

Offline

I love YaBB 1G - SP1!

Posts: 13

Re: comparing star trees with non-star trees
Reply #10 - Apr 10^th, 2007 at 9:15am

Dear Sergei,

Ive looked at the code and ive done a few runs and all seems to be ok, but i have some simple questions:

-The version that the code requires is more recent than the most recent version available for download in HYPHY.org. I have removed the line requiring the version number and all *seems* to be working fine. Is this ok, or might there be some problems in the older version?

-The line
njTreeString = InferTreeTopology(0);
makes a NJ tree, but does it implement some king of branch swapping (eg. NNI)?

-Is the Jackknifing protocol we talked about implemented? And it seems that the code uses the LRT and not the AIC as we discussed above?

Please take into consideration that the code is (or at least seems to be) working properly, and these questions are just to try and understand whats going on in the analysis with HYPHY. Also if branchswapping and Jackknifing is not implemented, i would like to try and implement it myself.

Thanks,
Bruno

Back to top

IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: comparing star trees with non-star trees
Reply #11 - Apr 10^th, 2007 at 11:03am

Dear Bruno,

I posted the newest build up yesterday, but, as you discovered, there is no need to use the latest version.

1). AIC should be very easy to implement based on the LRT code. The first argument populated by Optimize will have the number of optimized parameters in the second column of the second row ([1][1] index).

2). The procedure implements taxon sampling, but not the per-site non-parameteric bootstrap. I was not sure if you wanted to do that or not. Add this after everything else works to your satisfaction.

3). NJ does not carry out branch swapping - since it produces a possibly suboptimal tree, the test is conservative. Ideally, you could actually run an exhaustive search on each 6-taxon sample (105 trees are fast to fit). If you are interested, I can tell you which bits of code to snip out from 'TopologySearch.bf' to make this happen.

Cheers,
Sergei

Back to top

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

bruno

YaBB Newbies

Offline

I love YaBB 1G - SP1!

Posts: 13

Re: comparing star trees with non-star trees
Reply #12 - Apr 11^th, 2007 at 8:54am

Dear Sergei,

I think i have managed to include exhaustive search and AIC calculation in the code you posted. For AIC it was relativelly straightforward, while for exhaustive search things were a bit more complicated. Basically i copy pasted all the file topologySearch.bf into our file. Then i removed all "user interface" commands and replaced all appearances of "ds.species" by "cladesDefined". I also had to change 1 or 2 more lines (for some reason treeVar did not contain the correct number of leaves). I think it should be ok now (at least it runs and outputs somethng that makes sense) but maybe you can have a look. It also counts the number of significantly different results (for LRT) and the number of times the AIC of the startTree is bigger than the AIC of the best tree.

Of course im not done with questions just yet:

-Next i would like to implement the Jackknife method, could you give me some pointers?

-I would like to use AICc instead of AIC, but i dont know how to count the number of observations... is it somehow related to http://www.hyphy.org/cgi-bin/yabb/YaBB.pl?board=MOLEV;action=display;num=1164656629
or is this a simpler question?

-Would it make sense to optimize the values of the substitution model for each subsample instead of using the values for the complete alignment? How time consuming would it be?

-Ive seen that the number of d.f. of the estimated tree does not change. For the star tree d.f.=6, while for the estimate tree d.f. = 9. If i am not making some confusion, 6 is the number of free parameters of the HKY+G+I model itself (1 ts/tv + 3 nuc.freqs. + G + I). So for the star tree there are no parameters estimated? I know the tree topology is fixed, but the branch lengths are estimated, so why are they not counted as parameters? And as for the estimated tree, why are there always 9 d.f. if the number of nodes / branches varies amongst replicates (some best trees have politomies on them)? Is it because the branch exists but its estimated length is zero? Im sorry to get back to the d.f. thing (i think ive asked about d.f. in all the posts so far) but it seems to be an important value so i want to understand how it is calculated.

Ill post the code in a next message, i tried to put it here but it says the message is too big...

Cheers,
Bruno

Back to top

IP Logged

bruno

YaBB Newbies

Offline

I love YaBB 1G - SP1!

Posts: 13

Re: comparing star trees with non-star trees
Reply #13 - Apr 11^th, 2007 at 9:03am

VERBOSITY_LEVEL = -1;

/* include the topologySearch.bf module
ExecuteAFile (HYPHY_BASE_DIRECTORY+"TemplateBatchFiles"+DIRECTORY_SEPARATOR+"TopologySearch2.
bf.txt");

include a standard NJ module
ExecuteAFile (HYPHY_BASE_DIRECTORY+"TemplateBatchFiles"+DIRECTORY_SEPARATOR+"Utility" +DIRECTORY_SEPARATOR+"NJ.bf");

*/

/* first, read the data file and get a list of taxon names */

SetDialogPrompt ("A nucleotide alignment file:");
DataSet allData = ReadDataFile (PROMPT_FOR_FILE);
availableTaxonNames = {}; /* an associative array of the form array[taxon] = taxon # for
   all taxa which are in the data file */

for (i = 0; i<allData.species; i=i+1)
{
GetString (ithTaxon,allData,i);
availableTaxonNames[ithTaxon&&1] = i+1; /* &&1 capitalizes the string;
this is used to make comparisons
case insensitive */
}

fprintf (stdout, "[PHASE 0] Read a data file with ", allData.species, " taxa and ",
   allData.sites, " nucleotides.\n");

/* now, read and process the list of clades */

SetDialogPrompt ("A list of taxa in every clade:");
currentClade = 1;
taxaByClade = {}; /* assoc. array of the form array[clade#] = a list of taxon names */

fscanf (PROMPT_FOR_FILE, "Lines", taxonList); /* read the file into an array
   of lines*/
isLastLineEmpty = 0;

for (i = 0; i<Columns (taxonList); i=i+1)
{
currentLine = taxonList[i]&&1;
if (Abs(currentLine) == 0 ) /* empty line */
{
if (isLastLineEmpty == 0)
{
currentClade = currentClade + 1;
isLastLineEmpty = 1;
}
}
else
{
isLastLineEmpty = 0;
if (availableTaxonNames[currentLine] > 0) /* valid taxon name */
{
if (Abs(taxaByClade[currentClade])==0)
{
   (taxaByClade)[currentClade] = {};
}
((taxaByClade)[currentClade])[Abs((taxaByClade)[currentClade])] = currentLine;
}
}
}

cladesDefined = Abs(taxaByClade);
fprintf (stdout, "[PHASE 1] Defined ",cladesDefined , " clades");
taxaInClade = {cladesDefined,1}; /* how many taxa in each clade */
for (i=1; i<=cladesDefined; i=i+1)
{
fprintf (stdout, "\n\tClade ", i);
taxaInClade[i-1] = Abs(taxaByClade[i]);
for (j=0; j<Abs(taxaByClade[i]); j=j+1)
{
fprintf (stdout, "\n\t\t", (taxaByClade[i])[j]);
}
}
fprintf (stdout, "\n");

/* select a model to use */

DataSetFilter filteredData = CreateFilter (allData,1);
SelectTemplateModel (filteredData); /* invoke a user-driven selection of substitution model */

fprintf (stdout, "[PHASE 2] Fitting global model parameters using a global tree.\n");

/* see if there is a tree for the entire alignment to infer model parameters from;
if not use neighbor joining to get one */

if (Abs(DATAFILE_TREE)) /* have tree */
{
   Tree globalTree = DATAFILE_TREE;
   fprintf (stdout, "\tInference from a user (in-file) tree\n");
}
else
{
   njTreeString = InferTreeTopology(0);
   Tree globalTree = njTreeString;
   fprintf (stdout, "\tInference from a NJ tree\n");
}
LikelihoodFunction global_lf = (filteredData,globalTree);
Optimize (global_res, global_lf);
fprintf (stdout, "\tLogL = ", global_res[1][0],"\n");
/* load more utility functions */
ExecuteAFile (HYPHY_BASE_DIRECTORY+"TemplateBatchFiles"+DIRECTORY_SEPARATOR+"Utility" +DIRECTORY_SEPARATOR+"GrabBag.bf");
/* constrain all global model parameters (e.g. trans/transv ratio, gamma shape parameter etc) based on the global
alignment */
fixGlobalParameters ("global_lf");

/* prompt for the number of replicates and where to write the results*/

fprintf (stdout, "How many samples should be drawn:?");
fscanf (stdin,"Number",sampleCount);
SetDialogPrompt ("Write the results to:");
fprintf (PROMPT_FOR_FILE, CLEAR_FILE, KEEP_OPEN,"Iterate\tTaxa\tStarLog\tNJ Log\tDeltaDegFreedom\tp-value\tNJ Tree");
baseOutPath = LAST_FILE_PATH;

/* loop over the replicates */

for (iterate = 0; iterate < sampleCount; iterate = iterate + 1)
{
selectedTaxa = {}; /* which taxa fall in this iterate;
   an assoc. array of the form array[taxon INDEX] = 1 for chosen taxa*/
selectedByClade = {};
for (clade = 0; clade < cladesDefined; clade = clade + 1)
{
taxonName = (taxaByClade[clade+1])[Random (0,taxaInClade[clade])$1];
selectedTaxa [availableTaxonNames[taxonName]-1] = 1;
selectedByClade[clade]    = taxonName;
}
DataSetFilter filteredData = CreateFilter (allData,1,"",selectedTaxa[speciesIndex]>0);
fprintf (baseOutPath, "\n", iterate+1,"\t");
starTree = "(";
for (i=0; i<cladesDefined; i=i+1)
{
if (i>0)
{
fprintf (baseOutPath, ",");
starTree = starTree + "," + selectedByClade[i];
}
else
{
starTree = starTree + selectedByClade[i];
}
fprintf (baseOutPath, selectedByClade[i]);
}
starTree = starTree + ")";

/* make a star tree */
Tree star = starTree;
LikelihoodFunction smallLF = (filteredData, star);
Optimize (resStar, smallLF);

/* ################################# BEGGINING OF TOPOLOGYSEARCH.bf ##################### */
/* ____________________________________________*/

function TreeMatrix2TreeString (levelIndex)
{
treeString = "";
p = 0;
k = 0;
m = treeNodes[0][levelIndex+1];
n = treeNodes[0][levelIndex];

while (m)
{
if (m>p)
{
if (p)
{
treeString = treeString+",";
}
for (j=p;j<m;j=j+1)
{
treeString = treeString+"(";
}
}
else
{
if (m<p)
{
for (j=m;j<p;j=j+1)
{
treeString = treeString+")";
}
}
else
{
treeString = treeString+",";
}
}
if (n<cladesDefined)
{

/* Next line was also changed */

GetString (nodeName, filteredData, n);
treeString = treeString+nodeName;
}
k=k+1;
p=m;
n=treeNodes[k][levelIndex];
m=treeNodes[k][levelIndex+1];
}

for (j=m;j<p;j=j+1)
{
treeString = treeString+")";
}

return treeString;
}

/* ____________________________________________*/

function _PrepareForTreeSearch (treesToBeSearched)
{
bestTreesStash = {10,2};
globalTreeCounter = 0;
treeStatistics = {treesToBeSearched, 1};
for (ii=0; ii<10; ii=ii+1)
{
bestTreesStash [ii][1] = -1e100;
bestTreesStash [ii][0] = "";
}
return 1;
}

/* ____________________________________________*/

function _AddTreeToResults (currentTreeString, currentLFValue)
{
treeStatistics [globalTreeCounter][0] = currentLFValue;
globalTreeCounter = globalTreeCounter+1;

for (ii = 0; ii<10; ii=ii+1)
{
if (currentLFValue>bestTreesStash[ii][1])
{
break;
}
}
if (ii<10)
{
for (ii2 = 8; ii2>=ii; ii2=ii2-1)
{
bestTreesStash [ii2+1][1] = bestTreesStash[ii2][1];
bestTreesStash [ii2+1][0] = bestTreesStash[ii2][0];
}
bestTreesStash [ii][0] = currentTreeString;
bestTreesStash [ii][1] = currentLFValue;
}
return 1;
}

/*---------------------------------------------------------------*/

function ReceiveJobs (sendOrNot)
{
if (MPI_NODE_COUNT>1)
{
MPIReceive (-1, fromNode, result_String);

saveTree = MPINodeTree[fromNode-1];
saveIdx = MPINodeInfo[fromNode-1][1];

if (sendOrNot)
{
MPISend (fromNode,current_MPIJOB);
MPINodeTree[fromNode-1] = thisTree;
MPINodeInfo[fromNode-1][1] = treeCounter;
}
else
{
MPINodeInfo[fromNode-1][0] = 0;
MPINodeInfo[fromNode-1][1] = -1;
MPINodeTree[fromNode-1]    = "";
}

thisTree = saveTree;
ExecuteCommands ("currentLL = " + result_String + ";");

saveCounter = treeCounter;
treeCounter = saveIdx;
}
else
{
currentLL = res[1][0];
}

dummy = _AddTreeToResults (thisTree, currentLL);
if (currentLL>bestValue)
{
bestValue = currentLL;
bestTree = thisTree;
}
fprintf (stdout, " ==> logLhd = ", currentLL);

if (MPI_NODE_COUNT>1)
{
treeCounter = saveCounter;
}

return fromNode-1;
}

/* ____________________________________________*/

function _ReportTreeStatistics (currentLFValue)
{
ii = 0;
fprintf (stdout, "\n\n**************************\n",
      "*    TREE REPORT *\n",
      "**************************\n\n");

fprintf (stdout, "\n#### BEST TREES #####\n\n");

for (ii=0; ii<10; ii = ii+1)
{
if (bestTreesStash[ii][1]==(-1e100))
{
break;
}
fprintf (stdout, ii+1, ").");

if (ii>0)
{
fprintf (stdout, " Worse by: ", bestTreesStash[ii][1]-currentLFValue);
}
fprintf (stdout,"\n", bestTreesStash[ii][0], "\nLog-likelihood = ", bestTreesStash[ii][1], "\n\n");
}

fprintf (stdout, "\n#### STATISTICS #####\n\n");

bestTreesStash [0][0] = 0.1;
bestTreesStash [1][0] = 0.5;
bestTreesStash [2][0] = 1;
bestTreesStash [3][0] = 5;
bestTreesStash [4][0] = 10;
bestTreesStash [5][0] = 50;
bestTreesStash [6][0] = 100;
bestTreesStash [7][0] = 1000;
bestTreesStash [8][0] = 10000;
bestTreesStash [9][0] = 1e100;

bestTreesStash [0][1] = 0;
bestTreesStash [1][1] = 0;
bestTreesStash [2][1] = 0;
bestTreesStash [3][1] = 0;
bestTreesStash [4][1] = 0;
bestTreesStash [5][1] = 0;
bestTreesStash [6][1] = 0;
bestTreesStash [7][1] = 0;
bestTreesStash [8][1] = 0;
bestTreesStash [9][1] = 0;

probabilityOfTheData = 0;

for (i=0; i<globalTreeCounter; i=i+1)
{
diff = currentLFValue-treeStatistics[i];
j = 0;
while (diff>bestTreesStash[j][0])
{
j=j+1;
}
bestTreesStash [j][1] = bestTreesStash [j][1] + 1;
probabilityOfTheData = probabilityOfTheData+Exp(-diff);
}

bestTreesStash [0][1] = bestTreesStash [0][1]-1;

ii = "+---------------+---------------+---------------+---------------+\n";
fprintf (stdout, "\n\n", ii,
   "| From Best + | To Best + | Tree Count | % of total    |\n",
ii);
for (i=0; i<10; i=i+1)
{
if (i)
{
fprintf (stdout, "| " , Format (bestTreesStash [i-1][0],13,1));
}
else
{
fprintf (stdout, "| 0");
}
if (i<9)
{
fprintf (stdout, " | " , Format (bestTreesStash [i][0],13,1));
}
else
{
fprintf (stdout, " | Infinity");
}
fprintf (stdout, " | ", Format (bestTreesStash [i][1],13,0), " | ", Format (100*bestTreesStash [i][1]/globalTreeCounter,13,8), " |\n",ii);
}

fprintf (stdout, "\n\nPosterior probability of the best tree (with uninformative prior) = ",1./probabilityOfTheData,"\n\n");

fprintf (stdout, "\n\n***********Save full tree statistics to a file (y/n)?");

fscanf (stdin, "String", resp);

if ((resp!="n")&&(resp!="N"))
{
SetDialogPrompt ("Write tree stats string to:");
fprintf (PROMPT_FOR_FILE,CLEAR_FILE,treeStatistics);
}
treeStatistics = 0;

return 1;
}

/* #####################################################################    */

MESSAGE_LOGGING = 0;
VERBOSITY_LEVEL = -1;
/*
SetDialogPrompt ("Please choose a nucleotide or amino-acid data file:");

DataSet ds    = ReadDataFile (PROMPT_FOR_FILE);
DataSetFilter filteredData = CreateFilter (ds,1,"","");

SelectTemplateModel(filteredData);
*/

if (MPI_NODE_COUNT > 1)
{
mpiJobPrefix = "";
mpiJobPrefix * 128;
mpiJobPrefix * ("DataSet ds = ReadDataFile (\""+LAST_FILE_PATH+"\");DataSetFilter filteredData = CreateFilter (ds,1);");
Export (modelExp, USE_LAST_MODEL);
mpiJobPrefix * modelExp;
mpiJobPrefix * 0;
mpiJobSuffix = "LikelihoodFunction lf = (filteredData,givenTree);Optimize(res,lf);return res[1][0];";
}

treeNodes = {2*(cladesDefined+1),2*(cladesDefined-2)};
cladesInfo = {cladesDefined,2*(cladesDefined-2)};

branchIndex = {cladesDefined-3,1};
currentLevel = 0;

done = false;

i = 2*cladesDefined-5;
j = 1;
while (i>1)
{
j = j*i;
i = i-2;
}

dummy = _PrepareForTreeSearch (j);

treeNodes[0][0]=0;
treeNodes[0][1]=1;
treeNodes[1][0]=1;
treeNodes[1][1]=1;
treeNodes[2][0]=2;
treeNodes[2][1]=1;
treeNodes[3][0]=cladesDefined;
treeNodes[3][1]=0;
cladesInfo[0][0]=0;
cladesInfo[0][1]=4;

bestTree ="";
bestValue=-1e20;

done = 0;
treeCounter = 0;

if (MPI_NODE_COUNT>1)
{
MPINodeInfo = {MPI_NODE_COUNT-1,2};
MPINodeTree = {MPI_NODE_COUNT-1,1};
MPINodeTree[0] = "";
}

while (!done)
{
if (branchIndex[currentLevel]<2*currentLevel+3)
{
i = 0;
shift = 0;
j = 2*currentLevel;
k = j+2;
m = j+1;
while (treeNodes[i][m])
{
/*copy tree from prev level to this level */
if (i==branchIndex[currentLevel])
/*insert new branch*/
{
shift = 2;
if (treeNodes[i][j]<cladesDefined)
/* simple branch */
{
treeNodes[i][k]=treeNodes[i][j];
treeNodes[i][k+1]=treeNodes[i][m]+1;
treeNodes[i+1][k]=currentLevel+3;
treeNodes[i+1][k+1]=treeNodes[i][m]+1;
treeNodes[i+2][k]=currentLevel+cladesDefined+1;
treeNodes[i+2][k+1]=treeNodes[i][m];
cladesInfo[currentLevel+1][k] = i;
cladesInfo[currentLevel+1][k+1] = 3;
}
else
{
/* update node depths for the entire clade now*/
l = treeNodes[i][j]-cladesDefined;
s = cladesInfo[l][j];
for (p=s+cladesInfo[l][m]-1; p>=s; p=p-1)
{
treeNodes[i][k]=treeNodes[i][j];
treeNodes[i][k+1]=treeNodes[i][m]+1;
i=i-1;
}
i=i+cladesInfo[l][m];
/* new clade record */
cladesInfo[currentLevel+1][k] = cladesInfo[l][j];
cladesInfo[currentLevel+1][k+1] = cladesInfo[l][m]+2;
/* now we need to insert two more nodes */
treeNodes[i+1][k]=currentLevel+3;
treeNodes[i+1][k+1]=treeNodes[i][m]+1;
treeNodes[i+2][k]=currentLevel+cladesDefined+1;
treeNodes[i+2][k+1]=treeNodes[i][m];
}
for (p=0; p<=currentLevel; p=p+1)
{
if (cladesInfo[p][j]>i)
{
cladesInfo[p][k] = cladesInfo[p][j]+2;
}
else
{
cladesInfo[p][k] = cladesInfo[p][j];
}

if ((cladesInfo[p][j]<=i)&&((cladesInfo[p][j]+cladesInfo[p][m])>i+1))
{
cladesInfo[p][k+1] = cladesInfo[p][m]+2;
}
else
{
cladesInfo[p][k+1] = cladesInfo[p][m];
}
}
}
else
{
treeNodes[i+shift][k]=treeNodes[i][j];
treeNodes[i+shift][k+1]=treeNodes[i][m];
}
i = i+1;
}
treeNodes[i+2][k]=treeNodes[i][j];
treeNodes[i+2][k+1]=treeNodes[i][j+1];
if (currentLevel<cladesDefined-4)
{
currentLevel = currentLevel+1;
}
else
{

thisTree = TreeMatrix2TreeString (2*(currentLevel+1));

branchIndex[currentLevel]=branchIndex[currentLevel]+1;
fprintf (stdout, "\nTree#",Format(treeCounter,0,0)," ", thisTree);
current_MPIJOB = mpiJobPrefix + "Tree givenTree = " + thisTree + ";" + mpiJobSuffix;
if (MPI_NODE_COUNT>1)
{
for (mpiNode = 0; mpiNode < MPI_NODE_COUNT-1; mpiNode = mpiNode+1)
{
if (MPINodeInfo[mpiNode][0]==0)
{
break;
}
}

if (mpiNode==MPI_NODE_COUNT-1)
/* all nodes busy */
{
mpiNode = ReceiveJobs (1);
}
else
{
MPINodeInfo[mpiNode][0] = 1;
MPINodeInfo[mpiNode][1] = treeCounter;
MPINodeTree[mpiNode] = thisTree;
MPISend (mpiNode+1,current_MPIJOB);
}
}
else
{
Tree treeVar = thisTree;
LikelihoodFunction lf = (filteredData, treeVar);
Optimize (res, lf);
ReceiveJobs (0);
}
treeCounter = treeCounter+1;
}
}
else
{
branchIndex[currentLevel]=0;
if (currentLevel==0)
{
done = 1;
}
else
{
currentLevel = currentLevel-1;
branchIndex[currentLevel]=branchIndex[currentLevel]+1;
}
}
}

if (MPI_NODE_COUNT>1)
{
while (1)
{
for (nodeCounter = 0; nodeCounter < MPI_NODE_COUNT-1; nodeCounter = nodeCounter+1)
{
if (MPINodeInfo[nodeCounter][0]==1)
{
fromNode = ReceiveJobs (0);
break;
}
}
if (nodeCounter == MPI_NODE_COUNT-1)
{
break;
}
}
}

VERBOSITY_LEVEL = 0;

Tree es_tree = bestTree;
/*
fprintf (stdout,"\n\n --------------------- RESULTS --------------------- \n\n");
fprintf (stdout,"\n\n BestTree =", bestTree);

dummy = _ReportTreeStatistics (bestValue);

LikelihoodFunction lf = (filteredData, tr);
Optimize (res,lf);
fprintf (stdout, "\n",lf, "\n\n***********Save this tree to a file (y/n)?");

fscanf (stdin, "String", resp);

if ((resp!="n")&&(resp!="N"))
{
SetDialogPrompt ("Write tree string to:");
fprintf (PROMPT_FOR_FILE,CLEAR_FILE,bestTree,";");
}

*/

/* ####################################### END OF TOPOLOGYSEARCH.bf ###################### */

/* Optimize the likfunc with the best tree found through exaustive search */
LikelihoodFunction small_ES = (filteredData, es_tree);
Optimize (res_ES, small_ES);

/* compute LRT p-value and write the results */
DF = res_ES[1][1]-resStar[1][1];
pv = 1-CChi2(2(res_ES[1][0]-resStar[1][0]),DF);
fprintf (baseOutPath,"\t",resStar[1][0],"\t",res_ES[1][0],"\t",DF,"\t",pv,"\t",Format(es_tree,0,1));
fprintf (stdout, "[ITERATE ", iterate + 1, "/",sampleCount," p = ", pv, "]\n");

if (pv<0.05){
counter_pv = counter_pv+1;
}

/* compute AIC and write results */

AICstar = 2*resStar[1][1] - 2*resStar[1][0];
AICes = 2*res_ES[1][1] - 2*res_ES[1][0];
if (AICes<AICstar){
   counter_aic = counter_aic+1;
   }

fprintf (stdout, "AIC ", " : ", "Star Tree", " = ", AICstar, "\t", "ES Tree", " = ", AICes, "\n");
fprintf (baseOutPath,"\tAIC results(df):\tStarTree(", resStar[1][1], "):\t", AICstar, "\tExaustiveTree(", res_ES[1][1], "):\t", AICes);

}

fprintf (stdout, "Proportion of significant p-values: ", counter_pv, "/", sampleCount, "\n");
fprintf (stdout, "Proportion of AIC_ES < AIC_Star: ", counter_aic, "/", sampleCount, "\n");
fprintf (baseOutPath, "Proportion of significant p-values: ", counter_pv, "/", sampleCount, "\n");
fprintf (baseOutPath, "Proportion of AIC_ES < AIC_Star: ", counter_aic, "/", sampleCount, "\n");

/* close file */
fprintf (baseOutPath, CLOSE_FILE);

Back to top

IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: comparing star trees with non-star trees
Reply #14 - Apr 11^th, 2007 at 9:14am

Dear Bruno,

[quote]
-Next i would like to implement the Jackknife method, could you give me some pointers?
[/quote]

Take a look at the KH.bf file in TemplateBatchFiles.

[quote]
I would like to use AICc instead of AIC, but i dont know how to count the number of observations... is it somehow related to http://www.hyphy.org/cgi-bin/yabb/YaBB.pl?board=MOLEV;action=display;num=1164656629
or is this a simpler question?
[/quote]

It actually IS a difficult question; use the number of sites (filter.sites) to be on the conservative sde, use the number of characters (filter.sites*filter.species) to be on the liberal site.

[quote]
-Would it make sense to optimize the values of the substitution model for each subsample instead of using the values for the complete alignment? How time consuming would it be?
[/quote]

It would make very little difference in terms of scores but increase the run time significantly. Unless you think that the substitution process may vary from sample to sample for biological reasons, I would not estimate these parameters.

[quote]
-Ive seen that the number of d.f. of the estimated tree does not change. For the star tree d.f.=6, while for the estimate tree d.f. = 9. If i am not making some confusion, 6 is the number of free parameters of the HKY+G+I model itself (1 ts/tv + 3 nuc.freqs. + G + I). So for the star tree there are no parameters estimated? I know the tree topology is fixed, but the branch lengths are estimated, so why are they not counted as parameters? And as for the estimated tree, why are there always 9 d.f. if the number of nodes / branches varies amongst replicates (some best trees have politomies on them)? Is it because the branch exists but its estimated length is zero? Im sorry to get back to the d.f. thing (i think ive asked about d.f. in all the posts so far) but it seems to be an important value so i want to understand how it is calculated.
[/quote]

6 is the number of branch lengths in a star tree on 6 sequences and 9 is the number of branches in a fully resolved unrooted tree on 6 sequences. These are the counts of parameters which are actually adjusted by Optimize. Global and frequency parameters are not included in these counts (because they are fixed at values estimated previosuly). For LRT this does not matter, because only the DIFFERENCE in the numbers of parameters is important; for AIC - they are. You should take res[1][1] and add the number of global and frequency parameters. res[1][2] (from the first likelihood optimization) will store the number of global parameters.

Cheers,
Sergei

Back to top

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

	Welcome, Guest. Please Login