Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
GY vs MG simulations (Read 4415 times)
wayne
Global Moderator
*****
Offline


I love YaBB 1G - SP1!

Posts: 57
San Diego, CA
Gender: male
GY vs MG simulations
May 2nd, 2008 at 9:58am
 
Hi Sergei, I'm having trouble with a Muse Gaut simulation of data versus that of Goldman Yang. Goldman Yang where ef's are multiplied into the model after the rate matrix is defined, works well. however, when i incorporate frequencies of target nucleotides into the rate matrix i get substantially longer branch lengths than expected. i can prep some code to demo this if necessary. was just wondering if you had an immediate response. thanks ./w
Back to top
 

Assistant Project Scientist&&Antiviral Research Center&&Department of Pathology&&University of California, San Diego
WWW WWW  
IP Logged
 
wayne
Global Moderator
*****
Offline


I love YaBB 1G - SP1!

Posts: 57
San Diego, CA
Gender: male
Re: GY vs MG simulations
Reply #1 - May 2nd, 2008 at 9:59am
 
thanks for the promotion to vervet  Cheesy
Back to top
 

Assistant Project Scientist&&Antiviral Research Center&&Department of Pathology&&University of California, San Diego
WWW WWW  
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: GY vs MG simulations
Reply #2 - May 2nd, 2008 at 10:18am
 
Dear Wayne,

Make sure to include the third argument with value 0 when you define MG94 as in:

Code:
Model myModel = (rateMatrix, baseFrequencies, 0);
 



If you don't then each rate entry will be multiplied BOTH by the nuc. frequency (which you do explicitly) and the codon frequency (which is included by if the third argument is not 0).

HTH,
Sergei



Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: GY vs MG simulations
Reply #3 - May 2nd, 2008 at 2:52pm
 
wayne wrote on May 2nd, 2008 at 9:59am:
thanks for the promotion to vervet  Cheesy


You get to be Baboon after 50 posts:)

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
wayne
Global Moderator
*****
Offline


I love YaBB 1G - SP1!

Posts: 57
San Diego, CA
Gender: male
Re: GY vs MG simulations
Reply #4 - May 2nd, 2008 at 4:00pm
 
Hi, thanks. was doing this. i've included some test code. unless there is something i'm not following correctly with the setting of branch lengths for simulation. even so the results shouldn't differ between the two models right?

running the attached code with the GY option i get:

Tree length of base data = 1.48502
Tree length of simulated data = 0.40802

whereas i get the following with MG:

Tree length of base data = 1.48502
Tree length of simulated data = 4.89443

there is a 3*multiplier on the calculated nucleotide branch lengths so i do expect the tree length of simulated data to be longer, yet this should be consistent between the two models, right? please let me know if you can spot my error. thanks and regards ./w

Code:
/* set some parameters */
INPUTFILE = "Brumme_nef.30.nex";
TREEFILE = "Brumme_nef.30.tre";
SIMDATAFILE = "testsim.simdata";
INSTALL_DIR = "/usr/local/HYPHY_Source/";
fprintf ( SIMDATAFILE, CLEAR_FILE );
DATA_FILE_PRINT_FORMAT = 4;
BLMULTI = 3;
SIM1 = 0; /* if SIM1 = 1, then uses Simulate function, else uses SimulateDataSet function */
GY = 0; /* if true uses Goldman Yang parametisation, else Muse Gaut */

CODONS = { { "AAA", "AAC", "AAG", "AAT",
		   "ACA", "ACC", "ACG", "ACT",
		   "AGA", "AGC", "AGG", "AGT",
		   "ATA", "ATC", "ATG", "ATT",
		   "CAA", "CAC", "CAG", "CAT",
		   "CCA", "CCC", "CCG", "CCT",
		   "CGA", "CGC", "CGG", "CGT",
		   "CTA", "CTC", "CTG", "CTT",
		   "GAA", "GAC", "GAG", "GAT",
		   "GCA", "GCC", "GCG", "GCT",
		   "GGA", "GGC", "GGG", "GGT",
		   "GTA", "GTC", "GTG", "GTT",
		   "TAA", "TAC", "TAG", "TAT",
		   "TCA", "TCC", "TCG", "TCT",
		   "TGA", "TGC", "TGG", "TGT",
		   "TTA", "TTC", "TTG", "TTT" } };
AA = "FLIMVSPTAY*HQNKDECWRG";
DATA_FILE_PRINT_FORMAT = 4;

/* functions */
function GeneticCodeStuff ( nil )
{
	skipCodeSelectionStep = 1;
	incFileName = INSTALL_DIR + "TemplateBatchFiles" + DIRECTORY_SEPARATOR + "TemplateModels" + DIRECTORY_SEPARATOR + "chooseGeneticCode.def";
	ExecuteCommands ( "#include \""+incFileName+"\";");
	dummy = ApplyGeneticCodeTable ( DEFAULT_CODE_TABLE );
	ModelMatrixDimension = 64;
	for ( h = 0; h < 64; h = h + 1 ) {
		if ( _Genetic_Code[ h ] == 10 ) {
			ModelMatrixDimension = ModelMatrixDimension - 1;
		}
	}
	/* count number of alternate codons per aa and fill matrix */
	NumberAlternateCodons = { 21, 1 };
	for ( a = 0; a < 21; a = a + 1 ) {
		countAlternates = 0;
		for ( h = 0; h < 64; h = h + 1 ) {
			if ( _Genetic_Code [ h ] != 10 ) {
				if ( a == _Genetic_Code [ h ] ) {
					countAlternates = countAlternates + 1;
				}
			}
		}
		NumberAlternateCodons [ a ] = countAlternates;
	}
	return 0;
}

function CodonFrequencies ( nucFreqMatrix, mSize )
{
	PIStop = 1.0;
	result = { mSize, 1 };
	hshift = 0;
	for ( h = 0; h < 64; h = h + 1 )	{
		first = h$16;
		second = h%16$4;
		third = h%4;
		if ( _Genetic_Code[ h ] == 10 ) {
			hshift = hshift + 1;
			PIStop = PIStop - nucFreqMatrix[first][0]*nucFreqMatrix[second][1]*nucFreqMatrix[third][2];
		}
		else {
			result[ h-hshift ][0] = nucFreqMatrix[first][0]*nucFreqMatrix[second][1]*nucFreqMatrix[third][2];
		}
	}
	return result*( 1.0/PIStop );
}

function PopulateModelMatrix_GY ( ModelMatrixName&, EFV )
{
	ModelMatrixName = {ModelMatrixDimension,ModelMatrixDimension};
	hshift = 0;
	for (h=0; h<64; h=h+1)	{
		if (_Genetic_Code[h]==10) {
			hshift = hshift+1;
			continue;
		}
		vshift = hshift;
		for (v = h+1; v<64; v=v+1)	{
			diff = v-h;
			if (_Genetic_Code[v]==10)  {
				vshift = vshift+1;
				continue;
			}
			if ((h$4==v$4)||((diff%4==0)&&(h$16==v$16))||(diff%16==0)) {
				if (h$4==v$4)	{
					transition = v%4;
					transition2= h%4;
				}
				else {
					if(diff%16==0)	{
						transition = v$16;
						transition2= h$16;
					}
					else	{
						transition = v%16$4;
						transition2= h%16$4;
					}
				}
				if (_Genetic_Code[0][h]==_Genetic_Code[0][v]) {
					if (Abs(transition-transition2)%2)	{
						ModelMatrixName[h-hshift][v-vshift] := kappa*khs_synRate*t;
						ModelMatrixName[v-vshift][h-hshift] := kappa*khs_synRate*t;
					}
					else	{
						ModelMatrixName[h-hshift][v-vshift] := khs_synRate*t;
						ModelMatrixName[v-vshift][h-hshift] := khs_synRate*t;
					}
				}
				else	{
					if (Abs(transition-transition2)%2)	{
						ModelMatrixName[h-hshift][v-vshift] := kappa*c*khs_synRate*t;
						ModelMatrixName[v-vshift][h-hshift] := kappa*c*khs_synRate*t;
					}
					else	{
						ModelMatrixName[h-hshift][v-vshift] := c*khs_synRate*t;
						ModelMatrixName[v-vshift][h-hshift] := c*khs_synRate*t;
					}
				}
			}
		}
	}
	return 0;
}

function PopulateModelMatrix_MG ( ModelMatrixName&, EFV )
{
	ModelMatrixName = {ModelMatrixDimension,ModelMatrixDimension};
	hshift = 0;
	for (h=0; h<64; h=h+1)	{
		if (_Genetic_Code[h]==10) {
			hshift = hshift+1;
			continue;
		}
		vshift = hshift;
		for (v = h+1; v<64; v=v+1)	{
			diff = v-h;
			if (_Genetic_Code[v]==10)  {
				vshift = vshift+1;
				continue;
			}
			if ((h$4==v$4)||((diff%4==0)&&(h$16==v$16))||(diff%16==0)) {
				if (h$4==v$4)	{
					transition = v%4;
					transition2= h%4;
					nucPositionInCodon = 2;
				}
				else {
					if(diff%16==0)	{
						transition = v$16;
						transition2= h$16;
						nucPositionInCodon = 0;
					}
					else	{
						transition = v%16$4;
						transition2= h%16$4;
						nucPositionInCodon = 1;
					}
				}
				if (_Genetic_Code[0][h]==_Genetic_Code[0][v]) {
					if (Abs(transition-transition2)%2)	{
						ModelMatrixName[h-hshift][v-vshift] := kappa*khs_synRate*t*EFV__[transition__][nucPositionInCodon__];
						ModelMatrixName[v-vshift][h-hshift] := kappa*khs_synRate*t*EFV__[transition2__][nucPositionInCodon__];
					}
					else	{
						ModelMatrixName[h-hshift][v-vshift] := khs_synRate*t*EFV__[transition__][nucPositionInCodon__];
						ModelMatrixName[v-vshift][h-hshift] := khs_synRate*t*EFV__[transition2__][nucPositionInCodon__];
					}
				}
				else	{
					if (Abs(transition-transition2)%2)	{
						ModelMatrixName[h-hshift][v-vshift] := kappa*c*khs_synRate*t*EFV__[transition__][nucPositionInCodon__];
						ModelMatrixName[v-vshift][h-hshift] := kappa*c*khs_synRate*t*EFV__[transition2__][nucPositionInCodon__];
					}
					else	{
						ModelMatrixName[h-hshift][v-vshift] := c*khs_synRate*t*EFV__[transition__][nucPositionInCodon__];
						ModelMatrixName[v-vshift][h-hshift] := c*khs_synRate*t*EFV__[transition2__][nucPositionInCodon__];
					}
				}
			}
		}
	}
	return 0;
}

/* end of functions */

DataSet ds = ReadDataFile( INPUTFILE );
fscanf ( TREEFILE, "String", TREE );

/* get some nucleotide branch lengths */
DataSetFilter nucData = CreateFilter ( ds, 1 );
HarvestFrequencies ( nucFreqs, nucData, 1, 1, 1 );
global kappa_inv = 1;
HKY85_Matrix =	{	{ *, t*kappa_inv, t, t*kappa_inv }
				{ t*kappa_inv, *, kappa_inv*t,	t }
				{ t, t*kappa_inv, *, kappa_inv*t }
				{ t*kappa_inv,	t, kappa_inv*t,	* } };
Model HKY85_Model = ( HKY85_Matrix, nucFreqs, 1 );
Tree nucTree = TREE;
LikelihoodFunction lf = ( nucData, nucTree );
Optimize ( res, lf );
tvec = { 2*nucData.species-2, 1 };
branchNames = BranchName ( nucTree, -1 );
nucBL = BranchLength ( nucTree, -1);
sumBL = 0;
for ( k = 0; k < 2*nucData.species-2; k = k + 1 ) {
	thisbranch = branchNames [ k ];
	ExecuteCommands ( "tvec [ k ] = nucTree." + thisbranch + ".t;" );
	sumBL = sumBL + nucBL [ k ];
}
fprintf ( stdout, "Tree length of base data = ", sumBL, "\n" );

/* get base codon frequencies for simulation*/
dummy = GeneticCodeStuff ( 0 );
DataSetFilter codonData = CreateFilter ( ds, 3, "", "", GeneticCodeExclusions );
HarvestFrequencies ( nucleotidesByPos, codonData, 3, 1, 1 );
vectorOfFrequencies = CodonFrequencies ( nucleotidesByPos, ModelMatrixDimension );
/* set up simulation parameters and model */
global khs_synRate = 1;
global kappa = kappa_inv;
category c = ( 3, {{0.30,0.50,0.20}} , MEAN, ,{{0.05,1,4}}, 0, 1e25 );
modelMatrix = 0;
if ( GY ) {
	PopulateModelMatrix_GY ( "modelMatrix", nucleotidesByPos );
	Model M2amodel = ( modelMatrix, vectorOfFrequencies, 1 );
}
else {
	PopulateModelMatrix_MG ( "modelMatrix", nucleotidesByPos );
	Model M2amodel = ( modelMatrix, vectorOfFrequencies, 0 );
}
Tree M2atree = TREE;

/* set branches to scaled nucleotide length */

branchNames = BranchName ( M2atree, -1 );
for ( k = 0; k < Columns ( branchNames) - 1; k = k + 1 ) {
	thisbranch = branchNames [ k ];
	ExecuteCommands ( "M2atree." + thisbranch + ".t :=" + BLMULTI + "*" + tvec [ k ] + ";" );
}

charactersUniversalCode =	{ {"A","C","G","T" }
							{"3",GeneticCodeExclusions,"",""} };
if ( SIM1 ) {
	DataSet simData = Simulate ( M2atree, vectorOfFrequencies, charactersUniversalCode, codonData.sites, 0);
}
else {
	LikelihoodFunction lf = ( codonData, M2atree );
	DataSet simData = SimulateDataSet(lf);
}

/* calculate simulated BLs using nuc model */
DataSetFilter simDataFilter = CreateFilter ( simData, 1 );
HarvestFrequencies ( simFreqs, simDataFilter, 1, 1, 1 );
global kappa_inv = 1;
HKY85_Matrix =	{	{ *, t*kappa_inv, t, t*kappa_inv }
				{ t*kappa_inv, *, kappa_inv*t,	t }
				{ t, t*kappa_inv, *, kappa_inv*t }
				{ t*kappa_inv,	t, kappa_inv*t,	* } };
Model HKY85_Model = ( HKY85_Matrix, simFreqs, 1 );
Tree simTree = TREE;
LikelihoodFunction lf = ( simDataFilter, simTree );
Optimize ( res, lf );

BL = BranchLength ( simTree, -1);
sumBL = 0;
for ( k = 0; k < 2*simDataFilter.species-2; k = k + 1 ) {
	sumBL = sumBL + BL [ k ];
}
fprintf ( stdout, "Tree length of simulated data = ", sumBL, "\n" );

fprintf ( SIMDATAFILE, simDataFilter, "\n" );


 

Back to top
 

Assistant Project Scientist&&Antiviral Research Center&&Department of Pathology&&University of California, San Diego
WWW WWW  
IP Logged
 
wayne
Global Moderator
*****
Offline


I love YaBB 1G - SP1!

Posts: 57
San Diego, CA
Gender: male
Re: GY vs MG simulations
Reply #5 - May 4th, 2008 at 8:20am
 
hi, this has been racking my brain all weekend. as another test i modified your simulation code (i think that used in the "not so different paper") to compare MG versus GY. the modifications i made were

1. replace MG94custom.mdl with GY94.mdl in the simulation.config file
2. replace MG94custom matrix with GY94 matrix in the CodonSimulator.bf file

all other parameters were the same using random tree and branch lengths drawn @ random from exp distro (0.05) and default omega matrix. problem is the average BL's are an order of magnitude shorter on GY versus MG. any ideas? cheers ./w
Back to top
 

Assistant Project Scientist&&Antiviral Research Center&&Department of Pathology&&University of California, San Diego
WWW WWW  
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: GY vs MG simulations
Reply #6 - May 4th, 2008 at 1:18pm
 
Dear Wayne,

Will take a look today and see what I can find.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: GY vs MG simulations
Reply #7 - May 5th, 2008 at 7:34am
 
Dear Wayne,

This is probably the simplest way to get the scaling worked out properly:

Code:
/* one needs to work out a dynamic branch length scaler between nuc and codon models;
   here's one simple way to do it:

   Let N be the total branch length of the nucleotide tree

   1). Copy low level t parameters from nucTree to codonTree
   2). Compute the uncorrected length of codonTree (CU)
   3). Compute the ratio N/CU
   4). Rescale all low level parameters by the factor N/CU

*/

/* the constraint below is equivalent to the loop */

ReplicateConstraint ("this1.?.t:=this2.?.t", M2atree, nucTree);
codonBL = BranchLength (M2atree,-1);
codonBLSum = codonBL * Transpose(codonBL["1"]);
scalerNuc2Codon = sumBL/codonBLSum[0];
ClearConstraints (M2atree);
ReplicateConstraint ("this1.?.t:=scalerNuc2Codon__*this2.?.t", M2atree, nucTree);
 




Code:
codonBL["1"]
 


Populates a matrix of the same dimension as codonBL with 1's

HTH,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
wayne
Global Moderator
*****
Offline


I love YaBB 1G - SP1!

Posts: 57
San Diego, CA
Gender: male
Re: GY vs MG simulations
Reply #8 - May 5th, 2008 at 12:12pm
 
thanks Sergei, you have solved my life's problems.  Smiley i suspected it was something to do with the branch scaling, yet had tried several approaches. they didn't work since i was calculating a scaling factor without an optimisation (none of the smiley's have the dumb@$$ look). thanks for the help. much appreciated. cheers ./w
Back to top
 

Assistant Project Scientist&&Antiviral Research Center&&Department of Pathology&&University of California, San Diego
WWW WWW  
IP Logged