Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
Pages: 1 2 
codon model (Read 8993 times)
alex
YaBB Newbies
*
Offline



Posts: 8
Gender: male
codon model
Mar 27th, 2008 at 8:40am
 
Hi,

I am wondering if hyphy can help me estimate parameters of a model of molecular evolution.  Here is an example of the kind of model I am considering.  It is a mixture of five codon rate matrices, and is defined by these variables:
1 global kappa variable > 0
4 global nucleotide variables constrained to be positive and sum to 1
5 mixture variables constrained to be positive and sum to 1
20 category-specific variables constrained to be positive and sum to 1 for each category
I want to find the maximum likelihood estimate of all of these parameters, which is like
1+(4-1)+(5-1)+5*(20-1) = 105 degrees of freedom, given a bunch of ungapped codon alignments and a phylogenetic tree topology as input.  For each category I know how to write the category rate matrix as the product of a symmetric matrix and a vector of codon stationary frequencies.

Thanks,
Alex
Back to top
 
 
IP Logged
 
Art Poon
Global Moderator
*****
Offline


Feed your monkey!

Posts: 0
Re: codon model
Reply #1 - Mar 27th, 2008 at 11:40am
 
Hi Alex,
That's an awful lot of parameters to estimate in your model --- I hope you have tons of data!   It would make it easier to answer your question if you could be clearer about your variables.  I'm guessing that you're fitting a mixture of codon models crossed with HKY85, with stationary nucleotide frequencies to be estimated (else, why constrain your global variables to 1?). I'm also guessing that your 20 category-specific variables correspond to amino-acid stationary distributions that modify the codon substitution matrix, modeling variation over sites.

You can certainly carry out this sort of analysis using HyPhy.  The simplest approach would be to define your five separate codon substitution models and then assign all five to the same likelihood function with your "mixture" parameters to specify how to calculate site likelihood.  I'll post an example in a bit.

- Art.




Back to top
 
 
IP Logged
 
Art Poon
Global Moderator
*****
Offline


Feed your monkey!

Posts: 0
Re: codon model
Reply #2 - Mar 27th, 2008 at 12:23pm
 
Okay, here's an example:

[code]
VERBOSITY_LEVEL = 11;

DataSet                  spectrinData = ReadDataFile ("/Applications/HyPhy/data/p51.nex");
DataSetFilter            filteredData = CreateFilter (spectrinData,1);

HarvestFrequencies (observedFreqs, filteredData, 1, 1, 1);

fprintf (stdout, observedFreqs, "\n");

/* stationary distribution in nucleotide frequencies */
global      eqFreqA = 0.25;
global      eqFreqC = 0.25;
global      eqFreqG = 0.25;
global      eqFreqT := 1.0 - eqFreqA - eqFreqC - eqFreqG;      eqFreqT :> 0;

/* another distribution for nucleotide frequencies */
global      eqFreqA2 = 0.25;
global      eqFreqC2 = 0.25;
global      eqFreqG2 = 0.25;
global      eqFreqT2 := 1.0 - eqFreqA2 - eqFreqC2 - eqFreqG2;      eqFreqT2 :> 0;

/* transition/transversion ratio */
global      kappa = 1.0;      kappa :> 0;

HKY85RateMatrix =

           {{*,a,kappa*a,a}

            {a,*,a,kappa*a}

            {kappa*a,a,*,a}

            {a,kappa*a,a,*}};

/* construct vectors from global variables */
estFreqs1 = {
                 {eqFreqA,
                 eqFreqC,
                 eqFreqG,
                 eqFreqT}
                 };
                 
estFreqs2 = {
                 {eqFreqA2,
                 eqFreqC2,
                 eqFreqG2,
                 eqFreqT2}
                 };

/* assign models to trees */
Model      m1 = (HKY85RateMatrix, estFreqs1);

Tree      givenTree      = DATAFILE_TREE;

Model      m2 = (HKY85RateMatrix, estFreqs2);

Tree      otherTree      = DATAFILE_TREE;

global      P = 0.5; /* mixing proportion */
P :> 0; P :< 1;

/* construct likelihood function using mixture of two rate distributions */
LikelihoodFunction      lf = (filteredData, givenTree, filteredData, otherTree, "Log(SITE_LIKELIHOOD[0]*P+(1-P)*SITE_LIKELIHOOD[1])");

Optimize(res, lf);

fprintf(stdout, res);
[/code]

This example is based on a code snippet that Sergei wrote in this forum: [url]http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?num=1153484845[/url].
Unfortunately, the likelihood function isn't be computed correctly in either example.  I'm diagnosing it right now. 

- Art.
Back to top
 
 
IP Logged
 
alex
YaBB Newbies
*
Offline



Posts: 8
Gender: male
Re: codon model
Reply #3 - Mar 27th, 2008 at 12:34pm
 
Thank you, that looks very much like what I want to do.
Back to top
 
 
IP Logged
 
Art Poon
Global Moderator
*****
Offline


Feed your monkey!

Posts: 0
Re: codon model
Reply #4 - Mar 27th, 2008 at 12:37pm
 
Okay, it turns out that my development build of HyPhy, which I used to test out the snippet, is broken for this function (I do this on a regular basis while developing new code).  Embarrassed
But this code snippet works fine with our distributed build of HyPhy.  Of course, you'll have to change the file path in the first line to your own data file, and you'll need to define your amino acid rate matrices (I carried out an example for mixtures of nucleotide rate matrices only, because I don't know how exactly you intend to specify alternate codon rate matrices).

Hope this helps!
- Art.
Back to top
 
 
IP Logged
 
alex
YaBB Newbies
*
Offline



Posts: 8
Gender: male
Re: codon model
Reply #5 - Mar 31st, 2008 at 7:42am
 
This batch file seems to estimate a different set of branch lengths for each rate matrix.  How do I make it estimate only a single set of branch lengths?  Also, how would I tell HyPhy to use the branch lengths specified in the nexus file rather than estimating them?

Thanks,
Alex
Back to top
 
 
IP Logged
 
Art Poon
Global Moderator
*****
Offline


Feed your monkey!

Posts: 0
Re: codon model
Reply #6 - Mar 31st, 2008 at 10:07am
 
Hi Alex,

You can ask HyPhy to use the branch lengths specified in the NEXUS file by using this line:

Code:
ACCEPT_BRANCH_LENGTHS = 1; 



If you want to constrain both (all) components of the mixture model to use the branch lengths from your data file, use the following code snippet:

Code:
ACCEPT_BRANCH_LENGTHS = 1;
Tree myTree = DATAFILE_TREE;  /* arbitrary variable name */

branchNames  = BranchName (myTree, -1);  /* vector populated with names for each branch in tree */
branchLengths = BranchLength (myTree, -1);  /* vector populated with branch lengths */

for (k = 0; k <  Columns(branchNames)-1; k = k+1)
{
  ExecuteCommands("myTree."+branchNames[k]+".a:="+branchLengths[k]+";");
}
 



You'd need to append this snippet immediately following each tree definition. 

- Art.
Back to top
 
 
IP Logged
 
alex
YaBB Newbies
*
Offline



Posts: 8
Gender: male
Re: codon model
Reply #7 - Apr 1st, 2008 at 1:04pm
 
After making those changes (at least my interpretation of those changes), the branch lengths of the two trees are still different, and they are different from those of the reference tree.  However, they appear to be different from the reference tree by only a scaling factor that is the same for each branch within a tree.

Are these scaling factors being estimated?  If so, how can I fix them to 1.0?  On the other hand if they are deterministic consequences of, for example, normalizing the expected rate of the component rate matrices, then how exactly are they calculated?

I have a separate question about how the likelihood is maximized.  Is HyPhy taking advantage of the 'mixture' structure of my likelihood function as it searches for the MLE?  It is possible to analytically find the maximum likelihood values of some of the parameters in the model conditional on the the data and the values of the other parameters.

Thanks,
Alex



[code]
VERBOSITY_LEVEL = 11;

DataSet                         spectrinData = ReadDataFile ("sim2.nex");

DataSetFilter           filteredData = CreateFilter (spectrinData,1);

HarvestFrequencies (observedFreqs, filteredData, 1, 1, 1);

fprintf (stdout, observedFreqs, "\n");

/* stationary distribution in nucleotide frequencies */
global  eqFreqA = 0.25;
global  eqFreqC = 0.25;
global  eqFreqG = 0.25;
global  eqFreqT := 1.0 - eqFreqA - eqFreqC - eqFreqG;   eqFreqT :> 0;

/* another distribution for nucleotide frequencies */
global  eqFreqA2 = 0.25;
global  eqFreqC2 = 0.25;
global  eqFreqG2 = 0.25;
global  eqFreqT2 := 1.0 - eqFreqA2 - eqFreqC2 - eqFreqG2;       eqFreqT2 :> 0;

/* transition/transversion ratio */
global  kappa = 1.0;    kappa :> 0;

HKY85RateMatrix =
               {{*,a,kappa*a,a}
                {a,*,a,kappa*a}
                {kappa*a,a,*,a}
                {a,kappa*a,a,*}};

/* construct vectors from global variables */
estFreqs1 = {
                       {eqFreqA,
                       eqFreqC,
                       eqFreqG,
                       eqFreqT}
                       };

estFreqs2 = {
                       {eqFreqA2,
                       eqFreqC2,
                       eqFreqG2,
                       eqFreqT2}
                       };
/* assign models to trees */
Model   m1 = (HKY85RateMatrix, estFreqs1);

ACCEPT_BRANCH_LENGTHS = 1;
Tree    givenTree       = DATAFILE_TREE;
branchNames  = BranchName (givenTree, -1);  /* vector populated with names for each branch in tree */
branchLengths = BranchLength (givenTree, -1);  /* vector populated with branch lengths */
for (k = 0; k <  Columns(branchNames)-1; k = k+1)
{
  ExecuteCommands("givenTree."+branchNames[k]+".a:="+branchLengths[k]+";");
}

Model   m2 = (HKY85RateMatrix, estFreqs2);

ACCEPT_BRANCH_LENGTHS = 1;
Tree    otherTree       = DATAFILE_TREE;
branchNames  = BranchName (otherTree, -1);  /* vector populated with names for each branch in tree */
branchLengths = BranchLength (otherTree, -1);  /* vector populated with branch lengths */
for (k = 0; k <  Columns(branchNames)-1; k = k+1)
{
  ExecuteCommands("otherTree."+branchNames[k]+".a:="+branchLengths[k]+";");
}

global  P = 0.5; /* mixing proportion */
P :> 0; P :< 1;

/* construct likelihood function using mixture of two rate distributions */
LikelihoodFunction      lf = (filteredData, givenTree, filteredData, otherTree, "Log(SITE_LIKELIHOOD[0]*P+(1-P)*SITE_LIKELIHOOD[1])");

Optimize(res, lf);

fprintf(stdout, res);
fprintf(stdout, lf);

[/code]
Back to top
 
 
IP Logged
 
Art Poon
Global Moderator
*****
Offline


Feed your monkey!

Posts: 0
Re: codon model
Reply #8 - Apr 1st, 2008 at 3:16pm
 
Hi Alex,

Of course, estimates of branch lengths are confounded with respect to evolutionary rates and chronological time.  Hence, the assignment of different substitution models in your mixture model has a scaling effect on the tree.  You can get HyPhy to compute the scaling factors quite easily; here's a revised version of the code that reports scaling factors for each model/tree:

Code:
function computeScalingFactorB(rateMatrix, baseFreqs)
{
	B = 0;
	for (n1 = 0; n1 < Rows(rateMatrix); n1 = n1+1)
	{
		for (n2 = 0; n2 < Columns(rateMatrix); n2 = n2+1)
		{
			if (n2 != n1)
			{
				B = B + baseFreqs[n1]*baseFreqs[n2]*rateMatrix[n1][n2];
			}
		}
	}
	return B;
}

VERBOSITY_LEVEL = 1;

ACCEPT_BRANCH_LENGTHS = 1;

DataSet				 spectrinData = ReadDataFile ("/Applications/HyPhy/data/p51_1.nex");

DataSetFilter	     filteredData = CreateFilter (spectrinData,1);

HarvestFrequencies (observedFreqs, filteredData, 1, 1, 1);

fprintf (stdout, observedFreqs, "\n");

/* stationary distribution in nucleotide frequencies */
global  eqFreqA = 0.25;
global  eqFreqC = 0.25;
global  eqFreqG = 0.25;
global  eqFreqT := 1.0 - eqFreqA - eqFreqC - eqFreqG;   eqFreqT :> 0;

/* another distribution for nucleotide frequencies */
global  eqFreqA2 = 0.25;
global  eqFreqC2 = 0.25;
global  eqFreqG2 = 0.25;
global  eqFreqT2 := 1.0 - eqFreqA2 - eqFreqC2 - eqFreqG2;	 eqFreqT2 :> 0;

/* transition/transversion ratio */
global  kappa = 1.0;    kappa :> 0;

HKY85RateMatrix =
		    {{*,a,kappa*a,a}
		     {a,*,a,kappa*a}
		     {kappa*a,a,*,a}
		     {a,kappa*a,a,*}};

/* construct vectors from global variables */
estFreqs1 = {
				{eqFreqA,
				eqFreqC,
				eqFreqG,
				eqFreqT}
				};

estFreqs2 = {
				{eqFreqA2,
				eqFreqC2,
				eqFreqG2,
				eqFreqT2}
				};
/* assign models to trees */
Model   m1 = (HKY85RateMatrix, estFreqs1);


Tree    givenTree	 = DATAFILE_TREE;
branchNames  = BranchName (givenTree, -1);  /* vector populated with names for each branch in tree */
branchLengths = BranchLength (givenTree, -1);  /* vector populated with branch lengths */
for (k = 0; k <  Columns(branchNames)-1; k = k+1)
{
  ExecuteCommands("givenTree."+branchNames[k]+".a:="+branchLengths[k]+";");
}

Model   m2 = (HKY85RateMatrix, estFreqs2);

ACCEPT_BRANCH_LENGTHS = 1;
Tree    otherTree	 = DATAFILE_TREE;
branchNames  = BranchName (otherTree, -1);  /* vector populated with names for each branch in tree */
branchLengths = BranchLength (otherTree, -1);  /* vector populated with branch lengths */
for (k = 0; k <  Columns(branchNames)-1; k = k+1)
{
  ExecuteCommands("otherTree."+branchNames[k]+".a:="+branchLengths[k]+";");
}

global  P = 0.5; /* mixing proportion */
P :> 0; P :< 1;

/* construct likelihood function using mixture of two rate distributions */
LikelihoodFunction	lf = (filteredData, givenTree, filteredData, otherTree, "Log(SITE_LIKELIHOOD[0]*P+(1-P)*SITE_LIKELIHOOD[1])");

Optimize(res, lf);

/* fprintf(stdout, res); */
fprintf(stdout, lf);

fprintf (stdout, "\nScaling factors:", computeScalingFactorB(HKY85RateMatrix, estFreqs1), ", ", computeScalingFactorB(HKY85RateMatrix, estFreqs2), "\n");
 




Note that I had to bump up the statement "ACCEPT_BRANCH_LENGTHS" in order to handle the NEXUS file correctly.  Also, I toned down the output by setting VERBOSITY_LEVEL back to 1; this speeds things up a bit.

- Art.

Back to top
 
 
IP Logged
 
Art Poon
Global Moderator
*****
Offline


Feed your monkey!

Posts: 0
Re: codon model
Reply #9 - Apr 1st, 2008 at 3:43pm
 
In response to the second part of your message:

HyPhy is computing likelihood in the mixture model for each site of the alignment using the formula provided as the fifth argument in LikelihoodFunction() so, as you say, HyPhy is using the mixture structure of the likelihood function in its model fitting procedure.  Though I have to admit that I find a mixture of stationary frequency distributions rather odd.  Huh

I don't understand what you're asking for, with regards to finding MLEs conditional on the values of other parameters.  I suppose you could either implement meta-parameters in the model to represent conditional dependence between the existing parameters.

Best,
- Art.
Back to top
 
 
IP Logged
 
alex
YaBB Newbies
*
Offline



Posts: 8
Gender: male
Re: codon model
Reply #10 - Apr 7th, 2008 at 11:59am
 
Hi,

HyPhy seems to estimate a rate for each mixture component.  How do I ask HyPhy to not do this?

I simplified the model down to a 'mixture' of a single HKY component (i.e. not a mixture).  Here is the script I used:

[code]
VERBOSITY_LEVEL = 1;
ACCEPT_BRANCH_LENGTHS = 1;
DataSet spectrinData = ReadDataFile ("/var/www/python_scripts/data/hyphy.nex");
DataSetFilter filteredData = CreateFilter (spectrinData,1);
HarvestFrequencies (observedFreqs, filteredData, 1, 1, 1);
/* stationary distribution in nucleotide frequencies */
global  eqFreqA = 0.25;
global  eqFreqC = 0.25;
global  eqFreqG = 0.25;
global  eqFreqT := 1.0 - eqFreqA - eqFreqC - eqFreqG;   eqFreqT :> 0;
/* transition/transversion ratio */
global  kappa = 1.0;    kappa :> 0;
HKY85RateMatrix =
 {{*,a,kappa*a,a}
  {a,*,a,kappa*a}
  {kappa*a,a,*,a}
  {a,kappa*a,a,*}};
/* construct vectors from global variables */
estFreqs =
 {{eqFreqA,
   eqFreqC,
   eqFreqG,
   eqFreqT}
 };
/* assign models to trees */
Model m1 = (HKY85RateMatrix, estFreqs);
Tree givenTree = DATAFILE_TREE;
branchNames = BranchName (givenTree, -1);  /* vector populated with names for each branch in tree */
branchLengths = BranchLength (givenTree, -1);  /* vector populated with branch lengths */
for (k = 0; k < Columns(branchNames)-1; k = k+1)
{
 ExecuteCommands("givenTree."+branchNames[k]+".a:="+branchLengths[k]+";");
}
LikelihoodFunction lf = (filteredData, givenTree, "Log(SITE_LIKELIHOOD[0])");
Optimize(res, lf);
fprintf(stdout, "\n");
fprintf(stdout, res);
fprintf(stdout, lf);
[/code]

The Nexus file is:
[code]
#NEXUS
BEGIN TAXA;
 DIMENSIONS ntax = 5;
 TAXLABELS Human Gorilla Gibbon Orangutan Chimpanzee;
END;
BEGIN TREES;
 TREE primates = (((Human:0.1, Chimpanzee:0.2):0.8, Gorilla:0.3):0.7, Orangutan:0.4, Gibbon:0.5);
END;
BEGIN CHARACTERS;
 DIMENSIONS nchar = 100;
 FORMAT datatype = DNA;
 MATRIX
   Human      CGGATGACGGTCTGGCGAATCCCCCTTCATGACCTTTGTTGGTCTAAGGTTGCGGGTTAGCCTAATGCACAGGCGGGGGCCGGTCGGGCGCCGCACCGTC
   Gorilla    CAAAACATTTGCGGAAACGTGAGCCCTCGTCTCCTGCGTCGCCTCAGTGAGCCGTCACCTCGACCAAGGCGGACGCCCGCCGGTTCGCCGCCGTGCCGCG
   Gibbon     GGCAGTATTTCCATTGCGATGGTTCACTTACACCGGTGAACACCAAATTCGCAGCCACGTCATCAAGCGGGCGTCTAAGCGGAGTGGTTGGCTGGCCCCC
   Orangutan  CGTTCGATCTGCTCCTCGGTAGTCCCCTTACACCTGTGGTACCCAAGTTTACGGCCCCCGCAAGCCGAGGTCGCCCACGTGGTTTCGACGTTCGGCGCTC
   Chimpanzee CGGATGACCGTCTGGCGGATCACTCTTCATGTGCTTTGTTAGTGTAAGGGTGCGCGTCAGCCATTCGCACGGGCAGGGACCGGTTGGGCGCCGCACCCAC
 ;
END;
[/code]

This gives the following results:
[code]
tree : (((HUMAN:0.0798308, CHIMPANZEE:0.159662)Node2:0.638646, GORILLA:0.239492)Node1:0.558816, ORANGUTAN:0.319323, GIBBON:0.399154);
log likelihood : -585.138801549
kappa : 2.39211
A : 0.173
C : 0.315233
G : 0.303122
T : 0.208646
[/code]

For comparison, I used PAML to do the same thing.  It gives:
[code]
log likelihood : -583.874748
kappa : 1.89613
A : 0.1712
C : 0.3193
G : 0.29861
T : 0.21089
[/code]

I independently verified these likelihood results, so I understand the relationship among the rate matrices, tree topology, branch lengths, and sequence alignment.  It seems that HyPhy is using an extra degree of freedom that PAML is not.  This is being used to estimate a scaling factor for the branch lengths, or equivalently, a scaling factor for the rate matrix.  Is it possible to define a mixture model in HyPhy so that this extra degree of freedom is not estimated for each component?

Thanks,
Alex
Back to top
 
 
IP Logged
 
Art Poon
Global Moderator
*****
Offline


Feed your monkey!

Posts: 0
Re: codon model
Reply #11 - Apr 7th, 2008 at 1:11pm
 
Hi Alex,

You can constrain the mixture component parameters by the following:

Code:
P := 0.5;
 



As for your example, HyPhy is reporting only 4 free parameters, namely three nucleotide frequencies and kappa.  I'm not picking up this extra parameter you're talking about  Undecided  Scaling of the tree to the expected number of substitutions should not require any additional parameters, as this is only a function of the stationary distribution of nucleotide frequencies and parameters in the rate matrix.

Cheers,
- Art.


Back to top
 
 
IP Logged
 
alex
YaBB Newbies
*
Offline



Posts: 8
Gender: male
Re: codon model
Reply #12 - Apr 7th, 2008 at 2:40pm
 
Hi Art,

Now I see that the estimated human branch length (0.0798308) divided by the expected rate of the estimated rate matrix (1.0644...) is equal to the initial human branch length (0.1) times the expected rate of the initial rate matrix (.75).  I'm probably being very dense, but could you please help me understand why it's like this?

Thanks,
Alex
Back to top
 
 
IP Logged
 
Art Poon
Global Moderator
*****
Offline


Feed your monkey!

Posts: 0
Re: codon model
Reply #13 - Apr 7th, 2008 at 3:45pm
 
Hi Alex,

Branch lengths are scaled in units of expected number of substitutions, which is obtained from the substitution rate matrix and is initially equal to 1.  Though you've constrained branch lengths to values specified in the data file, the units are being rescaled as parameter estimates (nucleotide frequencies, kappa) are updated.  Does that make sense?

Cheers,
- Art.
Back to top
 
 
IP Logged
 
alex
YaBB Newbies
*
Offline



Posts: 8
Gender: male
Re: codon model
Reply #14 - Apr 10th, 2008 at 3:26am
 
In that case I want to constrain the branch lengths with units that are scaled with respect to the updated parameter estimates instead of with respect to the initial parameter values.  Is there a way to do this?

Thanks,
Alex
Back to top
 
 
IP Logged
 
Pages: 1 2