HyPhy message board - Print Page

Hi everyone,

For rate category variables, I'm looking for a way to combine a discretised gamma with equal-probability rate categories, (see code snippet), with a fixed-location discrete component.

This would be much like the "Gamma with a class of invariant sites" model, except I want to place the discrete component at a particular non-zero rate r and have probability p (so that the Gamma distribution is multiplied by 1-p).

I don't see any code in "Examples/Categories" that implements a Gamma+Invariant model, and my attempts to achieve this effect have bombed out with errors like "can't have a category variable depend on a category variable".

Does anyone have an idea on how to go about doing this?

global alpha = 0.5; alpha:>0.01; alpha:<100; beta:=alpha; category c = ( 5, /* number of rates */ EQUAL, /* probs. of rates */ MEAN, /* sampling method */ GammaDist(_x_,alpha,beta), /* density */ CGammaDist(_x_,alpha,beta), /* CDF */ 0, /* left bound */ 1e5, /* right bound */ CGammaDist(_x_,alpha+1,beta)*alpha/beta /* "antiderivative" of x f(x) */ );

Dear jettlogic,

Perhaps the most robust way to do this is to utilize two-level model mixing - a category variable to define the gamma component, and a likelihood function composition to add a discrete component. You can also define it as a mixture density, but I found that gamma+point tends to be numerically unstable on the boundaries. I am attaching a simple example with nucleotide data for you to work with. Please don't hesistate to ask if something is unclear

Code (]
GetURL (theAlignment,"Multimedia File Viewing and Clickable Links are available for Registered Members only!! You need to Login;
DataSet nucleotideSequences = ReadFromString (theAlignment);
DataSetFilter filteredData = CreateFilter (nucleotideSequences,1);
HarvestFrequencies (observedFreqs,filteredData,1,1,1);

global TV_TS = 1;

/* define two matrices:
one for the gamma component and
one for the discrete component */

global alpha = 1;
alpha:>0.01;alpha:<100;
global beta = 1;
beta:>0.01;beta:<100;

category c = (4, /* number of rates */
EQUAL, /* probs. of rates */
MEAN, /* sampling method */
GammaDist(_x_,alpha,beta), /* density */
CGammaDist(_x_,alpha,beta), /*CDF*/
0, /*left bound*/
1e25, /*right bound*/
CGammaDist(_x_,alpha+1,beta)*alpha/beta
);

HKY85RateMatrixGamma =
[*,c*t*TV_TS,c*t,c*t*TV_TS}
{c*t*TV_TS,*,c*t*TV_TS,c*t}
{c*t,c*t*TV_TS,*,c*t*TV_TS}
{c*t*TV_TS,c*t,c*t*TV_TS,*];

global PT = 0.5; /* this is the point mass */

HKY85RateMatrixPoint =
[*,PT*t*TV_TS,PT*t,PT*t*TV_TS}
{PT*t*TV_TS,*,PT*t*TV_TS,PT*t}
{PT*t,PT*t*TV_TS,*,PT*t*TV_TS}
{PT*t*TV_TS,PT*t,PT*t*TV_TS,*];

Model HKY85Gamma = (HKY85RateMatrixGamma, observedFreqs);
Tree givenTree1 = DATAFILE_TREE;

Model HKY85Point = (HKY85RateMatrixPoint, observedFreqs);
Tree givenTree2 = DATAFILE_TREE;

global P = 0.5; /* mixing proportion of the gamma + point*/
P:<1;

/* note that the mean of the rate distribution will be confounded
with the length of the tree. Hence we constrain PT, so that the
mean of P * gamma + (1-P) * point = 1, i.e.

beta/alpha * P + (1-P) * PT = 1 => PT = (1-beta/alpha*P) / (1-P).

Sometimes this can force PT < 0, but boundary violations are heavily
penalized. If alpha = beta, then PT is always > 0) */

beta:=alpha;
PT := (1-beta/alpha*P) / (1-P);

/* note the extra parameter in call to likeFunc constructor to indicate
how the site wise probabilities from each partition are to be combined
into a log likelihood */

LikelihoodFunction likeFunc = (filteredData, givenTree1,filteredData,givenTree2,"Log(SITE_LIKELIHOOD[0):

*P+(1-P)*SITE_LIKELIHOOD[1])");
Optimize (paramValues, likeFunc);
fprintf (stdout, likeFunc);

Hi Sergei,

Thank you for the tips! I wouldn't have thought to use the extra parameter to the LikelihoodFunction to carry out the mixture as a mixture of two complete models.

The resulting likelihoods behave well when comparing two mixtures (constraining some parameters reduces the likelihood, as expected).

However, a simple (non-mixture) model like plain gamma or even plain point mass (!) comes out more likely than the mixture (by a few log-likelihood points), although the mixture is more general. (point mass significantly less likely than gamma of course)

I have a few hypothesis, but after a lot of testing I suspect the mixture model is converging on some particular local optimum, or is converging too slowly to reach the true optimum in a reasonable step limit.

I think the right approach is to optimise params for the gamma model, then for the point-mass model (keeping the TV_TS from gamma?), then optimise P such that the sum of "Log(SITE_LIKELIHOOD[0]*P+(1-P)*SITE_LIKELIHOOD[1])" over sites is maximised. However, if HyPhy is optimising everything simultaneously it could explain the local or slow convergence (?)

Does this sound right to you? Would you like to see the test code?

Dear jettlogic,

wrote on Aug 1^st, 2006 at 9:25am:

Does this sound right to you? Would you like to see the test code?

This does sound like a convergence issue. I'd definitely like to see the test code (you could just e-mail it to me) and help you resolve the issue.

Cheers,
Sergei

To check that the model mixture isn't converging, I compared it to something conceptually equivalent but implemented without model mixture.

Since I don't know how to make gamma+point mass using just a category variable (would like to know how, please!), I used gamma+invariant. The first uses 5-cat category var (first being invariant sites), and the second a 4-cat gamma mixed with an invariant model. The results are for P51.nex.

For this code, the model-mixture gets a worse likelihood, and reject having invariant sites by setting IP=0 (!). I also noticed that with gamma+point-mass the model mixture rejects putting a point-mass anywhere (P=0), and produces worse-than-gamma likelihoods.

What do you think is causing the above?

Code (]
Gamma+Invariant using category variable
Log Likelihood = -3285.62637458753;
Shared Parameters:
IP=0.205486
alpha=0.369268
TV_TS=0.0989337
beta=alpha/(1-IP)=0.464773

Gamma+Invariant using model mixture
Log Likelihood = -3286.09788196153;
Shared Parameters:
IP=0
alpha=0.225829
TV_TS=0.0992244
beta=alpha/(1-IP)=0.225829
[/code):

With code being:

[code]
/********************************************************************
Read input data
********************************************************************/

/* Read in P51 HyPhy example alignment */
DataSet nucleotideSequences = ReadDataFile ( "p51.nex" );

/* Filter the data (all data, as nucleotides) */
DataSetFilter filteredData = CreateFilter ( nucleotideSequences, 1 );

/* Get observed nucleotide frequencies */
HarvestFrequencies ( observedFreqs, filteredData, 1, 1, 1 );

/********************************************************************
Gamma+Invariant LF using category variable
********************************************************************/

fprintf ( stdout, "Gamma+Invariant using category variable\n\n" );

global alpha = 1.0; alpha:>0.01; alpha:<100; /* shape parameter */
global beta = 1.0; beta:>0.01; beta:<100; /* beta paramter */
global TV_TS; TV_TS:>0; /* transversion/transition rate ratio */
global ncats = 5; ncats:=5; /* number of rate categories */
global IP = 1/ncats; IP :< 1-(1e-10); /* weight of invariant category */
beta:=alpha/(1-IP); /* mean of rate dist:=1 => (1-IP)*(beta/alpha)+IP*0 := 1 */

catFreqMatrix = { ncats, 1 };
catFreqMatrix[0] := IP;

for (h=1; h<ncats; h=h+1) {
catFreqMatrix[h] := (1-IP)/(ncats-1);
}

category c = (
ncats, /* number of rates */
catFreqMatrix, /* weights */
MEAN, /* sampling method */
(1-IP)*GammaDist(_x_,alpha,beta)*(_x_>0), /* density */
(1-IP)*CGammaDist(_x_,alpha,beta)*(_x_>0)+IP, /* CDF */
0, /* left bound */
1e5, /* right bound */
(1-IP)*CGammaDist(_x_,alpha+1,beta)*alpha/beta*(_x_>0) /* mean cumulative function */
);

HKY85RateMatrixGamma =
[* ,c*t*TV_TS,c*t ,c*t*TV_TS}
{c*t*TV_TS,* ,c*t*TV_TS,c*t }
{c*t ,c*t*TV_TS,* ,c*t*TV_TS}
{c*t*TV_TS,c*t ,c*t*TV_TS,* ];

Model HKY85Gamma = ( HKY85RateMatrixGamma, observedFreqs );
Tree givenTree = DATAFILE_TREE;

LikelihoodFunction likeFunc = ( filteredData, givenTree );

Optimize ( paramValues, likeFunc );
fprintf ( stdout, likeFunc, "\n\n" );

/********************************************************************
Gamma+Invariant LF using model mixture
********************************************************************/

fprintf ( stdout, "Gamma+Invariant using model mixture\n\n" );

global alpha = 1.0; alpha:>0.01; alpha:<100; /* shape parameter */
global beta = 1.0; beta:>0.01; beta:<100; /* beta paramter */
global TV_TS; TV_TS:>0; /* transversion/transition rate ratio */
global IP = 0.5; IP :< 1-(1e-10); /* Weight of invariant */
beta:=alpha/(1-IP); /* mean of rate dist:=1 => (1-IP)*(beta/alpha)+IP*0 := 1 */

category c = (
ncats-1, /* number of rates */
EQUAL, /* weights */
MEAN, /* sampling method */
GammaDist(_x_,alpha,beta), /* density */
CGammaDist(_x_,alpha,beta), /* CDF */
0, /* left bound */
1e5, /* right bound */
CGammaDist(_x_,alpha+1,beta)*alpha/beta /* mean cumulative function */
);

HKY85RateMatrixGamma =
[* ,c*t*TV_TS,c*t ,c*t*TV_TS}
{c*t*TV_TS,* ,c*t*TV_TS,c*t }
{c*t ,c*t*TV_TS,* ,c*t*TV_TS}
{c*t*TV_TS,c*t ,c*t*TV_TS,* ];

Model HKY85Gamma = ( HKY85RateMatrixGamma, observedFreqs );
Tree givenTree1 = DATAFILE_TREE;

RateMatrixInvariant =
[1,0,0,0}
{0,1,0,0}
{0,0,1,0}
{0,0,0,1];

Model Invariant = ( RateMatrixInvariant, observedFreqs );
Tree givenTree2 = DATAFILE_TREE;

LikelihoodFunction likeFunc = ( filteredData, givenTree1, filteredData, givenTree2,
"Log(SITE_LIKELIHOOD[0]*(1-IP)+SITE_LIKELIHOOD[1]*IP)" );

Optimize ( paramValues, likeFunc );
fprintf ( stdout, likeFunc, "\n\n" );

Dear Graham,

This example code revealed a bug in how the HyPhy core handles model mixtures when one of the components had category variables, and the other did not.

I greatly appreciate your bringing this to my attention. The problem has now been fixed (as of Aug 9th 2006) builds, and the slightly modified example code from your example (pasted here), now gives very similar results with two parameterizations. I realized that my earlier example was not scaling the mean of the mixture to be one correctly, and I also adjusted the number of rate categories to be the same in the gamma component in both cases.

Code (]
DataSet nucleotideSequences = ReadDataFile ("../data/p51.nex");
DataSetFilter filteredData = CreateFilter ( nucleotideSequences, 1 );

/* Get observed nucleotide frequencies */
HarvestFrequencies ( observedFreqs, filteredData, 1, 1, 1 );

/********************************************************************
Gamma+Invariant LF using category variable
********************************************************************/

fprintf ( stdout, "Gamma+Invariant using category variable\n\n" );

global alpha = 1.0; alpha:>0.01; alpha:<100; /* shape parameter */
global beta = 1.0; beta:>0.01; beta:<100; /* beta paramter */
global TV_TS; TV_TS:>0; /* transversion/transition rate ratio */
global ncats = 5; /* number of rate categories */
global IP = 1/ncats;
IP :< 1-(1e-10); /* weight of invariant category */
beta:=alpha*(1-IP); /* mean of rate dist:=1 => (1-IP)*(alpha/beta)+IP*0 := 1 */

catFreqMatrix = { ncats, 1 };
catFreqMatrix[0):

:= IP;

for (h=1; h<ncats; h=h+1) {
catFreqMatrix[h] := (1-IP)/(ncats__-1);
}

category c = (
ncats, /* number of rates */
catFreqMatrix, /* weights */
MEAN, /* sampling method */
(1-IP)*GammaDist(_x_,alpha,beta)*(_x_>0), /* density */
(1-IP)*CGammaDist(_x_,alpha,beta)*(_x_>0)+IP, /* CDF */
0, /* left bound */
1e5, /* right bound */
(1-IP)*CGammaDist(_x_,alpha+1,beta)*alpha/beta*(_x_>0) /* mean cumulative function */
);

HKY85RateMatrixGamma =
[* ,c*t*TV_TS,c*t ,c*t*TV_TS}
{c*t*TV_TS,* ,c*t*TV_TS,c*t }
{c*t ,c*t*TV_TS,* ,c*t*TV_TS}
{c*t*TV_TS,c*t ,c*t*TV_TS,* ];

Model HKY85Gamma = ( HKY85RateMatrixGamma, observedFreqs );
Tree givenTree = DATAFILE_TREE;

LikelihoodFunction likeFunc = ( filteredData, givenTree );

Optimize ( paramValues, likeFunc );
fprintf ( stdout, likeFunc, "\n\n" );

/********************************************************************
Gamma+Invariant LF using model mixture
********************************************************************/

fprintf ( stdout, "Gamma+Invariant using model mixture\n\n" );

category c = (
ncats-1, /* number of rates */
EQUAL, /* weights */
MEAN, /* sampling method */
GammaDist(_x_,alpha,beta), /* density */
CGammaDist(_x_,alpha,beta), /* CDF */
0, /* left bound */
1e5, /* right bound */
CGammaDist(_x_,alpha+1,beta)*alpha/beta /* mean cumulative function */
);

HKY85RateMatrixGamma =
[* ,c*t*TV_TS,c*t ,c*t*TV_TS}
{c*t*TV_TS,* ,c*t*TV_TS,c*t }
{c*t ,c*t*TV_TS,* ,c*t*TV_TS}
{c*t*TV_TS,c*t ,c*t*TV_TS,* ];

Model HKY85Gamma = ( HKY85RateMatrixGamma, observedFreqs );
Tree givenTree1 = DATAFILE_TREE;

RateMatrixInvariant =
[0,0,0,0}
{0,0,0,0}
{0,0,0,0}
{0,0,0,0];

Model Invariant = ( RateMatrixInvariant, observedFreqs );
Tree givenTree2 = DATAFILE_TREE;

LikelihoodFunction likeFunc = ( filteredData, givenTree1, filteredData, givenTree2,
"Log(SITE_LIKELIHOOD[0]*(1-IP)+SITE_LIKELIHOOD[1]*IP)" );

Optimize ( paramValues, likeFunc );
fprintf ( stdout, likeFunc, "\n\n" );

Output
[code]
amma+Invariant using category variable

Log Likelihood = -3285.32599393655;
Shared Parameters:
IP=0.438837
alpha=0.791996
TV_TS=0.0980606
beta=alpha*(1-IP)=0.444439

Tree givenTree=((((D_CD_83_ELI_ACC_K03454:0.0214343,D_CD_83_NDK_ACC_M27323:0.0101549)Node3:0.0121809,D_UG_94_94UG114_ACC_U88824:0.0659841)Node2:0.00275392,D_CD_84_84ZR085_ACC_U88822:0.0326156)Node1:0.0278006,B_US_83_RF_ACC_M17451:0.0301682,((B_FR_83_HXB2_ACC_K03455:0.0124699,B_US_86_JRFL_ACC_U63632:0.0189762)Node10:0.00189701,B_US_90_WEAU160_ACC_U21135:0.0225559)Node9:0.00470909);

Gamma+Invariant using model mixture

Log Likelihood = -3285.32535002373;
Shared Parameters:
IP=0.440838
alpha=0.799072
TV_TS=0.0980331
beta=alpha*(1-IP)=0.446811

Tree givenTree1=((((D_CD_83_ELI_ACC_K03454:0.0383084,D_CD_83_NDK_ACC_M27323:0.0181054)Node3:0.0218335,D_UG_94_94UG114_ACC_U88824:0.118069)Node2:0.00489725,D_CD_84_84ZR085_ACC_U88822:0.0583811)Node1:0.0497151,B_US_83_RF_ACC_M17451:0.0539374,((B_FR_83_HXB2_ACC_K03455:0.0222191,B_US_86_JRFL_ACC_U63632:0.0340032)Node10:0.00337515,B_US_90_WEAU160_ACC_U21135:0.0403464)Node9:0.0084376);
Tree givenTree2=((((D_CD_83_ELI_ACC_K03454:0,D_CD_83_NDK_ACC_M27323:0)Node3:0,D_UG_94_94UG114_ACC_U88824:0)Node2:0,D_CD_84_84ZR085_ACC_U88822:0)Node1:0,B_US_83_RF_ACC_M17451:0,((B_FR_83_HXB2_ACC_K03455:0,B_US_86_JRFL_ACC_U63632:0)Node10:0,B_US_90_WEAU160_ACC_U21135:0)Node9:0);
[/code]

Cheers,
Sergei

HyPhy message board
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl Methodology Questions >> How to >> Category vars: how to do gamma+discrete component? http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?num=1153484845 Message started by jettlogic on Jul 21^st, 2006 at 5:27am