HyPhy message board
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl
Methodology Questions >> How to >> building likelihood function
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?num=1229363502

Message started by gustavo on Dec 15th, 2008 at 9:51am

Title: building likelihood function
Post by gustavo on Dec 15th, 2008 at 9:51am
Hi,

I am working with site-specific models and then I have to build a likelihood function containing the n positions in an alignment like this:

LikelihoodFunction lk = (class0,tree0, class1,tree1,class2, tree2....classn,treen);

I have tried using ExecuteCommands as:

ExecuteCommands("LikelihoodFunction lk = (class0,tree0,");
      for(i=1;i<29;i=i+1){
       ExecuteCommands("class"+i+",tree"+i+", ");
      }
       ExecuteCommands("class29,tree29); ");

for an alignment with n=30, but it doesn't work.
Could you please help me with this script?
Thank you very much.

gustavo

Title: Re: building likelihood function
Post by Sergei on Dec 15th, 2008 at 10:39am
Dear Gustavo,

ExecuteCommands does not permit command accumulation - it expects fully formed statements. Try building up the entire command as in the following code snippet.

Sergei

[code]
lfDefCommand = "";
lfDefCommand * "LikelihoodFunction lk = (class0,tree0,";
for(i=1;i<29;i=i+1){
      lfDefCommand * ("class"+i+",tree"+i+", ");
}
lfDefCommand * ("class"+i+",tree"+i+") ");
lfDefCommand * 0;
fprintf (stdout, "\nExecuting: ", lfDefCommand, "\n");
ExecuteCommands (lfDefCommand);
[/code]

Title: Re: building likelihood function
Post by gustavo on Dec 24th, 2008 at 8:59am
Dear Sergei,
Thank you very much for your help!

I have still some problems using site-specific substitution matrices. In the script that follows I would like to use a gamma distribution with the use of site-specific substitution matrices (however here in the example I use JTT model):

   lk=0;
   global alpha = .5;
   alpha:>0.01;alpha:<100;
   category c = (4, EQUAL, MEAN, GammaDist(_x_,alpha,alpha),CGammaDist(_x_,alpha,alpha),0,1e25,CGammaDist(_x_,alpha+1,alpha));

   for(i=0;i<len;i=i+1){
       ExecuteCommands("DataSetFilter class"+i+" = CreateFilter (data,1,\""+i+"\",\"\");");
       obsFreqs={20,1};
       obsFreqs=0;
       ExecuteCommands("HarvestFrequencies (obsFreqs"+i+", class"+i+", 1, 1, 1);");
       ExecuteCommands("#include Jones2.mdl\";");
       ExecuteCommands("Model JTT"+i+" = (jones, obsFreqs"+i+",1);");
       ExecuteCommands("Tree tree"+i+" = treestring;");
   }

lfDefCommand = "";
lfDefCommand * "LikelihoodFunction lk = (class0,tree0,";
for(i=1;i<len-1;i=i+1){
        lfDefCommand * ("class"+i+",tree"+i+", ");
}
i=len-1;
lfDefCommand * ("class"+i+",tree"+i+"); ");
lfDefCommand * 0;
fprintf (stdout, "\nExecuting: ", lfDefCommand, "\n");
ExecuteCommands (lfDefCommand);

   for(i=1;i<len;i=i+1){
       ExecuteCommands("ReplicateConstraint (\"this1.?.a:=this2.?.a\",tree"+i+",tree0);");
   }

   Optimize(params,lk);

   fprintf(prueba1.out,"\n jones Log(L) + F + gamma con constraints lk = ", params[1][0]);

The problem with the script is that never ends! Would you please check it?

Additionally, I would like to ask you why is so important to optimize the time as a common variable for all the positions (using ReplicateConstraint in the script). In this way it is true that you have a given branch length common for all the positions for a given topology. However, if you optimize the time for each position and then derives the total likelihood  adding the log lik, the value you obtained is much better than using the time as a common variable among positions.

thank you very much again and merry christmas!

gustavo


Title: Re: building likelihood function
Post by Sergei on Jan 2nd, 2009 at 10:29pm
Dear Gustavo,

Does the script become stuck in the optimization phase, or does it not even get to that stage?

Could you add the following statement:

[code]
LIKELIHOOD_FUNCTION_OUTPUT=7;
fprintf ("pathtosomefile",CLEAR_FILE,lk);
[/code]

right before Optimize, where pathtosomefile is some file name to direct the result to and send me the resulting file? It should contain an exported likelihood function as a HyPhy script embedded inside a NEXUS file.

As far as branch lengths go, you are correct that estimating branch lengths independently at each site will give you a better likelihood (adding parameter can't worsen the likelihood), but at the cost of hundreds of extra parameters. If you compute the likelihood ratio test or AIC of the two models, I can guarantee that the model with shared branch lengths will come ahead (or fit no worse than) the branch-lengths/site model.

Keep in mind that estimating branch lengths from a single site leads to incredible overfitting (sample size of 1). Imagine trying to estimate the probability of heads from a single coin toss. Maximum likelihood estimate will either be 0 (if tails came up) or a 1 (if heads came up). Clearly, you need multiple tosses to estimate the probability with any degree of confidence.

HTH,
Sergei

Title: Re: building likelihood function
Post by gustavo on Jan 3rd, 2009 at 7:15pm
Dear Sergei,
After 6 hs of calculation using the script:

REPLACE_TREE_STRUCTURE = 1;

DataSet data = ReadDataFile ("seqfile");

file1 = "treefile";
fscanf (file1,"Tree",tree);
treestring=""+tree;

file2 = "len";
fscanf (file2,"Number",len);

   lk=0;
   global alpha = .5;
   alpha:>0.01;alpha:<100;
   category c = (4, EQUAL, MEAN, GammaDist(_x_,alpha,alpha),CGammaDist(_x_,alpha,alpha),0,1e25,CGammaDist(_x_,alpha+1,alpha));

   for(i=0;i<len;i=i+1){
       ExecuteCommands("DataSetFilter class"+i+" = CreateFilter (data,1,\""+i+"\",\"\");");
       obsFreqs={20,1};
       obsFreqs=0;
       ExecuteCommands("HarvestFrequencies (obsFreqs"+i+", class"+i+", 1, 1, 1);");
       ExecuteCommands("#include \"Jones2.mdl\";");
       ExecuteCommands("Model JTT"+i+" = (jones, obsFreqs"+i+",1);");
       ExecuteCommands("Tree tree"+i+" = treestring;");
   }

lfDefCommand = "";
lfDefCommand * "LikelihoodFunction lk = (class0,tree0,";
for(i=1;i<len-1;i=i+1){
        lfDefCommand * ("class"+i+",tree"+i+", ");
}
i=len-1;
lfDefCommand * ("class"+i+",tree"+i+"); ");
lfDefCommand * 0;
fprintf (stdout, "\nExecuting: ", lfDefCommand, "\n");
ExecuteCommands (lfDefCommand);

   for(i=1;i<len;i=i+1){
       ExecuteCommands("ReplicateConstraint (\"this1.?.a:=this2.?.a\",tree"+i+",tree0);");
   }

LIKELIHOOD_FUNCTION_OUTPUT=7;
fprintf ("forsergei.out",CLEAR_FILE,lk);

   Optimize(params,lk);

   fprintf(prueba1.out,"\n jones Log(L) + F + gamma con constraints lk = ", params[1][0]);


the file "forsergei.out" contains a "0"!

Thanks very much for the answer about the estimation site-by-site.
best regards,

gustavo

Title: Re: building likelihood function
Post by Sergei on Jan 4th, 2009 at 5:27pm
Dear Gustavo,

Hmm - very odd that the file contains nothing but "0". Could you e-mail me all the files used in the analysis so (data and Jones2.mdl) so that I could try to diagnose the issue for you?

Cheers,
Sergei

Title: Re: building likelihood function
Post by gustavo on Jan 4th, 2009 at 5:43pm
hi sergei,
here are the files.
regards,

gustavo
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?action=downloadfile;file=files.zip (10 KB | )

Title: Re: building likelihood function
Post by Sergei on Jan 5th, 2009 at 6:37pm
Dear Gustavo,

Because your model used named the branch length parameter 't', and ReplicateConstraint was applied to an 'a' parameter, the resulting likelihood function was actually unconstrained with about 30000 parameters (which would take forever to optimize).
Even with the constraint, the computational cost of this analysis is quite high and it took the entire day on my 8-core Mac Pro, but then again, you are fitting 262 different models to a 70 sequence alignment:)

Also, because you are using different models (e.g. different base frequencies for each site), you need to be a bit more careful in how you constrain branch lengths. A branch length is a function (in your example) of the 't' parameter and the vector of base frequencies. Because the latter differ between sites, there is a scaling factor involved in constraining all branch lengths to be the same. Finally, for invariable sites, you are setting base frequencies to 1 for the observed residue and 0 for all others; this results in a model that does not allow any transitions in or out of the constant state.

The attached .bf file incorporates all these changes and a few minor tweaks (you can obtain the number of sites in an alignment at run time).

FYI: The resulting log-likelihood is approximately -16379.2

Cheers,
Sergei



http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?action=downloadfile;file=jones-gamma.bf (2 KB | )

HyPhy message board » Powered by YaBB 2.5.2!
YaBB Forum Software © 2000-2024. All Rights Reserved.