Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
building likelihood function (Read 3431 times)
gustavo
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 12
building likelihood function
Dec 15th, 2008 at 9:51am
 
Hi,

I am working with site-specific models and then I have to build a likelihood function containing the n positions in an alignment like this:

LikelihoodFunction lk = (class0,tree0, class1,tree1,class2, tree2....classn,treen);

I have tried using ExecuteCommands as:

ExecuteCommands("LikelihoodFunction lk = (class0,tree0,");
      for(i=1;i<29;i=i+1){
       ExecuteCommands("class"+i+",tree"+i+", ");
      }
       ExecuteCommands("class29,tree29); ");

for an alignment with n=30, but it doesn't work.
Could you please help me with this script?
Thank you very much.

gustavo
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: building likelihood function
Reply #1 - Dec 15th, 2008 at 10:39am
 
Dear Gustavo,

ExecuteCommands does not permit command accumulation - it expects fully formed statements. Try building up the entire command as in the following code snippet.

Sergei

Code:
lfDefCommand = "";
lfDefCommand * "LikelihoodFunction lk = (class0,tree0,";
for(i=1;i<29;i=i+1){
	 lfDefCommand * ("class"+i+",tree"+i+", ");
}
lfDefCommand * ("class"+i+",tree"+i+") ");
lfDefCommand * 0;
fprintf (stdout, "\nExecuting: ", lfDefCommand, "\n");
ExecuteCommands (lfDefCommand);
 

Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
gustavo
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 12
Re: building likelihood function
Reply #2 - Dec 24th, 2008 at 8:59am
 
Dear Sergei,
Thank you very much for your help!

I have still some problems using site-specific substitution matrices. In the script that follows I would like to use a gamma distribution with the use of site-specific substitution matrices (however here in the example I use JTT model):

    lk=0;
    global alpha = .5;
    alpha:>0.01;alpha:<100;
    category c = (4, EQUAL, MEAN, GammaDist(_x_,alpha,alpha),CGammaDist(_x_,alpha,alpha),0,1e25,CGammaDist(_x_,alp
ha+1,alpha));

    for(i=0;i<len;i=i+1){
       ExecuteCommands("DataSetFilter class"+i+" = CreateFilter (data,1,\""+i+"\",\"\");");
       obsFreqs={20,1};
       obsFreqs=0;
       ExecuteCommands("HarvestFrequencies (obsFreqs"+i+", class"+i+", 1, 1, 1);");
       ExecuteCommands("#include Jones2.mdl\";");
       ExecuteCommands("Model JTT"+i+" = (jones, obsFreqs"+i+",1);");
       ExecuteCommands("Tree tree"+i+" = treestring;");
    }

lfDefCommand = "";
lfDefCommand * "LikelihoodFunction lk = (class0,tree0,";
for(i=1;i<len-1;i=i+1){
        lfDefCommand * ("class"+i+",tree"+i+", ");
}
i=len-1;
lfDefCommand * ("class"+i+",tree"+i+"); ");
lfDefCommand * 0;
fprintf (stdout, "\nExecuting: ", lfDefCommand, "\n");
ExecuteCommands (lfDefCommand);

    for(i=1;i<len;i=i+1){
       ExecuteCommands("ReplicateConstraint (\"this1.?.a:=this2.?.a\",tree"+i+",tree0);");
    }

    Optimize(params,lk);

    fprintf(prueba1.out,"\n jones Log(L) + F + gamma con constraints lk = ", params[1][0]);

The problem with the script is that never ends! Would you please check it?

Additionally, I would like to ask you why is so important to optimize the time as a common variable for all the positions (using ReplicateConstraint in the script). In this way it is true that you have a given branch length common for all the positions for a given topology. However, if you optimize the time for each position and then derives the total likelihood  adding the log lik, the value you obtained is much better than using the time as a common variable among positions.

thank you very much again and merry christmas!

gustavo

Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: building likelihood function
Reply #3 - Jan 2nd, 2009 at 10:29pm
 
Dear Gustavo,

Does the script become stuck in the optimization phase, or does it not even get to that stage?

Could you add the following statement:

Code:
LIKELIHOOD_FUNCTION_OUTPUT=7;
fprintf ("pathtosomefile",CLEAR_FILE,lk);
 



right before Optimize, where pathtosomefile is some file name to direct the result to and send me the resulting file? It should contain an exported likelihood function as a HyPhy script embedded inside a NEXUS file.

As far as branch lengths go, you are correct that estimating branch lengths independently at each site will give you a better likelihood (adding parameter can't worsen the likelihood), but at the cost of hundreds of extra parameters. If you compute the likelihood ratio test or AIC of the two models, I can guarantee that the model with shared branch lengths will come ahead (or fit no worse than) the branch-lengths/site model.

Keep in mind that estimating branch lengths from a single site leads to incredible overfitting (sample size of 1). Imagine trying to estimate the probability of heads from a single coin toss. Maximum likelihood estimate will either be 0 (if tails came up) or a 1 (if heads came up). Clearly, you need multiple tosses to estimate the probability with any degree of confidence.

HTH,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
gustavo
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 12
Re: building likelihood function
Reply #4 - Jan 3rd, 2009 at 7:15pm
 
Dear Sergei,
After 6 hs of calculation using the script:

REPLACE_TREE_STRUCTURE = 1;

DataSet data = ReadDataFile ("seqfile");

file1 = "treefile";
fscanf (file1,"Tree",tree);
treestring=""+tree;

file2 = "len";
fscanf (file2,"Number",len);

    lk=0;
    global alpha = .5;
    alpha:>0.01;alpha:<100;
    category c = (4, EQUAL, MEAN, GammaDist(_x_,alpha,alpha),CGammaDist(_x_,alpha,alpha),0,1e25,CGammaDist(_x_,alp
ha+1,alpha));

    for(i=0;i<len;i=i+1){
       ExecuteCommands("DataSetFilter class"+i+" = CreateFilter (data,1,\""+i+"\",\"\");");
       obsFreqs={20,1};
       obsFreqs=0;
       ExecuteCommands("HarvestFrequencies (obsFreqs"+i+", class"+i+", 1, 1, 1);");
       ExecuteCommands("#include \"Jones2.mdl\";");
       ExecuteCommands("Model JTT"+i+" = (jones, obsFreqs"+i+",1);");
       ExecuteCommands("Tree tree"+i+" = treestring;");
    }

lfDefCommand = "";
lfDefCommand * "LikelihoodFunction lk = (class0,tree0,";
for(i=1;i<len-1;i=i+1){
        lfDefCommand * ("class"+i+",tree"+i+", ");
}
i=len-1;
lfDefCommand * ("class"+i+",tree"+i+"); ");
lfDefCommand * 0;
fprintf (stdout, "\nExecuting: ", lfDefCommand, "\n");
ExecuteCommands (lfDefCommand);

    for(i=1;i<len;i=i+1){
       ExecuteCommands("ReplicateConstraint (\"this1.?.a:=this2.?.a\",tree"+i+",tree0);");
    }

LIKELIHOOD_FUNCTION_OUTPUT=7;
fprintf ("forsergei.out",CLEAR_FILE,lk);

    Optimize(params,lk);

    fprintf(prueba1.out,"\n jones Log(L) + F + gamma con constraints lk = ", params[1][0]);


the file "forsergei.out" contains a "0"!

Thanks very much for the answer about the estimation site-by-site.
best regards,

gustavo
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: building likelihood function
Reply #5 - Jan 4th, 2009 at 5:27pm
 
Dear Gustavo,

Hmm - very odd that the file contains nothing but "0". Could you e-mail me all the files used in the analysis so (data and Jones2.mdl) so that I could try to diagnose the issue for you?

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
gustavo
YaBB Newbies
*
Offline


I love YaBB 1G - SP1!

Posts: 12
Re: building likelihood function
Reply #6 - Jan 4th, 2009 at 5:43pm
 
hi sergei,
here are the files.
regards,

gustavo
Back to top
 
Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login (10 KB | )
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: building likelihood function
Reply #7 - Jan 5th, 2009 at 6:37pm
 
Dear Gustavo,

Because your model used named the branch length parameter 't', and ReplicateConstraint was applied to an 'a' parameter, the resulting likelihood function was actually unconstrained with about 30000 parameters (which would take forever to optimize).
Even with the constraint, the computational cost of this analysis is quite high and it took the entire day on my 8-core Mac Pro, but then again, you are fitting 262 different models to a 70 sequence alignment:)

Also, because you are using different models (e.g. different base frequencies for each site), you need to be a bit more careful in how you constrain branch lengths. A branch length is a function (in your example) of the 't' parameter and the vector of base frequencies. Because the latter differ between sites, there is a scaling factor involved in constraining all branch lengths to be the same. Finally, for invariable sites, you are setting base frequencies to 1 for the observed residue and 0 for all others; this results in a model that does not allow any transitions in or out of the constant state.

The attached .bf file incorporates all these changes and a few minor tweaks (you can obtain the number of sites in an alignment at run time).

FYI: The resulting log-likelihood is approximately -16379.2

Cheers,
Sergei


Back to top
 
Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login (2 KB | )

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged