Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
Dual Model (Read 2115 times)
Miguel Lacerda
YaBB Newbies
*
Offline


Feed your monkey!

Posts: 36
Natl Univ of Ireland, Galway
Gender: male
Dual Model
Jan 21st, 2009 at 3:04am
 
Hi Sergei,

Could you please confirm that my substitution matrix below is in fact an implementation of your dual GY94 model with 4-category independent general discrete distributions for each of the synonymous and non-synonymous substitution rates?

Code:
/* ------------------------- SUBSTITUTION RATE MATRIX --------------------------- */

global kappa = 1;

/* synonomous rate variation */
global dS_1 := 0; /* invariant category */
global dS_2;
global dS_3;
global dS_4;

global dS_p1;
global dS_p2;
global dS_p3;
global dS_p4 := 1-(dS_p1+dS_p2+dS_p3);

dS_p1:>-0.0000000000000001;   dS_p1:<1.00000000000000001;
dS_p2:>-0.0000000000000001;   dS_p2:<1.00000000000000001;
dS_p3:>-0.0000000000000001;   dS_p3:<1.00000000000000001;
dS_p4:>-0.0000000000000001;   dS_p4:<1.00000000000000001;

categFreqMatrixdS = {{dS_p1, dS_p2, dS_p3, dS_p4}};
categRateMatrixdS = {{ dS_1,  dS_2,  dS_3,  dS_4}};

category dS = (4, categFreqMatrixdS, MEAN, , categRateMatrixdS, 0, 1e25);


/* nonsynonomous rate variation */
global dN_1 := 0; /* invariant category */
global dN_2;
global dN_3;
global dN_4;

global dN_p1;
global dN_p2;
global dN_p3;
global dN_p4 := 1-(dN_p1+dN_p2+dN_p3);

dN_p1:>-0.0000000000000001;   dN_p1:<1.00000000000000001;
dN_p2:>-0.0000000000000001;   dN_p2:<1.00000000000000001;
dN_p3:>-0.0000000000000001;   dN_p3:<1.00000000000000001;
dN_p4:>-0.0000000000000001;   dN_p4:<1.00000000000000001;

categFreqMatrixdN = {{dN_p1, dN_p2, dN_p3, dN_p4}};
categRateMatrixdN = {{ dN_1,  dN_2,  dN_3,  dN_4}};

category dN = (4, categFreqMatrixdN, MEAN, , categRateMatrixdN, 0, 1e25);


SubsMatrix = {61,61};

hshift = 0;

for (h=0; h<63; h=h+1)
{
    if (UniversalGeneticCode[h] == 10) /* stop codon */
    {
        hshift = hshift + 1;
    }
    else
    {
        vshift = hshift;
        for (v=h+1; v<64; v=v+1)
        {
            if (UniversalGeneticCode[v] == 10)  /* stop codon */
            {
                vshift = vshift + 1;
            }
            else
            {
                /* We must determine how many subsitutions are needed to get from codon h to codon v.
                   Only pairs with exactly one substitution are assigned non-zero rates.
                  
                   Two codons indexed by h and v differ in a single nucleotide position if and only if one of the following conditions
                   are met:
                  
                   1). h$4==v$4 (same result of integer division by 4), the change is in the third position
                   2). (v-h)%16 is 0 (same remainder of integer division by 16), the change is in the first position
                   3). (h$16==v$16) and (v-h)%4 is 0 (same result of integer division by 16 and same remainder by integer division by 4), the change in the second position            
                */
                
                diff = v-h;
                
                if ((h$4==v$4)||((diff%4==0)&&(h$16==v$16))||(diff%16==0)) /* differ by one subsitution only */
                {
                    if (h$4==v$4) /* third position */
                    {
                        transition = v%4; /* get targets nucleotides for h->v and v->h substitutions */
                        transition2= h%4;
                        nucPosInCodon = 2; /* change is in the third position */
                    }
                    else
                    {
                        if(diff%16==0) /* first position */
                        {global scale = 1;
                            transition = v$16; /* get targets nucleotides for h->v and v->h substitutions */
                            transition2= h$16;
                            nucPosInCodon = 0; /* change is in the first position */
                        }
                        else
                        {
                            transition = v%16$4; /* get targets nucleotides for h->v and v->h substitutions */
                            transition2= h%16$4;
                            nucPosInCodon = 1;  /* change is in the second position */
                        }
                    }  
                    if (UniversalGeneticCode[h]==UniversalGeneticCode[v]) /* synonymous */
                    {
                        if (Abs(transition-transition2)%2) /* transversion: difference is not divisible by 2 */
                        {
                            SubsMatrix[h-hshift][v-vshift] := dS*t;
                            SubsMatrix[v-vshift][h-hshift] := dS*t;
                        }
                        else /* transition */
                        {
                            SubsMatrix[h-hshift][v-vshift] := kappa*dS*t;
                            SubsMatrix[v-vshift][h-hshift] := kappa*dS*t;
                        }
                    }
                    else /* non-synonymous */
                    {
                        if (Abs(transition-transition2)%2) /* transversion: difference is not divisible by 2 */
                        {
                            SubsMatrix[h-hshift][v-vshift] := dN*t;
                            SubsMatrix[v-vshift][h-hshift] := dN*t;
                        }
                        else /* transition */
                        {
                            SubsMatrix[h-hshift][v-vshift] := kappa*dN*t;
                            SubsMatrix[v-vshift][h-hshift] := kappa*dN*t;
                        }
                    }
                }              
            }
        }
    }
}
 



I am working with over 500 sequences of 370 codons each and the code is very slow, even with fewer rate categories. Optimising branch lengths and model parameters using, for example, dSdNRateAnalysis.bf does not appear to be an option - it just takes far too long. Any suggestions?

Thanks

Miguel




Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Dual Model
Reply #1 - Jan 21st, 2009 at 11:02am
 
Dear Miguel,

You need to normalize dS to be one, otherwise you end up confounding the mean of this distribution with the length of the tree. Also, I found it better to parameterize the weights of a GDD in a stick-breaking fashion. For a four category variable it would look like this:

Code:
global dS_p1 = 1/4; dS_p1 :< 1;
global dS_p2 = 1/3; dS_p2 :< 1;
global dS_p3 = 1/2; dS_p3 :< 1;

dS_weights = {4,1};
dS_weights[0] := dS_p1;
dS_weights[1] := (1-dS_p1)* dS_p2;
dS_weights[2] := (1-dS_p1)*(1-dS_p2)* dS_p3;
dS_weights[3] := (1-dS_p1)*(1-dS_p2)* (1-dS_p3);

category dS = (4, dS_weights, ...)

 



And to normalize, define the mean of dS (e.g. dS_norm := dS_1 * dS_p1 + ...) and then set the values of dS to

Code:
dS_rates = {4,1};
dS_rates [0] := dS_1 / dS_norm;
...

 



As far as speed goes; are you using the latest SVN build - that should help a bit. Also, I would definitely advise estimating branch lengths using a simpler model (e.g a one rate MG94) and then fixing them while you estimate the distribution of dS and dN rates.

This should be implemented as an option in the dNdSRateAnalysis.bf however - have you tried it?

S.

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Miguel Lacerda
YaBB Newbies
*
Offline


Feed your monkey!

Posts: 36
Natl Univ of Ireland, Galway
Gender: male
Re: Dual Model
Reply #2 - Feb 12th, 2009 at 4:53am
 
Dear Sergei

I have implemented your suggestion above. I estimated the branch lengths separately using a GY94 model and am now trying to optimize the parameters of the 4X4 GDD's for the synonymous and nonsynonymous rates as well as kappa, given the previously estimated branch lengths.

The problem is it is still taking a long time and I'm not sure if it will ever finish. I have been running the MPI version of HyPhy for over 3 days now on 10 processors. Is this normal? Perhaps there is something wrong with my attached code? Or is 4 categories for each of dN and dS just too much?

Thanks,
Miguel
Back to top
 
Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login (61 KB | )
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Dual Model
Reply #3 - Feb 12th, 2009 at 2:32pm
 
Dear Miguel,

I will take a look at your code. The SVN build of HyPhy does not run MPI_OPTIMIZER properly, so that may be a part of your problem.
Also, the new caching scheme will be quite memory hungry (I need to benchmark it vs the old code), as for a 500 taxon tree a 16 rate category model will allocate approximately 16*16*61*B*S bytes, where B is the number of tree branches and S is the number of site patterns in the data. In your case this should be about 930MB -- check memory usage on the cluster and see if the system is swapping memory in and out excessively.

Cheers,
Sergei

Cheers,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged