HyPhy message board - Dual Model

	Welcome, Guest. Please Login

Home

Help

HyPhy message board › Methodology Questions › How to › Dual Model

(Moderators: Sergei, Simon)

‹ Previous Topic | Next Topic ›

Pages: 1

Send Topic

Dual Model (Read 2115 times)

Miguel Lacerda

YaBB Newbies

Offline

Feed your monkey!

Posts: 36
Natl Univ of Ireland, Galway
Gender: male

Dual Model
Jan 21^st, 2009 at 3:04am

Hi Sergei,

Could you please confirm that my substitution matrix below is in fact an implementation of your dual GY94 model with 4-category independent general discrete distributions for each of the synonymous and non-synonymous substitution rates?

Code:

/* ------------------------- SUBSTITUTION RATE MATRIX --------------------------- */

global kappa = 1;

/* synonomous rate variation */
global dS_1 := 0; /* invariant category */
global dS_2;
global dS_3;
global dS_4;

global dS_p1;
global dS_p2;
global dS_p3;
global dS_p4 := 1-(dS_p1+dS_p2+dS_p3);

dS_p1:>-0.0000000000000001;   dS_p1:<1.00000000000000001;
dS_p2:>-0.0000000000000001;   dS_p2:<1.00000000000000001;
dS_p3:>-0.0000000000000001;   dS_p3:<1.00000000000000001;
dS_p4:>-0.0000000000000001;   dS_p4:<1.00000000000000001;

categFreqMatrixdS = {{dS_p1, dS_p2, dS_p3, dS_p4}};
categRateMatrixdS = {{ dS_1,  dS_2,  dS_3,  dS_4}};

category dS = (4, categFreqMatrixdS, MEAN, , categRateMatrixdS, 0, 1e25);


/* nonsynonomous rate variation */
global dN_1 := 0; /* invariant category */
global dN_2;
global dN_3;
global dN_4;

global dN_p1;
global dN_p2;
global dN_p3;
global dN_p4 := 1-(dN_p1+dN_p2+dN_p3);

dN_p1:>-0.0000000000000001;   dN_p1:<1.00000000000000001;
dN_p2:>-0.0000000000000001;   dN_p2:<1.00000000000000001;
dN_p3:>-0.0000000000000001;   dN_p3:<1.00000000000000001;
dN_p4:>-0.0000000000000001;   dN_p4:<1.00000000000000001;

categFreqMatrixdN = {{dN_p1, dN_p2, dN_p3, dN_p4}};
categRateMatrixdN = {{ dN_1,  dN_2,  dN_3,  dN_4}};

category dN = (4, categFreqMatrixdN, MEAN, , categRateMatrixdN, 0, 1e25);


SubsMatrix = {61,61};

hshift = 0;

for (h=0; h<63; h=h+1)
{
    if (UniversalGeneticCode[h] == 10) /* stop codon */
    {
        hshift = hshift + 1;
    }
    else
    {
        vshift = hshift;
        for (v=h+1; v<64; v=v+1)
        {
            if (UniversalGeneticCode[v] == 10)  /* stop codon */
            {
                vshift = vshift + 1;
            }
            else
            {
                /* We must determine how many subsitutions are needed to get from codon h to codon v.
                   Only pairs with exactly one substitution are assigned non-zero rates.
                  
                   Two codons indexed by h and v differ in a single nucleotide position if and only if one of the following conditions
                   are met:
                  
                   1). h$4==v$4 (same result of integer division by 4), the change is in the third position
                   2). (v-h)%16 is 0 (same remainder of integer division by 16), the change is in the first position
                   3). (h$16==v$16) and (v-h)%4 is 0 (same result of integer division by 16 and same remainder by integer division by 4), the change in the second position            
                */
                
                diff = v-h;
                
                if ((h$4==v$4)||((diff%4==0)&&(h$16==v$16))||(diff%16==0)) /* differ by one subsitution only */
                {
                    if (h$4==v$4) /* third position */
                    {
                        transition = v%4; /* get targets nucleotides for h->v and v->h substitutions */
                        transition2= h%4;
                        nucPosInCodon = 2; /* change is in the third position */
                    }
                    else
                    {
                        if(diff%16==0) /* first position */
                        {global scale = 1;
                            transition = v$16; /* get targets nucleotides for h->v and v->h substitutions */
                            transition2= h$16;
                            nucPosInCodon = 0; /* change is in the first position */
                        }
                        else
                        {
                            transition = v%16$4; /* get targets nucleotides for h->v and v->h substitutions */
                            transition2= h%16$4;
                            nucPosInCodon = 1;  /* change is in the second position */
                        }
                    }  
                    if (UniversalGeneticCode[h]==UniversalGeneticCode[v]) /* synonymous */
                    {
                        if (Abs(transition-transition2)%2) /* transversion: difference is not divisible by 2 */
                        {
                            SubsMatrix[h-hshift][v-vshift] := dS*t;
                            SubsMatrix[v-vshift][h-hshift] := dS*t;
                        }
                        else /* transition */
                        {
                            SubsMatrix[h-hshift][v-vshift] := kappa*dS*t;
                            SubsMatrix[v-vshift][h-hshift] := kappa*dS*t;
                        }
                    }
                    else /* non-synonymous */
                    {
                        if (Abs(transition-transition2)%2) /* transversion: difference is not divisible by 2 */
                        {
                            SubsMatrix[h-hshift][v-vshift] := dN*t;
                            SubsMatrix[v-vshift][h-hshift] := dN*t;
                        }
                        else /* transition */
                        {
                            SubsMatrix[h-hshift][v-vshift] := kappa*dN*t;
                            SubsMatrix[v-vshift][h-hshift] := kappa*dN*t;
                        }
                    }
                }              
            }
        }
    }
}

I am working with over 500 sequences of 370 codons each and the code is very slow, even with fewer rate categories. Optimising branch lengths and model parameters using, for example, dSdNRateAnalysis.bf does not appear to be an option - it just takes far too long. Any suggestions?

Thanks

Miguel

IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: Dual Model
Reply #1 - Jan 21^st, 2009 at 11:02am

Dear Miguel,

You need to normalize dS to be one, otherwise you end up confounding the mean of this distribution with the length of the tree. Also, I found it better to parameterize the weights of a GDD in a stick-breaking fashion. For a four category variable it would look like this:

Code:

global dS_p1 = 1/4; dS_p1 :< 1;
global dS_p2 = 1/3; dS_p2 :< 1;
global dS_p3 = 1/2; dS_p3 :< 1;

dS_weights = {4,1};
dS_weights[0] := dS_p1;
dS_weights[1] := (1-dS_p1)* dS_p2;
dS_weights[2] := (1-dS_p1)*(1-dS_p2)* dS_p3;
dS_weights[3] := (1-dS_p1)*(1-dS_p2)* (1-dS_p3);

category dS = (4, dS_weights, ...)

And to normalize, define the mean of dS (e.g. dS_norm := dS_1 * dS_p1 + ...) and then set the values of dS to

Code:

dS_rates = {4,1};
dS_rates [0] := dS_1 / dS_norm;
...

As far as speed goes; are you using the latest SVN build - that should help a bit. Also, I would definitely advise estimating branch lengths using a simpler model (e.g a one rate MG94) and then fixing them while you estimate the distribution of dS and dN rates.

This should be implemented as an option in the dNdSRateAnalysis.bf however - have you tried it?

S.

Cheers,
Sergei

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

Miguel Lacerda

YaBB Newbies

Offline

Feed your monkey!

Posts: 36
Natl Univ of Ireland, Galway
Gender: male

Re: Dual Model
Reply #2 - Feb 12^th, 2009 at 4:53am

Dear Sergei

I have implemented your suggestion above. I estimated the branch lengths separately using a GY94 model and am now trying to optimize the parameters of the 4X4 GDD's for the synonymous and nonsynonymous rates as well as kappa, given the previously estimated branch lengths.

The problem is it is still taking a long time and I'm not sure if it will ever finish. I have been running the MPI version of HyPhy for over 3 days now on 10 processors. Is this normal? Perhaps there is something wrong with my attached code? Or is 4 categories for each of dN and dS just too much?

Thanks,
Miguel

Multimedia File Viewing and Clickable Links are available for Registered Members only!! You need to

IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: Dual Model
Reply #3 - Feb 12^th, 2009 at 2:32pm

Dear Miguel,

I will take a look at your code. The SVN build of HyPhy does not run MPI_OPTIMIZER properly, so that may be a part of your problem.
Also, the new caching scheme will be quite memory hungry (I need to benchmark it vs the old code), as for a 500 taxon tree a 16 rate category model will allocate approximately 16*16*61*B*S bytes, where B is the number of tree branches and S is the number of site patterns in the data. In your case this should be about 930MB -- check memory usage on the cluster and see if the system is swapping memory in and out excessively.

Cheers,
Sergei

Cheers,
Sergei

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

Pages: 1

Send Topic

‹ Previous Topic | Next Topic ›

« Home

‹ Board

Top of this page