Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
Very similar dN/dS values (Read 6430 times)
ken_uct
YaBB Newbies
*
Offline


Feed your monkey!

Posts: 1
Very similar dN/dS values
Sep 26th, 2010 at 12:23pm
 
I am trying estimate the dn ds values for the evolution  of mycobacterium tuberculosis with the help of an implementation of the QuickSelection.bf within the HyPhy package. I align genes from different strains of the same bacteria ( so the the sequences are highly similar). Strangely (to me) enough, most of the results(dn/ds values) are very similar even upto the sixth decimal place. Is there an explanation for this?

Secondly could you please direct me to where I can obtain how the scaling factors are calculated (raw, branch).

Thank you
Back to top
 
 
IP Logged
 
Art Poon
Global Moderator
*****
Offline


Feed your monkey!

Posts: 0
Re: Very similar dN/dS values
Reply #1 - Sep 27th, 2010 at 9:04am
 
Hi,

It sounds like you're dealing with a low-variance alignment, so many of the reconstructed substitution counts will be low.  That means that most of your sites probably have only zero, one or two substitutions in the phylogeny so there is a high probability that many of the columns in your alignment (codons) will have the same dN-dS value estimated.

Cheers,
- rt.
Back to top
 
 
IP Logged
 
Art Poon
Global Moderator
*****
Offline


Feed your monkey!

Posts: 0
Re: Very similar dN/dS values
Reply #2 - Sep 27th, 2010 at 9:36am
 
Re: scaling factor, I believe you're looking for lines 174 and onward in qndHelper2.ibf

Cheers,
- Art.
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Very similar dN/dS values
Reply #3 - Oct 13th, 2010 at 10:43pm
 
Hi Ken,

Could you post a link to the results page where you see this behavior if you run the alignment through datamonkey? I am curios to see what is going on.

Scaling factors (reported by quickselectiondetection) are there to convert nucleotide model branch length parameters into codon model branch length parameters, and they are simply a diagnostic tool. Unless you want to enter the global dN/dS value directly (as opposed to having the program estimate it for you), you probably won't care about the value of the scaling parameter.

Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Simon_UCT
YaBB Newbies
*
Offline


Feed your monkey!

Posts: 3
Re: Very similar dN/dS values
Reply #4 - Oct 17th, 2010 at 3:27pm
 
Hi Sergei,

Im working on the project that Ken has mentioned. The figure we got for the DnDs rate seemed peculiar (for e.g. out of about 2700 alignments looked at about 2200 of them all had the value 0.529849).

I have discovered what I think to be the cause of this result, perhaps you could confirm this?
Data Source:
The species in the analysis are all strains of M. tuberculosis, with genes restricted to only those for which there exists a known orthologue in all 5 TB strains.

Problem:
After a reexamination of this data (by running CleanStopCodons.bf on each of the files containing the ortholog sequences ) it became clear that in most of these files there was actually only 1 of 2 unique sequences present, thus, if I understand correctly, the derivation of Dn/Ds on such sequences would fail to provide any statistically useful data (and should produce none in the case of submitting only 1 unique sequence - yet somehow even in such cases a figure was returned from somewhere?)

I actually realized this by manually running individual orthologue clusters on DataMonkey which immediately picked up that there were too few unique sequences to proceed.

What I dont get however is what HyPhy was doing?
What are the numbers it generated and why did it not throw an error like DataMonkey did.

I attach below Ken's batch script which ran the analysis:

/*this wrapper assumes you have a file which lists all the aligned files you want to process but without a .fasta or .nex extension
ie.
file1
file2
...
fileN

where the actual data files are

file1.nex
file2.nex
...
fileN.nex

and trees are in files

file1.tree
file2.tree
...
fileN.tree

these can easily be generated in *nix

$ find `pwd` -name "*.fasta" | sed 's/.fasta//g' > files.list

have a look at the files.list that I've sent though the paths will not be correct

*/

ExecuteAFile (HYPHY_BASE_DIRECTORY + "TemplateBatchFiles" + DIRECTORY_SEPARATOR + "Utility" + DIRECTORY_SEPARATOR + "ReadDelimitedFiles.bf");
ExecuteAFile (HYPHY_BASE_DIRECTORY + "TemplateBatchFiles" + DIRECTORY_SEPARATOR + "Utility" + DIRECTORY_SEPARATOR + "GrabBag.bf");

/* you can replace the following two lines with a path to file. i.e.: fscanf ( "pathtofile", "Lines", _inDirectoryPaths );*/
SetDialogPrompt ( "Provide a list of files to process:" );
fscanf ( PROMPT_FOR_FILE, "Lines", _inDirectoryPaths );

fprintf (stdout, "[READ ", Columns (_inDirectoryPaths), " file path lines]\n");

/* these options would be passed to the gui/menu, some of which are options implemented in the QuickSelectionDetection.bf which I hacked for this analysis */
_options                        = {};
_options [ "00" ]            = "Universal";
_options [ "01" ]            = "New Analysis";
/* option2 is the aligned codon file, see for loop below */
_options [ "03" ]            = "Default";
/* option4 is the tree file, see for loop below  */
/* option5 is the save file for nucleotide model fit, see for loop below  */
_options [ "06" ]            = "Estimate";


for ( _fileLine = 0; _fileLine < Columns ( _inDirectoryPaths ); _fileLine = _fileLine + 1 ) {
     
     pathParts   = splitFilePath(_inDirectoryPaths[_fileLine]);
     dir_prefix  = pathParts["DIRECTORY"];
     file_name   = pathParts["FILENAME"];
     
     fprintf ( stdout, "Processing ", file_name, "\n" );
     
     _options [ "02" ]            = _inDirectoryPaths[_fileLine] + ".aln.nuc"; /* note I am adding the file extension here - change to what is appropriate for you */
     _options [ "04" ]            = _inDirectoryPaths[_fileLine] + ".tree";
     _options [ "05" ]            = _inDirectoryPaths[_fileLine] + ".nuc.fit";
     
     ExecuteAFile ( "globaldNdS.bf", _options );
     outfile = "";
     outfile * ( _inDirectoryPaths [_fileLine ] + ".globaldNdS" );
     outfile * 0;
     fprintf ( outfile, CLEAR_FILE, dNdS, "\n" );
     
     Delete (lf); /* removes the existing likelihood function */

}

thanks
Simon
Back to top
 
 
IP Logged
 
Simon_UCT
YaBB Newbies
*
Offline


Feed your monkey!

Posts: 3
Re: Very similar dN/dS values
Reply #5 - Oct 17th, 2010 at 3:40pm
 
As an example the attached input file generated a dNdS value of 58.5413
yet in actual fact there are only 2 unique sequences present differing in only a single base. Where did this figure therefore derive from?
Back to top
 
Multimedia File Viewing and Clickable Links are available for Registered Members only!!  You need to Login Login (0 KB | )
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: Very similar dN/dS values
Reply #6 - Oct 17th, 2010 at 8:17pm
 
Hi Simon,

In very sparse alignments (as the one you attach), POINT estimates of dN/dS are unreliable. For example, if there is a single non-syn substitution and a no syn substitutions, then the dN/dS estimate will effectively be infinite (with huge error bounds). I sent Kenneth a file that computes such error bounds -- you may want to ask him about using it to ensure that point estimates from sparse data are not over-interpreted.

Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged
 
Simon_UCT
YaBB Newbies
*
Offline


Feed your monkey!

Posts: 3
Re: Very similar dN/dS values
Reply #7 - Oct 17th, 2010 at 10:05pm
 
Thanks for your help, that sounds precisely what is needed here.

What still worries me though is that since Im primarily looking at only a single species (in this case hoping to identify trends as you move from a non-virulent to virulent stain of TB) that we wont find any (or hardly any) genes with enough difference between strains to get any credible kind of result? 


very much appreciated,

Simon
Back to top
 
 
IP Logged