HyPhy message board - Very similar dN/dS values

ken_uct

YaBB Newbies

Offline

Feed your monkey!

Posts: 1

Very similar dN/dS values
Sep 26^th, 2010 at 12:23pm

I am trying estimate the dn ds values for the evolution of mycobacterium tuberculosis with the help of an implementation of the QuickSelection.bf within the HyPhy package. I align genes from different strains of the same bacteria ( so the the sequences are highly similar). Strangely (to me) enough, most of the results(dn/ds values) are very similar even upto the sixth decimal place. Is there an explanation for this?

Secondly could you please direct me to where I can obtain how the scaling factors are calculated (raw, branch).

Thank you

Back to top

IP Logged

Art Poon

Global Moderator

Offline

Feed your monkey!

Posts: 0

Re: Very similar dN/dS values
Reply #1 - Sep 27^th, 2010 at 9:04am

Hi,

It sounds like you're dealing with a low-variance alignment, so many of the reconstructed substitution counts will be low. That means that most of your sites probably have only zero, one or two substitutions in the phylogeny so there is a high probability that many of the columns in your alignment (codons) will have the same dN-dS value estimated.

Cheers,
- rt.

Back to top

IP Logged

Art Poon Global Moderator Offline Feed your monkey! Posts: 0	Re: Very similar dN/dS values Reply #2 - Sep 27^th, 2010 at 9:36am Re: scaling factor, I believe you're looking for lines 174 and onward in qndHelper2.ibf Cheers, - Art.
Back to top	IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: Very similar dN/dS values
Reply #3 - Oct 13^th, 2010 at 10:43pm

Hi Ken,

Could you post a link to the results page where you see this behavior if you run the alignment through datamonkey? I am curios to see what is going on.

Scaling factors (reported by quickselectiondetection) are there to convert nucleotide model branch length parameters into codon model branch length parameters, and they are simply a diagnostic tool. Unless you want to enter the global dN/dS value directly (as opposed to having the program estimate it for you), you probably won't care about the value of the scaling parameter.

Sergei

Back to top

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

Simon_UCT

YaBB Newbies

Offline

Feed your monkey!

Posts: 3

Re: Very similar dN/dS values
Reply #4 - Oct 17^th, 2010 at 3:27pm

Hi Sergei,

Im working on the project that Ken has mentioned. The figure we got for the DnDs rate seemed peculiar (for e.g. out of about 2700 alignments looked at about 2200 of them all had the value 0.529849).

I have discovered what I think to be the cause of this result, perhaps you could confirm this?
Data Source:
The species in the analysis are all strains of M. tuberculosis, with genes restricted to only those for which there exists a known orthologue in all 5 TB strains.

Problem:
After a reexamination of this data (by running CleanStopCodons.bf on each of the files containing the ortholog sequences ) it became clear that in most of these files there was actually only 1 of 2 unique sequences present, thus, if I understand correctly, the derivation of Dn/Ds on such sequences would fail to provide any statistically useful data (and should produce none in the case of submitting only 1 unique sequence - yet somehow even in such cases a figure was returned from somewhere?)

I actually realized this by manually running individual orthologue clusters on DataMonkey which immediately picked up that there were too few unique sequences to proceed.

What I dont get however is what HyPhy was doing?
What are the numbers it generated and why did it not throw an error like DataMonkey did.

I attach below Ken's batch script which ran the analysis:

/*this wrapper assumes you have a file which lists all the aligned files you want to process but without a .fasta or .nex extension
ie.
file1
file2
...
fileN

where the actual data files are

file1.nex
file2.nex
...
fileN.nex

and trees are in files

file1.tree
file2.tree
...
fileN.tree

these can easily be generated in *nix

$ find `pwd` -name "*.fasta" | sed 's/.fasta//g' > files.list

have a look at the files.list that I've sent though the paths will not be correct

*/

ExecuteAFile (HYPHY_BASE_DIRECTORY + "TemplateBatchFiles" + DIRECTORY_SEPARATOR + "Utility" + DIRECTORY_SEPARATOR + "ReadDelimitedFiles.bf");
ExecuteAFile (HYPHY_BASE_DIRECTORY + "TemplateBatchFiles" + DIRECTORY_SEPARATOR + "Utility" + DIRECTORY_SEPARATOR + "GrabBag.bf");

/* you can replace the following two lines with a path to file. i.e.: fscanf ( "pathtofile", "Lines", _inDirectoryPaths );*/
SetDialogPrompt ( "Provide a list of files to process:" );
fscanf ( PROMPT_FOR_FILE, "Lines", _inDirectoryPaths );

fprintf (stdout, "[READ ", Columns (_inDirectoryPaths), " file path lines]\n");

/* these options would be passed to the gui/menu, some of which are options implemented in the QuickSelectionDetection.bf which I hacked for this analysis */
_options = {};
_options [ "00" ] = "Universal";
_options [ "01" ] = "New Analysis";
/* option2 is the aligned codon file, see for loop below */
_options [ "03" ] = "Default";
/* option4 is the tree file, see for loop below */
/* option5 is the save file for nucleotide model fit, see for loop below */
_options [ "06" ] = "Estimate";

for ( _fileLine = 0; _fileLine < Columns ( _inDirectoryPaths ); _fileLine = _fileLine + 1 ) {

pathParts = splitFilePath(_inDirectoryPaths[_fileLine]);
dir_prefix = pathParts["DIRECTORY"];
file_name = pathParts["FILENAME"];

fprintf ( stdout, "Processing ", file_name, "\n" );

_options [ "02" ] = _inDirectoryPaths[_fileLine] + ".aln.nuc"; /* note I am adding the file extension here - change to what is appropriate for you */
_options [ "04" ] = _inDirectoryPaths[_fileLine] + ".tree";
_options [ "05" ] = _inDirectoryPaths[_fileLine] + ".nuc.fit";

ExecuteAFile ( "globaldNdS.bf", _options );
outfile = "";
outfile * ( _inDirectoryPaths [_fileLine ] + ".globaldNdS" );
outfile * 0;
fprintf ( outfile, CLEAR_FILE, dNdS, "\n" );

Delete (lf); /* removes the existing likelihood function */

}

thanks
Simon

Back to top

IP Logged

Simon_UCT YaBB Newbies Offline Feed your monkey! Posts: 3	Re: Very similar dN/dS values Reply #5 - Oct 17^th, 2010 at 3:40pm As an example the attached input file generated a dNdS value of 58.5413 yet in actual fact there are only 2 unique sequences present differing in only a single base. Where did this figure therefore derive from?
Back to top	Multimedia File Viewing and Clickable Links are available for Registered Members only!! You need to Login (0 KB \| ) IP Logged

Sergei

YaBB Administrator

Offline

Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male

Re: Very similar dN/dS values
Reply #6 - Oct 17^th, 2010 at 8:17pm

Hi Simon,

In very sparse alignments (as the one you attach), POINT estimates of dN/dS are unreliable. For example, if there is a single non-syn substitution and a no syn substitutions, then the dN/dS estimate will effectively be infinite (with huge error bounds). I sent Kenneth a file that computes such error bounds -- you may want to ask him about using it to ensure that point estimates from sparse data are not over-interpreted.

Sergei

Back to top

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego

WWW

IP Logged

Simon_UCT

YaBB Newbies

Offline

Feed your monkey!

Posts: 3

Re: Very similar dN/dS values
Reply #7 - Oct 17^th, 2010 at 10:05pm

Thanks for your help, that sounds precisely what is needed here.

What still worries me though is that since Im primarily looking at only a single species (in this case hoping to identify trends as you move from a non-virulent to virulent stain of TB) that we wont find any (or hardly any) genes with enough difference between strains to get any credible kind of result?

very much appreciated,

Simon

Back to top

IP Logged

	Welcome, Guest. Please Login