Dear Sunil,
Good questions!
Quote:1.Global mean DN/DS value doesn’t give any idea about rate heterogeneity at different sites. I would like to know what is a classical criterion for categorizing a gene “positively/negatively selected”. I mean is it global DN/DS OR number of selected sites OR their proportion OR selection strength?
I would say that a traditional test is to claim positive selection if there is a p>0 proportion of sites with dN>dS (i.e. a selection operates on some sites). You could also say that if at least one site is detected by a site-wise method (but now after an appropriate correction for multiple testing), then a gene is under selection. Take a look at our selection book chapter (Multimedia File Viewing and Clickable Links are available for Registered Members only!! You need to
) for further insight. If you are after this test, I would recommend
PARRIS (implemented in datamonkey.org), as it can also deal with recombinant data.
Quote:2.Will a nucleotide substitution bias model and codon model estimated for a mammalian sequence alignment data hold good for other mammalian datasets as well? OR is it necessary to estimate these models for each dataset?
It is probably best to estimate the appropriate model for each data set; it's not computationally expensive.
Different genes may have different models, as well as alignments of different sizes (simpler models for smaller alignments, complex models for larger alignments).
Quote:3.If the study is focused towards mammalian phylogeny, the number of species specific genic sequences are usually less (4 – 8). Is this dataset sufficiently large enough for inferring positive selection sites? SLAC is for larger datasets, REL gives higher false positive for smaller datasets, Whether FEL results can be expected to hold good for these small datasets? If yes, what would be a conservative level of significance ?
Generally speaking, 4-8 sequences have almost no power to detect selection at a single site. You are better off testing for selection in general (i.e. with PARRIS), or doing a region test - i.e. you can partition your gene into surface and buried residues (if that is known), or similarly based on structure and compare dN/dS between the two partitions. SLAC and FEL will have very little power with 4-8 sequences, and REL could suffer from errors associated with parameter estimates derived from small data sets. To detect selection from 4-8 sequences your model must be spot on and have very few parameters - in reality even the best models are biologically wrong, and we don't really know a priori which parameters matter (i.e. do synonymous rates vary from site to site) and which don't.
Quote:4.Whether a universally accepted tree topology based on mitochondrial DNA can used for analyzing mammalian datasets OR is it necessary to construct a tree using the sequence alignment itself? Since chances of horizontal gene transfer at mammalian level is low, is there any reason to consider GARD analysis for recombination detection prior selection analysis?
The first part has to do with gene trees vs species trees; I can't really say without looking at your genes. As far as screening for possible recombination: in some gene families in mammals (i.e. immune genes such as interferon), there is a lot of gene conversion, which would look as recombination in phylogenies. GARD is fairly conservative; you could always try your analysis with and without a GARD screen to see if that makes a difference.
Cheers,
Sergei