Welcome, Guest. Please Login
YaBB - Yet another Bulletin Board
 
  HomeHelpSearchLogin  
 
ML with defined constraints ? (Read 1613 times)
Bryony Mackenzie
Guest


ML with defined constraints ?
May 11th, 2005 at 3:22am
 
Hello,

I have a very large protein data set (104 partitions, 76 taxa ~17000 amino acid positions). Obviously it is far too large for ML analysis Sad I would like to set constraints so that the well defined subgroups in my tree are fixed and ML analysis is carried out for the deep branches only.

To do this I need either 1) a program that can take a constraints tree and the large dataset and calculate ML using only deep branch arrangements or 2) to reconstruct the ancestor sequences for each subgroup and use these to reduce the number of taxa in my analysis.

Would HYPHY be suitable for either of these methods ? How large a data set can it cope with on a single processor ? There is a grid system available to me if the program could run in parallel.

many thanks
basm101  Smiley
Back to top
 
 
IP Logged
 
Sergei
YaBB Administrator
*****
Offline


Datamonkeys are forever...

Posts: 1658
UCSD
Gender: male
Re: ML with defined constraints ?
Reply #1 - May 11th, 2005 at 7:25am
 
Dear Bryony,

Your problem is definitely an interesting one...

Quote:
Hello,

I have a very large protein data set (104 partitions, 76 taxa ~17000 amino acid positions). Obviously it is far too large for ML analysis Sad I would like to set constraints so that the well defined subgroups in my tree are fixed and ML analysis is carried out for the deep branches only.

To do this I need either 1) a program that can take a constraints tree and the large dataset and calculate ML using only deep branch arrangements or 2) to reconstruct the ancestor sequences for each subgroup and use these to reduce the number of taxa in my analysis.


Would HYPHY be suitable for either of these methods ?



HyPhy can be used to implement both approaches, however each one would require some work to implement. One could try to do (2) out of the box by: creating a separate data set for each resolved subtree; fitting an amino acid model to that subtree independently (you'd have to root each subtree using an outgroup, since most amino-acid models are time reversible); reconstructing ancestral sequences based on that model fit; replacing the subtree with its MRCA. This can be done via standard analyses (AnalyzeNucProtData.bf, followed by Reconstuct Ancestral Sequences from the Analyses->Results menu).

Are you trying to fit each of the 104 partitions with an individual model, or just treat all 17K residues as a contiguous block?

(1) could also be implemented, however one would need to write quite a bit of custom HyPhy scripting. If you are interested in pursuing this, I'll be happy to give you some pointers and help you along the way.


Quote:
How large a data set can it cope with on a single processor ? There is a grid system available to me if the program could run in parallel.


You should be able to fit your data to a given tree on a single processor machine in a reasonable time (depending on which models and how many of different models are used). The HYPHY batch language can be used to implement the search part of (1) in parallel on an MPI cluster.

Feel free to ask further questions Smiley It would be helpful, however,  to have more details (especially about how you want to deal with the 104 separate partitions - a joint analysis - one or multiple models - or a separate fit on each partition)...

HTH,
Sergei
Back to top
 

Associate Professor
Division of Infectious Diseases
Division of Biomedical Informatics
School of Medicine
University of California San Diego
WWW WWW  
IP Logged