HyPhy message board
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl
Methodology Questions >> How to >> ML with defined constraints ?
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?num=1115806935

Message started by Bryony Mackenzie on May 11th, 2005 at 3:22am

Title: ML with defined constraints ?
Post by Bryony Mackenzie on May 11th, 2005 at 3:22am
Hello,

I have a very large protein data set (104 partitions, 76 taxa ~17000 amino acid positions). Obviously it is far too large for ML analysis :( I would like to set constraints so that the well defined subgroups in my tree are fixed and ML analysis is carried out for the deep branches only.

To do this I need either 1) a program that can take a constraints tree and the large dataset and calculate ML using only deep branch arrangements or 2) to reconstruct the ancestor sequences for each subgroup and use these to reduce the number of taxa in my analysis.

Would HYPHY be suitable for either of these methods ? How large a data set can it cope with on a single processor ? There is a grid system available to me if the program could run in parallel.

many thanks
basm101  :)

Title: Re: ML with defined constraints ?
Post by Sergei on May 11th, 2005 at 7:25am
Dear Bryony,

Your problem is definitely an interesting one...


wrote on May 11th, 2005 at 3:22am:
Hello,

I have a very large protein data set (104 partitions, 76 taxa ~17000 amino acid positions). Obviously it is far too large for ML analysis :( I would like to set constraints so that the well defined subgroups in my tree are fixed and ML analysis is carried out for the deep branches only.

To do this I need either 1) a program that can take a constraints tree and the large dataset and calculate ML using only deep branch arrangements or 2) to reconstruct the ancestor sequences for each subgroup and use these to reduce the number of taxa in my analysis.


Would HYPHY be suitable for either of these methods ?



HyPhy can be used to implement both approaches, however each one would require some work to implement. One could try to do (2) out of the box by: creating a separate data set for each resolved subtree; fitting an amino acid model to that subtree independently (you'd have to root each subtree using an outgroup, since most amino-acid models are time reversible); reconstructing ancestral sequences based on that model fit; replacing the subtree with its MRCA. This can be done via standard analyses (AnalyzeNucProtData.bf, followed by Reconstuct Ancestral Sequences from the Analyses->Results menu).

Are you trying to fit each of the 104 partitions with an individual model, or just treat all 17K residues as a contiguous block?

(1) could also be implemented, however one would need to write quite a bit of custom HyPhy scripting. If you are interested in pursuing this, I'll be happy to give you some pointers and help you along the way.



Quote:
How large a data set can it cope with on a single processor ? There is a grid system available to me if the program could run in parallel.


You should be able to fit your data to a given tree on a single processor machine in a reasonable time (depending on which models and how many of different models are used). The HYPHY batch language can be used to implement the search part of (1) in parallel on an MPI cluster.

Feel free to ask further questions :) It would be helpful, however,  to have more details (especially about how you want to deal with the 104 separate partitions - a joint analysis - one or multiple models - or a separate fit on each partition)...

HTH,
Sergei

HyPhy message board » Powered by YaBB 2.5.2!
YaBB Forum Software © 2000-2024. All Rights Reserved.