Dear Sarah,
The newest GA analysis will automatically determine the appropriate number of categories starting with 1 and stepping up until no further AIC score improvement can be found.
A 250 taxon tree would probably take too long to run in it's entirety, even with the new, much faster GA version. There are several subsampling strategies:
1). Simply divide your tree into a number of clades, so each subsample forms a monophyletic group, if you can (might be difficult for very laddery trees). They don't have to be the same size, but should probably contain at least 20 sequences each. Then you can analyze each clade separately and collate the results. The drawbacks of this sampling approach are:
- You lose the information on branches connecting the clades
- Even though two branches in different clades may have the same dN/dS rate in a joint analysis (if we could run it), they may end up in different classes in individual analyses. This can be fixed by hacking the analysis to use the same set of dN/dS on all subtrees (analyzed jointly). Hence, instead of searching a single 250 taxon tree, you could jointly search a (say) 60, 50,40,55 and 45 taxon trees, where in each tree the allocation of branches to dN/dS classes is done independently of other trees, but the values of dN/dS for each rate class are estimated jointly from all trees.
2). Collapse all short branches. E.g, if the maximum pairwise distance (e.g. nucleotide TN93 metric) between K taxa is less than some a priori threshold, represent all K taxa with one of them (chosen randomly). This is not a bad idea if you have a sample of viruses from different hosts/localities or of different subtypes. You can use ClusterByDistanceRange.bf (a standard analysis under Data File Tools) to see if your data has collapsible clusters.
3). Random samples. Probably not a good idea, unless a large number of them are run and properly collated.
In terms of nucleotide rates, I would not worry about overfitting too much. The precisions of nucleotide rate estimation from 50 and 250 taxa on long alignments (yours is flu HA, right) are not qualitatively different.
HTH,
Sergei