Dear Sarah,
'Trust' in statistics is a dangerous concept:)
Nested hypotheses are good when they well represent all plausible explanations of a phenomenon. Otherwise, one may find that H_A is better than H_0, but in reality both are wrong and the correct model is H_1, nested neither in H_0, not in H_A.
There are a couple of diagnostics for GA results:
1). Repeatability of runs. If multiple runs return very similar results, both in terms of the top N models found, and in terms of model-averaged inference, then convergence has likely been achieved. Otherwise, try tightening the convergence criterion/increasing population size.
2). Size of confidence sets. If 95% confidence sets are very large (say > 30% of the total # models), then the model being considered is probably too complex (e.g. too many rate classes) and only model averaged inference should be trusted.
3). AIC based model selection has an analogue of p-values - evidence ratios. To compare if model 1 is much better than model 2, you can compute Akaike weights for each and take their ratios. A sufficiently large ratio (e.g. 20) can serve as evidence that model 1 has a much better fit than model 2. If the models found by the GA have high evidence ratios compared to the 'extreme' models (e.g. single ratio or free ratio or a priori), then we can have more confidence in the GA.
4). Sanity checks. For example, when you are running GABranch, you can always compare the output of that with a local (separate dN/dS for every branch) model, and see if the results seem to point in the same direction.
For a very readable book on model selection issues, I would recommend Multimedia File Viewing and Clickable Links are available for Registered Members only!! You need to
Cheers,
Sergei