The following is a guest post by Joseph Mudge, who published a paper with colleagues recently arguing for a different approach to setting α values. Joseph has written a summary of the argument below. Please chime in with your thoughts. Do you agree/disagree?

by Joseph F. Mudge,

Ph.D. candidate (biology), University of New Brunswick,

Saint John, N.B., Canada,

joe.mudge@unb.ca

Ecologists are often charged with making binary decisions concerning ecological data. This causes problems because most patterns in nature exist as gradients. Null hypothesis significance tests have been a traditional tool for making binary decisions from gradient patterns in ecology, and they remain common in ecological research, despite many well known problems. One of the most obvious problems with null hypothesis significance testing in ecology is the dogmatic adherence to the arbitrary α level of 0.05 as the significance threshold for decision-making. Two consequences of always applying this arbitrary standard are (1) the decoupling of statistical significance (or lack thereof) and biological significance (or lack thereof) and (2) radical inconsistencies in statistical power to detect biologically relevant effects between studies (ranging from near 0% to near 100%). Although the α problem has been well discussed by ecologists over the last few decades, the use of arbitrary α levels has persisted in the ecological literature due to the continuing need for a statistical binary decision-making tool and because a more rational, yet still universally applicable, approach to setting α levels has not been available.

If a researcher can quantify 2 things that I believe should always be important considerations for any ecological research question; (1) the level of effect that would be considered biologically meaningful if it were to exist, and (2) the relative seriousness of Type I vs. Type II errors, it becomes possible to set an optimal, study-specific, α level for decision-making that minimizes either the combined probabilities or costs of both Type I errors and biologically relevant Type II errors. Although specifying a biologically meaningful critical effect size and the relative importance of Type I and Type II errors is not a trivial matter for many ecological questions, it should be noted that implicit and unexamined decisions about biologically meaningful effect sizes and the relative importance of Type I and II error are made when α is set at 0.05. You can’t avoid making these decisions, you can only ignore that you’ve made them. It seems ill-advised to set an arbitrary (albeit easier) decision-making threshold that fails to minimize either chances or costs of mistakes when consideration of important factors related to the research being undertaken can minimize the chances and/or costs of making mistakes.

Once biologically relevant effect sizes and relative importance of Type I and Type II errors have been identified for an ecological research question, determining the optimal α level is simple, as long as the researcher can calculate power for the null hypothesis significance test. The researcher only needs to calculate statistical power for the biologically relevant effect size at a variety of different α levels, and the optimal α level is that which converges on the smallest average of α and β (1-statistical power), with α and β weighted by their respective relative costs of Type I or Type II error. Weighting α and β equally with respect to their relative costs assumes that errors are equally undesirable regardless of whether they be Type I or Type II, and the result for this scenario is the optimal α level that minimizes the combined probabilities of Type I and Type II error.

The result of applying the optimal α approach in ecological research is that studies with high sample sizes end up with very small optimal α levels (while still having very high power to detect biologically relevant effects), reflecting the excellence of the study. In contrast, studies with low sample size end up with larger optimal α levels (in order to maintain some power to detect biologically relevant effect sizes), reflecting the lower quality of the study. Thus, optimal α can be an important and useful indicator of study quality.

I encourage ecologists to try using the optimal α approach to choose a statistical threshold for decision-making in their ecological research. I can think of no rationale for continuing to use an arbitrary threshold for decision-making in ecological research, other than that (1) it requires less thought, and (2) everybody else is using it.

Further reading:

Robinson, D.H., & Wainer, H. 2002. On the past and future of null hypothesis significance testing. *Journal of Wildlife Management* 66, 263-271. here

Mudge, J.F., Baker, L.F., Edge, C.B., & Houlahan, J.E. 2012. Setting an optimal α that minimizes errors in null hypothesis significance tests. *PLoS ONE* 7(2), e32734.doi:10.1371/journal.pone.0032734. here

I like this pragmatic approach. I’m curious, though, what Mudge et al. think of the idea of discussing p-values openly, as levels of support for rejecting the null, rather than a yes/no decision point.

I am in complete agreement about the problems with traditional null hypothesis testing discussed in the link provided in the comment by jebyrnes. I also believe that for the results of a study to be useful, someone needs to be able to make a call concerning whether to proceed as if something close to the alternate hypothesis is true, or to proceed as if something closer to the null hypothesis is true. This constitutes a binary yes/no type decision. Personally, I find it annoying when the authors of a paper suspend judgement on how their results ought to be interpreted and simply conclude ‘moderate support for rejecting the null’, leaving the burden of results interpretation entirely on the reader. The authors of a paper (usually) ought to be considered the foremost experts on their research question, and as such, they should be the most qualified to make the decision of whether or not there is enough evidence to proceed as if something close to the alternate hypothesis is true. Using the optimal alpha approach would aid the authors in making this recommendation in a transparent and explicit manner. The reader may then agree or disagree with the authors’ interpretation of the results, and luckily, the optimal alpha approach also gives readers a tool to determine their own decision-making thresholds for study-designs. This can also come in handy when the authors suspend judgement and leave the burden of interpretation on the reader.

The non-binary approach that I find very interesting is the consideration of multiple alternate hypotheses. Under this scenario, there could be different optimal alphas for different potentially biologically meaningful effects. This would lead to potentially different conclusions depending on what the magnitude of effect size is considered meaningful, and different levels of confidence associated with the different potential conclusions. A result could be considered significant if small effects are considered meaningful, but the small critical effect size would yield a large optimal alpha, so there would be weak evidence in support of a small effect. That same result could be considered non-significant if only large effects are considered meaningful, and the large critical effect size would yield high power (and also a small optimal alpha), so there would be strong evidence against a large effect.

I agree that blind acceptance of alpha = 0.05 is a problem in NHST, but I think a bigger problem in ecology is that the null hypothesis is usually known to be false a priori. In most cases, the null hypothesis is a nil null. Rejecting it is usually uninteresting. I think any NHST needs a null that has at least some reasonable a priori support.

The other big problem is that failure to reject the null is often interpreted as evidence that the null is true, rather than looking at levels of support for the alternative hypotheses. NHST can only falsify the null hypothesis, not confirm it. This fallacy is a big problem if combined with trivial nil nulls.

An alternative that leads to much easier implementation and interpretation is to focus on effect sizes and the uncertainty in the effect sizes, including more frequent use of meta-analysis.

Everyone reading this should look for papers by Fiona Fidler on the topic of appropriate statistical reporting, including the following which really should have been cited in the Mudge et al. paper:

Fidler, F., Burgman, M., Cumming, G. Buttrose, R. & Thomason., N. (2006). Impact of criticism of null hypothesis significance testing on statistical reporting practices in conservation biology. Conservation Biology, 20, 1539-1544.

Other publications are listed at:

http://www.botany.unimelb.edu.au/envisci/about/staff/fidler.html

Fiona and some of her colleagues have experimental evidence of how to improve statistical reporting and interpretation. Now I love opinions, but I find evidence more compelling.

But for more opinions, readers might also look at related posts (by Bob O’Hara and me):

http://blogs.nature.com/boboh/2008/08/19/why-p-values-are-evil

http://mickresearch.wordpress.com/2012/02/25/interpreting-variation-in-data/

Cheers,

Mick (mickresearch.wordpress.com)

Pingback: A flurry of blogs on NHST | Michael McCarthy's Research

I agree with Mick that the misuse of null hypothesis significance testing to reject hypotheses that are known (a priori) to be false is indeed disappointingly common. Knowing a hypothesis to be false ought to discourage any perceived need to subject the hypothesis to probabilistic rejection. Null hypothesis significance tests only have utility for situations in which the null hypothesis (or something close to it) is considered to be plausible. It is also true that the probability of any precise point hypothesis being exactly true is infinitesimally small, but this need not prevent us from testing whether precise point hypotheses are approximately true. This is the realm of NHST.

Fidler et al. points out that, 6 years ago, people were still using NHST. Fidler et al. also attempts to steer researchers away from NHST. We do cite several papers that argue for a turn away from NHST, and although I don’t know percent frequency of use of NHST for any journal, it seems clear to me they remain common in ecological research in 2012. We make two claims 1. There are contexts where null hypothesis testing is appropriate and 2. When they are appropriate there is a better way to do it. To the first claim – I can provide a dozen examples where null hypothesis testing fits perfectly well with our intuitive sense of a question. For example, does smoking give you a higher risk of lung cancer? We could identify several potential causes of lung cancer and test competing hypotheses but we would not be testing the question we care about. Is global warming presently occurring at a faster rate than over another time period? Is ecosystem stability associated with higher biological diversity? These are questions that are perfectly reasonably addressed with a null hypothesis approach. Of course, they don’t have to be addressed with the NHST approach but it wouldn’t be wrong to do so. I take the continued use of NHST in ecology not as solely a measure of the statistical ignorance of ecologists, nor of their resistance to change, but instead as an indication that, for certain research questions, NHST provides ecologists with some benefit that cannot be readily attained by other statistical approaches. My previous comment describes the benefits that null hypothesis significance tests provide as decision-making tools over simply looking at levels of support for (or against) a hypothesis. To the second claim – the optimal alpha approach simply allows the decision-making threshold to be chosen in a manner that minimizes the probability/cost of making a mistake, based on careful consideration of biologically meaningful effect sizes and relative costs of Type I and Type II errors.

The recommendation to use effect sizes and confidence intervals ignores the fact that decisions are usually binary and effect sizes usually continuous, which implies that a decision threshold must be identified. Observed effect sizes are always important to report but using only effect sizes and confidence intervals does not provide a decision-making tool unless decisions are made using some non-overlap rule, which would also be subject to some arbitrary confidence level analogous to an arbitrary alpha level. What’s so special about 95% confidence intervals? Why not 80%, 90%, 99% or 99.9%? If decisions are being made from confidence intervals then the appropriate interval percentile should depend on the relative seriousness of Type I vs. Type II errors, and not be held to some arbitrarily agreed upon percentile.

I also agree with Mick that failure to reject a null hypothesis is often too hastily interpreted as evidence that the null is true. This is especially true when researchers ignore power issues associated with their experimental design, which is why I advocate for statistical power to play a more central role in null hypothesis significance testing. Failure to reject a null hypothesis should always result in investigation of the statistical power to detect a meaningful alternate hypothesis. This allows a researcher to take the failure to reject a null hypothesis as evidence that the null (or something close to it) is true if, and only if, they can show that a the probability of making a biologically relevant Type II error is very small. Statistical power calculation is a necessary component of the calculation of an optimal alpha level, so I believe that the optimal alpha approach would solve this problem, as non-significant results could not be interpreted without knowledge of statistical power to detect relevant effects.

I agree with Joseph that statistical power should be considered when using null hypothesis significance testing (NHST) in ecology – it is almost never is. But that is not the only problem.

I also agree that there are some circumstances in which NHST can be used effectively. But the evidence is that almost every NHST paper in ecology includes nil nulls, in which cases NHST is rarely if ever suitable and a binary decision is not required. So before we ask people to go to the trouble of doing a power analysis and think about the correct choice of alpha, we should ask them to think about whether NHST is appropriate in the first place.

Despite the above, non-significant results are incorrectly interpreted as evidence that the null is true about half the time (sometimes more frequently). Combined with nil nulls, no power analysis, and a common failure to report effect sizes, we have a problem.

My suggestion to quote effect sizes was not meant to imply that people should only do that, or that it is the only answer (see below). In those circumstances where NHST is appropriate, it is fine to quote the p-value. If using Bayesian or information theoretic methods, the weight of evidence in favour of different hypotheses might be quoted.

But regardless of the choice of statistical method, effect sizes should be quoted (and interpreted carefully) in almost every case (yet this occurs much too rarely).

An added benefit of effect sizes is that much of the same interpretation that is contained in a p-value and power analysis can be derived from an effect size with confidence interval. Because most ecologists have no experience with calculating power (more an indictment on the discipline of ecology rather than ecologists), the pragmatist in me says that asking for well-considered power analyses won’t change practices – because we rarely get any power analyses. But I hope I am wrong.

Most of the information above is based on Fiona Fidler’s paper that has data on surveys of published articles in major ecology and conservation biology journals:

Fidler, F., Burgman, M., Cumming, G. Buttrose, R. & Thomason., N. (2006). Impact of criticism of null hypothesis significance testing on statistical reporting practices in conservation biology. Conservation Biology, 20, 1539-1544.

But effect sizes won’t solve all difficulties, and my experiences bear that out. They need to be interpreted carefully. One of Fiona’s other papers has a title that communicates that point beautifully:

Fidler, F., Thomason, N., Cumming, G., Finch, S. & Leeman, J. (2004). Editors can lead researchers to confidence intervals but they can’t make them think: Statistical reform lessons from Medicine. Psychological Science, 15, 119-126.

Clear thinking is the key.

.

Many years ago, Schrader-Frechette and McCoy wrote a book arguing that ecologists should take type II errors more seriously and type I errors less seriously.

Personally, I think there are virtues to having a universal standard, even if there are reasons to deviate from it in particular cases. For instance, I can easily imagine researchers simply adopting a liberal alpha value, and providing apparently-cogent arguments for it, simply as a way to compensate for poor study design or lack of replication. That would be bad. We almost certainly have far too many “false positives” in science as it is. And while I agree we should be more concerned with effect sizes, and with estimation of biologically-meaningful parameters rather than with rejection of statistical null hypotheses, I don’t see why we need to change the alpha=0.05 standard to some flexible standard in order to encourage discussion of effect sizes and a focus on parameter estimation. I also question whether costs of type I vs. type II errors can ever be estimated with sufficient precision on a case-by-case basis to make it sensible to use those costs to set case-by-case error rates and decision rules.

Doing statistics well involves making a lot of judgment calls. I would never say statistics can be reduced to a set of rules, a “recipe” to be blindly followed. But I think individual investigators make the best judgment calls when they start from a deep appreciation of the point of widely-applicable “rules”. If those rules are too flexible, at some point they stop being rules.

Andrew Gelman and Hal Stern have highlighted one of the biggest problems with the use and abuse of NHST. Unfortunately “the difference between significant and not significant is not itself statistically significant”.

A must read.

http://www.stat.columbia.edu/%7Egelman/research/published/signif4.pdf

While it might be difficult, I think it’s definitely worth it to attempt to quantify the trade-off between Type I and Type II error for particular systems. Carl Boettiger had a great example in his ESA talk (posted here: http://precedings.nature.com/documents/6857/version/1) of using receiver-operator curves to examine this trade-off and make a decision about whether false positives or false negatives are more important.

Terrific paper! Is there now any justification for doing anything else if performing NHST?

@Jeremy Fox: The optimization method doesn’t give us more wiggle room. In fact, it gives us less, as there’s a single optimal alpha level. Also, the optimization approach forces us to be explicit about effect sizes.

I would argue that, in basic research, false positives and false negatives are equally bad. (In fact, an early false negative might be worse, as it can close off a whole line of research for a long time.) Anything other than equality needs to be justified and in applied work, such justifications certainly exist.

” it becomes possible to set an optimal, study-specific, α level for decision-making that minimizes either the combined probabilities or costs of both Type I errors and biologically relevant Type II errors”

Not quite. As far as I can tell, what this approach does is minimise (α+β) (optionally, weighted by cost), but that’s NOT the same as minimising the combined probability of Type I and Type II errors.

α represents the probability of type I errors *conditional on the null hypothesis being true*, and β is the probability of type II *conditional on the null hypothesis being false*.

Minimising (α+β) is only equivalent to minimising combined Type I and Type II errors in the case where the null and alternate hypotheses are equally likely, which is an awfully big assumption.

Example: suppose I’m trying to detect forged banknotes. For the sake of argument, suppose that the relationship between α and β is such that α=0.0001/β, and let’s assume Type I and Type II errors have equal cost.

Out of a million banknotes, I might expect 10000 to be forged. The approach outlined here suggests that I should set tolerances such that α=β=0.01. Under that approach, I’ll expect (0.01*990000)+(0.01*10000) = 9900 Type 1 and 100 Type 2 errors, or 10,000 errors in total.

But in fact, a bit of calculus shows that the optimal choice is approximately α = 0.001, β=0.1. At these tolerances we get 990 Type 1 and 1000 Type 2 errors – five times better than the result from minimising α+β.

Geoffrey Brent makes a good point that minimizing the combined probabilities or costs of Type I and Type II errors does ultimately also depend on the a priori relative probabilities of the null and alternate hypotheses being true. In the paper referred to in the reference section of this blog post (Mudge et al. 2012), we describe how ignoring the a priori relative probabilities of the null and alternate hypotheses assumes that they are equally likely to be true (as pointed out by Geoffrey Brent). We also describe how the average of α and β can be weighted by relative costs of error and also the relative prior probabilities of error, if either or both of these are known, to provide an optimal alpha that more accurately minimizes probabilities or costs of Type I and Type II errors.

Null hypothesis significance tests are most appropriate for situations in which researchers do not wish to incorporate prior probability information into the statistical decision-making process (Bayesian approaches are better suited to deal with statistical problems involving known priors). Nevertheless, the choice to undertake a null hypothesis significance test implies that the researcher believes there must be some chance that the null hypothesis (or something close to it) is true and there must also be some chance that a biologically relevant alternate hypothesis is true. Without incorporating any additional information that could potentially be used to generate prior probabilities for null and alternate hypotheses, Laplace’s principle of indifference states that we are justified in assigning equal prior probabilities to the null and alternate hypotheses. In fact, it is the only reasonable assumption that can be made in the absence of any prior probability information. Thus, without incorporating prior probability information into a null hypothesis significance test, we can use the optimal alpha approach to effectively say “given that both types of error are possible, but we are not able to make a confident claim about their relative prior probabilities, here is the optimal α level that most effectively guards against both types of errors.” I believe that this is a useful thing to be able to say, and I would recommend Bayesian statistics to those who might disagree.

I’m not convinced it’s valid to apply the principle of indifference here. A typical formulation of that principle is that when n mutually exhaustive and exclusive possibilities are indistinguishable *except for their names/labels*, it’s reasonable to set the probability of each outcome to 1/n.

This last condition is rarely applicable in experimental research. Assuming P0=P1 leaves us in the position of tacitly accepting an arbitrarily-chosen and unexamined prior.

As an example of the problems with applying this principle in such situations, let’s consider two questions: “Is there life on Mars?” and “Is there intelligent life on Mars?”

Applying this version of the principle of indifference to each question in turn tells us that there is a 50% probability of there being life on Mars, and a 50% probability of there being intelligent life on Mars. Since the latter event is a subset of the former, it follows that the existence of ANY life on Mars implies the existence of intelligent life there.

Obviously this is absurd; we might not be able to estimate accurate priors for these two questions, but any sensible analysis should at least acknowledge that the priors are unequal.

Using this version of PoI also leaves us vulnerable to biased experimental design. Take parapsychology as an example: if I want to disprove the existence of ESP, I’d pick a single broad hypothesis (say, “humans have the ability to predict future events”) and test that. But if I want to prove it, I’d pick lots of different hypotheses and examine them one at a time: “Bob can predict results of coin-tosses”, “Susan can tell me what card is next”, and so on. Using a 50/50 prior guarantees that if I test enough hypotheses, eventually I’ll get a positive “finding”, and I only need one.

We can mitigate that problem by applying a Bonferroni adjustment – but this is essentially an admission that the default priors would be overly generous and/or highly correlated. (Which is to say that the prior for Experiment B is heavily conditional on the outcome of Experiment A, and hence that if Experiment A gives a negative result then the prior for Experiment B should be very low).

One way or another, we end up having to abandon the assumption of equal priors (and in most cases, replace it with some other assumption).

BTW, just about all of these criticisms apply to the “alpha = 0.05” approach too; I think an approach that considers both alpha and beta is an improvement over one that only considers alpha. But in the end all methods depend on making some shaky assumptions about priors, and I’d prefer that people explicitly acknowledge those assumptions rather than package them inside the method where they don’t have to think about them. At least that way the assumptions are being made by the person who’s doing the work and is responsible for its quality.

(Sorry, I tend to go on a bit…)

I agree that prior probabilities are often extremely important to consider as Geoffrey Brent has pointed out, and there really isn’t a great way to incorporate prior probabilities into the frequentist statistical framework of null hypothesis significance testing.

If one can specify the relative prior probabilities of null and alternate hypotheses, it is possible to calculate an optimal α that minimizes the prior probability weighted average of α and β. But I don’t think that the relative prior probabilities of null and alternate hypotheses is something that most users of null hypothesis significance tests will be able to specify. If that is the case, users of null hypothesis significance test that wish to optimize their α level will either have to make the Laplacian equal prior probability assumption, or ignore prior probabilities outright and settle for minimizing the average of two conditional probabilities of error (Type I and Type II) that they are concerned about, but which they have no idea how likely they are to occur. The latter essentially amounts to an a priori ‘bet hedging’ process between 2 different probability spaces, the likelihoods of which are completely unknown.

If the researcher cannot specify prior probabilities and is also unwilling to assume equal prior probabilities, α levels can still be optimized to find the best possible compromise between Type I errors if the null hypothesis is true and Type II errors if the alternate hypothesis is true, but the researcher then has to be ok with not being able to make any estimate of the probability of having made an error (as is normally the case for frequentist statistical techniques).

I’m not trying to argue that the approach I describe is superior to Bayesian approaches. I’m merely suggesting that if we’re going to continue using null hypothesis significance tests, we should use them with more careful consideration of critical effect sizes, Type II errors and relative costs of Type I vs. Type II errors.

(I appreciate the feedback, and as you can see, I tend to go on a bit too…)

I definitely agree that Type II errors should be given as much consideration as Type I.

Hm, it occurs to me that even if you don’t have a good estimate of priors, there is one big advantage of keeping an eye on your βs as well as αs: If you know both, then you can at least set an upper and lower bound on the total a priori error probability, which makes the results easier to communicate. “This test gives the wrong result 2-5% of the time” is a much more intuitive statement than “when the true answer is negative this test gives false positives 5% of the time”.

Considering how many errors are introduced by misinterpretation of statistical results, that benefit is not to be sneezed at. Might even be an argument for designing to equalise alpha and beta in some cases – although relative costs of errors and Bonferroni-type issues would still require exceptions to that. I suspect equalising might often give a similar result to minimising the sum anyway.

Pingback: Journal of Ecology blog stats since the beginning | Journal of Ecology blog

I find the argument made in the Mudge et al. paper very interesting. Having said that, it feels like the authors have left out an important part of the puzzle. Namely the t-distribution or Normal distribution assumption. In practice, data never follows exactly any of these distributions. At best it does so only approximately. The violation of the distribution assumption affects both, the performance of traditional alpha tests and optimal alpha tests, but at different degrees. The performance of the optimal alpha vs traditional alpha is likely to depend on how ‘close’ the approximation is.

For example in the case of a one sample hypothesis test where the true distribution is ‘slightly’ left skewed (but not enough for us to reject the assumption of a t distribution), H_a: mu > value, and there is a small sample size, the optimal alpha approach could potentially drastically ‘overestimate’ the appropriate significance level by quantifying a greater area under the curve than the traditional significance level under the ‘wrong’ symmetric distribution curve.

Pingback: Evidence that study-specific optimal α levels offer substantial benefits over using α=0.05 | Journal of Ecology blog

Violation of test assumptions, including distributional assumptions, can certainly be a major problem for any statistical test, and the optimal alpha approach is no exception. Optimal alpha levels for traditional parametric tests do rely on a normality assumption in their calculation. So for parametric tests they are optimal, assuming the sampled data come from normal distributions. Alpha = 0.05 for parametric tests also assumes a normal distribution (i.e. there is a 5% chance of making a Type I error if the null hypothesis is true, only if the data are sampled from normal distributions). P-values for these parametric tests also assume data were collected from normally distributed populations. If the sampled data do not come from normal distributions, the p-value resulting from the data becomes unreliable and does not reflect the true probability of observing a result more extreme than what was observed, if the null hypothesis were true. An inaccurate p-value is just as unreliable when compared to alpha = 0.05 as it would be when compared to an optimal alpha level. I think the main point to be made here is that making good conclusions from statistical tests requires that test assumptions be met as closely as possible.

Luckily, the optimal alpha approach can also be applied in non-parametric or randomization tests, in which there are fewer test assumptions to violate (or at least more difficult test assumptions to violate) and in these cases having data sampled from normal distributions is not necessary.

I’ve written a version that works for bootstrap. It’s computationally intensive but not difficult.

is your code public for that Jane?

Pingback: Additional Thoughts on NHST in Ecology | Notes on Learnings About Data Analysis

The definition of the significance level is alpha = P(Reject H_o | H_o is true). There’s no distribution assumption made in the definition or when it is arbitrarily fixed at a value, typically 0.05. However, we do conduct the hypothesis test using the Normal distribution. As a result, the p-value is dependent on the Normal distribution, the traditional significance level IS NOT. But in the case of the optimal alpha, it’s value DOES indeed depend on the distribution assumption. As people in the finance world have found out, estimating tail events is difficult, in part because the Normal distribution is never exactly true.

I wonder how the optimal alpha performs when the Normal distribution is only ‘approximately’ true (this is always the case in real life and could be simulated). I wouldn’t be suprised if in situations, depending on the degree of skewness (even if small) of the true distribution, the optimal alpha approach overestimates or underestimates the combined error probability. This could lead to misleading significance level estimates. It would be interesting to see how sensitive the optimal alpha is to the distribution assumption. For example, when the distribution is only approximately Normal but barely (at least according to a Normal test) how is the optimal alpha procedure affected?