What to say (and not say) when P>0.05

This post was authored by Thomas Stewart as part of a series of articles prepared by Vanderbilt University Medical Center’s VICTR Research Methods group.

Suppose you’ve performed a study comparing the risk of complication between two surgical procedures, generically called A and B. The study results are below.

Procedure	Complication	No complication
A	30	70
B	40	60

Your estimate of the treatment effect is an odds ratio estimated from logistic regression.

. logistic complication procedure [fweight = frequency]

Logistic regression                             Number of obs     =        200
                                                LR chi2(1)        =       2.20
                                                Prob > chi2       =     0.1377
Log likelihood =  -128.3876                     Pseudo R2         =     0.0085

------------------------------------------------------------------------------
complication | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   procedure |   1.555556   .4648111     1.48   0.139     .8660449    2.794027
       _cons |   .2755102   .1327441    -2.68   0.007     .1071565    .7083646
------------------------------------------------------------------------------

The odds ratio is 1.56 with 95% confidence interval (0.87, 2.79). The odds ratio is relatively large at 1.56, but the confidence interval is wide, so wide that it spans the point null of 1. The p-value is 0.139.

Pop quiz

How should you interpret the results? Here are some options:

A. There were no differences in the complication rates of procedures A and B.

B. The data failed to reject the hypothesis that the complication rates were the same for procedures A and B, so we conclude that the rates are the same.

C. The data suggests that procedure A and B have a similar complication rate as they are not statistically (or significantly) different.

D. None of the above.

The correct answer is D, none of the above. And, you probably knew that. Or, at least knew that we as statisticians wanted you to say none of the above. Perhaps you are like our colleagues who honestly say: “I know the statisticians don’t want me to say the complication rates are the same, but for all practical purposes, I’m going to think of them as being the same.”

The truth is, interpreting results when P>0.05 is nuanced, and the mistake of concluding too much from a null result is probably the most common mistake in scientific research. In this blog post, we aim to give you a vocabulary for interpreting null results (and non-null results as well) which will help you identify and understand the subtle error in the A, B, and C options of the pop quiz. We hope that this vocabulary will help you interpret your study results when P > 0.05.

Study goals vs study outcomes

A large chunk of research studies are designed to show that two groups are different or that two groups are the same. For example, a randomized controlled trial comparing surgical procedures A and B may be designed to show that the complication rate is higher under A than B. In another setting, one might design a study to show that the complication rate the same for patients receiving procedure A or procedure B.

Often times, studies will successfully show a difference or a similarity. However, there is a third possible result of a study. A study may be inconclusive. In such a situation, the data fail to provide conclusive evidence for establishing a difference or similarity in the study endpoint between the two groups. We discuss this in greater detail in later sections, but it is helpful to begin the conversation by establishing the terms conclusive difference, conclusive similarity, and inconclusive.

How to demonstrate a similarity

Think about the study of surgical procedures presented at the start of this post. Look at the following hypothetical results from a series of studies. Which of the studies, if any, conclusively demonstrate for you that the complication rate is the same between procedures A and B? Why?

The figure shows the confidence interval and associated estimated difference in complication rate on the percentage-point scale. (If procedure A has 30% complications and procedure B has 27% complications, then the difference in complication rates on the percentage-point scale is 3.)

What information did you use to determine which studies demonstrated a conclusive similarity and which did not?

It is possible that none of the studies convinced you that the complication rates are the same. But, if some of the studies were convincing, we are willing to wager that the most important information was gleaned from the 95% confidence interval. For a study to show a conclusive similarity, not only must the point estimate of the difference be close to zero, the uncertainty interval must also be close to zero.

How much wiggle room around zero did you allow before being convinced that the complication rates were similar?

This, of course, is an abstract example and your answer to that question will probably change depending on the severity of the complication. If the complication is mild surgical site infection, you may be willing to accept that the procedures are similar if the confidence interval for the difference is within +/- 5 percentage-points of zero. However, if the complication were more serious, it is possible that you’d only be convinced that the complication rates are similar if the confidence interval is within +/- 1 percentage-point of zero.

A study designed to demonstrate similarity begins by defining the amount of wiggle room. The formal term for wiggle room is equivalence threshold. Differences larger than the threshold are considered meaningful. Differences smaller than the threshold are considered practically null. The region around zero within the upper and lower equivalence thresholds is the null region as it contains all the effect sizes that are equivalent to zero or essentially zero.

An equivalence threshold can be defined on both the absolute and relative scales. For example, earlier we defined the an equivalence threshold for the difference in complication rates as +/- 1 percentage-point of zero. It is a threshold on the absolute scale. Alternatively, we could have defined a threshold on the relative scale, say a risk ratio from 0.97 to 1.03.

Note that zero is the center of the null region on the absolute scale and one is the center of the null region on the relative scale.

The analysis plan to for a study designed to establish a similarity is essentially these steps:

Establish an equivalence threshold.
Collect data.
Calculate confidence interval.
If the CI is within the equivalence threshold, then the analysis demonstrates a conclusive similarity.

The figure below shows three possible confidence interval estimates overlaid on the null region. The first demonstrates a conclusive similarity, whereas the bottom two are inconclusive. The bottom two confidence intervals include practically null effect sizes and clinically meaningful effect sizes. The estimated confidence intervals do not exclusively support practically null effect sizes.

How to demonstrate a difference

If a study demonstrates a conclusive similarity when the confidence interval falls within the null region, and if a study is inconclusive when the confidence interval straddles the equivalence threshold, then how might a study demonstrate a conclusive difference?

Where would the confidence interval fall?

A study demonstrates a conclusive difference when the confidence interval for the effect size is completely outside the null region.

The point null

You might be thinking: “Hey, I’ve done a fair number of analyses in my time. In all those studies, I don’t recall a single time when the research team established an equivalence threshold.” Truth is, the vast majority of studies in clinical research areas do not establish an equivalence threshold. Rather, research teams most often opt to side-step the question about what are clinically meaningful effect sizes and use a point null. When the equivalence threshold is zero then the null region is a single point, the null value. The single point null region is the the point null.

In the context of the surgical complication study, rather than specifying +/- 1 percentage point difference as the null region, a research team may opt for a point null at zero. Or, if analyzing relative differences, a research team might specify a point null at one for the odds ratio or risk ratio.

The consequence of specifying a point null is that the analysis cannot demonstrate a conclusive similarity. (Extra credit: Why is this true? What would the confidence interval for the effect estimate need to be to show a conclusive similarity if the researchers specify a point null?)

The most common mistake in data analysis

The most common mistake when interpreting analysis results is to use language that suggests a conclusive similarity when in fact the result is inconclusive.

Most of the content in intro stat courses focuses on showing differences; very little is often taught about establishing similarities between groups. Introductory courses focus so heavily on showing differences, that students may come away from intro stats with the mistaken idea that analyses that fail to show a difference must show a similarity.

What should be said when P > 0.05?

First, be upfront that the analysis is inconclusive.

Second, take advantage of the confidence interval. If you are looking at complication rates, for example, and the confidence interval for the difference is +/- 5 percentage points, you might write: “If the complication rate differs between procedure A and procedure B, the data suggests that the difference is no larger than 5 percentage points.”

Conclusion

Let’s return to the initial example at the start of the post. The key result from the analysis was an estimated odds ratio and 95% confidence interval of 1.56 (0.87, 2.79). It would be misleading to say that there is no treatment effect as the confidence interval includes odds ratios that are large.

Look at the options of our pop quiz. You’ll note that each is a manifestation of the same mistake.

A. There were no differences in the complication rates of procedures A and B.

This is a concise statement, but it is also a misinterpretation of the analysis. The range of treatment effects supported by the data include some rather large values. Even the point estimate of 1.56 is a moderate treatment effect!

B. The data failed to reject the hypothesis that the complication rates were the same for procedures A and B, so we conclude that the rates are the same.

This option begins with a technically correct statement, but it draws the wrong conclusion. Failure to show a difference is not the same as showing a similarity.

C. The data suggests that procedure A and B have a similar complication rate as they are not statistically (or significantly) different.

Like the option before it, this option mistakes what it means to be similar. It fails to acknowledge that conclusive similarities are established with predetermined equivalence thresholds and narrow confidence intervals.

If you find yourself reading (or writing) a conclusion that two groups are the same in regards to a trait or outcome, ask yourself: “What was the equivalence threshold for this claim? Or, did the analysis use a point null?” Chances are, the analysis is based on a point null and no equivalence threshold was established.

For this example, an interpretation of the analysis will communicate the range of odds ratios supported by the data. It may include the word inconclusive. It need not be long. At the risk of being overly pedantic, we might write something like the following:

The odds ratio comparing the surgical complication risk of procedure B to procedure A was estimated at 1.56 (95% CI: 0.87, 2.79); the data neither establish a conclusive difference nor a conclusive similarity of the complication risk between A and B. For this hypothesis, the data are inconclusive.

Now you give it a go. What would you write?