Using summary statistics to determine whether a non-significant result supports the absence of an effect

Dan Quintana
6 min readJul 21, 2018

I just came across a new study that investigated the relationship between generalised anxiety symptoms and automonic nervous system function. This is an interesting research question given that generalised anxiety symptoms are often associated with autonomic arousal symptoms, like sweating and palpitations.

Generalised anxiety symptoms are often associated with autonomic arousal symptoms

In this study, 32 participants with social anxiety disorder and 23 neurotypical controls were recruited. Heart rate variability (HRV) was used to approximate autonomic control of the heart rate. The authors concluded that there was no relationship between generalised anxiety symptoms and HRV.

Now the thing with conventional null hypothesis test p-values is that you can only reject the null hypothesis. A non-significant test is not support of the null hypothesis, now matter how large the p-value.

I was curious as to whether the data actually supported the absence of an effect. In this post, I’m going to walk you through what I did to better understand this result.

Before we get into the analysis, let’s first look at the effect size that the study design had 80% statistical power to detect using the “pwr” R package. To calculate this, we only need to enter the group numbers, a specified significance level, and the desired power level.

The R script to perform the power analysis of this study

For an independent samples t-test, this study can effect size of d = 0.78 with 80% power. While this would conventionally just fall short of a large effect size (d = 0.8) according to Cohen’s guidelines, Cohen only intended for his effect size guidelines to be used as a last resort when the researcher did not have a good grasp of expected effect sizes from the published literature.

Fortunately, I already went to the trouble of calculating the distribution of effect sizes from HRV case-control studies, because I’m a nerd like that.

I found that for anxiety studies, a large effect size should be classified as d = 0.64 (instead of d = 0.8, which is Cohen’s suggested threshold for large effects), as this was the 75th percentile of effect sizes from 55 studies. In other words, only 25% of anxiety case control HRV studies have effect sizes that are larger than d = 0.64.

The distribution of effect sizes for HRV case-control studies categorised by different disorder groups.

So it looks like this study is only statistically powered to detect large effect sizes. But let’s plow ahead anyway.

For our first approach to better understand this non-significant result, we’re going to use equivalence testing. Equivalence testing uses a frequentist framework to determine whether you can reject the presence of a smallest effect size of interest (SESOI).

The first thing we need to do is to specify our SESOI. There are several ways to specify this, but here I will specify our SESOI as d = 0.26. I’ve chosen this SESOI as this would reject the upper 75% of effect sizes in the anxiety case-control HRV literature, which seems pretty reasonable to me.

We’ll have a look at the RMSSD measure of HRV, as this is one of the most commonly used measures in the literature. From the paper, we can extract group numbers, means, and standard deviations and then run a short script in R using the TOSTER package.

The R script to perform the equivalence test

As the equivalence test was non-significant (p = 0.35) we cannot reject effects that are as large or larger than d = 0.26. The plot below visualises this result. Notice that the thick line, which is the 90% two one-sided tests confidence interval, falls outside the dashed equivalence bounds of -0.26 and 0.26.

The plot of our equivalence test

My second approach was a Bayesian hypothesis test. With this test you can quantify how much more likely the data are under a null hypothesis model compared to an alternative model, given a prior probability (here’s a primer if you’re new to Bayesian hypothesis testing).

To perform a Bayesian hypothesis test from summary statistics, we just need to enter a t-statistic and group numbers. The authors didn’t report t-statistics in their results, but it’s possible to calculate this from the means, standard deviations, and group sizes.

The method section in this paper is very light on detail, so it’s unclear whether the p-value was generated from a Student’s t-test (assuming equal variances), Welch’s t-test (equal variances not assumed), or a Mann-Whitney test. RMSSD values are usually normally distributed, so I’ve gone with Welch’s t-test to be safe.

To calculate the t-statistic and corresponding p-value, I used the function below from this post.

The t-statistic was -0.59 and the p-value was 0.56. This wasn’t exactly what was reported in the paper (p = 0.594), which could be due to a variety of reasons, but it’s still quite close so we’ll use it anyway for pedagogic purposes.

Before we run our Bayesian hypothesis test, we also need to specify our prior distribution. Here, I’ll use a default Cauchy prior with a scale of 0.44, which presumes we are 50% confident that the true effect size will lie between −.44 and .44. I chose this figure because 0.44 is the median effect size for anxiety case-control HRV studies. Note that there are a few other options for specifying your prior for a Bayesian t-test in JASP, which might be better suited for your research question.

Now, we just plug our t-statistic, group numbers, and prior distribution into the Summary Stats module of JASP. This gives us a BF_01 value of 2.26, which suggests that the null model was preferred to the alternative model by a factor of 2.26. Conventionally speaking, this is only considered anecdotal evidence for the null model relative to the alternative model but you can make up your own mind as to the degree of evidence.

The prior and posterior plot for our analysis

If we use a more generous Cauchy width the relative evidence for the null model increases (see below). However, I would argue that these wide and ultrawide widths are too wide given what we know about the literature.

A robustness check can assess the impact of various Cauchy widths on the Bayes factor

By using both a frequentist and Bayesian approach, we can conclude that this non-significant result was probably not indicative of the absence of an effect. What’s more likely is that this non-significant result was due to the study being statistically underpowered.

If you liked this post, you’d probably enjoy our podcast episode with Daniel Lakens on p-values. Follow this link or listen using the player below. After 65 episodes and over 125,000 downloads, this is one of our most popular episodes.

More where this came from

This story is published in Noteworthy, where thousands come every day to learn about the people & ideas shaping the products we love.

Follow our publication to see more product & design stories featured by the Journal team.

--

--

Dan Quintana

Researcher at Oslo University in Biological Psychiatry