By David Becker
Are you fed up with those unsightly wrinkles? Have you tried everything to get rid of them, but nothing seems to work? Well, throw away your anti-aging creams and forget about those harmful Botox injections! Did you know that you can reverse the aging process just by listening to music? Scientists have found that listening to songs about growing old will decrease your age by one-and-a-half years. Amazing!
Not only is this finding amazing, it’s also complete lie. The study itself was real, but the researchers (Simmons, Nelson, & Simonsohn, 2016) weren’t actually interested in testing the effects of music on age. Their true intent was to shine a light on researchers’ overreliance on and misuse of significance testing by generating a statistically significant result that has no bearing on reality whatsoever.
Many people believe that a statistically significant result affirms that a study’s findings are real, but that’s not what statistical significance measures. Statistical significance measures the probability of error in a study’s data. It’s indicated by the p value, which determines the likelihood of rejecting the null hypothesis. The null hypothesis essentially opposes the hypothesis that the researchers are testing. In other words, if the researchers hypothesize that a new therapeutic approach will have a meaningful effect or will be superior to an established approach, the null hypothesis refers to the possibility that there will be no meaningful effect or no differences between the two approaches. A lower p value (.05 is the most commonly used cutoff) suggests a higher likelihood that the null hypothesis is false. However, a high probability of rejecting the null hypothesis doesn’t make the researcher’s hypothesis true; it simply reflects greater confidence in the data’s validity (researchers sometimes refer to a confidence level above 95%, which equates to a p value below .05). Even so, it’s very easy to play with the data until you achieve statistical significance.
Simmons and colleagues illustrated in their study how easy it is to get a false positive by manipulating what they refer to as “researcher degrees of freedom.” These are the decisions that researchers make about how to collect, analyze, and report data—decisions that include when to stop collecting data, which variables to analyze, and which subsets of data to report. Many researchers make these decisions throughout the research process. For instance, researchers may decide to stop collecting data as soon as they have achieved significant result. Or they might change their hypothesis based on the data. Practices like this are often referred to as “p hacking,” the idea being that researchers will massage the data until they come up with a low-enough p value. Simmons et al. estimate that abusing researcher degrees of freedom can lead to a false positive rate as high as 61% in a given study, which they admit might be a conservative figure.
It’s important to note that most researchers who p hack aren’t purposefully trying to be deceptive. Many of them don’t fully grasp the potential consequences of what seem like minor decisions. In a recent example, psychologist Dana Carney rejected a psychological phenomenon she helped popularize—power poses—and essentially admitted that she and her fellow researchers unknowingly abused researcher degrees of freedom to achieve a statistically significant result that had no real-world significance (Peters, 2016).
Even when p hacking is unintentional, social psychologist Harris Cooper (2016) argues that researchers have an ethical responsibility to the scientific community and to the greater public good that obliges them to be aware of the decisions they make, and their impact. He advises researchers to make decisions about data collection as early as possible—primarily in the planning phase before the study has even begun—and to stick with them throughout the study. If the researchers change course, they need to fully report their new decisions so that anyone scrutinizing the study’s findings and methodology will understand the full context. As an example, if researchers decide to create new data sets based on the original data—a practice that Cooper advises should be rarely used—the changes need to be explicitly catalogued with clear explanations. And under no circumstances should researchers edit or omit any of the original data, regardless of which data sets they choose to analyze.
In addition to carefully documenting their decisions, researchers need to better understand the true purpose of p values and how to properly use them. Clinical psychologist Rex Kline (2013) notes in his book Beyond Significance Testing: Statistics Reform in the Behavioral Sciences, now in its second edition, that cognitive errors surrounding significance testing are so common that they can be considered a form of “trained incapacity” (p. 10). Even statistics instructors don’t even fully understand p values, according to Kline, which feeds the “ongoing cycle of misinformation” (p. 10). The problem is so pervasive that last year the American Statistical Association (ASA) released a policy statement laying out six principles to help the scientific community better understand and apply p values. This marked the first time in its 178-year history that the ASA decided to take an official position on statistical practices.
The ASA also suggested other approaches that researchers might use in addition to or instead of significance testing, including one of the most popular alternatives proposed by advocates of statistics reform: Bayesian analysis. The premise behind Bayesian statistics is fairly simple: “Begin with an estimate of the probability that any claim, belief, [or] hypothesis is true, then look at any new data and update the probability given the new data” (Novella, 2016, para. 2).
Kline (2013, Chapter 10) supports the Bayesian approach because it reflects the fundamental tenets of science and critical thinking. Namely, extraordinary claims that seem implausible given what we know about the universe (e.g., listening to music will make you younger) need to be supported by extraordinary evidence. Kline further argues that Bayesian analysis allows us to compare the competing hypotheses of researchers who have differing interpretations about our existing body of knowledge and are studying the same subject from alternate perspectives. Being able to examine new and divergent findings against our current understanding of the world encourages scientists to reevaluate the likelihood of existing hypotheses, which is fundamental to science’s self-critical and self-correcting nature.
Simmons, Nelson, and Simonsohn (2016), however, are skeptical of Bayesian statistics as an alternative to significance testing. They contend that it gives researchers even more opportunities to manipulate data, in addition to those provided by significance testing. Simonsohn (2015) also argues that the default Bayesian test in psychology is biased against small effects.
If there’s one clear takeaway from this controversy, it’s that there isn’t one perfect alternative to significance testing. In fact, as the ASA points out in their policy statement, significance testing can be useful, so long as it’s properly applied. Therefore, completely avoiding p values doesn’t seem like an ideal, catch-all solution. Rather, scientists must experiment with a variety of solutions to see how best to test the validity of their findings.
In the meantime, it pays to be skeptical. The popular media tends to simplify, overhype, and misinterpret the findings of a single study—much like I did at the beginning of this post—without accounting for the complexities of scientific research. It can be difficult for those of us who are not scientific experts to figure out what to believe and what not to believe, especially when our mental filters are already overwhelmed by the constant deluge of information that floods over us every day.
But look on the bright side: At least you can be pretty confident that you won’t become younger and younger until you vanish from existence just by listening to The Beatles’ “When I’m Sixty-Four” on repeat.
Cooper, H. (2016). Ethical choices in research: Managing data, writing reports, and publishing results in the social sciences. https://doi.org/10.1037/14859-000
Kline, R. B. (2013). Beyond significance testing: Statistics reform in the behavioral sciences (2nd ed.). https://doi.org/10.1037/14136-000
Novella, S. (2016, January 8). What is Bayes theorem? [Blog post]. Retrieved from http://theness.com/neurologicablog/index.php/what-is-bayes-theorem/
Peters, M. (2016, October 1). ‘Power poses’ co-author: ‘I do not believe the effects are real.’ Retrieved from http://www.npr.org/2016/10/01/496093672/power-poses-co-author-i-do-not-believe-the-effects-are-real
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2016). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. In A. E. Kazdin (Ed.), Methodological issues and strategies in clinical research (4th ed., pp. 547–555). https://doi.org/10.1037/14805-033
Simonsohn, U. (2015, April 9). The default Bayesian test is prejudiced against small effects [Blog post]. Retrieved from http://datacolada.org/35