1. P-value is an important instrument to measure the inductive evidence against the null hypothesis. P-value is reported in various psychological studies. During the last decades, statistical testing of the null hypothesis has been the main instrument of evaluating results.
The p-value is used as an indicator of the importance of the result and as a test of the author’s contribution to the psychological research. The classical statistical approach to estimating the probability of a hypothesis, based on determining the p-value and relating it to critical levels (p <0.05 and p <0.01), not only prevents accumulating the plausibility of hypotheses but, on the other hand, demonstrates extremely low reproducibility of the results obtained in one study on one sample, in other cases. The p-value indicates that chosen data and an offered model are incompatible.
The results should be interpreted in these ways which are, the smaller the p-value, the more significant the statistical incompatibility between the data and the null hypothesis. In this day, scientists in the field of statistical methodology have shown serious concerns about the baselessness the wide application of null hypotheses results and its incorrect interpretation (Baker, 2016).
As a result of that discussion, a special working group of the American Psychological Association was created. The main purpose of this group was to create and prepare new requirements, to explain how to use the statistical methods in scientific publications, to shed light on some disputable points concerning the use of statistics, verification of significance, its alternatives. The American Statistical Association reported on the impossibility to determine to what extent a hypothesis is true and to what extent results are important (Baker, 2016). The American Statistical Association advised the researches to avoid making scientific decisions just on the basis of p-values. Otherwise, the problem with null-hypothesis appears when the straw-man null hypothesis is rejected and the alternative hypothesis is prioritized. As a result, the researcher gets a non-existent evidence (Gelman, 2016).
The questions of reproducibility and replicability of research conclusions are discussed. This ambiguity provokes much confusion, doubts and even banning of null hypothesis significance testing. Misinterpretation or misuse of statistical results (inappropriate chosen techniques, not properly conducted analyses, incorrect interpretation) cause the “reproducibility crisis”.
The concept of statistical significance states that the p-value can be used as a statistical measurement, but in practice, it is missused and missinterpreted, or even discouraged by scientists and scientific journals. To continue, some researches argue that its statistical significance is not so meaningful, because most of the statistically significant comparisons are systematically overestimated and can have the wrong signs.
To sum up, p-values are commonly used to test a null hypothesis. This hypothesis can be interpreting in the these ways, it claims that two groups don’t have any differences, or it states that the pair of characteristics has no correlation.
When the p-value is small then there is a small probability that observed pair of characteristics would happen by chance. When a p-value equals 0.05 or less, it generally gives a right to say that findings are statistically significant and can be published.
Although, according to the American Statistical Association it is not 100% truth. A p-value equal to 0.05 does not ensure that there is a 95% chance that a chosen hypothesis is true (Baker, 2016). In opposite, it states there is a 5% chance of getting the results as extreme as the observed ones (considering that all other assumptions made are valid).
2. The seriousness of the p-value problem is the fact that when the American Statistical Association has made such explicit recommendations it was the first time since the 177-year-old history of this organization (Baker, 2016).
P-value is claimed to be not an objective measurement for statistically testing of hypotheses. The reason is that p-value exaggerates the evidence against the null hypothesis. According to that many published researches based on small p-values can be in doubt.
In order to understand the reasons why the p-value as a statistical tool meets with increasing criticism and sometimes rejection by the scientific community, it is necessary to understand its very essence and the errors that exist in its relation.
The first delusion is that: if p <0.05, there will be just 5% chance the null hypothesis has the erroneous deviation or 5% probability that there is a first kind error. This error is known as a false positive result as well. When the assumption about the null hypothesis truthfulness is made such a result can be obtained.
Some of the following misconceptions include:
– the conviction that for p> 0.05 there are no differences between the groups (in fact, this only means that the absence of differences agrees better with the results);
– if p <0.05, a statistically significant conclusion is of clinical significance, considering a value for psychological practice (although p-value does not provide any information about the evidence of the effect).
There are a lot of talks about p-value problems in medicine, in psychology, where making, for example, decisions about the effectiveness of a treatment method only on the basis of the p-value significance, can lead to deplorable results in a very short time.
The lack of confirmation of scientific findings is the result of a convenient but unreasonable adoption strategy decisions based only on a formal p-value estimation. For example, some findings for the biomedical treatments can indicate that, at p <0.05, factors for cancer are: electric shavers, a broken arm (for women), fluorescent lamps, allergies, reindeer herding, steward’s duty, poultry breeding, being tall, hot dog and refrigerators. Or another example, which stated that aggressive men have more sons.
To continue, some psychological journals have announced that they would not publish articles containing p-values as they are often used to substantiate low-quality research. To resolve this misunderstanding, it is important for researches to include many additional factors into researches.
It means that p-values alone can’t guarantee the truthfulness or falsity of a decision. According to Wasserstein and Lazar (2016), good statistical practice underlines the necessity of proper design and conduction, providing the summaries that include different data (graphics, numbers), exploring the research phenomenon, explaining the results considering the context, completing, reporting and understanding of the obtained results.
In conclusion, the p-value familiar to psychologists has many significant shortcomings: it is connected with hypothetical data, depends on the objectives of the researcher, on the size of the sample, but it does not ensure the statistical evidence: neither in favor of true zero, nor even in favor of the truth of alternative hypotheses.
The p-value shows how probable the data obtained in the research. It does not guarantee that the hypothesis is true itself. In both cases, it should be taken into consideration that the null hypothesis is true.
3. Statisticians have pointed to a number of measures that might help. In order to avoid the misunderstanding and hesitations of whether the results are significant or not significant, researchers should always report effect sizes and confidence intervals (Nuzzo, 2014). These signify that a p-value does not include the magnitude and relative importance of an effect.
Many researches underline the necessity of replacing the p-value with methods that take the Bayes’ rule into consideration. This rule emphasizes the necessity to think about probability as the plausibility of an outcome, rather than as the potential frequency of that outcome. Others support an approach based on multiple methods on the same data set. Some statisticians talk about the importance mark the papers as p-certified or not p-hacked. Nuzzo (2014) proposed for researches to do two small exploratory studies, gather interesting findings after to decide how these results will confirm the findings.
The American Statistical Association proposed such recommendations: necessity to report not only the sample size, but also information about the process of making a decision about the sample size, including the size of the statistical effects; importance to show how the effect size scorers were obtained from previous studies and theories in order to disperse the view in that they could be calculated based on the data used in the study, or, worse, falsified to justify sample size; necessity to report the magnitude of the statistical effect along with the traditional p-value.
It means that communicating and interpreting effects in the context of previously obtained results is an integral part of good research. This allows the reader to assess the stability of the results with respect to samples, designs, and methods of data analysis.
The different ideas were proposed in order to resolve the replication crisis. They include: science communication (no restrictions for publications according to its statistically significance; interactions between disagreeing researchers); design and data collection (preregistration; prior estimates, more attention to accurate measurement); data analysis (Bayesian inference; hierarchical modelling; meta-analysis; control of error rates) (Gelman & Geurts, 2017).
To sum up, as an alternative to the p-value can be chosen the Bayes factor, known as likelihood ratio as well. It doesn’t have any p-value deficiencies. It indicates the percent of the probability of data provided. When one hypothesis is true than the other one is true as well to the probability of data. It is much easier to interpret as an assessment of the of the evidence received in a study in favor of a particular hypothesis.
Recommendations of the American Statistical Association about the importance of design, understanding, and context should be taken into consideration as well. As it is seen the practice of statistics associated with p-values shouldn’t a misuse and misunderstand the necessity of design, understanding, and context (Gelman, 2016).
Statisticians should describe not only the data testing that produces statistically significant results but all statistical tests and choices made in calculations. Otherwise, results may appear to be falsely robust (Baker, 2016).
The disadvantages of p-value as a statistical tool can be eliminated by using alternative methods and by following the recommendations of the American Statistic Association. These disadvantages consist of relation with hypothetical data and researcher’s intentions, dependence on the sample size, the lack of statistical evidence.
As an alternative instrument to evaluate the statistical hypotheses likelihood can be viewed Bayes factor. This measurement known as likelihood ratio can help to solve the controversial questions. Bayes factor is the percent of the probability of the analyzed data. Main condition is the truth of one hypothesis to the probability of the data under the condition of the truth of another hypothesis. The methods of its interpretation provide simple mathematical and based on the evidence conception.