Significance Testing'

Understand concept of significance and testing of significance

Significance testing is:

  • The process of determining whether a difference between groups in a study is due to a real difference, or chance alone
  • Performed using p-values
  • Does not imply clinical significance
    For a result to be statistically significant, there must be a 'real' difference between groups.
    • This difference does not have to be clinically meaningful
      • e.g. A drug may reliably cause a 5mmHg decrease in SBP - this is unlikely to cause a meaningful drop in cardiovascular mortality but may be statistically significant

P Values

The p-value is the probability of obtaining a summary statistic (e.g. a mean) equal to or more extreme than the observed result, provided the null hypothesis is true.

The p-value is commonly (mis)used in frequentist significance testing.

  • Prior to performing an experiment, a significance threshold (α) is selected
    • Traditionally 0.05 (5%) or 0.01 (1%)
      These values define the "false-positive rate".
      • When multiple tests are being performed on one set of data, the chance of a false-positive will increase
        • To reduce the chance of a false positive occurring, the significance threshold for each test can be reduced. One method of this is the Bonferroni correction, where α is divided by the number of tests being performed.
  • Then the experiment is performed, and a value for p is calculated
    If p < α, it suggests that the results are inconsistent with the null hypothesis (at that significance level), and it should be rejected.

Problems with P-values

P-values are, when employed correctly, are useful. However, they do have several weaknesses:

  • Assume the null hypothesis is true
    The p-value assumes that there is no real difference between groups.
    • This may not be the case
    • Not all hypotheses are created equal
      There may be significant prior evidence supporting (or refuting) HA - this will be ignored when interpreting a p-value.
      • Any study with significant results must therefore be interpreted in the context of:
        • Biological plausibility of those results
        • The previous evidence on the topic
    • It is a common misconception that the p-value estimates the chance that the result is true
      This is not the case. The p-value measures how inconsistent the observed results are with the null hypothesis.
  • A threshold of 0.05 is not always appropriate
    The cost of being wrong must be included when interpreting a p-value. If this is a true result, what are the potential benefits? If this is a false positive, what are the potential harms?
  • Vulnerable to multiple comparisons
    Conducting repeated analyses will eventually find a 'significant' result. At an α of 0.05, we would expect 1/20 analyses to be a false positive. Conducting 20 analyses would therefore generate one false positive result.
  • Does not quantify effect size
    A significant p-value simply suggests a difference exists, it does not measure how big this difference is.
    • A result may be statistically significant but clinically unimportant, e.g. an antihypertensive medication causing a decrease in SBP by 2mmHg may be statistically significant, but clinically unimportant.
  • Related to sample size p-values are affected by sample size:
    • A large effect size may be hidden by an insignificant p-value if sample size is small
    • Similarly, a tiny effect size may be detected (i.e. a significant p-value) if sample size is large
  • Does not account for bias
    Like other statistical test, the p-value cannot account for bias or confounding.

References

  1. Wasserstein RL, Lazar NA. The ASA's Statement on p-Values: Context, Process, and Purpose. The American Statistician. 2016 vol: 70 (2) pp: 129-133.
Last updated 2020-07-26

results matching ""

    No results matching ""