Understanding Statistics in Science: P-Values, Confidence, and Bias

Statistical reasoning is the hidden infrastructure of scientific knowledge — the machinery that separates a signal from the noise of a messy, variable world. This page covers the core tools that researchers use to evaluate evidence: p-values, confidence intervals, statistical power, and the various forms of bias that can quietly undermine even carefully designed studies. These concepts shape which findings get published, which drugs get approved, and which public health recommendations reach millions of people.

Definition and scope
Core mechanics or structure
Causal relationships or drivers
Classification boundaries
Tradeoffs and tensions
Common misconceptions
Checklist or steps (non-advisory)
Reference table or matrix

Definition and scope

A p-value is a probability — specifically, the probability of observing results at least as extreme as the ones in hand, assuming the null hypothesis is true. A confidence interval is a range of values consistent with the data at a specified level of certainty. Statistical bias refers to systematic errors that push estimates away from the true value in a predictable direction. Together, these three concepts form the evaluative core of quantitative science, from clinical trials to particle physics to behavioral economics.

The scope is broad. The American Statistical Association (ASA) issued a formal statement in 2016 — and a follow-up special issue of The American Statistician in 2019 — specifically because misuse of p-values had become pervasive enough to warrant institutional intervention (ASA Statement on P-Values, 2016). The problem isn't the tools themselves. It's the gap between what these tools actually measure and what researchers, journalists, and policymakers frequently believe they measure.

The science methodology underlying statistical inference rests on a deceptively simple question: how surprised should one be by this data, given a world where the null hypothesis were true?

Core mechanics or structure

The null hypothesis framework. Most classical statistical tests are structured around a null hypothesis (H₀) — typically the proposition that there is no effect, no difference, or no relationship. The p-value quantifies how compatible the observed data are with that null world. A p-value of 0.03 means there is a 3% probability of obtaining results this extreme (or more extreme) if the null hypothesis were true.

The conventional threshold of p < 0.05 traces to statistician Ronald Fisher, who proposed it in his 1925 book Statistical Methods for Research Workers as a rough rule of thumb — not as a hard decision boundary. That historical context matters enormously.

Confidence intervals (CIs). A 95% confidence interval is constructed so that, if the same study were repeated 100 times with the same methodology, approximately 95 of those intervals would contain the true population parameter. This is not the same as saying there is a 95% probability the true value lies within any single calculated interval — a distinction that trips up even experienced researchers.

Statistical power. Power is the probability that a test will detect a true effect when one exists. A study with 80% power has a 20% chance of a false negative — missing a real signal. Power depends on three factors: sample size, effect size, and significance threshold. Underpowered studies are common in psychology and medicine; a 2013 analysis published in Nature Reviews Neuroscience by Katherine Button and colleagues estimated median statistical power in neuroscience studies at approximately 20% (Button et al., 2013).

Causal relationships or drivers

Statistical significance does not imply causation. This bears repeating because it is violated in practice with stunning regularity — including in peer-reviewed publications. The how science works conceptual overview addresses the broader logic of inference, but within statistics specifically, several structural forces drive misinterpretation.

Publication bias operates at the level of the scientific literature itself. Studies that find statistically significant results are more likely to be submitted, accepted, and cited than null results. A 2014 meta-analysis in PLOS ONE by Daniele Fanelli found that the proportion of positive results in published papers increased by approximately 22 percentage points between 1990 and 2007 across scientific disciplines (Fanelli, 2012, PLOS ONE).

Researcher degrees of freedom compound the problem. Decisions made during data collection and analysis — when to stop collecting data, which covariates to include, how to handle outliers — can inflate false positive rates dramatically. Simmons, Nelson, and Simonsohn demonstrated in a 2011 paper in Psychological Science that flexible analytic choices could push false positive rates to 61% under plausible research conditions.

Confounding occurs when a third variable causally influences both the exposure and the outcome, creating a spurious association. Controlling for confounders requires identifying them in advance — which requires subject-matter knowledge that no statistical test can supply on its own.

Classification boundaries

Statistical tests are classified by the type of data and the question being asked. The main distinctions:

Parametric vs. nonparametric: Parametric tests (t-test, ANOVA, linear regression) assume the data follow a specific distribution, typically normal. Nonparametric tests (Mann-Whitney U, Kruskal-Wallis) make fewer distributional assumptions but generally have lower power.
One-tailed vs. two-tailed tests: A one-tailed test asks whether the effect is in a specific direction; a two-tailed test asks whether there is any difference at all. Using a one-tailed test without prior justification halves the p-value and inflates false positive risk.
Frequentist vs. Bayesian frameworks: Frequentist statistics (the dominant paradigm in most journals) defines probability as long-run frequency. Bayesian statistics defines probability as a degree of belief, updated by evidence via Bayes' theorem. The two frameworks answer subtly different questions and can produce different conclusions from identical data.
Type I vs. Type II error: A Type I error is a false positive — rejecting a true null hypothesis. A Type II error is a false negative — failing to reject a false null hypothesis. The p < 0.05 threshold conventionally controls Type I error at 5%, but says nothing about Type II error without additional power calculations.

Tradeoffs and tensions

The p < 0.05 threshold sits at the center of a genuine scientific controversy. In 2019, over 800 signatories to a comment in Nature titled "Retire Statistical Significance" argued that binary pass/fail thinking around a single threshold distorts scientific reasoning (Amrhein, Greenland, McShane, Nature 2019). Opponents argued that abolishing the threshold without a replacement standard would make it harder, not easier, to interpret evidence.

A parallel tension exists between statistical significance and practical significance. A study with a sample size of 50,000 might detect a statistically significant effect that accounts for 0.01% of variance in an outcome — real, but clinically or practically irrelevant. Effect sizes (Cohen's d, odds ratios, r²) communicate magnitude; p-values do not.

Bayesian methods offer some solutions but introduce their own complications: results depend on the prior probability assigned to hypotheses, and the choice of prior is inherently subjective. In domains where prior evidence is rich — such as drug repurposing trials — Bayesian approaches can be more informative. In entirely novel research areas, the prior is largely a guess.

Transparency mechanisms like pre-registration — posting hypotheses and analysis plans before data collection — have been formalized by platforms like the Open Science Framework and are now required by a growing number of journals, including those affiliated with the Center for Open Science.

Common misconceptions

Misconception 1: A p-value tells you the probability that the null hypothesis is true. It does not. The p-value assumes the null is true and asks how surprising the data would be under that assumption. The probability that the null is true requires Bayesian reasoning and a prior.

Misconception 2: A non-significant result means no effect exists. Absence of evidence is not evidence of absence — especially in underpowered studies. A p-value of 0.20 in a study with 40 participants says very little about whether a real effect exists.

Misconception 3: Replication failure means the original study was fraudulent. Most replication failures reflect underpowered original studies, publication bias, or context-specific effects — not misconduct. The Reproducibility Project: Psychology, led by Brian Nosek and published in Science in 2015, successfully replicated approximately 36% of 100 psychology experiments, with effect sizes averaging about half the original magnitude (Open Science Collaboration, Science, 2015).

Misconception 4: Confidence intervals are more informative than p-values. Confidence intervals do carry more information — they show the range of plausible effect sizes, not just a binary verdict. But they suffer from their own misinterpretation: a 95% CI is not a range that has a 95% probability of containing the true value after the fact.

Checklist or steps (non-advisory)

Elements present in a well-reported statistical analysis:

References

The law belongs to the people. Georgia v. Public.Resource.Org, 590 U.S. (2020)