The Science: Key Data, Statistics, and Metrics

Numbers are where scientific understanding earns its credibility. A hypothesis becomes a finding when it survives measurement, and a finding becomes knowledge when that measurement can be replicated, compared, and situated alongside other data. This page covers the core types of quantitative evidence used across scientific disciplines — how statistics are generated, what the most critical metrics signal, and where the numbers are most likely to mislead.

Definition and scope

Scientific data is any systematically collected, reproducible observation expressed in a form that allows comparison and analysis. Statistics are the mathematical tools applied to that data to describe patterns, test hypotheses, and estimate the reliability of conclusions. Metrics are the specific, agreed-upon measurements a field uses to track phenomena over time — things like VO₂ max in exercise physiology, or the R-value in epidemiology.

The scope of this domain is genuinely vast. The National Science Foundation's National Center for Science and Engineering Statistics tracks research output, funding flows, and workforce data across every discipline. As of its 2023 report, the United States spent approximately $886 billion on research and development — with the federal government contributing roughly $176 billion of that total — making the infrastructure for generating scientific data one of the largest organized human enterprises on the planet.

Not all data is quantitative. Qualitative data — observations, interviews, textual records — is legitimate scientific material, particularly in fields like anthropology and clinical research. The distinction matters because the statistical tools appropriate for a dataset depend entirely on what kind of data it contains. Applying a parametric test to categorical data is one of the more reliable ways to produce a confident, wrong answer.

For a broader orientation to the ideas and methods that produce this data, The Science: Key Concepts Glossary provides working definitions for the terminology used throughout this page.

How it works

Data collection begins with operationalization — the process of defining exactly what will be measured and how. "Health" is an idea. Systolic blood pressure measured in millimeters of mercury by a calibrated sphygmomanometer is a datum. That translation from concept to measurable unit is where most of the hard work in scientific research actually lives, even though it rarely makes the headline.

Once collected, data moves through a structured pipeline:

Cleaning and quality control — identifying and handling missing values, outliers, and measurement errors.
Descriptive statistics — summarizing the dataset using measures like mean, median, standard deviation, and range.
Inferential statistics — using probability theory to draw conclusions about a larger population from a sample, with tools like t-tests, ANOVA, or regression models.
Effect size estimation — quantifying how large a relationship is, not just whether it exists. A result can be statistically significant and practically trivial.
Confidence intervals — expressing uncertainty around an estimate. A 95% confidence interval does not mean there is a 95% probability the true value falls inside it; it means that if the experiment were repeated 100 times under identical conditions, approximately 95 of those intervals would contain the true value.
Peer review and replication — the social mechanism by which the broader scientific community stress-tests a finding.

The The Science: Methodology page covers the experimental frameworks that govern how data is gathered in the first place — a necessary complement to understanding what the numbers ultimately mean.

Common scenarios

Three scenarios account for the majority of public misunderstanding about scientific statistics.

Correlation presented as causation. Two variables can move together perfectly — ice cream sales and drowning rates, for instance — without one causing the other. Both are driven by a third variable (summer). Establishing causation requires either a randomized controlled trial or a carefully structured natural experiment, not a correlation coefficient, however large.

P-value thresholds treated as verdicts. The p < 0.05 threshold is a convention, not a law of nature. The American Statistical Association issued a formal statement in 2016 (ASA Statement on P-Values) explicitly warning against using p-values as binary pass/fail tests for truth. A p-value of 0.049 and one of 0.051 are not meaningfully different, yet the former is routinely described as "significant" and the latter as "not significant."

Small samples extrapolated to large populations. A study of 40 college students cannot reliably characterize human behavior broadly. Sample size determines the precision of an estimate, and under-powered studies produce noisy results that often fail to replicate. The replication crisis — documented extensively in psychology and medicine — is substantially a story about what happens when incentive structures reward small, headline-grabbing findings over large, rigorous ones.

More detailed treatment of replication and controversy can be found at The Science: Controversies and Debates.

Decision boundaries

The decisions made with scientific data — clinical guidelines, regulatory thresholds, public health recommendations — require translating probabilistic evidence into binary choices. A drug either gets approved or it does not. A contaminant either exceeds the permissible exposure limit or it does not. That translation involves value judgments, not just statistics.

The distinction between Type I error (false positive — concluding an effect exists when it does not) and Type II error (false negative — missing a real effect) is the core tension in any decision boundary. Screening programs for rare diseases, for example, must set thresholds carefully: a very sensitive test catches nearly every true case but generates substantial false positives, each of which carries its own cost — financial, psychological, and medical.

The National Institute of Standards and Technology publishes guidelines on expressing measurement uncertainty that apply across disciplines, from clinical chemistry to materials testing. The threshold for "good enough" data is always relative to what the data will be used to decide.

The starting point for the broader body of evidence behind these principles is The Science — the reference foundation for this entire network of topics.

The Science: Key Data, Statistics, and Metrics

Definition and scope

How it works

Common scenarios

Decision boundaries

References

Read Next