Reproducibility and Data Integrity in Scientific Research

When a 2011 study by Bayer HealthCare found that only 25% of published preclinical cancer studies could be replicated internally, it wasn't a quiet footnote — it set off a sustained reckoning across biomedical science. Reproducibility and data integrity sit at the center of that reckoning: not as abstract ideals, but as measurable properties of research that determine whether scientific knowledge actually accumulates. This page covers the definitions, structural mechanics, failure modes, and practical standards that govern reproducibility and data integrity across scientific disciplines.

Definition and scope
Core mechanics or structure
Causal relationships or drivers
Classification boundaries
Tradeoffs and tensions
Common misconceptions
Checklist or steps (non-advisory framing)
Reference table or matrix

Definition and scope

Reproducibility, as defined by the National Academies of Sciences, Engineering, and Medicine (NASEM) in their 2019 report Reproducibility and Replicability in Science, refers specifically to obtaining consistent computational results using the same input data, methods, code, and analytic conditions. Replicability — a related but distinct concept — refers to obtaining consistent results across studies addressing the same scientific question with independently collected data. The two terms are frequently conflated, which itself contributes to confusion about what any given "reproducibility crisis" claim actually means.

Data integrity is the broader condition under which data are accurate, complete, consistent, and unaltered from collection through analysis and reporting. The U.S. Food and Drug Administration's data integrity guidance documents, developed primarily for regulated industries, use the ALCOA+ framework — Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, and Available.

Scope-wise, these concepts apply across every empirical discipline: clinical trials, psychology, physics, ecology, social science, and economics all face reproducibility pressures, though the specific failure modes vary considerably by field. The problem is large enough that the National Science Foundation has funded dedicated metascience initiatives to measure and address it.

For anyone trying to understand how science works as a self-correcting process, reproducibility is the mechanism by which that self-correction actually operates — or fails to.

Core mechanics or structure

Three structural components determine whether a study is reproducible: transparency of methods, availability of data, and fidelity of analysis.

Methods transparency requires that protocols be documented in enough detail for an independent team to execute them. The EQUATOR Network, a consortium of reporting guideline developers, maintains over 500 reporting guidelines for different study types — including CONSORT for randomized trials and STROBE for observational studies — specifically because informal methods reporting creates irreproducible gaps.

Data availability means the underlying observations, measurements, or recordings are accessible and documented. Pre-registration — publicly logging a study's hypothesis, design, and analysis plan before data collection begins — is one of the more effective structural interventions, because it separates confirmatory from exploratory analysis at the point of origin. The Open Science Framework (OSF), operated by the Center for Open Science, hosts over 900,000 pre-registered study records as of its public statistics page.

Analysis fidelity addresses whether the same code and data produce the same numerical output. This is where computational reproducibility lives. Version control (via tools like Git), containerized software environments (Docker, for instance), and archived data repositories each address a different failure point in the analysis chain.

Causal relationships or drivers

The reproducibility problem does not have a single cause. The NASEM 2019 report identifies a cluster of contributing factors: publication bias toward positive results, inadequate statistical power in original studies, undisclosed flexibility in data analysis (sometimes called "researcher degrees of freedom"), and selective reporting of outcomes.

Publication bias is structurally reinforced by journal incentives. A study by Fanelli (2010) published in PLOS ONE found that the proportion of positive results in the scientific literature grew by over 22 percentage points between 1990 and 2007 across multiple disciplines — a pattern inconsistent with stable underlying effect sizes and consistent with increasing publication selectivity.

Statistical underpowering deserves particular attention. When a study is designed with insufficient sample size, it cannot reliably detect true effects of the magnitude being investigated. A well-known 2013 paper by Button et al. in Nature Reviews Neuroscience estimated median statistical power in neuroscience studies at just 21%, meaning that even when a true effect exists, 79% of individual experiments would fail to detect it. Low-powered studies that do report positive findings are more likely to be false positives — a counterintuitive but mathematically predictable outcome described by epidemiologist John Ioannidis in his widely cited 2005 paper "Why Most Published Research Findings Are False".

Data fabrication and selective outcome switching represent the more deliberate end of the integrity spectrum. These behaviors, though less common than inadvertent methodology failures, carry severe formal consequences in funded research — NSF regulations at 45 CFR Part 689 define research misconduct and authorize debarment from federal funding.

Classification boundaries

The field distinguishes between three categories of reproducibility failure, each requiring different remediation:

Computational failures — the same data and code produce different results, usually due to software version drift, platform differences, or undocumented preprocessing steps.
Statistical failures — results do not replicate because the original study was underpowered, used inappropriate tests, or engaged in undisclosed multiple comparisons.
Conceptual failures — results do not replicate because the underlying construct was measured differently, the population differed, or the theoretical model does not generalize.

These categories matter because conflating them produces bad diagnostics. A finding that fails conceptual replication tells a different story than one that fails computational reproducibility. The NASEM taxonomy formalized this three-part distinction specifically to prevent researchers and journalists from treating all non-replication as equivalent evidence of fraud or sloppiness.

Data integrity failures follow a parallel classification: errors (unintentional), protocol deviations (procedural), and misconduct (deliberate falsification or fabrication). Only the third category constitutes research misconduct under 42 CFR Part 93, the federal regulation governing misconduct in Public Health Service-funded research.

Tradeoffs and tensions

The drive for reproducibility creates genuine friction with other legitimate scientific values.

Speed vs. rigor: Detailed documentation, pre-registration, and data archiving all add time to the research cycle. In fast-moving fields like infectious disease epidemiology — where preprint servers like medRxiv became primary information channels during the COVID-19 pandemic — the infrastructure for reproducibility routinely lags behind publication pace.

Openness vs. privacy: Data sharing mandates, increasingly required by funders like the NIH Data Management and Sharing Policy (effective January 2023), can conflict with human subjects protections under HIPAA and IRB protocols. De-identification is imperfect; genomic data in particular resists true anonymization.

Replication cost vs. novelty rewards: Independent replication studies are expensive and rarely funded at the same level as original research. The academic incentive structure, measured through hiring, tenure, and grant success, still rewards novel findings over confirmatory work, even as funding agencies increasingly signal the opposite preference.

Common misconceptions

Misconception: Failure to replicate proves the original finding was fraudulent.
Correction: Most replication failures trace to statistical underpowering, methodological variation, or contextual differences — not misconduct. The Open Science Collaboration's 2015 replication project, published in Science, replicated 100 psychology studies and found that roughly 61 of 100 effects failed to replicate at conventional thresholds — but the authors explicitly cautioned against interpreting this as evidence of pervasive fraud.

Misconception: Pre-registration eliminates researcher bias.
Correction: Pre-registration reduces flexibility in analysis choices but does not eliminate all forms of bias. Hypotheses can still be vague, primary outcomes can be swapped after data collection if registration documents are poorly monitored, and deviations from registered protocols are common. Pre-registration shifts the problem; it does not solve it.

Misconception: High-impact journals have better reproducibility records.
Correction: Evidence points in the opposite direction. Ioannidis (2005) specifically noted that journals with high selectivity and prestige may publish more surprising findings — which by definition are less likely to be true under low prior probability conditions. The Bayer replication study (Prinz et al., 2011, Nature Reviews Drug Discovery) found that non-replication rates were notably high for studies published in prestigious journals.

Checklist or steps (non-advisory framing)

The following elements appear in fully reproducible study designs as documented by the Center for Open Science's Transparency and Openness Promotion (TOP) Guidelines:

Reference table or matrix

Failure type	Primary driver	Detection method	Remediation
Computational irreproducibility	Software version drift, undocumented preprocessing	Re-run code on archived data	Containerization, archived environments
Statistical non-replication	Low power, undisclosed multiple comparisons	Meta-analysis, registered replication	Pre-registration, power analysis
Conceptual non-replication	Construct variability, population differences	Multi-site replication	Explicit operationalization, heterogeneous samples
Data fabrication	Deliberate misconduct	Statistical forensics (GRIM, SPRITE tests)	Institutional investigation, 42 CFR Part 93
Selective reporting	Publication bias	Funnel plot asymmetry, outcome switching audit	Outcome pre-registration, results-neutral review
Protocol deviation	Poor documentation, site variability	Audit trail review, monitoring visits	ALCOA+ compliance, SOPs

The science authority model described across thescienceauthority.com treats reproducibility not as a bureaucratic compliance exercise but as the load-bearing structure of scientific knowledge — the mechanism by which provisional findings either earn the status of established fact or get quietly retired. Understanding it is foundational to evaluating any empirical claim.