The Science Methodology: How Research Is Conducted

Science methodology is the structured set of practices that transforms a question into a defensible answer — or, just as importantly, into a more refined question. This page covers how research is designed, executed, evaluated, and classified, from the logic of experimental controls to the contested boundaries between qualitative and quantitative approaches. The methodology is not separate from the science; it is the science, operating beneath every published result.

Definition and scope
Core mechanics or structure
Causal relationships or drivers
Classification boundaries
Tradeoffs and tensions
Common misconceptions
Checklist or steps (non-advisory)
Reference table or matrix

Definition and scope

A replication crisis touching psychology, medicine, and nutrition research — in which a 2015 study published in Science found that only 36 out of 97 psychology findings held up under direct replication — made one thing unavoidable: the quality of a scientific claim depends almost entirely on the rigor of the method that produced it. Methodology is not procedural housekeeping. It is the mechanism that separates a testable, falsifiable proposition from an educated guess with a graph attached.

In scope, scientific methodology encompasses the full sequence from hypothesis formation through data collection, analysis, peer review, and replication. It applies across disciplines — a randomized controlled trial in pharmacology and a field-observation protocol in ecology share the same underlying logic, even when the tools look nothing alike. The National Science Foundation's guidance on research design formally distinguishes basic research, applied research, and experimental development as the three primary modes, a classification that shapes funding eligibility, ethical review requirements, and how results are interpreted.

The scope also includes meta-methodology: research about how research is done. Systematic reviews, meta-analyses, and reproducibility audits are all methodological instruments, applied not to the natural world directly but to the body of evidence that has already accumulated about it.

Core mechanics or structure

The most durable structure in empirical science is the hypothetico-deductive model, developed formally through the work of philosophers including Karl Popper and Carl Hempel. The sequence moves through five functional stages: observation, hypothesis formation, prediction, testing, and revision. What makes it powerful is the revision step — a methodology that cannot update on negative results is not science, it is advocacy.

Within that frame, two families of methods handle most of the actual work. Experimental methods introduce a controlled manipulation — an independent variable — and measure its effect on a dependent variable while holding confounds constant. Observational methods measure without manipulation, relying on statistical controls rather than physical ones. The distinction matters enormously: experimental designs can support causal inference, observational ones generally cannot, though sophisticated techniques like instrumental variable analysis and difference-in-differences regression narrow that gap.

Measurement reliability and validity are the mechanical substrate of both approaches. Reliability means a measure produces consistent results across repeated applications; validity means it actually captures what it claims to capture. A bathroom scale is reliable if it gives the same reading twice in a row; it is valid if that reading corresponds to actual body mass. The two can come apart — a poorly calibrated but consistent instrument is reliable but not valid. The American Psychological Association's Standards for Educational and Psychological Testing treats the reliability-validity relationship as a foundational testing principle, and the framework has been widely adopted across social and behavioral sciences.

The peer review system, which routes submitted findings through independent expert evaluation before publication, functions as a methodological checkpoint — not a guarantee of truth, but a structured filter for obvious errors in design and inference.

Causal relationships or drivers

Methodology shapes outcomes through three primary drivers: design choices, sampling decisions, and analytical strategy.

Design choices determine what kinds of claims the data can support. A cross-sectional study — one snapshot in time — cannot distinguish cause from effect. A longitudinal cohort study tracks the same subjects over time and can establish temporal precedence, a necessary (though not sufficient) condition for causal inference. Randomized controlled trials achieve causal inference by distributing unmeasured confounds roughly equally across conditions, which is why the National Institutes of Health describes the RCT as the gold standard for evaluating medical interventions.

Sampling decisions determine generalizability. A convenience sample of 200 undergraduate students can reveal a lot about undergraduates. Extrapolating those findings to the adult population requires assumptions that must be stated explicitly, not buried. Probability sampling — where every member of a defined population has a known, nonzero chance of selection — is the mechanism that makes statistical generalization defensible. The U.S. Census Bureau's Survey Methodology documents probability sampling as central to population inference at national scale.

Analytical strategy determines what patterns emerge from collected data. The same dataset can support contradictory conclusions depending on whether the analyst controls for age, income, or pre-existing conditions — a fact that has produced serious debates around what is sometimes called "researcher degrees of freedom," the accumulated small choices that collectively tilt a result in one direction. Pre-registration, in which researchers publicly commit to their hypotheses and analysis plan before collecting data, is the primary institutional response. The Open Science Framework (OSF) hosted over 100,000 pre-registered studies as of 2023, a number that reflects how widely the practice has been adopted since gaining traction around 2013.

Classification boundaries

Methodology gets classified along several axes, and the boundaries between them are less rigid than textbooks imply.

Quantitative vs. qualitative: Quantitative methods produce numerical data amenable to statistical analysis. Qualitative methods — interviews, ethnographies, grounded theory — produce textual or observational data that requires interpretive analysis. The distinction is real, but the sharper version (that qualitative research is "soft" and quantitative is "hard") is an oversimplification. A poorly designed survey produces worse knowledge than a carefully executed interview study.

Primary vs. secondary research: Primary research generates new data. Secondary research analyzes data others have already collected. A systematic review is secondary research; a clinical trial is primary.

Basic vs. applied: Basic research pursues understanding without regard to immediate application. Applied research targets a defined practical problem. The NSF definition formalizes this as a distinction in intent, not in rigor.

Exploratory vs. confirmatory: Exploratory research searches for patterns in data without a prior hypothesis. Confirmatory research tests a specific, pre-stated hypothesis. The replication crisis is partly explained by confirmatory-framed papers that were actually exploratory in design — a practice sometimes called HARKing (Hypothesizing After Results are Known), discussed extensively in Kerr (1998) in Personality and Social Psychology Review.

The full landscape of scientific principles underlying these classifications is explored in The Science Principles and Theories.

Tradeoffs and tensions

Every methodological choice involves giving something up. Internal validity — the degree to which a study's results reflect a real causal relationship — and external validity — the degree to which results generalize beyond the study context — are in persistent tension. Laboratory experiments maximize control and internal validity. Field studies maximize ecological realism and external validity. A study almost never maximizes both.

Speed trades against rigor. Preprint culture, accelerated after the COVID-19 pandemic, made preliminary findings available before peer review, which served genuine public interest while also propagating methodological errors at scale. The CDC's science and research section has addressed the challenge of communicating findings at different stages of the evidence pipeline to non-specialist audiences.

Statistical significance thresholds create their own tension. The conventional p < 0.05 standard was not designed to be a universal cutoff for truth; Ronald Fisher, who introduced the 0.05 threshold in his 1925 text Statistical Methods for Research Workers, described it as a rough guide, not a law. The American Statistical Association's 2016 statement on p-values explicitly warned against treating statistical significance as binary. Despite this, journals, grant agencies, and public science reporting routinely treat p < 0.05 as the border between real and not-real.

These tensions are also documented in The Science Controversies and Debates.

Common misconceptions

"Correlation doesn't imply causation" means correlational data is useless. Epidemiology, economics, and climate science operate largely on observational data and have produced foundational causal knowledge. The claim means that correlation alone is insufficient for causal inference — not that it contributes nothing.

Peer review catches errors. Peer review is a screen for plausible methodology, not a verification of raw data or a replication of the analysis. Errors, fabrications, and p-hacking have cleared peer review at top journals. The process raises the bar; it does not eliminate the bar's limits.

A larger sample size always improves a study. A large sample amplifies the precision of whatever the study is measuring — including its flaws. A biased measurement instrument applied to 10,000 subjects produces a very precise estimate of the wrong thing.

Replication failure means the original study was fraudulent. Most replication failures reflect differences in sample, context, procedure, or simply statistical noise across independent trials. Fraud is rare; methodological fragility is common.

Checklist or steps (non-advisory)

The following sequence describes the components present in a complete empirical study as documented by organizations including the NIH Office of Research Integrity:

Phenomenon identification — A natural or social pattern is observed and deemed worthy of systematic investigation.
Literature review — Existing findings are catalogued to identify what is established, what is contested, and where genuine gaps exist.
Research question formulation — A specific, bounded question is stated. Broad questions get operationalized into measurable terms.
Hypothesis specification — A falsifiable prediction is stated before data collection begins. Pre-registration documents this step publicly.
Research design selection — Experimental, quasi-experimental, or observational design is chosen based on the question's causal demands and ethical constraints.
Sampling and recruitment — A population is defined, and a sampling strategy is selected. Inclusion and exclusion criteria are documented.
Instrument development or selection — Measures, surveys, devices, or observation protocols are chosen and validated for the study population.
Ethical review — Institutional Review Board (IRB) or equivalent body evaluates risk, consent procedures, and data handling. Governed in the US by 45 CFR Part 46 ("The Common Rule").
Data collection — Data are gathered according to the pre-specified protocol.
Data analysis — Pre-registered analytical procedures are applied. Deviations from the pre-registered plan are disclosed.
Interpretation and write-up — Results are interpreted within the study's design limits. Effect sizes and confidence intervals are reported alongside significance tests.
Peer review and publication — Manuscript is submitted, reviewed by independent experts, revised, and published or posted as a preprint.
Replication and meta-analysis — Independent teams attempt to reproduce the findings. Results are aggregated across studies through systematic review.

The full resources landscape for those working through this sequence is detailed at The Science Trusted Resources.

Reference table or matrix

Design Type	Causal Inference?	Typical Use Case	Key Strength	Key Limitation
Randomized Controlled Trial	Strong	Drug efficacy, behavioral interventions	Controls unmeasured confounds	High cost; ethical constraints on randomization
Quasi-experimental (e.g., difference-in-differences)	Moderate	Policy evaluation, natural experiments	Feasible when randomization is impossible	Requires strong assumptions about counterfactuals
Longitudinal cohort	Moderate	Chronic disease, developmental change	Establishes temporal order	Attrition; confounding accumulates over time
Cross-sectional	Weak	Prevalence estimates, screening	Fast, low-cost	Cannot establish cause; snapshot only
Case-control	Moderate (retrospective)	Rare disease etiology	Efficient for low-prevalence outcomes	Recall bias; selection bias risk
Systematic review / Meta-analysis	Derived	Synthesizing a literature	Maximizes statistical power across studies	Quality limited by quality of included studies
Ethnographic / Qualitative	Contextual	Mechanism exploration, underserved populations	Captures complexity and context	Not generalizable through standard statistical logic

A deeper look at how these methods connect to specific landmark findings lives in The Science Landmark Discoveries, and the broader context of what methodology has revealed about the world is mapped at the main science reference index.