Inferential Statistics: A Complete Guide

Inferential statistics is the branch of statistics that uses a sample to draw conclusions about a larger population. You can't measure everyone, so you measure some people, then reason carefully from what you found to what's probably true of the whole group. That reasoning, with its built-in account of uncertainty, is what turns a set of numbers into a finding. Where descriptive statistics summarize the data you have, inferential statistics let you generalize beyond it.

This guide is the map for that territory. It explains how inference works, walks through the core machinery of hypothesis testing, p-values, and the two kinds of error a test can make, and points you to the specific tests you'll use for real analyses. Whether you're writing your first results section or tightening a manuscript for review, this page shows how the pieces fit together and where to go deeper on each one.

Start Here: Choose Your Path

New to inference? Read this guide top to bottom, then follow the link to hypothesis testing. That is the foundation everything else rests on.

Setting up a study? Go straight to hypothesis testing and Type I and Type II errors, which cover how to frame your test and justify your sample size.

Interpreting results? The p-values guide explains what your output actually means, and what it doesn't.

Need the descriptive side first? See the companion guide to descriptive statistics, which covers summarizing the data before you generalize from it.

What Is Inferential Statistics?

Inferential statistics is the set of methods for using a sample to make claims about a population. The population is the entire group you care about: all doctoral students, all patients with a condition, all households in a country. The sample is the smaller group you actually measure. Inference is the bridge between them.

The reason inference is necessary is practical. Measuring a whole population is almost always impossible, too expensive, or too slow. So you draw a sample, calculate something from it, and use that to estimate the corresponding value in the population. A sample mean estimates a population mean. A sample proportion estimates a population proportion. The estimate won't be exact, and the genius of inferential statistics is that it quantifies how far off the estimate might be.

That last point is what separates inferential statistics from simply describing your data. If you survey 200 graduate students and find that men report higher financial risk tolerance than women, descriptive statistics tell you that's true in your 200. Inferential statistics tell you whether you can reasonably conclude it's true of graduate students in general, or whether the gap you saw could easily be an accident of which 200 people you happened to sample.

Descriptive vs. Inferential Statistics

The two branches of statistics do different jobs, and most analyses use both. Descriptive statistics summarize and describe the data in front of you: the mean, the standard deviation, the shape of the distribution. They make no claims beyond the sample. Inferential statistics take the next step, using the sample to reach conclusions about a population you didn't fully measure.

A simple way to keep them straight: descriptive statistics answer "what does my data look like?" while inferential statistics answer "what can I conclude about the world from my data?" You almost always describe before you infer, because you need to understand the shape and spread of your sample before you can reason about the population. For the full treatment of the descriptive side, see the companion guide to descriptive statistics. This guide picks up where that one leaves off.

Populations, Samples, and Sampling Distributions

Three ideas hold inference together. The first is the population parameter, the true value you want to know but can't measure directly, like the average risk tolerance of all graduate students. The second is the sample statistic, the value you actually calculate from your sample, which estimates the parameter. The third, and the one that makes inference work, is the sampling distribution.

A sampling distribution is the distribution of a statistic across all the possible samples you could have drawn. Imagine taking your sample, calculating the mean, then doing it again with a fresh sample, and again, thousands of times. Those means would form their own distribution. That distribution has a predictable shape and spread, and that predictability is what lets you state how close your single sample mean is likely to be to the true population mean.

The sampling distribution is why inference can attach probabilities to conclusions. Because the distribution of sample statistics often follows a known mathematical form, frequently the normal distribution or a close relative, you can calculate exactly how surprising a given result would be. That calculation is the engine inside every hypothesis test.

The Core of Inference: Hypothesis Testing

Most inferential statistics, in practice, comes down to hypothesis testing. A hypothesis test is a formal procedure for deciding whether a pattern in your sample is strong enough to conclude that a real pattern exists in the population, or whether it could plausibly be chance.

Every test starts with two competing statements. The null hypothesis says there's no effect: no difference, no relationship, nothing going on. The alternative hypothesis says there is. You assume the null is true, calculate how likely your data would be under that assumption, and reject the null only if your data would be very unlikely were it true. The logic is a kind of proof by contradiction: if assuming "no effect" makes your data look bizarre, the assumption was probably wrong.

This framework is the foundation for every specific test that follows, from t-tests to regression. Getting the setup right, writing the hypotheses correctly, choosing one-tailed or two-tailed, and setting your significance level in advance, determines whether everything downstream holds together. The full procedure, with a worked example, is covered in the guide to hypothesis testing and how to set up the null and alternative. If you read only one linked article from this page, make it that one.

Reading the Evidence: P-Values

Once you run a test, it produces a p-value, and the p-value is where most misunderstanding of statistics lives. A p-value is the probability of getting a result at least as extreme as yours if the null hypothesis were true. A small p-value means your data would be surprising under the null, which gives you grounds to reject it.

What a p-value is not is equally important. It is not the probability that the null hypothesis is true. It is not the probability your result was a fluke. It is not a measure of how large or important your effect is. Treating it as any of those is the single most common error in applied research, and it draws reviewer comments more reliably than almost anything else. By convention, a p-value below 0.05 is called statistically significant, though that threshold is a chosen convention rather than a law of nature.

Because so much rides on reading the p-value correctly, and reporting it the way style guides expect, it has its own dedicated guide: p-values explained, what they mean and what they don't. It covers the misinterpretations to avoid and the rules for reporting p-values in a results section.

When Tests Go Wrong: Type I and Type II Errors

Every hypothesis test can be wrong in two ways, and understanding both is part of understanding inference itself. A Type I error is a false positive: you reject a true null and report an effect that isn't real. A Type II error is a false negative: you fail to reject a false null and miss an effect that's really there.

These errors connect to three quantities you'll see throughout inferential work. Alpha is the probability of a Type I error, the significance level you set, usually 0.05. Beta is the probability of a Type II error. Statistical power, equal to 1 minus beta, is your chance of detecting a real effect, with 0.80 treated as the usual minimum for a sound study. The three are linked, and managing them is the heart of good study design.

The relationship between these errors, the tradeoff that ties them together, and the role of sample size in escaping that tradeoff are covered in the guide to Type I and Type II errors. That article also explains power analysis, which reviewers increasingly expect to see justifying your sample size.

A Worked Thread: Risk Tolerance by Gender

To see how the pieces connect, follow one research question through the whole framework. Fisher and Yao (2017) studied gender differences in financial risk tolerance. Suppose you want to test, in your own sample of graduate students, whether men and women differ in mean risk tolerance.

You begin with hypotheses. The null says there's no difference between men and women. The alternative says there is. You set your significance level at 0.05 before collecting data. You then survey your sample, calculate the mean risk tolerance for each group, and run a test that compares them. The test produces a test statistic and a p-value.

Say the p-value comes back at 0.001. Because that's well below 0.05, you reject the null and conclude the data shows a real difference. But the framework keeps you honest about what that means. The p-value of 0.001 says the difference is unlikely to be chance; it does not say the gap is large or important, which is a separate question about effect size. And your conclusion carries a small risk of being a Type I error, a false positive, if the null were actually true. Had your sample been much smaller and the result non-significant, you'd have faced the opposite risk, a Type II error, missing a real difference because the study lacked power to detect it.

That single thread touches every concept on this page: hypotheses, significance level, p-value, the two error types, effect size, and power. Each linked guide takes one piece and works it out in full.

Estimation: Confidence Intervals

Hypothesis testing isn't the only form of inference. The other major branch is estimation, and its main tool is the confidence interval. Where a hypothesis test gives a yes-or-no verdict on an effect, a confidence interval gives a range of plausible values for the thing you're estimating.

A 95% confidence interval for a mean, for example, gives a range that would capture the true population mean in 95% of samples drawn the same way. It carries more information than a p-value alone, because it shows both whether an effect is likely real (does the interval exclude zero?) and how large it might be (how wide is the range, and where does it sit?). Many journals now ask for confidence intervals alongside or instead of bare significance tests, precisely because they convey magnitude as well as significance. Confidence intervals are a core part of the common-tests articles in this cluster.

Choosing the Right Test

Once you understand the framework, the practical question becomes which specific test to run. The choice depends on what kind of data you have and what you're comparing. The major families are worth knowing at a glance.

  • T-tests compare the means of one or two groups. The risk tolerance example above, comparing men and women, is a job for an independent-samples t-test.
  • ANOVA (analysis of variance) compares means across three or more groups at once, without inflating the error rate that running many t-tests would cause.
  • Chi-square tests work with categorical data, testing whether the distribution of counts across categories departs from what you'd expect by chance.
  • Correlation and regression measure relationships between continuous variables, from the strength of a simple association to models that predict an outcome from several predictors at once.
  • Non-parametric tests step in when your data violates the assumptions the tests above rely on, such as normality, offering alternatives that make fewer assumptions.

Every one of these is a different way of generating a test statistic and a p-value, but they all share the hypothesis-testing logic at the center of this guide. Learn the framework once, and each specific test becomes a variation on a pattern you already understand. The common-tests guides in this cluster cover each family with worked examples.

Common Mistakes in Inferential Statistics

The same handful of errors account for most of the statistical problems flagged in graduate work and peer review. Knowing them in advance is the cheapest way to avoid them.

  • Misreading the p-value. Treating it as the probability the null is true, or as a measure of effect size, rather than what it is. This is the most common error of all.
  • Confusing statistical and practical significance. A significant result can be trivially small in a large sample. Always report an effect size alongside the p-value.
  • Treating a non-significant result as proof of no effect. Failing to reject the null isn't the same as confirming it, especially in an underpowered study.
  • Skipping the power analysis. Without one, a non-significant result is uninterpretable, because you can't tell a true null from a missed effect.
  • Running many tests without correction. Each test carries its own Type I error risk, so testing many hypotheses inflates the overall false-positive rate unless you adjust for it.
  • Choosing the test after seeing the data. Decisions like one-tailed versus two-tailed, or which test to run, belong before data collection, not after the results are in.

Each of these is covered in depth in the linked guides, but the pattern is worth noting on its own: most statistical errors are errors of interpretation and reporting, not of calculation. The software computes the right number. The mistake is in what the researcher claims it means.

That's also where careful editing earns its place. The statistical writing in a dissertation or manuscript is exactly where a subject-matter editor catches an overstated p-value, a missing power analysis, or a non-significant result described as proof of no effect, before a reviewer does. Editor World's editors hold advanced degrees and read this kind of analysis every day across dissertation editing and journal article editing. You choose your own editor by field, so the person reviewing your statistics knows the conventions of your discipline, and a free sample edit of your first 300 words is available before you commit.

Where This Fits in the Statistics Cluster

Inferential statistics is one of two halves of the field. This guide is the hub for the inferential side, linking to the foundations of hypothesis testing, p-values, and error types, and onward to the specific tests built on them. Its companion, descriptive statistics, covers the other half: summarizing and visualizing data before you generalize from it. For the full map of both halves and where to start, see the complete statistics guide for researchers.


Frequently Asked Questions

What is inferential statistics?

Inferential statistics is the branch of statistics that uses a sample to draw conclusions about a larger population. Because measuring an entire population is usually impossible, you measure a smaller sample and reason from it to the population, with a built-in account of how uncertain that reasoning is. A sample mean estimates a population mean, a sample proportion estimates a population proportion, and inferential methods quantify how far the estimate might be from the true value. That's what lets a set of sample numbers support a general finding.


What is the difference between descriptive and inferential statistics?

Descriptive statistics summarize and describe the data you actually have, using measures like the mean, standard deviation, and the shape of the distribution, and they make no claims beyond the sample. Inferential statistics use the sample to reach conclusions about a population you didn't fully measure. Descriptive statistics answer "what does my data look like?" while inferential statistics answer "what can I conclude about the population?" Most analyses use both, describing the sample before generalizing from it.


What are the main methods of inferential statistics?

The two major branches are hypothesis testing and estimation. Hypothesis testing decides whether a pattern in the sample is strong enough to conclude a real effect exists in the population, and it includes tests like t-tests, ANOVA, chi-square tests, correlation, and regression. Estimation produces a range of plausible values for a quantity, most commonly through confidence intervals. Both rest on the idea of a sampling distribution, which describes how a sample statistic varies across all the samples that could have been drawn, and which lets you attach probabilities to conclusions.


What is a sampling distribution?

A sampling distribution is the distribution of a sample statistic across all the possible samples that could be drawn from a population. If you repeatedly drew fresh samples and calculated the mean each time, those means would form their own distribution with a predictable shape and spread. That predictability is what makes inference possible, because it lets you state how close a single sample statistic is likely to be to the true population value. Sampling distributions often follow the normal distribution or a close relative, which is what allows exact probability calculations in hypothesis tests.


Why is a p-value not the probability that the null hypothesis is true?

A p-value is calculated by assuming the null hypothesis is true, so it can't also report the probability that the null is true. It's the probability of getting data at least as extreme as yours if the null were true, which is a statement about the data given the null, not about the null given the data. Treating a p-value as the probability the null is true, or as the probability a result was due to chance, is the most common misinterpretation in applied research and a frequent source of reviewer comments.


How do I choose the right statistical test?

The choice depends on the type of data and what you're comparing. A t-test compares the means of one or two groups. ANOVA compares means across three or more groups. A chi-square test works with categorical count data. Correlation and regression measure relationships between continuous variables. Non-parametric tests step in when the data violates the assumptions other tests rely on, such as normality. All of these share the same underlying hypothesis-testing logic, so understanding the framework once makes each specific test a variation on a familiar pattern.


Page last reviewed: June 2026. Content reviewed and edited by the Editor World editorial team. Editor World, founded in 2010 by Patti Fisher, PhD, provides professional human-only editing, proofreading, and writing services for graduate students, academics, and researchers worldwide. 100% human editing, no AI at any stage. BBB A+ accredited since 2010 with 5.0 / 5 Google Reviews and 5.0 / 5 Facebook Reviews. More than 100 million words edited for over 8,000 clients in 65+ countries. Recommended by the Boston University Economics Department.