P-Values Explained: What They Mean and What They Don't

A p-value is the probability of getting a result at least as extreme as the one you observed, assuming the null hypothesis is true. That single sentence is the whole definition, and almost every misunderstanding of p-values comes from quietly dropping one of its parts. The p-value is not the probability that your hypothesis is correct. It is not the probability that your result happened by chance. It is a conditional statement: if nothing were going on, how often would data like yours show up anyway?

This guide explains what a p-value actually measures, how to read it without falling into the traps that catch experienced researchers, what the 0.05 threshold means and where it came from, and the five interpretations of p-values that are flatly wrong but appear constantly in published work. We'll work through a concrete example with real numbers, then cover how to report p-values correctly so a reviewer never sends your manuscript back over a statistics sentence.

Quick Answer

What it is. A p-value is the probability of observing data at least as extreme as yours if the null hypothesis were true. It measures how compatible your data is with a model of no effect.

How to read it. A small p-value (typically below 0.05) means your data would be surprising under the null hypothesis, so you reject the null. A large p-value means your data is consistent with the null, so you fail to reject it.

What it is not. The p-value is not the probability that the null hypothesis is true, not the probability your result was due to chance, and not a measure of how large or important the effect is.

Why the cautions matter. Misreading a p-value as "the chance my finding is real" leads to overstated conclusions, the single most common statistical error flagged in peer review.

What a P-Value Actually Measures

To understand p-values, you have to start with the null hypothesis, because a p-value only has meaning in relation to one. If you're not yet comfortable with how null and alternative hypotheses are set up, it's worth reading the guide to hypothesis testing and how to set up the null and alternative first, since everything below builds on that framework.

Here's the logic. You assume the null hypothesis is true, which usually means assuming there's no real effect: no difference between groups, no relationship between variables. Under that assumption, you ask how likely it is that random sampling alone would produce a result as far from "no effect" as the one you actually got. That likelihood is the p-value.

Think of it as a measure of surprise. If the null hypothesis is true and you still see a large difference in your sample, one of two things happened. Either you witnessed an unlikely fluke, or your starting assumption was wrong. The smaller the p-value, the harder it is to write your result off as a fluke, and the more reasonable it becomes to conclude that the null assumption doesn't hold.

A p-value of 0.03 means this: if there were truly no effect, you'd see a result this extreme or more extreme about 3% of the time. A p-value of 0.40 means you'd see a result like yours 40% of the time even with no real effect, which is often enough that the result tells you nothing.

The 0.05 Threshold: Where It Comes From

By convention, researchers compare the p-value against a cutoff called the significance level, or alpha. The most common alpha is 0.05. When the p-value falls below 0.05, the result is called "statistically significant" and the null hypothesis is rejected. When it's at or above 0.05, the result is "not statistically significant" and the null stands.

The 0.05 line is not a law of nature. It traces back to the statistician Ronald Fisher in the 1920s, who suggested a 1-in-20 threshold as a reasonable, convenient standard. It stuck, and it's now the default across most of the social and biological sciences. Some fields use stricter cutoffs. Particle physics famously requires a threshold equivalent to a p-value around 0.0000003 before announcing a discovery.

The important point is that 0.05 is a chosen convention, not a bright line between truth and falsehood. A result with a p-value of 0.049 and a result with a p-value of 0.051 are nearly identical in strength of evidence. Treating one as a triumph and the other as a failure is a habit worth resisting, even though journals often reinforce it.

What a P-Value Does Not Tell You

This is the section that matters most, because the wrong interpretations of p-values are so widespread they appear in textbooks and published papers. Here are the five most common, each followed by the correction.

Mistake 1: "The p-value is the probability the null hypothesis is true"

This is the most damaging error. A p-value of 0.03 does not mean there's a 3% chance the null hypothesis is true and a 97% chance your effect is real. The p-value is calculated by assuming the null is true. It cannot then turn around and tell you the probability of that same assumption. The p-value is the probability of the data given the null, not the probability of the null given the data. Those are different quantities, and confusing them is the root of most p-value misuse.

Mistake 2: "The p-value is the probability the result was due to chance"

Close to the first mistake, and just as wrong. The p-value already assumes chance is the only thing operating, because it assumes the null is true. So it can't also be the probability that chance produced your result. A better phrasing: the p-value tells you how often chance alone would produce a result this extreme, not the probability that chance is the explanation for what you saw.

Mistake 3: "A small p-value means a large or important effect"

A p-value says nothing about the size of an effect. With a large enough sample, a trivial difference can produce a tiny p-value, because large samples make even small effects detectable. A p-value of 0.001 on a meaningless 0.1-point difference is entirely possible. This is why you must report an effect size alongside the p-value. Significance is about detectability. Effect size is about importance. They're separate questions.

Mistake 4: "A non-significant p-value proves there is no effect"

Failing to reject the null is not the same as confirming it. A p-value of 0.30 means your data is consistent with no effect, but it's also consistent with a small or moderate effect your study was too small to detect. Absence of evidence is not evidence of absence. The correct conclusion is that you did not find sufficient evidence of an effect, not that you proved there is none.

Mistake 5: "A significant p-value means the finding will replicate"

A single significant result is not a guarantee that the next study will find the same thing. P-values vary from sample to sample, sometimes a lot. A p-value of 0.04 in one study can easily become 0.20 in a direct replication, even when a real effect exists. Replication depends on the true effect size, the sample size, and study design, not on how impressive one p-value looked.

Reporting Results in Your Dissertation or Paper?

A single careless sentence about a p-value can trigger a revision request from a committee or a reviewer. Editor World's editors hold advanced degrees and read statistical writing every day, so they catch the overstated claim before your examiner does. Start with a free sample edit of your first 300 words and work with an editor who knows your field.

Request a Free Sample Edit

A Worked Example: Reading a P-Value Correctly

Let's continue the example from the hypothesis testing guide so the numbers connect. Suppose you tested whether men and women differ in mean financial risk tolerance, a question studied by Fisher and Yao (2017) in consumer economics. Your sample produced a difference of 0.5 points, men scoring higher, and an independent-samples t-test returned a p-value of 0.001.

Here is the correct interpretation, stated carefully.

What you can say. If there were truly no difference between men and women in the population, a sample difference this large or larger would occur only about once in a thousand studies. Because that is very unlikely, you reject the null hypothesis and conclude the data provides strong evidence of a real difference.
What you cannot say. You cannot say there is a 99.9% chance the effect is real. You cannot say there is a 0.1% chance the result was a fluke. You cannot say the 0.5-point gap is large or important just because the p-value is small.

Notice how much narrower the legitimate claim is than the tempting one. The p-value licenses you to reject a specific assumption. It does not hand you a probability that your theory is correct. Keeping that distinction sharp in your writing is what separates a clean results section from one that invites reviewer comments.

Now change one detail. Suppose the same 0.5-point difference came from a much smaller sample and returned a p-value of 0.12. Nothing about the observed effect changed: the gap is still 0.5 points. But you'd now fail to reject the null, because a difference that size is no longer surprising given the smaller sample. The effect size stayed the same. The p-value moved because p-values depend heavily on sample size. That dependence is exactly why a p-value should never be read as a measure of how big or real an effect is.

How to Report P-Values Correctly

Once you've interpreted a p-value correctly, you still have to write it up to the standard your style guide and reviewers expect. A few rules cover most situations.

Report exact values when you can. Write p = 0.03, not p < 0.05, for any value above 0.001. Exact values give readers more information than threshold statements.
Use p < 0.001 for very small values. Below 0.001, precise digits stop being meaningful, so the convention is to report p < 0.001 rather than p = 0.0004.
Pair every p-value with an effect size. A p-value alone tells a reader nothing about magnitude. Report the mean difference, correlation, Cohen's d, or other effect size alongside it.
Don't write "p = 0.000." Software sometimes rounds to 0.000, but a probability is never exactly zero. Report p < 0.001 instead.
Avoid "approached significance." A p-value of 0.06 is not significant at the 0.05 level. Phrases like "approached significance" or "trending toward significance" are widely criticized. Report the value and let it speak.

The statistical machinery that produces a p-value rests on a test statistic following a known probability distribution under the null. Many of those distributions are versions of, or approximations to, the normal distribution, and the standardized values that feed into the calculation are the same z-scores covered elsewhere in this cluster. Understanding those building blocks makes the p-value far less mysterious, because you can see where the probability actually comes from.

Statistical Significance and Practical Significance

The deepest problem with over-relying on p-values is that they answer only one question: is this effect detectable? They never answer the question your reader actually cares about: does this effect matter? A drug that lowers blood pressure by a statistically significant 0.2 points is detectable and useless. A teaching method that raises scores by a meaningful margin might miss significance in a small pilot study.

Good statistical writing always reports both. State the p-value so readers know the effect is unlikely to be chance, and state the effect size so they can judge whether it's worth caring about. A results section built on p-values alone is incomplete, no matter how many asterisks it carries.

Quick Self-Check Before You Submit

Run your results section against this checklist before a committee or reviewer sees it.

Have you avoided describing the p-value as "the probability the null is true" or "the chance the result was a fluke"?
Did you report an effect size alongside every p-value, so significance isn't mistaken for importance?
Are your p-values reported as exact values where possible, and as p < 0.001 only below that point?
Did you avoid "p = 0.000" and phrases like "approached significance"?
For non-significant results, did you write "did not find evidence of an effect" rather than "proved there is no effect"?
Does your interpretation match what the p-value can actually establish, with no overstated probability claims?

If any of those points is shaky, it's worth fixing before submission, because a misstated p-value is one of the first things a careful reviewer flags. The results section is where a subject-matter editor adds the most value, since the person reviewing your work needs to recognize a statistical overstatement when they see one. Editor World lets you choose your own editor by field across dissertation editing and journal article editing, so the editor reading your p-values understands the conventions your discipline holds you to.

Frequently Asked Questions

What does a p-value actually mean?

A p-value is the probability of getting a result at least as extreme as the one you observed, assuming the null hypothesis is true. It measures how compatible your data is with a model in which there's no real effect. A small p-value means your data would be surprising if the null were true, which gives you reason to reject it. It's a conditional probability about the data, not a statement about how likely any hypothesis is to be correct.

Does a p-value tell you the probability that the null hypothesis is true?

No. This is the most common misreading of p-values. The p-value is calculated by assuming the null is true, so it can't also tell you the probability that the null is true. A p-value of 0.03 doesn't mean there's a 3% chance the null is correct. It means that if the null were true, data at least as extreme as yours would show up about 3% of the time. The p-value is the probability of the data given the null, not the probability of the null given the data.

Why is 0.05 used as the significance threshold?

The 0.05 threshold is a convention, not a fixed rule. It traces back to the statistician Ronald Fisher in the 1920s, who proposed a 1-in-20 standard as a convenient benchmark. It became the default across most social and biological sciences. Some fields use stricter cutoffs like 0.01, and particle physics requires a far smaller threshold before announcing a discovery. A p-value of 0.049 and one of 0.051 are nearly identical in evidence, so the threshold isn't a sharp line between truth and falsehood.

Does a small p-value mean a large or important effect?

No. A p-value says nothing about the size or importance of an effect. With a large enough sample, even a trivial difference can produce a very small p-value, because large samples make small effects detectable. Statistical significance measures whether an effect is detectable, while effect size measures whether it's large enough to matter. That's why you should always report an effect size, such as a mean difference or Cohen's d, alongside the p-value.

How should I report a p-value in APA format?

Report exact p-values when you can, such as p = 0.03, rather than threshold statements like p < 0.05, for any value above 0.001. For very small values, use p < 0.001 rather than many decimal places. Never write p = 0.000, since a probability is never exactly zero. Always pair the p-value with an effect size so readers can judge magnitude. And avoid phrases like "approached significance," because a p-value either falls below your threshold or it doesn't.

What does a non-significant p-value tell you?

A non-significant p-value, one at or above your significance level, means you fail to reject the null hypothesis. It doesn't prove there's no effect. A non-significant result is consistent with no effect, but it's also consistent with a small or moderate effect your study was too small to detect. The correct conclusion is that you didn't find sufficient evidence of an effect, not that you proved none exists. Absence of evidence isn't the same as evidence of absence.

Page last reviewed: June 2026. Content reviewed and edited by the Editor World editorial team. Editor World, founded in 2010 by Patti Fisher, PhD, provides professional human-only editing, proofreading, and writing services for graduate students, academics, and researchers worldwide. 100% human editing, no AI at any stage. BBB A+ accredited since 2010 with 5.0 / 5 Google Reviews and 5.0 / 5 Facebook Reviews. More than 100 million words edited for over 8,000 clients in 65+ countries. Recommended by the Boston University Economics Department.