There are at least around 20 or so common misunderstandings and abuses of p-values and NHST [Null Hypothesis Significance Testing]. Most of them are related to the definition of p-value … Other misunderstandings are about the implications of statistical significance.

Statistical significance does not mean substantive significance: just because an observation (or a more extreme observation) was unlikely had there been no differences in the population does not mean that the observed differences are large enough to be of practical relevance. At high enough sample sizes, any difference will be statistically significant​ regardless of effect size.

Statistical non-significance does not entail equivalence: a failure to reject the null hypothesis is just that. It does not mean that the two groups are equivalent, since statistical non-significance can be due to low sample size.

Low p-value does not imply large effect sizes: because p-values depend on several other things besides effect size, such as sample size and spread.

It is not the probability of the null hypothesis: as we saw, it is the conditional probability of the data, or more extreme data, given the null hypothesis.

It is not the probability of the null hypothesis given the results: this is the fallacy of transposed conditionals as p-value is the other way around, the probability of at least as extreme data, given the null.

It is not the probability of falsely rejecting the null hypothesis: that would be alpha, not p.

It is not the probability that the​ results are a statistical fluke: since the test statistic is calculated under the assumption that all deviations from the null is due to chance. Thus, it cannot be used to estimate that probability of a statistical fluke since it is already assumed to be 100%.

Rejection null hypothesis is not confirmation of causal mechanism: you can imagine a great number of potential explanations for deviations from the null. Rejecting the null does not prove a specific one. See the above example with suicide rates.

NHST promotes arbitrary data dredging (“p-value fishing”): if you test your entire dataset and does not attain statistical significance, it is tempting to test a number of subgroups. Maybe the real effect occurs in me, women, old, young, whites, blacks, Hispanics, Asians, thin, obese etc.? More likely, you will get a number of spurious results that appear statistically significant but are really false positives. In the quest for statistical significance, this unethical behavior is common.

Debunking Denialism