by Deborah G. Mayo

“[C]onfusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth.” (George Barnard 1985, p. 2)

“Relevant clarifications of the nature and roles of statistical evidence in scientific research may well be achieved by bringing to bear in systematic concert the scholarly methods of statisticians, philosophers and historians of science, and substantive scientists…” (Allan Birnbaum 1972, p. 861).

“In the training program for PhD students, the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science must be covered.” (p. 57, Committee Investigating fraudulent research practices of social psychologist Diederik Stapel)

I was the lone philosophical observer at a special meeting convened by the American Statistical Association (ASA) in 2015 to construct a non-technical document to guide users of statistical significance tests–one of the most common methods used to distinguish genuine effects from chance variability across a landscape of social, physical and biological sciences.

It was, by the ASA Director’s own description, “historical”, but it was also highly philosophical, and its ramifications are only now being discussed and debated. Today, introspection on statistical methods is rather common due to the “statistical crisis in science”. What is it? In a nutshell: high powered computer methods make it easy to arrive at impressive-looking ‘findings’ that too often disappear when others try to replicate them when hypotheses and data analysis protocols are required to be fixed in advance.

How should scientific integrity be restored? Experts do not agree and the disagreement is intertwined with fundamental disagreements regarding the nature, interpretation, and justification of methods and models used to learn from incomplete and uncertain data. Today’s reformers, fraudbusters, and replication researchers increasingly call for more self-critical scrutiny on philosophical foundations. Philosophers should take this seriously. While philosophers of science are interested in helping to clarify, if not also to resolve, matters of evidence and inference, they are rarely consulted in practice for this end. The assumptions behind today’s competing evidence reforms–issues of what I will call evidence-policy–are largely hidden to those outside the loop of the philosophical foundations of statistics and data analysis, or Phil Stat. This is a crucial obstacle to scrutinizing the consequences to science policy, clinical trials, personalized medicine, and across a wide landscape of Big Data modeling.

Statistics has a fascinating and colorful history of philosophical debate, marked by unusual heights of passion, personality, and controversy for at least a century. Wars between frequentists and Bayesians have been so contentious that everyone wants to believe we are long past them: we now have unifications and reconciliations, and practitioners only care about what works. The truth is that both brand new and long-standing battles simmer below the surface in questions about scientific trustworthiness. They show up unannounced in the current problems of scientific integrity, questionable research practices, and in the swirl of methodological reforms and guidelines that spin their way down from journals and reports, the ASA Statement being just one. There isn’t even agreement as to what is to be meant by the method “works”. These are key themes in my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (2018, CUP).

Many of the key problems in today’s evidence-policy disputes inherit the conceptual confusions of the underlying methods for evidence and inference. They are intertwined with philosophical terms that often remain vague, such as inference, reliability, testing, rationality, explanation, induction, confirmation, and falsification. This hampers communication among various stakeholders, making it difficult to even recognize and articulate where they agree. The philosopher’s penchant for laying bare presuppositions of claims and arguments would let us cut through the unclarity that blocked the experts at the ASA meeting from clearly pinpointing where and why they agree or disagree. (As a mere “observer”, I rarely intervened.) We should put philosophy to work on the popular memes: “All models are false”, “Everything is equally subjective and objective”, “P-values exaggerate evidence”, and “ most published research findings are false”.

So am I calling on my fellow philosophers (at least some of them) to learn formal statistics?That would be both too much and too little. Too much because it would be impractical; too little because despite technical sophistication, basic concepts of statistical testing and inference are more unsettled than ever. Debates about P-values–whether to redefine them, lower them, or ban them altogether–are all the subject of heated discussion and journalistic debates. Megateams of seventy or more authors array themselves on either side of the debate (e.g., Benjamin 2017, Lakens 2018), including some philosophers (I was a co-author in Lakens, arguing that redefining significance would not help with the problem of replication). The deepest problems underlying the replication crisis go beyond formal statistics–into measurement, experimental design, communication of uncertainty. Yet these rarely occupy center stage in all the brouhaha. By focusing just on the formal statistical issues, the debates give short shrift to the need to tie formal methods to substantive inferences, to a general account of collecting and learning from data, and to entirely non-statistical types of inference. The goal becomes: who can claim to offer the highest proportion of “true” effects among those outputted by a formal method?

You might say my project is only relevant for philosophers of science, logic, formal epistemology and the like. While they are the obvious suspects, it goes further. Despite the burgeoning of discussions of ethics in research and in data science, the work is generally done by practitioners apart from philosophy, or by philosophers apart from the nitty-gritty details of the data sciences themselves. Without grasping the basic statistics, informed by understanding contrasting views of the nature and goals of using probability in learning, it’s impossible to see where the formal issues leave off and informal, value-laden issues arise or intersect. Philosophers in research ethics can wind up building arguments that forfeit a stronger stance that a critical assessment of the methods would afford (e.g., arguing for a precautionary stance, when there is evidence of genuine risk increase in the data, despite non-significant results.) Interest in experimental philosophy is another area that underscores the importance of a critical assessment of the statistical methods on which it is based. Formal methods, logic and probability are staples of philosophy, why not methods of inference based on probabilistic methods? That’s what statistics is.

Not only is PhilStat relevant to addressing some long-standing philosophical problems of evidence, inference and knowledge, it offers a superb avenue for philosophers to genuinely impact scientific practice and policy. Even a sufficient understanding of the inference methods together with a platform for raising questions about fallacies and pitfalls could be extremely effective. What is at stake is a critical standpoint that we may be in danger of losing. Without it, we forfeit the ability to communicate with, and hold accountable, the “experts,” the agencies, the quants, and all those data handlers increasingly exerting power over our lives. It goes beyond philosophical outreach–as important as that is–to becoming citizen scholars and citizen scientists.

I have been pondering how to overcome these obstacles, and am keen to engage fellow philosophers in the project. I am going to take one step toward exploring and meeting this goal, together with a colleague, Aris Spanos, in economics. We are running a two-week immersive seminar on PhilStat for philosophy faculty and post-docs who wish to acquire or strengthen their background in PhilStat as it relates to philosophical problems of evidence and inference, to today’s statistical crisis of replication, and to associated evidence-policy debates. The logistics are modeled on the NEH Summer Seminars for college faculty that I directed in 1999 (on Philosophy of Experiment: Induction, Reliability, and Error). The content reflects Mayo (2018), which is written as a series of Excursions and Tours in a “Philosophical Voyage” to illuminate statistical inference. Consider joining me. In the meantime, I would like to hear from philosophers interested or already involved in this arena. Do you have references to existing efforts in this direction? Please share them.

Barnard, G. (1985). A Coherent View of Statistical Inference, Statistics Technical Report Series. Department of Statistics & Actuarial Science, University of Waterloo, Canada.

Benjamin, D. et al (2017). Redefine Statistical Significance, Nature Human Behaviour 2, 6-10.

Birnbaum, A. (1972). More on concepts of statistical evidence. J. Amer. Statist. Assoc. 67 858–861. MR0365793

Lakens et al (2018). Justify Your Alpha Nature Human Behaviour 2, 168-71.

Levelt Committee, Noort Committee, Drenth Committee (2012). Flawed Science: The Fraudulent Research Practices of Social Psychologist Diederik Stapel (www.commissielevelt.nl/).

Mayo, D. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (CUP). (The first chapter [Excursion 1 Tour I ] is here.)

Wasserstein & Lazar (2016). The ASA’s Statement on P-values: Context, Process and Purpose, (and supplemental materials), The American Statistician 70(2), 129–33.

Credit for the ‘statistical cruise ship’ artwork goes to Mickey Mayo of Mayo Studios, Inc.