In this article, we report a study in which 109 research‐active mathematicians were asked to judge the validity of a purported proof in undergraduate calculus. Significant results from our study were as follows: (a) there was substantial disagreement among mathematicians regarding whether the argument was a valid proof, (b) applied mathematicians were more likely than pure mathematicians to judge the argument valid, (c) participants who judged the argument invalid were more confident in their judgments than those who judged it valid, and (d) participants who judged the argument valid usually did not change their judgment when presented with a reason raised by other mathematicians for why the proof should be judged invalid. These findings suggest that, contrary to some claims in the literature, there is not a single standard of validity among contemporary mathematicians.

1. Introduction Mathematical proof is a primary means of conveying mathematical content. Although it may be impossible to provide an operational definition of what constitutes a mathematical proof (e.g., Davis & Hersh, 1981), a preliminary definition describes a proof as a sequence of statements each of which is either accepted as true a priori or is a necessary logical consequence of previous statements in the proof (e.g., Auslander, 2008; Kitcher, 1984). One purported virtue of this genre of argumentation is that each statement in a proof is either a valid logical consequence of previous assertions or it is not. As Rota (1993) noted, as opposed to argumentation in most other disciplines, “mathematical proof does not admit degrees” (p. 93). As a consequence, many in the mathematical community believe that the validity of a proof is not a subjective issue but an objective fact, and they cite the high level of agreement between mathematicians on what constitutes a proof as support for their position. Azzouni's (2004) article, for instance, attempted to explain why “mathematicians are so good at agreeing with one another on whether some proof convincingly establishes a theorem” (p. 84). McKnight, Magid, Murphy, and McKnight (2000) asserted that “all agree that something is either a proof or it is not and what makes it a proof is that every assertion in it is correct” (p. 1). Selden and Selden (2003) remarked on “the unusual degree of agreement about the correctness of arguments and the truth of theorems arising from the validation process” (p. 7); they contended that validity is a function only of the argument and not of the reader: “Mathematicians say that an argument proves a theorem, not that it proves it for Smith and possibly not for Jones” (p. 11). Other mathematicians and philosophers, however, have been more skeptical about whether there are universally adopted standards of mathematical proof. Auslander (2008) suggested that “standards of proof vary over time and even among mathematicians at a given time” (p. 62). Rav (2007) argued that mathematical practice has a “pluralistic nature” and concluded that “not only is mathematical proof ‘time‐dependent,’ but because of the historical and methodological wealth of proof practices (plural), any attempt to encapsulate such multifarious practices in a unique and uniform one‐block perspective is bound to be defective” (p. 299). The general purpose of this article is to provide empirical support for the latter position. To situate our contribution more precisely, we note two different sources of disagreements that mathematicians may have when evaluating a proof: performance errors in the validation of a proof, and different standards about what constitutes a proof in an established branch of mathematics.1 With regards to performance errors, two mathematicians may disagree on whether a proof is valid because one mathematician has overlooked a flaw in the proof. Indeed, some argue that this occurs surprisingly frequently in mathematical practice. Davis (1972), for instance, estimates that as many as half of all published proofs contain logical errors, perhaps because journal referees do not always check every line of a proof that they review (Geist, Löwe, & Van Kerkhove, 2010; Szpiro, 2003; Weber & Mejia‐Ramos, 2011). However, when disagreements are due to performance errors, it is believed that the mathematicians could resolve these disagreements with discussions about the problematic areas in the proof (as Selden & Selden, 2003, suggest). In this article, we demonstrate that there is a second source of disagreement about what mathematicians consider to be a proof. By asking 109 mathematicians to determine whether an elementary proof from undergraduate calculus was valid, we found that mathematicians used different standards in judging the validity of this proof and that the standards held by mathematicians were related to their research areas.

2. A negative characterization of individual validity judgments To frame both the design and results of our empirical study, we first discuss the ways in which individuals make judgments about validity. We argue that while mathematicians are apparently willing to make positive judgments that particular proofs are valid, these can more accurately be characterized as negative judgments that such proofs are not invalid. We suggest here that this characterization leads to testable predictions for an individual validator's behavior. A positive characterization of individual validity judgments is based on the idea that, when reading a proof, mathematicians attempt to mentally reconstruct the flow of inferences in the proof using the written text and their knowledge of the mathematical domain (perhaps using “inference packages” as suggested by Azzouni, 2005). In our negative characterization, on the other hand, we suggest that a validation attempt can more accurately be seen as a search for problems with a purported proof. Such problems might be of (at least) two types, the detection of either of which might not be trivial. One type of problem is a genuine logical error. Such an error might be identified through a detailed line‐by‐line check, which reveals a statement that does not follow as a necessary consequence of previous statements. Or it might be identified through a higher level check that establishes that methods have been applied inappropriately or that a new method is invalid (see Rav, 1999; Weber & Mejia‐Ramos, 2011; for further discussion of approaches to proof validation). Of course, errors can be subtle: Contemporary mathematics contains many invalid arguments that were believed to be proofs because the errors in these arguments were difficult to detect (Devlin, 2003, described several such cases in detail). Consequently, any individual validator knows that they might fail to detect such a problem. Another type of problem is a serious gap. Gaps are not always problematic: In general, they are a natural feature of the proofs typically found in textbooks and research papers. Such proofs differ from what Rav (1999) described as derivations, the latter being formal sequences of formulae where each element in the sequence is either an axiom or follows from an axiom by an accepted rule of inference. Because the vast majority of proofs are not derivations, not every statement in a proof follows directly from the earlier statements; often a reader must bridge gaps by constructing subproofs. Consequently, when a validator is reading a proof, he or she must decide whether any gaps that he or she finds are sufficiently serious or large to warrant rejecting the proof as invalid. Again, any individual validator knows that he or she might fail to make an appropriate judgment about such a gap. For the mathematical community as a whole, the potential existence of errors and gaps in proofs is not unduly disruptive. Even with a negative characterization of validity judgments in mind, very high levels of confidence in a theorem and its proof can arise in time, for two reasons. First, if many mathematicians study a proof, we gain confidence that if an error existed, it would be discovered. Second, as Dawson (2006) argued, very high levels of confidence may also be gained through the generation of different proofs of the same theorem, which in a sense may serve as independent verifications of one another. For individual validators, however, the potential existence of errors and gaps has consequences in terms of the balance of confidence with which validity judgments can be made. If a gap or a problematic statement or method is located, the validator can be confident that the proof (as written) is not correct. If, however, no such gap, statement, or method is found, the validator cannot with absolute confidence conclude that none exists: A problem might simply have eluded detection. Why then would a validator ever conclude that a proof is valid? We suggest that this happens when, after a validation attempt conducted with (what they perceive to be) due care and attention, the validator has failed to reject it as invalid. When validating a purported proof, those who regard it as invalid will be more confident in their judgment than those who regard it as valid (because they have found a problem, rather than merely having failed to find one). It will be easier for validators to justify their response if they have rejected the proof as invalid rather than accepting it as valid (because those who rate it valid have nothing to say beyond that they have failed to find a problem). Thus, the positive and negative characterizations described here differ in the way they characterize the validator's goals. In one case the goal is to find a problem; in the other it is to reconstruct the flow of inferences. Based on the assumption that validators will be more confident about their judgment (and will find it easier to justify) when it is based on reaching their goal than when it is based on failing to reach their goal, our negative characterization, therefore, leads to the following two predictions about validator behavior: Note that a positive characterization of validity would make the opposite predictions. If validators successfully reconstruct the flow of inferences from premises to conclusions, they would confidently accept the proof as being valid, so we would expect those mathematicians who rated the proof as “valid” would be more confident in their judgments than those who rated the proof “invalid” (as they had successfully achieved their aim of reconstructing the flow of inferences). Similarly, under a positive characterization of proof, those mathematicians who rated the proof invalid would have nothing to offer in justification beyond reporting that they failed to reconstruct the proof, whereas those who rated it valid would be able to elaborate upon their reconstruction and report that it had occurred successfully. We therefore believe that predictions 1 and 2 offer a method of empirically distinguishing between positive and negative characterizations of proof validation, and one of the goals of our study was to test these accounts.

3. Study design in relation to prior work on validity judgments The work reported here builds on two psychological studies (Inglis & Alcock, 2012; Weber, 2008) that we conducted in response to Selden and Selden's (2003) research on the cognitive processes of undergraduate mathematics majors engaged in proof validation. In their study, Selden and Selden (both published research mathematicians) asked students to evaluate an argument they labeled “the real thing,” which they judged to be a fully valid proof, and another they labeled “the gap,” which they evaluated as invalid. Weber (2008) gave these and other proofs to eight mathematicians and asked them to determine whether they were valid. One declared “the real thing” invalid, two judged “the gap” to be valid, and another said that such a judgment about “the gap” was impossible without context. In a separate study, Inglis and Alcock (2012) found similar results—5 of 12 mathematicians judged “the real thing” invalid and 5 of 12 mathematicians judged “the gap” to be valid. Selden and Selden's proofs were short and simple so we were surprised to see this level of disagreement among other mathematicians. However, these findings gave only limited information about actual mathematical practice since they were based on variants of student‐produced proofs and were therefore somewhat awkwardly written. This drawback was addressed to some extent by the inclusion in Inglis and Alcock's (2012) study of the following purported proof: Theorem Proof We know that Rearranging the constant of integration gives Set and take the limit as k→1 as follows. Let m = k + 1, and rearrange Set by properties of e. As we have , so . In other words, as , so as . So This proof is more substantial than those used by Selden and Selden, and it is written in a standard format and style. Nonetheless, it involves relatively straightforward undergraduate mathematics, and the results were similar: 6 of the 12 mathematicians who participated in Inglis and Alcock's study judged the proof valid and 6 judged it invalid. These studies all had small sample sizes because they were all designed to focus on the methods by which mathematicians and undergraduates validate purported proofs, not their final judgments. To investigate whether these apparent differences in validity judgments were genuine, we thus asked a large number of research‐active mathematicians to judge the validity of this proof. This article reports the results.

4. Method Given the general difficulty of obtaining large samples of research‐active mathematicians, we decided to maximize our sample size by collecting our data through the internet. Web‐based research methods present some practical difficulties, such as the possibility of participants submitting multiple responses. However, by taking certain precautions (outlined by Reips, 2000), these methods have been found to produce results that are consistent with those found by traditional experimental methods (Gosling, Vazire, Srivastava, & John, 2004; Krantz & Dalal, 2000). We followed the strategies employed by Inglis and Mejia‐Ramos (2009a,b) to conduct internet studies in mathematics education research. 4.1. Participants Participants were 109 research‐active mathematicians (56 PhD students and 53 academic staff) associated with Australian and Canadian universities. They were recruited through an email sent via their departmental secretary. Those mathematicians who chose to take part in the study clicked a link contained in the email, which directed them to the study website. 4.2. Procedure The study website consisted of six pages of information and questions which participants moved through at their own pace. Page 1 contained background to the study and reminded participants that they should only continue if they were a research‐active mathematician. Page 2 asked participants to provide demographic information about themselves: their status (PhD student or academic staff), number of years of experience in teaching undergraduates, broad research area (applied mathematics, pure mathematics, or statistics), and their specific research area (AMS subject classification). Page 3 gave the following instruction, which contextualized the task: “Below is a proof of the type that might be submitted to a recreational mathematics journal such as The Mathematical Gazette. Please read the proof and decide whether you think it is valid.” Participants were then presented with the proof of the theorem stating that , as given earlier. After reading the proof, participants were asked two questions: “Do you think the proof is valid or invalid?” and “How certain are you that your answer is correct?” They responded to the first by selecting either “valid” or “invalid,” and to the second via a five‐point Likert scale (from “1—It was a complete guess” to “5—I am completely certain”). Finally, they were given the opportunity of explaining their answer via a free response text box. Page 4 asked participants to estimate what percentage of mathematicians would agree with their judgment about the validity of the purported proof (they responded by selecting 0–9%, 10–19%, etc.). They were also asked to suggest reasons another mathematician might have for disagreeing with their judgment. “They do not say anything about the type of convergence they are dealing with. Therefore, in the last line they seem to assume that the limit and the integral commute, which is false in general.” Page 5 presented participants with an explicit objection to the proof that had been given by a mathematician in a pilot study: The original proof was presented alongside the objection, and participants were asked to state whether this objection was “reasonable” (responding “yes,” “no,” or “unsure”), and whether the objection (on its own) was “enough to render the proof invalid” (again responding “yes,” “no,” or “unsure”). Finally, they were given the opportunity to explain their response via a free text box. Page 6 thanked participants for their time and gave information about how to receive information about the purpose of the experiment.

5. Results and discussion 5.1. Overview: Disagreement and results by disciplinary area Of the 109 participants who completed the survey,2 29 (27%) judged the argument valid, and 80 (73%) judged it invalid. There was no significant difference between the responses of academic staff compared and those doctoral students, p = .522 (throughout this paper, if no test statistic is reported—as here—this indicates that the significance level was calculated using a 2 × 2 Fisher's Exact Test); consequently, we do not distinguish between these groups in the analyses that follow. Participants’ responses were related to their research area, with applied mathematicians more likely than pure mathematicians to judge the argument valid, p = .002. These data are shown in Table 1. The association between research area and validity judgment retained significance when doctoral students were removed from the analysis, p = .011. Table 1. Participants' responses to the proof, by research area Research Area Number of Valid Ratings Number of Invalid Ratings Applied mathematics 12 12 Pure mathematics 13 64 5.2. Predictions based on the negative characterization of validity judgments As predicted by our negative characterization of mathematical validity judgments, those who judged the argument invalid had higher confidence in their responses, M = 4.16, than those who rated it valid, M = 3.41, U = 658, p < .001. Also as predicted, participants who judged the argument invalid seemed to find it easier to justify their responses: They were more likely to leave a comment explaining their answer than were those who rated the argument valid, 64% versus 35%, p = .011. “Even if all the statements in a “proof” are correct, it is not a correct proof unless there is a justification of the transition from each to the next. What is the justification for taking the limit under the integral sign?” Comments left by those who judged the proof invalid fell into three main categories. Some participants complained about the interchange of the limit and the integral in the last line of the proof (thus flagging the same problem as the pilot study participant whose comment we used). For example, one wrote the following: “The issue is whether it is valid to exchange integration with taking the limit ( ). Since the limit is a priori pointwise, one must use a convergence theorem here (e.g., restrict to a compact interval away from zero and use the dominated convergence theorem, or show that the limit is uniform). It seems that this detail can be corrected, but it would be imprudent to call the proof valid without explaining this point.” Another wrote: “The line ‘rearranging the constant of integration’ is invalid, since now depends on k, but previously it did not. I did not read further.” A second broad category of comments expressed concern about the manipulation of the constant of integration. One participant wrote the following: “The proof does not make any sense without defining what “ln” and “e” mean. Defining ln(x) by means of the integral of 1/x and exp to be the inverse function of ln is one of the standard ways; if this proof were showing something, it would be that this definition is equal to some other definition—but which one?” The final category of comments was complaints about a lack of clarity regarding what definitions the author was using: We note that all of these comments have a claim to be valid criticisms of the proof, meaning that the 27% of participants who judged the argument valid (among them a disproportionate number of applied mathematicians) must have either not noticed these problems or not considered them sufficient to reject the proof as invalid. We consider various hypotheses to account for these results in the next section. 5.3. Agreement estimates Both those who judged the proof valid and those who judged it invalid believed that theirs would be the majority view. The overall mean estimate of the number of mathematicians who would agree with participants’ judgments was 75.0%. The mean agreement estimate for those who believed the proof was valid was 64.7%, significantly higher than 50%, t(28) = 4.59, p < .001. The equivalent figure for those who believe the proof was invalid was 78.7%, again significantly higher than 50%, t(79) = 13.28, p < .001. However, these figures indicate that participants did not all believe that the majority judgment would be overwhelming. Allowing that (say) one quarter of one's colleagues might disagree seems to indicate a recognition that one might have missed a problem or that others might evaluate gaps as more or less serious. Indeed, relatively few participants thought that there would be uniform agreement: Only 30% of participants believed that over 90% of mathematicians would agree with them. This is perhaps surprising given the suggestion that mathematicians exhibit near uniform agreement about validity that has been expressed by Azzouni (2004), McKnight et al. (2000), and others.3 5.4 Willingness to change a judgment All the results presented so far, however, leave open the question of whether the participants who judged the proof valid did not find any problems, or whether they found problems but judged them insufficient to render the proof invalid. Thus, we do not yet know why pure and applied mathematicians differed in their evaluations of the proof. The hypothesis that we wish to advance is that pure and applied mathematicians use different standards when deciding whether a problem in a proof is sufficient to render the proof invalid: that they evaluate potential problems differently. However, there are alternative hypotheses. Perhaps, applied mathematicians, who in their practice may be more concerned with computation than deduction, are less adept at seeking logical errors. Or perhaps they read the proof less carefully than the pure mathematicians and consequently were more likely to overlook the problems cited by others. These accounts attribute the disagreement among participants to performance errors on the part of those who rated the proof valid. Alternatively, perhaps applied mathematicians simply look for different sorts of problems when validating a proof, so did not spot the particular problems raised by our proof. All of these alternative hypotheses suggest that instead of noticing a problem with this proof but not judging it sufficiently serious, those mathematicians who rated the proof valid simply did not notice any problems at all. To investigate this issue, we turn to participants’ responses to the latter section of the instrument, where they were asked to read a specific objection given by a participant in the pilot study. This objection, given earlier, related to how the author of the proof commuted the limit and integral in the last line. Recall that participants were asked two questions about the presented objection: whether it was “reasonable,” and whether, on its own, it was enough to render the proof invalid. A substantial majority of all participants believed the objection was reasonable (82% said it was, with a further 9% saying they were not sure), and these reasonableness judgments were not significantly related to their original validity ratings, p = .113. Crucially, however, participants’ responses to the second question were related to their original validity ratings, p < .001. A majority (65%) of those who had originally claimed the proof was invalid said that this objection was, on its own, enough to render the proof invalid. In contrast only one participant who had originally rated the proof valid (3%) believed that this objection was sufficient. These data are shown in Table 2. Table 2. Participants’ responses to the question “Do you think this objection (on its own) is enough to render the proof invalid?” Original Rating Yes No Unsure Valid 1 24 4 Invalid 52 19 9 This result allows us to address the above hypotheses. If those participants who judged the proof valid simply had not noticed that the exchange of limit and integral was problematic, then we would have expected them to change their minds when this was explicitly pointed out to them. In fact, only one participant did so.4 These data provide strong support for our claim that the reason for the disagreement about the validity of the original proof was not that a subset of participants failed to notice problems that others spotted. Rather, we suggest that some participants, disproportionately applied mathematicians, applied different standards when deciding whether a potential problem with the proof was sufficient to render it invalid.

6. General discussion The results of this study provide empirical support for the claim that there is not universal agreement among mathematicians regarding what constitutes a valid proof, in the context of a submission to a journal such as The Mathematics Gazette. Our findings suggest that pure and applied mathematicians adopt different standards in judging proof validity; in particular, that they apply different standards when judging whether a potential problem in a proof is sufficient to render it invalid. This claim is true even in the domain of relatively elementary mathematics, such as the content of undergraduate calculus. Our study thus has implications for at least three academic audiences. First, mathematicians are usually aware that students struggle to engage in formal proof‐based mathematics. They are also usually aware that they as individuals might require different standards of proof from students at different levels—a beginning student might be required to spell out every step in a way that a more advanced student is not. However, mathematicians might not be aware of the extent to which students could be grappling with contradictory messages; our study indicates that even within central content areas and with a requirement to judge in a particular way, mathematicians disagree about what constitutes validity. They may therefore be giving students inconsistent feedback on what features a proof needs to have to be considered valid. Second, because of these different judgments, researchers conducting psychological studies on proof should take care in their interpretations, particularly when classifying specific proofs as “obviously” valid or invalid. Our studies cast doubt on the validity status of two proofs used by Selden and Selden (2003) (Inglis & Alcock, 2012; Weber, 2008). Judgments about the validity of proofs, even at the undergraduate level, appear to be far more nuanced than is commonly thought. Our interpretations of research on student understanding of such proofs should perhaps be similarly nuanced. Finally, our results allow us to address the widely (albeit, not universally) held philosophical assumption that there is a remarkably high degree of agreement among mathematicians regarding whether an argument constitutes a proof (Azzouni, 2004; McKnight et al., 2000). Our results indicate that this assumption may not be correct. Instead, they indicate heterogeneity in mathematical practice that is more in line with Rav's (2007) claim that mathematical practice is “pluralistic.” We thus argue that philosophers should be cautious in any attempts to explain the high level of agreement among mathematicians. Such agreement might simply not exist.

Acknowledgments This study was partially funded by a Royal Society Worshipful Company of Actuaries Research Fellowship (to M.I.). We are grateful to Chris Sangwin for bringing to our attention the purported proof used in this study, and to three anonymous reviewers for insightful comments on an earlier draft of the manuscript.