First published Mon Mar 2, 2015; substantive revision Tue Apr 28, 2015

Perhaps the best way to get a feel for formal epistemology is to look at concrete examples. We’ll take a few classic epistemological questions and look at popular formal approaches to them, to see what formal tools bring to the table. We’ll also look at some applications of these formal methods outside epistemology.

And yet, the tools formal epistemologists apply to these questions share much history and interest with other fields, both inside and outside philosophy. So formal epistemologists often ask questions that aren’t part of the usual epistemological core, questions about decision-making ( §5.1 ) or the meaning of hypothetical language ( §5.3 ), for example.

The questions that drive formal epistemology are often the same as those that drive “informal” epistemology. What is knowledge, and how is it different from mere opinion? What separates science from pseudoscience? When is a belief justified? What justifies my belief that the sun will rise tomorrow, or that the external world is real and not an illusion induced by Descartes’ demon?

Formal epistemology explores knowledge and reasoning using “formal” tools, tools from math and logic. For example, a formal epistemologist might use probability theory to explain how scientific reasoning works. Or she might use modal logic to defend a particular theory of knowledge.

How does scientific reasoning work? In the early 20th century, large swaths of mathematics were successfully reconstructed using first-order logic. Many philosophers sought a similar systematization of the reasoning in empirical sciences, like biology, psychology, and physics. Though empirical sciences rely heavily on non-deductive reasoning, the tools of deductive logic still offer a promising starting point.

Consider a hypothesis like All electrons have negative charge, which in first-order logic is rendered \(\forall x (Ex \supset Nx)\). Having identified some object \(a\) as an electron, this hypothesis deductively entails a prediction, \(Na\), that \(a\) has negative charge:

\[ \begin{array}{l} \forall x (Ex \supset Nx)\\ Ea\\ \hline Na \end{array} \]

If we test this prediction and observe that, indeed, \(Na\), this would seem to support the hypothesis.

Scientific hypothesis-testing thus appears to work something like “deduction in reverse” (Goodman 1954). If we swap the hypothesis and the predicted datum in the above deduction, we get an example of confirmation:

\[ \begin{array}{l} Ea\\ Na\\ \overline{\overline{\forall x (Ex \supset Nx)}} \end{array} \]

Here the double-line represents non-deductive inference. The inference is very weak in this case, since the hypothesis has only been verified in one instance, \(a\). But as we add further instances \(b\), \(c\), etc., it becomes stronger (provided we discover no counter-instances, of course).

These observations suggest a proposal due to Nicod (1930) and famously examined by Hempel (1945):

Nicod’s Criterion

A universal generalization is confirmed by its positive instances (as long as no counter-instances are discovered): \(\forall x(Fx \supset Gx)\) is confirmed by \(Fa \wedge Ga\), by \(Fb \wedge Gb\), etc.

The general idea is that hypotheses are confirmed when their predictions are borne out. To capture this idea formally in deductive logic, we’re equating prediction with logical entailment. When an object is \(F\), the hypothesis \(\forall x(Fx \supset Gx)\) entails/predicts that the object is \(G\). So any discovery of an object that is both \(F\) and \(G\) confirms the hypothesis.

One classic challenge for Nicod’s criterion is the notorious raven paradox. Suppose we want to test the hypothesis that all ravens are black, which we formalize \(\forall x(Rx \supset Bx)\). That’s logically equivalent to \(\forall x(

eg Bx \supset

eg Rx)\), by contraposition. And Nicod’s Criterion says this latter hypothesis is confirmed by the discovery of any object that is not black and not a raven—a red shirt, for example, or a pair of blue underpants (Hempel 1937, 1945). But walking the halls of my department noting non-black non-ravens hardly seems a reasonable way to verify that all ravens are black. How can “indoor ornithology” (Goodman 1954) be good science?!

A second, more general challenge for the prediction-as-deduction approach is posed by statistical hypotheses. Suppose we want to test the theory that only 50% of ravens are black. This hypothesis entails nothing about the color of an individual raven; it might be one of the black ones, it might not. In fact, even a very large survey of ravens, all of which turn out to be black, does not contradict this hypothesis. It’s always possible that the 50% of ravens that aren’t black weren’t caught up in the survey. (Maybe non-black ravens are exceptionally skilled at evasion.)

This challenge suggests some important lessons. First, we need a laxer notion of prediction than deductive entailment. The 50% hypothesis may not entail that a large survey of ravens will have some non-black ravens, but it does suggest this prediction pretty strongly. Second, as a sort of corollary, confirmation is quantitative: it comes in degrees. A single, black raven doesn’t do much to support the hypothesis that 50% of ravens are black, but a large sample of roughly half black, half white ravens would. Third and finally, degrees of confirmation should be understood in terms of probability. The 50% hypothesis doesn’t make it very probable that a single raven will be black, but it makes it highly probable that a much larger collection will be roughly half black, half non-black. And the all-black hypothesis predicts that any sample of ravens will be entirely black with \(100\)% probability.

A quantitative approach also promises to help resolve the raven paradox. The most popular resolution says that observing a red shirt does confirm that all ravens are black, just by a very minuscule amount. The raven paradox is thus an illusion: we mistake a minuscule amount of confirmation for none at all (Hosiasson-Lindenbaum 1940). But to make this response convincing, we need a proper, quantitative theory of confirmation that explains how a red shirt could be relevant to a hypothesis about ravens, but only just slightly relevant.

Let’s start with the idea that to confirm a hypothesis is to make it more probable. The more a piece of evidence increases the probability of a hypothesis, the more it confirms the hypothesis.

What we need then is a theory of probability. The standard theory begins with a function, \(p\), which takes in a proposition and returns a number, \(x\), the probability of that proposition: \(p(A)=x\). To qualify as a probability function, \(p\) must satisfy three axioms:

For any proposition \(A\), \(0 \leq p(A) \leq 1\).[1] For any tautology \(A\), \(p(A)=1\). For any logically incompatible propositions \(A\) and \(B\), \(p(A \vee B) = p(A) + p(B)\).

The first axiom sets the scale of probability, from 0 to 1, which we can think of as running from 0% probability to 100% probability.[2] The second axiom places tautologies at the top of this scale: nothing is more probable than a tautology.[3] And finally, the third axiom tells us how to figure out the probability of a hypothesis by breaking it into parts. For example, the probability that an American country will be the first to develop a cure for Alzheimer’s can be figured by adding the probability that a North American country will be first to the probability that a South American country will be.[4]

What about conditional probabilities, like the probability of doing well in your next philosophy class given that you’ve done well in previous ones? So far we’ve only formalized the notion of absolute probability, \(p(A)=x\). Let’s introduce conditional probability by definition:

Definition. The conditional probability of \(B\) given \(A\) is written \(p(B\mid A)\), and is defined: \[p(B\mid A) = \frac{p(B \wedge A)}{p(A)}.\]

Why this definition? A helpful heuristic is to think of the probability of \(B\) given \(A\) as something like the portion of the \(A\)-possibilities that are also \(B\)-possibilities. For example, the probability of rolling a high number (4, 5, or 6) on a six-sided die given that the roll is even is 2/3. Why? There are 3 even possibilities (2, 4, 6), so \(p(A) = 3/6\). Of those 3 possibilities, 2 are also high numbers (4, 6), so \(p(B \wedge A) = 2/6\). Thus \[p(B\mid A) = \frac{p(B \wedge A)}{p(A)} = \frac{2/6}{3/6} = 2/3.\] Generalizing this idea, we start with the quantity of \(A\)-possibilities as a sort of baseline by putting \(p(A)\) in the denominator. Then we consider how many of those are also \(B\)-possibilities by putting \(p(B \wedge A)\) in the numerator.

Notice, by the way, that \(p(B\mid A)\) is undefined when \(p(A) = 0\). This might seem fine at first. Why worry about the probability of \(B\) when \(A\) is true if there’s no chance \(A\) is true? In fact there are deep problems lurking here (Hájek manuscript—see Other Internet Resources), though we won’t stop to explore them.

Instead, let’s take advantage of the groundwork we’ve laid to state our formal definition of quantitative confirmation. Our guiding idea is that evidence confirms a hypothesis to the extent that it increases its probability. So we are comparing \(p(H\mid E)\) to \(p(H)\) by looking at the difference between them:

Definition. The degree to which \(E\) confirms \(H\), called the degree of confirmation, is written \(c(H,E)\) and is defined: \[c(H,E) = p(H\mid E) - p(H).\]

When \(c(H,E)\) is negative, \(E\) actually decreases the probability of \(H\), and we say that \(E\) disconfirms \(H\). When \(c(H,E)\) is 0, we say that \(E\) is neutral with respect to \(H\).

Minimal as they are, these simple axioms and definitions are enough to derive many interesting claims about probability and confirmation. The following two subsections introduce some elementary, yet promising results. See the technical supplement for proofs.

Let’s start with some elementary theorems that illustrate how probability interacts with deductive logic:

Theorem (No Chance for Contradictions). When \(A\) is a contradiction, \(p(A) = 0\).

Theorem (Complementarity for Contradictories). For any \(A\), \(p(A) = 1 - p(

eg A)\).

Theorem (Equality for Equivalents). When \(A\) and \(B\) are logically equivalent, \(p(A) = p(B)\).

Theorem (Conditional Certainty for Logical Consequences) When \(A\) logically entails \(B\), \(p(B\mid A)=1\).

The next three theorems go a bit deeper, and are useful for building up more interesting results:

Theorem (Conjunction Costs Probability). For any \(A\) and \(B\), \(p(A) > p(A \wedge B)\) unless \(p(A \wedge

eg B)=0\), in which case \(p(A) = p(A \wedge B)\).

One way of thinking about what Conjunction Costs Probability says is that the stronger a statement is, the greater the risk of falsehood. If we strengthen \(A\) by adding \(B\) to it, the resulting, stronger statement is less probable. Unless, that is, there was no chance of \(A\) being true without \(B\) to begin with. In that case, adding \(B\) to \(A\) doesn’t change the risk of falsehood, because there was no chance of \(A\) being true without \(B\) anyway.

Theorem (The Conjunction Rule). For any \(A\) and \(B\) such that \(p(B)

eq 0\), \(p(A \wedge B) = p(A\mid B)p(B)\).

This says we can calculate how likely two statements \(A\) and \(B\) are to be true together by temporarily taking \(B\) for granted, assessing the probability of \(A\) in that light, and then giving the result as much weight as \(B\)’s probability on its own merits.

Theorem (The Law of Total Probability). For any \(A\), and any \(B\) whose probability is neither \(0\) nor 1: \[p(A) = p(A\mid B)p(B) + p(A\mid

eg B)p(

eg B).\]

The Law of Total Probability basically says that we can calculate the probability of \(A\) by breaking it down into two possible cases: \(B\) and \(

eg B\). We consider how likely \(A\) is if \(B\) is true and how likely it is if \(B\) is false. We then give each case appropriate “weight”, by multiplying it against the probability that it holds, then adding together the results. For this to work, \(p(A\mid B)\) and \(p(A\mid

eg B)\) have to be well-defined, so \(p(B)\) can’t be 0 or 1.

This classic theorem relates a conditional probability \(p(H\mid E)\) to the unconditional probability, \(p(H)\): \[ p(H\mid E) = p(H)\frac{p(E\mid H)}{p(E)}\]

The theorem is philosophically important, as we’ll see in a moment. But it’s also useful as a tool for calculating \(p(H\mid E)\), because the three terms on the right hand side can often be inferred from available statistics.

Consider, for example, whether a student at University X having high grades (\(E\)) says anything about the likelihood of her taking a class in philosophy (\(H\)). The registrar tells us that 35% of students take a philosophy class at some point, so \(p(H) = 35/100\). They also tell us that only 20% of students campus-wide have high grades (defined as a GPA of 3.5 or above), so \(p(E) = 20/100\). But they don’t keep track of any more detailed information. Luckily, the philosophy department can tell us that 25% of students who take their classes have high grades, so \(p(E\mid H) = 25/100\). That’s everything we need to apply Bayes’ theorem: \[\begin{split} p(H\mid E) &= p(H)\frac{p(E\mid H)}{p(E)}\\ &= 35/100 \times \frac{25/100}{20/100}\\ &= 7/16\end{split}\]

That’s higher than \(p(H) = 20/100\), so we can also see that a student’s having high grades confirms the hypothesis that she will take a philosophy class.

What’s the philosophical significance of Bayes’ theorem? It unifies a number of influential ideas about confirmation and scientific methodology, binding them together in a single, simple equation. Let’s see how.

Theoretical Fit. It’s a truism that the better a theory fits the evidence, the more the evidence supports it. But what does it mean for a theory to fit the evidence? When \(H\) entails \(E\), the theory says the evidence must be true, so the discovery of the evidence fits the theory perfectly. Our formalism vindicates the truism in this special case as follows. When \(H\) entails \(E\), Conditional Certainty for Logical Consequences tells us that \(p(E\mid H)=1\), so Bayes’ theorem becomes: \[p(H\mid E) = p(H)\frac{1}{p(E)}\] Provided \(p(E)\) is less than 1, this amounts to multiplying \(p(H)\) by a ratio greater than 1, which means \(p(H\mid E)\) comes out larger than \(p(H)\). Moreover, since 1 is the greatest quantity that can appear in the numerator, the case where \(H\) entails \(E\) and thus \(p(E\mid H)=1\) gives the greatest possible boost to the probability of \(H\). In other words, confirmation is greatest when the theory fits the evidence as well as possible. (What if \(p(E) = 1\), though? Then \(H\) may fit \(E\), but so may \(

eg H\). If \(p(E)=1\), we can prove that \(p(E\mid H)=1\) and \(p(E\mid

eg H)=1\) (hint: combine The Law of Total Probability with Complementarity for Contradictories). In other words, \(E\) fits both \(H\) and its negation perfectly. So it shouldn’t be able to discriminate between these two hypotheses. And, indeed, in this case \(p(H\mid E)\) comes out the same as \(p(H)\), so \(c(H,E)=0\).) What about when the theory fits the evidence less than perfectly? If we think of fit as the certainty with which \(H\) predicts \(E\), \(p(E\mid H)\), then the previous analysis generalizes nicely. Suppose \(H\) predicts \(E\) strongly, but not with absolute certainty: \(p(E\mid H) = 1 - \varepsilon\), for some small number \(\varepsilon\). Applying Bayes’ theorem again, we have: \[ p(H\mid E) = p(H)\frac{1-\varepsilon}{p(E)}\] This again amounts to multiplying \(p(H)\) by a ratio larger than 1, provided \(p(E)\) isn’t close to 1. So \(p(H\mid E)\) will come out larger than \(p(H)\). Of course, the larger \(\varepsilon\) gets, the weaker the confirmation becomes, befitting the weakness with which \(H\) then predicts \(E\). Novel Prediction. Another truism is that novel predictions count more. When a theory predicts something we wouldn’t otherwise expect, it’s confirmed especially strongly if the prediction is borne out. For example, Poisson derided the theory that light is a wave because it predicted a bright spot should appear at the center of certain shadows. No one had previously observed such bright spots, making it a novel prediction. When the presence of these bright spots was then verified, it was a boon for the wave theory. Once again, our formalization vindicates the truism. Suppose as before that \(H\) predicts \(E\) and thus \(p(E\mid H) = 1\), or nearly so. A novel prediction is one where \(p(E)\) is low, or at least not very high. It’s a prediction one wouldn’t expect. Our previous analysis exposed that, in such circumstances, we multiply \(p(H)\) by a large ratio in Bayes’ theorem. Thus \(p(H\mid E)\) comes out significantly larger than \(p(H)\), making \(c(H,E)\) large. So novel predictions turn out especially confirmatory. Prior Plausibility. A final truism: new evidence for a theory has to be weighed against the theory’s prior plausibility. Maybe the theory is inherently implausible, being convoluted or metaphysically fraught. Or maybe the theory had become implausible because it clashed with earlier evidence. Or maybe the theory was already pretty plausible, being elegant and fitting well with previous evidence. In any case, the new evidence has to be evaluated in light of these prior considerations. Once again, Bayes’ theorem vindicates this truism. \(p(H\mid E)\) is calculated by multiplying \(p(H)\) by the factor \(p(E\mid H)/p(E)\). We can think of the factor \(p(E\mid H)/p(E)\) as capturing the extent to which the evidence counts for \(H\) (or against it, if \(p(E\mid H)/p(E)\) is less than 1), which we then multiply against the previous probability of \(H\), \(p(H)\), in order to obtain \(H\)’s new, all-things-considered plausibility. If \(H\) was already implausible, \(p(H)\) will be low and the result of this multiplication will be smaller than it would be if \(H\) had already been plausible, and \(p(H)\) had thus been high.

Let’s pause to summarize. Bayes’ theorem isn’t just a useful calculational tool. It also vindicates three truisms about confirmation, unifying them in a single equation. Each truism corresponds to a term in Bayes’ theorem:

\(p(E\mid H)\) corresponds to theoretical fit. The better the hypothesis fits the evidence, the greater this quantity will be. Since this term appears in the numerator in Bayes’ theorem, better fit means a larger value for \(p(H\mid E)\). \(p(E)\) corresponds to predictive novelty, or rather the lack of it. The more novel the prediction is, the less we expect \(E\) to be true, and thus the smaller \(p(E)\) is. Since this term appears in the denominator of Bayes’ theorem, more novelty means a larger value for \(p(H\mid E)\). \(p(H)\) corresponds to prior plausibility. The more plausible \(H\) is before the discovery of \(E\), the greater this quantity will be, and thus the greater \(p(H\mid E)\) will be.

But what about the raven paradox?

Recall the raven paradox: the hypothesis that all ravens are black is logically equivalent to the hypothesis that all non-black things are non-ravens. Yet the latter would seem to be confirmed with each discovery of a non-black, non-raven…red shirts, blue underpants, etc. Yet examining the contents of your neighbor’s clothesline doesn’t seem a good way to research an ornithological hypothesis. (Nor does it seem a good way to treat your neighbor.)

The classic, quantitative solution originates with Hosiasson-Lindenbaum (1940). It holds that the discovery of blue underpants does confirm the hypothesis that all ravens are black, just by so little that we overlook it. How could blue underpants be relevant to the hypothesis that all ravens are black? Informally, the idea is that an object which turns out to be a blue pair of underpants could instead have turned out to be a white raven. When it turns out not to be such a counterexample, our hypothesis passes a weak sort of test. Does our formal theory of confirmation vindicate this informal line of thinking? The answer is, “yes, but…”.

The ‘but…’ will prove crucial to the fate of Nicod’s Criterion (spoiler: outlook not good). But let’s start with the ‘yes’.

We vindicate the ‘yes’ with a theorem: discovering an object to be a non-raven that isn’t black, \(

eg R \wedge

eg B\), just slightly boosts the probability of the hypothesis that all ravens are black, \(H\), if we make certain assumptions. Here is the theorem (see the technical supplement for a proof):

Theorem (Raven Theorem). If (i) \(p(

eg R \mid

eg B)\) is very high and (ii) \(p(

eg B\mid H)=p(

eg B)\), then \(p(H\mid

eg R \wedge

eg B)\) is just slightly larger than \(p(H)\).

The first assumption, that \(p(

eg R \mid

eg B)\) is very high, seems pretty sensible. With all the non-ravens in the world, the probability that a given object will be a non-raven is quite high, especially if it’s not black. The second assumption is that \(p(

eg B\mid H)=p(

eg B)\). In other words, assuming that all ravens are black doesn’t change the probability that a given object will not be black. This assumption is more controversial (Vranas 2004). If all the ravens are black, then some of the things that might have been black aren’t, namely the ravens. In that case shouldn’t \(p(

eg B\mid H) < p(

eg B)\) instead? On the other hand, maybe all the ravens being black doesn’t reduce the number of black things in the universe. Maybe it just means that other kinds of things are black slightly more often. Luckily, it turns out we can replace (ii) with less dubious assumptions (Fitelson 2006; Fitelson and Hawthorne 2010; Rinard 2014). But we can’t do with no assumptions at all, which brings us to two crucial points about confirmation and probability.

The first point is that Nicod’s Criterion fails. Assumptions like (i) and (ii) of the Raven Theorem don’t always hold. In fact, in some situations, discovering a black raven would actually lower the probability that all ravens are black. How could this be? The trick is to imagine a situation where the very discovery of a raven is bad news for the hypothesis that all ravens are black. This would happen if the only way for all the ravens to be black is for there to be very few of them. Then stumbling across a raven would suggest that ravens are actually plentiful, in which case they aren’t all black. Good (1967) offers the following, concrete illustration. Suppose there are only two possibilities:

All ravens are black, though there are only \(100\) ravens and a million other things.

There is one non-black raven out of \(1,000\) ravens, and there are a million other things.

In this case, happening upon a raven favors \(

eg H\) because \(

eg H\) makes ravens ten times less exotic. That the raven is black fits slightly better with \(H\), but not enough to outweigh the first effect: black ravens are hardly a rarity on \(

eg H\). This is the ‘but…’ to go with our earlier ‘yes’.

The second point is a far-reaching moral: that the fates of claims about confirmation often turn crucially on what assumptions we make about the values of \(p\). Nicod’s criterion fails in situations like Good’s, where \(p\) assigns a lower value to \(p(R \wedge B\mid H)\) than to \(p(R \wedge B\mid

eg H)\). But in another situation, where things are reversed, Nicod’s Criterion does apply. Likewise, a diagnosis of the raven paradox like the standard one only applies given certain assumptions about \(p\), like assumptions (i) and (ii) of the Raven Theorem. The probability axioms alone generally aren’t enough to tell us when Nicod’s Criterion applies, or when confirmation is small or large, positive or negative.

This last point is a very general, very important phenomenon. Like the axioms of first-order logic, the axioms of probability are quite weak (Howson and Urbach 1993; Christensen 2004). Unless \(H\) is a tautology or contradiction, the axioms only tell us that its probability is somewhere between \(0\) and 1. If we can express \(H\) as a disjunction of two logically incompatible sub-hypotheses, \(H_1\) and \(H_2\), and we know the probabilities of these sub-hypotheses, then the third axiom lets us compute \(p(H) = p(H_1)+p(H_2)\). But this just pushes things back a step, since the axioms by themselves only tell us that \(p(H_1)\) and \(p(H_2)\) must themselves lie between \(0\) and 1.

This weakness of the probability axioms generates the famous problem of the priors, the problem of saying where initial probabilities come from. Are they always based on evidence previously collected? If so, how does scientific inquiry get started? If instead they’re not based on previous evidence but are a priori, what principles govern this a priori reasoning? Formal epistemologists are split on this question. The so-called objectivists see the probability axioms as incomplete, waiting to be supplemented by additional postulates that determine the probabilities with which inquiry should begin. (The Principle of Indifference (PoI) is the leading candidate here. See the entry on the interpretation of probability.) The so-called subjectivists think instead that there is no single, correct probability function \(p\) with which inquiry should begin. Different inquirers may begin with different values for \(p\), and none of them is thereby more or less scientific or rational than the others.

In later sections the problem of the priors will return several times, illustrating its importance and ubiquity.

We’ve seen that formalizing confirmation using probability theory yields an account that succeeds in several significant ways: it vindicates several truisms about confirmation, it unifies those truisms in a single equation, and it resolves a classic paradox (not to mention others we didn’t discuss (Crupi and Tentori 2010)).

We also saw that it raises a problem though, the problem of priors, which formal epistemologists are divided on how to resolve. And there are other problems we didn’t explore, most notably the problems of logical omniscience and old evidence (see subsections of entry on Bayesian epistemology).

These and other problems have led to the exploration and development of other approaches to scientific reasoning, and reasoning in general. Some stick to the probabilistic framework but develop different methodologies within it (Fisher 1925; Neyman and Pearson 1928a,b; Royall 1997; Mayo 1996; Mayo and Spanos 2011; see entry on the philosophy of statistics). Others depart from standard probability theory, like Dempster-Shafer theory (Shafer 1976; see entry on formal representations of belief), a variant of probability theory meant to solve the problem of the priors and make other improvements. Ranking theory (Spohn 1988, 2012; again see entry on formal representations of belief) also bears some resemblance to probability theory but draws much inspiration from possible-world semantics for conditionals (see entry on indicative conditionals). Bootstrapping theory (Glymour 1980; Douven and Meijs 2006) leaves the probabilistic framework behind entirely, drawing inspiration instead from the deduction-based approach we began with. Still other approaches develop non-monotonic logics (see entry), logics for making not only deductive inferences, but also defeasible, inductive inferences (Pollock 1995, 2008; Horty 2012). Formal learning theory provides a framework for studying the long-run consequences of a wide range of methodologies.

For the next two sections we’ll build on the probabilistic approach introduced here, since it’s currently the most popular and influential approach to formal epistemology. But it’s important to remember that there is a rich and variegated range of alternative approaches, and that this one has its problems, some consequences of which we’ll soon encounter.

A lot of our reasoning seems to involve projecting observed patterns onto unobserved instances. For example, suppose I don’t know whether the coin I’m holding is biased or fair. If I flip it 9 times and it lands tails every time, I’ll expect the 10th toss to come up tails too. What justifies this kind of reasoning? Hume famously argued that nothing can justify it. In modern form, Hume’s challenge is essentially this: a justification for such reasoning must appeal to either an inductive argument or a deductive one. Appealing to an inductive argument would be unacceptably circular. While a deductive argument would have to show that unobserved instances will resemble observed ones, which is not a necessary truth, and hence not demonstrable by any valid argument. So no argument can justify projecting observed patterns onto unobserved cases. (Russell and Restall (2010) offer a formal development. Haack (1976) discusses the supposed asymmetry between induction and deduction here.)

Can probability come to the rescue here? What if instead of deducing that unobserved instances will resemble observed ones we just deduce that they’ll probably resemble the observed ones? If we can deduce from the probability axioms that the next toss is likely to come up tails given that it landed tails 9 out of 9 times so far, that would seem to solve Hume’s problem.

Unfortunately, no such deduction is possible: the probability axioms simply don’t entail the conclusion we want. How can that be? Consider all the different sequences of heads (\(\mathsf{H}\)) and tails (\(\mathsf{T}\)) we might get in the course of 10 tosses:

\[\begin{array}{c} \mathsf{HHHHHHHHHH}\\ \mathsf{HHHHHHHHHT}\\ \mathsf{HHHHHHHHTH}\\ \vdots\\ \mathsf{HHHHHHHHTT}\\ \mathsf{HHHHHHHTHT}\\ \vdots\\ \mathsf{TTTTTTTTTH}\\ \mathsf{TTTTTTTTTT}\\ \end{array}\]

There are 1024 possible sequences, so the probability of each possible sequence would seem to be \(1/1024\). Of course, only two of them begin with 9 tails in a row, namely the last two. So, once we’ve narrowed things down to a sequence that begins with 9 out of 9 tails, the probability of tails on the 10th toss is \(1/2\), same as heads. More formally, applying the definition conditional probability gives us:

\[\begin{align} p(T_{10} \mid T_{1\ldots9}) &= \frac{p(T_{10} \wedge T_{1\ldots9})}{p(T_{1\ldots9})}\\ &= \frac{1/1024}{2/1024}\\ &= \frac{1}{2}\end{align}\]

So it looks like the axioms of probability entail that the first 9 tosses tell us nothing about the 10th toss.

In fact, though, the axioms of probability don’t even entail that—they don’t actually say anything about \(p(T_{10} \mid T_{1\ldots9})\). In the previous paragraph, we assumed that each possible sequence of tosses was equally probable, with \(p(\ldots)=1/1024\) the same for each sequence. But the probability axioms don’t require this “uniform” assignment. As we saw earlier when we encountered the problem of the priors (1.4), the probability axioms only tell us that tautologies have probability 1 (and contradictions probability \(0\)). Contingent propositions can have any probability from \(0\) to 1, and this includes the proposition that the sequence of tosses will be \(\mathsf{HHHHHHHTHT}\), or any other sequence of \(\mathsf{H}\)s and \(\mathsf{T}\)s.

We can exploit this freedom and get more sensible, induction-friendly results if we assign prior probabilities using a different scheme advocated by Carnap (1950). Suppose instead of assigning each possible sequence the same probability, we assign each possible number of \(\mathsf{T}\)s the same probability. We could get anywhere from 0 to 10 \(\mathsf{T}\)s, so each possible number of \(\mathsf{T}\)s has probability 1/11. Now, there’s just one way of getting 0 \(\mathsf{T}\)s:

\[\mathsf{HHHHHHHHHH}\]

So \(p(H_{1\ldots10})=1/11\). But there are 10 ways of getting 1 \(\mathsf{T}\):

\[\begin{array}{c} \mathsf{HHHHHHHHHT}\\ \mathsf{HHHHHHHHTH}\\ \mathsf{HHHHHHHTHH}\\ \vdots\\ \mathsf{THHHHHHHHH}\end{array}\]

So this possibility’s probability of \(1/11\) is divided 10 ways, yielding probability \(1/110\) for each subpossibility, e.g., \(p(\mathsf{HHHHHHHTHH})=1/110\). And then there are 45 ways of getting 2 \(\mathsf{T}\)s:

\[\begin{array}{c} \mathsf{HHHHHHHHTT}\\ \mathsf{HHHHHHHTHT}\\ \mathsf{HHHHHHTHHT}\\ \vdots\\ \mathsf{TTHHHHHHHH}\end{array}\]

So here the probability of \(1/11\) is divided \(45\) ways, yielding a probability of \(1/495\) for each subpossibility, e.g., \(p(\mathsf{HTHHHHHTHH})=1/495\). And so on.

What then becomes of \(p(T_{10} \mid T_{1\ldots9})\)?

\[\begin{align} p(T_{10} \mid T_{1\ldots9}) &= \frac{p(T_{10} \wedge T_{1\ldots9})}{p(T_{1\ldots9})}\\ &= \frac{p(T_{1\ldots10})}{p(T_{1\ldots10} \vee [T_{1\ldots9} \wedge H_{10}])}\\ &= \frac{p(T_{1\ldots10})}{p(T_{1\ldots10}) + p(T_{1\ldots9} \wedge H_{10})}\\ &= \frac{1/11}{1/11 + 1/110}\\ &= \frac{10}{11}\end{align}\]

So we get a much more reasonable result when we assign prior probabilities according to Carnap’s two-stage scheme. However, this scheme is not mandated by the axioms of probability.

One thing this teaches us is that the probability axioms are silent on Hume’s problem. Inductive reasoning is compatible with the axioms, since Carnap’s way of constructing the prior probabilities makes a 10th \(\mathsf{T}\) quite likely given an initial string of \(9\) \(\mathsf{T}\)s. But the axioms are also compatible with skepticism about induction. On the first way of constructing the prior probabilities, a string of \(\mathsf{T}\)s never makes the next toss any more likely to be a \(\mathsf{T}\), no matter how long the string gets! In fact, there are further ways of constructing the prior probabilities that yield “anti-induction”, where the more \(\mathsf{T}\)s we observe, the less likely the next toss is to be a \(\mathsf{T}\).

We also learn something else though, something more constructive: that Hume’s problem is a close cousin of the problem of the priors. If we could justify Carnap’s way of assigning prior probabilities, we would be well on our way to solving Hume’s problem. (Why only on our way? More on that in a moment, but very briefly: because we’d still have to justify using conditional probabilities as our guide to the new, unconditional probabilities.) Can we justify Carnap’s two-stage scheme? This brings us to a classic debate in formal epistemology.

If you had to bet on a horserace without knowing anything about any of the horses, which one would you bet on? It probably wouldn’t matter to you: each horse is as likely to win as the others, so you’d be indifferent between the available wagers. If there are 3 horses in the race, each has a 1/3 chance of winning; if there are 5, each has a 1/5 chance; etc. This kind of reasoning is common and is often attributed to the Principle of Indifference:[5]

The Principle of Indifference (PoI)

Given \(n\) mutually exclusive and jointly exhaustive possibilities, none of which is favored over the others by the available evidence, the probability of each is \(1/n\).

PoI looks quite plausible at first, and may even have the flavor of a conceptual truth. How could one possibility be more probable than another if the evidence doesn’t favor it? And yet, the PoI faces a classic and recalcitrant challenge.

Consider the first horse listed in the race, Athena. There are two possibilities, that she will win and that she will lose. Our evidence (or lack thereof) favors neither possibility, so the PoI says the probability that she’ll win is \(1/2\). But suppose there are three horses in the race: Athena, Beatrice, and Cecil. Since our evidence favors none of them over any other, the PoI requires that we assign probability \(1/3\) to each, which contradicts our earlier conclusion that Athena’s probability of winning is \(1/2\).

The source of the trouble is that possibilities can be subdivided into further subpossibilities. The possibility of Athena losing can be subdivided into two subpossibilities, one where Beatrice wins and another where Cecil wins. Because we lack any relevant evidence, the available evidence doesn’t seem to favor the coarser possibilities over the finer subpossibilities, leading to contradictory probability assignments. What we need, it seems, is some way of choosing a single, privileged way of dividing up the space of possibilities so that we can apply the PoI consistently.

It’s natural to think we should use the more fine-grained division of possibilities, the three-way division in the case of Athena, Beatrice, and Cecil. But we can actually divide things further—infinitely further in fact. For example, Athena might win by a full length, by half a length, by a quarter of a length, etc. So the possibility that she wins is actually infinitely divisible. We can extend the PoI to handle such infinite divisions of possibilities in a natural way by saying that, if Athena wins, the probability that she’ll win by between 1 and 2 lengths is twice the probability that she’ll win by between \(1/2\) and 1 length. But the same problem we were trying to solve still persists, in the form of the notorious Bertrand paradox (Bertrand 2007 [1888]).

The paradox is nicely illustrated by the following example from van Fraassen (1989). Suppose a factory cuts iron cubes with edge-lengths ranging from \(0\) cm to 2 cm. What is the probability that the next cube to come off the line will have edges between \(0\) cm and 1 cm in length? Without further information about how the factory goes about producing cubes, the PoI would seem to say the probability is \(1/2\). The range from \(0\) to 1 covers \(1/2\) the full range of possibilities from \(0\) to 2. But now consider this question: what is the probability that the next cube to come off the line will have volume between \(0\) cubic cm and 1 cubic cm? Here the PoI seems to say the probability is \(1/8\). For the range from \(0\) to 1 covers only \(1/8\) the full range of possible volumes from \(0\) to \(8\) cubic cm. So we have two different probabilities for equivalent propositions: a cube has edge-length between \(0\) and 1 cm if and only if it has a volume between \(0\) cubic cm and 1 cubic cm. Once again, the probabilities given by the PoI seem to depend on how we describe the range of possible outcomes. Described in terms of length, we get one answer; described in terms of volume, we get another.

Importantly, Bertrand’s paradox applies quite generally. Whether we’re interested in the size of a cube, the distance by which a horse will win, or any other parameter measured in real numbers, we can always redescribe the space of possible outcomes so that the probabilities assigned by the PoI come out differently. Even an infinitely fine division of the space of possibilities doesn’t fix the problem: the probabilities assigned by the PoI still depend on how we describe the space of possibilities.

We face essentially this problem when we frame the problem of induction in probabilistic terms. Earlier we saw two competing ways of assigning prior probabilities to sequences of coin tosses. One way divides the possible outcomes according to the exact sequence in which \(\mathsf{H}\) and \(\mathsf{T}\) occur. The PoI assigns each possible sequence a probability of \(1/1024\), with the result that the first 9 tosses tell us nothing about the 10th toss. The second, Carnapian way instead divides the possible outcomes according to the number of \(\mathsf{T}\)s, regardless of where they occur in the sequence. The PoI then assigns each possible number of \(\mathsf{T}\)s the same probability, \(1/11\). The result then is that the first 9 tosses tell us a lot about the 10th toss: if the first 9 tosses are tails, the 10th toss has a \(10/11\) chance of coming up tails too.

So one way of applying the PoI leads to inductive skepticism, the other yields the inductive optimism that seems so indispensable to science and daily life. If we could clarify how the PoI should be applied, and justify its use, we would have our answer to Hume’s problem (or at least the first half—we still have to address the issue of using conditional probabilities as a guide to new, unconditional probabilities). Can it be clarified and justified?

Here again we run up against one of the deepest and oldest divides in formal epistemology, that between subjectivists and objectivists. The subjectivists hold that any assignment of probabilities is a legitimate, reasonable way to start one’s inquiry. One need only conform to the three probability axioms to be reasonable. They take this view largely because they despair of clarifying the PoI. They see no reason, for example, that we should follow Carnap in first dividing according to the number of \(\mathsf{T}\)s, and only then subdividing according to where in sequence those \(\mathsf{T}\)s appear. Closely related to this skepticism is a skepticism about the prospects for justifying the PoI, even once clarified, in a way that would put it on a par with the three axioms of probability. We haven’t yet touched on how the three axioms are supposed to be justified. But the classic story is this: a family of theorems—Dutch book theorems (see entry) and representation theorems (see entry)—are taken to show that any deviation from the three axioms of probability leads to irrational decision-making. For example, if you deviate from the axioms, you will accept a set of bets that is bound to lose money, even though you can see that losing money is inevitable a priori. These theorems don’t extend to violations of the PoI though, however it’s clarified. So subjectivists conclude that violating the PoI is not irrational.

Subjectivists aren’t thereby entirely helpless in the face of the problem of induction, though. According to them, any initial assignment of probabilities is reasonable, including Carnap’s. So if you do happen to start out with a Carnap-esque assignment, you will be an inductive optimist, and reasonably so. It’s just that you don’t have to start out that way. You could instead start out treating each possible sequence of \(\mathsf{H}\)s and \(\mathsf{T}\)s as equally probable, in which case you’ll end up an inductive skeptic. That’s reasonable too. According to subjectivism, induction is perfectly rational, it just isn’t the only rational way to reason.

Objectivists hold instead that there’s just one way to assign initial probabilities (though some allow a bit of flexibility (Maher 1996)). These initial probabilities are given by the PoI, according to orthodox objectivism. As for the PoI’s conflicting probability assignments depending on how possibilities are divided up, some objectivists propose restricting it to avoid these inconsistencies (Castell 1998). Others argue that it’s actually appropriate for probability assignments to depend on the way possibilities are divvied up, since this reflects the language in which we conceive the situation, and our language reflects knowledge we bring to the matter (Williamson 2007). Still others argue that the PoI’s assignments don’t actually depend on the way possibilities are divided up—it’s just hard to tell sometimes when the evidence favors one possibility over another (White 2009).

What about justifying the PoI though? Subjectivists have traditionally justified the three axioms of probability by appeal to one of the aforementioned theorems: the Dutch book theorem or some form of representation theorem. But as we noted earlier, these theorems don’t extend to the PoI.

Recently, a different sort of justification has been gaining favor, one that may extend to the PoI. Arguments that rely on Dutch book or representation theorems have long been suspect because of their pragmatic character. They aim to show that deviating from the probability axioms leads to irrational choices, which seems to show at best that obeying the probability axioms is part of pragmatic rationality, as opposed to epistemic irrationality. (But see Christensen (1996, 2001) and Vineberg (1997, 2001) for replies.) Preferring a more properly epistemic approach, Joyce (1998, 2009) argues that deviating from the probability axioms takes one unnecessarily far from the truth, no matter what the truth turns out to be. Pettigrew (forthcoming) adapts this approach to the PoI, showing that violations of the PoI increase one’s risk of being further from the truth. (But see Carr (manuscript—see Other Internet Resources) for a critical perspective on this general approach.)

Whether we prefer the subjectivist’s response to Hume’s problem or the objectivist’s, a crucial element is still missing. Earlier we noted that justifying a Carnapian assignment of prior probabilities only gets us half way to a solution. We still have to turn these prior probabilities into posterior probabilities: initially, the probability of tails on the tenth toss was \(1/2\), but after observing the first 9 tosses come out tails, it’s supposed to be \(10/11\). Having justified our initial assignment of probabilities—whether the subjectivist way or the objectivist way—we can prove that \(p(T_{10}\mid T_{1\ldots9})=10/11\) compared to \(p(T_{10})=1/2\). But that doesn’t mean the new probability of \(T_{10}\) is \(10/11\). Remember, the symbolism \(p(T_{10}\mid T_{1\ldots9})\) is just shorthand for the fraction \(p(T_{10} \wedge T_{1\ldots9})/p(T_{1\ldots9})\). So the fact that \(p(T_{10}\mid T_{1\ldots9})=10/11\) just means that this ratio is \(10/11\), which is still just a fact about the initial, prior probabilities.

To appreciate the problem, it helps to forget probabilities for a moment and think in simple, folksy terms. Suppose you aren’t sure whether \(A\) is true, but you believe that if it is true, then so is \(B\). If you then learn that \(A\) is in fact true, you then have two options. You might conclude that \(B\) is true, but you might instead decide that you were wrong at the outset to think \(B\) is true if \(A\) is. Faced with the prospect of accepting \(B\), you might find it too implausible to accept, and thus abandon your initial, conditional belief that \(B\) is true if \(A\) is (Harman 1986).

Likewise, we might start out unsure whether the first \(9\) tosses will come up tails, but believe that if they do, then the probability of the \(10\)th toss coming up tails is \(10/11\). Then, when we see the first \(9\) tosses come up tails, we might conclude that the \(10\)th toss has a \(10/11\) chance of coming up tails, or, we might instead decide we were wrong at the outset to think it had a \(10/11\) chance of coming up tails on the \(10\)th toss if it came up tails on the first \(9\) tosses.

The task is to justify taking the first route rather than the second: sticking to our conditional belief that, if \(T_{1\ldots9}\), then \(T_{10}\) has probability \(10/11\), even once we’ve learned that indeed \(T_{1\ldots9}\). Standing by one’s conditional probabilities in this way is known as “conditionalizing”, because one thereby turns the old conditional probabilities into new, unconditional probabilities. To see why sticking by your old conditional probabilities amounts to turning them into unconditional probabilities, let’s keep using \(p\) to represent the prior probabilities, and let’s introduce \(p'\) to stand for the new, posterior probabilities after we learn that \(T_{1\ldots9}\). If we stand by our prior conditional probabilities, then \[p'(T_{10}\mid T_{1\ldots9}) = p(T_{10}\mid T_{1\ldots9})=10/11.\] And since we now know that \(T_{1\ldots9}\), \(p'(T_{1\ldots9})=1\). It then follows that \(p'(T_{10})=10/11\):

\[\begin{align} p'(T_{10}\mid T_{1\ldots9}) &= \frac{p'(T_{10} \wedge T_{1\ldots9})}{p'(T_{1\ldots9})}\\ &= p'(T_{10} \wedge T_{1\ldots9})\\ &= p'(T_{10})\\ &= 10/11\end{align}\]

The first line follows from the definition of conditional probability. The second follows from the fact that \(p'(T_{1\ldots9})=1\), since we’ve seen how the first \(9\) tosses go. The third line follows from an elementary theorem of the probability axioms: conjoining \(A\) with another proposition \(B\) that has probability 1 results in the same probability, i.e., \(p(A \wedge B)=p(A)\) when \(p(B)=1\). (Deriving this theorem is left as an exercise for the reader.) Finally, the last line just follows from our assumption that \[p'(T_{10}\mid T_{1\ldots9}) = p(T_{10}\mid T_{1\ldots9})=10/11.\] The thesis that we should generally update probabilities in this fashion is known as conditionalization.

Conditionalization

Given the prior probability assignment \(p(H\mid E)\), the new, unconditional probability assignment to \(H\) upon learning \(E\) should be \(p'(H)=p(H\mid E)\).

A number of arguments have been given for this principle, many of them parallel to the previously mentioned arguments for the axioms of probability. Some appeal to Dutch books (Teller 1973; Lewis 1999), others to the pursuit of cognitive values (Greaves and Wallace 2006), especially closeness to the truth (Leitgeb and Pettigrew 2010a,b), and still others to the idea that one should generally revise one’s beliefs as little as possible when accommodating new information (Williams 1980).

The details of these arguments can get very technical, so we won’t examine them here. The important thing for the moment is to appreciate that (i) inductive inference is a dynamic process, since it involves changing our beliefs over time, but (ii) the general probability axioms, and particular assignments of prior probabilities like Carnap’s, are static, concerning only the initial probabilities. Thus (iii) a full theory of inference that answers Hume’s challenge must appeal to additional, dynamic principles like Conditionalization. So (iv) we need to justify these additional, dynamic principles in order to justify a proper theory of inference and answer Hume’s challenge.

Importantly, the morals summarized in (i)–(iv) are extremely general. They don’t just apply to formal epistemologies based in probability theory. They also apply to a wide range of theories based in other formalisms, like Dempster-Shafer theory, ranking theory, belief-revision theory, and non-monotonic logics. One way of viewing the takeaway here, then, is as follows.

Formal epistemology gives us precise ways of stating how induction works. But these precise formulations do not themselves solve a problem like Hume’s, for they rely on assumptions like the probability axioms, Carnap’s assignment of prior probabilities, and Conditionalization. Still, they do help us isolate and clarify these assumptions, and then formulate various arguments in their defense. Whether formal epistemology thereby aids in the solution of Hume’s problem depends on whether these formulations and justifications are plausible, which is controversial.

The problem of induction challenges our inferences from the observed to the unobserved. The regress problem challenges our knowledge at an even more fundamental level, questioning our ability to know anything by observation in the first place (see Weintraub 1995 for a critical analysis of this distinction).

To know something, it seems you must have some justification for believing it. For example, your knowledge that Socrates taught Plato is based on testimony and textual sources handed down through the years. But how do you know these testimonies and texts are reliable sources? Presumably this knowledge is itself based on some further justification—various experiences with these sources, their agreement with each other, with other things you’ve observed independently, and so on. But the basis of this knowledge too can be challenged. How do you know that these sources even say what you think they say, or that they even exist—maybe every experience you’ve had reading The Apology has been a mirage or a delusion.

The famous Agrippan trilemma identifies three possible ways this regress of justification might ultimately unfold. First, it could go on forever, with \(A\) justified by \(B\) justified by \(C\) justified by …, ad infinitum. Second, it it could cycle back on itself at some point, with \(A\) justified by \(B\) justified by \(C\) justified by…justified by \(B\), for example. Third and finally, the regress might stop at some point, with \(A\) justified by \(B\) justified by \(C\) justified by…justified by \(N\), which is not justified by any further belief.

These three possibilities correspond to three classic responses to this regress of justification. Infinitists hold that the regress goes on forever, coherentists that it cycles back on itself, and foundationalists that it ultimately terminates. The proponents of each view reject the alternatives as unacceptable. Infinitism looks psychologically unrealistic, requiring an infinite tree of beliefs that finite minds like ours could not accommodate. Coherentism seems to make justification unacceptably circular, and thus too easy to achieve. And foundationalism seems to make justification arbitrary, since the beliefs at the end of the regress apparently have no justification.

The proponents of each view have long striven to answer the concerns about their own view, and to show that the concerns about the alternatives cannot be adequately answered. Recently, methods from formal epistemology have begun to be recruited to examine the adequacy of these answers. We’ll look at some work that’s been done on coherentism and foundationalism, since these have been the focus of both informal and formal work. (For work on infinitism, see Turri and Klein 2014. See Haack (1993) for a hybrid option, “foundherentism”.)

The immediate concern about coherentism is that it makes justification circular. How can a belief be justified by other beliefs which are, ultimately, justified by the first belief in question? If cycles of justification are allowed, what’s to stop one from believing anything one likes, and appealing to it as a justification for itself?

Coherentists usually respond that justification doesn’t actually go in cycles. In fact, it isn’t even really a relationship between individual beliefs. Rather, a belief is justified by being part of a larger body of beliefs that fit together well, that cohere. Justification is thus global, or holistic. It is a feature of an entire body of beliefs first, and only of individual beliefs second, in virtue of their being part of the coherent whole. When we trace the justification for a belief back and back and back until we come full circle, we aren’t exposing the path by which it’s justified. Rather, we are exposing the various interconnections that make the whole web justified as a unit. That these connections can be traced in a circle merely exposes how interconnected the web is, being connected in both directions, from \(A\) to \(B\) to …to \(N\), and then from \(N\) all the way back to \(A\) again.

Still, arbitrariness remains a worry: you can still believe just about anything, provided you also believe many other things that fit well with it. If I want to believe in ghosts, can I just adopt a larger world view on which supernatural and paranormal phenomena are rife? This worry leads to a further one, a worry about truth: given that almost any belief can be embedded in a larger, just-so story that makes sense of it, why expect a coherent body of beliefs to be true? There are many coherent stories one can tell, the vast majority of which will be massively false. If coherence is no indication of truth, how can it provide justification?

This is where formal methods come in: what does probability theory tell us about the connection between coherence and truth? Are more coherent bodies of belief more likely to be true? Less likely?

Klein and Warfield (1994) argue that coherence often decreases probability. Why? Increases in coherence often come from new beliefs that make sense of our existing beliefs. A detective investigating a crime may be puzzled by conflicting testimony until she learns that the suspect has an identical twin, which explains why some witnesses report seeing the suspect in another city the day of the crime. And yet, adding the fact about the identical twin to her body of beliefs actually decreases its probability. This follows from a theorem of the probability axioms we noted earlier (§1.2), Conjunction Costs Probability, which says that conjoining \(A\) with \(B\) generally yields a lower probability than for \(A\) alone (unless \(p(A \wedge

eg B)=0\)). Intuitively, the more things you believe the more risks you take with the truth. But making sense of things often requires believing more.

Merricks (1995) replies that it’s only the probability of the entire belief corpus that goes down when beliefs are added. But the individual probabilities of the beliefs it contains are what’s at issue. And from the detective’s point of view, her individual beliefs do become more probable when made sense of by the additional information that the suspect has an identical twin. Shogenji (1999) differs: coherence of the whole cannot influence probability of the parts. Coherence is for the parts to stand or fall together, so just as coherence makes all the members more likely to be true together, it makes it more likely that they are all false (at the expense of the possibility that some will turn out true and others false).

Instead, Shogenji prefers to answer Klein & Warfield at the collective level, the level of the whole belief corpus. He argues that the corpora Klein & Warfield compare differ in probability because they are of different strengths. The more beliefs a corpus contains, or the more specific its beliefs are, the stronger it is. In the case of the detective, adding the information about the twin increases the strength of her beliefs. And, in general, increasing strength decreases probability, since as we’ve seen, \(p(A \wedge B) \leq p(A)\). Thus the increase in the coherence of the detective’s beliefs is accompanied by an increase in strength. The net effect, argues Shogenji, is negative: the probability of the corpus goes down because the increase in strength outweighs the increase in coherence.

To vindicate this diagnosis, Shogenji appeals to a formula for measuring the coherence of a belief-set in probabilistic terms, which we’ll label coh:

\[ \textit{coh}(A_1,\ldots,A_n) = \frac{p(A_1 \wedge \ldots \wedge A_n)}{p(A_1) \times \ldots \times p(A_n)}\]

To see the rationale behind this formula, consider the simple case of just two beliefs:

\[\begin{align} \textit{coh}(A,B) &= \frac{p(A \wedge B)}{p(A) \times p(B)}\\ &= \frac{p(A \mid B)}{p(A)}\end{align}\]

When \(B\) has no bearing on \(A\), \(p(A\mid B)=p(A)\), and this ratio just comes out 1, which is our neutral point. If instead \(B\) raises the probability of \(A\), this ratio comes out larger than 1; and if \(B\) lowers the probability of \(A\), it comes out smaller than 1. So \(\textit{coh}(A,B)\) measures the extent to which \(A\) and \(B\) are related. Shogenji’s formula \(\textit{coh}(A_1,\ldots,A_n)\) generalizes this idea for larger collections of propositions.

How does measuring coherence this way vindicate Shogenji’s reply to Klein & Warfield, that the increase in the detective’s coherence is outweighed by an increase in the strength of her beliefs? The denominator in the formula for \(\textit{coh}\) tracks strength: the more propositions there are, and the more specific they are, the smaller this denominator will be. So if we compare two belief-sets with the same strength, their denominators will be the same. Thus, if one is more coherent than the other, it must be because its numerator is greater. Thus coherence increases with overall probability, provided strength is held constant. Since in the detective’s case overall probability does not increase despite the increase in coherence, it must be because the strength of her commitments had an even stronger influence.

Shogenji’s measure of coherence is criticized by other authors, many of whom offer their own, preferred measures (Akiba 2000; Olsson 2002, 2005; Glass 2002; Bovens & Hartmann 2003; Fitelson 2003; Douven and Meijs 2007). Which measure is correct, if any, remains controversial, as does the fate of Klein & Warfield’s argument against coherentism. Another line of probabilistic attack on coherentism, which we won’t explore here, comes from Huemer (1997) and is endorsed by Olsson (2005). Huemer (2011) later retracts the argument though, on the grounds that it foists unnecessary commitments on the coherentist. More details are available in the entry on coherentism.

Foundationalists hold that some beliefs are justified without being justified by other beliefs. Which beliefs have this special, foundational status? Foundationalists usually identify either beliefs about perceived or remembered matters, like “there’s a door in front of me” or “I had eggs yesterday”, or else beliefs about how things seem to us, like “there appears to be a door in front of me” or “I seem to remember having eggs yesterday”. Either way, the challenge is to say how these beliefs can be justified if they are not justified by any other beliefs.

One view is that these beliefs are justified by our perceptual and memorial states. When it looks like there’s a door in front of me, this perceptual state justifies me in believing that there is a door there, provided I have no reason to distrust this appearance. Or, at least, I am justified in believing that there appears to be a door there. So foundational beliefs are not arbitrary, they are justified by closely related perceptual and memorial states. Still, the regress ends there, because it makes no sense to ask what justifies a state of perception or memory. These states are outside the domain of epistemic normativity.

A classic criticism of foundationalism now arises, a version of the infamous Sellarsian dilemma. Must you know that your (say) vision is reliable to be justified in believing that there’s a door in front of you on the basis of its looking that way? If so, we face the first horn of the dilemma: the regress of justification is revived. For what justifies your belief that your vision is reliable? Appealing to previous cases where your vision proved reliable just pushes things back a step, since the same problem now arises for the reliability of your memory. Could we say instead that the appearance of a door is enough by itself to justify your belief in the door? Then we face the second horn: such a belief would seem to be arbitrary, formed on the basis of a source you have no reason to trust, namely your vision (Sellars 1956; Bonjour 1985; Cohen 2002).

This second horn is sharpened by White (2006), who formalizes it in probabilistic terms. Let \(A(D)\) be the proposition that there appears to be a door before you, and \(D\) the proposition that there really is a door there. The conjunction \(A(D) \wedge

eg D\) represents the possibility that appearances are misleading in this case. It says there appears to be a door but isn’t really. Using the probability axioms, we can prove that \(p(D\mid A(D)) \leq p(

eg (A(D) \wedge

eg D))\) (see technical supplement §3). In other words, the probability that there really is a door given that there appears to be one cannot exceed the initial probability that appearances are not misleading in this case. So it seems that any justification \(A(D)\) lends to belief in \(D\) must be preceded by some justification for believing that appearances are not misleading, i.e., \(

eg (A(D) \wedge

eg D)\). Apparently then, you must know (or have reason to believe) your sources are reliable before you can trust them. (Pryor 2013 elucidates some tacit assumptions in this argument.)

Lying in wait at the other horn of the Sellarsian dilemma is the Principle of Indifference (PoI). What is the initial probability that the appearance as of a door is misleading, according to the PoI? On one way of thinking about it, your vision can be anywhere from 100% reliable to 0% reliable. That is, the way things appear to us might be accurate all the time, none of the time, or anywhere in between. If we regard every degree of reliability from 0% to 100% as equally probable, the effect is the same as if we just assumed experience to be 50% reliable. The PoI will then assign \(p(D\mid A(D))=1/2\). This result effectively embraces skepticism, since we remain agnostic about the presence of the door despite appearances.

We saw earlier (§2.1) that the PoI assigns different probabilities depending on how we divide up the space of possibilities. What if we divide things up this way instead:

\(D\) \(

eg D\) \(A(D)\) \(1/4\) \(1/4\) \(

eg A(D)\) \(1/4\) \(1/4\)

Once again, we get the skeptical, agnostic result that \(p(D\mid A(D))=1/2\). Other ways of dividing up the space of possibilities will surely deliver better, anti-skeptical results. But then some argument for preferring those ways of dividing things up will be wanted, launching the regress of justification all over again.

Subjectivists, who reject the PoI and allow any assignment of initial probabilities as long as it obeys the probability axioms, may respond that it’s perfectly permissible to assign a high initial probability to the hypothesis that our senses are (say) 95% reliable. But they must also admit that it is permissible to assign a high initial probability to the hypothesis that our senses are 0% reliable, i.e., wrong all the time. Subjectivists can say that belief in the external world is justified, but they must allow that skepticism is justified too. Some foundationalists may be able to live with this result, but many seek to understand how experience justifies external world beliefs in a stronger sense—in a way that can be used to combat skeptics, rather than merely agreeing to disagree with them.

So far we’ve used just one formal tool, probability theory. We can get many similar results in the above applications using other tools, like Dempster-Shafer theory or ranking theory. But let’s move to a new application, and a new tool. Let’s use modal logic to explore the limits of knowledge.

The language of modal logic is the same as ordinary, classical logic, but with an additional sentential operator, \(\Box\), thrown in to represent necessity. If a sentence \(\phi\) isn’t just true, but necessarily true, we write \(\Box \phi\).

There are many kinds of necessity, though. Some things are logically necessary, like tautologies. Others may not be logically necessary, but still metaphysically necessary. (That Hesperus and Phosphorus are identical is a popular example; more controversial candidates are God’s existence or facts about parental origin, e.g., the fact that Ada Lovelace’s father was Lord Byron.)

But the kind of necessity that concerns us here is epistemic necessity, the necessity of things that must be true given what we know. For example, it is epistemically necessary for you that the author of this sentence is human. If you didn’t know that already (maybe you hadn’t considered the question), it had to be true given other things you did know: that humans are the only beings on Earth capable of constructing coherent surveys of formal epistemology, and that this is such a survey (I hope).

In epistemic modal logic then, it makes sense to write \(K \phi\) instead of \(\Box \phi\), where \(K \phi\) means that \(\phi\) is known to be true, or at least follows from what is known to be true. Known by whom? That depends on the application. Let’s assume we are talking about your knowledge unless specified otherwise.

What axioms should epistemic modal logic include? Well, any tautology of propositional logic should be a theorem, like \(\phi \supset \phi\). For that matter, formulas with the \(K\) operator that are similarly truth-table valid, like \(K \phi \supset K \phi\), should be theorems too. So we’ll just go ahead and make all these formulas theorems in the crudest way possible, by making them all axioms:

(P) Any sentence that is truth-table valid by the rules of classical logic is an axiom.

Adopting P immediately makes our list of axioms infinite. But they’re all easily identified by the truth-table method, so we won’t worry about it.

Moving beyond classical logic, all so-called “normal” modal logics share an axiom that looks pretty sensible for epistemic applications:

\[\tag{\(\bf K\)} K (\phi \supset \psi) \supset (K \phi \supset K \psi) \]

If you know that \(\phi \supset \psi\) is true, then if you also know \(\phi\), you also know \(\psi\). Or at least, \(\psi\) follows from what you know if \(\phi \supset \psi\) and \(\phi\) do. (The ‘K’ here stands for ‘Kripke’ by the way, not for ‘knowledge’.) Another common axiom shared by all “alethic” modal logics also looks good:

\[\tag{\(\bf T\)} K \phi \supset \phi \]

If you know \(\phi\), it must be true. (Note: K and T are actually axiom schemas, since any sentence of these forms is an axiom. So each of these schemas actually adds infinitely many axioms, all of the same general form.)

To these axioms we’ll add two inference rules. The first, familiar from classical logic, states that from \(\phi \supset \psi\) and \(\phi\), one may derive \(\psi\). Formally:

\[\tag{\(\bf{MP}\)} \phi \supset \psi, \phi \vdash \psi \]

The second rule is specific to modal logic and states that from \(\phi\) one can infer \(K \phi\). Formally:

\[\tag{\(\textbf{NEC}\)} \phi \vdash K \phi \]

The NEC rule looks immediately suspect: doesn’t it make everything true known? Actually, no: our logic only admits axioms and things that follow from them by MP. So only logical truths will be subject to the NEC rule, and these are epistemically necessary: they’re either known, or they follow from what we know, because they follow given no assumptions at all. (NEC stands for ‘necessary’, epistemically necessary in the present system.)

The three axiom schemas P, K, and T, together with the derivation rules MP and NEC, complete our minimal epistemic modal logic. They allow us to derive some basic theorems, one of which we’ll use in the next section:

Theorem (\(\bwedge\)-distribution). \(K(\phi \wedge \psi) \supset (K \phi \wedge K \psi)\)

(See the technical supplement for a proof). This theorem says roughly that if you know a conjunction, then you know each conjunct. At least, each conjunct follows from what you know (I’ll be leaving this qualifier implicit from now on), which seems pretty sensible.

Can we prove anything more interesting? With some tweaks here and there, we can derive some quite striking results about the limits of our knowledge.

Can everything that is true be known? Or are there some truths that could never be known, even in principle? A famous argument popularized by Fitch (1963) and originally due to Alonzo Church (Salerno 2009) suggests not: some truths are unknowable. For if all truths were knowable in principle, we could derive that all truths are actually known already, which would be absurd.

The argument requires a slight extension of our epistemic logic, to accommodate the notion of knowability. For us, \(K\) means known (or entailed by the known), whereas knowability adds an extra modal layer: what it’s possible to know. So we’ll need a sentential operator \(\Diamond\) in our language to represent metaphysical possibility. Thus \(\Diamond \phi\) means “it’s metaphysically possible for \(\phi\) to be true”. In fact, \(\Diamond \phi\) is just short for \(

eg \Box

eg \phi\), since what doesn’t have to be false can be true. So we can actually add the \(\Box\) instead and assume that, like the \(K\) operator, it obeys the NEC rule. (As with the NEC rule for the \(K\) operator, it’s okay that we can always derive \(\Box \phi\) from \(\phi\), because we can only derive \(\phi\) in the first place when \(\phi\) is a logical truth.) \(\Diamond\) is then just \(

eg \Box

eg\) by definition.

With this addition to our language in place, we can derive the following lemma (see the technical supplement for the derivation):

Lemma (Unknowns are Unknowable). \(

eg \Diamond K(\phi \wedge

eg K \phi)\)

This lemma basically says you can’t know a fact of the sort, “\(\phi\) is true but I don’t know it’s true”, which seems pretty sensible. If you knew such a conjunction, the second conjunct would have to be true, which conflicts with your knowing the first conjunct. (This is where \(\bwedge\)-distribution proves useful.)

Yet this plausible looking lemma leads almost immediately to the unknowability of some truths. Suppose for reductio that everything true could be known, at least in principle. That is, suppose we took as an axiom:

Knowledge Without Limits

\(\phi \supset \Diamond K \phi\)

We would then be able to derive in just a few lines that everything true is actually known, i.e., \(\phi \supset K \phi\).

\begin{array}{rll} 1.& (\phi \wedge

eg K \phi) \supset \Diamond K (\phi \wedge

eg K \phi)& \textbf{Knowledge Without Limits}\\ 2.&

eg (\phi \wedge

eg K\phi)& 1,\ \textbf{Unknowns are Unknowable, P}\\ 3.& \phi \supset K\phi& 2,\ \textbf{P}\\ \end{array}

If \(K\) represents what God knows, this would be fine. But if \(K\) represents what you or I know, it seems absurd! Not only are there truths we don’t know, most truths don’t even follow from what we know. Knowledge Without Limits appears to be the culprit here, so it seems there are some things we could not know, even in principle. But see the entry on Fitch’s paradox of knowability for more discussion.

Even if we can’t know some things, might we at least have unlimited access to our own knowledge? Are we at least always able to discern whether we know something? A popular axiom in the logic of metaphysical necessity is the so-called S4 axiom: \(\Box \phi \supset \Box \Box \phi\). This says that whatever is necessary had to be necessary. In epistemic logic, the corresponding formula is:

\[\tag{\(\bf KK\)} K \phi \supset KK \phi \]

This says roughly that whenever we know something, we know that we know it. Hintikka (1962) famously advocates including KK as an axiom of epistemic logic. But an influential argument due to Williamson (2000) suggests otherwise.

The argument hinges on the idea that knowledge can’t be had by luck. Specifically, to know something, it must be that you couldn’t have been wrong very easily. Otherwise, though you might be right, it’s only by luck. For example, you might correctly guess that there are exactly 967 jellybeans in the jar on my desk, but even though you’re right, you just got lucky. You didn’t know there were 967 jellybeans, because there could easily have been 968 jellybeans without you noticing the difference.

To formalize this “no-luck” idea, let the propositions \(\phi_1, \phi_2\), etc. say that the number of jellybeans is at least 1, at least 2, etc. We’ll assume you’re eyeballing the number of jellybeans in the jar, not counting them carefully. Because you’re an imperfect estimator of large quantities of jellybeans, you can’t know that there are at least 967 jellybeans in the jar. If you think there are at least 967 jellybeans, you could easily make the mistake of thinking there are at least 968, in which case you’d be wrong. So we can formalize the “not easily wrong” idea in this scenario as follows:

Safety

\(K \phi_i \supset \phi_{i+1}\) when \(i\) is large (at least \(100\) let’s say).

The idea is that knowledge requires a margin for error, a margin of at least one jellybean in our example. Presumably more than one jellybean, but at least one. Within one jellybean of the true number, you can’t discern truth from falsehood. (See Nozick (1981) for a different conception of a “no luck” requirement on knowledge, which Roush (2005; 2009) formalizes in probabilistic terms.)

Having explained all this to you though, here’s something else you now know: that the Safety thesis is true. So we also have:

Knowledge of Safety

\(K(K \phi_i \supset \phi_{i+1})\) when \(i\) is large.

And combining Knowledge of Safety with KK yields an absurd result:

\begin{array}{rll} 1.& K \phi_{100}& \mbox{Assumption}\\ 2.& KK \phi_{100}& 1, \mathbf{KK}\\ 3.& K(K \phi_{100} \supset \phi_{101})& \textbf{Knowledge of Safety}\\ 4.& KK \phi_{100} \supset K \phi_{101}& 3, \mathbf{K}\\ 5.& K \phi_{101}& 2,4, \mathbf{MP}\\ &&\mbox{repeat steps (2)–(5) for }\phi_{101}, \phi_{102}, \ldots, \phi_n\\ m.& K \phi_n& m-1, \mathbf{MP}\\ m'.& \phi_n& m, \mathbf{T}\\ \end{array}

Given the assumption on line (1), that you know there are at least \(100\) jellybeans in the jar (which you can plainly see), we can show that there are more jellybeans in the jar than stars in the galaxy. Set \(n\) high enough and the jellybeans even outnumber the particles in the universe! (Notice that we don’t rely on NEC anywhere in this derivation, so it’s okay to use non-logical assumptions like line (1) and Knowledge of Safety.)

What’s the philosophical payoff if we join Williamson in rejecting KK on these grounds? Skeptical arguments that rely on KK might be disarmed. For example, a skeptic might argue that to know something, you must be able to rule out any competing alternatives. For example, to know the external world is real, you must be able to rule out the possibility that you are being deceived by Descartes’ demon (Stroud 1984). But then you must also be able to rule out the possibility that you don’t know the external world is real, since this is plainly an alternative to your knowing it is real. That is, you must \(K

eg

eg K\phi\), and thus \(KK\phi\) (Greco forthcoming). So the driving premise of this skeptical argument entails the KK thesis, which we’ve seen reason to reject.

Other skeptical arguments don’t rely on KK, of course. For example, a different skeptical tack begins with the premise that a victim of Descartes’ demon has exactly the same evidence as a person in the real world, since their experiential states are indistinguishable. But if our evidence is the same in the two scenarios, we have no justification for believing we are in one rather than the other. Williamson (2000: ch. 8) deploys an argument similar to his reductio of KK against the premise that the evidence is the same in the real world and the demon world. The gist is that we don’t always know what evidence we have in a given scenario, much as we don’t always know what we know. Indeed, Williamson argues that any interesting feature of our own minds is subject to a similar argument, including that it appears to us that \(\phi\): \(A\phi \supset KA\phi\) faces a similar reductio to that for \(K\phi \supset KK \phi\). For further analysis and criticism, see Hawthorne (2005), Mahtani (2008), Ramachandran (2009), Cresto (2012), and Greco (forthcoming).

Gettier (1963) famously deposed the theory that knowledge is justified true belief (JTB) with a pair of intuitively compelling counterexamples. But such appeals to intuition have come under fire recently (Weinberg, Nichols, and Stich 2001; Buckwalter and Stich 2011) (though see Nagel 2012). Some also think it would be better anyway to retain the simplicity of the JTB theory and bite Gettier’s counterintuitive bullets, rather than pursue baroque revisions of the JTB account (Weatherson 2003).

Can epistemic logic help here? T. Williamson (2013a) argues that a simple model in epistemic logic vindicates Gettier’s initial insight: there are cases of justified true belief without knowledge. To formulate this argument though, we have to turn to the semantics of epistemic logic, rather than its axioms and derivation rules.

The standard semantics for modal logic revolves around possible worlds. For something to be necessarily true is for it to hold no matter how things could be, i.e., in every possible world. For epistemic logic, we use epistemically possible worlds, ways things could be for all we know, i.e., compatible with the sum total of our knowledge. For example, scenarios where I become an astronaut were epistemically possible for me when I was young, but they are not epistemic possibilities for me now. (I have no regrets.)

To represent these possible worlds, we introduce a set of objects we’ll label \(W\). \(W\) can be populated with natural numbers, dots on a page, or any other objects we choose as stand-ins for the possibilities under discussion. For now, let’s just label \(W\)’s members \(w\), \(w'\), \(w''\), etc.

Epistemic possibility is relative. In a scenario where my thermostat reads \(23\) degrees Celsius, the real temperature might be as high as \(25\) or as low as \(21\). (I really should get it looked at.) But a world where the actual temperature is \(29\) is not epistemically possible when the thermostat reads \(23\); it’s not that unreliable. Still, the \(29\) scenario is epistemically possible relative to a world where the thermostat reads (say) \(28\). So epistemic possibility is relative: what’s possible in one situation isn’t always the same as what’s possible in another situation.

To capture this relativity, let’s introduce a relation, \(R\), to express the fact that scenario \(w'\) is possible relative to \(w\). For example, if \(w\) is a scenario where the thermostat reads \(23\) and \(w'\) a scenario where the real temperature is \(25\), then \(wRw'\). That is, from the point of view of \(w\), \(w'\) is an epistemic possibility—when the thermostat reads \(23\), the real temperature might be \(25\), for all I know.

To apply all this to our epistemic logic, we just need to settle which sentences are true in which worlds. That is, we need a two-place function \(v(\phi,w)\) that returns \({\textsf{T}}\) if \(\phi\) is true in world \(w\), and \({\textsf{F}}\) otherwise. Then we can give truth-conditions for the \(K\) operator as follows:

\[ v(K\phi,w)= \textsf{T} \text{ iff } v(\phi,w')= \textsf{T} \text{ for every } w' \text{ such that } wRw'. \]

In other words, \(\phi\) is known just in case it’s true in every epistemically possible scenario.

Even before we specify which worlds bear relation \(R\) to which others, or which formulas are true at which worlds, we can see that axiom K comes out true in every possible world. Recall axiom K: \(K (\phi \supset \psi) \supset (K \phi \supset K \psi)\). If the antecedent \(K (\phi \supset \psi)\) is true in world \(w\), then every epistemically possible scenario \(w'\) is one where either \(\phi\) is false or where \(\psi\) is true. If \(K\phi\) is also true in \(w\), then every one of these epistemically possible scenarios \(w'\) is not of the former kind, but instead of the latter kind. That is, \(\psi\) is true in every epistemically possible world \(w'\), i.e., \(K\psi\) is true in \(w\).

The same can’t be said for the T axiom, however. Recall T: \(K\phi \supset \phi\). Now imagine a simple model with just two possible worlds, \(w\) and \(w'\), with \(\phi\) true at \(w\) but not \(w'\): \(v(\phi,w)={\textsf{T}}\), \(v(\phi,w')={\textsf{F}}\). This model might be used to represent the outcome of a coin flip, for example, with \(w\) the heads world and \(w'\) the tails world. Now suppose we stipulate that \(w'Rw\), but not \(w'Rw'\):

(The arrow here represents the fact that \(w'Rw\).) Then we find that \(v(K\phi,w')={\textsf{T}}\) but \(v(\phi,w')={\textsf{F}}\), violating T. \(K\phi\) comes out true at \(w'\) because \(\phi\) is true in every world possible relative to \(w'\), namely just \(w\). But that seems absurd: how can I know that \(\phi\) is true in a world where it’s not actually true?!

The fix is to stipulate that \(w'Rw'\). In general, to ensure that T always comes out true, we stipulate that \(wRw\) for every \(w \in W\).

When the possibility relation \(R\) does this, we say it’s reflexive: every world is possible relative to itself. And surely, the actual world is always possible given what one knows. Genuine knowledge never rules out the truth.

There are various further constraints one might impose on \(R\), which yield other axioms one might be interested in. For example, if we wanted to make sure the KK axiom is always true, we could stipulate that \(R\) is transitive, i.e., if \(wRw'\) and \(w'Rw''\), then \(wRw''\). But we saw in the previous section that KK might not be a plausible result, so we won’t impose the transitivity requirement here.

To construct a model of a Gettier case, let’s run with the thermostat example. We are considering two factors, the real temperature and the apparent temperature displayed on the thermostat. A possible scenario can thus be represented by an ordered pair, \((r,a)\), where \(r\) is the real temperature and \(a\) the apparent temperature. So \(W\) is now a set of pairs of numbers, \((r,a)\). For simplicity, we’ll stick with integers—the thermostat is digital, and the real temperature can always be rounded to the nearest integer. We’ll also pretend there’s no absolute zero, not even on the thermostat. So \(r\) and \(a\) can be any integers.

That’s \(W\), what about the relation of epistemic possibility, \(R\)? I always know what the thermostat says, so we can stipulate that for \((r',a')\) to be epistemically possible in world \((r,a)\), \(a'\) must equal \(a\). When the thermostat reads \(23\), it’s not epistemically possible for me that it reads \(24\), or anything other than \(23\).

Furthermore, let’s stipulate that the thermostat is only reliable up to \(\pm 2\). In the best-case scenario then, when the thermostat reads accurately, i.e., \(r=a\), the most I can know is that the temperature is \(a\pm2\). I can’t know on the basis of the thermostat’s readings anything more precise than what the thermostat reliably tells me. So the thermostat’s range of reliability places an upper limit on the precision of what I can know in our example. At most, my knowledge has precision \(\pm 2\).

In addition to that limit, we’ll stipulate one other. The further off the thermostat is from the true temperature, the less I know about the actual temperature. The worse the case, the weaker my grip on reality, and thus the weaker my knowledge. For definiteness, let’s say that for every degree the reading is off by, my knowledge becomes one degree weaker. If the thermostat reads \(23\) when the true temperature is \(22\), the most I can know is that that temperature is between \(20\) and \(26\) (\(23 \pm (2 + 1)\)). If the thermostat reads \(24\) when the true temperature is \(22\), the most I can know is that that temperature is between \(20\) and \(28\) \((24 \pm (2 + 2))\). And so on. If the thermostat is off by a bit, then my access to the true temperature is somewhat compromised. If the thermostat is off by a lot, my access to the truth is significantly compromised.

Putting all these stipulations together, we can define our \(R\) thus:

Temperate Knowledge

\((r,a)R(r',a')\) iff (i) \(a'=a\), and (ii) \(r'\) is within \(a \pm (2 + \left| r-a \right| )\).

Condition (i) captures the fact that I know what the thermostat says. Condition (ii) captures both the fact that the thermostat is only reliable up to \(\pm 2\), and the fact that the less accurate a reading is, the less knowledge it gives me about the true temperature.

With \(W\) and \(R\) specified, we’ve settled what I know in each possible world. For example, if \(\phi\) is the proposition that the real temperature is between \(10\) and \(20\), then I know \(\phi\) in (say) world \((15,16)\). The accessible worlds \((r',a')\) are those where \(a'=16\) and \(r'\) is between \(13\) and \(19\). So I know the true temperature is somewhere between \(13\) and \(16\), and thus certainly lies between \(10\) and \(20\).

What about belief? I generally believe more than I know (sadly), so how do we express what I believe in this model? We’ll assume that whatever I know, I justifiedly believe. In the best-case scenario where \(r=a\), I know that the apparent temperature is \(a\) and the true temperature is within \(a\pm2\). So these are also justified beliefs. We’ll further assume that my beliefs are based solely on the thermostat’s readings. (Pretend I have no bodily sense of the temperature.) So in any other world where the apparent temperature is the same, I justifiedly believe the same things. Thus:

Temperate Justified Belief

In world \((r,a)\), I justifiedly believe that the thermostat reads \(a\) and that the true temperature lies in \(a\pm2\).

In fact, let’s say that’s all I believe. We don’t want me adding any stronger beliefs about the true temperature, since they might not be justified—even in the best case scenario where the reading is accurate, the most I could know is that the true temperature is \(a\pm2\).

(We could be more formal about adding justified belief to the model. We could add a second modal operator for justified belief, \(J\), to our language. And we could add a corresponding possibility relation \(R_J\) to the model. But for our purposes stipulating Temperate Justified Belief will serve.)

That completes our model. Now let’s see how it contains Gettier scenarios.

Consider the possible world \((19,20)\). Since the reading is off by 1, the most I can know is that the temperature is \(20\pm(2+1)\), i.e., \(20\pm3\). But recall, I justifiedly believe everything I would know if this reading were correct, since knowledge entails justified belief and my belief is based solely on the reading. So I justifiedly believe the true temperature is within \(20\pm2\). This means my justified beliefs include that the true temperature is not \(23\). Which is true. But I do not know it. For all I know, the true temperature might be as high as \(23\). My knowledge is tempered by 1 degree since, unbeknownst to me, the thermostat is off by 1 degree.

In fact, our model is rife with such scenarios. Any world where \(r

eq a\) will have at least one Gettier belief, since my justified beliefs will have precision \(\pm2\) but my knowledge will only have precision \(\pm(2+n)\), where \(n\) is the degree of error in the reading, i.e., \(\left| r-a\right| \). Notice though, only some beliefs are Gettiered in these scenarios, as we would expect. For example, in the \((19,20)\) case, my weaker belief that the true temperature is not \(30\) isn’t only justified and true, but known as well, since 30 lies outside the \(20\pm3\) range.

We made a number of questionable assumptions on our way to this model. Many of them were simplifying idealizations that we can abandon without undermining the main result. For example, we could let temperatures be real numbers with an absolute zero. In cases well above zero, the same stipulations and results would obtain. We could also make the reliability of the thermostat more realistic, by making the margin of reliability smaller or asymmetric, for example. We could also change the “rate” at which my knowledge weakens as the reading gets further from the true temperature.

One thing we can’t abandon, however, is the very broad assumption that my knowledge does weaken as the reading becomes less accurate. It doesn’t have to weaken by 1 degree for every 1 degree the thermostat’s reading is off. It could weaken much more slowly, it could happen non-linearly, even quite erratically. But my knowledge must be weaker when the reading is more erroneous. Otherwise, my justified beliefs, which are based on the precision of my knowledge in cases where the reading is dead-on, won’t outstrip my knowledge in scenarios where the reading isn’t dead-on.

Cohen and Comesaña (2013) challenge this assumption and argue for a different definition of \(R\). Their definition still delivers Gettier cases, but interestingly, it also vindicates the \({\textbf{KK}}\) principle. (Their definition makes \(R\) transitive, per our earlier discussion in §4.4.1.) Nagel (2013) explores the motivations behind the model’s various other stipulations, and other authors in the same volume offer further illuminating discussion (Goodman 2013; Weatherson 2013). Williamson responds in T. Williamson (2013b).

Tools like probability theory and epistemic logic have numerous uses in many areas of philosophy besides epistemology. Here we’ll look briefly at just a few examples: how to make decisions, whether God exists, and what hypothetical discourses like ‘if…then …’ mean.

Should you keep reading this section, or should you stop here and go do something else? That all depends: what might you gain by continuing reading, and what are the odds those gains will surpass the gains of doing something else instead? Decision theory weighs these considerations to determine which choice is best.

To see how the weighing works, let’s start with a very simple example: betting on the outcome of a die-roll. In particular, let’s suppose a 5 or 6 will win you $19, while any other outcome loses you $10. Should you take this bet? We can represent the choice you face in the form of a table:

Roll 1–4 Roll 5 or 6 Bet −$10 \(+\)$19 Don’t bet $0 $0

So far, taking the bet looks pretty good: you stand to gain almost twice as much as you stand to lose. What the table doesn’t show, however, is that you’re twice as likely to lose as to win: \(2/3\) vs. \(1/3\). So let’s add this information in:

Roll 1–4 Roll 5 or 6 Bet \(\substack{-$10\\ p=2/3}\) \(\substack{+$19\\ p=1/3}\) Don’t bet \(\substack{-$0\\ p=2/3}\) \(\substack{+$0\\ p=1/3}\)

Now we can see that the potential downside of betting, namely losing $10, isn’t outweighed by the potential upside. What you stand to win isn’t quite twice what you’d lose, but the probability of losing is twice as much. Formally, we can express this line of thinking as follows:

\[ (-10 \times 2/3) + (19 \times 1/3) = -1/3 < 0\]

In other words, when the potential losses and gains are weighed against their respective probabilities, their sum total fails to exceed 0. But $0 is what you can expect if you don’t bet. So betting doesn’t quite measure up to abstaining in this example.

That’s the basic idea at the core of decision theory, but it’s still a long way from being satisfactory. For one thing, this calculation assumes money is everything, which it surely isn’t. Suppose you need exactly $29 to get a bus home for the night, and all you have is the $10 bill in your pocket, which on its own is no use (even the cheapest drink at the casino bar is $11). So losing your $10 isn’t really much worse than keeping it—you might as well be broke either way. But gaining $19, now that’s worth a lot to you. If you can just get the bus back home, you won’t have to sleep rough for the night.

So we have to consider how much various dollar-amounts are worth to you. Losing $10 is worth about the same to you as losing $0, though gaining $19 is much, much more valuable. To capture these facts, we introduce a function, \(u\), which represents the utility of various possible outcomes. For you, \(u(-$10) \approx u(-$0)\), but \(u(+$19) \gg u(-$0)\).

Exactly how much is gaining $19 worth to you? What is \(u(+$19)=\ldots\), exactly? We can actually answer this question if we just set a scale first. For example, suppose we want to know exactly how much you value a gain of $19 on a scale that ranges from gaining nothing to gaining $100. Then we set \(u(+$0)=0\) and \(u(+$100)=1\), so that our scale ranges from 0 to 1. Then we can calculate \(u(+$19)\) by asking how much you would be willing to risk to gain $100 instead of just $19. That is, suppose you had a choice between just being handed $19 with no strings attached vs. being offered a (free) gamble that pays $100 if you win, but nothing otherwise. How high would the probability of winning that $100 have to be for you to take a chance on it instead of the guaranteed $19? Given what’s at stake—making it home for the night vs. sleeping rough—you probably wouldn’t accept much risk for the chance at the full $100 instead of the guaranteed $19. Let’s say you’d accept at most .01 risk, i.e., the chance of winning the full $100 would have to be at least .99 for you to trade the guaranteed $19 for the chance at the full $100. Well, then, on a scale from gaining $0 to gaining $100, you value gaining $19 quite highly: .99 out of 1. (This method of measuring utility was discovered and popularized by von Neumann and Morgenstern (1944), though essentially the same idea was previously discovered by Ramsey (1964 [1926]).)

Our full decision theory relies on two functions then, \(p\) and \(u\). The probability function \(p\) reflects how likely you think the various possible outcomes of an action are to obtain, while \(u\) represents how desirable each outcome is. Faced with a choice between two possible courses of action, \(A\) and \(

eg A\), with two possible states the world might be in, \(S\) and \(

eg S\), there are four possible outcomes, \(O_1,\ldots,O_4\). For example, if you bet $1 on a coin-flip coming up heads and it does comes up heads, outcome \(O_1\) obtains and you win $1; if instead it comes up tails, outcome \(O_2\) obtains and you lose $1. The general shape of such situations is thus:

\(S\) \(

eg S\) \(A\) \(\substack{u(O_1)\\ p(S)}\) \(\substack{u(O_2)\\ p(

eg S)}\) \(

eg A\) \( \substack{u(O_3)\\p(S)}\) \( \substack{u(O_4)\\p(

eg S)}\)

To weigh the probabilities and the utilities against each other, we then define the notion of expected utility:

Definition. The expected utility of act \(A\), \(EU(A)\), is defined: \[ EU(A) = p(S)u(O_1) + p(

eg S)u(O_2).\] The expected utility of act \(

eg A\), \(EU(

eg A)\), is likewise: \[ EU(

eg A) = p(S)u(O_3) + p(

eg S)u(O_4).\]

(Why “expected” utility? If you faced the same decision problem over and over again, and each time you chose option \(A\), in the long run you could expect your average utility to be approximately \(EU(A)\).) The same idea extends to cases with more than two ways things could turn out simply by adding columns to the table and multiplying/summing all the way across. When there are more than two possible actions, we just add more rows and do the same.

Finally, our decision theory culminates in the following norm:

Expected Utility Maximization

Choose the option with the highest expected utility. (In case of a tie, either option is acceptable.)

We haven’t given much of an argument for this rule, except that it “weighs” the desirability of each possible outcome against the probability that it will obtain. There are various ways one might develop this weighing idea, however. The one elaborated here is due to Savage (1954). It is considered the classic/orthodox approach in social sciences like economics and psychology. Philosophers, however, tend to prefer variations on Savage’s basic approach: either the “evidential” decision theory developed by Jeffrey (1965) or some form of “causal” decision theory (see entry) (Gibbard and Harper 1978; Skyrms 1980; Lewis 1981; Joyce 1999).

These approaches all agree on the broad idea that the correct decision rule weighs probabilities and utilities in linear fashion: multiply then add (see the entry on expected utility). A different approach recently pioneered by Buchak (2013, forthcoming) holds that (in)tolerance for risk throws a non-linear wrench into this equation, however (see also Steele 2007). And taking account of people’s cognitive limitations has long been thought to require further departures from the traditional, linear model (Kahneman and Tversky 1979; Payne, Bettman, and Johnson 1993; Gigerenzer, Todd, and Group 1999; Weirich 2004; Weisberg 2013).

The mathematical theories of probability and decision emerged together in correspondence between Blaise Pascale and Pierre de Fermat in the mid-17th Century. Pascal went on to apply them to theological questions, developing his famous “wager” argument (see entry on Pascal’s Wager) for belief in God. Probability theory now commonly appears in discussions of other arguments for and against theism, especially the argument from design. Though Darwin is generally thought to have toppled theistic appeals to biological design, newer findings in cosmology and physics seem to support a new probabilistic argument for God’s