Searching for information is critical in many situations. In medicine, for instance, careful choice of a diagnostic test can help narrow down the range of plausible diseases that the patient might have. In a probabilistic framework, test selection is often modeled by assuming that people's goal is to reduce uncertainty about possible states of the world. In cognitive science, psychology, and medical decision making, Shannon entropy is the most prominent and most widely used model to formalize probabilistic uncertainty and the reduction thereof. However, a variety of alternative entropy metrics (Hartley, Quadratic, Tsallis, Rényi, and more) are popular in the social and the natural sciences, computer science, and philosophy of science. Particular entropy measures have been predominant in particular research areas, and it is often an open issue whether these divergences emerge from different theoretical and practical goals or are merely due to historical accident. Cutting across disciplinary boundaries, we show that several entropy and entropy reduction measures arise as special cases in a unified formalism, the Sharma–Mittal framework. Using mathematical results, computer simulations, and analyses of published behavioral data, we discuss four key questions: How do various entropy models relate to each other? What insights can be obtained by considering diverse entropy models within a unified framework? What is the psychological plausibility of different entropy models? What new questions and insights for research on human information acquisition follow? Our work provides several new pathways for theoretical and empirical research, reconciling apparently conflicting approaches and empirical findings within a comprehensive and unified information‐theoretic formalism.

1 Introduction A key topic in the study of rationality, cognition, and behavior is the effective search for relevant information or evidence. Information search is also closely connected to the notion of uncertainty. Typically, an agent will seek to acquire information to reduce uncertainty about an inference or decision problem. Physicians prescribe medical tests in order to handle arrays of possible diagnoses. Detectives seek witnesses in order to identify the culprit of a crime. And, of course, scientists gather data in order to discriminate among different hypotheses. In psychology and cognitive science, most early work on information acquisition adopted a logical, deductive inference perspective. In the spirit of Popper's (1959) influential falsificationist philosophy of science, the idea was that learners should seek information that could help them falsify hypotheses (e.g., expressed as a conditional or a rule; Wason, 1960, 1966, 1968). However, many human reasoners did not seem to believe that information is useful if and only if it can potentially rule out (falsify) a hypothesis. From the 1980s, cognitive scientists started analyzing human information search with a closer look at inductive inference, using probabilistic models to quantify the value of information and endorsing them as normative benchmarks (e.g., Baron, 1985; Klayman & Ha, 1987; Skov & Sherman, 1986; Slowiaczek, Klayman, Sherman, & Skov, 1992; Trope & Bassok, 1982, 1983). This research was inspired by seminal work in philosophy of science (e.g. Good, 1950), statistics (e.g., Lindley, 1956), and decision theory (Savage, 1972). In this view, each outcome of a query could modify an agent's beliefs about the hypotheses being considered, thus providing some amount of information. For instance, the key theoretical point of Oaksford and Chater's (1994, 2003) analysis of Wason's selection task was to conceptualize information acquisition as a piece of probabilistic inductive reasoning, assuming that people's goal is to reduce uncertainty about whether a rule holds or not. In a similar vein, researchers in vision science have used measures of uncertainty reduction to predict visual queries for gathering information (i.e., eye movements; Legge, Klitz, & Tjan, 1997; Najemnik & Geisler, 2005, 2009; Nelson & Cottrell, 2007; Renninger, Coughlan, Verghese, & Malik, 2005), or to guide a robot's eye movements (Denzler & Brown, 2002). Probabilistic models of uncertainty reduction have also been used to predict human query selection in causal reasoning (Bramley, Lagnado, & Speekenbrink, 2015), hypothesis testing (Austerweil & Griffiths, 2011; Navarro & Perfors, 2011; Nelson, Divjak, Gudmundsdottir, Martignon, & Meder, 2014; Nelson, Tenenbaum, & Movellan, 2001), and categorization (Meder & Nelson, 2012; Nelson, McKenzie, Cottrell, & Sejnowski, 2010). If reducing uncertainty is a major cognitive goal and motivation for information acquisition, a critical issue is how uncertainty and the reduction thereof can be represented in a rigorous manner. A fruitful approach to formalize uncertainty is using the mathematical notion of entropy, which in turn generates a corresponding model of the informational utility of an experiment as the expected reduction of entropy (uncertainty), sometimes called expected information gain. In many disciplines, including psychology and neuroscience (Hasson, 2016), the most prominent model is Shannon (1948) entropy. However, a number of non‐equivalent measures of entropy have been suggested, and are being used, in a variety of research domains. Examples include the application of quadratic entropy in ecology (Lande, 1996), the family of Rényi (1961) entropies in computer science and image processing (Boztas, 2014; Sahoo & Arora, 2004), and Tsallis entropies in physics (Tsallis, 2011). It is currently unknown whether these other entropy models would have potential to address key theoretical and empirical questions in cognitive science. Here, we bring together these different models in a comprehensive theoretical framework, the Sharma–Mittal formalism (from Sharma & Mittal, 1975), which incorporates a large number of prominent entropy measures as special cases. Careful consideration of the formal properties of this family of entropy measures will reveal important implications for modeling uncertainty and information search behavior. Against this rich theoretical background, we will draw on existing behavioral data and novel simulations to explore how different models relate to each other, elucidate their psychological meaning and plausibility, and show how they can generate new testable predictions. The remainder of this paper is organized as follows. We begin by spelling out what an entropy measure is and how it can be employed to represent uncertainty and the informational value of queries (questions, tests, experiments) (Section 3.2). Subsequently, we review four representative and influential definitions of entropy, namely Quadratic, Hartley, Shannon, and Error entropy (Section 3). These models have been, and continue to be, of importance in different areas of research. In the main theoretical section of the paper, we describe a unified formal framework generating a biparametric continuum of entropy measures. Drawing on work in generalized information theory, we show that many extant models of entropy and expected entropy reduction can be embedded in this comprehensive formalism (Section 4). We provide a number of new mathematical results in this section. We also address the theoretical meaning of the parameters involved when the target domain of application is human reasoning, with implications for both normative and descriptive approaches. We then further elaborate on the connection with experimental research in several ways (Section 5). First, we present simulation results from an extensive exploration of information search decision problems in which alternative models provide strongly diverging, empirically testable predictions (Section 3.5.1). Second, we report and discuss an overarching analysis of the information‐theoretic account of the most widely known experimental paradigm for the study of information gathering, i.e., Wason's (1966, 1968) abstract selection task (Section 3.6). Then we investigate which models perform better against data from a range of experience‐based studies on human information search behavior (Meder & Nelson, 2012; Nelson et al., 2010) (Section 5.3.). We also point out that some entropy models from this framework offer potential explanation of human information search behavior in experiments where probabilities are conveyed through words and numbers, which to date have been perplexing to account for theoretically (Section 5.4). Finally, we show that new models offer a theoretically satisfying and descriptively adequate unification of disparate results across different kinds of tasks (Section 5.5.). In the General Discussion (Section 4.), we outline and assess the prospects of a generalized information‐theoretic framework for guiding the study of human inference and decision making. Part of our discussion relies and elaborates on mathematical analyses, including novel results. Moreover, although a number of the mathematical points in the paper can be found scattered through the mathematics and physics literature, here we bring them together systematically. We provide Supplementary Materials where non‐trivial derivations are given according to our unified notation. Throughout each section of the text, statements requiring a mathematical proof are flagged by brackets (Supplementary Material S1), and the proof is then presented in the corresponding subsection of the Supplementary Materials file. Among the formal results provided that are novel to the best of our knowledge, the following we find especially important: the ordinal equivalence of Sharma–Mittal entropy measures of the same order (proof in Supplementary Material S1, section 4), the additivity of all Sharma–Mittal measures of expected entropy reduction for sequential tests (again Supplementary Material S1, section 4), and the distinctive role of the degree parameter in information search tasks such as the Person Game (Supplementary Material S1, section 5). Further novel results include the subsumption of diverse models such as the Arimoto (1971) and the Power entropies within the Sharma–Mittal framework (Supplementary Material S1, section 3), and the specification of how a number of different entropy measures can be construed within the general theory of means (Table 2).

2 Entropies, uncertainty, and information search According to a well‐known anecdote, the origins of information theory were marked by a witty joke of John von Neumann. Claude Shannon was doubtful how to call the key concept of his groundbreaking work on the “mathematical theory of communication” (Shannon, 1948). “You should call it entropy,” von Neumann suggested. Of course, von Neumann must have been aware of the close connections between Shannon's formula and Boltzmann's definition of entropy in classical statistical mechanics. But the most important reason for his suggestion, von Neumann quipped, was that “nobody knows what entropy really is, so in a debate you will always have the advantage” (see Tribus & McIrvine, 1971). Shannon accepted the advice. Several decades later, von Neumann's remark seems even more pointed, if anything. Influential observers have voiced caution and concern about the proliferation of mathematical analyses of entropy and related notions (Aczél, 1984, 1987). Meanwhile, many applications have been developed, for instance in physics and ecology (see, e.g., Beck, 2009; Keylock, 2005). But recurrent theoretical controversies have arisen, too, along with occasional complaints of conceptual confusion (see Cho, 2002; and Jost, 2006, respectively). Luckily, these thorny issues will be tangential to our main concerns. Although a given formalization of entropy can be considered for the representation and measurement of different constructs in each of a variety of domains, we focus on one target concept for which entropies can be employed, namely the uncertainty concerning a variable X given a probability distribution P. In this regard, the key question is the following: How much uncertainty is conveyed about variable X by a given probability distribution P? This notion is central to the normative and descriptive study of human cognition. x 1 , x 2 , x 3 the corresponding possibilities. Consider two different probability assignments, such as, say: Suppose, for instance, that an infection can be caused by three different types of virus, and labelthe corresponding possibilities. Consider two different probability assignments, such as, say:and Is the uncertainty about X = {x 1 , x 2 , x 3 } greater under P or under P*? An entropy measure enables us to give precise quantitative values in both case, and hence a clear answer. Importantly, however, the answer will often be measure‐dependent, for different entropy measures convey different ideas of uncertainty and exhibit distinct mathematical properties of theoretical interest. We will see this in detail later on. X of n mutually exclusive and jointly exhaustive possibilities x 1 , …, x n on which a probability distribution P(X) is defined, so that P(X) = {P(x 1 ), …, P(x n )}, with P(x i ) ≥ 0 for any i (1 ≤ i ≤ n) and . The n elements in X = {x 1 , …, x n } can be taken as representing different kinds of entities, such as events, categories, or propositions. For our purposes, ent is an entropy measure if it is a function f of the relevant probability values only, i.e.: f satisfies a small number of basic properties (see below). Notice that, in general, an entropy function can be readily extended to the case of a conditional probability distribution given some datum y. In fact, under the conditional probability distribution P(X|y), one has . Once uncertainty as our conceptual target has been outlined, we can turn to entropy as a mathematical object. Consider a finite setofmutually exclusive and jointly exhaustive possibilities, …,on which a probability distribution) is defined, so that) = {), …,)}, with) ≥ 0 for any(1 ≤) and. Theelements in= {, …,} can be taken as representing different kinds of entities, such as events, categories, or propositions. For our purposes,is an entropy measure if it is a functionof the relevant probability values only, i.e.:and functionsatisfies a small number of basic properties (see below). Notice that, in general, an entropy function can be readily extended to the case of a conditional probability distribution given some datum. In fact, under the conditional probability distribution), one has Shannon entropy has been so prominent in cognitive science that some readers will ask: why we do not just stick with it? More specific objections in this vein include that Shannon entropy is uniquely axiomatically motivated, that Shannon entropy is already central to psychological theory of the value of information, or that Shannon entropy is optimal in certain applied situations. Each objection can be addressed separately. First, a number of entropy metrics in our generalized framework (not only Shannon) have been or can be uniquely derived from specific sets of axioms (see Csizár, 2008). Second, although Shannon entropy has a number intuitively desirable properties, it is not a serious competitive descriptive psychological model of the value of information in some tasks (e.g., Nelson et al., 2010). Third, several published papers in applied domains report superior performance when other entropy measures are used (e.g., Ramírez‐Reyes, Hernández‐Montoya, Herrera‐Corral, & Domínguez‐Jiménez, 2016). Indeed, Shannon's (1948) own view was that although axiomatic characterization can lend plausibility to measures of entropy and information, “the real justification” (p. 393) rests on the measures’ operational relevance. A generalized mathematical framework can increase our theoretical understanding of the relationships among different measures, unify diverse psychological findings, and generate novel questions for future research. Scholars have used different properties as defining an entropy measure (see, e.g., Csizár, 2008). Besides some usual technical requirement (like non‐negativity), a key idea is that entropy should be appropriately sensitive to how even or uneven a distribution is, at least with respect to the extreme cases of an uniform probability function, U(X) = {1/n, …, 1/n}, or of a deterministic function V(X) where V(x i ) = 1 for some i (1 ≤ i ≤ n) and 0 for all other xs. (In the latter case, the distribution actually reflects a truth‐value assignment, in logical parlance.) In our setting, U(X) represents the highest possible degree of uncertainty about X, while under V(X) the true value of X is known for sure, and no uncertainty is left. Hence it must hold that, for any X and P(X), , with at least one inequality strict. This basic and minimal condition we label evenness sensitivity. It is conveyed by Shannon entropy as well as many others, as we shall see, and it guarantees, for instance, that entropy is strictly higher for, say, a distribution like {1/3, 1/3, 1/3} than for {1,0,0}. two variables X and Y, and defining the expected reduction of the initial entropy of X across the elements of Y. To illustrate, in the viral infection example mentioned above, X may concern the type of virus actually involved, while Y could be some clinically observable marker (like the result of a blood test) which is informationally relevant for X. Mathematically, given a joint probability distribution P(X,Y) over the combination of two variables X and Y (i.e., their Cartesian product X × Y), the actual change in entropy about X determined by an element y in Y can be represented as . Accordingly, the expected reduction of the initial entropy of X across the elements of Y can be computed in a standard way, as follows: Once the idea of an entropy measure is characterized, one can study different measures of expected entropy reduction. This amounts to consideringvariablesand, and defining the expected reduction of the initial entropy ofacross the elements of. To illustrate, in the viral infection example mentioned above,may concern the type of virus actually involved, whilecould be some clinically observable marker (like the result of a blood test) which is informationally relevant for. Mathematically, given a joint probability distribution) over the combination of two variablesand(i.e., their Cartesian product), the actual change in entropy aboutdetermined by an elementincan be represented as. Accordingly, the expected reduction of the initial entropy ofacross the elements ofcan be computed in a standard way, as follows: 1 The notation is adapted from work on the foundations of Bayesian statistics, where the expected reduction in entropy is seen as measuring the dependence of variable X on variable Y, or of the relevance of Y for X (see, e.g., Dawid & Musio, 2014). Very much as for entropy itself, the expected reduction of entropy remains as general and neutral a notion as possible. R measures, too, can be given different interpretations in different domains. In many contexts, it is plausibly assumed that reduction of the uncertainty is a major dimension of the purely informational (or epistemic) value of the search for more data. We will thus consider a measure R as providing a formal approach to questions of the following kind: Given X as a target of investigation, what is the expected usefulness of finding out about Y from a purely informational point of view? Hence, the notion of uncertainty is tightly coupled to the rational assessment of the expected informational utility of pursuing a given search for additional evidence (performing a query, executing a test, running an experiment). (See Crupi & Tentori, 2014; Nelson, 2005, 2008. For more discussion, also see Evans & Over, 1996; Roche & Shogenji, 2016.) Formally, X and Y can just be seen as partitions of possibilities. In this interpretation, however, they play quite different roles in . The first argument, X, represents the overall goal of the inquiry, while the second, Y, is supposed to be directly accessible to the information seeker. In a typical application, Y will be more or less useful a test to learn about target X, although unable to conclusively establish what the true hypothesis in X is. In general, the occurrence of one particular element y of Y does not need to reduce the initial entropy about X; it might as much increase it, hence making negative. This quantity can be negative if (for instance) datum y changes probabilities from P(X) = {0.9, 0.1} to P(X|y) = {0.6, 0.4}. But can , i.e., the expected informational usefulness of Y for learning about X, be negative? Some R measures are strictly non‐negative, but others can in fact be negative in the expectation; this depends on key properties of the underlying entropy measure, as we discuss later on. To summarize, in the domain of human cognition, probability distributions can be employed to represent an agent's degrees of belief (be they based on objective statistical information or subjective confidence), with entropy providing a formalization of the uncertainty about X (given P). Relying on the reduction of uncertainty as an informational utility, is then interpreted as a measure of the expected usefulness of a query (test, experiment) Y relative to a target hypothesis space X. From now on, to emphasize this interpretation, we will often use H = {h 1 , …, h n } to denote a hypothesis set of interest and E = {e 1 , …, e m } for a possible search for evidence. Table 1 summarizes our terminology in this respect as well as for the subsequent sections. Table 1. Notation employed Notation Description H = {h 1 , …, h n } A partition of n possibilities (or hypothesis space) P(H) Probability distribution P defined over the elements of H P(H|e) Probability distribution P defined over the elements of H conditional on e U(H) Uniform probability distribution over the elements of H V(H) A probability distribution such that V(h i ) = 1 for some i (1 ≤ i ≤ n) and 0 for all other hs H × E The variable obtained by the combination (Cartesian product) of variables H and E P(H,E) Joint probability distribution over the combination of variables H and E H ⊥ P E Given P(H,E), variables H and E are statistically independent Given P(H,E,F), variables H and E are statistically independent conditional on each element in F Entropy of H given P(H) Conditional entropy of H on e given P(H|e) Reduction of the initial entropy of H provided by e, i.e., R P (H,E) Expected reduction of the entropy of H across the elements of E, given P(H,E) R P (H,E| f) Expected reduction of the entropy of H across the elements of E, given P(H,E| f) R P (H,E|F) Expected value of R P (H, E| f) across the elements of F, given P(H,E,F) ln t (x) The Tsallis generalization of the natural logarithm (with parameter t) e t (x) The Tsallis generalization of the ordinary exponential (with parameter t)

3 Four influential entropy models We will now briefly review four important models of entropy and the corresponding models of expected entropy reduction. 3.1 Quadratic entropy Entropy/Uncertainty entropy was not yet a scientific term outside statistical thermodynamics. Here is one major instance: Some interesting entropy measures were originally proposed long before the exchange between Shannon and von Neumann, whenwas not yet a scientific term outside statistical thermodynamics. Here is one major instance: Quadratic entropy in Vajda and Zvárová ( 2007 Gini (or Gini‐Simpson) index, after Gini ( 1912 1949 1962 biological diversity (see, e.g., Patil & Taille, 1982 Labeledin Vajda and Zvárová (), this measure is widely known as the(or, after Gini () and Simpson () (also see Gibbs & Martin,). It is often employed as an index of(see, e.g., Patil & Taille,) and sometimes spelled out in the following equivalent formulation: The above formula suggests a meaningful interpretation with H amounting to a partition of hypotheses considered by an uncertain agent. In this reading, entQuad computes the average (expected) surprise that the agent would experience in finding out what the true element of H is, given 1 − P(h) as a measure of the surprise that arises in case h obtains (see Crupi & Tentori, 2014).2 Entropy reduction/Informational value of queries Quadratic entropy reduction, namely, , has been occasionally mentioned in philosophical analyses of scientific inference (Niiniluoto & Tuomela, 1973, p. 67). In turn, its associated expected reduction measure, , was applied by Horwich (1982, pp. 127–129), again in formal philosophy of science, and studied in computer science by Raileanu and Stoffel (2004). 3.2 Hartley entropy Entropy/Uncertainty Gini's work did not play any apparent role in the development of Shannon's (1948) theory. A seminal paper by Hartley (1928), however, was a starting point for Shannon's analysis. One lasting insight of Hartley was the introduction of logarithmic functions, which have become ubiquitous in information theory ever since. As Hartley also realized, the choice of a base for the logarithm is a matter of conventionally setting a unit of measurement (Hartley, 1928, pp. 539–541). Throughout our discussion, we will employ the natural logarithm, denoted as ln. 1928 n possible values of a variable is increasingly informative the larger n is, and that it immediately reflects the entropy of that variable, one can define the Hartley entropy as follows (Aczél, Forte, & Ng, 1974 Inspired by Hartley's () original idea that the information provided by the observation of one amongpossible values of a variable is increasingly informative the largeris, and that it immediately reflects the entropy of that variable, one can define the Hartley entropy as follows (Aczél, Forte, & Ng,): Under the convention 00 = 0 (which is standard in the entropy literature), and given that P(h i )0 = 1 whenever P(h i ) > 0, entHartley computes the logarithm of the number of all non‐null probability elements in H. Entropy reduction/Informational value of queries When applied to the domain of reasoning and cognition, the implications of Hartley entropy reveal an interesting Popperian flavor. A piece of evidence e is useful, it turns out, only to the extent that it excludes (“falsifies”) at least some of the hypotheses in H, for otherwise the reduction in Hartley entropy, , is just zero. An agent adopting such a measure of informational utility would then only value a test outcome, e, insofar as it conclusively rules out at least one hypothesis in H. If no possible outcome in E is potentially a “falsifier” for some hypothesis in H, then the expected reduction of Hartley entropy, RHartley, is also zero, implying that query E has no expected usefulness at all with respect to H. 3.3 Shannon entropy Entropy/Uncertainty In many contexts, the notion of entropy is simply and immediately equated to Shannon's formalism. Overall, such special consideration is well‐deserved and motivated by countless applications spread over virtually all branches of science. The form of Shannon entropy is fairly well‐known: Concerning the interpretation of the formula, many points made earlier for quadratic entropy apply to Shannon entropy too, given relevant adjustments. In fact, ln(1/P(h)) is another measure of the surprise in finding out that a state of affairs h obtains, and thus entShannon is its overall expected value relative to H.3 Entropy reduction/Informational value of queries The reduction of Shannon entropy, , is sometimes called information gain and it is often considered as a measure of the informational utility of a datum e. Its expected value, also called expected information gain, , is then viewed as a measure of usefulness of query E for learning about H. (See, e.g., Austerweil & Griffiths, 2011; Bar‐Hillel & Carnap, 1953; Lindley, 1956; Oaksford & Chater, 1994, 2003; and Ruggeri & Lombrozo, 2015; also see Benish, 1999; and Nelson, 2005, 2008, for more discussion.) 3.4 Error entropy Entropy/Uncertainty P(H) and the goal of predicting the true element of H, a rational agent would plausibly select h* such that P(h*) = , and 1 – would then be the probability of error. Since Fano's ( 1961 Error entropy: Given a distribution) and the goal of predicting the true element of, a rational agent would plausibly select* such that*) =, and 1 –would then be the. Since Fano's () seminal work, this quantity has received considerable attention in information theory. Also known as Bayes's error, we will call this quantity Note that entError is only concerned with the largest value in the distribution P(H), namely . The lower that value, the higher the chance of error were a guess to be made, thus the higher the uncertainty about H. Entropy reduction/Informational value of queries Unlike the other models above, Error entropy has seldom been considered in the natural or social sciences. However, it can be taken as a sound basis for the analysis of rational behavior. In the latter domain, it is quite natural to rely on the reduction of the expected probability of error as the utility of a datum (often labeled probability gain; see Baron, 1985; Nelson, 2005, 2008) and on its expected value, , as the usefulness of a query or test. Indeed, there are important occurrences of this model in the study of human cognition.4

4 A unified framework for uncertainty and information search The set of models introduced above represents a diverse sample in historical, theoretical, and mathematical terms (see Fig. 1 for a graphical illustration). Is the prominence of particular models due to fundamental distinctive properties, or largely due to historical accident? What are the relationships among these models? In this section we show how all of these models can be embedded in a unified mathematical formalism, providing new insight. Figure 1 Open in figure viewer PowerPoint H = {h, } as a function of the probability of h. A graphical illustration of Quadratic, Hartley, Shannon, and Error entropy as distinct measures of uncertainty over a binary hypothesis set= {} as a function of the probability of 4.1 Sharma–Mittal entropies self‐weighted average, displaying the following structure: Let us take Shannon entropy again as a convenient starting point. As noted above, Shannon entropy is an average, more precisely aaverage, displaying the following structure: The label self‐weighted indicates that each probability P(h) serves as a weight for the value of function inf having that same probability as its argument, namely, inf [P(h)]. The function inf can be seen as capturing a notion of atomic information (or surprise), assigning a value to each distinct element of H on the basis of its own probability (and nothing else). An obvious requirement here is that inf should be a decreasing function, because a finding that was antecedently highly probable (improbable) provides little (much) new information (an idea that Floridi, 2013, calls “inverse relationship principle” after Barwise, 1997, p. 491). In Shannon entropy, one has inf(x) = ln(1/x). Given inf(x) = 1 − x, instead, Quadratic entropy arises from the very same scheme above. g is a differentiable and strictly increasing function (see Wang & Jiang, 2005 1993 g, different kinds of (self‐weighted) means are instantiated. With g(x) = x, the weighted average above obtains once again. For another standard instance, g(x) = 1/x gives rise to the harmonic mean. Let us now consider the form of generalized (self‐weighted) means above and focus on the following setting: 1988 ln t function recovers the ordinary natural logarithm ln in the limit for t → 1, so that one can safely equate ln t (x) = ln(x) for t = 1 and have a nice and smooth generalized logarithmic function. = ex for t = 1, as this is the limit for t → 1 (Supplementary Material r and t will not need concern us here: we'll be assuming r,t ≥ 0 throughout. A self‐weighted average is a special case of a generalized (self‐weighted) mean, which can be characterized as follows:whereis a differentiable and strictly increasing function (see Wang & Jiang,; also see Muliere & Parmigiani,, for the fascinating history of these ideas). For different choices of, different kinds of (self‐weighted) means are instantiated. With) =, the weighted average above obtains once again. For another standard instance,) = 1/gives rise to the harmonic mean. Let us now consider the form of generalized (self‐weighted) means above and focus on the following setting:whereare generalized versions of the natural logarithm and exponential functions, respectively, often associated with Tsallis's () work. Importantly, thefunction recovers the ordinary natural logarithmin the limit for→ 1, so that one can safely equate) =) for1 and have a nice and smooth generalized logarithmic function. 5 Similarly, it is assumed thatfor1, as this is the limit for→ 1 (Supplementary Material S1 , section 1). Negative values of parametersandwill not need concern us here: we'll be assuming0 throughout. inf(x) and g(x) yield a two‐parameter family of entropy measures of order r and degree t (Supplementary Material Once fed into the generalized means equation, these specifications of) and) yield a two‐parameter family of entropy measures ofand(Supplementary Material S1 , section 2): SM refers to Sharma and Mittal ( 1975 2008 2005 one can embed the whole set of four classic measures in our initial list. More precisely (Supplementary Material Quadratic entropy can be derived from the Sharma–Mittal family for r = t = 2, that is, ;

= = 2, that is, ; Hartley entropy can be derived from the Sharma–Mittal family for r = 0 and t = 1, that is, ;

= 0 and = 1, that is, ; Shannon entropy can be derived from the Sharma–Mittal family for r = t = 1, that is, ;

= = 1, that is, ; Error entropy is recovered from the Sharma–Mittal family in the limit for r → ∞ when t = 2, so that we have . The labelrefers to Sharma and Mittal (), where this formalism was originally proposed (also see Hoffmann,; and Masi,). All functions in the Sharma–Mittal family are evenness sensitive (see 2. above), thus in line with a basic characterization of entropies (Supplementary Material S1 , section 2). Also, withone can embed the whole set of four classic measures in our initial list. More precisely (Supplementary Material S1 , section 3): A good deal more can be said about the scope of this approach: see Figs. 2 and 3, Table 2, and Supplementary Material S1 (section 3) for additional material. Here, we will only mention briefly three important further points about R‐measures in the Sharma–Mittal framework and their meaning for modeling information search behavior. They are as follows. Figure 2 Open in figure viewer PowerPoint r and of the degree parameter t lying on the x– and y–axis, respectively. Each point in the quadrant corresponds to a specific entropy measure, each line corresponds to a distinct one‐parameter generalized entropy function. Several special cases are highlighted. (Relevant references and formulas are listed in Table The Sharma–Mittal family of entropy measures is represented in a Cartesian quadrant with values of the order parameterand of the degree parameterlying on the– and–axis, respectively. Each point in the quadrant corresponds to a specific entropy measure, each line corresponds to a distinct one‐parameter generalized entropy function. Several special cases are highlighted. (Relevant references and formulas are listed in Table 2 ). Figure 3 Open in figure viewer PowerPoint A graphical illustration of the generalized atomic information function ln t (1/P(h)) for four different values of the parameter t (0, 1, 2, and 5, respectively, for the curves from top to bottom). Appropriately, the amount of information arising from finding out that h is the case is a decreasing function of P(h). For high values of t, however, such decrease is flattened: with t = 5 (the lowest curve in the figure) finding out that h is true provides almost the same amount of information for a large set of initial probability assignments. Table 2. A summary of the Sharma–Mittal framework and several of its special cases, including a specification of their structure in the general theory of means and a key reference for each (r,t)‐setting Algebraic form of ent P (H) Generalized mean construction Characteristic function and its inverse Atomic information Sharma–Mittal Sharma and Mittal (1975) r ≥ 0 t ≥ 0 Effective Numbers Hill (1973) r ≥ 0 t = 0 Rényi Rényi (1961) r ≥ 0 t = 1 Power entropies Laakso and Taagepera (1979) r ≥ 0 t = 2 Gaussian Frank (2004) r = 1 t ≥ 0 Arimoto Arimoto (1971) r ≥ 0 Tsallis Tsallis (1988) r = t ≥ 0 Quadratic Gini (1912) r = t = 2 Shannon Shannon (1948) r = t = 1 Hartley Hartley (1928) r = 0 t = 1 Additivity of expected entropy reduction For any H,E,F and , . This statement means that, for any Sharma–Mittal R‐measure, the informational utility of a combined test E × F for H amounts to the sum of the plain utility of E and the utility of F that is expected considering all possible outcomes of E (Supplementary Material S1, section 4). (Formally, , while denotes the expected entropy reduction of H provided by F as computed when all relevant probabilities are conditionalized on e j .) According to Nelson's (2008) discussion, this elegant additivity property of expected entropy reduction is important and highly desirable as concerns the analysis of the rational assessment of tests or queries. Moreover, one can see that the additivity of expected entropy reduction can be extended to any finite chain of queries and thus be applied to sequential search tasks such as those experimentally investigated by Nelson et al. (2014). Irrelevance For any H,E and , if either E = {e} or H ⊥ P E then . This statement says that two special kinds of queries can be known in advance to be of no use, that is, informationally inconsequential relative to the hypothesis set of interest. One is the case of an empty test E = {e} with a single possibility that is already known to obtain with certainty, so that P(e) = 1. As suggested vividly by Floridi (2009, p. 26), this would be like consulting the raven in Edgar Allan Poe's famous poem, which is known to give one and the same answer no matter what (it always spells out “Nevermore”). The other case is when variables H and E are unrelated, that is, statistically independent according to P (H ⊥ P E in our notation). In both of these circumstances, simply because the prior and posterior distribution on H are identical for each possible value of E, so that no entropy reduction can ever obtain. By the irrelevance condition, empty and unrelated queries have zero expected utility — but can a query E have a negative expected utility? If so, a rational agent would be willing to pay a cost just for not being told what the true state of affairs is as concerns E, much as an abandoned lover who wants to be spared being told whether her/his beloved is or is not happy because s/he expects more harm than good. Note, however, that for the lover non‐informational costs are clearly involved, while we are assuming queries or tests to be assessed in purely informational terms, bracketing all further factors (for work involving situation‐specific payoffs, see, e.g., Raiffa & Schlaifer, 1961; Meder & Nelson, 2012; and Markant & Gureckis, 2012). In this perspective, it is reasonable and common to see irrelevance as the worst‐case scenario and exclude the possibility of informationally harmful tests: an irrelevant test (whether empty or statistically unrelated) simply cannot tell us anything of interest, but that is as bad as it can get (for seminal analyses, see Good, 1967, and Goosens, 1976; also see Dawid, 1998).6 Interestingly, not all Sharma–Mittal measures of expected entropy reduction are non‐negative. Some of them do allow for the controversial idea that there could exist detrimental tests in purely informational terms, such that an agent should rank them worse than an irrelevant search and take active measures to avoid them (despite them having, by assumption, no intrinsic cost). Mathematically, a non‐negative measure is generated if and only if the underlying entropy measure is a concave function (Supplementary Material S1, section 4), and the conditions for concavity are as follows: Concavity: is a concave function of {P(h 1 ), …, P(h n )} just in case t ≥ 2 − 1/r.7 In terms of Fig. 2, this means that any entropy (represented by a point) below the Arimoto curve is not generally concave (see Fig. 4 for a graphical illustration of a strongly non‐concave entropy measure). Thus, if the concavity of ent is required (to preserve the non‐negativity of R), then many prominent special cases are retained (including Quadratic, Hartley, Shannon, and Error entropy), but a significant bit of the whole Sharma–Mittal parameter space is ruled out. This concerns, for instance, entropies of degree 1 and order higher than 1 (see Ben‐Bassat & Raviv, 1978). Figure 4 Open in figure viewer PowerPoint for a binary hypothesis set H = {h, } as a function of the probability of h. Graphical illustration of the non‐concave entropyfor a binary hypothesis set= {} as a function of the probability of 4.2 Psychological interpretation of the order and degree parameter The order parameter r: Imbalance and continuity r, i.e., if r = 0 or goes to infinity, respectively (Supplementary Material What is the meaning of the order parameter in the Sharma–Mittal formalism when entropies and expected entropy reduction measures represent uncertainty and the value of queries, respectively? To clarify, let us consider what happens with extreme values of, i.e., if0 or goes to infinity, respectively (Supplementary Material S1 , section 3): Given the convention 00 = 0, simply computes the number of all elements in H with a non‐null probability. Accordingly, when r = 0, entropy becomes a (increasing) function of the mere number of the “live” (non‐zero probability) options in H. When r goes to infinity, on the other hand, entropy becomes a (decreasing) function of the probability of a single element in H, i.e., the most likely hypothesis. This shows that the order parameter r is an index of the imbalance of the entropy function, which indicates how much the entropy measure discounts minor (low probability) hypotheses. For order‐0 measures, the actual probability distribution is neglected: non‐zero probability hypothesis are just counted, as if they were all equally important (see Gauvrit & Morsanyi, 2014). For order‐∞ measures, on the other hand, only the most probable hypothesis matters, and all other hypotheses are disregarded altogether. For intermediate values of r, more likely hypotheses count more, but less likely hypotheses do retain some weight. The higher (lower) r is, the more (less) the likely hypotheses are regarded and the unlikely hypotheses are discounted. Importantly, for extreme values of the order parameter, an otherwise natural idea of continuity fails in the measurement of entropy: when r goes to either zero or infinity, it is not the case that small (large) changes in the probability distribution P(H) produce comparably small (large) changes in entropy values. , so n+ denotes the number of hypotheses in H with a non‐null (strictly positive) probability. Given the −1 correction, can be interpreted as the “number of contenders” for each entity in set H, because it takes value 0 when only one element is left. For future reference, we will label Origin entropy because it marks the origin of the graph in Fig. the expected number of hypotheses in H conclusively falsified by a test E. To see better how order‐0 entropy measures behave, consider the simplest of them:where, sodenotes the number of hypotheses inwith a non‐null (strictly positive) probability. Given the −1 correction,can be interpreted as the “number of contenders” for each entity in set, because it takes value 0 when only one element is left. For future reference, we will labelbecause it marks the origin of the graph in Fig. 3 . Importantly, the expected reduction of Origin entropy is just To the extent that all details of the prior and posterior probability distribution over H are neglected, computational demands are significantly decreased with order‐0 entropies. As a consequence, measures of the expected reduction of an order‐0 entropy (and especially Origin entropy) also amount to comparably frugal, heuristic or quasi‐heuristic models of information search (see Baron et al.'s, 1988; p. 106). Lack of continuity, too, is associated with heuristic models, which often rely on discrete elements instead of continuous representations (see Gigerenzer, Hertwig, & Pachur, 2011; Katsikopoulos, Schooler, & Hertwig, 2010). More generally, when the order parameter approaches 0, entropy measures become more and more balanced, meaning that they treat all live hypotheses more and more equally. What happens to the associated expected entropy reduction measures is that they become more and more “Popperian” in spirit. In fact, for order‐0 relevance measures, a test E will deliver some non‐null expected informational utility about hypothesis set H if and only if some of the possible outcomes of E can conclusively rule out some element in H. Otherwise, the expected entropy reduction will be zero, no matter how large the changes in probability that might arise from E. Cognitively, relevance measures of low order would then describe the information search preferences of an agent who is distinctively eager to prune down the list of candidate hypotheses, an attitude which might prevail in earlier stages of an inquiry, when such a list can be sizable. Among entropy measures of order infinity, we already know as Error entropy. What this illustrates is that, when r goes to infinity, entropy measures become more and more decision‐theoretic in a short‐sighted kind of way: in the limit, they are only affected by the probability of a correct guess given the currently available information. A notable consequence for the associated measures of expected entropy reduction is that a test E can deliver some non‐null expected informational utility only if some of the possible outcomes of E can alter the probability of the modal hypothesis in H. If that is not the case, then the expected utility will be zero, no matter how significant the changes in the probability distribution arising from E. Cognitively, then, R‐measures of very high order would describe the information search preferences of an agent who is predominantly concerned with an estimate of the probability of error in an impending choice from set H. 4.2.1 The degree parameter t: Perfect tests and certainty Let us now consider briefly the meaning of the degree parameter t in the Sharma–Mittal formalism when entropies and relevance measures represent uncertainty and the value of queries, respectively. A remarkable fact about the degree parameter t is that (unlike the order parameter r) it does not affect the ranking of entropy values. Indeed, one can show that any Sharma–Mittal entropy measure is a strictly increasing function of any other measure of the same order r, regardless of the degree (for any hypothesis set H and any probability distribution P) (Supplementary Material S1, section 4). Thus, concerning the ordinal comparison of entropy values, only if the order differs can divergences between pairs of SM entropy measures arise. On the other hand, the implications of the degree parameter for measures of expected entropy reduction are significant and have not received much attention. H and E are independent, in the standard sense that for any h i ∈ H and any e j ∈ E, P(h i ∩e j ) = P(h i )P(e j ), denoted as H ⊥ P E. Then we have (Supplementary Material As a useful basis for discussion, suppose that variablesandare independent, in the standard sense that for anyand any) =), denoted as. Then we have (Supplementary Material S1 , section 4): If expected entropy reduction is interpreted as a measure of the informational utility of queries or tests, this equality governs the relationship between the computed utilities of E in case it is a “perfect” (conclusive) test and in case it is not. More precisely, the first term on the left, , measures the expected informational utility of a perfect test because the test itself and the target of investigation are the same, hence finding out the true value of E removes all relevant uncertainty. On the other hand, E is not anymore a perfect test in the second term of the equation above, , for here a more fine‐grained hypothesis set H × E is at issue, thus a more demanding epistemic target; hence finding out the true value of E would not remove all relevant uncertainty. (Recall that, by assumption, H is statistically independent from E, so the uncertainty about H would remain untouched, as it were, after knowing about E.). With entropies of degree 1 (including Shannon), the associated measures of expected entropy reduction imply that E has exactly identical utility in both cases, because t = 1 nullifies the right‐hand side of the equation, regardless of the order parameter r. With t > 1 the right‐hand side is positive, so E is a strictly more useful test when it is conclusive than when it is not. With t < 1, on the contrary, the right‐hand side is negative, so E is strictly less useful a test when it is conclusive than when it is not. Note that these are ordinal relationships (rankings). In comparing the expected informational utility of queries, the degree parameter t can thus play a crucial role. Crupi and Tentori (2014, p. 88) provided some simple illustrations which can be adapted as favoring an entropy with t > 1 as the basis for the R‐measure of the expected utility of queries (here, we present an illustration in Fig. 5). Figure 5 Open in figure viewer PowerPoint Consider a standard 52‐card playing deck, with Suit corresponding to the four equally probable suits, Value corresponding to the 13 equally probable numbers (or faces) that a card can take (2 through 10, Jack, Queen, King, Ace), and Suit✕Value corresponding to the 52 equally probable individual cards in the deck. Suppose that you will be told the suit of a randomly chosen card. Is this more valuable to you if (i) (perfect test case) your goal is to learn the suit, i.e., R P (Suit, Suit), or (ii) (inconclusive test case) your goal is to learn the specific card, i.e., R P (Suit✕Value, Suit)? What is the ratio of the value of the expected entropy reduction in (i) versus (ii)? For degree 1, the information to be obtained has equal value in each case. For degrees > 1, the perfect test is more useful. For degrees < 1, the inconclusive test is more useful. Interestingly, as the figure shows, the degree parameter uniquely determines the relative value of R P (Suit, Suit) and R P (Suit ✕ Value, Suit), regardless of the order parameter. In the Figure, values of the order parameter r and of the degree parameter t lie on the x– and y–axis, respectively. Color represents the log of the ratio between the conclusive test and the inconclusive test case in the card example above: black means that the information values of the tests are equal (log of the ratio is 0); warm/cool shades indicate that the conclusive test has a higher/lower value, respectively (log of the ratio is positive/negative). The meaning of a high degree parameter is of particular interest in so‐called Tsallis family of entropy measures, obtained from when r = t (see Table 2). Consider Tsallis entropy of degree 30, that is . With this measure, entropy remains very close to a upper bound value of 1/(t − 1) ≈ 0.0345 unless the probability distribution reflects near‐certainty about the true element in the hypothesis set H. For instance, for as uneven a distribution as {0.90, 0.05, 0.05}, yields entropy 0.03330, still close to 0.0345, while it quickly approaches 0 when the probability of one hypothesis exceeds 0.99. Non‐certainty entropy seems a useful label for future reference, as measure essentially implies that entropy is almost invariant as long as an appreciable lack of certainty (a “reasonable doubt”, as it were) endures. Accordingly, the entropy reduction from a piece of evidence e is largely negligible unless one is led to acquire a very high degree of certainty about H, and it approaches the upper bound of 1/(t − 1) as the posterior probability comes close to matching a truth‐value assignment (with P(h i ) = 1 for some i and 0 for all other hs). Up to the inconsequential normalizing constant t − 1, the expected reduction of this entropy, , amounts to a smooth variant of Nelson et al.'s (2010) “probability‐of‐certainty heuristic”, where a datum e i ∈ E has informational utility 1 if it reveals the true element in H with certainty and utility 0 otherwise, so that the expected utility of E itself is just the overall probability that certainty about H is eventually achieved by that test. These remarks further illustrate that a larger degree t implies an increasing tendency of the corresponding R‐measure to value highly the attainment of certainty or quasi‐certainty about the target hypothesis set when assessing a test.

5 A systematic exploration of how key information search models diverge Depending on different entropy functions, two measures R and R* of the expected reduction of entropy as the informational utility of tests may disagree in their rankings. Formally, there exist variables H, E, and F and probability distribution P(H,E,F) such that while ; thus, R‐measures are not generally ordinally equivalent. In the following, we will focus on an illustrative sample of measures in the Sharma–Mittal framework and show that such divergences can be widespread, strong, and telling about the specific tenets of those measures. This means that different entropy measures can provide markedly divergent implications in the assessment of possible queries’ expected usefulness. Depending on the interpretation of the models, this in turn implies conflicting empirical predictions and/or incompatible normative recommendations. Our list will include three classical models that are standard at least in some domains, namely Shannon, Quadratic, and Error entropy. It also includes three measures which we previously labeled heuristic or quasi‐heuristic in that they largely or completely disregard quantitative information conveyed by the relevant probability distribution P: these are Origin entropy (or the “number of contenders”), Hartley entropy, and Non‐certainty entropy, as defined above. For a wider coverage and comparison, we also include an entropy function lying well below the Arimoto curve in Fig. 2, that is, , and thus labeled Non‐concave (see Fig. 4). We ran simulations to identify cases of strong disagreement between our seven measures of expected entropy reduction, on a pairwise basis, about which of two tests is taken to be more useful. In each simulation, we considered a scenario with a threefold hypothesis space H = {h 1 , h 2 , h 3 }, and two binary tests, E = {e, } and F = {f, }.8 The goal of each simulation was to find a case — that is, a specific joint probability distribution P(H,E,F) — where two R‐measures strongly disagree about which of two tests is most useful. The ideal scenario here is a case where expected reduction of one kind of entropy (say, Origin) implies that E is as useful as can possibly be found, while F is as bad as it can be, and the expected reduction of another kind of entropy (say, Shannon) implies the opposite, with equal strength of conviction. The quantification of the disagreement between two R‐measures in a given case — for a given P(H,E,F) — arises from three steps (also see Nelson et al., 2010). (i) Normalization: for each measure, we divide nominal values of expected entropy reduction (for each of E and F) by the expected entropy reduction of a conclusive test for three equally probable hypotheses, that is, by . (ii) Preference Strength: for each measure, we compute the simple difference between the (normalized) expected entropy reduction for test E and for test F, that is, . (iii) Disagreement Strength (DS): if the two measures agree on whether E or F is most useful, DS is defined as zero; if they disagree, DS is defined as the geometric mean of those measures’ respective absolute preference strengths in step (ii). In the simulations, a variety of techniques were involved in order to maximize disagreement strength, including random generation of prior probabilities over H and of likelihoods for E and F, optimization of likelihoods alone, and joint optimization of likelihoods and priors. Each example reported here was found in the attempt to maximize DS for a particular pair of measures. We relied on the simulations largely as a heuristic tool, thus selecting and slightly adapting the numerical examples to make them more intuitive and improve clarity.9 For each pair of R‐measures in our sample of seven, at least one case of moderate or strong disagreement was found (Table 3). Thus, for each pairwise comparison one can identify probabilities for which the models make diverging claims about which test is more useful. In what follows, we append a short discussion to the cases in which Shannon entropy strongly disagrees with each competing model. Such discussion is illustrative and qualitative, to intuitively highlight the underlying properties of different models. Similar explications could be provided for all other pairwise comparisons, but are omitted for the sake of brevity. Table 3. Cases of strong disagreement between seven measures of expected entropy reduction. Two binary tests E and F are considered for a ternary hypothesis set H. Preference strength is the difference between (normalized) values of expected entropy reduction for E and F, respectively: it is positive if test E is strictly preferred, negative if F is strictly preferred, and null if they're rated equally. The most relevant preference values to be compared are highlighted in bold: they illustrate that, for each pair of R‐measures in our sample of seven, the table includes at least one case of moderate or strong disagreement n. P(H) Test E Test F Preference strength in the expected reduction of entropy P(H|e) versus P(H| ) P(e) versus P( ) P(H|f) versus P(H| ) P(f) versus P( ) Non‐certainty Origin Hartley Non‐concave Shannon Quadratic Error 1 {0.50, 0.25, 0.25} {0.5, 0.5, 0} {0.5, 0, 0.5} 0.5 0.5 {1, 0, 0} {1/3, 1/3, 1/3} 0.25 0.75 –0.250 0.250 0.119 0.250 0.119 0 0 2 {0.67, 0.17, 0.17} {0.82, 0.17, 0.01} {0.01, 0.17, 0.82} 0.8 0.2 {1, 0, 0} {1/3, 1/3, 1/3} 0.49 0.51 –0.487 –0.490 –0.490 0.394 0.046 0.062 0.240 3 {0.67, 0.10, 0.23} {0.899, 0.1, 0.001} {0.001, 0.1, 0.899} 0.74 0.26 {1, 0, 0} {0.40, 0.18, 0.42} 0.45 0.55 –0.409 –0.450 –0.450 0.342 0.218 0.249 0.329 4 {0.6, 0.1, 0.3} {1, 0, 0} {1/3, 1/6, 1/2} 0.4 0.6 {0.7, 0.3, 0} {0.55, 0. 0.45} 1/3 2/3 0.400 –0.100 0.031 0.045 0.051 0.155 0.150 5 {0.5, 0.499, 0.001} {0.998, 0.001, 0.001} {0.001, 0.998, 0.001} 0.501 0.499 {0.501, 0.499, 0} {0, 0.499, 0.501} 0.998 0.002 0.942 –0.500 –0.369 0.499 0.617 0.744 0.746 6 {0.66, 0.17, 0.17} {1, 0, 0} {1/3, 1/3, 1/3} 0.49 0.51 {0.66, 0.17, 0.17} {0.66, 0.17, 0.17} 0.5 0.5 0.490 0.490 0.490 –0.236 0.288 0.250 0 7 {0.53, 0.25, 0.22} {1, 0, 0} {0.295, 0.375, 0.330} 1/3 2/3 {0.53, 0.25, 0.22} {0.53, 0.25, 0.22} 0.5 0.5 0.333 0.333 0.333 –0.123 0.261 0.249 0.080 8 {0.50, 0.14, 0.36} {0.72, 0.14, 0.14} {0.14, 0.14, 0.72} 0.62 0.38 {0.5, 0.5, 0} {0.5, 0, 0.5} 0.28 0.72 0 –0.500 –0.369 0.293 –0.085 0.086 0.330 9 {0.50, 0.18, 0.32} {0.65, 0.18, 0.17} {0.17, 0.18, 0.65} 0.69 0.31 {0.5, 0.5, 0} {0.5, 0, 0.5} 0.36 0.64 0 –0.180 –0.133 0.213 –0.179 –0.024 0.225 10 {0.42, 0.42, 0.16} {0.5, 0.5, 0} {0, 0, 1} 0.84 0.16 {0.66, 0.24, 0.10} {0.10, 0.66, 0.24} 0.57 0.43 0.160 0.580 0.470 –0.146 0.241 0.115 –0.120 Shannon versus Non‐certainty Entropy (case 3 in Table 3; DS = 0.30). In its purest form, Non‐certainty entropy equals 0 if one hypothesis in H is known to be true with certainty, and 1 otherwise. As a consequence, the entropy reduction expected from a test E just amounts to the probability that full certainty will be achieved after E is performed. Within the Sharma–Mittal framework, this behavior can be often approximated by an entropy measure such as Tsallis of degree 30, as explained above.10 One example where the expected reduction of Shannon and Non‐certainty entropy disagree significantly involves a prior P(H) = {0.67, 0.10, 0.23}. The Non‐certainty measure rates very poorly a test E such that P(H|e) = {0.899, 0.100, 0.001}, P(H| ) = {0.001, 0.100, 0.899}, and P(e) = 0.74, and strongly prefers a test F such that P(H|f) = {1, 0, 0}, P(H| ) = {0.40, 0.18, 0.42}, and P(f) = 0.45, because the probability to attain full certainty from F is sizable (45%). The expected reduction of Shannon entropy implies the opposite ranking, because test E, while unable to provide full certainty, will invariably yield a highly skewed posterior as compared to the prior. Shannon versus Origin and Hartley Entropy (case 5 in Table 3; DS = 0.56 and DS = 0.48, respectively). The reduction of both Origin and Hartley entropy share similar ideas of counting how many hypotheses are conclusively ruled out by the evidence. For example, with prior P(H) = {0.500, 0.499, 0.001}, the expected reduction of either Origin or Hartley entropy assigns value zero to test E such that P(H|e) = {0.998, 0.001, 0.001}, P(H| ) = {0.001, 0.998, 0.001}, and P(e) = 0.501, because no hypothesis is ever ruled out conclusively, and rather prefers test F such that P(H|f) = {0.501, 0.499, 0}, P(H| ) = {0, 0.499, 0.501}, and P(f) = 0.998. The expected reduction of Shannon entropy implies the opposite ranking, because F will almost always yield only a tiny change in overall uncertainty. Shannon versus Non‐concave Entropy (case 6 in Table 3; DS = 0.26). For non‐concave entropies, the expected entropy reduction may turn out to be negative, thus indicating an allegedly detrimental query, that is, a test where expected utility is lower than that of a completely irrelevant test. This feature yields cases of significant disagreement between the expected reduction of our illustrative non‐concave entropy, , and of classical concave measures such as Shannon. With a prior P(H) = {0.66, 0.17, 0.17}, the non‐concave measure rates a test E such that P(H|e) = {1, 0, 0}, P(H| ) = {1/3, 1/3, 1/3}, and P(e) = 0.49 much lower than an irrelevant test F such that P(H|f) = P(H| ) = P(H). Indeed, the non‐concave R‐measure assigns a significant negative value to test E. This critically depends on one interesting fact: for non‐concave entropy, going from P(H) to a completely flat posterior, P(H| ), is an extremely aversive outcome (i.e. it implies a very large increase in uncertainty), while the 49% chance of achieving certainty by datum e is not highly valued (a feature of low degree measures, as we know). The expected reduction of Shannon entropy implies the opposite ranking instead, as it conveys the principle that no test can be informationally less useful than an irrelevant test (such as F). Shannon versus Quadratic Entropy (case 8 in Table 3; DS = 0.09). Shannon and Quadratic entropies are similar in many ways, yet at least cases of moderate disagreement can be found. One is with prior P(H) = {0.50, 0.14, 0.36}. Test E is such that P(H|e) = {0.72, 0.14, 0.14}, P(H|) = {0.14, 0.14, 0.72}, and P(e) = 0.62, while with test F one has P(H| f) = {0.5, 0.5, 0}, P(H| ) = {0.5, 0, 0.5}, and P(f) = 0.28. Expected Quadratic entropy reduction ranks E over F, as it puts a particularly high value on posterior distributions where one single hypothesis comes to prevail. In comparison, this is less important for the reduction of Shannon entropy, as long as some hypotheses are completely (or largely) ruled out, as occurs with F. Accordingly, the Shannon measure prefers F over E. Shannon versus Error Entropy (case 9 in Table 3; DS = 0.20). A stronger disagreement arises between Shannon and Error entropy. Consider prior P(H) = {0.50, 0.18, 0.32}, a test E such that P(H|e) = {0.65, 0.18, 0.17}, P(H| ) = {0.17, 0.18, 0.65}, and P(e) = 0.69, and a test F such that P(H| f) = {0.5, 0.5, 0}, P(H| ) = {0.5, 0, 0.5}, and P(f) = 0.36. The expected reduction of Error entropy is significant with E but zero with F, because the latter will leave the modal probability untouched. (Note that it does not matter that the hypotheses with the maximum probability changed.) However, test F, unlike E, will invariably rule out an hypothesis that was a priori significantly probable, and for this reason is preferred by the Shannon R‐measure.

6 Model comparison: Prediction and behavior Now that we have seen examples illustrating the theoretical properties of a variety of Sharma–Mittal relevance measures, we turn to addressing whether the Sharma–Mittal measures can help with psychological or normative theory of the value of information. 6.1 Comprehensive analysis of Wason's abstract selection task The single most widely studied experimental information search paradigm is Wason's (1966) selection task. In the classical, abstract version, participants are presented with a conditional hypothesis (or “rule”), h = “if A (antecedent), then C (consequent)”. The hypothesis concerns some cards, each of which has a letter on one side and a number on the other, for instance A = “the card has a vowel on one side” and C = “the card has an even number on the other side”. One side is displayed for each of four cards: one instantiating A (e.g., showing letter E), one instantiating not‐A (e.g., showing letter K), one instantiating C (e.g., showing number 4), and one instantiating not‐C (e.g., showing number 7). Participants have therefore four information search options in order to assess the truth or falsity of hypothesis h: turning over the A, the not‐A, the C, or the not‐C card. They are asked to choose which ones they would pick up as useful to establish whether the hypothesis holds or not. All, none, or any subset of the four cards can be selected. According to Wason's (1966) original “Popperian” reading of the task, the A and not‐C search options are useful because they could falsify h (by possibly revealing a even number and a vowel, respectively), so a rational agent should select them. The not‐A and C options, on the contrary, could not provide conclusively refuting evidence, so they're worthless in this interpretation. However, observed choice frequencies depart markedly from these prescriptions. In Oaksford and Chater's (1994, p. 613) metaanalysis, they were 89%, 16%, 62%, and 25% for A, not‐A, C, and not‐C, respectively. Oaksford and Chater (1994, 2003) devised Bayesian models of the task in which agents treat the four cards as sampled from a larger deck and are assumed to maximize the expected reduction of uncertainty, with Shannon entropy as the standard measure. Oaksford and Chater postulated a foil hypothesis in which A and C are statistically independent and a target hypothesis h under which C always (or almost always) follows A. In Oaksford and Chater's (1994) “deterministic” analysis, C always followed A under the dependence hypothesis h. A key innovation in Oaksford and Chater (2003, p. 291) was the introduction of an “exception” parameter, such that P(C|A) = 1 – P(exception) under h. The model also requires parameters α and γ for the probabilities P(A) and P(C) of the antecedent and consequent of h. We implement Oaksford and Chater's (2003) model, positing α = 0.22 and γ = 0.27 (according to the “rarity” assumption), and an uniform prior on H = {h, }, as suggested in Oaksford and Chater (2003, p. 296). We explored the implications of calculating the expected usefulness of turning over each card, not only according to Shannon entropy reduction, but for the whole set of entropy measures from the Sharma–Mittal framework.11 Empirical data We first address how well different expected entropy reduction measures correspond to empirical aggregate card selection frequencies in the task, with respect to Oaksford and Chater's (2003) model. For the selection frequencies, we use the abstract selection task data as reported by Oaksford and Chater (1994, p. 613) and mentioned above (89%, 16%, 62%, and 25% for A, not‐A, C, and not‐C, respectively). Fig. 6 (top row) shows the rank correlation between relevance values and empirical selection frequencies for each order and degree value from 0 to 20, in steps of 0.25. First consider results for the model with P(exception) = 0 (Fig. 6, top left subplot). A wide range of measures, including expected reduction of Shannon and Quadratic entropy, of some non‐concave entropies (e.g., ) and of measures with fairly high degree (e.g., ) correlate perfectly with the rank of selection frequencies. However, if a high degree measure with moderate or high order is used, the rank correlation is not perfect. Consider for instance the Tsallis measure of degree 20 (i.e. ). This leads to relevance values for the A, not‐A, C, and not‐C cards of 0.0281, 0.0002, 0.0008, and 0.0084, respectively. Because the relative ordering of the C and the not‐C card is incorrect (from the perspective of observed choices), the rank correlation is only 0.8. The same rank correlation of 0.8 is obtained, but for a different reason, from strongly non‐concave relevance measures. , for instance, gives values of 1.181, 0.380, 1.054, and 0.372 (again for the A, not‐A, C, and not‐C cards, respectively), so that the not‐A card is deemed more informative than the not‐C card by this relevance measure. Figure 6 Open in figure viewer PowerPoint 2003 Plots of rank correlation values for the expected reduction of various Sharma–Mittal entropies in Oaksford and Chater's () model of the Wason selection task. In the top row, models of expected entropy reduction are compared with empirical aggregate card selection frequencies. In the bottom row, instead, the comparison is with theoretical choices implied by Wason's original analysis of the task. In the left versus right columns the conditional probability representation of “if vowel, then even number” rules out expections or allows for them (with probability 0.1), respectively. Let us now consider expected reduction of Origin entropy, , as an example of the 0‐order measures. It gives relevance values of 0.527, 0, 0, and 0.159 for the A, not‐A, C, and not‐C cards, respectively. This is similar to Wason's analysis of the task: only the A and the not‐C cards can falsify a hypothesis (namely, the dependence hypothesis h), thus only those two cards have value. The other cards could change the relative plausibility of h versus ; however, according to 0‐order measures, no informational value is achieved because no hypothesis is definitely ruled out. In this sense, 0‐order measures can be thought of as bringing elements of the original logical interpretation of the selection task into the same unified information‐theoretic framework including Shannon and generalized entropies (see below for more on this). Interestingly, this does not imply that the A and the not‐C cards are equally valuable: in the model, the A card offers a higher chance of falsifying h than the not‐C card, so it is more valuable, according to this analysis. Thus, while incorporating the basic idea of the importance of possible falsification, the 0‐order Sharma–Mittal formalization of informational value offers something that the standard logical reading does not: a rationale for assessing the relative value among those queries (the A and the not‐C card) providing the possibility of falsifying a hypothesis. The Origin entropy values and the empirical data agree that the A card is most useful and (up to a tie) that the not‐A card is least useful, but disagree on virtually everything else; 's rank correlation to empirical card selection frequencies is 0.6325. What if Oaksford and Chater's (2003) model is combined with exception parameter P(exception) = 0.1, rather than 0? In this case, the empirical selection frequencies perfectly correlate with the theoretical values for an even wider range of measures than for the “deterministic” model (Fig. 6, top right plot). For instance, Tsallis of degree 11, i.e. , which had rank correlation of 0.8 with P(exception) = 0, has a perfect rank correlation with 0.1. This is due to the relative ordering of the not‐A and C cards. For the P(exception) = 0 model, the A, not‐A, C, and not‐C cards had relevance of 0.059, 0.002, 0.012, and 0.016, respectively; with P(exception) = 0.1, the cards’ respective relevance values are 0.019, 0.001, 0.007, and 0.005. In addition, a dramatic difference between P(exception) = 0 and P(exception) = 0.1 arises for the 0‐order measures. If P(exception) > 0, even if very small, no amount of obtained data can ever lead to ruling out a hypothesis in the model. Therefore, with P(exception) = 0.1 all cards have zero value for 0‐order measures, and the correlation with behavioral data is undefined (plotted black in Fig. 6). A probabilistic understanding of Wason's normative indications Finally, we discuss how well the expected informational value of the cards, as calculated using Oaksford and Chater's (2003) model and various Sharma–Mittal measures, corresponds to Wason's original interpretation of the task. We thus conducted the same analyses as above, but instead of using the human selection frequencies we assumed that the A card was selected with 100%, the not‐A card with 0%, the C card with 0%, and the not‐C card with 100% probability. The 0‐order relevance measures, again within Oaksford and Chater's (2003) model with P(exception) = 0, provide a probabilistic understanding of Wason's normative indications. Like Wason, the 0‐order measures deem only the A and the not‐C cards to be useful when P(exception) = 0. The rank correlation with theoretical selection frequencies from Wason's analysis is 0.94 (see Fig. 6, bottom left plot). Why is the correlation not perfect? The probabilistic understanding proposed, as discussed above, goes beyond the logical analysis: because the A card offers a higher probability of falsification than the not‐C card does in the probability model, the 0‐order relevance measures value the former more than the latter. Recall that our hypothetical participants always select both cards that entail the possibility of falsifying the dependence hypothesis; thus, the correlation is < 1. The worst correlation with Wason's ranking is from the strongly non‐concave measures, such as ; this correlation is exactly zero. The Wason selection task illustrates the theoretical potential of the Sharma–Mittal framework; whereas other authors noted the robustness of probabilistic analyses of the task across different measures of informational utility (see Fitelson & Hawthorne, 2010; Nelson, 2005; pp. 985–986; Oaksford & Chater, 2007), the variety of measures involved in those analyses arose in an ad hoc way. We extend those results, and show that even the traditional, allegedly anti‐Bayesian reading of the task can be recovered smoothly in one overarching framework. In particular, the implications of Wason's Popperian interpretation can be represented well by the maximization of the expected reduction of an entirely balanced (order‐0) Sharma–Mittal measure (such as Origin or Hartley entropy) in a deterministic reading of the task (i.e., with P(exception) = 0). Conversely, this means that adopting a probabilistic approach to Wason's task is not by itself sufficient to account for observed behavior. Even then, in fact, people's choices would still diverge from at least some theoretically viable models of information search. 6.2 Information search in experience‐based studies Is the same expected uncertainty reduction measure able to account for human behavior across a variety of tasks? To explore this issue, we reviewed experimental scenarios employed in experience‐based investigations of information search behavior. In this experimental paradigm, participants learn the underlying statistical structure of an environment where items (plankton specimens) are visually displayed and subject to a binary classification (kind A vs. B) for which two binary features (yellow vs. black eye; dark vs. light claw) are potentially relevant. Immediate feedback is provided after each trial in a learning phase, until a performance criterion is reached, indicating adequate mastery of the environmental statistics. In a subsequent information‐acquisition test phase of this procedure, both of the two features (eye and claw) are obscured, and participants have to select the most informative/useful feature relative to the target categories (kinds of plankton). (See Nelson et al., 2010, for a detailed description.) In our current terms, these scenarios concern a binary hypothesis space H = {specimen of kind A, specimen of kind B} and two binary tests E = {yellow eye, black eye} and F = {dark claw, light claw}. In each case, the experience‐based learning phase conveyed the structure of the joint probability distribution P(H,E,F) to participants. The test phase, in which either feature E or F can be viewed, represents a way to see whether the participants deemed or to be greater. Overall, we found eight relevant experimental scenarios from the experimental paradigm described above (they are listed in Table 4) in which there was at least some interesting disagreement among the Sharma–Mittal measures about which feature is more useful. For each, we derived values of expected uncertainty reduction from Sharma–Mittal measures of order and degree from 0 to 20, in increments of 0.25, and we computed the simple proportion of cases in which each measure's ranking of and matched the most prevalent observed choice. Table 4. Choices between two binary tests/experiments (E vs. F) for a binary classification problem (H) in experience‐based experimental procedures. Cases 1–3 are taken from Nelson et al. ( ); cases 4–5 from Exp. 3 in the same article; case 6 is an unpublished study using the same experimental procedure; cases 7–8 are from Meder and Nelson ( , Exp. 1) n. P(H) Test E Test F % observed choices of E P(H|e) versus P(H| ) P(e) versus P( ) P(H| f) versus P(H| ) P(f) versus P( ) 1 {0.7, 0.3} {0, 1} {0.754, 0.246} 0.072 0.928 {1, 0} {0.501, 0.499} 0.399 0.601 82% (23/28) 2 {0.7, 0.3} {0, 1} {0.767, 0.233} 0.087 0.913 {1, 0} {0.501, 0.499} 0.399 0.601 82% (23/28) 3 {0.7, 0.3} {0.109, 0.891} {0.978, 0.022} 0.320 0.680 {1, 0} {0.501, 0.499} 0.399 0.601 97% (28/29) 4 {0.7, 0.3} {0, 1} {0.733, 0,.267} 0.045 0.955 {1, 0} {0.501, 0.499} 0.399 0.601 89% (8/9) 5 {0.7, 0.3} {0.201, 0.799} {0.780, 0.220} 0.139 0.861 {1, 0} {0.501, 0.499} 0.399 0.601 70% (14/20) 6 {0.7, 0.3} {0.135, 0.865} {0.848, 0.152} 0.208 0.792 {1, 0} {0.501, 0.499} 0.399 0.601 70% (14/20) 7 {0.44, 0.56} {0.595, 0.405} {0.331, 0.669} 0.414 0.586 {0, 1} {0.502, 0.498} 0.123 0.877 60% (12/20) 8 {0.36, 0.64} {0.090, 0.910} {0.707, 0.293} 0.562 0.438 {0, 1} {0.501, 0.499} 0.282 0.118 79% (15/19) Nelson et al. (2010) devised their scenarios to dissociate predictions from a sample of competing and historically influential models of rational information search. Their conclusion was that the expected reduction of Error entropy (expected probability gain, in their terminology) accounted for participants’ behavior and outperformed the expected reduction of Shannon entropy (expected information gain, in their terminology). A more comprehensive analysis within our current approach implies a richer picture. The data set employed can be accurately represented in the Sharma–Mittal framework for a significant range of degree values provided that the order parameter is high enough (the results are displayed in Fig. 7, left side). Observed choices are especially consistent with expected reduction of a quite unbalanced (e.g., r ≥ 4), concave or quasi‐concave (t close to 2) Sharma–Mittal entropy measure. Importantly, there is overlap between results from modeling the Wason selection task and these experience‐based learning data, giving hope to the idea that a unified theoretical explanation of human behavior may extend across several tasks. Figure 7 Open in figure viewer PowerPoint On the left, a graphical illustration of the empirical accuracy of Sharma–Mittal measures relative to binary information search choices in eight experience‐based experimental scenarios (described in Table 4 ). The shade at each point illustrates the proportion of choices (out of 8) correctly predicted by the expected reduction of the corresponding underlying entropy, with white and black indicating maximum (8/8) and minimum (0/8) accuracy, respectively. Results suggest that an Arimoto metric of moderate or high order is highly consistent with human choices. On the right, illustration of the empirical accuracy of Sharma–Mittal measures in theoretically similar tasks, but where probabilistic information is presented in a standard explicit format (with numeric prior probabilities and test likelihoods). In these tasks, individual participants’ test choices are highly noisy. Can a systematic theory still account for the modal results across tasks? We analyzed 13 cases (described in Table 5 ) of binary information search preferences. The shade at each point illustrates the proportion of comparisons (out of 13) correctly predicted by the expected reduction of the corresponding underlying entropy, with white and black again indicating maximum (13/13) and minimum (0/13) accuracy, respectively. Results show that a wide range of measures is consistent with available experimental findings, including Shannon entropy as well as a variety of high‐degree measures (degree much higher than the Arimoto curve). 6.3 Information search in words‐and‐numbers studies The experience‐based learning tasks discussed above were inspired by analogous tasks in which the prior probabilities of categories and feature likelihoods were presented to participants using words and numbers (e.g., Skov & Sherman, 1986). We refer to such tasks as Planet Vuma experiments, reflecting the typically whimsical content, such as classifying species of aliens on Planet Vuma, designed to not conflict with people's experience with real object categories. Whereas expected reduction of Error entropy, and other models as discussed above, gives a plausible explanation of the experience‐based learning task data, individual data in words‐and‐numbers studies are very noisy, and no attempt has been made to see whether a unified theory could account for the modal responses across these tasks. We therefore re‐analyzed empirical data from several Planet Vuma experiments, in a manner analogous to our analyses of the experience‐based learning data above (Fig. 7). What do the results show? To our surprise, the results suggest that there may be a systematic explanation of people's behavior on words‐and‐numbers‐based tasks. The degree of the most plausible measures is considerably above the Arimoto curve, although not as high as, for instance, non‐certainty entropy (order 30). From a descriptive psychological standpoint, a plausible interpretation is that when confronted with words‐and‐numbers‐type tasks, people have a strong focus on the chances of obtaining a certain or near‐to‐certain result, and are less concerned with (or, perhaps, attuned to) the details of the individual items in the probability distribution. The Sharma–Mittal framework provides potential explanation for heretofore perplexing experimental results, while also highlighting key questions (e.g., how much preference for near‐certainty, exactly, do subjects have) for future empirical research on words‐and‐numbers tasks. 6.4 Unifying theory and intuition in the Person Game (Having your cake and eating it too) In this section, we introduce another theoretical conundrum from the literature, and show how the Sharma–Mittal framework may help solve it. As pointed out above, the expected reduction of Error entropy had appeared initially to provide the best explanation of people's intuitions and behavior on experience‐based‐learning‐based information search tasks (Nelson et al., 2010). But this model leads to potentially counterintuitive behavior on another interesting kind of information search task, namely the Person Game (a variant of the Twenty Questions game). In this game, n cards (say, 20) with different faces are presented. One of those faces has been chosen at random (with equal probability) to be the correct face in a particular round of the game. The player's task is to find the true face in the smallest number of yes/no questions about physical features of the faces. For instance, asking whether the person has a beard would be a possible question, E = {e, }, with e = beard and = no beard. If k < n is the number of characters with a beard, then P(e) = k/n and P( ) = (n – k)/n. Moreover, a “yes” answer will leave k equiprobable guesses still in play, and a “no” answer n – k such guesses. Several papers have reported (see Nelson et al., 2014, for references) that people preferentially ask about features that are possessed by close to 50% of the remaining possible items, thus with P(e) close to 0.5. This strategy can be labeled the split‐half heuristic. It is optimal to minimize the expected number of questions needed under some task variants (Navarro & Perfors, 2011), although not in the general case (Nelson, Meder, & Jones, 2016), and can be accounted for using expected Shannon entropy reduction. But expected Shannon entropy reduction cannot account for people's behavior on experience‐based learning information search tasks, as our above analyses show. Can expected Error entropy reduction account for these results and intuitions? Put more broadly, can the same entropy model provide a satisfying account for both the Person Game and the experience‐based learning tasks? As it happens, Error entropy cannot account for the preference to split the remaining items close to 50%. In fact, every possible question (unless its answer is known already, because none or all of the remaining faces have the feature) has exactly the same expected Error entropy reduction, namely 1/k, where there are k items remaining (Nelson et al., 2016). This might lead us to wonder whether we must have different entropy/information models to account for people's intuitions and behavior across these different tasks. Indeed, it would call into question the potential for a unified and general purpose theory of the psychological value of information. t = 1 have the exact same preferences as expected Shannon entropy reduction, and all of them quantify the usefulness of querying a feature as a function of the proportion of remaining items that possess that feature. Similarly, all degree‐2 measures, and not only Error entropy, deem all questions to be equally useful in the Person Game. The core of this insight stems from the fact that, if a probability distribution is uniform, then the entropy of that distribution depends only on the degree of a Sharma–Mittal entropy measure. More formally, for any set of hypotheses H = {h 1 , h 2 , … h n } with a uniform probability distribution U(H): It turns out that the findings on why expected Shannon entropy reduction favors questions close to a 50:50 split, and why Error entropy has no such preference, apply much more generally than to Shannon and Error entropy. In fact, for all Sharma–Mittal measures, the ordinal evaluation of questions on the Person Game is solely a function of the degree of the entropy measure, and has nothing to do with the order of the measure (Supplementary Material S1 , section 5). Among other things, this implies that all entropy‐based measures with degree1 have the exact same preferences as expected Shannon entropy reduction, and all of them quantify the usefulness of querying a feature as a function of the proportion of remaining items that possess that feature. Similarly, all degree‐2 measures, and not only Error entropy, deem all questions to be equally useful in the Person Game. The core of this insight stems from the fact that, if a probability distribution is uniform, then the entropy of that distribution depends only on the degree of a Sharma–Mittal entropy measure. More formally, for any set of hypotheses= {, …} with a uniform probability distribution): Fig. 8 shows how possible questions are valued, in the Person Game, as a function of the proportion of remaining items that possess a particular feature. We see that if t = 1, as for Shannon and all Rényi entropies, questions with close to a 50:50 split are preferred. If the degree t is > 1 but < 2, questions with close to a 50:50 split are still preferred, but less so. If t = 2, then 1:99 and 50:50 questions are deemed equally useful. Remarkably, if the degree is > 2, then a 1:99 question is preferred to a 50:50 question. Figure 8 Open in figure viewer PowerPoint E = {e, } in the Person Game with a hypothesis set H of size 40 (the possible guesses, that is, characters initially in play) as a function of the proportion of possible guesses remaining after getting datum e (e.g., a “yes” answer to “has the chosen person a beard?”). Questions are deemed most valuable with the zero‐degree entropy measures (bottom right plot). Although the shape of the curve is similar for the degree t = 0 and degree t = 1 measures, the actual information value (see the y axis) decreases as the degree increases. For degree t = 2 (for example for Error entropy), every question is equally useful (provided that there is some uncertainty about the answer; bottom left plot). If the degree is > 2, then the least‐equally‐split questions (e.g., 1:39 questions, in the case of 40 items) are deemed most useful (left column, top and middle row). The order parameter is irrelevant for purposes of evaluating questions’ expected usefulness in the Person Game, because all prior and possible posterior probability distributions are uniform (see text). The expected entropy reduction of a binary question= {} in the Person Game with a hypothesis setof size 40 (the possible guesses, that is, characters initially in play) as a function of the proportion of possible guesses remaining after getting datum(e.g., a “yes” answer to “has the chosen person a beard?”). Questions are deemed most valuable with the zero‐degree entropy measures (bottom right plot). Although the shape of the curve is similar for the degree0 and degree1 measures, the actual information value (see theaxis) decreases as the degree increases. For degree2 (for example for Error entropy), every question is equally useful (provided that there is some uncertainty about the answer; bottom left plot). If the degree is > 2, then the least‐equally‐split questions (e.g., 1:39 questions, in the case of 40 items) are deemed most useful (left column, top and middle row). The order parameter is irrelevant for purposes of evaluating questions’ expected usefulness in the Person Game, because all prior and possible posterior probability distributions are uniform (see text). While the choice of particular Sharma–Mittal measures is only partly constrained by observed preferences in the Person Game alone (and specifically the value of the order parameter r is not), nothing in principle would guarantee that a joint and coherent account of such behavior and other findings exists. It is then important to point out that one can, in fact, pick up an entropy measure whereby the experience‐based data above follow along with a greater informative value for 50:50 questions than for 1:99 questions in the Person Game. For instance, medium‐order Arimoto entropies (such as ) will work.

7 General discussion In this paper, we have presented a general framework for the formal analysis of uncertainty, the Sharma–Mittal entropy formalism. This framework generates a comprehensive approach to the informational value of queries (questions, tests, experiments) as the expected reduction of uncertainty. The amount of theoretical insight and unification achieved is remarkable, in our view. Moreover, such a framework can help us understand existing empirical results, and point out important research questions for future investigation of human intuition and reasoning processes as concerns uncertainty and information search. Mathematically, the parsimony of the Sharma–Mittal formalism is appealing and yields decisive advantages in analytic manipulations, derivations, and calculations, too. Within the domain of cognitive science, no earlier attempt has been made to unify so many existing models concerning information search/acquisition behavior. Notably, this involves both popular candidate rational measures of informational utility (such as the expected reduction of Shannon or Error entropy) and avowed heuristic models, such as Baron et al.'s (1988, 106) quasi‐Popperian heuristic (maximization of the expected number of hypotheses ruled out, i.e., the expected reduction of Origin entropy) and Nelson et al.'s (2010, 962) “probability‐of‐certainty” heuristic (closely approximated by the expected reduction of a high degree Tsallis entropy, or a similar measure). In addition, once applied to uncertainty and information search, the Sharma–Mittal parameters are not dumb mathematical construals, but rather capture cognitively and behaviorally meaningful ideas. Roughly, the order parameter, r, captures how much one disregards minor hypotheses (via the kind of means applied to the probability values in P(H)). The degree parameter t, on the other hand, captures how much one cares about getting (very close) to certainty (via the behavior of the surprise/atomic information function; see Fig. 3). Thus, high order indicates a strong focus on the prevalent (most likely) element in the hypothesis set and lack of consideration for minor possibilities. A very low order, on the other hand, implies a Popperian or quasi‐Popperian attitude in the assessment of tests, with a marked appreciation of potentially falsifying or almost falsifying evidence. The degree parameter, in turn, has important implications for how much potentially conclusive experiments are valued, as compared to experiments that are informative but not conclusive. Moreover, for each particular order, if the degree is higher than the corresponding Arimoto entropy (and in any case if the order is < 0.5 or the degree is at least 2), then the concavity of the entropy measure guarantees that no experiment will be rated as having negative expected usefulness. Even according to fairly cautious views such as Aczél's (1984), the above remarks seem to provide a fairly strong motivation to consider pursuing a generalized approach. Here is another possible concern, however. Uncertainty and the informational value of tests may be involved in many arguments concerning human cognition. Now we see that those notions can be formalized in many different ways, such that different properties (say, additivity, or non‐negativity) are or are not implied. Thus, the arguments at issue might be valid for some choices of the corresponding measures and not for others. This point has been labeled the issue of measure‐sensitivity in related areas (Fitelson, 1999) — is it something to be worried about? Does it raise problems for our proposal? It is not uncommon for measure‐sensitivity to foster skeptical or dismissive reactions on the prospects of the formal analysis of the concept at issue (e.g. Hurlbert, 1971; Kyburg & Teng, 2001, pp. 98 ff.). However, measure‐sensitivity is a widespread and mundane phenomenon. In areas related to the formal analysis of reasoning, the issue arises, for instance, for Bayesian theories of inductive confirmation (e.g., Brössel, 2013; Crupi & Tentori, 2016; Festa & Cevolani, 2017; Glass, 2013; Hájek & Joyce, 2008; Roche & Shogenji, 2014), scoring rules and measures of accuracy (e.g., D'Agostino & Sinigaglia, 2010; Leitgeb & Pettigrew, 2010a,b; Levinstein, 2012; Predd et al., 2009), and measures of causal strength (e.g., Fitelson & Hitchcock, 2011; Griffiths & Tenenbaum, 2005, 2009; Meder, Mayrhofer, & Waldmann, 2014; Sprenger, 2016). Our treatment contributes to make the same point explicit for measures of uncertainty and the informational value of experiments. This we see as a constructive contribution. The prominence of one specific measure in one research domain may well have been partly affected by historical contingencies. As a consequence, when a theoretical or experimental inference relies on the choice of one measure, it makes sense to check how robust it is across different choices or, alternatively, to acknowledge which measure‐specific properties support the conclusion and how compelling they are. Having a plurality of related measures available is indeed an important opportunity. It prompts thorough investigation of the features of alternative options and their relationships (e.g., Crupi, Chater, & Tentori, 2013; Huber & Schmidt‐Petri, 2009; Nelson, 2005, 2008), it can provide a rich source of tools for both theorizing and the design of new experimental investigations (e.g., Rusconi et al., 2014; Schupbach, 2011; Tentori, Crupi, Bonini, & Osherson, 2007), and it makes it possible to tailor specific models to varying tasks and contexts within an otherwise coherent approach (e.g., Crupi & Tentori, 2014; Dawid & Musio, 2014; Oaksford & Hahn, 2007). Which Sharma–Mittal measures are more consistent with observed behavior overall? According to our analyses, a subset of Sharma–Mittal information search models receives a significant amount of convergent support. We found that measures of high but finite order accounting for the experience‐based (plankton task) data (Fig. 7, left side) are also empirically adequate for abstract selection task data (Fig. 6, top row) and results from a Twenty Questions kind of task such as the Person Game (Fig. 8). On the other hand, the best fit with words‐and‐numbers (Planet Vuma) information search tasks indicates a different kind of model within the Sharma–Mittal framework (Fig. 7, right side). For these cases, our analysis thus suggests that people's behavior may comply with different measures in different situations, so a key question arises about the features of a task which affect such variation in a consistent way, such as a comparably stronger appreciation of certainty or quasi‐certainty as prompted by an experimental procedure conveying environmental statistics by explicit verbal and numerical stimuli. Beyond this broad outlook, our discussion also allows for the resolution of a number of puzzles. Let us mention a last one. Nelson et al. (2010) had concluded from their experimental investigations that human information search in an experience‐based setting was appropriately accounted for by maximization of the expected reduction of Error entropy. This specific model, however, exhibits some questionable properties related to its lack of mathematical continuity: in particular, if the most likely hypothesis in H is not changed by any possible evidence in E, then the latter has no informational utility whatsoever according to RError, no matter if it can rule out other non‐negligible hypotheses in the set (see, e.g., cases 1 and 6 in Table 4). Findings from Baron et al. (1988) suggest that this might not describe human judgment adequately. In that study, participants were given a fictitious medical diagnosis scenario with P(H) = {0.64, 0.24, 0.12}, and a series of possible binary tests including E such that P(H|e) = {0.47, 0.35, 0.18}, P(H| ) = {1,0,0} and P(e) = 0.68 and another completely irrelevant test F (with an even chance of a positive/negative result on each one of the elements in H, so that P(H| f) = P(H| ) = P(H)). According to RError, tests E and F are both equally worthless — — because hypothesis h 1 ∈ H remains t