In this section, we develop the notion of a decision-making process with limited resources following the simple assumption that any decision-making process

Starting from an intuitive interpretation ofand, these concepts are refined incrementally until a precise definition of a decision-making process is given at the end of this section (Definition 7) in terms of. Here, a decision-making process is a comprehensive term that describes all kinds of biological as well as artificial systems that are searching for solutions to given problems, for example a human decision-maker that burns calories while thinking, or a computer that uses electric energy to run an algorithm. However,do not necessarily refer to a real consumable quantity but can also represent more explicit resources (e.g., time) as a proxy, for example the number of binary comparisons in a search algorithm, the number of forward simulations in a reinforcement learning algorithm, the number of samples in a Monte Carlo algorithm, or, even more abstractly, they can express the limited availability of some source of information, for example the number of data that are available to an inference algorithm (see Section 4 ).

From a probabilistic perspective, a decision-making process as described above is a transition from a uniform probability distribution overoptions to a uniform probability distribution overoptions, that converges to the Dirac measurecentered atin the fully rational limit. From this point of view, the restriction to uniform distributions is artificial. A decision-maker that is uncertain about the optimal decisionmight indeed have a bias towards a subsetwithout completely excluding other options (the ones in), so that the behavior must be properly described by a probability distribution. Therefore, in the following section, we extend Equations ( 1 ) and ( 2 ) to transitions between probability distributions. In particular, we must replace the power set ofby the space of probability distributions on, denoted by

For example, a rational decision-maker can afford C ( { x ∗ } ) , whereas a decision-maker with limited resources can typically only afford uncertainty reduction with cost C ( A ) < C ( { x ∗ } ) .

Summarizing, we conclude that a decision-making process with decision spacethat successively eliminates options can be represented by a mappingbetween subsets of, together with a cost functionthat quantifies the total expenses of arriving at a given subset,such that

In utility theory, decision-making is modeled as an optimization process that maximizes a so-called(which can itself be anutility with respect to a probabilistic model of the environment, in the sense of von Neumann and Morgenstern [ 1 ]). A decision-maker that is optimizing a given utility functionobtains a utility ofon average after reducing the amount of uncertain options fromto(see Figure 2 ). A decision-maker that completely reduces uncertainty by finding the optimumis called(without loss of generality we can assume thatis unique, by redefiningin the case when it is not). Since uncertainty reduction generally comes with a cost, a utility optimizing decision-maker with limited resources, correspondingly called(see Section 3 ), in contrast will obtain only uncertain decisions from a subset. Such decision-makers seek satisfactory rather than optimal solutions, for example by taking the first option that satisfies a minimal utility requirement, which Herbert A. Simon calls a satisficing solution [ 2 ].

In its most basic form, the concept of decision-making can be formalized as the process of looking for a decisionin a discrete set of options. We say that a decisionis, if repeated queries of the decision-maker will result in the same decision, and it is, if repeated queries can result in different decisions. Uncertainty reduction then corresponds to reducing the amount of uncertain options. Hence, a decision-making process that transitions from a spaceof options to a strictly smaller subsetreduces the amount of uncertain options fromto, with the possible goal to eventually find a single certain decision. Such a process is generally costly, the more uncertainty is reduced the more resources it costs ( Figure 1 ). The explicit mapping between uncertainty reduction and resource cost depends on the details of the underlying process and on which explicit quantity is taken as the resource. For example, if the resource is given by time (or any monotone function of time), then a search algorithm that eliminates options sequentially until the target value is found (linear search) is less cost efficient than an algorithm that takes a sorted list and in each step removes half of the options by comparing the mid point to the target (logarithmic search). Abstractly, any real-valued functionon the power set ofthat satisfieswhenevermight be used as a cost function in the sense thatquantifies the expenses of reducing the uncertainty fromto

2.2. Probabilistic Decision-Making

Ω be a discrete decision space of N = | Ω | < ∞ options, so that P Ω consists of discrete distributions p , often represented by probability vectors p = ( p 1 , ⋯ , p N ) . However, many of the concepts presented in this and the following section can be generalized to the continuous case [ Letbe a discrete decision space ofoptions, so thatconsists of discrete distributions, often represented by probability vectors. However, many of the concepts presented in this and the following section can be generalized to the continuous case [ 27 28 ].

p ∈ P Ω is related to the relative inequality of its entries, the more similar its entries are, the higher the uncertainty. This means that uncertainty is increased by moving some probability weight from a more likely option to a less likely option. It turns out that this simple idea leads to a concept widely known as majorization [27,29,30,31,32,34, Pigou–Dalton Principle of Transfers . Here, the operation of moving weight from a more likely to a less likely option corresponds to the transfer of income from one individual of a population to a relatively poorer individual (also known as a Robin Hood operation [ elementary computation . Intuitively, the uncertainty contained in a distributionis related to the relative inequality of its entries, the more similar its entries are, the higher the uncertainty. This means that uncertainty is increased by moving some probability weight from a more likely option to a less likely option. It turns out that this simple idea leads to a concept widely known as 33 ], which has roots in the economic literature of the early 19th century [ 26 35 ], where it was introduced to describe income inequality, later known as the. Here, the operation of moving weight from a more likely to a less likely option corresponds to the transfer of income from one individual of a population to a relatively poorer individual (also known as a 30 ]). Since a decision-making process can be viewed as a sequence of uncertainty reducing computations, we call the inverse of such a Pigou–Dalton transfer an

Definition 1 . A transformation on P Ω of the form T ε : p ↦ ( p 1 , ⋯ , p m + ε , ⋯ , p n − ε , ⋯ , p N ) , (3) m , n are such that p m ≤ p n , and 0 < ε ≤ p n − p m 2 , is called a Pigou–Dalton transfer (see T ε − 1 an elementary computation. whereare such that, and, is called a Pigou–Dalton transfer (see Figure 3 ). We call its inversean elementary computation. (Elementary computation)

Since making two probability values more similar or more dissimilar are the only two possibilities to minimally transform a probability distribution, elementary computations are the most basic principle of how uncertainty is reduced. Hence, we conclude that a distribution p ′ has more uncertainty than a distribution p if and only if p can be obtained from p ′ by finitely many elementary computations (and permutations, which are not considered an elementary computation due to the choice of ε ).

Definition 2 . We say that p ′ ∈ P Ω contains more uncertainty than p ∈ P Ω , denoted by p ′ ≺ p , (4) if and only if p can be obtained from p ′ by a finite number of elementary computations and permutations. (Uncertainty)

Note that, mathematically, this defines a preorder on P Ω , i.e., a reflexive ( p ≺ p for all p ∈ P Ω ) and transitive (if p ″ ≺ p ′ , p ′ ≺ p then p ″ ≺ p for all p , p ′ , p ″ ∈ P Ω ) binary relation.

p and p ′ expressed by Definition 2, for example p ′ is called more mixed than p [ more disordered than p [ more chaotic than p [ average of p [ p is said to majorize p ′ , which started with the early influences of Muirhead [40, In the literature, there are different names for the relation betweenandexpressed by Definition 2, for exampleis calledthan 36 ],than 37 ],than 32 ], or anof 29 ]. Most commonly, however,is said to, which started with the early influences of Muirhead [ 38 ], and Hardy, Littlewood, and Pólya [ 29 ] and was developed by many authors into the field of majorization theory (a standard reference was published by Marshall, Olkin, and Arnold [ 27 ]), with far reaching applications until today, especially in non-equilibrium thermodynamics and quantum information theory [ 39 41 ].

p ≺ p ′ , some of which are summarized below. However, one characterization makes use of a concept very closely related to Pigou–Dalton transfers, known as T-transforms [27, P Ω with a matrix of the form T = ( 1 − λ ) I + λ Π , where I denotes the identity matrix on R N , Π denotes a permutation matrix of two elements, and 0 ≤ λ ≤ 1 . If Π permutes p m and p n , then ( T p ) k = p k for all k ∉ { m , n } , and ( T p ) m = ( 1 − λ ) p m + λ p n , ( T p ) n = λ p m + ( 1 − λ ) p n . (5) There are plenty of equivalent (arguably less intuitive) characterizations of, some of which are summarized below. However, one characterization makes use of a concept very closely related to Pigou–Dalton transfers, known as 32 ], which expresses the fact that moving some weight from a more likely option to a less likely option is equivalent to taking (weighted) averages of the two probability values. More precisely, a T-transform is a linear operator onwith a matrix of the form, wheredenotes the identity matrix ondenotes a permutation matrix of two elements, and. Ifpermutesand, thenfor all, and

p m and p n of a given p ∈ P Ω , calculates their weighted averages with weights ( 1 − λ , λ ) and ( λ , 1 − λ ) , and replaces the original values with these averages. From Equation ( 0 < λ ≤ 1 2 and a permutation Π of p m , p n with p m ≤ p n is a Pigou–Dalton transfer with ε = ( p n − p m ) λ . In addition, allowing 1 2 ≤ λ ≤ 1 means that T-transfers include permutations, in particular, p ′ ≺ p if and only if p ′ can be derived from p by successive applications of finitely many T-transforms. Hence, a T-transform considers any two probability valuesandof a given, calculates their weighted averages with weightsand, and replaces the original values with these averages. From Equation ( 5 ), it follows immediately that a T-transform with parameterand a permutationofwithis a Pigou–Dalton transfer with. In addition, allowingmeans that T-transfers include permutations, in particular,if and only ifcan be derived fromby successive applications of finitely many T-transforms.

doubly stochastic matrices , i.e., matrices A = ( A i j ) i , j with A i j ≥ 0 and ∑ i A i j = 1 = ∑ j A i j for all i , j . By writing x A : = A T x for all x ∈ R N , and e : = ( 1 , ⋯ , 1 ) , these conditions are often stated as A i j ≥ 0 , A e = e , e A = e . (6) Due to a classic result by Hardy, Littlewood and Pólya ([ 29 ] (p. 49)), this characterization can be stated in an even simpler form by using, i.e., matriceswithandfor all. By writingfor all, and, these conditions are often stated as

p ′ = p A with a doubly stochastic matrix A , then p j ′ = ∑ i A i j p i is a convex combination, or a weighted average, of p with coefficients ( A i j ) i for each j . This is also why p ′ is then called more mixed than p [ p ′ is the result of an application of a doubly stochastic matrix, p ′ = p A , then p ′ is an average of p and therefore contains more uncertainty than p . This is exactly what is expressed by Characterization ( i i i ) in the following theorem. A similar characterization of p ′ ≺ p is that p ′ must be given by a convex combination of permutations of the elements of p (see property ( i v ) below). Note that doubly stochastic matrices can be viewed as generalizations of T-transforms in the sense that a T-transform takes an average of two entries, whereas ifwith a doubly stochastic matrix, thenis a convex combination, or a weighted average, ofwith coefficientsfor each. This is also whyis then calledthan 36 ]. Therefore, similar to T-transforms, we might expect that, ifis the result of an application of a doubly stochastic matrix,, thenis an average ofand therefore contains more uncertainty than. This is exactly what is expressed by Characterizationin the following theorem. A similar characterization ofis thatmust be given by a convex combination of permutations of the elements of(see propertybelow).

p ↦ ∑ i f ( p i ) with a convex function f are monotone with respect to the application of a doubly stochastic matrix [ ( v ) below). Functions of this form are an important class of cost functions for probabilistic decision-makers, as we discuss in Example 1. Without having the concept of majorization, Schur proved that functions of the formwith a convex functionare monotone with respect to the application of a doubly stochastic matrix [ 42 ] (see propertybelow). Functions of this form are an important class of cost functions for probabilistic decision-makers, as we discuss in Example 1.

Theorem 1 p ′ ≺ p [. For p , p ′ ∈ P Ω , the following are equivalent: (Characterizations of 27 ]) (i) p ′ ≺ p , i.e., p ′ contains more uncertainty than p (Definition 2) (ii) p ′ is the result of finitely many T-transforms applied to p (iii) p ′ = p A for a doubly stochastic matrix A (iv) p ′ = ∑ k = 1 K θ k Π k ( p ) where K ∈ N , ∑ k = 1 K θ k = 1 , θ k ≥ 0 , and Π k is a permutation for all k ∈ { 1 , ⋯ , K } (v) ∑ i = 1 N f ( p i ′ ) ≤ ∑ i = 1 N f ( p i ) for all continuous convex functions f (vi) ∑ i = 1 k ( p i ′ ) ↓ ≤ ∑ i = 1 k p i ↓ for all k ∈ { 1 , ⋯ , N − 1 } , where p ↓ denotes the decreasing rearrangement of p

( i ) and ( i i ) is straight-forward. The equivalences among ( i i ) , ( i i i ) , and ( v i ) are due to Muirhead [ ( v ) ⇒ ( i i i ) is due to Karamata [ ( i i i ) ⇒ ( v ) goes back to Schur [ ( i v ) means that p ′ belongs to the convex hull of all permutations of the entries of p , and the equivalence ( i i i ) ⇔ ( i v ) is known as the Birkhoff–von Neumann theorem. Here, we state all relations for probability vectors p ∈ P Ω , even though they are usually stated for all p , p ′ ∈ R N with the additional requirement that ∑ i = 1 N p i = ∑ i = 1 N p i ′ . As argued above, the equivalence betweenandis straight-forward. The equivalences among, andare due to Muirhead [ 38 ] and Hardy, Littlewood, and Pólya [ 29 ]. The implicationis due to Karamata [ 43 ] and Hardy, Littlewood, and Pólya [ 44 ], whereasgoes back to Schur [ 42 ]. Mathematically,means thatbelongs to the convex hull of all permutations of the entries of, and the equivalenceis known as the Birkhoff–von Neumann theorem. Here, we state all relations for probability vectors, even though they are usually stated for allwith the additional requirement that

( v i ) is the classical and most commonly used definition of majorization [29, ( v i ) , it immediately follows that uniform distributions over N options contain more uncertainty than uniform distributions over N ′ < N options, since ∑ i = 1 k 1 N = k N ⩽ k N ′ = ∑ i = 1 k 1 N ′ for all k < N , i.e., for N ≥ 3 we have 1 N , ⋯ , 1 N ≺ 1 N − 1 , ⋯ , 1 N − 1 , 0 ≺ 1 2 , 1 2 , 0 , ⋯ , 0 ≺ 1 , 0 ⋯ , 0 . (7) Conditionis the classical and most commonly used definition of majorization [ 27 34 ], since it is often the easiest to check in practical examples. For example, from, it immediately follows that uniform distributions overoptions contain more uncertainty than uniform distributions overoptions, sincefor all, i.e., forwe have

In particular, if A ⊂ A ′ ⊂ Ω , then the uniform distribution over A contains less uncertainty than the uniform distribution over A ′ , which shows that the notion of uncertainty introduced in Definition 2 is indeed a generalizatin of the notion of uncertainty given by the number of uncertain options introduced in the previous section.

P Ω , in general, two distributions p ′ , p ∈ P Ω are not necessarily comparable, i.e., we can have both p ′ ⊀ p and p ⊀ p ′ . In N = 3 ), represented on the two-dimensional simplex of probability vectors p = ( p 1 , p 2 , p 3 ) . For example, p = ( 1 2 , 1 4 , 1 4 ) and p ′ = ( 2 5 , 2 5 , 1 5 ) cannot be compared under ≺, since 1 2 > 2 5 , but 3 4 < 4 5 . Note that ≺ only being a preorder on, in general, two distributionsare not necessarily comparable, i.e., we can have bothand. In Figure 4 , we visualize the regions of all comparable distributions for two exemplary distributions on a three-dimensional decision space (), represented on the two-dimensional simplex of probability vectors. For example,andcannot be compared under ≺, since, but

Cost functions can now be generalized to probabilistic decision-making by noting that the property C ( A ′ ) < C ( A ) whenever A ⊊ A ′ in Equation ( C is strictly monotonic with respect to the preorder given by set inclusion. can now be generalized to probabilistic decision-making by noting that the propertywheneverin Equation ( 2 ) means thatis strictly monotonic with respect to the preorder given by set inclusion.

Definition 3 P Ω ). We say that a function C : P Ω → R + is a cost function, if it is strictly monotonically increasing with respect to the preorder ≺, i.e., if p ′ ≺ p ⇒ C ( p ′ ) ≤ C ( p ) , (8) with equality only if p and p ′ are equivalent, p ′ ∼ p , which is defined as p ′ ≺ p and p ≺ p ′ . Moreover, for a parameterized family of posteriors ( p r ) r ∈ I , we say that r is a resource parameter with respect to a cost function C, if the mapping I ↦ R + , r ↦ C ( p r ) is strictly monotonically increasing. (Cost functions on

Schur-convex ([ Since monotonic functions with respect to majorization were first studied by Schur [ 42 ], functions with this property are usually called (strictly)([ 27 ] (Ch. 3)).

Example 1 . From ( v ) in Theorem 1, it follows that functions of the form C ( p ) = ∑ i = 1 N f ( p i ) , (9) f ( t ) = t log t , we have C ( p ) = − H ( p ) , where H ( p ) denotes the Shannon entropy of p. Thus, if p ′ contains more uncertainty than p in the sense of Definition 2 ( p ′ ≺ p ) then the Shannon entropy of p ′ is larger than the Shannon entropy of p and therefore p ′ contains also more uncertainty in the sense of classical information theory than p. Similarly, for f ( t ) = − log ( t ) we obtain the (negative) Burg entropy, and for functions of the form f ( t ) = ± t α for α ∈ R \ { 0 , 1 } we get the (negative) Tsallis entropy, where the sign is chosen depending on α such that f is convex (see, e.g., [ N = 3 . In this case, the two-dimensional probability simplex P Ω is given by the triangle in R 3 with edges ( 1 , 0 , 0 ) , ( 0 , 1 , 0 ) , and ( 0 , 0 , 1 ) . Cost functions are visualized in terms of their level sets. where f is strictly convex, are examples of cost functions. Since many entropy measures used in the literature can be seen to be special cases of Equation ( 9 ) (with a concave f), functions of this form are often called generalized entropies [ 45 ]. In particular, for the choice, we have, wheredenotes the Shannon entropy of p. Thus, ifcontains more uncertainty than p in the sense of Definition 2 () then the Shannon entropy ofis larger than the Shannon entropy of p and thereforecontains also more uncertainty in the sense of classical information theory than p. Similarly, forwe obtain the (negative) Burg entropy, and for functions of the formforwe get the (negative) Tsallis entropy, where the sign is chosen depending on α such that f is convex (see, e.g., [ 46 ] for more examples). Moreover, the composition of any (strictly) monotonically increasing function g with Equation ( 9 ) generates another class of cost functions, which contains for example the (negative) Rényi entropy [ 23 ]. Note also that entropies of the form of Equation ( 9 ) are special cases of Csiszár’s f-divergences [ 47 ] for uniform reference distributions (see Example 3 below). In Figure 5 , several examples of cost functions are shown for. In this case, the two-dimensional probability simplexis given by the triangle inwith edges, and. Cost functions are visualized in terms of their level sets. (Generalized entropies) Ω , represented by a random variable Z, is split up into two steps by partitioning Ω = ⋃ i ∈ I A i and first deciding about the partition i ∈ I , correspondingly described by a random variable X with values in I, and then choosing an option inside of the selected partition A i , represented by a random variable Y, i.e., Z = ( X , Y ) , then We prove in Proposition A1 in Appendix A that all cost functions of the form of Equation ( 9 ) are superadditive with respect to coarse-graining. This seems to be a new result and an improvement upon the fact that generalized entropies (and f-divergences) satisfy information monotonicity [ 48 ]. More precisely, if a decision in, represented by a random variable Z, is split up into two steps by partitioningand first deciding about the partition, correspondingly described by a random variable X with values in I, and then choosing an option inside of the selected partition, represented by a random variable Y, i.e.,, then C ( Z ) ≥ C ( X ) + C ( Y | X ) , (10) C ( X ) : = C ( p ( X ) ) and C ( Y | X ) : = E p ( X ) [ C ( p ( Y | X ) ) ] . For symmetric cost functions (such as Equation ( whereand. For symmetric cost functions (such as Equation ( 9 )) this is equivalent to C ( p 1 , ⋯ , p N ) ≥ C ( p 1 + p 2 , p 3 , ⋯ , p N ) + ( p 1 + p 2 ) C ( p 1 p 1 + p 2 , p 2 p 1 + p 2 ) . (11) 52,53,54,55, − H . See also Example 3 in the next section, where we discuss the generalization to arbitrary reference distributions. The case of equality in Equations ( 10 ) and ( 11 ) (see Figure 6 ) is sometimes called separability [ 49 ], strong additivity [ 50 ], or recursivity [ 51 ], and it is often used to characterize Shannon entropy [ 23 56 ]. In fact, we also show in Appendix A (Proposition A2) that cost functions C that are additive under coarse-graining are proportional to the negative Shannon entropy. See also Example 3 in the next section, where we discuss the generalization to arbitrary reference distributions.

ϕ together with a cost function C satisfying Equation ( A ′ to smaller subsets A ⊊ A ′ by successively eliminating options, we now allow ϕ to be a mapping between probability distributions such that ϕ ( p ) can be obtained from p by a finite number of elementary computations (without permutations), and we require C to be a cost function on P Ω , so that p ⋨ ϕ ( p ) , C ( p ) < C ( ϕ ( p ) ) ∀ p ∈ P Ω . (12) We can now refine the notion of a decision-making process introduced in the previous section as a mappingtogether with a cost functionsatisfying Equation ( 2 ). Instead of simply mapping from setsto smaller subsetsby successively eliminating options, we now allowto be a mapping between probability distributions such thatcan be obtained fromby a finite number of elementary computations (without permutations), and we requireto be a cost function on, so that

Here, C ( p ) quantifies the total costs of arriving at a distribution p , and p ′ ⋨ p means that p ′ ≺ p and p ⊀ p ′ . In other words, a decision-making process can be viewed as traversing probability space by moving pieces of probability from one option to another option such that uncertainty is reduced.

q with minimal cost , i.e., satisfying C ( q ) ≤ C ( p ) for all p , which must be identified with the initial distribution of a decision-making process with cost function C . As one might expect (see Up to now, we have ignored one important property of a decision-making process, the distributionwith, i.e., satisfyingfor all, which must be identified with the initial distribution of a decision-making process with cost function. As one might expect (see Figure 5 ), it turns out that all cost functions according to Definition 3 have the same minimal element.

Proposition 1 . The uniform distribution ( 1 N , ⋯ , 1 N ) is the unique minimal element in P Ω with respect to ≺, i.e. 1 N , ⋯ , 1 N ≺ p ∀ p ∈ P Ω . (13) (Uniform distributions are minimal)

C ( ( 1 N , ⋯ , 1 N ) ) ≤ C ( p ) for all p , in particular the uniform distribution corresponds to the initial state of all decision-making processes with cost function C satisfying Equation ( 0 ≤ H ( p ) ≤ log N . Proposition 1 follows from Characterization ( i v ) in Theorem 1 after noticing that every p ∈ P Ω can be transformed to a uniform distribution by permuting its elements cyclically (see Proposition A3 in Once Equation ( 13 ) is established, it follows from Equation ( 8 ) thatfor all, in particular the uniform distribution corresponds to the initial state of all decision-making processes with cost functionsatisfying Equation ( 12 ). In particular, it contains the maximum amount of uncertainty with respect to any entropy measure of the form of Equation ( 9 ), known as the second Khinchin axiom [ 49 ], e.g., for Shannon entropy. Proposition 1 follows from Characterizationin Theorem 1 after noticing that everycan be transformed to a uniform distribution by permuting its elements cyclically (see Proposition A3 in Appendix A for a detailed proof).