Games for studying cooperation

We describe the benchmark of games used in our studies, provide an overview of S++, and describe S# in more detail. Details that are informative but not essential to gaining an understanding of the main results of the paper are provided in the Supplementary Notes. We begin with a discussion about games for benchmarking cooperation.

We study cooperation between people and algorithms in long-term relationships (rather than one-shot settings53) in which the players do not share all the same preferences. In this paper, we model these interactions as RSGs. An RSG, played by players i and −i, consists of a set of rounds. In each round, the players engage in a sequence of stage games drawn from the set of stage games S. In each stage s ∈ S, both players choose an action from a finite set. Let A(s) = A i (s) × A −i (s) be the set of joint actions available in stage s, where A i (s) and A −i (S) are the action sets of players i and −i, respectively. Each player simultaneously selects an action from its set of actions. Once joint action a = (a i, a −i ) is played in stage s, each player receives a finite reward, denoted r i (s, a) and r −i (s, a), respectively. The world also transitions to some new stage s′ with probability defined by P M (s, a, s′). Each round of an RSG begins in the start stage \(\hat s \in S\) and terminates when some goal stage \(s_g \in G \subseteq S\) is reached. A new round then begins in stage \(\hat s\). The game repeats for an unknown number of rounds. In this work, we assume perfect information.

Real-world interactions often permit cheap talk, in which players send non-binding, costless signals to each other before taking actions. In this paper, we add cheap talk to repeated games by allowing each player, at the beginning of each round, to send a set of messages (selected from a pre-determined set of speech acts M) prior to acting in that round. Formally, let \({\cal P}(M)\) denote the power set of M, and let \(m_i(t) \in {\cal P}(M)\) be the set of messages sent by player i prior to round t. Only after sending the message m i (t) can player i view the message m −1 (t) (sent by its partner) and vice versa.

As with all historical grand challenges in AI, it is important to identify a class of benchmark problems to compare the performance of different algorithms. When it comes to human cooperation, a fundamental benchmark has been 2 × 2, general-sum, repeated games54. This class of games has been a workhorse for decades in the fields of behavioral economics45, mathematical biology45, psychology46 sociology47, computer science36, and political science26. These fields have revealed many aspects of human cooperative behavior through canonical games, such as Prisoner’s Dilemmas, Chicken, Battle of the Sexes, and the Stag Hunt. Such games, therefore, provide a well-established, extensively studied, and widely understood benchmark for studying the capabilities of machines to develop cooperative relationships.

Thus, for foundational purposes, we initially focus on two-player, two-action normal-form games, or RSGs with a single stage (i.e., |S| = 1). This allows us to fully enumerate the problem domain under the assumption that payoff functions follow strict ordinal preference orderings. The periodic table of 2 × 2 games54,55,56,57,58 (see Supplementary Figure 1 along with Supplementary Note 2) identifies and categorizes 144 unique game structures that present many unique scenarios in which machines may need to cooperate. We use this set of game structures as a benchmark against which to compare the abilities of algorithms to cooperate. Successful algorithms should forge successful relationships with both people and machines across all of these repeated games. In particular, we can use these games to quantify the abilities of various state-of-the-art machine-learning algorithms to satisfy the properties advocated in the introduction: generality across games, flexibility across opponent types (including humans), and speed of learning.

Though we initially focus on normal-form RSGs, we are interested in algorithms that can be used in more general settings, such as RSGs in which |S| > 1. These games require players to reason over multiple actions in each round. Thus, we also study the algorithms in a set of such games, including the Block Game shown in Fig. 6a. Additional results are reported in Supplementary Note 6.

The search for metrics that properly evaluate successful behavior in repeated games has a long history, for which we refer the reader to Supplementary Note 2. In this paper, we focus on two metrics of success: empirical performance and proportion of mutual cooperation. Ultimately, the success of a player in an RSG is measured by the sum of the payoffs the player receives over the duration of the game. A successful algorithm should have high empirical performance across a broad range of games when paired with many different kinds of partners. However, since the level of mutual cooperation (i.e., how often both players cooperate with each other) often highly correlates with a player’s empirical performance26, the ability to establish cooperative relationships is a key attribute of successful algorithms. However, we do not consider mutual cooperation as a substitute for high empirical performance, but rather as a supporting factor.

The term “cooperation” has specific meaning in well-known games such as the Prisoner’s Dilemma. In other games, the term is much more nebulous. Furthermore, mutual cooperation can be achieved in degrees; it is usually not an all or nothing event. However, for simplicity in this work, we define mutual cooperation as the Nash bargaining solution of the game59, defined as the unique solution that maximizes the product of the players’ payoffs minus their maximin values. Supplementary Table 4 specifies the Nash bargaining solutions for the games used in our user studies. Interestingly, the proportion of rounds that players played mutually cooperative solutions (as defined by this measure) was strongly correlated with the payoffs a player received in our user studies. For example, in our third user study, the correlation between payoffs received and proportion of mutual cooperation was r (572)=0.909.

Overview of S++

S# is derived from S++37,44, an expert algorithm that combines and builds on decades of research in computer science, economics, and the behavioral and social sciences. Since understanding S++ is key to understanding S#, we first overview S++.

S++ is defined by a method for computing a set of experts for arbitrary RSGs and a method for choosing which expert to follow in each round (called the expert-selection mechanism). Given a set of experts, S++’s expert-selection mechanism uses a meta-level control strategy based on aspiration learning34,60,61 to dynamically prune the set of experts it considers following in a round. Formally, let E denote the set of experts computed by S++. In each epoch (beginning in round t), S++ computes the potential ρ j (t) of each expert e j ∈ E, and compares this potential with its aspiration level α(t) to form the reduced set E(t) of experts:

$$E(t) = \{ e_j \in E:\rho _j(t) \ge \alpha (t)\} .$$ (1)

This reduced set consists of the experts that S++ believes could potentially produce satisfactory payoffs. It then selects one expert e sel (t) ∈ E(t) using a satisficing decision rule34,60. Over the next m rounds, it follows the strategy prescribed by e sel (t). After these m rounds, it updates its aspiration level as follows:

$$\alpha (t + m) \leftarrow \lambda ^m\alpha (t) + (1 - \lambda ^m)\,R,$$ (2)

where λ ∈ (0, 1) is the learning rate and R is the average payoff S++ obtained in the last m rounds. It also updates each expert e j ∈ E based on its peculiar reasoning mechanism. A new epoch then begins.

S++ uses the description of the game environment to compute a diverse set of experts. Each expert uses distinct mathematics and assumptions to produce a strategy over the entire space of the game. The set of experts used in the implementation of S++ used in our user studies includes five expectant followers, five trigger strategies, a preventative strategy44, the maximin strategy, and a model-based reinforcement learner (MBRL-1). For illustrative purposes relevant to the description of S#, we overview how the expectant followers and trigger strategies are computed.

Both trigger strategies and expectant followers (which are identical to trigger strategies except that they omit the punishment phase of the strategy) are defined by a joint strategy computed over all stages of the RSG. Thus, to create a set of such strategies, S++ first computes a set of pareto-optimal joint strategies, each of which offers a different compromise between the players. This is done by solving Markov decision processes (MDPs) over the joint-action space of the RSG. These MDPs are defined by A, S, and P M of the RSG, as well as a payoff function defined as a convex combination of the players’ payoffs62:

$$y^\omega (s,{\mathbf{a}}) = \omega r_i(s,{\mathbf{a}}) + (1 - \omega )r_{ - i}(s,{\mathbf{a}}),$$ (3)

where ω ∈ [0, 1]. Then, the value of joint-action a in state s is

$$Q^\omega (s,{\mathbf{a}}) = y^\omega (s,{\mathbf{a}}) + \mathop {\sum}\limits_{s{\prime} \in S} P_M(s,{\mathbf{a}},s{\prime})V^\omega (s{\prime}),$$ (4)

where \(V^\omega (s) = \mathop {{max}}

olimits_{{\mathbf{a}} \in A(s)} Q^\omega (s,{\mathbf{a}})\). The MDP can be solved in polynomial time using linear programming63.

By solving MDPs of this form for multiple values of ω, S++ computes a variety of possible pareto-optimal62 joint strategies. We call the resulting solutions “pure solutions.” These joint strategies produce joint payoff profiles. Let MDP(ω) denote the joint strategy produced by solving an MDP for a particular ω. Also, let \(V_i^\omega (s)\) be player i’s expected future payoff from stage s when MDP(ω) is followed. Then, the ordered pair \(\left( {V_i^\omega (\hat s),V_{ - i}^\omega (\hat s)} \right)\) is the joint payoff vector for the joint strategy defined by MDP(ω). Additional solutions, or “alternating solutions,” are obtained by alternating across rounds between different pure solutions. Since longer cycles are difficult for a partner to model, S++ only includes cycles of length two.

Applying this technique to a 0-1-3-5 Prisoner’s Dilemma produces the joint payoffs depicted in Fig. 9a. Notably, MDP(0.1), MDP(0.5), and MDP(0.9) produce the joint payoffs (0, 5), (3, 3), and (5, 0), respectively. Alternating between these solutions produces three other solutions whose (average) payoff profiles are also shown. These joint payoffs reflect different compromises, of varying degrees of fairness, that the two players could possibly agree upon.

Fig. 9 Target solutions computed by S++. The x’s and o’s are the joint payoffs of possible target solutions computed by S++. Though games vary widely, the possible target solutions computed by S++ in each game represent a range of different compromises, of varying degrees of fairness, that the two players could possibly agree upon. For example, tradeoffs in solution quality are similar for a a 0-1-3-5 Prisoner’s Dilemma and b a more complex micro-grid scenario44. Solutions marked in red are selected by S++ as target solutions Full size image

Regardless of the structure of the RSG, including whether it is simple or complex, this technique produces a set of potential compromises available in the game. For example, Fig. 9b shows potential solutions computed for a two-player micro-grid scenario44 (with asymmetric payoffs) in which players must share energy resources. Despite the differences in the dynamics of the Prisoner’s Dilemma and this micro-grid scenario, these games have similar sets of potential compromises. As such, in each game, S++ must learn which of these compromises to play, including whether to make fair compromises, or compromises that benefit one player over the other (when they exist). These game-independent similarities can be exploited by S# to provide signaling capabilities that can be used in arbitrary RSG’s, an observation we exploited to develop S#’s signaling mechanisms (see the next subsection for details).

The implementation of S++ used in our user studies selects five of these compromises, selected to reflect a range of different compromises. They include the solution most resembling the game’s egalitarian solution62, and the two solutions that maximize each player’s payoff subject to the other player receiving at least its maximin value (if such solutions exist). The other two solutions are selected to maximize the Euclidean distance between the payoff profiles of selected solutions. The five selected solutions form five expectant followers (which simply repeatedly follow the computed strategy) and five trigger strategies. For the trigger strategies, the selected solutions constitute the offers. The punishment phase is the strategy that minimizes the partner’s maximum expected payoff, which is played after the partner deviates from the offer until the sum of its partner’s payoffs (from the time of the deviation) are below what it would have obtained had it not deviated from the offer. This makes following the offer of the trigger strategy the partner’s optimal strategy.

The resulting set of experts available to S++ generalizes many popular strategies that have been studied in past work. For example, for prisoner’s dilemmas, the computed trigger strategies include Generous Tit-for-Tat and other strategies that resemble zero-determinant strategies64 (e.g., Bully65). Furthermore, computed expectant followers include Always Cooperate, while the expert MBRL-1 quickly learns the strategy Always Defect when paired with a partner that always cooperates. In short, the set of strategies (generalized to the game being played) available for S++ to follow give it the flexibility to effectively adapt to many different kinds of partners.

Description of S#

S# builds on S++ in two ways. First, it generates speech acts to try to influence its partner’s behavior. Second, it uses its partner’s speech acts to make more informed choices regarding which experts to follow.

Signaling to its partner: S#’s signaling mechanism was designed with three properties in mind: game independence, fidelity between signals and actions, and human understandability (i.e., the signals should be communicated at a level conducive to human understanding). One way to do this is to base signal generation on game-independent, high-level ideals rather than game-specific attributes. Example ideals include proficiency assessment10,27,61,66, fairness67,68, behavioral expectations25, and punishment and forgiveness26,69. These ideals package well-established concepts of interaction in terms people are familiar with.

Not coincidentally, S++’s internal state and algorithmic mechanisms are largely defined in terms of these high-level ideals. First, S++’s decision-making is governed by proficiency assessment. It continually evaluating its own proficiency and that of its experts by comparing its aspiration level (Eq. 2), which encodes performance expectations, with its actual and potential performance (see descriptions of traditional aspiration learning34,60,61 and Eq. 1). S# also evaluates its partner’s performance against the performance it expects it partner to have. Second, the array of compromises computed for expectant followers and trigger strategies encode various degrees of fairness. As such, these experts can be defined and even referred to by references to fairness. Third, strategies encoded by expectant followers and trigger strategies define expectations for how the agent and its partner should behave. Finally, transitions between the offer and punishment phases of trigger strategies define punishment and forgiveness.

The cheap talk generated by S# is based not on the individual attributes of the game, but rather on events taken in context of these five game-independent principles (as they are encoded in S++). Specifically, S# automatically computes a finite-state machine (FSM) with output (specifically, a Mealy machine70) for each expert. The states and transitions in the state machine are defined by proficiency assessments, behavioral expectations, and (in the case of experts encoding trigger strategies) punishment and forgiveness. The outputs of each FSM are speech acts that correspond to the various events and that also refer to the fairness of outcomes and potential outcomes.

For example, consider the FSM with output for a trigger strategy that offers a pure solution, which is shown in Fig. 3b. States s 0 – s 6 are states in which S# has expectations that its partner will conform with the trigger strategy’s offer. Initially (when transitioning from state s 0 to s 1 ), S# voices these behavioral expectations (speech act #15), along with a threat that if these expectations are not met, it will punish its partner (speech act #0). If these expectations are met (event labeled s), S# praises its partner (speech acts #5–6). On the other hand, when behavioral expectations are not met (events labeled d and g), S# voices its dissatisfaction with its partner (speech acts #11–12). If S# determines that its partner has benefitted from the deviation (proficiency assessment), the expert transitions to state s 7 , while telling its partner that it will punish him (speech act #13). S# stays in this punishment phase and voices pleasure in reducing its partner’s payoffs (speech act #14) until the punishment phase is complete. It then transitions out of the punishment phase (into state s 8 ) and expresses that it forgives its partner, and then returns to states in which it renews behavioral expectations for its partner.

FSMs for other kinds of experts along with specific details for generating them, are given in Supplementary Note 4.

Because S#’s speech generation is based on game-independent principles, the FSMs for speech generation are the same for complex RSGs as they are for simple (normal-form) RSGs. The exception to this statement is the expression of behavioral expectations, which are expressed in normal-form games simply as sequences of joint actions. However, more complex RSGs (such as the Block Game; Fig. 6) have more complex joint strategies that are not as easily expressed in a generic way that people understand. In these cases, S# uses game-invariant descriptions of fairness to specify solutions, and then depends on its partner to infer the details. It refers to a solution in which players get similar payoffs as a “cooperative” or “fair” solution, and a solution in which one player scores higher than the other as a solution in which “you (or I) get the higher payoff.” While not as specific, our results demonstrate that such expressions can be sufficient to communicate behavioral expectations.

While signaling via cheap talk has great potential to increase cooperation, honestly signaling one’s internal states exposes a player to the potential of being exploited. Furthermore, the so-called silent treatment is often used by humans as a means of punishment and an expression of displeasure. For these two reasons, S# also chooses not to speak when its proposals are repeatedly not followed by its partner. The method describing how S# determines whether or not to voice speech acts is described in Supplementary Note 4.

In short, S# essentially voices the stream of consciousness of its internal decision-making mechanisms, which are tied to the aforementioned game-independent principles. Since these principles also tend to be understandable to humans and are present in all forms of RSGs, S#’s signal-generation mechanism tends to achieve the three properties we desired to satisfy: game independence, fidelity between signals and actions, and human understandability.

Responding to its partner: In addition to voicing cheap talk, the ability to respond to a partner’s signals can substantially enhance one’s ability to quickly coordinating on cooperative solutions (see Supplementary Note 4, including Supplementary Table 12). When its partner signals a desire to play a particular solution, S# uses proficiency assessment to determine whether it should consider playing it. If this assessment indicates that the proposed solution could be a desirable outcome to S#, it determines which of its experts play strategies consistent (or congruent) with the proposed solution to further reduce the set of experts that it considers following in that round (step 3 in Fig. 3a). Formally, let E cong (t) denote the set of experts in round t that are congruent with the last joint plan proposed by S#’s partner. Then, S# considers selecting experts from the set defined as:

$$E(t) = \{ e_j \in E_{{\mathrm{cong}}}(t):\rho _j(t) \ge \alpha (t)\} .$$ (5)

If this set is empty (i.e., no desirable options are congruent with the partner’s proposal), E(t) is calculated as with S++ (Eq. 1).

The congruence of the partner’s proposed plan with an expert is determined by comparing the strategy proposed by the partner with the solution espoused by the expert. In the case of trigger strategies and expectant followers, we compare the strategy proposed by the partner with the strategies’ offer. This is done rather easily in normal-form games, as the partner can easily and naturally express its strategy as a sequence of joint actions, which is then compared to the sequence of joint actions of the expert’s offer. For general RSGs, however, this is more difficult because a joint action is viewed as a sequence of joint strategies over many stages. In this case, S# must rely on high-level descriptions of the solution. For example, solutions described as “fair” and “cooperative” can be assumed to be congruent with solutions that have payoff profiles similar to those of the Nash bargaining solution.

Listening to its partner can potentially expose S# to being exploited. For example, in a (0-1-3-5)-Prisoner’s Dilemma, a partner could continually propose that both players cooperate, a proposal S# would continually accept and act on when \(\alpha _i^t \le 3\). However, if the partner did not follow through with its proposal, it could potentially exploit S# to some degree for a period of time. To avoid this, S# listens to its partner less frequently the more the partner fails to follow through with its own proposals. The method describing how S# determines whether or not to listen to its partner is described in Supplementary Note 4.

User studies and statistical analysis

Three user studies were conducted as part of this research. Complete details about these studies and the statistical analysis used to analyze the results are given in Supplementary Notes 5–7.

Data availability

The data sets from our user studies, the computer code used to generate the comparison of algorithms, and our implementation of S# can be obtained by contacting Jacob Crandall.