Subjects and apparatus

Our subjects were six male kea at Willowbank Wildlife Reserve (see Supplementary Table 1). Kea were housed in a large outdoor aviary, where food and water were available ad libitum. Participation in the study was voluntary and subjects were free to leave mid-session at any point. This research was conducted under ethics approval from The University of Auckland Ethics Committee (reference number 001816).

Each subject was allocated an individual training platform (42 cm × 42 cm) within the aviary on which they were tested. Performance in trials was rewarded with soaked Science Hill Diet pellets. A small wooden shelf (60 cm × 20 cm) with a plexiglass screen (43 cm × 29 cm) was used to separate subjects from the apparatus and the experimenter during testing. Transparent jars (⌀10.5 cm, 16 cm tall) were used during training and testing which contained populations of either rewarding (black) or unrewarding (orange) wooden tokens (7 cm × 1 cm × 1 cm). Each jar held a maximum of 120 tokens. When the jars were too large for a population of tokens, tokens sat on a cardboard platform that was placed inside the jar, to ensure subjects could not see the experimenter’s hands during sampling. Semicircular cardboard lids (⌀11.5 cm, 5.5 cm tall) were attached to the top of each jar to ensure subjects could not see which tokens were being sampled. Where barriers were used, a blue foam disk (⌀10.5 cm, 1 cm thick) was added into the jar.

Training and procedures

Throughout training and testing, subjects were required to select which of two closed hands contained an out-of-sight rewarding token, while ignoring the hand containing the unrewarding token. The rewarding token would then be exchanged with the experimenter for a food reward. Where subjects attempted to exchange an unrewarding token, this was taken by the experimenter but not rewarded.

Subjects were trained to attend to and track hand trajectories for a previous study. Subjects were trained specifically for this study on hand tracking so that they could follow the motion of sampling, and making inferences about sampling from token populations in two jars, by selecting a hand that picked a rewarding token from a population of 100% rewarding tokens, over a hand that sampled from a population of 100% unrewarding tokens (see Supplementary Methods). In order to allow for a full counterbalancing of trial presentations at test and minimise side biasing, subjects were also taught to simultaneously attend to the side on which jars were placed and whether hands were presented in parallel or crossed over. This was trained over four separate training steps (outlined in detail in the Supplementary Methods).

Before each experimental session, subjects were given motivation trials, where they had to select and exchange a rewarding (black) token and ignore a nearby unrewarding (orange) token with the experimenter three times in a row, prior to the start of the session. This ensured subjects were eager to work and remembered which of the two tokens they should search for at test. Testing was carried out by three experimenters who were blind to experimental design and hypotheses, wearing mirrored sunglasses. Subjects only proceeded to the next testing condition or experiment upon reaching a criterion of 17/20 correct choices within the same block, or completing 240 trials (12 blocks) without reaching criterion. This ensured that subjects were confident in the current task before proceeding to a more demanding one. Throughout testing, hand presentation (parallel or crossed), and location of the rewarding hand at time of choice were all pseudorandomised and counterbalanced within blocks of 20 trials. Throughout training and testing, kea could only see the experimenter’s hand disappear behind the cardboard occluder on the top of the jar. Therefore, subjects were unable to see how far down the populations the experimenter’s hand reached, or which token it sampled from the population. In test conditions with very disparate ratios of rewarding-to-unrewarding tokens, we ensured that at least two tokens from the minority population were fully visible to the subjects in every trial.

Experiment 1

This experiment investigated whether kea are able to make statistical inferences about two populations of objects using relative frequencies. Over three conditions, we tested whether (1) kea would prefer a sample from a population containing a majority of rewarding tokens, as opposed to a population where they were in the minority, and whether kea rely on relative frequencies, (2) the absolute number of rewarding tokens or (3) the absolute number of unrewarding tokens, when choosing between samples from two populations. Illustrations for the three conditions are provided in Fig. 1.

The first condition aimed to test whether kea would prefer a sampled token from a population where there was a higher probability of randomly sampling a rewarding token, as opposed to a population where there was a higher probability of sampling an unrewarding token. Two jars were presented: one contained a 1:5 ratio of rewarding-to-unrewarding (rewarding-to-unrewarding tokens), and the other contained a 5:1 rewarding-to-unrewarding ratio. Both jars contained 120 tokens in total.

The second condition tested whether kea were making their choices based on absolute frequencies or relative frequencies. In order to make this distinction, subjects were presented with two jars containing the same number of rewarding (black) tokens, in differing proportions. One jar had a 1:5 rewarding-to-unrewarding population of 120 tokens, whilst the other had a 5:1 rewarding-to-unrewarding population of 24 tokens. If kea were using the absolute number of rewarding tokens to guide their choices, we predicted they would choose at chance. If, in contrast, they were taking into account the relative proportion of rewarding-to-unrewarding tokens, we predicted they would choose the jar with only four unrewarding tokens.

In the third condition, we presented subjects with two jars containing the same number of unrewarding tokens: one jar had a 57:63 rewarding-to-unrewarding population (120 tokens total), whilst the other had a 3:63 rewarding-to-unrewarding population (66 tokens total). If kea were simply selecting the jar containing the fewest unrewarding tokens rather than comparing between the frequencies of token populations between jars, they should perform at chance in this condition.

Experiment 2

This experiment investigated whether kea are able to integrate physical constraints into their sampling inferences. We first gave kea both egocentric and allocentric experience of a foam barrier. Kea were first allowed to sample from two small jars (⌀6 cm, 7.5 cm tall) where a population of 20 rewarding tokens was either physically accessible or impeded by a barrier, then observed as an experimenter attempted to sample from two populations with a similar configuration (details of training are provided in the Supplementary Methods). Over two conditions, we presented kea with two populations of tokens which were split in the middle by physical barriers, and tested whether kea understood that only the population above the barrier could be sampled from. Both conditions are illustrated in Supplementary Table 1.

In the first condition, both jars each contained 40 rewarding and 40 unrewarding tokens. One jar had a 1:1 rewarding-to-unrewarding population (40 tokens) both above and below the barrier, and the other had a 5:1 rewarding-to-unrewarding population (24 tokens) above the barrier and 5:9 rewarding-to-unrewarding population (56 tokens) below it. This was used to test whether kea were simply attending to which jar had the largest number of rewarding tokens near the top, which should lead to performance at chance, as opposed to comparing between the relative frequencies of tokens for the two accessible populations. Subjects were also expected to perform at chance in this condition if they were comparing between the token frequencies of the two jars without taking the barrier into account, as both jars contained the same absolute number and relative frequencies of rewarding and unrewarding tokens, 1:1 (40 rewarding, 40 unrewarding).

The second condition was identical, but with reversed proportions: one jar had a 1:1 rewarding-to-unrewarding population of 40 tokens above and below the barrier, whilst the other had a 1:5 rewarding-to-unrewarding population (24) tokens above the barrier and the remaining 9:5 rewarding-to-unrewarding population below it. This condition tested whether kea were selecting the jar with the fewest unrewarding tokens near the top, in which case they should perform at chance, or comparing between the relative frequencies of the two accessible populations in the two jars. Again, both jars contained the same absolute number and relative frequencies of rewarding and unrewarding tokens.

Experiment 3

Experiment 3 tested whether kea could take a biased sampler’s biases into account during a sampling event. Two experimenters were randomly assigned and counterbalanced between birds as either unbiased (hereafter ‘E2’) or biased (hereafter ‘E1’). The procedure was based on the study by Eckert and colleagues22 with chimpanzees, and required four experience phases.

In the first phase, we ensured that kea could tell the difference between the two experimenters: E1 and E2 stood next to each other and either picked up a food pellet or nothing into their right hand, then closed their fist. E1 and E2 either switched sides or stood on the same side for 5 s, before calling the subject’s name in turn and presenting their hands simultaneously for the subject to make a choice. The experimenter’s sides, the order of their actions, whether or not they switched sides (and whether the experimenter that switched sides did so by walking behind or in front of the other), and the order in which the subject’s name was called, were all pseudorandomised and counterbalanced within sessions of ten trials. Subjects received this training until they achieved a 17/20 criterion.

Following this, subjects were given a preference test. E1 and E2 offered an empty hand to the subject as it held a rewarding token. The subject then had a choice of whom to deliver the token to, in exchange for a reward. Which experimenter placed the token on the platform and the side on which each experimenter stood were pseudorandomised and counterbalanced within blocks of 20 trials. In order to proceed to the next stage, subjects were required to show no preference for either experimenter, that is, select E1 at between 9/20 and 11/20.

Subjects then observed demonstrations by the two experimenters where they had the opportunity to learn that E2 picked randomly from a population of tokens, whilst E1 acted as a biased sampler. For the demonstration, E1 and E2 stood next to each other and neither wore mirrored sunglasses so the kea could see their eyes. E2 always had a 10:1 rewarding-to-unrewarding population of 110 tokens, whilst E1 always had a 1:10 rewarding-to-unrewarding population of 110 tokens. Therefore, based on sampling probability alone, E2 was far more likely to sample a rewarding token than E1. During the demonstrations, E1 and E2 took turns sampling, and E2 always tilted their heads back and looked up whilst sampling, whilst E1 lowered their heads close to the jar and looked into it as they made a choice, keeping their hands in the jar for 3 s. Both experimenters always sampled a rewarding token, so that they were equally reinforced. After sampling, either both experimenters stood on the same side for 5 s, or switched sides, before presenting their closed fists to the subject simultaneously. Which side each experimenter stood on, who sampled first, whether or not they switched sides (and whether they did so by going behind or in front of the other experimenter), were all pseudorandomised and counterbalanced within sessions of ten trials. In order to proceed to the next experience phase, subjects had to select E1 at 9/20 or above, showing that they had no preference for E2 and were therefore not simply attending to the token populations within jars during demonstrations. All subjects passed this criterion within 20 trials except for Neo, who experienced two blocks (40 trials) of demonstrations.

The final experience phase before test was a memory probe. In this phase, E2 presented each bird with a block of 20 trials where 2 jars of 120 tokens each contained either 100% rewarding or 100% unrewarding tokens. E2 wore mirrored sunglasses for this phase, and presented their hands in parallel or crossed over, as in previous experiments. This was done by E2 because they were the unbiased experimenter. We predicted that if greater exposure to one or another person before test could affect test results, then carrying out an extra set of trials with E2 would make the choice of E1 less likely at test. Similarly, an increased number of positive ‘rewarding token’ experiences with E2 should make the choice of E1 less likely at test. Jar sides and hand presentation were counterbalanced and pseudorandomised. This phase ensured that subjects could and would still attend to the contents of jars following the demonstrations, and had not simply learned to ignore jar contents during the demonstration phase.

Subjects were then given the experimental task. They observed three trials of demonstrations identical to before, and then jars were swapped to 1:1 rewarding-to-unrewarding populations of 110 tokens. Based on token probability alone, E1 and E2 were now equally likely to sample a rewarding token. However, E1 and E2 behaved in identical fashion to demonstration trials, suggesting that they were biased and unbiased samplers, respectively. At test, E2 sampled truly randomly, whilst E1 continued to sample only the rewarding token in each trial. We expected that if kea understood that E1 was a biased sampler, they should choose them significantly above chance.

Analyses

All trials were filmed and coded in situ. Subject performance was blind coded for 10% of all video data and compared to in situ coded data. Inter-observer reliability was high (Cohen’s kappa = 1.0).

Performance in the first 20 trials of each condition were analysed at the individual level, using two-tailed Bayesian binomial tests with a test value of 0.5. We used Bayesian correlation tests to investigate average performance across the first 20 trials of each condition over trial number, and average performance on the first 20 trials of each condition across the 6 experimental conditions. We used default parameters (non-directional correlation, prior width = 1) for all correlation tests. These statistical analyses were carried out in JASP 0.9.277. We followed the convention that a Bayes factor (BF) < 0.33 shows substantial support for the null hypothesis, whilst a BF > 3 shows substantial support for the competing hypothesis78.

We also analysed first-trial performance at the group level using a Bayesian intercept-only model, using a Bernoulli distribution. We fitted our model to all thirty-six first-trial data points, across all individuals and conditions. Intercepts were given weakly informative Gaussian priors (M = 0, SD = 1), to reduce overfitting. Reported pMCMC values reflect the probability of performance differing from a 0.5 chance baseline. This analysis was conducted in R 3.4.179 using the “brms” package80. We used Stan to run Hamiltonian Monte Carlo estimations81.

All raw data is available in Supplementary Data 1. Code and MCMC chain diagnostics are also provided as Supplementary Information.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.