The aim of Experiment 1 was to measure subjective certainty of participants during concept learning and attempt to predict it using plausible model-based and behavioral predictors. In this experiment, certainty judgments were about what underlying concept (rule) generated the data they saw, as opposed to their certainty about the correct answer for any given trial (see Experiment 3).

We tested 552 participants recruited via Amazon Mechanical Turk in a standard Boolean concept-learning task during which we measured their knowledge of a hidden concept (via yes or no responses) and their certainty throughout the learning process (see Figure 1 and Table 1 ). In this experiment, participants were shown positive and negative examples of a target concept “daxxy,” where membership was determined by a latent rule on a small set of feature dimensions (e.g., color, shape, size), following experimental work by Shepard et al. ( 1961 ) and Feldman ( 2000 ). The latent rules participants were required to learn varied across a variety of logical forms. After responding to each item, participants were provided feedback and then rated their certainty on what the word “daxxy” meant. For our analyses we considered and compared several different models of what might drive uncertainty (see Table 2 ). These predictors can be classified into two broad categories. Model-based predictors were calculated using our ideal learning model, while behavioral predictors were calculated using the behavioral data (see Appendix A in the Supplemental Materials [Martí, Mollica, Piantadosi, & Kidd, 2018 ] for additional method details).

Results

We first visualize plots of participants’ certainty and accuracy for each concept in order to show (a) whether certainty and accuracy improved over the course of the experiment, (b) whether theoretically harder concepts (according to Feldman, 2000) were, in fact, more difficult for participants, and (c) whether participants’ certainty correlated with their accuracy in general.

Figure 2 shows participants’ certainty and accuracy (y-axis) over trials of the experiment (x-axis). The accuracy curves indicate participants learned the concepts in some conditions but not others. This is beneficial to our analysis as it allows us to analyze conditions and trials in which participants should have had high uncertainty. Overall, participant certainty was inversely proportional to concept difficulty. Participant certainty generally increased, but only reached high values in conditions in which they also achieved high accuracy. The increasing trend of certainty in conditions for which accuracy did not go above 50% may be reflective of overconfidence. It is also important to note that even though participants received exhaustive evidence, there were still multiple logical rules that were both equivalent and correct. Despite this, participants still became certain over time.

We will first consider our predictors as separate models in order to determine which best predict certainty. Subsequently we will build a model using the best predictors of each type in order to determine the unique contributions of each predictor.

We assessed our predictors with generalized logistic mixed-effect models fit by maximum likelihood with random subject and condition effects.1 First, this analysis shows model accuracy significantly predicts behavioral accuracy (R2 = .50, β = .748, z = 30.423, p < .001; Figure 3), meaning that overall performance can be reasonably well predicted by the learning model.

Figure 4 then shows mean certainty responses for each trial and condition (y-axis) over several different key predictors of certainty (x-axis). A perfect model here would have data points lying along the line y = x with a high R2 and very little residual variance. Local Accuracy 5 Back, the accuracy averaged over the past 5 items, has a high R2, meaning that individuals with low local accuracy were uncertain and individuals with high local accuracy were highly certain. Likewise, Domain Entropy also has a high R2 and is very ordered compared to the other model predictors (see Figure A.1 in the Supplemental Materials [Martí et al., 2018] for additional predictor visualizations).

Table A.2 in the Supplemental Materials (Martí et al., 2018) shows the full model results, giving the performance of each model in predicting certainty ratings.2 These have been sorted by Akaike information criterion (AIC), which quantifies the fit of each model penalizing its number of free parameters (closer to −∞ is better). The AIC score is derived from a generalized logistic mixed effect model fit by maximum likelihood with random subject and condition effects. This table also provides an R2 measure, calculated using the Pearson correlation between the means of each response and predictor for each trial and condition (this ignores variance from participants). As this table makes clear, the behavioral predictors tend to outperform the model predictors, at times by a substantial amount. The best predictor, Local Accuracy 5 Back accounts for 58% of the variance. Additionally, Local Accuracy models outperform most of the other alternatives, a pattern that is robust to the way in which local accuracy is quantified (e.g., the number back that were counted or whether the current trial is included). The quantitatively best Local Accuracy model tracks accuracy over the past five trials. One possible explanation for this is that participants were simply basing their certainty on recent performance. The high performance of both Local Accuracy and Total Correct implies that people’s certainty is largely influenced by their own perception of how well they were doing on the task.

Strikingly, the lackluster performance of the majority of ideal learner models suggests that subjective certainty is not calibrated to the ideal learner. This is consistent with the theory that learners were likely not maintaining more than one hypothesis—perhaps they stored a sample from the posterior, but did not have access to the full posterior distribution. Strikingly, the idealized model of entropy over hypotheses–what might have corresponded to our best a priori guess for what certainty should reflect—performs especially poorly, worse than many behavioral and other model-based predictors. Such a failure of metacognition is consistent with the poor performance of Current Accuracy, a measure of whether or not the participant got the current trial correct. Subjective certainty does not accurately predict accuracy on the current trial, or vice versa.

Our first analysis treated each predictor separately and found the best, but what if multiple predictors were jointly allowed to predict certainty? To answer this, we created a model using the top three behavioral predictors and the top three model predictors in order to determine the unique contributions of each (see Table 3).3,4 As the table makes clear, all behavioral predictors, along with Domain Entropy, make significant, unique contributions to certainty. Conversely, Entropy and Log Maximum Likelihood were not significant when controlling for the other predictors, demonstrating they provide no unique contributions to certainty. In alignment with the results of our AIC analysis, the (normalized) beta weights, which quantify the strength of each predictors’ influence, reveal that the behavioral predictors have the largest influence.