Despite these limitations, the intelligence community continues to hold the belief that ACH encourages critical thinking and cognitively debiases analysts (e.g., Marrin, 2008 ). Indeed, ACH is one of the few techniques listed in the U.S. Government's ( 2009 ) Tradecraft Primer . ACH also features prominently in the UK Ministry of Defence's ( 2013 ) Quick Wins for Busy Analysts handbook. The popularity of ACH is surprising given the dearth of empirical research testing its utility (Chang et al., 2018 ; Dhami, Mandel, Mellers, & Tetlock, 2015 ; Pool, 2010 ).

Critics of ACH have noted several shortcomings (e.g., Chang, Berdini, Mandel, & Tetlock, 2018 ; Jones, 2017 ; Mandel, in press ; Murukannaiah, Kalia, Telang, & Singh, 2015 ). It is vague in multiple respects. For instance, it is unclear how hypotheses should be selected; what criteria should be used to rate evidence as being consistent or inconsistent with a hypothesis; or what criteria should be used to judge evidence diagnosticity. This vagueness permits the analyst's judgement process to become unreliable. Finally, ACH does not represent some features of relevant normative methods, such as Bayesianism, for revising beliefs in the face of uncertain evidence. For instance, it provides no guidance on how prior beliefs should be revised in light of new evidence, and so it may be prone to base rate neglect (Bar‐Hillel, 1980 ). ACH's information integration rule involves merely counting the number of weighted inconsistent evidence items for any given hypothesis, while discounting the amount of supporting evidence. Consequently, ACH will diverge from the predictions of Bayes theorem under some conditions, such as when the prior probability distribution over the set of hypotheses is far from uniform (as in the experiment reported here).

In an effort to assist analysts to think critically and avoid bias, the intelligence community has adopted the use of “structured analytic techniques.” The analysis of competing hypotheses (ACH; Heuer, 1999 , 2005 ) is one such technique. It is designed to help analysts avoid “confirmation bias” in several respects, namely, by explicitly requiring them to (a) consider alternative hypotheses; (b) rate evidence as inconsistent (or consistent) with each hypothesis under consideration; (c) adjust their belief in a hypothesis in accordance with evidence diagnosticity (or credibility); (d) select the most likely hypothesis based solely on (it being the one with the least) inconsistent evidence; and (e) identify indicators that will disconfirm (or confirm) a hypothesis in the future.

Intelligence analysts are required to assess evidence to test alternative accounts of a current or future situation. In performing such a cognitively complex task, analysts may resort to using simple strategies that can bias their thinking and result in judgement errors (Belton & Dhami, in press ). In particular, it is argued that analysts may suffer from “confirmation bias” (Heuer, 1999 ). This can manifest itself in a number of ways (see Klayman, 1995 ; Nickerson, 1998 ). Analysts are often portrayed as not considering alternative hypotheses; searching for evidence supporting rather than disconfirming their prior beliefs; reaching conclusions about a hypothesis based on the presence of supporting rather than conflicting evidence; and insufficiently adjusting their belief in a hypothesis when existing (supporting) evidence is discredited (e.g., Cook & Smallman, 2008 ; Lehner, Adelman, Cheikes, & Brown, 2008 ; Lehner et al., 2009 ). Indeed, confirmation bias is a popular explanation for intelligence failures such as the Iraq weapons of mass destruction mis‐estimate (Jervis, 2006 ).

The aforementioned studies have several shortcomings. They were based on very small samples, precluding statistical testing of the reliability and size of any effects reported. Lehner et al. ( 2008 ) studied 24 individuals. Convertino et al.'s ( 2008 ) study involved nine, three‐member, geographically distributed groups of students. Kretz et al. ( 2012 ) and Kretz and Granderson ( 2013 ) studied 27 junior engineers without analytic experience. In addition, some past studies did not include relevant control groups against which ACH could be compared. In Convertino et al.'s ( 2008 ) study, all groups used a collaborative version of ACH. Kretz et al. ( 2012 ) and Kretz and Granderson ( 2013 ) compared ACH with two techniques whose primary function is not hypothesis testing (see Dhami, Belton, & Careless, 2016 ). Furthermore, some of the studies using a control group were confounded by the fact that some “control” participants were familiar with ACH and may have used it (Kretz et al., 2012 ; Kretz & Granderson, 2013 ; Lehner et al., 2008 ). Finally, none of the studies measured whether analysts in the ACH group applied ACH correctly.

Lehner et al. ( 2008 ) reported that ACH reduced confirmation bias (measured in terms of the size of the significant positive correlations between participants' confidence in an initial hypothesis and their ratings of the extent to which subsequent evidence supported that hypothesis) in participants with no analytic experience but not in those with experience. This was partly because the latter group initially demonstrated less bias. ACH, however, did not appear to reduce participants' resistance to change from one hypothesis to another. Convertino et al. ( 2008 ) found that confirmation bias (measured in terms of belief in the initially supported hypothesis in later phases and the importance attached to evidence supporting the favoured hypothesis) was evident across all groups studied, but stronger in the group with similar beliefs rather than dissimilar beliefs. Kretz et al. ( 2012 ) and Kretz and Granderson ( 2013 ) found that participants using ACH did not consistently outperform those using one of two other techniques (i.e., link analysis and information extraction and weighting) in terms of the number of hypotheses generated and how often the chosen hypothesis was supported by the evidence overall.

The small body of past research is conceptually vague in terms of the features of ACH being tested, although there is a general focus on measuring some aspects of confirmation bias (Convertino, Billman, Pirolli, Massar, & Shrager, 2008 ; Kretz & Granderson, 2013 ; Kretz, Simpson, & Graham, 2012 ; Lehner et al., 2008 ). Specifically, the studies induce confirmation bias in participants before testing ACH by presenting evidence in stages such that it initially favours one hypothesis, and then in later stages, it either balances out across the hypotheses (Convertino et al., 2008 ), supports a hypothesis it initially conflicted with (Lehner et al., 2008 ), or conflicts with the hypothesis it initially supported (Kretz et al., 2012 ; Kretz & Granderson, 2013 ). Thus, researchers cannot comment on how ACH may reduce other aspects of confirmation bias such as explicitly requiring them to consider alternative hypotheses, rate evidence as inconsistent (or consistent) with each hypothesis under consideration, adjust belief in a hypothesis in accordance with evidence diagnosticity (or credibility), and identify indicators that will disconfirm (or confirm) a hypothesis in the future.

The final aim was to compare the accuracy of the ACH and untrained groups. Although ACH was designed to reduce judgement bias and error, as Dhami et al. ( 2016 ) point out, techniques such as ACH cannot guarantee accuracy. This is partly because they rely on the judgement skills of the analyst and his/her subjective input of the information and interpretation of the outputs. Past research on ACH does not sufficiently comment on its ability to help analysts arrive at the correct solution; however, the implicit belief among the intelligence community is that ACH can improve the accuracy of those analysts who use it as opposed to those who do not.

The second aim was to measure the extent of within‐individual consistency in the judgement processes of the ACH and untrained groups. It is reasonable to expect that analysts taught to use ACH may demonstrate greater consistency in how they approach a hypothesis testing task compared with those who have not been taught to use ACH.

We examine all of the features of ACH using a comparatively larger sample of practicing intelligence analysts, half of whom were randomly assigned to be trained to use ACH and half of whom were not. The experiment had three main aims. First, we sought to compare the judgement processes of analysts trained (and instructed) to use ACH against analysts from the same cohort not trained in ACH and not instructed to use any particular technique (i.e., control group). According to proponents of ACH, the untrained group ought to demonstrate greater “confirmation bias” than the ACH group in several respects. In the context of our experiment, this bias is conceptualized as: (a) not considering all alternative hypotheses; (b) only evaluating evidence based on whether it is consistent with each hypothesis under consideration; (c) not adjusting belief in a hypothesis in accordance with evidence diagnosticity; (d) selecting the most likely hypothesis based solely on evidence that is consistent with it; and (e) identifying indicators that will only confirm a hypothesis in the future.

The data in the written protocols were coded using a structured coding scheme (a copy is available from the first author). This scheme was divided into three parts. The first enabled coding of variables pertaining to data that could potentially be available for both groups (e.g., selection of tribe membership of the target individual and whether the analyst took account of base rate information). The second part contained codes for variables pertaining to data we would expect to observe for the ACH group only given the contents of their training (e.g., did they draw a ACH matrix?). The final part enabled coding of data that could be available for the untrained group only (e.g., did they reformat the data? If so, how?).

Data were collected using a written protocol. The ACH group was told “In order to solve the analytic task presented, we would like you to use the technique called ‘Analysis of Competing Hypotheses’ (ACH). This consists of the steps described below. Please use the space provided to detail your analysis using ACH” (see Table A3 ). The control group was told “Report your conclusions in the box below. Consider the relative likelihood of all of the hypotheses. State which items of information were the most diagnostic, and how compelling a case they make in identifying the most likely hypothesis. Also say why alternative hypotheses were rejected. (You can use the page overleaf to make any notes you need.)” 2

Participants were then told “Assume that your government has already determined the following information which is at your disposal.” See Table A2 for a summary description of the four tribes in terms of the 12 evidence items, and the information about the target.

The scenario was as follows: “In the Zuma region of Zanda, there are four tribes called Acanda, Bango, Conda, and Dengo. They represent 5%, 20%, 30%, and 45% of Zuma's population, respectively. Assume that Acanda and Conda are hostile tribes, whereas Bango and Dengo are friendly. Your government would like to improve its understanding of this region and has captured a randomly chosen inhabitant to be interviewed. The inhabitant was given a truth serum and will have provided accurate information. In this sense, your task is already easier than in real life since you don't have to worry about inaccuracies in the information provided. Moreover, you may assume that this target, when released, will have no memory of the capture and his brief absence will not have been noticed by any Zumans. Finally, the sex of the target (male) is non‐diagnostic since all tribes have the same ratio of males to females (1:1).”

Analysts were instructed as follows: “In this task, you will be asked to assess the tribe membership of a randomly selected person from a region. The region and groups are fictitious and bear no intended relationship to any real groups in any region on Earth. Your task is to use the information provided to offer the best assessment you can of the target person's tribe membership. After reading the scenario, you will be asked to detail your analysis. Then, you will be asked to assess the likelihood of specific hypotheses and the usefulness of the various pieces of information that you received.”

Analysts each performed an analytic task (i.e., judging the likelihood that a target individual belongs to one tribe) comprising four hypotheses (i.e., tribes) and 12 evidence items (e.g., language spoken). The probability of occurrence of each evidence item (i.e., the diagnostic probability) was provided, as was the base rate information for each hypothesis (see Table A2 for task properties). The task enabled analysts to apply all of the steps of ACH and arrive at a normatively correct conclusion by relying solely on the available information. A statistical analysis using Bayes theorem under the assumption of cue independence (i.e., a naïve Bayes model of the evidence) shows that the most probable hypothesis is that the target is a member of the Conda tribe (46% chance). The probabilities for the other tribes are Dengo (31%), Bango (15%), and Acanda (8%). (Although we acknowledge that the task does not demand the simplifying assumption of cue independence, we found no evidence—such as discussion of inter‐cue relationships—to suggest the invalidity of the assumption.)

Analysts undergoing their regular training at a UK intelligence organization were asked by the trainers to participate in the experiment. In total, 50 analysts participated, and there was no attrition. 1 Fifty‐seven per cent of the sample was male. The mean age of the sample was 27.79 years ( SD = 5.03). The mean number of months' experience working as an analyst was 14.08 ( SD = 29.50). Half of the sample was randomly allocated to the experimental group and half to the control group. The two groups did not differ significantly on any of the aforementioned demographic variables.

Finally, we also explored the relationship between within‐individual consistency and accuracy across both groups. A McNemar test revealed that a statistically significantly greater proportion of analysts in both groups (i.e., 80%, n = 4 out of 5) who applied their evidence assessment rule consistently across evidence items were accurate in their choice of most likely tribe, compared with 31% ( n = 9 out of 29) of analysts who were inconsistent, p = .021, odds ratio = 8.89. Similarly, a statistically significantly greater proportion of analysts in both groups who applied their evidence integration rule consistently across hypotheses (i.e., 38.7%, n = 12 out of 31) chose the correct tribe, compared with 22.2% ( n = 2 out of 9) of those who were inconsistent, p < .001, odds ratio = 2.21.

The other way of measuring accuracy was on a categorical/binary scale (i.e., whether analysts chose the correct tribe as the most likely). Thirty‐six per cent ( n = 9) of analysts in the ACH group and 33% ( n = 8) in the untrained group 9 chose the correct hypothesis (i.e., Conda tribe), and this difference between groups was not statistically significant, χ 2 (1, N = 49) = .04, p = .845, ϕ = .03.

One way of measuring accuracy was on an ordinal scale (i.e., correctness of analysts' ranking of tribe membership from most to least likely). Here, only one (4%) of the 25 analysts in the ACH group produced a correct rank ordering of the four hypotheses compared with two (4.9%) of 16 8 analysts in the untrained group. This difference between groups was not statistically significant, χ 2 (1, N = 41) = .31, p = .308, ϕ = −.16. A further examination of the data revealed that a statistically significantly greater proportion (i.e., 80%, n = 20 of 25) of analysts in the ACH group had one or more tied ranks between hypotheses compared with 19% ( n = 3 of 16) of the untrained group, χ 2 (1, N = 41) = 14.86, p < .001, ϕ = .60.

The final conclusion reached by 64% ( n = 16) of the ACH group matched the judgements made in their revised matrix (preceding judgements). By contrast, where it was possible to evaluate, the final conclusion presented by all of the analysts in the untrained group (out of n = 16 7 ) was consistent with their preceding judgement process. This difference between groups was statistically significant, χ 2 (1, N = 41) = 7.38, p = .007, ϕ = −.42.

Seventy‐two per cent ( n = 18) of analysts in the ACH group provided at least one indicator. A total of 68 indicators were provided by these analysts, with 22 indicators potentially confirming their conclusion, 19 disconfirming it, and 27 being neutral (i.e., that could either confirm or disconfirm their conclusion depending on the circumstances). A Friedman test found no statistically significant difference in the type of indicators provided by those 18 analysts in the ACH group who provided indicators, χ 2 (2, N = 18) = .13, p = .936, Kendall's W = .004.

Finally, ACH requires analysts to assess the sensitivity of their conclusions and identify indicators for future observation that would support or contest their conclusion. A statistically significantly greater proportion of analysts in the ACH group (i.e., 60%, n = 15) checked the sensitivity of their conclusions to a change in assumptions compared with 4% ( n = 1) of the untrained group, χ 2 (1, N = 50) = 18.02, p < .001, ϕ = .60.

When selecting the most likely hypothesis, ACH requires analysts to add up only evidence inconsistent with each hypothesis, ignoring evidence consistent with it, and to consider the hypothesis with the lowest number of inconsistent ratings as most likely. We found a statistically significant difference between the two groups in how they selected the most likely hypothesis, χ 2 (2, N = 50) = 6.58, p = .037, V = .38. Post hoc analyses were conducted to further explore the source of this difference. We found that despite their training, only 20% ( n = 5) of analysts in the ACH group relied solely on inconsistent evidence, and none of the untrained group 4 did so. This difference between groups was statistically significant, χ 2 (1, N = 50) = 5.14, p = .023, ϕ = .33. A small minority of analysts in both groups added up only evidence consistent with each hypothesis (i.e., ACH: 4%, n = 1 and untrained: 22%, 4 out of n = 20), χ 2 (1, N = 50) = 2.88, p = .090, ϕ = −.25. Finally, the majority of analysts in both groups added up both consistent and inconsistent evidence (i.e., ACH: 76%, n = 19 and untrained: 78%, 16 out of n = 20), χ 2 (1, N = 50) = 0.01, p = .954, ϕ = −.01.

Eighty per cent ( n = 20) of the ACH group took some account of evidence diagnosticity, as instructed (i.e., by deleting some evidence items in their revised matrix and/or reordering their matrix based on diagnosticity). Thirty‐two per cent ( n = 8) of the untrained group ranked evidence items in some way based on diagnosticity or stated that they took account of diagnosticity in reaching their conclusion. The difference between groups was statistically significant, χ 2 (1, N = 50) = 11.69, p = .001, ϕ = .48.

Individual analysts in the untrained group did not list enough evidence items as diagnostic to compute individual correlations between their rankings and an objective measure. Therefore, the correlational analysis was computed across the whole group by comparing the percentage of analysts that identified the evidence items as diagnostic with the items' ranking using the Shannon entropy reduction measure. Kendal's tau b was .44, p = .07.

In order to evaluate how well analysts assessed evidence diagnosticity, we examined the ACH group and untrained group separately. Only 11 analysts produced, as instructed, an amended ACH matrix (see Step 4 of the ACH process in Table A1 ) with evidence items reordered from the original (Step 3) matrix based on their diagnosticity. A further nine analysts removed one or more items but did not reorder their matrix. For each of these 20 analysts, we compared the rankings of the evidence items with the ranking computed using “information gain,” an information utility measure that gauges reduction in Shannon entropy (see Nelson, 2005 ). 3 Mean Kendal's tau b was .63 ( SD = 0.16). The tau b correlations were statistically significant for eight analysts ( p < .05), indicating a degree of accuracy in these analysts' assessments of evidence diagnosticity.

All of the ACH group applied the scoring rule for assessing evidence in relation to each hypothesis, as instructed. Eighty per cent ( n = 20) of the untrained group used some form of scoring rule. The difference between groups was statistically significant, χ 2 (1, N = 50) = 5.56, p = .018, ϕ = .33. Of those 20 analysts in the untrained group who used a scoring rule, eight added up the evidence likelihood percentages for each hypothesis 2 or performed a similar calculation, whereas 12 attached points for matching evidence in different ways (four of these divided the scale in half so that >50% = 1 point, six divided the scale into several intervals so that ≥75 = 3 points, ≥50% = 2 points, and ≥25% = 1 point, and the remaining two gave a point to the hypothesis that was the best match for each evidence item in terms of having the highest likelihood for that item).

ACH requires analysts to represent the task in terms of a matrix with hypotheses as columns and evidence as rows, and all of the ACH group did this. Eighty per cent ( n = 20) of the untrained group also reformatted the data. The difference between groups was statistically significant, χ 2 (1, N = 50) = 5.56, p = .018, ϕ = .33. A closer examination of data from the 20 analysts in the untrained group who reformatted the task revealed that 16 drew a matrix (i.e., 14 with hypotheses as columns and evidence as rows, and two with evidence as columns and hypotheses as rows) and four made a list.

The coded data are presented in Table 1 . In order to examine the association between group (ACH or untrained/control) and performance on specific aspects of the analytic task, we analysed the data using chi‐square tests of independence supplemented with effect size measures. The results are presented below in order of the three main aims of the study.

6 DISCUSSION

The intelligence community believes that ACH helps analysts to think critically and avoid “confirmation bias.” The present study examined ACH in practice. We found that most analysts trained (and instructed) to use ACH deviated from one or more of the steps prescribed by this technique. In particular, they departed from ACH's Step 5, which refers to evidence integration (see Table A1). Past research on ACH has not measured the extent to which participants fully applied ACH. However, Trent, Voshell, and Patterson (2007) reported that army intelligence officers resisted using ACH after being trained and repeatedly instructed to do so. In fact, intelligence organizations also find themselves deviating from some of the steps prescribed by ACH. For instance, in its manual describing ACH, UK Defence Intelligence (UK Ministry of Defence, 2013, p. 15) asks analysts to consider “If this hypothesis were true, how likely would this evidence be?” Analysts must enter a score of 0 to 4, where 0 represents less than 10%, 1 represents 10–25%, 2 represents 50–75%, and 4 represents more than 75%. Then, they must add up the scores for each hypothesis. These are significant departures from ACH, and yet both analysts and their organizations would believe they are applying ACH. Clearly, future research ought to examine the efficacy of ACH as designed and if it is found to be useful then more needs to be done to persuade analysts and intelligence organizations to use it. Meanwhile, our discussion of the present findings below focuses on how ACH is used in practice.

Before we discuss the present findings, we highlight potential concerns some may raise about their external validity, given the nature of the analytic task used in the present study. Although intelligence analysts seldom face such neat problems (i.e., where all hypotheses are provided and are mutually exclusive, and where all relevant evidence is available and precisely quantified), we do not believe this implies that analysts would perform better when faced with real intelligence problems. This is because real problems are murky, unlike the present task—there may be not enough relevant data or there may be large volumes of data, the credibility of data sources may vary, the data may be formatted in different ways (e.g., structured/unstructured, textual/visual/audio), it may be ambiguous, unreliable and sometimes intentionally misleading, and there may be time pressure and high‐stakes involved. We see no reason why ACH should help under these conditions when it does not help hypothesis evaluation under the more modest conditions of the present experimental task where the information available to analysts could be easily subjected to the consistency tests that ACH requires. We would expect that ACH would perform better in the simple analytic task used in the present study than in the much more complex tasks encountered by analysts in practice. Nevertheless, it would be useful to conduct future research on ACH involving a diverse set of tasks. Indeed, this could help to identify some of the conditions under which ACH does better or worse.

6.1 Confirmation bias, consistency, and accuracy In the context of our experiment, confirmation bias was conceptualized as follows: (a) not considering all alternative hypotheses; (b) only evaluating evidence based on whether it is consistent with each hypothesis under consideration; (c) not adjusting belief in a hypothesis in accordance with evidence diagnosticity; (d) selecting the most likely hypothesis based solely on evidence that is consistent with it; and (e) identifying indicators that will only confirm a hypothesis in the future. We found that analysts in the ACH group were no more likely than their untrained counterparts to identify the four alternative hypotheses in the present experiment. On the other hand, the ACH group were more likely to rate evidence as being either inconsistent or consistent with each hypothesis (as opposed to simply more or less consistent) and to take account of evidence diagnosticity. Both the ACH and untrained groups were equally likely to focus solely on consistent evidence when selecting the most likely hypothesis, although the majority of analysts in both groups selected the most likely hypothesis based on an integration of consistent and inconsistent evidence. Finally, analysts in the ACH group who provided indicators for future observation were no more likely to provide indicators that would disconfirm (as opposed to confirm) the hypotheses. Taken from the perspective of untrained analysts, these findings reiterate that they do not all suffer from such bias, like participants in other psychological studies on “confirmation bias” (e.g., Beattie & Baron, 1988). Nevertheless, it seems apt that analysts may need explicit instructions to differentiate between evidence that is consistent versus inconsistent with a hypothesis and to remove nondiagnostic information from their “working out” (Kemmelmeier, 2004), especially when it may lead to the “dilution” of judgements based on diagnostic information (Shelton, 1999). The fact that ACH is vague on these two issues means that it has limited value in this regard. Taken from the perspective of analysts trained to use ACH, the above findings highlight that not only may analysts resist applying its evidence integration rule but also they prefer to use (like their untrained counterparts) a cognitively more complex strategy (i.e., adding up both consistent and inconsistent evidence for each hypothesis). The strategy used by most of the present analysts is beneficial because there is no “loss” of relevant information and the credibility of all available evidence (rather than just the disconfirming evidence) can be taken into account. Perhaps one benefit of any sort of structured analytic technique such as ACH is that it can make the analytic process more transparent and easier to manage and audit by increasing within‐individual inconsistency. However, we found that the ACH group demonstrated significantly less consistency in terms of evidence assessment and the match between final conclusions and preceding judgements, compared with their untrained counterparts. A large proportion of analysts in both groups also applied their evidence integration strategy inconsistently across hypotheses. Inconsistency in evidence assessment may be partly explained by the fact that, although ACH asks analysts to distinguish between evidence that is highly inconsistent or inconsistent (vs. highly consistent or consistent) with a hypothesis, it does not specify how this should be done. These results support recent warnings about how structured analytic techniques, in general, can foster inconsistency in assessments (Chang et al., 2018; Mandel & Tetlock, 2018). Decision‐support tools may be useful in this domain because they can reduce the cognitive burden on analysts. Reducing inconsistency is important because it is difficult to identify the source of error if an analyst is behaving inconsistently. Increasing consistency is also important because, as we found (across both groups), it was associated with the accuracy of conclusions reached. Indeed, one could argue that the ultimate goal of analysts is to arrive at an accurate conclusion about a current or future situation. However, we found that only one of the ACH group correctly ranked the four hypotheses from most to least likely and two of the untrained group did so. Analysts in the ACH group were significantly more likely than their untrained counterparts to produce tied ranks between hypotheses, partly because ACH encourages analysts to reduce probabilistic (continuous) data regarding consistency or inconsistency to a 5‐point ordinal scale. Unsurprisingly, the ACH group was no more likely than the untrained group to choose the correct hypothesis (also see Mandel, Karvetski, & Dhami, 2018).

6.2 Other findings on how analysts test competing hypotheses Several other findings emerged that shed some light on how analysts may solve a hypothesis testing task. First, the majority of untrained analysts reformatted the data in the task. Over half of this group drew an ACH‐style matrix with hypotheses as columns and evidence as rows. It is unclear if this format is helpful. Psychological research suggests that the way in which information is formatted can aid or hinder information processing in a range of cognitive tasks (e.g., Garcia‐Retamero & Dhami, 2011, 2013; Gigerenzer & Hoffrage, 1995). Future research ought to systematically examine the effects of ACH's recommended matrix format on analysts' hypothesis testing compared with alternative information formats. Cook and Smallman (2008) found that graphical information displays reduced the attention that naval personnel paid to confirming evidence. Second, although ACH is unclear about how analysts should assess evidence diagnosticity, we observed a correlation between an objective measure of information diagnosticity and judgements of diagnosticity made by individual analysts in the ACH group as well as across analysts in the untrained group. It would, however, be premature to suggest that people may have some “intuitive” capacity to judge diagnosticity since a variety of strategies can be correlated with objectives measures such as the one we used here (i.e., information gain). Future research could more fully explore analysts' strategies for judging information diagnosticity against other existing measures (see Nelson, 2005). Finally, as mentioned earlier, ACH does not take account of base rate information, and unsurprisingly, we found that analysts in the ACH group were significantly less likely to do so compared with their untrained counterparts. Nevertheless, only around half of the untrained group used base rate information. Base rate neglect is common (e.g., Kahneman & Tversky, 1973; Tversky & Kahneman, 1982). Base rate information is useful because it provides an indication of the priori probability of a hypothesis being true before being presented with any evidence. In the present study, such information was useful for arriving at the correct conclusion because of the inequality in base rates for the four hypotheses. Some believe that ACH may be particularly useful for collaborative analysis, where it can provide analysts a better understanding of differences of opinion, depersonalize issues, and guide discussion (Heuer, 2007). However, there is, as yet, little empirical evidence to support this view. In Convertino et al.'s (2008) study, reviewed earlier, all groups used a collaborative version of ACH, and yet “confirmation bias” remained evident in all groups (i.e., here, evidence initially favoured one hypothesis but then balanced out across hypotheses later on). Clearly, more research is needed to test the benefits of ACH when applied in a collaborative context.