It is very unlikely that even if I continue my blog for decades, it will ever have the impact of Stephen Jay Gould’s (1981) “The mis-measure of Man”. It was a best seller, cited in the academic literature over 10,000 times, and even 445 times in 2017 alone. It continues to meet an audience need.

Why was it so popular? I read it and found that it was written in a very engaging way. In my view Gould had an excellent prose style. I enjoyed his essays. His book attacked intelligence tests, which had fallen in popularity, and had come to be seen as a Bad Thing. Intelligence testing had originally been seen as a very good thing, providing opportunities to bright young children who could not afford fancy schools, but who deserved the opportunity of good quality education and employment. Intelligence tests were meritocratic, not aristocratic. You could not fool them with specific knowledge derived from private tuition. They were the great levellers. Although it is hardly relevant to their actual veracity, they were warmly received by the political Left, who saw in these assessments a vindication of the working-class talents which had been suppressed by private education.

Why was it that SJ Gould had such an impact when he argued that the tests were biased against working class and minority racial groups? Morever, how did his views ever take hold when the issue of bias in intelligence testing had just been comprehensively evaluated in Arthur Jensen’s (1980) “Bias in Mental Testing”. Jensen showed that, far from under-predicting African-American achievements, they perhaps slightly over-predicted them. I presume that Jensen’s volume was less often read, though it was written by an expert, not a polemicist. Perhaps precisely because it was written by an expert, in a restrained and far from folksy style, it had less impact on popular culture, which is what tends to determine public debates.

I leave the full explanation to others, but I think that a good prose style, no equations, few numbers and little in the way of statistical and logical arguments generally increases readership. That would be predicted by the bell curve, which makes it plain that technical books about difficult subjects are a minority interest.

Gould’s book made a number of assertions. Two that stuck in people’s minds were: that measures of brain size derived from the study of skulls of different races had been biased, and that many items on the Army tests of intelligence were culturally biased.

The debate about the ancient skulls has raged to and fro for a long time, but it seems highly probable that the measures were taken correctly

Now the redoubtable Russell Warne has taken a detailed look at what Gould said about the Army Beta test, and finds that on that topic he has been unreliable and incorrect.

https://res.mdpi.com/jintelligence/jintelligence-07-00006/article_deploy/jintelligence-07-00006.pdf?filename=&attachment=1

A number of points:

Face validity. Sure, it helps if a test item looks relevant to the job you are applying for. However, a test item may have high predictive value without seeming to. This is the famous “indifference of the indicator” dictum. If it predicts, use it. Furthermore, you cannot dismiss an item simply because you yourself can think of a way in which it might be misinterpreted, as Gould did. You need to show that such misinterpretations actually exist (and compare them with the misinterpretations which arise on items which seem fine to you).

Testees were not baffled by the use of numbers, in the sense of digits, as Gould implied. All language speakers had knowledge of digits because they had had some years of education.

Gould twists things. His reading of the instructions was that the men would be “scared shitless” whereas an officer who had actually done the testing wrote later: “It was touching to see the intense effort put into answering the questions, often by men who never before had held a pencil in their hands”. A shade different, don’t you think?

Gould claimed that “vast numbers of men” earned zero scores, and therefore, must not have been able to understand the Army Beta test instructions and/or stimuli. However, only 4% scored less than 10 in total, and only 2.6% of testees scored less than 5 points. Gould neglected to point out that the standard procedure in the Army was that the low scorers were then individually tested on the Stanford Binet to give them yet another chance to do well.

Gould reports an officer’s unfavourable view of the testing, but does not show that 13 other officers were favourable.

Gould criticised the short time limits on some subtests, saying they also were too short for his biology students, on whom he used the test (see later). Warne politely explains that short time limits on process tasks are required because otherwise they are too easy, and discriminate poorly. Short time limits are a good feature, not a bug. (This is a common misunderstanding. See Hyde on sex differences in the speed of completion of tasks).

Gould criticised the Beta test, saying that poor testing circumstances meant that it could not be considered a test of innate intelligence. He failed to tell his readers test-constructor Boring’s opinion that the tests had predictive value. Also, the test creators rarely mentioned “innate intelligence”. They simply found that test results helped them predict who would do well on the tasks the army required, which was the whole purpose of testing.

The test creators believed that different levels of education were likely to have influenced performance on the test, as did their immigrant status, but Gould cast Yerkes as dismissing that factor, when in fact he discussed it and correctly said that a correlation between years spent in the US and higher test scores showed an aculturation effect, but did not identify a cause.

Gould also downplayed the work done on establishing the validity of the Army tests. Scores on the Army Beta correlated positively with scores on other intelligence tests, including the Army Alpha (r= 0.811) and the Stanford-Binet (r= 0.727), both the “gold standard” of intelligence measurement at the time ([15], p. 634). Army Beta scores also correlated positively with external criteria, such as the number of years of schooling a recruit had (both as children and adults), commanding officers’ ratings of soldiers’ job performance, and army rank.

After all this, I would have regarded Gould has having given an unfair account of the test, and left it at that, job done. Warne, perhaps prisoner of the American work ethic, has gone further. He gave the Beta test to his students, and also pre-registered his expectations. This is excellent. Instead of getting the results and saying “I told you so” he puts his prior assumptions up for examination. If only Gould had done that.

For me the most interesting result is that the test picks up what looks like a confirmation of a secular trend. As more people get to go to college, scores go down and more resemble the average in the general population from which students are selected.

Warne says:

Given these results from our replication, it seems that Gould’s criticism of time limits and his argument that the Army Beta did not measure intelligence are without basis. Despite the short time limits for each Army Beta subtest, the results of this replication support the World War I psychologists’ belief that the Army Beta measured intelligence. We demonstrated this in the following four relevant results of our replication:

Here is my summary of those four results.

1 Gould’s Harvard students were 1.3 standard deviations brighter than Warne’s open access college. Selection was more important than the Flynn Effect. (This result makes a lot of sense).

2 Gould had administered the test properly (against Warne’s own supposition, which he could have covered up if he had not been honest enough to preregister his suppositions).

3 The Beta subtest scores correlate positively (positive manifold), and the total score correlates with two self-reported measures of scholastic attainment.

4 The Beta test is best described by a single factor.

In summary, Warne has done an excellent job in showing that Gould traduced the Army Beta test and the researchers that created it. Bluntly, Gould mis-represented the test, and misled his readers. Gould probably achieved his objective, which was to trash intelligence testing in the eyes of a generation of academics.

Warne has shown that the Beta test still works. It is a good predictor of intelligence, which correlates with current measures of scholastic attainment, shows a positive manifold and resolves into a common factor. In a standard which Gould never attempted, Warne pre-registered his prior assumptions so that the results of his experiment could be plainly seen by the reader, and so that the facts could prove him wrong.

Warne’s achievement is to have shown that Gould got it wrong.