There seems to be an expectation in science that the people who gather a dataset should also be the ones who analyze it. But often that doesn’t make sense: what it takes to gather relevant data has little to do with what it takes to perform a reasonable analysis. Indeed, the imperatives of analysis can even impede data-gathering, if people have confused ideas of what they can and can’t do with their data.

I’d like us to move to a world in which gathering and analysis of data are separated, in which researchers can get full credit for putting together a useful dataset, without the expectation that they perform a serious analyses. I think that could get around some research bottlenecks.

It’s my impression that this is already done in many areas of science—for example, there are public datasets on genes, and climate, and astronomy, and all sorts of areas in which many teams of researchers are studying common datasets. And in social science we have the NES, GSS, NLSY, etc. Even silly things like the Electoral Integrity Project—I don’t think these data are so great, but I appreciate the open spirit under which these data are shared.

In many smaller projects, though—including on some topics of general interest—data are collected by people who aren’t well prepared to do a serious analysis. Sometimes the problems come from conflicts of interest (as with disgraced primatologist Marc Hauser or food researcher Brian Wansink, both of whom seem to have succumbed to strong and continuing incentives to find positive results from their own data); other times it’s as simple as the challenge of using real and imperfect data to answer real and complex questions.

The above thoughts were motivated by a communication I received from Simon Franklin, a post-doc in economics at the London School of Economics, who pointed me to this paper by Stéphane Bermon and Pierre-Yves Garnier: “Serum androgen levels and their relation to performance in track and field: mass spectrometry results from 2127 observations in male and female elite athletes.”

From the abstract of the article in question:



Methods 2127 observations of competition best performances and mass spectrometry-measured serum androgen concentrations, obtained during the 2011 and 2013 International Association of Athletics Federations World Championships, were analysed in male and female elite track and field athletes. To test the influence of serum androgen levels on performance, male and female athletes were classified in tertiles according to their free testosterone (fT) concentration and the best competition results achieved in the highest and lowest fT tertiles were then compared. Results The type of athletic event did not influence fT concentration among elite women, whereas male sprinters showed higher values for fT than male athletes in other events. Men involved in all throwing events showed significantly (p<0.05) lower testosterone and sex hormone binding globulin than men in other events. When compared with the lowest female fT tertile, women with the highest fT tertile performed significantly (p<0.05) better in 400 m, 400 m hurdles, 800 m, hammer throw, and pole vault with margins of 2.73%, 2.78%, 1.78%, 4.53%, and 2.94%, respectively. Such a pattern was not found in any of the male athletic events.

Franklin writes:

I’m sure you wouldn’t be surprised to see these kinds of mistakes in published work. What is more distressing is that this evidence is said to be a key submission in the IAAF’s upcoming case against CAS [the Court of Arbitration for Sport], since the CAS has argued that sex classification on the basis of T levels are only justified if high T confers a “significant competitive advantage”. (Of course, one might reasonably disagree with that standard but is the standard the IAAF, and this paper, try to meet. An IAAF official is, in fact, a co-author.) Media coverage here, here, and here. The paper correlates testosterone levels in athletes with their performance at a recent World Championship and makes causal claims about the affects of testosterone on female performance. There are more than a few problems with the paper, not least the fact that it makes causal claims from correlations in a highly selective sample, and the bizarre choice of comparing averages within the highest and lowest tertiles of fT levels using a student t-test (without any other statistical tests presented). But most problematic is the multiple hypothesis testing. The authors test for a correlation between T-levels and performance across a total of over 40 events (men and women) and find a significant correlation in 5 events, at the 5% level. They then conclude: Female athletes with high fT levels have a significant competitive advantage over those with low fT in 400 m, 400 m hurdles, 800 m, hammer throw, and pole vault. These are 5 events for which they found significant correlations! And we are lead to believe that there is no such advantage for any of the other events. Further, when I attempt to replicate the p-values (using the limited data available) I find only 3 out of the 5 with p<0.05, and least three women's events with p<0.15 with signs in the opposite direction (high-T athletes perform worse), strongly suggesting that a joint test on standardized performance measures would fail to reject. Note also that this study is being done precisely because there are currently at least a few high performing hyper-androgenic women in world athletics at the moment, and these women are (presumably) included in their sample. Now, of course, there are all sorts of endogeneity problems that could be leading to a downward bias in these estimates. And indeed I'm surprised to see such a weak correlation in so many events, given what I've read about the physiology. But the conclusion to this paper cannot possibly be justified on the basis of the evidence.

It’s hard for me to judge this, in part because I know next to nothing about doping in sports, and in part because the statistical analysis data processing in this paper is such a mess that I can’t really figure out what data they are working with, what exactly they are doing, or the connection between some of their analyses and their scientific goals. So, without making any comment on the substance of the matter—the analysis in that paper is so tangled and I don’t have a strong motivation to work it all out—let me just say that statistics is hard, and papers like this give me more of an appreciation for the sort of robotic-style data analyses that are sometimes recommended in biostatistics. Cookbook rules can be pretty silly, but it all gets worse when people just start mixing and matching various recipes (“Data distributions were assessed for normality using visual inspection, calculation of skewness and kurtosis, and the Kolmogorov-Smirnov test. . . . The effects of the type of athletic event were tested with a one-way analysis of variance . . . Tukey HSD (Spjotvoll/Stoline) post hoc test when appropriate . . . athletic performances and Hb concentrations of the highest and lowest fT tertiles were compared by using non-paired Student’s t-test. These different athletic events were considered as distinct independent analyses and adjustment for multiple comparisons was not required. . . . When appropriate, a χ2 test was used. Correlations were tested by the Pearson correlation test.”) I have no reason to think these authors are cheating in any way; this just appears to be the old, old story of a data analysis that is disconnected with the underlying questions being asked.

What to recommend here? I feel like any such analysis would really have to start over from scratch: the authors of the above paper would be a good resource, but really you’ll have to look more carefully at the questions of interest and how they can be addressed by the data. But that takes effort and expertise, which is a challenge given that expertise is not so easy to come by, and the IAAF is not, I assume, an organization that has a bunch of statisticians kicking around.

Given all this, I think the way to go is for people such as Bermon and Garnier to publish their data and their speculations, and leave the data analysis to others.