« previous post | next post »

Pete Warden, "Why you should never trust a data scientist", 7/18/2013:

The wonderful thing about being a data scientist is that I get all of the credibility of genuine science, with none of the irritating peer review or reproducibility worries. […]

I’ve never ceased to be disturbed at how the inclusion of numbers and the mention of large data sets numbs criticism. The articles live in a strange purgatory between journalism, which most readers have a healthy skepticism towards, and science, where we sub-contract verification to other scientists and so trust the public output far more. If a sociologist tells you that people in Utah only have friends in Utah, you can follow a web of references and peer review to understand if she’s believable. If I, or somebody at a large tech company, tells you the same, there’s no way to check. The source data is proprietary, and in a lot of cases may not even exist any more in the same exact form as databases turn over, and users delete or update their information. Even other data scientists outside the team won’t be able to verify the results. The data scientists I know are honest people, but there’s no external checks in the system to keep them that way.

Amen. Except that Warden's trust in that "web of references and peer review" is naive and unfounded. Most peer-reviewed scientific papers are based on unpublished (and typically unavailable) data, and under-documented (and often crucially errorful) methods. Journals are reluctant to publish negative results (for the plausible reason that there are lots of ways to screw up an experiment), and equally reluctant to publish failures to replicate positive ones. For these and other reasons, most peer-reviewed scientific papers are wrong, and the more prominent the journal, the less likely published results are to be replicable.

A serious effort is underway to ameliorate if not fix these problems, but there's a long way to go.

Warden's concluding advice is likely to make things worse:

What should you do? If you’re a social scientist, don’t let us run away with all the publicity, jump in and figure out how to work with all these new sources.

Successful scientific PR is not necessarily antithetical to valid science, but there's good evidence of a negative correlation. A couple of random examples, out of dozens from LL coverage over the years: "Debasing the coinage of rational inquiry", 4/22/2009; "'Vampirical' hypotheses", 4/28/2011.

It's not helpful to urge scientists to grab hunks of "big data" and pursue publicity even more avidly. Regular LL readers know that I'm a strong proponent of empirical methods in studies of speech, language, and communication — but whatever the dataset sizes and analysis methods involved, the key methodological issue is reproducibility. This normally requires publication of all (raw) data and (implemented) methods.

Ironically, traditional "armchair" syntax and semantics is entirely reproducible: The explicandum is a pattern of judgments about specified examples. You can disagree about the judgments, or about the argument from the pattern of judgments to a conclusion, but all the cards are on the table. The same thing is true of traditional work in phonology and morphology, which makes assertions about patterns of documented lexical fact.

But traditional experimental research in phonetics, psycholinguistics, sociolinguistics, corpus linguistics, neurolinguistics etc. is generally not reproducible: the raw data is usually not available; detailed annotations or classifications of the data may be withheld along with documentation of the methods used to create them; the fine details of the statistical analysis may be unavailable (e.g. decisions about data inclusion and exclusion, specific methods used, possible algorithmic or coding errors).

Does this matter? Often, the lack of transparency in scientific publication hides over-interpretation, mistakes, and even outright fraud — see e.g. the priming controversy, the fall of Marc Hauser, the Duke biomarkers scandal, and so on.

This is unfortunately not an unusual situation — there are many examples within linguistics where false or misleading ideas have become widely accepted on the basis of flawed experimental evidence, and where access to the experimental data would probably have limited the damage.

Here's one example among many: A series of important papers from 1976 onwards argued for a categorical distinction between e.g.

le mappe di città [v:]ecchie "the maps of old cities"

le mappe di città [v]ecchie "the old maps of cities"

This conclusion was originally based on native-speaker intuitions, though none of the original authors spoke a relevant dialect of Italian. Intuition was later supported by a small phonetic experiment, which was crucially effective in countering native speakers who doubted the judgments. This "fact" was crucial evidence in favor of a widely-accepted hypothesis, namely that well-defined prosodic consitutents exist, arranged in a "prosodic hierarchy", and that crisp formal rules define the relationship between syntactic structures and prosodic structures, which in turn govern the application of certain external sandhi rules, of which raddoppiamento (fono)sintattico became a paradigm example.

This argument was very influential throughout the 1980s and 1990s.

But in fact, the basic observation was completely wrong. Italian raddoppiamento sintattico works rather like English flapping and voicing — it can apply anywhere in connected speech. The cited phonetic measurements were apparently due to facultative disambiguation: insertion of overt silent pauses to disambiguate (somewhat unnatural) sentences presented as minimal pairs. For a detailed summary of the situation, see e.g. Matthew Absalom et al., "A Typology of Spreading, Insertion and Deletion or What You Weren't Told About Raddoppiamento Sintattico in Italian", ALS 2002.

The view of RS as an "anywhere" rule was strongly supported by corpus-based work, e.g. in Agostiniani "'Su alcuni aspetti del rafforzamento sintattico in Toscana e sulla loro importanza per la qualificazione del fenomeno in generale", Quaderni del Dipartimento di Linguistica, Università degli studi di Firenze (1992). And in 1997, one of the original authors admitted that "… notably in Tuscan and romanesco, raddoppiamento fonosintattico [RS] seems to apply throughout sentences without regard to their syntactic (and [derived] phonological) constituency" (Irene Vogel, "Prosodic phonology", in Maiden & Parry (eds), The Dialects of Italy).

How did this happen? I'm not asking about the sociological process whereby a 35-year-old false generalization, abandoned by its originators 15 years ago, continues to be treated by some as part of the foundations of the field. Rather, I want to discuss the natural processes that led to the wrong generalization in the first place.

1. "Facultative disambiguation" — The natural contrastive effect of considering a minimal pair in juxtaposition usually leads to an exaggeration of the natural distinctions, and sometimes to the deployment of unusual (phonetic or pragmatic) resources in order to create a clear separation.

2. "Selection bias" — It's natural to choose cases where a phenomenon of interest seems to be especially clear, and this often leads to the selection of examples from the ends of a continuum or from widely-separated regions of a more complex space; or perhaps examples where some additional associated characteristics re-inforce the apparent differences.

3. "Confirmation bias" — As an apparent pattern begins to emerge, we (individually or as a field) tend to focus on evidence that confirms the pattern, and to put problematic or equivocal evidence into the background.

All of these things can easily happen with laboratory experiments as well as with intuitions: We choose experimental materials that seem to work especially well ("selection bias"); experimental subjects are likely to notice (near-) minimal pairs, and to exaggerate the contrasts that they imply ("facultative disambiguation"); and experiments often don't work for irrelevant reasons, and so it's tempting (and often correct) to put "failed" experiments aside in favor of "successful" ones.

Of course, all of these things — especially selection bias and confirmation bias — can also happen in corpus-based research. But both in laboratory experiments and corpus studies, the best way to avoid or fix such mistakes is to make sure that all of the data and methods are available for others to check and extend.

Beyond possible problems with flawed, mistaken, or outright fraudulent studies, there are significant positive benefits to "reproducibility": it reduces barriers to entry, and speeds up extension as well as replication. The greatest benefits accrue to the original researchers themselves, who don't have to waste time trying to remember or recreate what they did to get some results from a few years (or even a few months) earlier.

So we can hope that some day, experimental research on speech and language will be as reproducible as armchair linguistics always has been.

Permalink