A prototype radio dish for the Square Kilometre Array was unveiled in Shijiazhuang, China, in February. Credit: SKA Organisation

When astronomer Kai Polsterer’s laptop was stolen, the thieves made off with more than hardware. The laptop contained Polsterer’s only copy of a collection of thousands of stars and galaxies, a sample that a computer algorithm had randomly selected from a data set consisting of millions of celestial objects. Because Polsterer could not re-create what the algorithm had done, he could not exactly reproduce his data set for a work-in-progress journal article. And without a data set, nobody could exactly reproduce his results.

Irreproducibility and the black box nature of machine learning plague many fields of science, from Earth observation to drug discovery. But astronomy represents a notable case study because the quantity of data is burgeoning at an unprecedented rate. The installation of new data-churning telescopes, combined with marked improvements in pattern-finding algorithms, has led astronomers to turn to sophisticated software to do the data-crunching they can’t do manually. And with more powerful analyses comes less transparency as to how they were performed.

Polsterer, now a group leader of astroinformatics at the Heidelberg Institute for Theoretical Studies, is leading a charge to reform astronomy publishing and reproducibility. The idea is to use source code libraries and other internet resources to publish explanations of what has been done to data from the moment they are collected and to make all data sets available for peer review.

Astronomy data pipeline

In the 1960s, a career as a radio astronomer encompassed a gig in radio engineering. Those scientists were intimately acquainted with all aspects of data collection, from the specific components of a given telescope to the statistical techniques used to make sense of the data it produced. But as radio astronomy (and other divisions of astronomy) became more complex, researchers became more specialized; scientists who understand the full data value chain have become increasingly rare.

Today, algorithms play a central role at every stage of an astronomy observing run. By the time astronomers receive a data set from an observatory, software has already removed signal interference and reduced the mass of information into something manageable. Then researchers slice and dice the data with their own algorithms.

Polsterer gives an example of two research groups hunting for quasars using the same raw data. Each team employs slightly different thresholds for whether a luminous object qualifies as a quasar. Those biases get integrated into each group’s software, and after millions of objects in a sky survey are analyzed, this can produce very different results. “Although both labs say they used the same data, there is no way of defining benchmark data sets,” Polsterer says. “And for publications, no one cares.”

Biases can also sneak into random sampling studies like Polsterer’s, especially ones that deal with rare objects. “It might be that a certain class of galaxy gets overrepresented by chance,” he says. “To make the publication fully reproducible, it would be mandatory to publish the randomly selected subsamples too.”

Whether based on a chance overrepresentation in an algorithm’s random sampling or an astronomer’s decision on how to define a galaxy, biases become even more pronounced when the data set grows—and that is happening quickly. Sixty years ago a scientist could count sources identified during a radio survey by hand; today it would be impossible for a dozen researchers to do it in their lifetimes. The data streams of current telescopes, such as Australia’s ASKAP, are in the region of terabits a second. The Square Kilometre Array, slated for completion in the 2030s, will be 50 times more sensitive than today’s telescopes, and with that sensitivity will come more data.

Code transparency

Alice Allen is trying to introduce transparency into the algorithms informing astrophysics. The retired computer programmer now oversees the Astrophysics Source Code Library, an online repository of the algorithms that astronomers have used to filter and manipulate their data. The library is home to more than 1600 source codes, many of which Allen has tracked down herself. To be included in the registry, the code has to have undergone peer review.

Allen says she considers the data-manipulation process a computational method that should be open to scrutiny. “Science is ultimately a human endeavor, and humans make mistakes and bring their own perspective,” she says. “They are not machines, and so what we do on machines in some way reflects their human creators.”

In a recent study, Allen and colleagues investigated the availability of source code for about 160 randomly selected astrophysics articles published in 2015. Of those papers, only three out of every five had code available for download and scrutiny.

People are reluctant to show how they manipulated their data. Some coders say their work is proprietary and don’t want to lose their intellectual property by making their algorithms open source. Other researchers hesitate for simpler reasons, including embarrassment—often the code is messy, or the principal investigator is largely in the dark about the software that a graduate student created.

“A lot of care must be paid when reading papers based on machine-learning results,” says Giuseppe Longo, a professor of astrophysics at the University of Naples Federico II. He says that younger researchers are particularly susceptible to blindly relying on algorithms because of the publish-or-perish pressures of a science career. “Very often they do not pay attention to the previous literature and are prone either to rediscover the wheel or to make very naïve errors,” Longo says.

To combat this data-heavy and opaque era, both Allen and Polsterer advocate for complete transparency from the moment data are collected through to journal publication. “The problem with publications is that they are still oriented toward the letter- and paper-based way used 100 years ago,” Polsterer says. “In a digital age, we need to consider how to transform it.”

At an astroinformatics conference held in Cape Town, South Africa, last year, Polsterer’s suggestions brought mixed reactions, with some people strongly supporting his campaign and others doubting whether their institutions would allow them to publish their source code.

Polsterer’s ideal process would involve extensive transparency and collaboration on the internet. A researcher would explain an idea on a webpage, publish all the data associated with the problem, and include code and explanations of how the data were manipulated. Then other researchers would comment and work collaboratively to achieve a result. “In principle, you create a workbench, and a lot of researchers interested in the same research start working together, but in a public place,” Polsterer says.

The first step in achieving full transparency is requiring researchers to publish their process, whether in Allen’s code library or in an appendix with their article. In the end, the issue boils down to scientists engaging with the publishing process, Polsterer says. “I’m working with publishers to rethink the way we publish and to start with some good examples of how to do it,” he says. “I’m just learning from my bad mistakes.”