Kiret Dhindsa , postdoctoral fellow 1 , Mohit Bhandari , professor 2 , Ranil R Sonnadara , associate professor 2 1Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada 2Department of Surgery, McMaster University, Hamilton, Ontario, Canada Correspondence to: K Dhindsa dhindsj{at}mcmaster.ca

Poor data quality, incompatible datasets, inadequate expertise, and hype

Big data refers to datasets that are too large or complex to analyse with traditional methods.1 Instead we rely on machine learning—self updating algorithms that build predictive models by finding patterns in data.2 In recent years, a so called “big data revolution” in healthcare has been promised345 so often that researchers are now asking why this supposed inevitability has not happened.6 Although some technical barriers have been correctly identified,7 there is a deeper issue: many of the data are of poor quality and in the form of small, incompatible datasets.

Current practices around collection, curation, and sharing of data make it difficult to apply machine learning to healthcare on a large scale. We need to develop, evaluate, and adopt modern health data standards that guarantee data quality, ensure that datasets from different institutions are compatible for pooling, and allow timely access to datasets by researchers and others. These prerequisites for …