About this Episode

A discussion with Katharine Jarmul, aka kjam, about some of the challenges of data science with respect to testing.

Some of the topics we discuss:

experimentation vs testing

testing pipelines and pipeline changes

automating data validation

property based testing

schema validation and detecting schema changes

using unit test techniques to test data pipeline stages

testing nodes and transitions in DAGs

testing expected and unexpected data

missing data and non-signals

corrupting a dataset with noise

fuzz testing for both data pipelines and web APIs

datafuzz

hypothesis

testing internal interfaces

documenting and sharing domain expertise to build good reasonableness

intermediary data and stages

neural networks

speaking at conferences

Episode Links

@kjam on Twitter — Data Magic and Computer Sorcery

Kjamistan: Data Science

datafuzz’s Python library — The goal of datafuzz is to give you the ability to test your data science code and models with BAD data.

Hypothesis Python library — Hypothesis is a Python library for finding edge cases in your code you wouldn’t have thought to look for.