Modern data engineering and analysis workflows will often involve using data manipulation libraries, which, in the Python universe, would be tools like pandas. One problem you may have encountered with this powerful data manipulation tool is that the dataframe can be an opaque object that’s hard to reason about in terms of its contents, data types, and other properties.

One tool that may help you with this problem is pandera, which was accepted by pyOpenSci as part of its ecosystem of packages on September 2019. Pandera provides a flexible and expressive data validation toolkit that helps users make statistical assertions about pandas data structures.

A Statistical Data Validation Toolkit for Pandas

To illustrate pandera ’s capabilities let’s use a small toy example. Suppose you’re analyzing data for some insights in the context of a mission-critical project, where it’s vital to ensure the quality of the datasets that you’re looking at.

Each row in the dataset is uniquely identified by a person_id , and each column describes that person’s height_in_cm s and age_category .

import pandas as pd dataset = pd . DataFrame ( data = { "height_in_cm" : [ 150 , 145 , 122 , 176 , 137 , 151 ], "age_category" : [ "20-30" , "10-20" , "10-20" , "20-30" , "10-20" , "20-30" ], }, index = pd . Series ([ 100 , 101 , 102 , 103 , 104 , 105 ], name = "person_id" ), ) print ( dataset )

height_in_cm age_category person_id 100 150 20-30 101 145 10-20 102 122 10-20 103 176 20-30 104 137 10-20 105 151 20-30

You want to ensure that some columns have the correct data type, or that the dataset fulfills certain statistical properties. Pandera allows you to validate a DataFrame to ensure that these conditions are met. It allows you to spend less time worrying about the correctness of a DataFrame’s data so you can make the right assumptions in analyzing it.

Column Presence and Type Checking

The most basic type of schema is one that simply checks that specific columns exist with specific datatypes.

import pandera as pa schema = pa . DataFrameSchema ( columns = { "height_in_cm" : pa . Column ( pa . Int ), "age_category" : pa . Column ( pa . String ), }, index = pa . Index ( pa . Int , name = "person_id" ), ) schema ( dataset )

The schema object is callable, so you can validate the dataset by passing it in as an argument to the schema call. If the dataframe passes schema validation, schema simply returns the dataframe.

If not, it’ll provide useful error messages:

invalid_dataframe = pd . DataFrame ({ "weight_in_kg" : [ 44 , 31 , 55 , 61 , 55 , 62 ], "age_category" : [ "20-30" , "10-20" , "10-20" , "20-30" , "10-20" , "20-30" ], }) schema ( invalid_dataframe )

SchemaError: column 'height_in_cm' not in dataframe weight_in_kg age_category 0 44 20-30 1 31 10-20 2 55 10-20 3 61 20-30 4 55 10-20

Basic Statistical Checks

If you want to make stricter assertions about the empirical properties of the dataset, we can supply the checks keyword argument to the Column and Index constructors with a Check or list of Check s.

schema = pa . DataFrameSchema ( columns = { "height_in_cm" : pa . Column ( pa . Int , # height in centimeters should be between 100 and 300 checks = pa . Check ( lambda s : ( 100 < s ) & ( s < 300 )), ), "age_category" : pa . Column ( pa . String , # check allowable age categories checks = pa . Check ( lambda s : s . isin ([ "10-20" , "20-30" ])) ), }, index = pa . Index ( pa . Int , name = "person_id" , checks = [ # id is a positive integer pa . Check ( lambda s : s > 0 ), # id is unique pa . Check ( lambda s : s . duplicated (). sum () == 0 ), ] ), ) schema ( dataset )

A Check object specifies the exact implementation of how to validate a column or index. The first positional argument in its constructor is a callable with the signature:

Callable[ pd.Series, Union[ bool, pd.Series[bool] ] ]

Notice that the only constraint to the callable is that takes a Series as input and returns a boolean or a boolean Series. By design, checks have access to the entire pandas Series API to make assertions about the properties of a particular column or index.

Indexed Error Messages

In cases where the Check returns a boolean Series , violations of the schema are reported by the index location of failure cases.

invalid_data = pd . DataFrame ( data = { "height_in_cm" : [ 91 , 105 , 87 , 87 ], "age_category" : [ "10-20" , "10-20" , "10-20" , "10-20" ] }, index = pd . Series ([ 200 , 201 , 202 , 203 ], name = "person_id" ) ) schema ( invalid_data )

pandera.errors.SchemaError: <Schema Column: 'height_in_cm' type=int64> failed element-wise validator 0: <lambda> failure cases: person_id count failure_case 87 [202, 203] 2 91 [200] 1

The error is reported as a stringified dataframe where the failure_case index enumerates instances of height_in_cm values that failed data validation, the person_id column is the index location of the failure case, and count column displays the number of instances of a particular failure case.

Statistical Hypothesis Tests

What if we wanted to test the hypothesis that older people tend to be taller? We can achieve this with the Hypothesis check:

schema = pa . DataFrameSchema ( columns = { "height_in_cm" : pa . Column ( # perform a one-sided two-sample t-test of # the distribution of heights by age category, # with an alpha value of 5% checks = pa . Hypothesis . two_sample_ttest ( groupby = "age_category" , sample1 = "20-30" , relationship = "greater_than" , sample2 = "10-20" , alpha = 0.05 , equal_var = True , ) ), "age_category" : pa . Column ( pa . String , checks = pa . Check ( lambda s : s . isin ([ "10-20" , "20-30" ])), ) } ) schema ( dataset )

Whether you use this tool in Jupyter notebooks, one-off scripts, ETL pipeline code, or unit tests, pandera enables you to make pandas code more readable and robust by enforcing the deterministic and statistical properties of pandas data structures at runtime.

Hopefully this post has given you a flavor of what pandera can do. It offers a few more features that you may find useful:

What’s Next?

I’m actively developing this project and have some exciting features coming up soon, such as built-in checks, first-class Dask support, and yaml schema specification. If you’d like to contribute to this project, you’re welcome to head on over to the github repo!