I always lookout for libraries that will make my life easy and one of the key tasks that I spend lot of time is in data validation. I am sure lot of enterprises, business and jobs are solely created for Data validation and Data cleaning. With the advent of Big Data this aspect takes even more predominance as more and more data is ingested and in real time and the expected result is near perfect. And today when my friend mentioned about Cutplace. Now the documentation given in the links section is far more elaborate than what I will cover here. I would rather cover some of the basics with a sample example and also try to check the performance against large Dataset.

Now I had some 100 csv files where the data was not clean and was a great candidate to test for cutplace data validation.

Installation

The simplest way to install the cutplace is using pip

pip install cutplace

Cutplace Interface Definition

As a first step you need to create a Cutplace Interface Definition. Basically it is a file where you define the validation rules and the format of the data that is to be validated by Cutplace.

More on it in the below link:

In this file that I defined is given below to define the format of the csv/Excel/ODS file and the field details to be parsed:

The CID file can be saved as a csv, Fixed length or Excel/ODS format.

While cutplace parses this file it will ignore all the fields that do not start with a D or F descriptor and are considered comments.

The D descriptor describes the file format and which row to check for the header. This is very helpful as many a times in a csv or excel the actual data will not start from first row. There are many other fields which we can define in D descriptor for which I suggest you to refer the documentation.

The F descriptor describes the fixed data fields that are present, here you can see that around 9 fields are defined. By default it will check if these may fields are present or not.

Execution

A sample file on which it was run is given below:

You can see the file has 9 fields and the header names match with the filed names defined in CID file.

$cutplace cid_ifc.csv IFCB2009_19_modified.csv

INFO:cutplace:read CID from “cid_ifc.csv”

INFO:cutplace:validate “IFCB2009_19_modified.csv”

INFO:cutplace: accepted 1 rows

To test a sample file you can directly run cutplace from command line as given above.

Now since I wanted to run it on 100 files I used the cutplace API. The code is given below:

The above program takes two parameters where first is the CID file name and the 2nd takes the wild card character sequence to extract the list of all csv files that we want to process. To understand further please see the command below:

$python validate_files.py cid_ifc.csv IFC*modified.csv > op.log

This generated the following sample log:

Here you can see that in some files the contact field is empty, it goes on to show that its an optional field. Based on this I modified the CID file which is given below:

Now is is a more elaborate CID file where you have more options to be defined with the fields. Just to quickly explain:

Example column is to document sample of the filed and its not processed by cutplace

Empty: If the field is optional you can mention X in this column

length: You can mention either fixed length or a range e.g 11 means the field will have length of 11 and 0..9 means field length is from 0 to 9.

type & rule columns define the type of data and any rules required to validate the data. For that I suggest you to refer the cutplace CID documentation link.

$python validate_files.py cid_ifc1.csv IFC*modified.csv > op.log

After running the above command with the new CID file I got the following output log which gives me more fine grained output.

By this you can find the fine grained output which was missed in the earlier CID definition file. Hence CID definition is very critical while while validating using cutplace.

Conclusion

From my early assessment I can say that this is a very useful tool and soon I am planning to use it more for CSV and Excel file validation. Please refer to the below links to get a deeper idea of this tool and how to use it.

Contact: bobquest33@gmail.com

Twitter: https://twitter.com/twitmyreview

Linkedin: https://www.linkedin.com/in/priyabrata-dash-21616b15/

Python Library Link

Documentation

Github