As companies become more data-driven there is often a proliferation of data from both internal sources as well as third parties being consumed. Rarely have I seen firms try and centralise where datasets are stored. Instead, data is often copied onto infrastructure for individual teams and departments. This allows teams to not disrupt others with their work as well as avoid disruption from other teams.

Data sources are often refreshed in batches ranging from every few minutes to monthly updates. The file formats, compression schemes and encryption systems used to proliferate these datasets can vary greatly.

There is no one single tool I use for collection and analysis of new datasets. I do my best to pick tools that help me avoid writing a lot of bespoke code while taking advantage of the hardware available on any one system I may be using.

In this guide I'll walk through a exercise in consuming, transforming and analysing a data dump of the English language version of Wikipedia.

Installing Prerequisites The following commands were run on a fresh install of Ubuntu 16.04.2 LTS on a system with 16 GB of RAM. I'll first add the repository information needed to install ClickHouse, an open source analytics database, from Debian packages. $ sudo apt-key adv \ --keyserver hkp://keyserver.ubuntu.com:80 \ --recv E0C56BD4 $ echo "deb http://repo.yandex.ru/clickhouse/deb/stable/ main/" | \ sudo tee /etc/apt/sources.list.d/clickhouse.list $ sudo apt update I'll then install ClickHouse, Python, a JSON manipulation tool and a few decompression utilities (some of which can take advantage of multiple CPU cores). $ sudo apt install \ clickhouse-client \ clickhouse-server-common \ jq \ lbzip2 \ pigz \ python-pip \ python-virtualenv \ software-properties-common \ unzip I'll be using a number of Java-based utilities in this exercise so I'll install Oracle's Java 8 distribution. $ sudo add-apt-repository ppa:webupd8team/java $ sudo apt update $ sudo apt install oracle-java8-installer I'll create a Python virtual environment and install a couple data transformation and analysis utilities. $ virtualenv ~/.feed $ source ~/.feed/bin/activate $ pip install \ csvkit \ pandas \ xmltodict Finally, I'll download and unpack a few libraries for the Hadoop-centric file formats I'll be working with. $ wget -c https://repo1.maven.org/maven2/org/apache/orc/orc-tools/1.5.2/orc-tools-1.5.2-uber.jar $ wget -c https://www-eu.apache.org/dist/avro/avro-1.8.2/java/avro-tools-1.8.2.jar $ wget -c https://github.com/jairamc/csv-avro-converter/releases/download/v0.1-alpha/csv-avro-converter-cli-1.0.zip $ unzip csv-avro-converter-cli-1.0.zip

Downloading Wikipedia Wikipedia dumps page contents and metadata for their English-language website every couple of weeks. As of this writing the dump is made up of 55 bzip2-compressed XML files. They're around ~300 MB when compressed and ~1.5 GB uncompressed each. The following will download the 55 compressed XML files creating a little over 14 GB of compressed content. This command will pull the URLs from a manifest file and transform them so that they are complete URLs. The URLs are piped through xargs so that two files will be downloaded in parallel at any one time. $ wget -qO- https://dumps.wikimedia.org/enwiki/20180920/dumpstatus.json \ | jq '.jobs.articlesdump.files[].url' \ | sed 's/\"//g' \ | sed 's/^/https:\/\/dumps.wikimedia.org/' \ | xargs -n 1 \ -P 2 \ wget -c

Converting XML into JSON XML files can be challenging to work with. Some libraries expect to work with files as a whole and can quickly exhaust system memory with simple operations. Other libraries can be unintuitive to work with and require a great deal of cognitive effort for simple tasks. For these reasons I try and convert XML files into JSON as soon as I receive them. I'll be streaming out records from each of the bzip2-compressed XML files and dumping each record onto individual lines using JSON serialisation. I will use a Python library to first convert the XML into a Python dictionary. Then the Python script below will convert the Python dictionary into a JSON string. $ vi dump_pages.py import json import marshal import sys while True : try : _ , page = marshal . load ( sys . stdin ) print json . dumps ( page ) except EOFError : break Below is a bash script that will find each of the bzip2-compressed XML files. Each file will be decompressed using a multi-core-optimised bzip2 utility. The decompressed XML will be converted into Python dictionaries with the xmltodict library. Then the script above will convert each record into a single-line JSON string. Finally, a multi-core-optimised gzip drop-in replacement will compress them using gzip compression. $ for FILE in *.bz2 ; do echo $FILE lbunzip2 --stdout $FILE \ | python xmltodict.py 2 \ | python dump_pages.py \ | pigz > $FILE .json.gz done The transformed content in the bzip2 files grew by 33% once they turned into line-delimited, gzip-compressed JSON files.

Converting JSON into CSV Now that I have the data in compressed JSON format I want to be able to flatten out the nested elements into their own columns. For reference, this is a truncated and pretty-printed view of what one JSON record currently looks like: { "id" : "41067206" , "ns" : "0" , "revision" : { "comment" : "###" , "contributor" : { "id" : "244263" , "username" : "Czar" }, "format" : "text/x-wiki" , "id" : "817456361" , "model" : "wikitext" , "parentid" : "788230430" , "sha1" : "2egdyphfnavxhsjrxvhjpfvb6ndh0wi" , "text" : { "#text" : "###" , "@xml:space" : "preserve" }, "timestamp" : "2017-12-28T14:11:08Z" }, "title" : "###" } I'm looking to transform the above into the following columns. Note I'm using an underscore to denote a nested child node. id ns redirect_@title revision_comment revision_contributor_id revision_contributor_username revision_format revision_id revision_model revision_parentid revision_sha1 revision_text_#text revision_text_@xml:space revision_timestamp title revision_contributor_ip revision_minor restrictions The following Python script will iterate through each of the compressed JSON files, flatten out their nested data structures and save the data as a gzip-compressed CSV file. The only dataset-specific check I have in this code is making sure there is a title element in each record. This distinguishes the page records from other types held in the dump. $ python from glob import glob import gzip import json import pandas as pd from pandas.io.json.normalize import nested_to_record for filename in glob ( '*.json.gz' ): df = pd . DataFrame () for line in gzip . open ( filename , 'rb' ): page = nested_to_record ( json . loads ( line ), sep = '_' ) if 'title' in page . keys (): df = df . append ( page , ignore_index = True ) df . to_csv ( ' %s .csv.gz' % filename . split ( '.' )[ 0 ], index = False , encoding = 'utf-8' , compression = 'gzip' ) The gzip-compressed CSV files are now a little over double the size of the bzip2-compressed XML files. Note that the columns in each CSV file may not align with columns in others CSV files if the datasets vary in structure. You can examine the consistency of the headers across the gzip-compressed CSV files with the following: $ for FILE in *.csv.gz ; do echo $FILE pigz -d -c $FILE | head -n1 done

Cherry-Picking Columns of Interest I want to cut down on the amount of data I'll be working with in this exercise. To start, I'll examine the first thousand lines of one of the gzip-compressed CSV files and see what field names and data types it contains. Note I've set the maximum field size csvstat will allow for to eight million bytes so that none of the lengthy Markdown contents will cause an issue. I've also had to declare the delimiter as the last row is being cut off mid-record on the 1000th line (records can span multiple lines) and this is throwing csvstat's delimiter auto-detection off. $ pigz -d -c enwiki-20180920-pages-articles1.csv.gz \ | head -n1000 \ | csvstat --maxfieldsize 8000000 \ --delimiter ',' \ --type 1. id: Number 2. ns: Boolean 3. redirect_@title: Text 4. revision_comment: Text 5. revision_contributor_id: Number 6. revision_contributor_username: Text 7. revision_format: Text 8. revision_id: Number 9. revision_model: Text 10. revision_parentid: Number 11. revision_sha1: Text 12. revision_text_#text: Text 13. revision_text_@xml:space: Text 14. revision_timestamp: DateTime 15. title: Text 16. revision_contributor_ip: Text 17. revision_minor: Boolean 18. restrictions: Boolean I'll create a new gzip-compressed CSV file made up of five columns of interest. $ pigz -d -c enwiki-20180920-pages-articles1.csv.gz \ | csvcut --maxfieldsize 8000000 \ --columns 1,5,6,14,15 \ | pigz > metadata.csv.gz I'll now examine the statistics of each column in the newly-created gzip-compressed CSV file. Note, I can examine the compressed file, there's no need to decompress it ahead of time. $ csvstat metadata.csv.gz 1. "id" Type of data: Number Contains null values: False Unique values: 19833 Smallest value: 10 Largest value: 30,302 Sum: 305,772,174 Mean: 15,417.344 Median: 15,524 StDev: 8,567.878 Most common values: 10 (1x) 12 (1x) 13 (1x) 14 (1x) 15 (1x) 2. "revision_contributor_id" Type of data: Number Contains null values: True (excluded from calculations) Unique values: 5197 Smallest value: 0 Largest value: 34,699,348 Sum: 204,280,488,782 Mean: 11,801,299.179 Median: 9,612,106 StDev: 11,423,342.666 Most common values: None (2523x) 9,784,415 (1248x) 27,015,025 (462x) 212,624 (412x) 194,203 (390x) 3. "revision_contributor_username" Type of data: Text Contains null values: True (excluded from calculations) Unique values: 5198 Longest value: 51 characters Most common values: None (2523x) Tom.Reding (1248x) InternetArchiveBot (462x) MZMcBride (412x) Graham87 (390x) 4. "revision_timestamp" Type of data: DateTime Contains null values: False Unique values: 19788 Smallest value: 2002-02-25 15:43:11+00:00 Largest value: 2018-09-20 11:47:36+00:00 Most common values: 2002-02-25 15:51:15+00:00 (11x) 2002-02-25 15:43:11+00:00 (8x) 2018-09-16 18:47:23+00:00 (2x) 2018-09-20 06:53:14+00:00 (2x) 2018-09-13 20:16:20+00:00 (2x) 5. "title" Type of data: Text Contains null values: True (excluded from calculations) Unique values: 19833 Longest value: 97 characters Most common values: AccessibleComputing (1x) Anarchism (1x) AfghanistanHistory (1x) AfghanistanGeography (1x) AfghanistanPeople (1x) The above is incredibly handy for building data fluency. I have enough information to create a fairly granular schema in a structured data store. Note, the values printed out have been transformed to be more consistent and readable. Timestamps share the same format and numbers are printed using commas. The following will print the first ten records of the new CSV file. Note the data has been formatted for easier reading here as well. $ pigz -d -c metadata.csv.gz \ | head -n10 \ | csvlook | id | revision_contributor_id | revision_contributor_username | revision_timestamp | title | | -- | ----------------------- | ----------------------------- | ------------------------- | ------------------------------ | | 10 | 23,257,138 | Godsy | 2018-08-14 06:47:24+00:00 | AccessibleComputing | | 12 | | | 2018-09-19 12:07:26+00:00 | Anarchism | | 13 | 9,784,415 | Tom.Reding | 2017-06-05 04:18:18+00:00 | AfghanistanHistory | | 14 | 9,784,415 | Tom.Reding | 2017-06-05 04:18:23+00:00 | AfghanistanGeography | | 15 | 9,784,415 | Tom.Reding | 2017-06-05 04:19:42+00:00 | AfghanistanPeople | | 18 | 9,784,415 | Tom.Reding | 2017-06-05 04:19:45+00:00 | AfghanistanCommunications | | 19 | 9,784,415 | Tom.Reding | 2017-06-04 21:42:11+00:00 | AfghanistanTransportations | | 20 | 9,784,415 | Tom.Reding | 2017-06-04 21:43:11+00:00 | AfghanistanMilitary | | 21 | 9,784,415 | Tom.Reding | 2017-06-04 21:43:14+00:00 | AfghanistanTransnationalIssues | If you want to see the data in tabular form but without the transformations run the command again with the no inference flag. $ pigz -d -c metadata.csv.gz \ | head -n10 \ | csvlook --no-inference | id | revision_contributor_id | revision_contributor_username | revision_timestamp | title | | -- | ----------------------- | ----------------------------- | -------------------- | ------------------------------ | | 10 | 23257138 | Godsy | 2018-08-14T06:47:24Z | AccessibleComputing | | 12 | | | 2018-09-19T12:07:26Z | Anarchism | | 13 | 9784415 | Tom.Reding | 2017-06-05T04:18:18Z | AfghanistanHistory | | 14 | 9784415 | Tom.Reding | 2017-06-05T04:18:23Z | AfghanistanGeography | | 15 | 9784415 | Tom.Reding | 2017-06-05T04:19:42Z | AfghanistanPeople | | 18 | 9784415 | Tom.Reding | 2017-06-05T04:19:45Z | AfghanistanCommunications | | 19 | 9784415 | Tom.Reding | 2017-06-04T21:42:11Z | AfghanistanTransportations | | 20 | 9784415 | Tom.Reding | 2017-06-04T21:43:11Z | AfghanistanMilitary | | 21 | 9784415 | Tom.Reding | 2017-06-04T21:43:14Z | AfghanistanTransnationalIssues | The csvlook utility does support a max-rows parameter as well but it takes a lot longer to pull the first ten records from a compressed CSV file than pigz and head do.

Converting CSV to AVRO AVRO files store data in a self-describing form by maintaining a schema of each individual record along with its data. Data migrations are supported as well with schema versioning for both reading and writing records. AVRO files themselves can be read and worked with using a large variety of tools in the Hadoop ecosystem as well as with various Python and Java libraries. These features make AVRO an excellent transport format for micro-batches of data. Below I'll use a simple CSV to AVRO conversion tool. It doesn't support reading data from gzip-compressed files so I'll decompress the CSV file first. $ pigz -d --keep metadata.csv.gz $ ~/csv-avro-converter-cli-1.0/bin/csv-avro-converter-cli \ --in metadata.csv \ --out metadata.avro The file sizes of the uncompressed CSV and AVRO are about the same. The tool will perform some basic analysis on the contents of the CSV data and put together a schema for the dataset. Note, the timestamp column ended up being stored as a string rather than as any sort of smaller timestamp column type. Below is a dump of the schema the conversion tool put together. $ java -jar ~/avro-tools-1.8.2.jar \ getschema \ metadata.avro { "type" : "record" , "name" : "csv" , "namespace" : "metadata" , "fields" : [ { "name" : "id" , "type" : "int" }, { "name" : "revision_contributor_id" , "type" : [ "null" , "int" ], "default" : null }, { "name" : "revision_contributor_username" , "type" : [ "null" , "string" ], "default" : null }, { "name" : "revision_timestamp" , "type" : "string" }, { "name" : "title" , "type" : "string" } ] } Below I'll extract a record from the AVRO file. Note that the data hasn't been truncated or degraded in any way. $ java -jar ~/avro-tools-1.8.2.jar \ tojson \ metadata.avro \ | head -n1 \ | python -mjson.tool { "id" : 10 , "revision_contributor_id" : { "int" : 23257138 }, "revision_contributor_username" : { "string" : "Godsy" }, "revision_timestamp" : "2018-08-14T06:47:24Z" , "title" : "AccessibleComputing" }

Converting CSV to ORC ORC files store data in a columnar and compressed form. The project itself began in early 2013 and has been lead by Hortonwork's Founder and Technical Fellow Owen O'Malley. The format is very popular in the Hadoop ecosystem and in 2014, Facebook announced they had successfully converted 10s of petabytes of data in their then-300 PB data warehouse into ORC format. Statistics are kept for each column throughout an ORC file and for each group of 10,000 rows within each column. This allows queries to skip the parts of the file that aren't relevant and does a good job of speeding up queries by narrowing down the search space. When the ORC project started it was closely tied to Hive but since 2016 releases no longer depend on it. Different Hadoop projects often don't share the same version of their underlying ORC library but this year it was announced that the next release of Spark, version 2.4, will use the same version of the ORC library that Hive 3.0 uses. Below I'll convert the compressed CSV file into an ORC file. The tool I'm using doesn't support feeding in compressed data nor can it skip the header row so I'll decompress all but the first line of data into a new CSV file. $ pigz -d -c metadata.csv.gz \ | tail -n +2 \ > metadata.noheader.csv ORC files have schemas with field names and data types. Below I'll define the schema for the five fields within the CSV file. I've kept the field names a, b, c, d and e in order to keep the command short. Longer field names are supported if you don't wish to be so ambiguous. $ java -jar ~/orc-tools-1.5.2-uber.jar \ convert \ metadata.noheader.csv \ --schema "struct<a:int,b:int,c:string,d:timestamp,e:string>" \ --timestampformat "yyyy-MM-dd'T'HH:mm:ss'Z'" \ --output metadata.orc The resulting ORC file is around 25% of the size of the uncompressed CSV data. I'll print out the properties of the resulting ORC file. Below you can see the file is compressed using zlib, has a single stripe and statistics on the minimum, maximum and sum values for each column are given. The third column that contains the revision contributor usernames is dictionary encoded. $ java -jar ~/orc-tools-1.5.2-uber.jar \ meta \ metadata.orc Processing data file metadata.orc [length: 341494] Structure for metadata.orc File Version: 0.12 with ORC_135 Rows: 19833 Compression: ZLIB Compression size: 262144 Type: struct<a:int,b:int,c:string,d:timestamp,e:string> Stripe Statistics: Stripe 1: Column 0: count: 19833 hasNull: false Column 1: count: 19833 hasNull: false bytesOnDisk: 5511 min: 10 max: 30302 sum: 305772174 Column 2: count: 17310 hasNull: true bytesOnDisk: 44996 min: 0 max: 34699348 sum: 204280488782 Column 3: count: 17310 hasNull: true bytesOnDisk: 64840 min: "Serious" Sam Stone max: 칼빈500 sum: 173207 Column 4: count: 19833 hasNull: false bytesOnDisk: 70303 min: 2002-02-25 15:43:11.0 max: 2018-09-20 11:47:36.0 Column 5: count: 19833 hasNull: false bytesOnDisk: 154814 min: "Hello, World!" program max: ♯P sum: 312852 File Statistics: Column 0: count: 19833 hasNull: false Column 1: count: 19833 hasNull: false bytesOnDisk: 5511 min: 10 max: 30302 sum: 305772174 Column 2: count: 17310 hasNull: true bytesOnDisk: 44996 min: 0 max: 34699348 sum: 204280488782 Column 3: count: 17310 hasNull: true bytesOnDisk: 64840 min: "Serious" Sam Stone max: 칼빈500 sum: 173207 Column 4: count: 19833 hasNull: false bytesOnDisk: 70303 min: 2002-02-25 15:43:11.0 max: 2018-09-20 11:47:36.0 Column 5: count: 19833 hasNull: false bytesOnDisk: 154814 min: "Hello, World!" program max: ♯P sum: 312852 Stripes: Stripe: offset: 3 data: 340464 rows: 19833 tail: 144 index: 412 Stream: column 0 section ROW_INDEX start: 3 length 21 Stream: column 1 section ROW_INDEX start: 24 length 62 Stream: column 2 section ROW_INDEX start: 86 length 62 Stream: column 3 section ROW_INDEX start: 148 length 115 Stream: column 4 section ROW_INDEX start: 263 length 62 Stream: column 5 section ROW_INDEX start: 325 length 90 Stream: column 1 section DATA start: 415 length 5511 Stream: column 2 section PRESENT start: 5926 length 1686 Stream: column 2 section DATA start: 7612 length 43310 Stream: column 3 section PRESENT start: 50922 length 1686 Stream: column 3 section DATA start: 52608 length 27735 Stream: column 3 section LENGTH start: 80343 length 3229 Stream: column 3 section DICTIONARY_DATA start: 83572 length 32190 Stream: column 4 section DATA start: 115762 length 70288 Stream: column 4 section SECONDARY start: 186050 length 15 Stream: column 5 section DATA start: 186065 length 140495 Stream: column 5 section LENGTH start: 326560 length 14319 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 Encoding column 2: DIRECT_V2 Encoding column 3: DICTIONARY_V2[5197] Encoding column 4: DIRECT_V2 Encoding column 5: DIRECT_V2 File length: 341494 bytes Padding length: 0 bytes Padding ratio: 0% I'll dump out the first ten rows of data from the ORC file and as you can see, the data looks to be in place. $ java -jar ~/orc-tools-1.5.2-uber.jar \ data \ metadata.orc \ | head { "a" : 10 , "b" : 23257138 , "c" : "Godsy" , "d" : "2018-08-14 06:47:24.0" , "e" : "AccessibleComputing" } { "a" : 12 , "b" : null , "c" : null , "d" : "2018-09-19 12:07:26.0" , "e" : "Anarchism" } { "a" : 13 , "b" : 9784415 , "c" : "Tom.Reding" , "d" : "2017-06-05 04:18:18.0" , "e" : "AfghanistanHistory" } { "a" : 14 , "b" : 9784415 , "c" : "Tom.Reding" , "d" : "2017-06-05 04:18:23.0" , "e" : "AfghanistanGeography" } { "a" : 15 , "b" : 9784415 , "c" : "Tom.Reding" , "d" : "2017-06-05 04:19:42.0" , "e" : "AfghanistanPeople" } { "a" : 18 , "b" : 9784415 , "c" : "Tom.Reding" , "d" : "2017-06-05 04:19:45.0" , "e" : "AfghanistanCommunications" } { "a" : 19 , "b" : 9784415 , "c" : "Tom.Reding" , "d" : "2017-06-04 21:42:11.0" , "e" : "AfghanistanTransportations" } { "a" : 20 , "b" : 9784415 , "c" : "Tom.Reding" , "d" : "2017-06-04 21:43:11.0" , "e" : "AfghanistanMilitary" } { "a" : 21 , "b" : 9784415 , "c" : "Tom.Reding" , "d" : "2017-06-04 21:43:14.0" , "e" : "AfghanistanTransnationalIssues" } { "a" : 23 , "b" : 9784415 , "c" : "Tom.Reding" , "d" : "2017-06-05 04:19:50.0" , "e" : "AssistiveTechnology" } For comparison, this is the data for the first ten rows in the source CSV file. $ csvlook --no-header-row \ --no-inference \ --max-rows 10 \ metadata.noheader.csv | a | b | c | d | e | | -- | -------- | ---------- | -------------------- | ------------------------------ | | 10 | 23257138 | Godsy | 2018-08-14T06:47:24Z | AccessibleComputing | | 12 | | | 2018-09-19T12:07:26Z | Anarchism | | 13 | 9784415 | Tom.Reding | 2017-06-05T04:18:18Z | AfghanistanHistory | | 14 | 9784415 | Tom.Reding | 2017-06-05T04:18:23Z | AfghanistanGeography | | 15 | 9784415 | Tom.Reding | 2017-06-05T04:19:42Z | AfghanistanPeople | | 18 | 9784415 | Tom.Reding | 2017-06-05T04:19:45Z | AfghanistanCommunications | | 19 | 9784415 | Tom.Reding | 2017-06-04T21:42:11Z | AfghanistanTransportations | | 20 | 9784415 | Tom.Reding | 2017-06-04T21:43:11Z | AfghanistanMilitary | | 21 | 9784415 | Tom.Reding | 2017-06-04T21:43:14Z | AfghanistanTransnationalIssues | | 23 | 9784415 | Tom.Reding | 2017-06-05T04:19:50Z | AssistiveTechnology |