Ask all your coding questions @ codequery.io

Parquet is a columnar storage format popularized by the Hadoop ecosystem. Through it's superior data compression and columnar format, Parquet addresses both the processing and storage challenges of big data. In this article, we explore what is Parquet, how Parquet works, and why using Parquet as a data format makes sense.

What is Parquet?

Parquet is an open source file format for Hadoop. Parquet both compresses and encodes data making it an efficient storage format for HDFS. You can use Parquet with Hive, Impala, Spark, Pig, etc. and easily convert Parquet to other data formats.

How Parquet Works

Columnar Format

Parquet stores binary data in a columnar format. Unlike a traditional row based format, values coming from the same column are stored together in their own row groups. This may seem subtle, but grouping values based on column provides several benefits including better compression and faster lookups (depending on the query).

The Anatomy of a Parquet File

Row groups are simply logical partitioning of data into rows. Row groups consist of column chunks, which are simply chunks of the data for a particular column. These chunks are further divided into data pages having the same headers.

Parquet files have footers which contain schema details such as object model metadata and other information about the associated row groups and columns.

Flexibility

Parquet does not favor one storage format over another. While other serialization frameworks like Avro, Thrift, and Protocol Buffers have their own storage formats, Parquet converts these formats by using it's own internal data model. This allows Parquet to play well with applications like Hive, Pig, Avro, etc. since it can use its own converters to map between storage formats.

Why Use Parquet?

Parquet is so useful because it solves the two biggest problems in big data: processing costs and storage costs. Using Parquet's columnar format results in better compression. This reduces both storage and the network bandwidth required to read larger data sets.

Parquet shines with OLAP operations, especially with wider tables where a few columns are being returned. Parquet is more computationally expensive for writes but offers many benefits when reading data.

Conclusion

Parquet is one of the most popular storage formats within the Hadoop ecosystem. By utilizing a columnar format to encode and store data, Parquet saves on both storage and processing costs. Use Parquet when anticipating heavy OLAP operations and know that Parquet integrates well across the Hadoop ecosystem (Hive, Avro, Impala, Pig, etc).