Short introduction

This is a part three of Apache Spark gotcha series. Checkout the first, and the second part. Today, we’ll talk about how Hive cannot really read big decimals using parquet file format.

Problem

Consider we have a following dataset, and it has some data in it.

scala > foos . printSchema () root |-- number : decimal ( 10 , 2 ) ( nullable = true )

After repartitioning the dataset, saving it to HDFS, and optionally creating a new partition in Hive table (named Foo in our case), we should have our data stored safe and sound, ready to be read by external systems that support parquet file format, right? Let’s launch Hive and try to read what we’ve just saved.

hive ( dev ) > describe Foo ; OK number decimal ( 10 , 2 ) hive ( dev ) > select * from Foo ; OK Failed with exception java . io . IOException : parquet . io . ParquetDecodingException : Can not read value at 0 in block - 1 in file hdfs : // path / to / paruqet / file / file - c000 . snappy . parquet

Oh no, what happened? Schema of our Hive table, compared to a schema of the dataset we saved, appears to be the same. It turns out, that Hive cannot read small big decimals, check out this JIRA issue. Essentially, when you save decimals to parquet with precision less or equal to eighteen it will store them using int32 or int64 type. And Hive just cannot deal with this for some reason…

On the other hand, if you try to read the data via Spark SQL, then it should work just fine.

The workaround

Either use another data warehousing solution or give a bigger precision to your big decimal so that parquet is going to store it in a different way. According to this document, precision of nineteen is enough for parquet to store big decimal using fixed_len_byte_array and this can be easily read by Hive.

Summary

This concludes part three of the series. Stay tuned for the next part! If you’ve got some idea you’d like me to write about, just hit me up on Twitter. If you want to get notified about other interesting posts about big data and distributed systems, then follow me on Twitter or subscribe to the mailing list. See ya next time!