We frequently use Spark SQL and EMR to analyze terabytes of JSON request logs. The builtin JSON support in Spark is easy to use and works well for most use cases. For example, this small piece of code will infer the schema of the files and provide a table that can be queried with standard SQL:

import org.apache.spark.sql.SparkSession val session = SparkSession . getOrCreate () val jsonDF = session . read . json ( "s3a://somelogs/2016/12/*" ) jsonDF . createTempView ( "logs" )

However, JSON is not the most efficient format to query. This is especially evident when querying a large dataset, we can easily exhaust the heap of the executor nodes if we don't have a cluster that is sized just right.

Parquet

For datasets that need to be queried often, it would be ideal if they were stored in format like Parquet. Parquet provides several advantages over JSON, some of the notable ones are:

It's a binary format as opposed to JSON which is text. This allows for a more compact storage of the same set of records, which in turn means less memory overhead when parsing the records.

It's a column-oriented format. For querying, this means Spark will only read the relevant columns off of disk that are needed to execute the query, whereas JSON requires every record to be read and parsed regardless.

Parquet allows for the storage of column statistics per row group (by default these are blocks of 128mb). This data can be used by the Parquet reader in conjunction with a Spark SQL optimization called predicate push down. The Parquet reader has knowledge of which rows the query is filtering by and will skip row groups that are not applicable. We can somewhat achieve a similar optimization with JSON if we carefully lay out our directory structure and use clever hadoop globs as input paths, but it's a lot more work and does not handle more complex cases.

When writing a DataFrame as Parquet, Spark will store the frame's schema as metadata at the root of the directory. This lets Spark quickly infer the schema of a Parquet DataFrame by reading a small file; this is in contrast to JSON where we either need to specify the schema upfront or pay the cost of reading the whole dataset.

Writing a Parquet Index

Now that we've established that Parquet is great for querying, let's see how hard it would be to store a copy of our logs as Parquet. We'll choose to order our logs by time since most of our queries are bounded by a time range and that will allow us to make good use of the predicate pushdown optimization. We're going to treat this index as an append only archive of logs, so we'll have to keep this index slightly behind the most recent logs available to ensure that we don't include a partial day's worth of logs.

First we spin up a cluster big enough to be able to parse the entire dataset and then run something like this:

import java.time.LocalDate import java.time.format.DateTimeFormatter.ISO_LOCAL_DATE import org.apache.spark.sql.SaveMode import org.apache.spark.sql.functions.substring // We intentionally keep the index behind current to avoid issues around // lagging source logs, incorrect system clocks, etc. // We index everything that is before the start of yesterday val maxIndexableDate = LocalDate . now (). minusDays ( 1 ). format ( ISO_LOCAL_DATE ) // Our request logs contain a timestamp column that is ISO8601 formatted. // For example: 2016-12-22T15:33:07Z // Here we are extracting the yyyy-mm-dd component of the timestamp and // storing that as new column called "date". There is a date_format // function builtin to Spark but it is significantly slower since it // allocates java.util.Date objects for every row so we're doing a // substring instead val logsWithDate = jsonDF . withColumn ( "date" , substring ( jsonDF ( "timestamp" ), 0 , 10 )) val idexableLogs = logsWithDate . filter ( logsWithDate ( "date" ) < maxIndexableDate ) val numberOfDays = indexableLogs . select ( indexableLogs ( "date" )) . distinct () . count () . toInt // This will end up being one file per day on s3. val logsPartitionedByDay = indexableLogs . repartition ( numberOfDays , indexableLogs ( "date" )) // partitionBy controls the directory structure. Each date gets a // directory. logsPartitionedByDay . write . partitionBy ( "date" ) . mode ( SaveMode . Overwrite ) . parquet ( "s3a://somelogs/by-day/" )

If everything went well, we should see a structure on S3 that looks like this:

.... { lots of folders } ... ./date = 2016-12-25/part-r-00000-f74bba9c-4bf6-83db-c0720a48aca7.snappy.parquet ./date = 2016-12-26/part-r-00001-f74bba9c-4bf6-83db-c0720a48aca7.snappy.parquet ./date = 2016-12-27/part-r-00002-f74bba9c-4bf6-83db-c0720a48aca7.snappy.parquet ./_SUCCESS ./_common_metadata

We have one compressed Parquet file per day and a _common_metadata file where Spark stores the schema. So now we should be able to run queries that filter by time much faster than using the JSON logs directly. For example:

val parquetDF = session . read . parquet ( "s3a://somelogs/by-day/" ) parquetDF . createTempView ( "parquetLogs" ) session . sql ( "SELECT COUNT(DISTINCT(ip)) FROM parquetLogs WHERE date > 2016-12-20" )

This will work initially but we'll soon end up with a stale index. We could easily rebuild the index from scratch occasionally, but since that would require launching a cluster large enough to process the entire JSON dataset the index wouldn't be buying us much. We need a way of appending new data to the index in small batches. We can append to a Parquet DataFrame by using SaveMode.Append when we write a DataFrame, but the tricky part is ensuring we don't duplicate records that are already in the index. We can keep things simple by appending logs partitioned by day, and using the start of the day as a boundary point.

import java.time.temporal.ChronoUnit.DAYS import org.apache.spark.sql.functions.max val parquetIndex = session . read . parquet ( "s3a://somelogs/by-day/" ) // We update the index in day paritioned chunks. So the next day that would // need to be added to the index would be max day that is indexed + 1 val nextDateToIndex = LocalDate . parse ( parquetIndex . select ( max ( parquetIndex ( "date" ))). head (). getString ( 0 ) ). plusDays ( 1 ) // Just as with creating an index, we want to keep the index a day behind // current. We cap our update to yesterday. val maxIndexableDate = LocalDate . now (). minusDays ( 1 ) // Check if the index needs to be updated. if ( nextDateToIndex . isBefore ( maxIndexableDate ) { val numberOfDaysInUpdate = DAYS . between ( nextDateToIndex , maxIndexableDate ). toInt // In practice, the JSON DataFrame should use a date glob path that only // parses the minimum number of records needed for this update. Just // keeping it simple here and assuming we have that in scope. val logsWithDate = jsonDF . withColumn ( "date" , substring ( jsonDF ( "timestamp" ), 0 , 10 )) // Create a DataFrame with the set of logs that are after the highest day // in the index but a day before current. Just like we did when we created // the index, we're making one file per day. val partitionedUpdate = logsWithDate . filter ( logsWithDate ( "date" ) >= nextDateToIndex . format ( ISO_LOCAL_DATE )) . filter ( logsWithDate ( "date" ) < maxIndexableDate . format ( ISO_LOCAL_DATE )) . repartition ( numberOfDaysInUpdate , logsWithDate ( "date" )) // Append the new set of logs to the index partitionedUpdate . write . partitionBy ( "date" ) . mode ( SaveMode . Append ) . parquet ( "s3a://somelogs/by-day/" ) } else { println ( "No update required" ) }

Unified reads

Last but not least, we need a way to read from the index without forcing the user to worry about all these implementation details. The user shouldn't have to be concerned about which days are in the index and which are not; ideally they will be able to write queries as if they were accessing the raw logs directly. Luckily, Spark already handles most of that abstraction for us with DataFrames. All we have to do is union the raw logs DataFrame for recent logs with the indexed DataFrame for archived logs.

val maxIndexedDate = LocalDate . parse ( parquetIndex . select ( max ( parquetIndex ( "date" ))). head (). getString ( 0 )) val boundary = maxIndexedDay . plusDays ( 1 ). format ( ISO_LOCAL_DATE ) // Again, its crucial that the jsonDF uses a glob so it will only parse the // minimum amount of logs. In this case, only the files from maxIndexed date // until today. Omitting that work here. val jsonLogsWithDate = jsonDF . withColumn ( "date" , substring ( jsonDF ( "timestamp" ), 0 , 10 )) val filteredJsonLogs = jsonLogsWithDate . filter ( jsonLogsWithDate ( "date" ) >= boundary ) val filteredIndexedLogs = parquetIndex . filter ( parquetIndex ( "date" ) < boundary ) val mergedFrame = filteredIndexedLogs . union ( filteredJsonLogs )

Now mergedFrame can be queried against the full range of logs and whether the index is being used or not is completely transparent to the user. This is illustrated nicely in the Spark UI.

Spark will still have to parse all the JSON logs at query time since we don't have the same optimizations available as Parquet. However, if the index is kept current and a restrictive path glob is used for the JSON DataFrame, Spark shouldn't have to parse more than a day or two of logs. Of course, if we know our query will only be looking at historical data it would be optimal to query against the Parquet frame directly.

If you're interested in solving these sorts of problems, we'd love to hear from you. We're always hiring.