On our platform teams, we use Amazon’s Elastic MapReduce (EMR) service to help us gather useful metrics from log files. We have processes that capture log files, then compress and push them to Amazon S3 for archiving. This pattern builds up massive amounts of information going back several years and, thanks to EMR, it’s all available to us for data crunching. Initially, we used Python for a lot of the heavy lifting, but over time we came to rely on Go.

In the Beginning

When we first started using EMR, my team wrote the mapper and reducer scripts in Python. We chose Python because it requires minimal setup and code to create a single file Python script capable of reading JSON or CSV records from stdin and writing similarly structured data to stdout. Also, since we had already decided to use the boto Python library for starting EMR job flows and managing their output, it made sense to use the same language throughout the project.

Using Python works pretty well for simple processing. For a while, we had decent performance and — since the scripts were very simple — we didn’t need help from third-party modules. Eventually, we encountered situations where we needed to import an internal module to reuse some business logic or a third-party module to gain insight into our data, so another layer of complexity was needed in the EMR job flow: bootstrap actions.

Bootstrap actions are executed on each node in the cluster once they are ready for work. For simple third-party libraries, a bootstrap action can execute something like sudo apt-get install python-nltk . To load an internal NYT library, we’d have to make it available on S3 and write an install script for the bootstrap action. Both cases are taxing and time-consuming. We wanted a more elegant solution and, since we’d started writing more and more libraries and services with Go, we thought we’d see if it would work for this scenario.

Moving to Go

For over a year, my team has been creating back end services and web APIs with Go. As Dave Cheney stated in in his recent talk (Five Things That Make Go Fast) at Gocon, Go is often a superior choice because it offers concurrency, easy deployment and great performance. Additionally, the syntax is clean and offers the feel of a dynamic language, but is still statically typed.

Once we had several Go services running in production, we noticed it was possible for us to reuse some logic from one of those services in a streaming mapper; it turns out creating mappers (and reducers) with Go does not take significantly more effort than with Python. To demonstrate, I’ve put together an example of a simple mapper/reducer in Python and the same one in Go:

Python Mapper

#!/usr/bin/python import sys import simplejson as json def main ( ) : # loop through each line of stdin for line in sys . stdin : try : # parse the incoming json j = json. loads ( line. strip ( ) ) # initialize output structure output = dict ( ) # grab an identifier output [ "key" ] = j [ "data" ] [ "key" ] # and any other useful information from input json output [ "secondary-key" ] = j [ "data" ] [ "another-key" ] output [ "first-metric" ] = j [ "data" ] [ "metric" ] output [ "second-metric" ] = j [ "data" ] [ "metric-2" ] except Exception as e: sys . stderr . write ( "unable to read log: %s" % e ) continue try : # generate json output output_json = json. dumps ( output ) # write the key and json to stdout print "%s \t %s" % ( output [ "key" ] , output_json ) except Exception as e: sys . stderr . write ( "unable to write mapper output: %s" % e ) continue if __name__ == "__main__" : main ( )

Go Mapper

package main import ( "bufio" "encoding/json" "fmt" "log" "os" ) func main ( ) { var line [ ] byte var input logRecord var output mapperOutput var outputJSON [ ] byte var err error // loop through each line of stdin ls := bufio. NewScanner ( os. Stdin ) for ls. Scan ( ) { line = ls. Bytes ( ) // parse the incoming json if err = json. Unmarshal ( line , & input ) ; err != nil { log . Print ( "unable to read log: " , err ) continue } // initialize output structure output = mapperOutput { // grab an identifier input. Data . Key , // and any other useful information from input json input. Data . AnotherKey , input. Data . Metric , input. Data . AnotherMetric , } // generate json output if outputJSON , err = json. Marshal ( output ) ; err != nil { log . Print ( "unable to write mapper output: " , err ) continue } // write the key and json to stdout fmt. Fprintf ( os. Stdout , "%s \t %s

" , output. Key , outputJSON ) } if ls. Err ( ) != nil { log . Print ( "error reading from stdin: " , ls. Err ( ) ) os. Exit ( 1 ) } } type logRecord struct { Data struct { Key string `json : "key" ` AnotherKey string `json : "another-key" ` Metric int64 `json : "metric" ` AnotherMetric int64 `json : "metric-2" ` } `json : "data" ` } type mapperOutput struct { Key string `json : "key" ` SecondaryKey string `json : "secondary-key" ` FirstMetric int64 `json : "first-metric" ` SecondMetric int64 `json : "second-metric" ` }

Python Reducer

#!/usr/bin/python import sys import simplejson as json def main ( ) : ongoing_count = { "key" : "" } # loop through each line for stdin for line in sys . stdin : try : # split line to separate key and value key_val = line. split ( " \t " , 1 ) key = key_val [ 0 ] # parse the incoming json data = json. loads ( key_val [ 1 ] ) # check if incoming key equals ongoing key if key == ongoing_count [ "key" ] : # inrement ongoing metrics ongoing_count [ "first-metric" ] + = data [ "first-metric" ] ongoing_count [ "second-metric" ] + = data [ "second-metric" ] else : # if a new key, emit ongoing counts writeOutput ( ongoing_count ) # set ongoing count with current data ongoing_count = data except Exception as e: sys . stderr . write ( "unable to parse reducer input: %s" % e ) continue # emit the final counts writeOutput ( ongoing_count ) def writeOutput ( ongoing_count ) : if ongoing_count [ "key" ] != str ( ) : try : # generate json output output_json = json. dumps ( ongoing_count ) except Exception as e: sys . stderr . write ( "unable to create reducer json: %s" % e ) continue # write the key and json to stdout print "%s \t %s" % ( key , output_json ) if __name__ == "__main__" : main ( )

Go Reducer

package main import ( "bufio" "bytes" "encoding/json" "fmt" "log" "os" ) var tab = [ ] byte ( " \t " ) func main ( ) { var rawInput [ ] string var input mapperOutput var ongoingCount mapperOutput var err error // loop through each line for stdin ls := bufio. NewScanner ( os. Stdin ) for ls. Scan ( ) { // split line to separate key and value rawInput = bytes. SplitN ( ls. Bytes ( ) , tab , 2 ) // parse the incoming json if err = json. Unmarshal ( rawInput [ 1 ] , & input ) ; err != nil { log . Print ( "unable to parse reducer input: " , err ) continue } // check if incoming key equals ongoing key if ongoingCount. Key == input. Key { // inrement ongoing metrics ongoingCount. FirstMetric += input. FirstMetric ongoingCount. SecondMetric += input. SecondMetric } else { // if a new key, emit ongoing counts writeOutput ( ongoingCount ) // set ongoing count with current data ongoingCount = input } } if ls. Err ( ) != nil { log . Print ( "error reading from stdin: " , ls. Err ( ) ) os. Exit ( 1 ) } // emit the final counts writeOutput ( ongoingCount ) } func writeOutput ( o mapperOutput ) { if len ( o. Key ) == 0 { return } // generate json output data , err := json. Marshal ( o ) if err != nil { log . Print ( "unable to marshal reducer output: " , err ) return } // write the key and json to stdout fmt. Fprintf ( os. Stdout , "%s \t %s

" , o. Key , data ) } type mapperOutput struct { Key string `json : "key" ` SecondaryKey string `json : "secondary-key" ` FirstMetric int64 `json : "first-metric" ` SecondMetric int64 `json : "second-metric" ` }

There’s a small bump in the number of lines in the Go implementation. However, the inconvenience is more than worth the simple deployment and increased performance Go provides.

Since programs compile down to a single binary, we can include all the third-party libraries we want and deployment is just a matter of putting our binary on S3, no bootstrap actions required.

We also get a nice speed boost. I ran an old Python mapper/reducer over the same data as the Go implementations and, after several runs, I found an average speed increase of about 25 percent. Both of these mapper/reducers were using standard libraries for CSV, JSON and regex (with the exception of Python’s simplejson).

As we’ve continued to build out and improve our platform technology, we’ve become more confident and familiar with Go. From daemon services to simple MapReduce scripts, Go has been my team’s first choice for server-side code. It’s enabled us to build performant and reliable services that have been easy to maintain, and the Go community’s enthusiasm along with the speed of quality releases have kept us excited and eager to see what’s next for the language.