We process no small amount of geo-temporal data at Vistar. We are, after all, developing one of the deepest insights of how people move and behave in the real world. In tandem, we've built systems to serve up such insights to online systems that need results with sub-millisecond latency. The core of these systems has been written largely in Google's Go, which has proven to be an excellent fit.

In many cases, we need to "play" with the data; make small ad-hoc changes to both explore the data and refine the classes of questions we're able to ask of it. While Hadoop's Java interface is fast and flexible, the boilerplate involved in simply describing jobs was interfering with experimentation. The streaming interface and subsequent libraries which wrap it provide a much lower cost of iteration, but we found ourselves limited by slow task execution and spotty support for bundled libraries.

We wanted something that was fast to write, fast to execute, and if possible, wouldn't give us any headaches in building jobs when sending them off to Hadoop. Many of our geospatial libraries are written in Go. Go packages all its dependencies into a single executable binary. Really, we just wanted to run Go on Hadoop.

So, we started running Go on Hadoop.

A Quick Example

I'm not sure you are legally allowed to write text on the internet about MapReduce without referencing the canon of WORD COUNT, so in the name of making no adjustments to my arrest record, let's see how it looks.

Just kidding. All code is terrible.

package main import ( "log" "strings" "github.com/vistarmedia/gossamr" ) type WordCount struct {} func ( wc * WordCount ) Map ( p int64 , line string , c gossamr . Collector ) error { for _ , word := range strings . Fields ( line ) { c . Collect ( strings . ToLower ( word ), int64 ( 1 )) } return nil } func ( wc * WordCount ) Reduce ( word string , counts chan int64 , c gossamr . Collector ) error { var sum int64 = 0 for v := range counts { sum += v } c . Collect ( sum , word ) return nil } func main () { wordcount := gossamr . NewTask ( & WordCount {}) err := gossamr . Run ( wordcount ) if err != nil { log . Fatal ( err ) } }

After compiling the binary, we can submit that single file to hadoop to run as follows. Let's say the binary was named wordcount .

./bin/hadoop jar ./contrib/streaming/hadoop-streaming-1.2.1.jar \ -input /big/old/chunka/words.txt \ -output /wordcount \ -mapper "wordcount -task 0 -phase map" \ -reducer "wordcount -task 0 -phase reduce" \ -io typedbytes \ -file ./wordcount \ -numReduceTasks 6

And we're on our way.

Okay, okay. The submission process needs a little work. You may be able to follow how we build one task in the code, then specify its parameters directly to the Hadoop streaming interface. Our runner could easily kick off the job instead of making you run it by hand. Or maybe give you the command. Or Maybe we're just cruel. I don't know. One thing at a time, kid. One thing at a time.

Check it out

We've packaged what we have as Gossamr. We use it today to process big chunks of data, and hope you might find it useful as well. It's a nascent project, and we're always happy to get pull requests.

If this sort of thing sounds like it could be your kettle of fish, we'd want to hear from you. It's a lot of fun. I guarantee it.