Golang for data: Using Golang to easily extract and process data faster.

This was the title of the lightning talk I gave at the recent Web Summit Meetup in Lisbon. This blog post is a written, slightly changed, version of that presentation. So, without further ado let’s Go.

Why Golang for data

Extracting, processing and analysing data are very important tasks in today’s information age. While interpreted languages, such as Python, are commonly used, they have their problems, namely speed, when dealing with large amounts of data.

Golang being a compiled, modern, and easily concurrent language becomes an attractive possibility to deal with data.

Libraries

Part of the appeal in using any language for any task are the libraries it offers. So what about Go?

Mike Hall did a good round up of libraries by the end of 2015, which can be found here.

We can already see a good number of libraries which are very useful. While some libraries might still be missing, I do believe the existing ones already give developers a very good foundation to work with data. However, if you do feel like something is missing you can always contribute back to the community and create a library, that way we can increase that list.

Example

So let’s jump to an example and see how easy it is to use Golang for a simple web scraper. Consider we want to fetch the links for the results of a duckduckgo search query.

For this example, I will use the goquery library, which is built on top of the Go’s project library net/html .

In order to fetch a page we simply have to do

doc , _ := goquery . NewDocument ( "http://duckduckgo.com/?q=whitesmith" )

Afterwards we want to find all the <a class="result__urL"> html tags

doc . Find ( "a.result__url" )

To loop over all the tags and print the corresponding href

doc . Find ( "a.result__url" ) . Each ( func ( i int , s * goquery . Selection ) { link , _ := s . Attr ( "href" ) fmt . Println ( link ) }

In the end, we add some error handling, imports and we are done

package main import ( "fmt" "github.com/PuerkitoBio/goquery" "log" ) func scrape ( url string ) { doc , err := goquery . NewDocument ( url ) if err != nil { log . Fatal ( err ) } doc . Find ( "a.result__url" ) . Each ( func ( i int , s * goquery . Selection ) { link , exists := s . Attr ( "href" ) if ! exists { return } fmt . Println ( link ) }) } func main () { scrape ( "http://duckduckgo.com/?q=whitesmith" ) }

Adding concurrency

Now, say that we wanted to fetch multiple pages at the same time. With Golang, it is quite easy to generate concurrent code. For the example above, we could easily change the main function to:

func main () { go scrape ( "http://duckduckgo.com/?q=whitesmith" ) go scrape ( "http://duckduckgo.com/?q=coimbra" ) go scrape ( "http://duckduckgo.com/?q=adbjesus" ) }

And there we go. We may now scrape 3 different queries at the same time.

It would also be interesting to have the scraper itself being concurrent (for example, by analysing multiple nodes at the same time). However, with this library, it isn’t that easy, if even possible. Instead, if we had used the net/html library we would have more control over the scraper and it would be easier to do this. We would, of course, lose the simplicity of goquery , but it’s a trade-off worth considering.

Deployment

One other factor which I believe is very interesting in Go is the ease of deployment. Let’s say we want to send the scraper to a server. We would create a binary for the server’s architecture and run it there without other dependencies.

To do this we have the GOOS and GOARCH environment variables. If our server was running on a Linux 64 bits system we could simply do

export GOARCH=amd64 export GOOS=linux go build

And we have a binary capable of running on the server. We just upload it to the server and we can have our scraper running without the need for any more dependencies.

Performance

Performance was one of the main reasons I looked at Go for data, so let’s compare our scraper against a similar Python version that uses the pyquery library

from pyquery import PyQuery as pq from lxml import etree import urllib d = pq ( url = "http://localhost" ) for l in d ( "a.result__url" ): print ( l . get ( "href" ))

For the tests, I ran each scraper 35 times, on a page host locally containing 240 results. I also removed any javascript or css. The measures were taken with the gnu time tool.

Language | Avg. Time (s) | Avg. Max Resident Memory (KBytes) Go | 0.02 | 10273.60 Python2 | 0.17 | 30345.49

We can see Go is able to get both a better time and a better memory usage, which, although isn’t very significant for this small scenario, makes it very interesting for even larger scenarios. Imagine this on a larger scale, instead of taking 17 days, you would only take 2 days with the Go version.

Conclusion

Golang has been growing a lot in the last couple of years, and some interesting libraries for data already exist. This together with its performance, easy concurrency and good error handling make it a valuable tool to have in one’s arsenal when dealing with data.

With this said, I invite you to give Go a try. However, make sure to always evaluate what is the better tool for each job and to not blindly use any of them. While Go is interesting for some use cases like our example, there are certainly other cases where R, Julia, Python, or any other language could be a better fit.