Google Safe Browsing without The Browser

Posted by Nick Galbreath on March 4, 2012

Note: In 2020 we updated this post to adopt more inclusive language. Going forward, we’ll use “allowlist/blocklist” in our Code as Craft entries.

At Etsy, we are constantly evaluating the security and safety of our members as they use the site. One way we do this is by analyzing user generated content (UGC) for possible problems. As part of the process we integrate results from the Google Safe Browsing (GSB) service. Typically this is client-side technology used by web browsers to protect the end-user from visiting dangerous websites that might serve malware or be part of a phishing scam.

The Security and Defensive Systems group here at Etsy have flipped this model around. Rather than warn the user when a malicious link is followed, we block the link (or the whole page) from displaying in the first place.

There are a few ways to use the Google Safe Browsing service. For lower volume queries, there is a very simple REST API. For high volume, high performance systems, the GSB V2 protocol is more appropriate as it mirrors the entire GSB database locally. It’s designed to scale to an extremely large number of clients while minimizing network traffic. To do so, it uses a complicated protocol involving multiple blocklists and allowlists sent as a series of distributed binary diffs.

While many implementations of the GSB protocols are available, for a variety of reasons they were not appropriate for use in Etsy’s operational environment (e.g. use of autoincrement ids, designed to run under a web server, etc), and so we created our own. We have open sourced our version and made it available in our gsb4ugc git repository. It’s in PHP, but it should be straightforward to port to other languages, as it’s really more of a toolkit than a standalone product.

To use, you’ll need to create and assemble resources to create your own API. First you need to set up some boilerplate for both the GSB updater and client:

// Set up a db connection. $dbh = new PDO('mysql:host=127.0.0.1; dbname=gsb', 'user', ‘password’); $dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION); // Create storage; works with mysql, sqlite. // No auto-increment IDs, so it's safe with master-master replication. // Etsy subclasses this and adds

StatsD

calls.

http://etsy.me/dQwVXi

$storage = new GSB_StoreDB($dbh); // Create network access. Pass in your

GSB API key

. Uses

PHP curl

. $network = new GSB_Request($api); // Logger. Subclass to use your logging infrastructure (or not). $logger = new GSB_Logger(5);

Then one needs to setup a cron job that runs every 30 minutes to start mirroring the GSB database.

$updater = new GSB_Updater($storage, $network, $logger); $updater->downloadData($gsblists, FALSE);

It takes about 24 hours to full sync up. Finally, you are able to start checking URLs:

$client = new GSB_Client($storage, $network, $logger); $url = "http://malware.testing.google.test/testing/malware/”; print_r($client->doLookup($url));

should return something similar to:

[list_id] => 1 [add_chunk_num] => 70219 [host_key] => b2ae8c6f [prefix] => 51864045 [match] => malware.testing.google.test/testing/malware/ [hash] => 518640453f8b2a5f0d43bc2251.... [host] => testing.google.test/ [url] => http://malware.testing.google.test/testing/malware/ [listname] => goog-malware-shavar

More details are in the bin/samples directory of our repository.

We are currently scanning a few types of user generated content in production. This is done asynchronously from the website so we don’t block the user experience, however we still care about performance. Almost all performance metrics here at Etsy measure maximum and minimum times, as well as 90th percentile and mean, and this is no exception. The peak times occur when a network call is required, otherwise, it’s typically 5ms.

Since this is security-related code, another goal of gsb4ucg is testability. The protocol-parsing code is separated out from database and networking code, so it’s very easy to write unit tests. This also helps to explain how the code works. As you see below, we have some more work to do:

In addition to expanding test coverage and improving performance, we’d like to add MAC support, and to use it for more content types on Etsy. We’d also like to add the results from PhishTank for completeness and redundancy. Comments, bug reports, patches and pull requests are all welcome, but if this type of work interests you, consider doing it full time.

Now, go forth and browse and consume content safely!