Scumblr leverages a gem called Workflowable (which we are also open sourcing) that allows setting up flexible workflows that can be associated with search results. These workflows can be customized so that different types of results go through different workflow processes depending on how you want to action them. Workflowable also has a plug-in architecture that allows triggering custom automated actions at each step of the process.

Scumblr also integrates with Sketchy, which allows automatic screenshot generation of identified results to provide a snapshot-in-time of what a given page and result looked like when it was identified.

Architecture

Scumblr makes use of the following components :

Ruby on Rails 4.0.9

Backend database for storing results

Redis + Sidekiq for background tasks

Workflowable for workflow creation and management

Sketchy for screenshot capture

We’re shipping Scumblr with built-in search libraries for seven common services including Google, Twitter, and Facebook.

Getting Started with Scumblr and Workflowable

Scumblr and Workflowable are available now on the Netflix Open Source site. Detailed instructions on setup and configuration are available in the projects’ wiki pages.

Sketchy

One of the features we wanted to see in Scumblr was the ability to collect screenshots and text content from potentially malicious sites — this allows security analysts to preview Scumblr results without the risk of visiting the site directly. We wanted this collection system to be isolated from Scumblr and also resilient to sites that may perform malicious actions. We also decided it would be nice to build an API that we could use in other applications outside of Scumblr. Although a variety of tools and frameworks exist for taking screenshots, we discovered a number of edge cases that made taking reliable screenshots difficult — capturing screenshots from AJAX-heavy sites, cut-off images with virtual X drivers, and SSL and compression issues in the PhantomJS driver for Selenium, to name a few. In order to solve these challenges, we decided to leverage the best possible tools and create an API framework that would allow for reliable, scalable, and easy to use screenshot and text scraping capabilities. Sketchy to the rescue!

Architecture:

At a high level, Sketchy contains the following components:

Python + Flask to serve Sketchy

PhantomJS to take lazy captures of AJAX heavy sites

Celery to manage jobs and + Redis to schedule and store job results

Backend database to store capture records (by leveraging SQLAlchemy)

Sketchy Overview

Sketchy at its core provides a scalable task-based framework to capture screenshots, scrape page text, and save HTML through a simple to use API. These captures can be stored locally or on an AWS S3 bucket. Optionally, token auth can be configured and callbacks can be used if required. Sketchy uses PhantomJS with lazy-rendering to ensure AJAX-heavy sites are captured correctly. Sketchy also uses the Celery task management system, allowing users to scale Sketchy accordingly and manage time-intensive captures for large sites.

Getting Started with Sketchy

Sketchy is available now on the Netflix Open Source site and setup is straightforward. In addition, we’ve also created a Docker for Sketchy for interested users. Please visit the Sketchy wiki for documentation on how to get started.

Conclusion

Scumblr and Sketchy are helping the Netflix security team keep an eye on potential threats to our environment every day. We hope that the open source community can find new and interesting uses for the newest additions to the Netflix Open Source Software initiative. Scumblr, Sketchy, and the Workflowable gem are all available on our GitHub site now!

— Andy Hoernecke and Scott Behrens (Netflix Cloud Security Team)