Unlocking New Features in Moz Pro with a Database-Free Architecture

Posted by Devin Ellis on February 16, 2016

Moz Pro is undertaking a comprehensive overhaul of our backend architecture to improve the performance and speed of our application and to unlock significant new features in high demand by our customers. We are abandoning MySQL database storage—our current infrastructure’s decisive bottleneck—in favor of a database-free architecture and an Elixir-driven data indexing model. In this post, I'll discuss our new architecture, the competitive and technical reasons we chose to invest in this project, and some interesting implementation challenges we faced. Update May 20, 2016: Hey Elixir fans! Today we open-sourced the Public Suffix component of our Elixir project described below. You can read about it here.

Overview of Moz Pro

Outgrowing our Database Infrastructure

Limitations of our Legacy Architecture

Scaling

Time to build . Every week, we rebuild a sharded MySQL database for each campaign, which we treat as an immutable, read-only index. Since the table insertions for a given campaign’s shard happen sequentially (attempts to parallelize resulted in deadlocks), shard-building for campaigns with deep histories increased linearly. Thus, five years (260 weeks) of history would take about 22 times longer to build than 12 weeks.

. Every week, we rebuild a sharded MySQL database for each campaign, which we treat as an immutable, read-only index. Since the table insertions for a given campaign’s shard happen sequentially (attempts to parallelize resulted in deadlocks), shard-building for campaigns with deep histories increased linearly. Thus, five years (260 weeks) of history would take about 22 times longer to build than 12 weeks. Storage constraints . In addition to linear increases in build times, the shard size also increases with direct proportion to the length of history.

. In addition to linear increases in build times, the shard size also increases with direct proportion to the length of history. Stress on MySQL. We put a lot of campaign data into the sharded MySQL databases, and we do so for a lot of campaigns. Even with the 12-cycle cap, we operate very close to capacity. Minor disruptions, especially with replication and monthly cycles, could multiply and cause significant backlogs that sometimes impacted our customers.

Moz Pro is the industry-leading digital marketing and SEO suite, offering actionable, timely data on numerous search marketing dimensions, including keyword performance, site crawl, page optimization, search visibility, and competitive metrics. Over the past year, we've focused intently on rankings-related features, the core of the Moz Pro suite. We introduced mobile search engine support, aggregated search visibility , and localized rankings data to keep pace with evolving trends in the search ecosystem and to provide competitive advantages to our customers.As our features grew and we incorporated progressively more data into our toolset, we became increasingly hamstrung when it came to maximizing that data’s potential. That’s because our architecture didn’t scale well and wasn’t performant with large datasets. It didn’t support the features we wanted to offer to complement our novel datasets: selectable date ranges, complete campaign history, flexible in-app data segmentation, and customizable data filtering, to name a few. For context on the technical discussion to follow, let’s unpack what one of those features means: complete campaign history. Under the old architecture, Moz Pro presented historical rankings in weekly and monthly views, with 12 data points each: 12 weeks of rankings history in Moz Pro But many of our current customers have (or will have) several years of rankings history. Heretofore, we’ve been unable to surface more than 12 cycles (weeks or months) of data, or even to selectively display anything other than the most recent 12 cycles. This didn’t serve our customers’ needs, and we wanted to change that so our customers could segment and slice their campaigns’ entire histories as they please.Thankfully our 12-cycle limitation wasn’t due to a limit on data retention: we store rankings history forever in Amazon’s S3 service. The culprit was our database architecture, which limited the depth and flexibility of the data we present in a variety of ways. Among those limiting factors were non-scalability, non-standard use of MySQL, and concurrency limitations in Ruby. I go into more data on each of these below.Our core challenge was scalability, manifesting in several ways:

Late campaign builds in May 2015, when it took us a week to recover from a series of snowballing MySQL backlogs

(Mis)use of MySQL

Resource contention . MySQL became the bottleneck in the data build pipeline whenever new rankings data collections finished. We had to heavily throttle upstream jobs to prevent MySQL crashes.

. MySQL became the bottleneck in the data build pipeline whenever new rankings data collections finished. We had to heavily throttle upstream jobs to prevent MySQL crashes. Long recovery times . When a MySQL server crashed, it could take hours to complete the MySQL boot sequence and reload the many thousands of campaign shards.

. When a MySQL server crashed, it could take hours to complete the MySQL boot sequence and reload the many thousands of campaign shards. Provisioning expense . Our particular configuration required specialized hardware that is significantly more expensive than commodity VMs.

. Our particular configuration required specialized hardware that is significantly more expensive than commodity VMs. Inefficiency. Each shard recursively builds data from (up to) 11 previous cycles, plus the current cycle. With the exception of the latest cycle, that was wasted work already performed in previous builds.

Ruby Concurrency Limitations

Arriving at the New Solution

Prototype: Replacing Databases with CSVs

The CSV columns were untyped because in a CSV, everything is a string.

Multiple files were required for each campaign to express the different dimensions of keywords and search engine results that Moz Pro provides, such as image and location verticals.

The CSV format only supported storing the raw data, but we wanted to index the data in advance so that it was in ready-to-query form.

CSV is an inflexible format. For some datasets it fits perfectly, but for the kinds of multi-dimensional data we deal with beyond simple rankings (such as universal search results), a nested data structure proved more practical.

SQLite: A Possible Middle Path

Very large file sizes.

Hard to ensure data is served from memory. One of the capabilities we sought with this architecture was loading a campaign's data entirely into memory, and having the API serve it from there. This ability would enable predictable, consistent performance, and reduce network calls that could potentially fail. SQLite lazily loads parts of its data file into memory as needed, and there was no easy, performant way to control this.

Elixir: Everything in Memory

Performance Summary

Small data files. Compressed rankings data files consume 63 times less disk space, on average, than their corresponding MySQL rankings tables in a direct comparison of campaigns with exactly 12 cycles of history. We obtain storage savings even for well-established campaigns, which is notable because we’re comparing campaigns with hundreds of data cycles in the new model with just 12 cycles in the legacy model. Rand’s oldest campaign, with 181 weekly cycles, is about 10 times smaller in the new format than its 12-cycle counterpart in MySQL. This reinforces the new model’s value and sustainability—unlimited history and all.

We could buy our way out of any of these problems, but scaling hardware spending linearly isn’t a responsible answer, and doesn’t solve response time problems or address some other core underlying issues (read on!).The way were were using MySQL presented further challenges:Finally, our build pipeline was written in Ruby, which is ill-optimized for concurrency. While parallel I/O is performant enough, parallel computation requires multiple processes, each bearing the overhead of the full application environment (this is the Global Interpreter Lock limitation, explained here ).Our path to an Elixir-driven solution took us down a couple of interesting roads that I’ll briefly recount before diving into the details of the model we ultimately selected.Our interest was initially piqued about the possibility of eschewing a database-driven architecture by an internal prototype using compressed CSV files generated from our rankings history store (thanks, Dudley ). These CSVs were easy to build, compact enough to load into memory on a handful of VMs, and suitable for querying using pandas to serve data via an API. The CSV prototype suggested a wholly-different kind of architecture, but a few limitations of the CSV solution prompted us to keep experimenting:Seeing both the upsides and limitations of the CSV solution, we investigated SQLite as a possible alternative. While still a database, SQLite runs embedded in its host language’s process, avoiding the centralized bottleneck we faced with MySQL. Moreover, SQLite addressed some of the challenges of the CSV solution such as allowing multiple tables, typed data, and indexes. The main issues we faced with SQLite that induced us to keep exploring alternatives were:Our ultimate solution uses Elixir to process rankings information from cold storage in S3 and load it into memory—not entirely dissimilar to the CSV solution. But with Elixir we’re able to index the data up front, resulting in a ready-to-query data structure. We save that data structure to disk, perfectly-preserved, using Erlang’s binary marshaling format. This is what we refer to as our "database-free" solution. The file sizes are small enough that we can load them into memory on a few VM hosts. From there, our API query logic uses typical functional operations (map, filter, group, etc.) to query the data.Virtually all of our concerns with the CSV and SQLite solutions were satisfactorily alleviated by Elixir, and we’ve achieved impressive results:

Sampling of storage requirements in MySQL vs. Elixir. Note the significant savings in Elixir even when we're storing up to 16 times more history.

Impressive query performance. Our new API serves the core rankings datasets 20 times faster than the previous API, with average response times consistently below 50 ms (compared to 800+ ms).

Comparison of average response times (ms) for rankings data requests from new and old APIs

Data in memory . Using the Erlang marshal format allows us to keep a campaign’s dataset completely in memory in an indexed, ready-to-query form.

. Using the Erlang marshal format allows us to keep a campaign’s dataset completely in memory in an indexed, ready-to-query form. Simple deployment . Elixir provides great tooling with Exrm, allowing us to package a release that runs on a stock Ubuntu server without the usual setup or configuration management overhead.

. Elixir provides great tooling with Exrm, allowing us to package a release that runs on a stock Ubuntu server without the usual setup or configuration management overhead. Build speed. Weekly cycles are built into queryable files on average 30 times faster with the new architecture. For the average campaign, that means we’re able to move from the cold storage stage to a finished “shard” in under 25 seconds. For a very large campaign—one with multiple years of history, more keywords than 99% of campaigns, and an order-of-magnitude more keywords than the average campaign—it takes about two minutes. Contrast those times with 13 and 45 minutes, respectively, under the legacy system.

Build time, in minutes, for campaign shards using the MySQL pipeline vs. our Elixir solution

Other Things We Like About Elixir

Optimized for parallel computation . Taking full advantage of all available hardware resources makes data processing efficient and fast. For Moz Pro, this means we can parallelize pure computation in memory, without a database query engine.

. Taking full advantage of all available hardware resources makes data processing efficient and fast. For Moz Pro, this means we can parallelize pure computation in memory, without a database query engine. Robust and reliable . Full parallel, in-memory shard building in a single OS process allows us to atomically write the shard to disk at the end, without worrying about writing intermediate computations to MySQL and Redis. This allows us to dramatically reduce the number of (potentially failing) network calls.

. Full parallel, in-memory shard building in a single OS process allows us to atomically write the shard to disk at the end, without worrying about writing intermediate computations to MySQL and Redis. This allows us to dramatically reduce the number of (potentially failing) network calls. Designed for distributed systems . The Erlang VM and ecosystem (of which Elixir is a part) is designed specifically for distributed systems – more so than any other language platforms we’re aware of.

. The Erlang VM and ecosystem (of which Elixir is a part) is designed specifically for distributed systems – more so than any other language platforms we’re aware of. Extremely good performance . For a high-level, dynamically-typed language, Elixir has solid performance. It fares very well in head to head benchmarks and users often report getting sub-millisecond response times—not something we’ve ever come close to with Ruby.

. For a high-level, dynamically-typed language, Elixir has solid performance. It fares very well in head to head benchmarks and users often report getting sub-millisecond response times—not something we’ve ever come close to with Ruby. Flexible and low-cost code management . Elixir has excellent support for umbrella apps, which allows us to put multiple sub-applications in the same Git repository, manage them in a consistent way, and deploy flexibly.

. Elixir has excellent support for umbrella apps, which allows us to put multiple sub-applications in the same Git repository, manage them in a consistent way, and deploy flexibly. Easy to learn, maintain, and reason about . The Moz Pro platform team, which comprises five developers of varying backgrounds and experience, became conversant in Elixir within a month. It wasn’t long before the team found it to be a more maintainable and performant platform than Ruby or Python because it encourages a functional style and immutable data structures.

. The Moz Pro platform team, which comprises five developers of varying backgrounds and experience, became conversant in Elixir within a month. It wasn’t long before the team found it to be a more maintainable and performant platform than Ruby or Python because it encourages a functional style and immutable data structures. Friendly and supportive community. The Elixir core team is very friendly, welcoming, and approachable; they have consistently given us help whenever we asked. As proponents of Moz's TAGFEE culture, we really appreciate that. In the same spirit, Myron, Katie, and Vadim have begun to contribute to the Elixir language.

Here are some of the things that attracted us to Elixir and made the performance numbers above achievable:

Iterating Toward Replacement

Comments

Overhauling a behemoth architecture like we have at Moz Pro presents obvious logistical challenges. We’ve decided to stay true to agile values and not pursue a big bang strategy. We’re committed to frequent, working iterations, driven by product features. We’ve started by honing in on a stable MVP limited to the rankings dataset, eviscerating the 12-cycle limitation and enabling flexible data segmentation and filtering. Success! 156 weeks of rankings history in Moz Pro We started with rankings, though we could have chosen a less challenging dataset to prototype. Crawl data, for example, while voluminous and complex, has a tidy collection-to-display pipeline that doesn’t require commingling with any other datasets. The value of converting rankings data, however, was so huge from a feature standpoint that we elected to start there. Converting other datasets will be comparatively easy, and we’re confident we’ve worked through most of the migration challenges by starting with the rankings section. We’re very excited to have completed the first leg of this journey, and you can read about the new features we released today over on the Moz Blog . It’s an important milestone for us, and for our subscribers, who get some great new features and major performance improvements in the Moz Pro rankings section. We’ll continue overhauling rankings-related data, including Analyze a Keyword and Page Optimization, and eventually migrate all stored Moz Pro datasets to the new architecture.