By Brendan Neutra and Aimee Barciauskas

Nava’s Scalable Login System (SLS) provides authentication and account management for users on HealthCare.gov. It is a RESTful API service built on Amazon Web Services. The Nava team wanted to see just how far it could scale by running a load test with a billion users, 50 times the 20 million accounts that SLS currently handles.

Summary

Throughput of 7,754 transactions per second was served for an hour. Response time was 128ms in the 90th percentile and there were zero errors. 9 times the number of current production servers were used to serve this load: 70 4-core machines vs 15 2-core machines in current production. Application servers’ CPU was at a comfortable 50%.

Running 7,754 transactions per second with a 128ms response time at 90th percentile and zero errors.

The Details

Tools

Nava has developed its own load testing infrastructure based on Apache JMeter, an industry standard load testing tool and ruby-jmeter. The tests are written in Ruby from reusable components that simulate SLS client http requests. All components and tests are revision controlled in Github. The tests were conducted from a distributed load generation “grid” (of only 2 machines!) with a total of 6000 worker threads.

Architecture

The Test

We wanted to see how far the current SLS architecture could scale. The current system has over 20 million users in the database. The current observed peak load for HealthCare.gov’s Open Enrollment in 2015 was on Dec 14 (about 150 requests per second). The load test simulated key API requests for registering users, logging in and getting user information.

Extrapolating current peak service size and usage, our goal was a database of 1 billion users and a request rate of 7,500 per second (50 times the current number of users and the peak throughput SLS had seen). While populating the database with 1 billion users, Brendan posted updates in Slack:

With a database of 1 billion users, the load test was prepped. We achieved 7,754 requests per second sustained for one hour with acceptable latency and zero errors.

The most time consuming part of this exercise was populating the database with 1 billion users. At 2000 users per second, it still took over a week!