The Process: How We Tried (And Failed) To Build Our Own Solution

Over the course of a few months, the team heavily evaluated build versus buy. We tried out some third-party solutions and Amazon Web Services™(AWS) provided solutions that worked, but not well enough for our needs or at our desired price point. We will focus on our “build” approaches.

Our initial build solution was a simple Gunicorn/Flask app deployed on Elastic Beanstalk. We went this route because it was:

Simple

Quick and easy to get it up running

In python, the language of our inference code.

We tested this with some sample inputs from one of our clients and saw that it performed reasonably well and actually had decent throughput. We happily went ahead with this as our production solution where it happily fell over almost immediately. Under higher loads and with much larger inputs than our tests, the service would continually run out of GPU memory. We were able to eventually get this stable, but the stability came at the cost of throughput. We had to over-provision the GPU fleet to process requests without hitting timeouts or 503 errors.

A stripped-down high-level view of our architecture

In parallel with stabilizing the Gunicorn/flask, we continued evaluating other solutions for throughput, latency, cost, and stability. This was a long and exhausting process where more time was spent investigating, prototyping, and testing than developing a solution that could get the team unstuck. When we came across MXNet Model Server (MMS), it felt like yet another solution to throw on the pile and we were pretty close to just committing fully to making the flask app better. Still, we tried it out as a proof of concept and compared it with our existing flask server.