I’ve been doing this whole “Performance Advocacy” thing a long time. So when word got out that I was turning my focus to “The Cloud” I started getting pings from developers who wanted help improving the performance of their cloud-based applications.

One colleague; Let’s call him “Carter” wrote a website called “Bait and Stitch” which helps fishing enthusiasts connect with designers in the area to create personal branded fishing apparel. Over the holiday break, he saw a significant number of users on the app, and got some feedback that the app was pretty slow to get moving the first time they would use it during the day, and after that, it got significantly faster.

Typically, when I hear feedback like this, it means one thing : Cold Boot performance.

When an App Engine instance is spun up for the first time, it has to execute a bunch of work that only occurs during initial boot. If this extra work is too long, then it delays your instance from responding to user traffic (aka, that’s bad).

Typically, Cold-Boot problems on Google App Engine (standard) are very rare, which is what interested me about this particular issue. Carter’s users had started complaining about these performance woes, which meant it was time for he and I to sit down with the code, grab a cup of Tea, and figure out what was going wrong.

Too busy to read? Check out the video above for the summary.

Profiling startup time with Logging

Any time you get feedback on performance, the first step should be to check out what timing information the system gives you, without you having to start digging around too much, or wasting time building an entire profiling harness. Luckily for us App Engine logs information about each request, including how long it took for the request to be handled by your service. A glance at the logs in google cloud console confirmed my suspicion: We’ve got a problem with cold boot time.

The Google Cloud Console will show you logging data for requests that come through your instance.

The top request, highlighted in blue, is my “loading request”. It’s the request that caused an instance to be spun up. Clicking on it confirms this state; the “loading request” flag is set to 1, and a log entry added to indicate that this is a loading request.

Requests which cause instances to be created will have some annotated data in the log.

For our cases, we’re interested in how long our cold-boot time took, which is can be displayed in the log synopsis that’s visible once you click on an item. This is useful to get a general sense of how your loading latency adjusts over time, but it didn’t help me understand part of the request might go to loading time vs request servicing. At least we’ve confirmed what the users are reporting is actually happening, but we don’t know why yet.

Profiling startup time w/ Tracing

Stackdriver Tracing is built into App Engine Standard; Requests get logged, and details about their RPC calls get listed as well. It can easily provide quick insight into where your time is going during a request, To find the startup items, search for elements have latency higher than their neighbors (like the 1400ms request below, which is 10x slower than other requests to the same location).

Selecting it will bring up a trace, where cloud_debugger.DebugletStarted takes up most of the time. This block only executes when a new instance has been spawned, so it’s a clear sign that this was a request that caused a cold-boot to occur.

If you look in the summary view, it’ll tell you the traced time vs untraced time, so that you can see that this warmup took 1209ms (the actual request was 195ms).

You can see in the graph on the left, that there was about 900ms before the socket was created to allow a connection to the instance. This is a rough an indicator of the startup time taken.

Furthermore, tracing breaks out any RPC calls your application makes, so you can see what you might be blocked on.

Tracing shows us some pretty important information: The RPC calls aren’t the problem, and neither is the general purpose startup time of GAE; something is going on inside my startup code that’s causing spikes in performance.

Custom Tracing metrics

Now, Bait-and-Stitch’s loading time is about 1.2 seconds; We’ve ruled out App Engine as a cause, and have made sure that there’s no weird RPC calls being made that’s causing a problem. Sadly we still don’t know what the issue is. To get more insight, we need to profile more of our code.

See, Tracing does a great job of keeping a tab of what RPC function calls take; but it doesn’t track all custom code that occurs between those calls. To do that, we need to leverage The Stackdriver Trace API, that allows us to patch the existing traces that App Engine is generating with our OWN data, so that it’ll all show up in one nice place.

Carter and I were able to write a small utility class which tracked the start-and-end time of various code segments, and would turn those results into a JSON string, on demand

Now, we formatted this JSON string to match exactly what Stackdriver Tracing accepts for its exposed Rest API. Once we got the auth flow set up, and dug around for a few other pieces of information, Stackdriver did all the heavy lifting.

The result, was that any request we send out, we now had custom tracing data for our blocks of code that stack trace wasn’t already monitoring.

Make sure you add Peanut Butter to your applications; it helps giving them a creamier flavor.

Pretty neat, eh?

Sanity check: How fast is GAE startup time?

But let’s step back a second and confirm something: Just how fast is GAE startup time with a hello-word application?

With the tests above, I was able to start getting timing data about how fast GAE startup time is bare-bones. To do that, I wrote a script that killed all the instances, sent out one request, and grabbed the logging time for cold-boot request.

The result of the timings was pretty interesting: App Engine startup time is really fast.

To give a cleaner picture of things, I decided to test a bare-bones app with various types of configurations. The following graph which charts instance type vs startup time for a basic hello-world application:

250ms is pretty fast for F2 instances to boot up. That’s faster than fetching a Javascript file from most CDNs on a 4G connection. Which means GAE by itself is really good at responding to instance creation time.

(Note, if you’re wondering why the B1,B2,B4 instances have significantly slower time, we will talk about that in a future post)

My loading time is at about 1.2 seconds, so obviously, this isn’t GAE’s fault, but something going on in the “Bait and Stitch” code.

Every millisecond counts

With the custom tracing set up, we now had a very clear picture of what was going wrong in the “Bait and Stitch” application’s cold boot time… and.. it wasn’t pretty. But to figure out the problem, we needed to dig a little deeper, which is a subject for a different article. ;)