Performance analysis and tuning is a powerful technique to verify if client focused performance requirements are being met. Performance analysis can be used to characterize where a program is spending its time and supports a scientific approach to testing the efficacy of tuning experiments. This post defines a general approach for performance analysis and tuning using a go HTTP service as an example. Go is particularly well suited for performance analysis and tuning because it provides pprof profiling tool chain in its standard library.

Strategy

Let’s try to establish a loose framework to begin a structured analysis of our program. The only thing we will try and do will be to use data to guide our decisions instead of making changes based on intuition or hunches. In order to do this we will:

Establish which dimensions we’re optimizing for (Requirements)

Create a harness to apply transactional load to the system

Execute a Test— (Generate data)

Observe

Analyze — Are the requirements met?

Tune — Scientific Method — Form A Hypothesis

Execute experiment to test this hypothesis

Simple HTTP Server Architecture

For this blog post we’ll be using a small HTTP golang server. All code for this post can be found at: https://github.com/dm03514/sre-tutorials/tree/master/performance/analysis-methodology-simple-http

The application that will be analyzed is a go HTTP server that queries PostgreSQL on each request. Additionally, there is Prometheus, node_exporter, and Grafana in order to collect and visualize application and system level metrics:

For simplicity this post assumes that in order to scale horizontally (and simplify our calculations) each HTTP service and Postgres Database will be deployed together:

Determine Goals (Dimensions)

This step outlines a specific goal. What are we trying to analyze? How do we know when our efforts are complete?

For this post we’ll pretend that we have clients and that on average our service will be called 10000 requests / second.

The Google SRE Book goes into a lot of depth on how to choose and model these. In the spirit of SRE we’ll model ours as:

Latency — 99% of requests should complete in < 60ms

— 99% of requests should complete in < 60ms Cost — The service should use the minimal amount of money that we think is reasonably possible. In order to do this throughput should be maximized.

— The service should use the minimal amount of money that we think is reasonably possible. In order to do this throughput should be maximized. Capacity Planning — An understanding of how many instances will be required for launch and the general scaling function should be documented. What’s the number of instances we’ll need to meet our expected initial load requirements and achieve n+1 redundancy?

Latency may be require an optimization in addition to analysis but throughput is an analysis. Using SRE SLO process the latency requirement would come from the client and or business as represented by the product owner. What’s really cool about this is that our service may be able to meet this commitment from the outset without any required tuning!

Setup Test Harness

The test harness will apply a fixed amount load to our system. In order to analyze the performance of the HTTP service data will have to be generated.

Transactional Load

This harness just uses Vegeta to make HTTP requests at a configurable rate until stopped:



echo "POST $ make load-test LOAD_TEST_RATE=50echo "POST http://localhost:8080 " | vegeta attack -body tests/fixtures/age_no_match.json -rate=50 -duration=0 | tee results.bin | vegeta report

Observe

During execution an unending transaction load will be applied (load test). In addition to application (requests rate, request latencies) and system level metrics (memory, CPU, IOPS) this time the application will be profiled in order to understand where it is spending its time.

Profiling is a type of measurement and allow us to to see where the program is spending its time. It reports where a program is spending its time. Profiling can be used to identify which functions are being called and how much time the program is spending in each one:

This data can be used during analysis to gain visibility into unnecessary work that the program is doing. Go (pprof) is able to generate profiles, and visualize them as a flame graph, using the standard toolchain. We’ll cover using these to guide tuning decisions later in this post.

Execute/Observe/Analyze

Let’s begin executing experiments. We’ll execute, observe and analyze until we our performance requirements have been invalidated. We’ll choose an arbitrary low amount of load to apply in order to generate the first observations and analysis. Each step that our performance requirements hold true we’ll increase the load by some random-ish scaling factor. Each of the load tests are executed by adjustng the rate of: make load-test LOAD_TEST_RATE=X .

50 requests / second

Notice the top two graphs. The upper left hand shows us that our application is registering that it is processing 50 requests per second (from its point of view) and the upper right reporting the duration of each HTTP request. These together help us observe and analyze if our performance requirements are being met. The red line on the HTTP Request Latency is at the SLO of 60 ms. This shows that we are well below it for our max response times.

On the cost front:

10k requests / second / 50 requests / machine = 200 machines + 1

we can still do much better.

500 Requests / second

Things start to get more interesting when 500 requests per second are applied:

Once again the upper left hand graph shows that the application is registering the intended load. If it weren’t it might identify an issue in the host machine between the load test command and the application server. The latency graph in the upper right graph shows that applying 500 requests / second send the tail up to 25–40ms per HTTP requests. The p99 is still well within the 60ms SLO that was specified above.

On the cost front:

10k requests / second / 500 requests / machine = 20 machines + 1

we can still do much better!

1000 requests / second

This is a great one! The program is logging 1000 requests / second, but on the latency side the SLO has been violated. This can be seen by the P99 line in the upper left hand graph. While the tail p100 max is much higher p99 tracks above the 60ms maximum. It’s time to dive in and profile what the application is actually doing.

Profile

In order to profile we’ll apply 1000 request per second load and then use pprof to sample the stacks in order to get an idea of where our program is spending its time. This is done by enabling the pprof HTTP endpoint and curling a trace while load is being applied:

The trace can then be visualized using:

$ go tool pprof -http=:12345 cpu.1000_reqs_sec_no_optimizations.prof

The flame graph shows where the application is spending its time and how much time time is being spent! From Brendan Gregg’s description:

The x-axis shows the stack profile population, sorted alphabetically (it is not the passage of time), and the y-axis shows stack depth, counting from zero at the [top]. Each rectangle represents a stack frame. The wider a frame is is, the more often it was present in the stacks. The top edge shows what is on-CPU, and beneath it is its [children]. The colors are usually not significant, picked randomly to differentiate frames.

Analysis — Hypothesis

In order to guide tuning we’ll focus on trying to find unnecessary work. We’ll try and find the largest source of unnecessary work and remove it. Since profiling exposes where exactly the service is spending its time it requires identifying potentially duplicate work, modifying the code to change it, re running the test and observing if the performance is closer to its goal.

Following Bredan Gregg’s description the go pprof flame graph is read from top to bottom. Each row represents a stack frame (function call). The first row is the entry point to the program and is the parent of all other calls (ie all other calls have the first line in their stack). The next row diverges from this:

Hovering over the function name in the flame graph shows the total amount of time it was on the stack during tracing. HTTPServe was on the stack for 65% of the profiles while various go runtime methods runtime.mcall , mstart , gc compose the rest of this. One interesting thing is that 5% of the program is being spent in DNS Lookups:

The only addresses that the program should be resolving are for Postgres. Clicking on FindByAge shows:

This is interesting it shows roughly 3 main sources which add to the latency: Closing/Releasing Connections, Querying Data, and Connecting.Based on this flame graph DNS lookups and connection closing/opening amounts for ~13% of the total service time.

Hypothesis: Reusing connections through pooling will reduce HTTP transaction time allow for higher throughput and lower latency.

Tune the application — Experiment

The program will be updated to try and remove the postgres connection on every request. One way to do this is to use an application level connection pool. This experiment will configure a connection pool using the go sql driver pool configuration options:

db, err := sql.Open("postgres", dbConnectionString)

db.SetMaxOpenConns(8)



if err != nil {

return nil, err

}

Execute/Observe/Analyze

Re-Runing the 1000 test load test shows that the P99 HTTP Request Latency is well within the 60ms SLO!

But on the cost front:

10k requests / second / 1000 requests / machine = 10 machines + 1

Let’s try and do better!

2000 Requests / Second

Doubling the requests show the same. Once again the upper left hand graph shows that the application is receiving 2000 requests per second, and that while the p100 max client latency is above 60ms, the p99 is still within the SLO.

On the cost side:

10k requests / second / 2000 requests / machine = 5 machines + 1

3000 Requests / second

At this point the service is able to achieve a rate of 3000 requests / second with < 60ms p99 latency, with the p100 rising significantly from 100–250ms to 250ms-1000ms between this run and the 2000 requests / second run. While the SLO isn’t being violated and costs are acceptable with:

10k requests / second / 3000 requests / machine = 4 machines + 1

Let’s try one more round of analysis.

Analysis — Hypothesis

Generating and visualizing a profile of the application with 3000 requests / second load being applied shows:

There is still 6% of the FindByAge transaction length due to Dialing Connections!! Establishing a connection pool has increased performance but its still observed that the program is still doing repeated work by continuing to make new DB connections!

Hypothesis: Even though connections are being pooled they are still being recycled and cleaned up resulting in the application having to reconnect. Adjusting the number of connections idle to equal the pool size should help with latency by further minimizing the amount of time the application spends in creating database connections.

Tune the application — Experiment

Let’s try and set the MaxIdleConns equal to the pool size (also described here):

db, err := sql.Open("postgres", dbConnectionString)

db.SetMaxOpenConns(8)

db.SetMaxIdleConns(8) if err != nil {

return nil, err

}

Execute/Observe/Analyze

3000 requests / second

p99 is well below the <60ms! @ 3000 requests per second with a lower p100!

Looking closely at the flame graph below now shows that the connection dial has no longer shows up! Looking closely at pg(*conn).query a whole leg of dialing is no longer there.

Conclusion

Performance analysis is critical for understanding if client expectations and non-functional requirements are being met. Analyzing performance by qualifying observations against client expectations can help in determining what is acceptable performance vs what is unacceptable. Go provides amazing utilities baked into the standard library to make this sort of analysis easy and accessible.

I appreciate you reading this and would love feedback!