Introduction

The Go Performance Tuning guide is a short step-by-step guide for performance optimization of Go applications. It also includes a collection of optimization patterns based on practice and various benchmarks. This guide addresses long-running operational applications that are deployed to production or test environments with real or simulated conditions such as scale and traffic.

When starting any performance tuning effort, it is important to keep in mind that even if an application is optimizable for speed or efficiency, it does not necessarily have to be optimized. It is advisable to optimize application performance only if the improvements can be justified; for example, in terms of response latency or resource efficiency-related benefits for the business.

Efficiency Optimization vs. Latency Optimization

Two different but related aspects of application performance are resource efficiency and operation latency. The former has a goal of optimizing the entire program's resource consumption and the latter focuses on a particular function's (e.g. HTTP request handler) latency improvement. While resource efficiency improvements will often positively affect latency, latency optimization may intentionally increase resource consumption (e.g. caching). This is very often well justified when bad latency-related losses are higher than the infrastructure costs.

Some parts of the program will naturally consume more resources than others. To improve efficiency, the program's resource consumption hot spots need be identified. Hot spots represent the code that consumes a significant amount of resources, whether CPU, memory or bandwidth.

In order to improve latency, identifying latency bottlenecks is necessary. Bottlenecks represent the code that consumes a significant amount of time, expressed in CPU or off-CPU time, when processing a single task. Unlike program hot spots, bottlenecks influence latency of a particular operation irrespective of their resource footprint.

Normally, both representations of program performance - hot spots and bottlenecks - are closely related and have to be addressed together.

Algorithm Efficiency

Programs can be designed to perform the same task using a different number of operations, often orders of magnitude different. Imagine a dictionary that does a key scan for every access. Add a hash function, rearrange its data and the scan isn't necessary, turning its time complexity from O(n) to O(1).

The study of computational efficiency of algorithms is part of theoretical computer science. For in-depth coverage on performance-oriented algorithm design, these books can be useful:

The Art of Computer Programming, Donald Knuth

Introduction to Algorithms, Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein

Programming Pearls, Jon Bentley

Algorithm optimizations usually provide the highest performance gains. It is a good idea to first consider algorithm improvements, if applicable, and only afterwards focus on the language-specific optimizations.

Language-specific optimizations focus on the efficient use of the programming language. The Go Performance Patterns section lists many such recommendations. Additionally, many micro-optimizations are automatically taken care of by the compiler.

To locate program hot spots and bottlenecks in production and simulated test environments, low-overhead sampling profilers are the most suitable. We prefer the Instana Go Profiler. It uses Go's built-in pprof to automatically and securely profile remote applications in any environment.

To analyze isolated parts of the program, benchmark profiling can be used. It is described in the Benchmark Profiling with pprof blog post.

Efficiency Optimization Procedure

The following procedure is intended to serve as a reference for efficiency optimization work. The actual procedure may vary greatly depending on the application and the nature of the hot spots.

Turn on the Go profiler. See Instana's profiler setup instructions. Locate a CPU, memory and/or blocking time hot spot. See hot spot profiling reference. Analyze the source code at the hot spot location. If an optimization is obvious, jump to step 8. Isolate the hot spot code into a standalone function. Write a benchmark test for the isolated function. Benchmark and optionally profile the function. See Benchmark Profiling with pprof blog post. Optimize the code. See Language Performance Patterns sections. Apply the optimization to the application. Deploy the application. Repeat from step 2.

Latency Optimization Procedure

The following procedure is intended to serve as a reference for latency improvement work. The actual procedure may vary greatly depending on the application and the nature of the bottlenecks.

Turn on Go profiling. See Instana AutoProfile setup instructions. Define latency-relevant functions, e.g. HTTP handlers. Locate a top bottleneck for the most important function. See bottleneck profiling reference. Analyze the source code at the bottleneck location. If an optimization is obvious, jump to step 9. Isolate the code into a standalone function. Write a benchmark test for the isolated function. Benchmark and optionally profile the function. See Benchmark Profiling with pprof blog post. Optimize the code. See Language Performance Patterns sections. Apply the optimization to the application. Deploy the application. Repeat from step 3.

Go Performance Patterns

When application performance is a critical requirement, the use of built-in or third-party packages and methods should be considered carefully. The cases when a compiler can optimize code automatically are limited. The Go Performance Patterns are benchmark- and practice-based recommendations for choosing the most efficient package, method or implementation technique.

Some points may not be applicable to a particular program; the actual performance optimization benefits depend almost entirely on the application logic and load.

Parallelize CPU work When the work can be parallelized without too much synchronization, taking advantage of all available cores can speed up execution linearly to the number of physical cores.

Make multiple I/O operations asynchronous Network and file I/O (e.g. a database query) is the most common bottleneck in I/O-bound applications. Making independent I/O operations asynchronous, i.e. running in parallel, can improve downstream latency. Use sync.WaitGroup to synchronize multiple operations.

Avoid memory allocation in hot code Object creation not only requires additional CPU cycles, but will also keep the garbage collector busy. It is a good practice to reuse objects whenever possible, especially in program hot spots. You can use sync.Pool for convenience. See also: Object Creation Benchmark

Favor lock-free algorithms Synchronization often leads to contention and race conditions. Avoiding mutexes whenever possible will have a positive impact on efficiency as well as latency. Lock-free alternatives to some common data structures are available (e.g. Circular buffers).

Use read-only locks The use of full locks for read-heavy synchronized variables will unnecessarily make reading goroutines wait. Use read-only locks to avoid it.

Use buffered I/O Disks operate in blocks of data. Accessing disk for every byte is inefficient; reading and writing bigger chunks of data greatly improves the speed. See also: File I/O Benchmark

Use StringBuffer or StringBuilder instead of += operator A new string is allocated on every assignment, which is inefficient and should be avoided. See also: String Concatenation Benchmark.

Use compiled regular expressions for repeated matching It is inefficient to compile the same regular expression before every matching. While obvious, it is often overlooked. See also: Regexp Benchmark.

Preallocate slices Go manages dynamically growing slices intelligently; it allocates twice as much memory every time the current capacity is reached. During re-allocation, the underlying array is copied to a new location. To avoid copying the memory and occupying garbage collection, preallocate the slice fully whenever possible. See also: Slice Appending Benchmark.

Use Protocol Buffers or MessagePack instead of JSON and Gob JSON and Gob use reflection, which is relatively slow due to the amount of work it does. Although Gob serialization and deserialization is comparably fast, though, and may be preferred as it does not require type generation. See also: Serialization Benchmark.

Use int keys instead of string keys for maps If the program relies heavily on maps, using int keys might be meaningful, if applicable. See also: Map Access Benchmark.

See also: