I’m unreasonably keen on making things faster. Here’s a story about making some Go code 100x faster using the amazing tools that you’ll find in the box with the compiler. This isn’t done by making your code weird and unmaintainable, just by finding slow stuff and replacing it with faster stuff.

For this exercise I thought it would be interesting to try to determine the “diameter” of a graph. This is close to stuff I do at work (we use graphs to catch bad people at ravelin.com) and handily takes a long time to calculate.

The diameter is the length of the longest shortest path in the graph. Two points in a graph have a shortest path between them. One pair of points will have a longer shortest path than others. The length of this path is the diameter. Any pair of points will be at most this far apart.

So what’s our initial naive implementation? Well, each of our nodes is identified by a string

type nodeId string

And each node has a number of adjacent nodes that it’s directly connected to. We’ll keep these in a map, so we can easily add adjacent nodes without risking duplicates.

type node struct {

id nodeId

adj map[nodeId]*node

}

We need to be able to find all our nodes by their id (at least as we build up the structures from the input list of edges), so we’ll use a map for that too.

type nodes map[nodeId]*node

To find the shortest path between two nodes we can use Breadth First Search (BFS). Now, BFS starting at any node will allow us to find the shortest path from that node to every other node, so we just need to run BFS starting at every node and we can find the diameter. Here’s the code to do that. The key elements are somewhere to track when we’ve visited each node and how many steps we’ve taken to reach it, and a queue of nodes to consider next.

The code for this first version is here: https://github.com/philpearl/graphblog/commit/f4742fb1c65a896562052990780fe27b9ce85e3f

If I run a benchmark calling diameter() for a network with 9997 edges, I get the following result.

BenchmarkDiameter/diameter-8 1 38108360293 ns/op 9170172832 B/op 82451888 allocs/op

1 iteration takes 38s and allocates 82 million objects. So it’s worth trying to make this faster! So I run the benchmark with cpu profiling ( go test -bench . -cpuprofile cpu.prof ) , then run the profiler to create a .svg showing where the time has gone ( go tool pprof -svg graphblog.test cpu.prof > cpu1.svg ). Here’s the interesting section from that .svg

In this initial version lots of time is spend in runtime functions associated with maps

Lots of the CPU usage is assigning and iterating on maps in longestShortestPath() . There’s also quite some time pushing items onto the list. And if we look around the profile we see more time spent in Garbage Collection.

So it would be good if we can remove or improve on the maps and the list, or reduce allocations.

My first guess at an improvement is to switch from string node IDs to int32. These should be smaller and easier to hash and compare, so will perhaps make the map accesses quicker. I keep a “symbol table” that maps from the string node names to an int32. The first ID will be 0, and will increase by one for each new node. Here’s the symbol table.

We then need to wrap our nodes and symbolTable into a graph.

Here’s the commit with the complete set of changes. If we run the benchmark again, we see the run is about 30s, a full 8s faster.

BenchmarkDiameter/diameter-8 1 29804044414 ns/op 7311829424 B/op 82451563 allocs/op

If I look at the CPU profile, nothing much has changed and maps are still a big part of the problem. But now my nodeIds are contiguous integers starting at zero. Could I just use a slice to hold my nodes? The nodeId would simply be the index in the slice, so looking up the node data would be very quick. If I have a fixed number of nodes (or a reasonable upper bound) it’s an easy change to make. We can allocate an appropriately sized slice up-front and then we mostly just need to change some types.

The benchmarks improve again. We’re down to 14s from the original 38s.

BenchmarkDiameter/diameter-8 1 13927658892 ns/op 5563335728 B/op 81779354 allocs/op

This time the profiling data has changed.

The big costs now are as follows

the allocations made when pushing onto the list in longestShortestPath,

and allocating the slices used to hold state information for each BFS run,

and there’s still some cost associated with the maps used for adjacent nodes.

Lets take on the biggest target and replace the list.

We’ll make a list that we can reuse from one BFS run to another. The list needs to support taking nodes from the start and pushing nodes onto the end, and nothing else. It only need know how to store pointers to our node structures. And it would be nice if it re-used list elements, and cut down on the number of allocations, so we add an internal list of free elements.

Since this list is simpler than container/list, re-uses its elements where possible and doesn’t use interfaces, it may be a faster replacement even before we try to reuse it between BFS runs.

The benchmarks do indeed improve. We’re down to about 9.5s from 14s and the original 38s.

BenchmarkDiameter/diameter-8 1 9453521303 ns/op 1762047056 B/op 7737600 allocs/op

The profile data shows we’re still burning time in GC. This is likely on a background CPU so will make little impact on the elapsed time taken to calculate the diameter. But sorting this will reduce the overall impact of this code.

We can reuse the list simply by allocating it within diameter() and passing it into longestShortestPath() as a parameter. This doesn’t make a huge difference to the overall time, but cuts down the allocations immensely.

BenchmarkDiameter/diameter-8 1 9211307204 ns/op 1638428512 B/op 11301 allocs/op

With our re-used fast list, map iteration is now a big factor

Now map iteration is taking the most time. Lets replace those adjacent node maps with slices. We don’t expect to have hundreds of adjacent nodes per node, so linear searches for duplicates will likely be at least as fast as using maps, if not faster. Also, it appears we just need the adjacent node IDs, not pointers to the nodes, so we’ll change that at the same time. This brings us down to just under 8s.

BenchmarkDiameter/diameter-8 1 7815493067 ns/op 1638427296 B/op 11273 allocs/op

makeslice and writebarrier are both associated with allocating the BFS state data

The next candidate to improve is the slice of state data we allocate in longestShortestPath() to track the progress of the breadth-first-search. Can we reuse it? Well, we can re-use the memory by allocating it in diameter() rather than longestShortestPath() , but we’ll need to reset the contents between runs, so on the face of it we may be doing more work. But let’s have a try.

Turns out this is a huge win. We’re now down to 1.8s

BenchmarkDiameter/diameter-8 1 1840585891 ns/op 185584 B/op 1233 allocs/op

Now the profile .svg doesn’t show us much. All the time is attributed to code we have written.

At this point our profile has become uninteresting!

Most of the time is spent in longestShortestPath() . If we look at this closely we can see we’re not really using the parent field in the bfsNode structure. We just use it to indicate whether we’ve visited this node before in the search. We can try removing that field, and using a depth of -1 to indicate the node has not been examined. This takes the time down to 1.5s (note I’ve added -benchtime 10s to my benchmark command line)

BenchmarkDiameter/diameter-8 10 1543700500 ns/op 101552 B/op 1229 allocs/op

If we assume our diameter won’t be greater than 32,000 we can replace our int depth with an int16. This reduces the amount of memory we need to update, and reduces our time to just under 1.4s

BenchmarkDiameter/diameter-8 10 1389635165 ns/op 40112 B/op 1229 allocs/op

Now we’ve got to a point where I’m running out of ideas. One last thing we can do is make this code much more Go. We’re running BFS from every node of a graph that doesn’t change. If we split those nodes up, we can run BFS from different sets of nodes on different CPU cores. This brings the execution time down to 0.3s from the original 38s.

BenchmarkDiameter/diameter-8 5 303704709 ns/op 309097 B/op 8969 allocs/op

The code for all this is at https://github.com/philpearl/graphblog, with each stage along the journey as a separate commit, and the profiling .svgs added in a final commit.

If you’ve got this far then congratulations! I was getting a bit tired near the end too! If you’ve enjoyed this please hit the recommend heart so I can bask in the glow.

Oh, and I think there’s one final improvement we can make to halve the running time. Let me know what you think this is in the comments.