background

with tens of thousands of web sites using our authoritative dns server, we respond to millions of requests every day. dns attacks are getting more and more widespread these days, dns is a crucial part of our system and we must make sure that we can perform well under high pressure.

dns-flood is a small tool I found capable of producing huge number of udp requests.

# timeout 20s ./dnsflood example.com 127.0.0.1 -p 2053

monitoring our systems showed that memory usage of our service was growing so fast that we had stop our service or we would get into OOM errors. this looks like a memory-leak problem; there are different causes for “Kind-of” memory leaks and “Real ” memory leaks in go:

hanging goroutines

incorrect use of defer & finalizer

substring & subslices

global variables

this post has an in-dept explanation of different leak cases.

before jumping to any conclusions let’s do some profiling first

GODEBUG

different debug facilities can be enabled using GODEBUG environment variable by passing a list of comma separated name=value pairs.

scheduler trace

scheduler trace can provide information about the run-time behavior of goroutines. to enable scheduler trace, run the program with GODEBUG=schedtrace=100 the value means output period in ms.

$ GODEBUG=schedtrace=100 ./z42 -c config.json

SCHED 2952ms: ... runqueue=3 [26 11 7 18 13 30 6 3 24 25 11 0]

SCHED 3053ms: ... runqueue=3 [0 0 0 0 0 0 0 0 4 0 21 0]

SCHED 3154ms: ... runqueue=0 [0 6 2 4 0 30 0 5 0 11 2 5]

SCHED 3255ms: ... runqueue=1 [0 0 0 0 0 0 0 0 0 0 0 0]

SCHED 3355ms: ... runqueue=0 [1 0 0 0 0 0 0 0 0 0 0 0]

SCHED 3456ms: ... runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0]

SCHED 3557ms: ... runqueue=0 [13 0 3 0 3 33 2 0 10 8 10 14]

SCHED 3657ms: ...runqueue=3 [14 1 0 5 19 54 9 1 0 1 29 0]

SCHED 3758ms: ... runqueue=0 [67 1 5 0 0 1 0 0 87 4 0 0]

SCHED 3859ms: ... runqueue=6 [0 0 3 6 0 0 0 0 3 2 2 19]

SCHED 3960ms: ... runqueue=0 [0 0 1 0 1 0 0 1 0 1 0 0]

SCHED 4060ms: ... runqueue=5 [4 0 5 0 1 0 0 0 0 0 0 0]

SCHED 4161ms: ... runqueue=0 [0 0 0 0 0 0 0 1 0 0 0 0]

SCHED 4262ms: ... runqueue=4 [0 128 21 172 1 19 8 2 43 5 139 37]

SCHED 4362ms: ... runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0]

SCHED 4463ms: ... runqueue=6 [0 28 23 39 4 11 4 11 25 0 25 0]

SCHED 4564ms: ... runqueue=354 [51 45 33 32 15 20 8 7 5 42 6 0]

runqueue is the length of global run-able goroutine queue. numbers in bracket are length of per process queue.

an ideal situation is where all processes are busy running goroutines and a reasonable runqueue length evenly distributed between all processes:

SCHED 2449ms: gomaxprocs=12 idleprocs=0 threads=40 spinningthreads=1 idlethreads=1 runqueue=20 [20 20 20 20 20 20 20 20 20 20 20]

looking at our schedtrace output we can see there are time frames when almost all processes are idle. this means we are not utilizing full cpu power.

garbage collector trace

To enable the garbage collector (GC) trace, run the program with GODEBUG=gctrace=1 environment variable:



.

.

.

gc 30 59/0+1.1 ms cpu, 67->71->45 MB, 76 MB goal, 12 P

gc 31 81/7.8+0.63 ms cpu, 79->84->55 MB, 90 MB goal, 12 P

gc 32 104/0+0.29 ms cpu, 96->100->65 MB, 110 MB goal, 12 P

gc 33 131/57+1.6 ms cpu, 117->123->79 MB, 131 MB goal, 12 P

gc 34 159/69+0.29 ms cpu, 142->147->91 MB, 158 MB goal, 12 P

gc 35 179/94+1.5 ms cpu, 166->174->105 MB, 182 MB goal, 12 P

gc 36 216/4.0+1.0 ms cpu, 191->198->122 MB, 210 MB goal, 12 P

gc 37 253/0+1.1 ms cpu, 222->230->140 MB, 244 MB goal, 12 P

.

.

. GODEBUG=gctrace=1 ./z42 -c config1.jsongc 30 @3 .727s 1%: 0.066+21+0.093 ms clock, 0.79+128//0+1.1 ms cpu, 67->71->45 MB,goal, 12 Pgc 31 @3 .784s 2%: 0.030+27+0.053 ms clock, 0.36+177//7.8+0.63 ms cpu, 79->84->55 MB,goal, 12 Pgc 32 @3 .858s 3%: 0.026+34+0.024 ms clock, 0.32+234//0+0.29 ms cpu, 96->100->65 MB,goal, 12 Pgc 33 @3 .954s 3%: 0.026+44+0.13 ms clock, 0.32+191//57+1.6 ms cpu, 117->123->79 MB,goal, 12 Pgc 34 @4 .077s 4%: 0.010+53+0.024 ms clock, 0.12+241//69+0.29 ms cpu, 142->147->91 MB,goal, 12 Pgc 35 @4 .228s 5%: 0.017+61+0.12 ms clock, 0.20+296//94+1.5 ms cpu, 166->174->105 MB,goal, 12 Pgc 36 @4 .391s 6%: 0.017+73+0.086 ms clock, 0.21+492//4.0+1.0 ms cpu, 191->198->122 MB,goal, 12 Pgc 37 @4 .590s 7%: 0.049+85+0.095 ms clock, 0.59+618//0+1.1 ms cpu, 222->230->140 MB,goal, 12 P

as we can see here, in-use memory is increasing and the amount of time needed for gc to do it’s job is also increasing. this means that we are consuming more memory than gc can process.

more on GODEBUG and some other golang env variables here.

enabling the profiler

go tool pprof is the tool for analyzing and profiling data. there are two ways to set up pprof either by directly calling runtime/pprof functions like pprof.StartCPUProfile() in your code or setting up net/http/pprof http listener and getting data from there which is the way we used. pprof has very little overhead so it is safe to use in production, but profile endpoint should not be exposed publicly because they may reveal sensitive data.

for the second option all we need to to do is import “net/http/pprof” package:

import (

_ "net/http/pprof"

)

then add a http listener:

go func() {

log.Println(http.ListenAndServe("localhost:6060", nil))

}()

pprof has several default profiles:

allocs : A sampling of all past memory allocations

: A sampling of all past memory allocations block : Stack traces that led to blocking on synchronization primitives

: Stack traces that led to blocking on synchronization primitives goroutine : Stack traces of all current goroutines

: Stack traces of all current goroutines heap : A sampling of memory allocations of live objects.

: A sampling of memory allocations of live objects. mutex : Stack traces of holders of contended mutexes

: Stack traces of holders of contended mutexes profile : CPU profile.

: CPU profile. threadcreate : Stack traces that led to the creation of new OS threads

: Stack traces that led to the creation of new OS threads trace: A trace of execution of the current program.

note: trace endpoint unlike all other endpoints is a trace profile not a pprof profile, you can view it using go tool trace instead of go tool pprof .

now that we have our tools up and running we can look at the available tools.

cpu profiler

‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍$ go tool pprof http://localhost:6060/debug/pprof/profile?seconds=10

cpu profiler runs for 30 seconds by default (we can change this with seconds parameter) and collect samples every 100 milliseconds then it enters the interactive mode. there are several commands available the most common are top , list , web

use top n to view top hottest entries in text format, there are two options to sort the output, -cum for cumulative order and -flat

(pprof) top 10 -cum

Showing nodes accounting for 1.50s, 6.19% of 24.23s total

Dropped 347 nodes (cum <= 0.12s)

Showing top 10 nodes out of 186

flat flat% sum% cum cum%

0.03s 0.12% 0.12% 16.7s 69.13% (*Server).serveUDPPacket

0.05s 0.21% 0.33% 15.6s 64.51% (*Server).serveDNS

0 0% 0.33% 14.3s 59.10% (*ServeMux).ServeDNS

0 0% 0.33% 14.2s 58.73% HandlerFunc.ServeDNS

0.01s 0.04% 0.37% 14.2s 58.73% main.handleRequest

0.07s 0.29% 0.66% 13.5s 56.00% (*DnsRequestHandler).HandleRequest

0.99s 4.09% 4.75% 7.56s 31.20% runtime.gentraceback

0.02s 0.08% 4.83% 7.02s 28.97% runtime.systemstack

0.31s 1.28% 6.11% 6.62s 27.32% runtime.mallocgc

0.02s 0.08% 6.19% 6.35s 26.21% (*DnsRequestHandler).FindANAME

(pprof)

use list to investigate a single function.

(pprof) list handleRequest

Total: 24.23s

ROUTINE ======================== main.handleRequest in /home/arash/go/src/arvancloud/redins/redins.go

10ms 14.23s (flat, cum) 58.73% of Total

. . 35: l *handler.RateLimiter

. . 36: configFile string

. . 37:)

. . 38:

. . 39:func handleRequest(w dns.ResponseWriter, r *dns.Msg) {

10ms 610ms 40: context := handler.NewRequestContext(w, r)

. 50ms 41: logger.Default.Debugf("handle request: [%d] %s %s", r.Id, context.RawName(), context.Type())

. . 42:

. . 43: if l.CanHandle(context.IP()) {

. 13.57s 44: h.HandleRequest(context)

. . 45: } else {

. . 46: context.Response(dns.RcodeRefused)

. . 47: }

. . 48:}

. . 49:

(pprof)

web command can be used to generate a SVG graph of hotsopts and opens it in a browser.

(pprof)web handleRequest

having lots of time spent in GC functions like runtime.mallocgc often means heavy allocation, which can put extra pressure on garbage collector and increase latency.

having lots of time spent on synchronization mechanisms like runtime.chansend or runtime.lock could be a sign of contention

having lots of time spent in syscall.Read/Write mean exsessive use of io operations

memory profiler

$ go tool pprof http://localhost:6060/debug/pprof/allocs

by default it shows lifetime allocated memory. we can see number of allocated objects using -alloc_objects , other useful options are -inuse_objects and -inuse_space for inspecting live memory.

generally if you want to reduce memory consumption you need to look at -inuse_space but if you want to improve latency look at -alloc_objects after a suficient execution time/load.

identifying the bottleneck

it is important to first identify the type of bottleneck (cpu, io, memory) we are dealing with. aside from profilers, there’s another kind of tool available.

go tool trace can show what our goroutines are really doing in details. to collect trace sample we need to send a http request to trace endpoint:

generated file can be viewed using trace tool:



2019/12/25 15:30:50 Parsing trace...

2019/12/25 15:30:59 Splitting trace...

2019/12/25 15:31:10 Opening browser. Trace viewer is listening on $ go tool trace trace.out2019/12/25 15:30:50 Parsing trace...2019/12/25 15:30:59 Splitting trace...2019/12/25 15:31:10 Opening browser. Trace viewer is listening on http://127.0.0.1:42703

go tool trace is a web app which uses Chrome DevTools protocol, and only compatible with chrome browsers. main page looks something similar to this :

trace splits trace time to make sure your browser can handle it.

there are myriads of data here which makes it almost decipherable if we don’t now what we are looking for. let’s leave it for now.

the next link on the main page is “goroutine analysis” which shows different kind of goroutines running in the program during the trace period:

click on the first item with N=441703 and this is what we get:

goroutines analysis

this is very interesting. most of goroutines spend almost no time in execution and most of the time is spend in Sync block. let’s have a closer look at one of them:

goroutine trace diagram

it appears our goroutine is almost always inactive waiting on a rwlock. from here we can directly go to block tool; blocking profile is disabled by default, we first need to enable it in our code:

runtime.SetBlockProfileRate(1)

now we can get our block samples:



(pprof) top

Showing nodes accounting for 16.03wks, 99.75% of 16.07wks total

Dropped 87 nodes (cum <= 0.08wks)

Showing top 10 nodes out of 27

flat flat% sum% cum cum%

10.78wks 67.08% 67.08% 10.78wks 67.08% internal/poll.(*fdMutex).rwlock

5.25wks 32.67% 99.75% 5.25wks 32.67% sync.(*Mutex).Lock

0 0% 99.75% 5.25wks 32.67% arvancloud/redins/handler.(*DnsRequestHandler).Filter

0 0% 99.75% 5.25wks 32.68% arvancloud/redins/handler.(*DnsRequestHandler).FindANAME

0 0% 99.75% 16.04wks 99.81% arvancloud/redins/handler.(*DnsRequestHandler).HandleRequest

0 0% 99.75% 10.78wks 67.08% arvancloud/redins/handler.(*DnsRequestHandler).Response

0 0% 99.75% 10.78wks 67.08% arvancloud/redins/handler.(*RequestContext).Response

0 0% 99.75% 5.25wks 32.67% arvancloud/redins/handler.ChooseIp

0 0% 99.75% 16.04wks 99.81% github.com/miekg/dns.(*ServeMux).ServeDNS

0 0% 99.75% 16.04wks 99.81% github.com/miekg/dns.(*Server).serveDNS

(pprof) $ go tool pprof http://localhost:6060/debug/pprof/block (pprof) topShowing nodes accounting for 16.03wks, 99.75% of 16.07wks totalDropped 87 nodes (cum <= 0.08wks)Showing top 10 nodes out of 27flat flat% sum% cum cum%10.78wks 67.08% 67.08% 10.78wks 67.08% internal/poll.5.25wks 32.67% 99.75% 5.25wks 32.67% sync.0 0% 99.75% 5.25wks 32.67% arvancloud/redins/handler.(*DnsRequestHandler).Filter0 0% 99.75% 5.25wks 32.68% arvancloud/redins/handler.(*DnsRequestHandler).FindANAME0 0% 99.75% 16.04wks 99.81% arvancloud/redins/handler.(*DnsRequestHandler).HandleRequest0 0% 99.75% 10.78wks 67.08% arvancloud/redins/handler.(*DnsRequestHandler).Response0 0% 99.75% 10.78wks 67.08% arvancloud/redins/handler.(*RequestContext).Response0 0% 99.75% 5.25wks 32.67% arvancloud/redins/handler.ChooseIp0 0% 99.75% 16.04wks 99.81% github.com/miekg/dns.(*ServeMux).ServeDNS0 0% 99.75% 16.04wks 99.81% github.com/miekg/dns.(*Server).serveDNS(pprof)

here we have two different lock (poll.fdMutex and sync.Mutex) responsible for almost 100% of blocks. this verifies our guess about lock contention, now we only need to find were these happens:

(pprof) svg lock

this command creates a vector graph of all nodes accounting for contention with focus on lock functions:

block endpoint svg graph

we can get the same result from our goroutine endpoint:

$ go tool pprof http://localhost:6060/debug/pprof/goroutine

and then:

(pprof) top

Showing nodes accounting for 294412, 100% of 294424 total

Dropped 84 nodes (cum <= 1472)

Showing top 10 nodes out of 32

flat flat% sum% cum cum%

294404 100% 100% 294404 100% runtime.gopark

8 0.0027% 100% 294405 100% github.com/miekg/dns.(*Server).serveUDPPacket

0 0% 100% 58257 19.79% arvancloud/redins/handler.(*DnsRequestHandler).Filter

0 0% 100% 58259 19.79% arvancloud/redins/handler.(*DnsRequestHandler).FindANAME

0 0% 100% 293852 99.81% arvancloud/redins/handler.(*DnsRequestHandler).HandleRequest

0 0% 100% 235406 79.95% arvancloud/redins/handler.(*DnsRequestHandler).Response

0 0% 100% 235406 79.95% arvancloud/redins/handler.(*RequestContext).Response

0 0% 100% 58140 19.75% arvancloud/redins/handler.ChooseIp

0 0% 100% 293852 99.81% github.com/miekg/dns.(*ServeMux).ServeDNS

0 0% 100% 293900 99.82% github.com/miekg/dns.(*Server).serveDNS

(pprof)

almost all our goroutines are staying at runtime.gopark, this is go scheduler putting goroutines to sleep; a very common cause for this to happen is lock contention

(pprof) svg gopark

goroutines endpoint svg graph

here we have our two sources of contention:

UDPConn.WriteMsg()

looks like all responses ended up writing to the same fd (hence the lock), this makes sense because they all have the same source address.

we did a little experiment on different solutions, at the end we decided to use multiple listeners to balance the load, this way we let OS balance incoming request between different connection and reduce contention.

Rand()

looks like there’s a lock underneath generic math/rand functions (more on this here). this can be easily fixed using Rand.New() which creates a random generator without the locking wrapper

rg := rand.New(rand.NewSource(int64(time.Now().Nanosecond())))

this is a little better but creating a new source every time is expensive. can we do better? in our case we don’t really need random number. we just need an even distribution for our load balancing and it turns out that Time.Nanoseconds() can do just fine.

now that we have eliminated all extra blockings. let’s see the results:

goroutine analysis

looks better but still most of time is spent on sync block. let’s take a look at synchronization blocking profile from trace UI main page:

sync block time graph

let’s take a look at ccache promote function from pprof ’s block endpoint:

(pprof) list promote

ROUTINE ======================== github.com/karlseguin/ccache.(*Cache).promote in ...

0 9.66days (flat, cum) 99.70% of Total

. . 155: h.Write([]byte(key))

. . 156: return c.buckets[h.Sum32()&c.bucketMask]

. . 157:}

. . 158:

. . 159:func (c *Cache) promote(item *Item) {

. 9.66days 160: c.promotables <- item

. . 161:}

. . 162:

. . 163:func (c *Cache) worker() {

. . 164: defer close(c.donec)

. . 165:

all ccache.Get() calls end up sending to a single c.promotables channel. cache being a crucial part of our service, we have to consider other options; Dgraph has an excellent article about state of cache in go, they also have an excellent cache module called ristretto. unfortunately ristretto doesn’t support Ttl based eviction yet, we could workaround this problem by using a very large MaxCost and keep a timeout value in our cache struct (we wish to keep stale data in cache). let’s see the result using ristretto:

goroutine analysis

Great!

we managed to reduce maximum goroutine runtime from 5000 ms to 22 ms. still most of the execution time is split between “sync block” and “”scheduler wait”. let’s see if there’s anything we can do about that:

sync block time graph

there’s little more we can do about fdMutex.rwlock let’s now focus on the other one : gcMarkDone which is responsible for 53% of block time. this function is part of go garbage collection process. having these on hot spot are often a sign that we are putting a lot of pressure on gc.

alloc optimization

at this point it may be useful to see how go garbage collection works; go uses a tri-color mark and sweep collector. it keeps tracking of everything that’s allocated and once it reaches twice (or whatever value GOGC is set to) the size of previous size, GC sweep starts. mark happens in three phases:

Mark Setup (STW)

Marking (concurrent)

Mark Termination (STW)

Stop The World phases put the entire exectution to halt, although they’re generally very short, tight loops can extend their duration. that’s because currently (go v1.13) goroutines are only preemptible at function call points, thus it is possible for a tight loop to cause arbitrarily long pause time since GC waits for all goroutines to stop.

during marking gc utilizes about 25% of GOMAXPROCS but additional goroutines can be forced into mark assist, this happens when a rapidly allocating goroutine outruns the background marker, in order to reduce the latency caused by gc we need minimize heap usage; there are two things to note:

number of allocations matters more than size (e.g 1000 allocations of a 20 bytes struct puts much more stress on heap than a single 20000 bytes allocation) unlike languages like C/C++, not all allocations end up in heap. go compiler decides whether a variable escapes to heap or it can be allocated inside stack frame. unlike heap-allocated variables, stack-allocated variables do not put pressure on gc

for more information on go memory model and gc design check this presentation.

for alloc optimization we use a collection of go tools:

cpu profiler to find hot allocations

memory profiler to track allocations

tracer for GC patterns

escape analysis to find why allocations happen

let’s start with cpu profiler:

$ go tool pprof http://localhost:6060/debug/pprof/profile?seconds=20

(pprof) top 20 -cum

Showing nodes accounting for 7.27s, 29.10% of 24.98s total

Dropped 315 nodes (cum <= 0.12s)

Showing top 20 nodes out of 201

flat flat% sum% cum cum%

0 0% 0% 16.42s 65.73% github.com/miekg/dns.(*Server).serveUDPPacket

0.02s 0.08% 0.08% 16.02s 64.13% github.com/miekg/dns.(*Server).serveDNS

0.02s 0.08% 0.16% 13.69s 54.80% github.com/miekg/dns.(*ServeMux).ServeDNS

0.01s 0.04% 0.2% 13.48s 53.96% github.com/miekg/dns.HandlerFunc.ServeDNS

0.02s 0.08% 0.28% 13.47s 53.92% main.handleRequest

0.24s 0.96% 1.24% 12.50s 50.04% arvancloud/redins/handler.(*DnsRequestHandler).HandleRequest

0.81s 3.24% 4.48% 6.91s 27.66% runtime.gentraceback

3.82s 15.29% 19.78% 5.48s 21.94% syscall.Syscall

0.02s 0.08% 19.86% 5.44s 21.78% arvancloud/redins/handler.(*DnsRequestHandler).Response

0.06s 0.24% 20.10% 5.25s 21.02% arvancloud/redins/handler.(*RequestContext).Response

0.03s 0.12% 20.22% 4.97s 19.90% arvancloud/redins/handler.(*DnsRequestHandler).FindANAME

0.56s 2.24% 22.46% 4.92s 19.70% runtime.mallocgc

0.07s 0.28% 22.74% 4.90s 19.62% github.com/miekg/dns.(*response).WriteMsg

0.04s 0.16% 22.90% 4.40s 17.61% github.com/miekg/dns.(*response).Write

0.01s 0.04% 22.94% 4.36s 17.45% github.com/miekg/dns.WriteToSessionUDP

1.43s 5.72% 28.66% 4.30s 17.21% runtime.pcvalue

0.01s 0.04% 28.70% 4.15s 16.61% runtime.newstack

0.06s 0.24% 28.94% 4.09s 16.37% runtime.copystack

0.01s 0.04% 28.98% 4.05s 16.21% net.(*UDPConn).WriteMsgUDP

0.03s 0.12% 29.10% 4.04s 16.17% net.(*UDPConn).writeMsg

we are particularly intereseted in functions linked with mallocgc , this is where mark assists happen

(pprof) svg mallocgc

we can track allocation using alloc endpoint, alloc_object option means total allocations count, there are other options for in use memory and allocation space.



(pprof) top -cum

Active filters:

show=handler

Showing nodes accounting for 58464353, 59.78% of 97803168 total

Dropped 1 node (cum <= 489015)

Showing top 10 nodes out of 19

flat flat% sum% cum cum%

15401215 15.75% 15.75% 70279955 71.86% arvancloud/redins/handler.(*DnsRequestHandler).HandleRequest

2392100 2.45% 18.19% 27198697 27.81% arvancloud/redins/handler.(*DnsRequestHandler).FindANAME

711174 0.73% 18.92% 14936976 15.27% arvancloud/redins/handler.(*DnsRequestHandler).Filter

0 0% 18.92% 14161410 14.48% arvancloud/redins/handler.(*DnsRequestHandler).Response

14161410 14.48% 33.40% 14161410 14.48% arvancloud/redins/handler.(*RequestContext).Response

7284487 7.45% 40.85% 11118401 11.37% arvancloud/redins/handler.NewRequestContext

10439697 10.67% 51.52% 10439697 10.67% arvancloud/redins/handler.reverseZone

0 0% 51.52% 10371430 10.60% arvancloud/redins/handler.(*DnsRequestHandler).FindZone

2680723 2.74% 54.26% 8022046 8.20% arvancloud/redins/handler.(*GeoIp).GetSameCountry

5393547 5.51% 59.78% 5393547 5.51% arvancloud/redins/handler.(*DnsRequestHandler).LoadLocation $ go tool pprof -alloc_objects http://localhost:6060/debug/pprof/allocs (pprof) top -cumActive filters:show=handlerShowing nodes accounting for 58464353, 59.78% of 97803168 totalDropped 1 node (cum <= 489015)Showing top 10 nodes out of 19flat flat% sum% cum cum%15401215 15.75% 15.75% 70279955 71.86% arvancloud/redins/handler.(*DnsRequestHandler).HandleRequest2392100 2.45% 18.19% 27198697 27.81% arvancloud/redins/handler.(*DnsRequestHandler).FindANAME711174 0.73% 18.92% 14936976 15.27% arvancloud/redins/handler.(*DnsRequestHandler).Filter0 0% 18.92% 14161410 14.48% arvancloud/redins/handler.(*DnsRequestHandler).Response14161410 14.48% 33.40% 14161410 14.48% arvancloud/redins/handler.(*RequestContext).Response7284487 7.45% 40.85% 11118401 11.37% arvancloud/redins/handler.NewRequestContext10439697 10.67% 51.52% 10439697 10.67% arvancloud/redins/handler.reverseZone0 0% 51.52% 10371430 10.60% arvancloud/redins/handler.(*DnsRequestHandler).FindZone2680723 2.74% 54.26% 8022046 8.20% arvancloud/redins/handler.(*GeoIp).GetSameCountry5393547 5.51% 59.78% 5393547 5.51% arvancloud/redins/handler.(*DnsRequestHandler).LoadLocation

from here on we can use list on each function and see if we can reduce memory allocation. let’s check a few :

printf-like functions

(pprof) list handleRequest

Total: 97803168

ROUTINE ======================== main.handleRequest in /home/arash/go/src/arvancloud/redins/redins.go

2555943 83954299 (flat, cum) 85.84% of Total

. . 35: l *handler.RateLimiter

. . 36: configFile string

. . 37:)

. . 38:

. . 39:func handleRequest(w dns.ResponseWriter, r *dns.Msg) {

. 11118401 40: context := handler.NewRequestContext(w, r)

2555943 2555943 41: logger.Default.Debugf("handle request: [%d] %s %s", r.Id, context.RawName(), context.Type())

. . 42:

. . 43: if l.CanHandle(context.IP()) {

. 70279955 44: h.HandleRequest(context)

. . 45: } else {

. . 46: context.Response(dns.RcodeRefused)

. . 47: }

. . 48:}

. . 49:

line 41 is particularly interesting, even when debug is disabled there’s still memory allocation, we can use escape analysis to investigate it more

go escape analysis tool is actually a compiler flag

$ go build -gcflags '-m'

you can add another -m for more information

$ go build -gcflags '-m '

for a better interface use view-annotated-file.

$ go build -gcflags '-m'

.

.

. ../redins.go:39:20: leaking param: w

./redins.go:39:42: leaking param: r

./redins.go:41:55: r.MsgHdr.Id escapes to heap

./redins.go:41:75: context.RawName() escapes to heap

./redins.go:41:91: context.Request.Type() escapes to heap

./redins.go:41:23: handleRequest []interface {} literal does not escape

./redins.go:219:17: leaking param: path .

.

.

here, all Debugf parameters escape to heap. this happens because of the way Debugf is defined:

func (l *EventLogger) Debugf(format string, args ...interface{})

all args parameters convert to interface{} type which always escape to heap. we can either remove debug logs or use a zero allocation log library like zerolog.

for more information about escape analysis see “allocation eficiency in golang”.

string manipulation

(pprof) list reverseZone

Total: 100817064

ROUTINE ======================== arvancloud/redins/handler.reverseZone in /home/arash/go/src/arvancloud/redins/handler/handler.go

6127746 10379086 (flat, cum) 10.29% of Total

. . 396: logger.Default.Warning("log queue is full")

. . 397: }

. . 398:}

. . 399:

. . 400:func reverseZone(zone string) string {

. 4251340 401: x := strings.Split(zone, ".")

. . 402: var y string

. . 403: for i := len(x) - 1; i >= 0; i-- {

6127746 6127746 404: y += x[i] + "."

. . 405: }

. . 406: return y

. . 407:}

. . 408:

. . 409:func (h *DnsRequestHandler) LoadZones() {

(pprof)

Since Go’s string is immutable, making temporary string cause allocations. Beginning with Go 1.10 there is a strings.Builder to efficiently build a string.

(pprof) list reverseZone

Total: 93437002

ROUTINE ======================== arvancloud/redins/handler.reverseZone in /home/arash/go/src/arvancloud/redins/handler/handler.go

0 7580611 (flat, cum) 8.11% of Total

. . 396: logger.Default.Warning("log queue is full")

. . 397: }

. . 398:}

. . 399:

. . 400:func reverseZone(zone string) string {

. 3681140 401: x := strings.Split(zone, ".")

. . 402: var sb strings.Builder

. 3899471 403: sb.Grow(len(zone)+1)

. . 404: for i := len(x) - 1; i >= 0; i-- {

. . 405: sb.WriteString(x[i])

. . 406: sb.WriteByte('.')

. . 407: }

. . 408: return sb.String()

since we don’t care about the value of reversed string we can eliminate Split() by simply reversing the whole string.

(pprof) list reverseZone

Total: 89094296

ROUTINE ======================== arvancloud/redins/handler.reverseZone in /home/arash/go/src/arvancloud/redins/handler/handler.go

3801168 3801168 (flat, cum) 4.27% of Total

. . 400:func reverseZone(zone string) []byte {

. . 401: runes := []rune("." + zone)

. . 402: for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {

. . 403: runes[i], runes[j] = runes[j], runes[i]

. . 404: }

3801168 3801168 405: return []byte(string(runes))

. . 406:}

. . 407:

. . 408:func (h *DnsRequestHandler) LoadZones() {

. . 409: h.LastZoneUpdate = time.Now()

. . 410: zones, err := h.Redis.SMembers("redins:zones")

more on string manipulation here.

sync.Pool

(pprof) list GetASN

Total: 69005282

ROUTINE ======================== arvancloud/redins/handler.(*GeoIp).GetASN in /home/arash/go/src/arvancloud/redins/handler/geoip.go

1146897 1146897 (flat, cum) 1.66% of Total

. . 231:func (g *GeoIp) GetASN(ip net.IP) (uint, error) {

1146897 1146897 232: var record struct {

. . 233: AutonomousSystemNumber uint `maxminddb:"autonomous_system_number"`

. . 234: }

. . 235: err := g.ASNDB.Lookup(ip, &record)

. . 236: if err != nil {

. . 237: logger.Default.Errorf("lookup failed : %s", err) (pprof) list GetGeoLocation

Total: 69005282

ROUTINE ======================== arvancloud/redins/handler.(*GeoIp).GetGeoLocation in /home/arash/go/src/arvancloud/redins/handler/geoip.go

1376298 3604572 (flat, cum) 5.22% of Total

. . 207:

. . 208:func (g *GeoIp) GetGeoLocation(ip net.IP) (latitude float64, longitude float64, country string, err error) {

. . 209: if !g.Enable || g.CountryDB == nil {

. . 210: return

. . 211: }

1376298 1376298 212: var record struct {

. . 213: Location struct {

. . 214: Latitude float64 `maxminddb:"latitude"`

. . 215: LongitudeOffset uintptr `maxminddb:"longitude"`

. . 216: } `maxminddb:"location"`

. . 217: Country struct {

. . 218: ISOCode string `maxminddb:"iso_code"`

. . 219: } `maxminddb:"country"`

. . 220: }

. . 221: // logger.Default.Debugf("ip : %s", ip)

. . 222: if err := g.CountryDB.Lookup(ip, &record); err != nil {

. . 223: logger.Default.Errorf("lookup failed : %s", err)

. . 224: return 0, 0, "", err

. . 225: }

. 2228274 226: _ = g.CountryDB.Decode(record.Location.LongitudeOffset, &longitude)

. . 227: // logger.Default.Debug("lat = ", record.Location.Latitude, " lang = ", longitude, " country = ", record.Country.ISOCode)

. . 228: return record.Location.Latitude, longitude, record.Country.ISOCode, nil

. . 229:}

. . 230:

. . 231:func (g *GeoIp) GetASN(ip net.IP) (uint, error) {

we use maxmiddb functions to get geolocation data. these functions take interface{} as parameters which as we saw earlier can cause heap escapes.

we can use sync.Pool to cache allocated but unused items for later reuse

type MMDBGeoLocation struct {

Coordinations struct {

Latitude float64 `maxminddb:"latitude"`

Longitude float64

LongitudeOffset uintptr `maxminddb:"longitude"`

} `maxminddb:"location"`

Country struct {

ISOCode string `maxminddb:"iso_code"`

} `maxminddb:"country"`

}



type MMDBASN struct {

AutonomousSystemNumber uint `maxminddb:"autonomous_system_number"`

} func (g *GeoIp) GetGeoLocation(ip net.IP) (*MMDBGeoLocation, error) {

if !g.Enable || g.CountryDB == nil {

return nil, EMMDBNotAvailable

}

var record *MMDBGeoLocation = g.LocationPool.Get().(*MMDBGeoLocation)

logger.Default.Debugf("ip : %s", ip)

if err := g.CountryDB.Lookup(ip, &record); err != nil {

logger.Default.Errorf("lookup failed : %s", err)

return nil, err

}

_ = g.CountryDB.Decode(record.Coordinations.LongitudeOffset, &record.Coordinations.Longitude)

logger.Default.Debug("lat = ", record.Coordinations.Latitude, " lang = ", record.Coordinations.Longitude, " country = ", record.Country.ISOCode)

return record, nil

}



func (g *GeoIp) GetASN(ip net.IP) (uint, error) {

var record *MMDBASN = g.AsnPool.Get().(*MMDBASN)

err := g.ASNDB.Lookup(ip, record)

if err != nil {

logger.Default.Errorf("lookup failed : %s", err)

return 0, err

}

logger.Default.Debug("asn = ", record.AutonomousSystemNumber)

return record.AutonomousSystemNumber, nil

}

more on sync.Pool here.

there are many other small optimimazions possible, but at this point it seems we have done enough. for more information about memory optimization techniques check Allocation efficiency in high-performance Go services

results

two visualize the result of our memory optimizations we use a diagram in go tool trace called “Minimum Mutator Utilization”, here Mutator means not gc.

before optimizations:

here we have a window of about 500 milliseconds while we have almost no utilization(gc is consuming all resources) and we never get past 80% utilization in the long run. we want to have the zero utilization window as small as possible and get to 100% percent utilization as fast as possible, something like this:

after optimizations:

conclusion

by using go tools we managed to optimize our service to handle high load of requests and better utilize system recources.

you can check our source code on github. here is unoptimized version and here is the optimized version.

further readings