Avoiding Memory Leak in Golang API

18,854 reads

@ imantumorang Iman Tumorang Software Engineer

You must read this before releasing your Golang API into production. Based on our true story at Kurio, how we struggling for every release because we are not doing it in the right ways.

reactions

A few weeks ago, we are in Kurio just fixing our weird and undetected bug in our main services. We have tried so many ways to debugging and fixing it. The bug is not with the business logic. Because it has been running on production for a few weeks. But we always saved by our auto-scaling mechanism, so it’s just like running well.

reactions

Until later, we figured out it, it is because our code, we are not doing it well.

reactions

Architecture

Fyi, we are using a microservices pattern in our architecture. We have one gateway API — we call it main API — that serve API to our user (mobile, and web). Because of its role similar to API gateway, so its task is only handled request from the user, then call the required services, and build the response to user. This main API, written fully in Golang. The reason to choose golang is another story that I won’t tell it here.

reactions

If drawn in a picture, our system will look more like this.

reactions

The Problem

We have struggled for a long time with our main API, that always down and returning a long response to our mobile-apps and sometimes causing our API can’t be accessed. Our API dashboard monitor just turns to red — to be honest, when our API dashboard monitor goes red, it was a dangerous thing and causing stress and panic and crazy to us the engineer 😈.

reactions

Other things are, our CPU and memory usage are getting higher. If this happens we just restart it manually and wait until it runs again.

reactions

Our API response time up-to 86 seconds for a single request.

reactions

Graph our API response time, and doing restart manually for safety.

reactions

This bug really makes us frustrated, because we don’t have any log that specifically telling about this bug. We just have that the response time is so long. The CPU and memory usage increasing. It just like a nightmare.

reactions

Phase 1 : Using Customized http.Client

One thing we learn and really really learn when developing this service is, don’t trust the default config.

reactions

We use a customized http.Client , instead using the default one from the http’s package,

reactions

client:=http. Client {} //default

We add some config based on our need. Because we need to reuse the connection, we make some configuration in our transport and control max-idle reusable connection.

reactions

keepAliveTimeout:= 600 * time.Second timeout:= 2 * time.Second defaultTransport := &http.Transport{ Dial: (&net.Dialer{ KeepAlive: keepAliveTimeout,} ).Dial, MaxIdleConns: 100 , MaxIdleConnsPerHost: 100 , } client:= &http.Client{ Transport: defaultTransport, Timeout: timeout, }

This configuration can help us to reduce the maximum of time to be used to calling another service.

reactions

Phase 2: Avoiding Memory Leak From Unclosed Response’s Body

What we learn from this phase is: If we want to reuse our connection pool to another service, we must read the response body, and close it.

reactions

Because our main API is just calling another service, we make a fatal mistake. Our main API suppose to reuse the available connection from http.Client so whatever happens, we must read the response body, even we don’t need it. And also we must close the response body. Both of this used to avoid memory leak in our server.

reactions

We forgot to close our response body in our code. And this things can cause a huge disaster for our production.

reactions

The solution is: we close the response body and read it even we don’t need the data.

reactions

req, err:= http.NewRequest( "GET" , "http://example.com?q=one" , nil ) if err != nil { return err } resp, err:= client.Do(req) //================================================= // CLOSE THE RESPONSE BODY //================================================= if resp != nil { defer resp.Body.Close() // MUST CLOSED THIS } if err != nil { return err } //================================================= // READ THE BODY EVEN THE DATA IS NOT IMPORTANT // THIS MUST TO DO, TO AVOID MEMORY LEAK WHEN REUSING HTTP // CONNECTION //================================================= _, err = io.Copy(ioutil.Discard, resp.Body) // WE READ THE BODY if err != nil { return err }

We fix this after reading a great article here: http://devs.cloudimmunity.com/gotchas-and-common-mistakes-in-go-golang/ and here: http://tleyden.github.io/blog/2016/11/21/tuning-the-go-http-client-library-for-load-testing/

reactions

Phase 1 and Phase 2 and with the help of auto scaling success to reduce this bug. Well, to be honest, it never happens anymore for even 3 months since the last year: 2017.

reactions

Phase 3: Timeout Control In Golang Channel

After a few months run well, this bug happens again. At the first week of January 2018, one of our service that called by our main API, let says: is down. For some reasons, it cannot be accessed.

reactions

So when our content service is down, our main API is going fired again. The API dashboard going red again, the API response time is going higher and slower. Our CPU and memory usage going very high even using the autoscaling.

reactions

Again, we trying to find the root problem again. Well, after re-running the content service we run well again.

reactions

For that’s case, we are curious, why this happens. Because we think, we have set the timeout deadline in

http.Client

reactions

, so in that case, this will never happen.

Searching for the potential issue in our code, then we found some dangerous code.

reactions

For more simple, the code more look like this

* ps: this function is just an example, but similar in pattern with ours

reactions

type sampleChannel struct { Data *Sample Err error } func (u *usecase) GetSample (id int64 , someparam string , anotherParam string ) ([]*Sample, error) { chanSample := make ( chan sampleChannel, 3 ) wg := sync.WaitGroup{} wg.Add( 1 ) go func () { defer wg.Done() chanSample <- u.getDataFromGoogle(id, anotherParam) // just example of function }() wg.Add( 1 ) go func () { defer wg.Done() chanSample <- u.getDataFromFacebook(id, anotherParam) }() wg.Add( 1 ) go func () { defer wg.Done() chanSample <- u.getDataFromTwitter(id,anotherParam) }() wg.Wait() close (chanSample) result := make ([]*Sample, 0 ) for sampleItem := range chanSample { if sampleItem.Error != nil { logrus.Error(sampleItem.Err) } if sampleItem.Data == nil { continue } result = append (result, sampleItem.Data) } return result }

our code before fixed

If we look the above code, it’s like nothing wrong. But this function is the most accessed, and have the heaviest call in our main API. Because this function will do 3 API call with huge processing.

reactions

To improve this we make a new approachment using timeout-control on the channel. Because with the above style code — which using WaitGroup will wait until all the process done — we must wait for all the API call must finish so we can process and return the response to the user.

reactions

This is one of our big mistakes. This code can make a huge disaster when one of our services died. Because there will be a long waiting until the died service recovered. With 5K call/s, of course, this is a disaster.

reactions

First attempt solution:

We modify it with adding a timeout. So our user will not wait for so long, they will just get an internal server error.

reactions

func (u *usecase) GetSample (id int64 , someparam string , anotherParam string ) ([]*Sample, error) { chanSample := make ( chan sampleChannel, 3 ) defer close (chanSample) go func () { chanSample <- u.getDataFromGoogle(id, anotherParam) // just example of function }() go func () { chanSample <- u.getDataFromFacebook(id, anotherParam) }() go func () { chanSample <- u.getDataFromTwitter(id,anotherParam) }() result := make ([]*feed.Feed, 0 ) timeout := time.After(time.Second * 2 ) for loop := 0 ; loop < 3 ; loop++ { select { case sampleItem := <-chanSample: if sampleItem.Err != nil { logrus.Error(sampleItem.Err) continue } if feedItem.Data == nil { continue } result = append (result,sampleItem.Data) case <-timeout: err := fmt.Errorf( "Timeout to get sample id: %d. " , id) result = make ([]*sample, 0 ) return result, err } } return result, nil ; }

trying to add timeout

Phase 4: Timeout Control Using Context

After doing Phase 3, our problem still not fully solved. Our main API still consume high CPU and memory.

reactions

This happens because, even we already returning Internal Server Error to our user, but our goroutine has still existed. What we want is, if we already return the response, then all the resource is also cleared, no exception, including goroutine and API call that running in the background.

reactions

Later after reading this article:

http://dahernan.github.io/2015/02/04/context-and-cancellation-of-goroutines/

reactions

We found some interesting feature that we do not realize yet in golang. That was using Context to help cancellation in go-routine.

reactions

Instead of using time.After to use the timeout, we move to context.Context . With this new approachment, our service is more reliable.

reactions

Then we change our code structure again by adding context to function that related.

reactions

func (u *usecase) GetSample (c context.Context, id int64 , someparam string , anotherParam string ) ([]*Sample, error) { if c== nil { c= context.Background() } ctx, cancel := context.WithTimeout(c, time.Second * 2 ) defer cancel() chanSample := make ( chan sampleChannel, 3 ) defer close (chanSample) go func () { chanSample <- u.getDataFromGoogle(ctx, id, anotherParam) // just example of function }() go func () { chanSample <- u.getDataFromFacebook(ctx, id, anotherParam) }() go func () { chanSample <- u.getDataFromTwitter(ctx, id,anotherParam) }() result := make ([]*feed.Feed, 0 ) for loop := 0 ; loop < 3 ; loop++ { select { case sampleItem := <-chanSample: if sampleItem.Err != nil { continue } if feedItem.Data == nil { continue } result = append (result,sampleItem.Data) // ============================================================ // CATCH IF THE CONTEXT ALREADY EXCEEDED THE TIMEOUT // FOR AVOID INCONSISTENT DATA, WE JUST SENT EMPTY ARRAY TO // USER AND ERROR MESSAGE // ============================================================ case <-ctx.Done(): // To get the notify signal that the context already exceeded the timeout err := fmt.Errorf( "Timeout to get sample id: %d. " , id) result = make ([]*sample, 0 ) return result, err } } return result, nil ; }

final fixed code. using context

So then we use Context for every goroutine call in our code. This helps us to release memory and cancel the goroutine call.

reactions

In addition, for more controlled and reliable, we also pass the context to our HTTP request.

reactions

func ( u *usecase) getDataFromFacebook (ctx context.Context, id int64 , param string ) sampleChanel { req,err := http.NewRequest( "GET" , "https://facebook.com" , nil ) if err != nil { return sampleChannel{ Err: err, } } // ============================================================ // THEN WE PASS THE CONTEXT TO OUR REQUEST. // THIS FEATURE CAN BE USED FROM GO 1.7 // ============================================================ if ctx != nil { req = req.WithContext(ctx) // NOTICE THIS. WE ARE USING CONTEXT TO OUR HTTP CALL REQUEST } resp, err:= u.httpClient.Do(req) if err != nil { return sampleChannel{ Err: err, } } body,err:= ioutils.ReadAll(resp.Body) if err!= nil { return sampleChannel{ Err:err, } sample:= new (Sample) err:= json.Unmarshall(body,&sample) if err != nil { return sampleChannle{ Err:err, } } return sampleChannel{ Err: nil , Data:sample, } }

using context on HTTP request

With all of this setting and timeout control, our system is more safe and controllable.

reactions

Lesson learned:

reactions

Never used the default option in production.

Just never used the default option. If you’re building a big concurrent A, just never use the default option.

Just never used the default option. If you’re building a big concurrent A, just never use the default option. Read a lot, tried a lot, failed a lot, and gain a lot.

We learn a lot from this experience, this experience only gained in real cases and real user. And I’m so happy being part of it when fixing this bug.

*last updated: 18 January 2018: fixing some of the typos

reactions

If you think this worth enough to read, give me a 👏

also share this on your twitter, facebook etc, so other people can read this. If you have a question or something, you could put a response below or just email me.

reactions

Tags