Squeezing the most out of the server: Erlang Profiling Written by Facundo Olano , April 07, 2020 Written by, April 07, 2020

NextRoll’s real-time bidding (RTB) platform has been featured several times in this tech blog: we run a fleet of Erlang applications (the bidders) that typically ranges between one and two thousand nodes. As described in past articles, an ongoing goal of the RTB team —as well as a source of interesting technical problems— is to minimize operational costs as much as possible.

An obvious way to reduce costs is to make the system more efficient and this means entering the hazardous land of software optimization. Even for experienced programmers, identifying bottlenecks is a hard enough problem when using the right tools; trying to guess what could make the code run faster will not only waste time but is likely to introduce unnecessary complexity that can cause problems down the line. The cousin of premature optimization is necessary optimization without profiling first

While Erlang is famously known for its concurrency model and fault-tolerant design, one of its biggest strengths is the level of live inspection and tuning it offers, often with little or no setup and runtime cost. In this article, we outline how we leverage those features to profile our system, driving the optimizations that can lead to cost reductions.

10-15 minute read

Infrastructure

An interesting aspect of real-time bidding is that it is fairly low-risk to test in production. Even if the new code is slow or contains errors, the bidders are architected to just send a no-bid response whenever a request can’t be fulfilled.

Taking advantage of this, we incorporate canary deploys to our day-to-day development workflow. In the context of optimizations, this means we can quickly test our performance hypothesis by updating the code and testing with live traffic. We have metrics and dashboards that give feedback on common metrics like timeouts, errors and amount of requests processed, making it obvious when a change is beneficial.

Request-level timers

Bid request processing is the fundamental operation of the bidder application. Any improvement in the amount of time it takes us to send a response to an ad exchange means we can process more requests per server, requiring fewer servers to handle our traffic and ultimately saving us money.

The work involved in a bid request can be broadly divided into a series of tasks such as payload parsing, selection of matching ads, pricing a particular ad, and response encoding. A common practice is to periodically measure the time invested in each of those phases to make sure they don’t degrade over time and to help provide a frame of reference to use when we look for areas of the code that are worthy of our optimization efforts.



A sample of bid request timings per phase.

The most basic way of profiling consists of timing the calls to a given piece of the code (perhaps one that was deemed suspicious by one of the methods described in the next sections):

%% Evaluates Fun() and reports its evaluation time as a histogram %% metric. - spec time_call ( name (), fun (() -> Result )) -> Result . time_call ( Metric , Fun ) -> Start = erlang : system_time ( microsecond ), Result = Fun (), End = erlang : system_time ( microsecond ), Diff = End - Start , update ( histogram , Metric , Diff ), Result .

The above helper is used to wrap calls to the function we want to measure. A canary deploy of the timed code to production will generate average, median and percentile metrics that we can then compare to the overall request time to identify bottlenecks.

recon

Timing request operations can be a very useful technique for understanding the specifics of a request flow but gives us a limited perspective of the entire system. Most of the bid request phases are handled on a single process and some of them involve idle time waiting for external systems. There are many periodic tasks and long-lived support processes in the bidders, and we can benefit from system-wide profiling that looks beyond bid request processing. This is where the Erlang toolbox comes into play.

The first valuable resource is not a piece of software but a little book by Fred Hebert: Erlang in Anger. This guide is the perfect reference because it describes methods for gaining insight and optimizing production systems, backed by real-world experience. The book’s companion library, recon, provides safer, friendlier and more productive interfaces to Erlang’s powerful inspection tools. What follows are some simple examples mostly derived from the book.

%% get node general stats ( bidder_dev @ node ) > recon : node_stats_print ( 1 , 100 ). {[{ process_count , 3061 }, { run_queue , 31 }, { error_logger_queue_len , 0 }, { memory_total , 6141339952 }, { memory_procs , 2623297976 }, { memory_atoms , 967320 }, { memory_bin , 610903880 }, { memory_ets , 2078839480 }], [{ bytes_in , 729538 }, { bytes_out , 2722704 }, { gc_count , 9475 }, { gc_words_reclaimed , 28611695 }, { reductions , 699610880 }, { scheduler_usage ,[{ 1 , 0 . 9840756659394524 }, { 2 , 0 . 9760797698171886 }, { 3 , 0 . 9891411385108696 }, { 4 , 0 . 9817415742856814 }, { 5 , 0 . 985099143110567 }, { 6 , 0 . 9862095293650041 }, { 7 , 0 . 9759060081154223 }, { 8 , 0 . 9926624828687856 }]}]} %% get top 3 memory consumers ( bidder_dev @ node ) > recon : proc_count ( memory , 3 ). [{ < 0 . 1597 . 0 > , 986767172 , [ bidder_banker , { current_function ,{ gen_server , loop , 7 }}, { initial_call ,{ proc_lib , init_p , 5 }}]}, { < 0 . 344 . 0 > , 915466692 , [ bid_guardian , { current_function ,{ gen_server , loop , 7 }}, { initial_call ,{ proc_lib , init_p , 5 }}]}, { < 0 . 15784 . 2618 > , 41052056 , [{ current_function ,{ prim_inet , recv0 , 3 }}, { initial_call ,{ proc_lib , init_p , 5 }}]}] %% get top 3 cpu consumers ( bidder_dev @ node ) > recon : proc_count ( reductions , 3 ). [{ < 0 . 7011 . 0 > , 2464069696 , [{ current_function ,{ exometer_probe , loop , 1 }}, { initial_call ,{ proc_lib , init_p , 3 }}]}, { < 0 . 6730 . 0 > , 2422804659 , [{ current_function ,{ exometer_probe , loop , 1 }}, { initial_call ,{ proc_lib , init_p , 3 }}]}, { < 0 . 6740 . 0 > , 2402511307 , [{ current_function ,{ exometer_probe , loop , 1 }}, { initial_call ,{ proc_lib , init_p , 3 }}]}] %% get top cpu consumers during a time window ( bidder_dev @ node ) > recon : proc_window ( reductions , 3 , 1000 ). [{ < 0 . 20046 . 2631 > , 688443 , [{ current_function ,{ views_bid , process , 1 }}, { initial_call ,{ proc_lib , init_p , 5 }}]}, { < 0 . 29341 . 2629 > , 547335 , [{ current_function ,{ timer , sleep , 1 }}, { initial_call ,{ proc_lib , init_p , 5 }}]}, { < 0 . 7687 . 2632 > , 386187 , [{ current_function ,{ prim_inet , recv0 , 3 }}, { initial_call ,{ proc_lib , init_p , 5 }}]}] %% get process stats, including stacktrace ( bidder_dev @ node ) > recon : info ( < 0 . 1368 . 0 > ). [{ meta ,[{ registered_name , bidder_banker }, { dictionary ,[{ '$initial_call' ,{ bidder_banker , init , 1 }}, { '$ancestors' ,[ bidder_sup , < 0 . 340 . 0 > ]}]}, { group_leader , < 0 . 339 . 0 > }, { status , running }]}, { signals ,[{ links ,[ < 0 . 342 . 0 > ]}, { monitors ,[]}, { monitored_by ,[]}, { trap_exit , true }]}, { location ,[{ initial_call ,{ proc_lib , init_p , 5 }}, { current_stacktrace ,[{ bidder_banker , '-handle_info/2-lc$^0/1-0-' , 1 , [{ file , "/bidder/src/pacing/bidder_banker.erl" }, { line , 337 }]}, { bidder_banker , '-handle_info/2-lc$^0/1-0-' , 1 , [{ file , "/bidder/src/pacing/bidder_banker.erl" }, { line , 337 }]}, { bidder_banker , '-handle_info/2-lc$^0/1-0-' , 1 , [{ file , "/bidder/src/pacing/bidder_banker.erl" }, { line , 337 }]}, { bidder_banker , handle_info , 2 , [{ file , "/bidder/src/pacing/bidder_banker.erl" }, { line , 337 }]}, { gen_server , try_dispatch , 4 , [{ file , "gen_server.erl" },{ line , 637 }]}, { gen_server , handle_msg , 6 , [{ file , "gen_server.erl" },{ line , 711 }]}, { proc_lib , init_p_do_apply , 3 , [{ file , "proc_lib.erl" },{ line , 249 }]}]}]}, { memory_used ,[{ memory , 916065740 }, { message_queue_len , 0 }, { heap_size , 75113 }, { total_heap_size , 114508086 }, { garbage_collection ,[{ max_heap_size , # { error_logger => true , kill => true , size => 0 }}, { min_bin_vheap_size , 46422 }, { min_heap_size , 233 }, { fullsweep_after , 65535 }, { minor_gcs , 3 }]}]}, { work ,[{ reductions , 1961022577 }]}]

As suggested by the book, a good method is to run recon:proc_window repeatedly and try to identify patterns, e.g. a process that frequently ranks among the top CPU consumers. The process id can then be passed to recon:info to get useful information (such as the stacktrace) in order to understand what the process is doing. Using this method we quickly found that a commonly accessed data structure contained ~100kb of debug data which was being copied thousands of times per second.

This approach, though, will tend to highlight long-running busy processes over short-lived ones (that could be called a lot and account for bigger overhead overall). This can be partially overcome by running proc_window repeatedly and aggregating the results by location rather than process id. However, there are better tools to look at aggregated process times.

redbug

Strictly speaking, redbug is not a profiling tool, but it’s so useful for debugging live systems that it deserves a mention in this article. It allows you to safely trace functions from the shell in a very intuitive yet sophisticated way (as opposed to the rougher Erlang built-in tracing modules). It can be very handy to get a quick notion of how functions are being called, how frequently, and what production data looks like:

( bidder_dev @ node ) > redbug : start ( "jiffy:encode/1" ). % 00:11:03 <0.18718.7151>({erlang,apply,2}) % jiffy:encode(1) ( bidder_dev @ node ) > redbug : start ( "jiffy:encode->return" ). % 00:11:03 <0.18718.7151>({erlang,apply,2}) % jiffy:encode([<<"USD">>]) % 00:11:03 <0.18718.7151>({erlang,apply,2}) % jiffy:encode/1 -> <<"[\"USD\"]">> ( bidder_dev @ node ) > redbug : start ( "jiffy:encode([N]) when is_integer(N)->return" ). % 00:12:20 <0.1535.7173>({erlang,apply,2}) % jiffy:encode([300]) % 00:12:20 <0.1535.7173>({erlang,apply,2}) % jiffy : encode / 1 -> << "[300]" >>

Including recon and redbug in an Erlang application release has no cost and can be a real life-saver when diagnosing production issues. These tools promote a flow that’s more powerful than adding prints to the code and feels more natural than step debuggers which wouldn’t be that useful in a highly concurrent world anyway.

Erlang Easy Profiling (eep)

eep allows for a more “traditional” approach to profiling by using Erlang tracing to take a snapshot of the system operation with function call counts, execution times and inter-dependencies.

It requires a bit more effort to use and it is not as safe as the rest of the tools described in the article. It will slow down the system potentially even killing it if used carelessly. Its output file can eat up a lot of disk space (a 100 ms snapshot takes about 300mb for our system). Depending on the nature of the application, it may not make sense to run it directly in production.

Here’s an example tracing session using eep:

( bidder_dev @ node ) > eep : start_file_tracing ( "file_name" ), timer : sleep ( 100 ), eep : stop_tracing ().

Note that we start, sleep and stop tracing in the same line. Don’t rely on the shell being responsive during tracing! You could send a message or call a function in between, as well, to force a certain part of the code to be executed while taking the snapshot. The instructions above will output a file_name.trace file in the release directory. The file then needs to be moved out of the production server and processed on a local Erlang shell:

( local ) > eep : convert_tracing ( "file_name" ). working 38436 msgs ( 38366 msgs / sec ), 0 . 106996 secs ( 9 x slowdown ) 39367 msgs ( 37807 msgs / sec ), 0 . 106996 secs ( 10 x slowdown ) done

This will, in turn, produce a callgrind.out.file_name that can be input to Kcachegrind (Qcachegrind in macOS). Note that by default the tracing will discriminate entries per process id. This would yield a similar situation to what we saw with recon:proc_window . A more interesting view is to merge function calls of all processes, which can be accomplished by stripping the pid:

$ grep -v "^ob=" callgrind.out.file_name > callgrind.out.merged_file_name $ qcachegrind callgrind.out.merged_file_name

QCachegrind presents the snapshot in a very sophisticated UI that can be used to spot the most frequently called functions, where most time is spent, etc.



eep output to QCachegrind.

Since eep is based on Erlang tracing, it will add an overhead to all code and may comparatively misrepresent the work done by Built-in functions (BIFs) and Native-implemented functions (NIFs), so the timings shown in the snapshot need to be taken with a grain of salt. Nevertheless, it is still a great exploratory tool to understand how the different components of the system interact, how dependencies are used, and learn about obscure or suspicious areas that can be hard to spot by just looking at the code.

Note that there are other Erlang profiling libraries (which we haven’t tried yet), that produce callgrind output: erlgrind and looking_glass.

erlang:system_monitor

Yet another way of looking at the application is the erlang:system_monitor/2 BIF. It allows you to set up a process to receive a message every time a certain condition is met. It was particularly helpful for us when examining long garbage collections and schedules of long duration, the latter of which can surface issues with NIFs that would go unnoticed with other methods.

Here’s an example of its use in the shell (based off a snippet from Erlang in Anger):

( bidder_dev @ node ) > Loop = fun F () -> receive { monitor , Pid , Event , Info } -> ReconLocation = recon : info ( Pid , location ), io : format ( "monitor= ~p pid= ~p ~n info= ~p ~n recon= ~p~n " , [ Event , Pid , Info , ReconLocation ]) end , F () end . ( bidder_dev @ node ) > spawn ( fun () -> register ( temp_sys_monitor , self ()), erlang : system_monitor ( self (), [{ long_schedule , 30 }, { long_gc , 30 }]), Loop () end ), timer : sleep ( 10000 ), exit ( whereis ( temp_sys_monitor ), kill ).

This will monitor the system for 10 seconds and output process information to the shell every time there’s a garbage collection or schedule that takes more than 30ms:

monitor = long_gc pid =< 7011 . 2176 . 5384 > info = [{ timeout , 33 }, { old_heap_block_size , 2984878 }, { heap_block_size , 514838 }, { mbuf_size , 0 }, { stack_size , 35 }, { old_heap_size , 1535726 }, { heap_size , 51190 }] recon = { location , [{ initial_call ,{ proc_lib , init_p , 5 }}, { current_stacktrace , [{ bidder_stat , update , 4 , [{ file , "/bidder/src/bidder_stat.erl" }, { line , 202 }]}, { views_bid , bid , 1 , [{ file , "/bidder/src/views/views_bid.erl" }, { line , 79 }]}, { views_bid , bid_request , 1 , [{ file , "/bidder/src/views/views_bid.erl" }, { line , 17 }]}, { erl_stat , time_call , 3 , [{ file , "/bidder/src/bidder_stat.erl" }, { line , 42 }]}, { bidder_web_handler , dispatch , 2 , [{ file , "/bidder/src/bidder_web_handler.erl" }, { line , 211 }]}, { proc_lib , init_p_do_apply , 3 , [{ file , "proc_lib.erl" },{ line , 249 }]}]}]}

Conclusion

This article is by no means an exhaustive list of Erlang diagnosing tools; there’s the observer, eprof, fprof, eflame, eministat and the list goes on. The Erlang documentation itself has a nice efficiency guide with an overview of the built-in profiling modules.

Since we started with this effort, we consistently reduced request times (and operational costs), month over month. To a large extent these gains came thanks to the advanced tools Erlang and its ecosystem have to offer. What’s most interesting is that we achieved this by getting to know our systems better, fixing bugs, and often removing rather than adding specialized code.