Low level details

So what really happens, in more detail. First, you have to compile the Erlang VM with support. Pass the option

--enable-dirty-schedulers

to the configure script. Next your code has to change a bit. If you have a standard NIF, like the one for obtaining random data in the enacl library[x], you have a function with signature:

ERL_NIF_TERM enif_randombytes(ErlNifEnv *, int, ERL_NIF_TERM[])

When you then build the table for the NIF, you write the following:

/* Tie the knot to the Erlang world */

static ErlNifFunc nif_funcs[] = {

... {"randombytes_b", 1, enif_randombytes},

{"randombytes", 1, enif_randombytes, ERL_NIF_DIRTY_JOB_CPU_BOUND}

}; ERL_NIF_INIT(enacl_nif, nif_funcs, enif_crypto_load,

NULL, NULL, NULL);

Which tells the Erlang system that the function randombytes_b is a regular NIF, and the function randombytes is bound to the dirty scheduler for CPU bound jobs.

The way Erlang works with a NIF module is that you can replace functions in a module with the NIF functions by calling erlang:load_nif/2. This reads the above nif_funcs[] array, and then patches the functions such that calling a function in the module will invoke the BEAM operation ‘call_nif’ as the very first instruction. This opcode contains the function pointer to the real C-function.

When the opcode is invoked an environment is set up and passed to the C code. It can then inspect the environment and the incomnig parameters and carry out its work. Once it is done, it simply returns an Erlang term back to the interpreter, which in turn regards that as the return value of the function in question.

When loading a dirty NIF however, the NIF is not directly patched in. Rather, a wrapper function ‘schedule_dirty_io_nif’ or ‘schedule_dirty_cpu_nif’ is patched in. And the NIF itself is wrapped in yet another helper ‘execute_dirty_nif’. So when calling the NIF, you are really calling, say, schedule_dirty_cpu_nif.

The scheduling functions alters the process flags for the currently executing function such that it becomes a dirty (cpu-bound) function. It then bumps all reductions so the process it out of reductions. This forces a re-evaluation of the process queueing. The scheduler now sees the dirty-flag and evacuates the process into the run queue of the dirty schedulers.

Once on the dirty scheduler, the ‘execute_dirty_nif’ wrapper gets called. This wrapper executes the real NIF, clears the flags, and arranges the ‘dirty_nif_finalizer’ to run, which is a function that simply returns the NIF result via a standard trap[0]. The scheduler now requeues and moves the NIF back onto one of the main schedulers, where it will continue execution and return its value.

The execution wrapper can also handle exceptions in the NIF. This executes another finalizer, which raises the exception once it is back on the main scheduler thread. Also, functions may decide to yield the scheduler they are running on in order to give others a chance to run. Like the exception case, the execution wrapper handles this.

Overhead measurement strategy

Now, this whole dirty NIF evacuation back and forth is not a free operation. For a long-running NIF that could take something like 20 milli-seconds or more, it is probably a negligible overhead. But for a NIF which returns in a few instructions, the overhead of moving things around would dominate. Also, when a NIF returns quickly, there is no reason to even consider the dirty scheduler.

The question is, however, how large is this overhead? Luckily, I have a new Illumos machine sitting around to answer this question through DTrace.

The above description suggests a way to grab the overhead of going back and forth. We can use DTrace to dynamically add trace probes into the running Erlang system and measure the actual overhead of going back and forth between schedulers by writing a DTrace script:

Whenever we return from schedule_dirty_cpu_nif, we record a timestamp. And whenever we start to execute the dirty nif through execute_dirty_nif, we linearly quantize the time it took. Likewise, the return path can be measured by recording the time it takes from the dirty nif execution returns and until the dirty nif finalizer runs (which will happen back on a normal scheduler core).

Now, running DTrace on this script will dynamically reach inside the Erlang VM and patch the prologue/epilogue of the targeted functions. Once hit, we will execute the body snippets of the above in order to measure the runtime overhead.

Measurement

Measurements are done on a Intel(r) Core(tm) i7–3720QM CPU @ 2.60GHz with 16 Gigabytes of RAM, running SunOS dev 5.11 omnios-d08e0e5 i86pc i386 i86pc (i.e., OmniOS).

A first naive run where we just run many scheduler operations back-to-back one at a time, on a standard configured Erlang with no special provisioning options results in the following output:

Scheduling overhead (nanos):

value ------------- Distribution ------------- count

1000 | 0

1250 |@@@@@@ 2978

1500 |@@@@@@@@@@@ 5684

1750 |@@@@@@@@@@@@@@@@@@@@ 10171

2000 |@@ 1139

2250 | 11

2500 | 2

...

Which suggests overhead of moving to the scheduler is in the ballpark of 1750 nanoseconds which is around 1.8us. The return graph looks like this:

Return overhead (nanos):

value ------------- Distribution ------------- count

6750 | 0

7000 | 54

7250 |@@@@@@ 3125

7500 |@@@@@@@@@@@@@@@@@@@@@@ 11214

7750 |@@@@@ 2639

8000 |@ 415

8250 | 54

8500 | 24

8750 | 13

9000 | 10

9250 | 8

9500 | 52

9750 | 70

>= 10000 |@@@@@ 2326



So it takes around 7.5us to get back on the main scheduler.

The hypothesis is that the reason it costs nearly 10us to move around is that you are highly likely to move from one CPU core another while this is happening. This results in a lot of contention in caches, TLBs, Locking and so on. In turn, things get expensive because we have to wake up cores, and intercommunicate a lot. Worse, since we only need a single core, we will be moved around on the CPU cores all the time, which blows our cache, destroys our TLB and makes us generally sad.

To test our hypothesis, we exploit that we are running on the raw steel of a machine, and set up some Erlang flag options to the runtime:

ERL_FLAGS='+K true +A 10 +sbt db +sbwt very_long +swt very_low +Mulmbcs 32767 +Mumbcgs 1 +Musmbcs 2047' rebar3 shell

The important flags here are sbt which binds schedulers to cores, and the sbwt/swt setting the wakeup threshold. The +M options makes the system allocate in 2–32 megabyte blocks which really helps if the system supports superpages/hugepages in its VM layer. It also tends to pack data for locality.

This has a huge effect on the runs. Whereas in the above, the overhead is 10us, after setting options, the overhead is in the ballpark of 2.5us to 3.75us[1]:

Scheduling overhead (nanos):

value ------------- Distribution ------------- count

750 | 0

1000 |@@@@@@@@@@@@@@@@@@@@@ 10259

1250 |@@@@ 2217

1500 |@@@@@ 2373

1750 |@@@@@@@@@@ 5104

2000 | 29 Return overhead (nanos):

value ------------- Distribution ------------- count

1000 | 0

1250 |@@@@@ 2412

1500 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 12827

1750 |@@@@@@ 3029

2000 |@@ 1107

2250 |@ 301

Note some important caveats here: We are far from loading the system! We only run at a concurrency level of 1, such that we don’t let some measurements interfere with others. We can’t easily use thread-local-variables since we are measuring work that gets moved between threads, which is why we use the above somewhat simplified approach.

1000ns for a scheduling operation isn’t that high. There are several atomics, a couple of mutex locks, and a DRAM hit will cost 100ns in there. Synchronization has rarely been entirely free, and this is yet another example of it. For a highly optimized loop, 1000ns is an eternity given the good ILP of a modern CPU core. There are easily 2500 insns in there, and this number can be much higher if you use SIMD style vector operations. But this suggests you have a pipeline by which you can feed the system, or that you are not in a synchronization routine. This is quite different.

Dualize Dirty NIF handling

The Erlang documentation suggests using Dirty Schedulers whenever a NIF is about to do lengthy work (1ms or 1000us). But the measurements above suggests another solution: once the overhead of executing a dirty NIF is low enough, it’s execution is hidden in the noise.

For instance, running an enacl crypto_box operation requires you to first to compute an ed25519 curve shared key. These take so much time to compute that the overhead of running on a dirty scheduler is impossible to measure. Hence, it is easier to just always execute it as a dirty operation.

For other operations, say a secret key computation in NaCl/libsodium, encrypting a 4096 byte entry is so fast the movement onto a dirty scheduler dominates. Enacl detects this and simply runs the operation directly on the non-dirty scheduler in order to make many small operations faster.

I’m currently revising the overhead thresholds for the enacl subsystem based on the above measurements. I’m targeting far smaller thresholds now, as the above suggests I can stop worring when the DS overhead is less than 10%, say. So if handling 8192 bytes of data takes 35us, I stop worrying and just run it on a Dirty Scheduler now. As an example, here is the run for one of the operations, secretboxes with a size of 8192 bytes on the above-mentioned Ivy Bridge based CPU:

0 enacl_nif:crypto_secretbox_b/3

value ------------- Distribution ------------- count

15000 | 0

16000 |@@@@@@ 1525

17000 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8222

18000 |@ 167

19000 | 3

20000 | 6

21000 | 6

22000 | 2

23000 | 0

24000 | 1

To obtain this, we used Scott L. Fritchie et.al’s work on the Erlang DTrace provider to directly hook onto ‘nif-entry’ and ‘nif-return’ in Erlang and measure how long it took to execute these. It suggests a current overhead of roughly 15%, which is fine I think, provided you know how to configure your Erlang system.