Lately I had too much time and wanted to entertain myself by making a small game. Coincidentally, the js13k 2019 game jam was about to start. It’s themed around games made for popular web browsers that would fit in 13 kB (after compressing to .zip file). Since I am not a fan of JavaScript I decided to explore a topic I was interested in for a long time. WebAssembly. What’s the state of it, how does it work and what can we expect from it by coding in C++?

State of WebAssembly

I always tend to look into the future technologies because I’m not satisfied with what we’ve got these days. WebAssembly promises better performance, portability and the crucial part - purpose of killing JavaScript (which is the cause of many frequent performance problems in webdev).

As I discovered this month currently WebAssembly is this bag of cons:

manual memory management

no efficient way to call any functions from HTML5 API

no way to share any data structures (like objects/maps) between JS and WebAssembly

lack of efficient DOM manipulation

the only data types are float32, float64, int32, int64 so with strings you may be on your own

bytecode does not save as much of binary size as I would imagine comparing to JS APIs

Some pros:

it proves much better performance than JS but it’s not as good as native, yet.

specificiation is still evolving, especially by Mozilla with Rust

Let’s dig into C++.

Emscripten: a tool for building C++ to JS

Emscripten is this magic tool that enables one to compile C++ code for browsers. When it came into existance in (around) 2011, it would only build to JavaScript code. Huge JS files. Later, developers of EMCC introduced new backend with support for WebAssembly. It’s kind of hybrid because there is some JS, some WASM. But this is too big, like hundreds of additional kilobytes of JS. For js13k competition it’s a no brainer to avoid such sizes.

There is also the 3rd mode I was interested in the most. It’s the flag -s SIDE_MODULE=1 . Pure WASM and only WASM, with nooo (generated) JS. Which means no memory management and no support for RTTI.

Manual memory management with C/C++

What??? Yes, no malloc() , free() , new , new[] , delete or delete[] unless you implement those on your own. Getting memory addresses is not a job for C++, it’s the job for operating system. Here the actual kind-of-OS is WebAssembly environment (web browser in our case) which reserves blocks of size 65536 bytes per each. We can reserve as many as we want, using the grow() function on the JS side. By default it’s 1 block.

Here’s memory management in C++:

void* pointer = 1024; *pointer = 123213; //no crash

Feels like joke? Yep, you can write anywhere you want. Except, you shouldn’t overwrite your stack or memory reserved by global variables on compile time. And if you do then no one will tell you did a wrong thing. No “access violation”. You’re only going to observe weird behaviours of overwriting your data with other data in your program.

Fundamentally, to manage memory you have 2 ways or a decent mix of them: 1. simply specify places in your memory where you store which data. Useful for easy sharing arrays of data between WASM and JS. It’s what I did with vertex/normal/color/index/texcoord buffers. 2. implement dynamic allocation. Useful for WASM’s app internal memory.

You may think that allocation is very is with WebAssembly, since just increasing a memory pointer would be enough. However, if you also want to deallocate memory to reuse the chunks then you need something more sophisticated.

Let’s copy malloc() from somewhere

Emscripten, without the SIDE_MODULE=1 flag, would create JavaScript code with malloc() , free() etc. functions based on emmalloc. It’s a simple but still quite efficient implementation of allocation for single-threaded app. You can search for emscripten_builtin_malloc if you’re interested.

Alternatively, there is an older multi-platform implementation called dlmalloc. Here’s a pull request where it was replaced with emmalloc : https://github.com/emscripten-core/emscripten/pull/6249

They say the mini-malloc is about 1/3rd size of the dlmalloc . I’ve found that dlmalloc took about 15 kB. So I was not interested in it, neither in the emmalloc . I’ve been digging for alternative implementations, like wee_alloc, described to take less than 1 kB and implemented in Rust.

A custom implementation of malloc()

In the end I’ve found a dead simple memalloc which took just a few hundreds of bytes. Look at memory.cpp to see my implementation. Only the malloc() function took about 240 bytes.

But that’s not all. Remember about memory blocks and the grow() function? When there is not enough memory then you need to ask JavaScript for more memory blocks to be used within WebAssembly environment. Most of memalloc implementations (like emmalloc or dlmalloc ) use a POSIX/Linux function sbrk(). In context of WASM we need to provide sbrk() implementation with JavaScript:

function sbrk(size) { let ret = dynamicMemoryBreak if (size != 0) { ret += (16 - dynamicMemoryBreak % 16) dynamicMemoryBreak = ret + size; if (dynamicMemoryBreak > MEMORY_PAGE_SIZE*allocatedPages - dynamicMemoryOffset) { allocatedPages += 1 // TODO memory.grow(pages) } else if (size < 0) { // TODO on some condition, memory.drop(pages) } } return ret; }

I did not need it for my app so I’ve skipped the growing/dropping part. To understand it better see it here: webgl.js, read about WASM memory management here and about sbrk().

C++ Limitations of Emscripten for SIDE_MODULE=1

I couldn’t use the following things:

STL - some parts were either too big or wouldn’t work (read on)

function pointers - WASM function table would be so messed it couldn’t point at valid address

virtual methods (for overriding in subclasses) - as above

polymorphism - because it needs vtable

constructors - except if it’s a simple struct, otherwise addresses would be invalid

templates - they work but result in huge growth in binary size

They say you could use exceptions under a flag but I haven’t tried. So you can see it’s basically some C-like subset of what’s left.

C++: Classes and Constructors

A simplest constructor usually got me an error in runtime:

Uncaught (in promise) LinkError: WebAssembly.instantiate(): Import #1 module="env" function="__Znwm" error: function import requires a callable

After long debugging I discovered that __Znam was supposed to be the new operator so probabbly __Znwm was the same thing.

What’s weird though, you can call methods of object created in place:

LevelRenderSystem system; system.init();

But no luck with:

LevelRenderSystem* system = new LevelRenderSystem(); system->init();

This worked but it’s probably because of some compiler optimisation:

LevelRenderSystem system1; system.init(); LevelRenderSystem* system2 = &system1; system2->init();

So you end up with functions and pure C-style structs.

C++: Function pointers

Function pointers just don’t work. However, I’ve managed to get along with lambda. Since I had my own implementation of memcpy() using the std::function ended with compilation conflicts.

So I’ve found an alternative: TransientFunction.h which worked but it’s been taking too much bytes. It’s because it uses recursive variadic templates. This leads into a type explosion which results in separate bytecode for every usage. It’s actually the same situation with the std::function .

C++: Templates

Speaking of templates, those do work but after implementing a little-sophisticated definition of ECS world and seen another set of hundreds of kilobytes used I resigned from templates.

Every template class in C++ is instantiated separately for every used type. This generates a new set of methods which means a lot of additional bytes to final output. For instance, having an Array with methods add, set, get, remove generates as many of those methods as many T:s are applied to Array template.

Usage of the mentioned over-sophisticated class looked like this:

EcsWorld< Components<Transform, Level, Vehicle, Collider>, Systems<UpdateMovement, SimulatePhysics, RenderEntites> 1000 > world;

The EcsWorld class takes a template arguments which are:

a class (yes, a class, not a type) a class (again) a number (maximum number of entities)

Example like above just for 3 components and 2 systems took 2800 bytes. So big no-no.

Let’s take at a smaller example. One usage of a single template type for the following template method takes approximately 15 bytes:

template <class T> T &createAt(int index) { char *ptr = getPointer(size); return (T &)ptr; }

If you have enclosed this helper with 4 other (get, create, set, remove) in another helper class (which guarantees zipping all types to this helper) and used it with 15 types (very small amount of Components for a game jam) then… do the math. Hundreds. Just for providing sugar.

So, no, sugar is not for free.

C++: memset

This one is very interesting. I haven’t used memset() but it appeared in expected imports. How? Compiler found this pattern:

void myprivatememzero(char* ptr, unsigned int size) { for (int i = 0; i < size; ++i) ptr[i] = 0; }

No matter how I called this function, its contents would be replaced with memset! It’s not happening when it’s just a loop with some other code, though. So I made this a macro:

#define MEMZERO(ptr, size) \ for (int i = 0; i < size; ++i) \ ptr[i] = 0;

And memset() is no longer needed by linker.

C++: virtual methods and vtable

If you have a virtual method in a class so you could overload it in a subclass… you’re out of luck because the vtable for __cxxabiv1::__class_type_info is lacking.

Uncaught (in promise) LinkError: WebAssembly.instantiate(): Import #6 module="env" function="g$__ZTVN10__cxxabiv117__class_type_infoE" error: function import requires a callable

In fact, this is runtime type information. Although WebAssembly includes a function table feature, Emscripten doesn’t make use of this for vtables on this configuration.

Caveats between C++ and WASM?

float height = c.height; height *= 0.8;

Compiler won’t tell you what’s wrong but yeah. You better put the f into 0.8f or the Emscripten compiler will show you full-text WAT code (WebAssembly text format) of your program with no further information.

Another one. Variable inside a method:

float faceVertices[5 * 3];

Its address is 0. static keyword fixes that.

WASM binary: Size optimizations

Avoiding templates is the basics of size optimization. Here are more tricks.

Compare those two small loops:

for (int j = 0; j < 4; ++j) { for (int k = 0; k < 3; ++k) e.renderNormalBuffer[i + j * 3 + k] = e.currentVertexNormal[k]; } for (int j = 0; j < 4; ++j) { e.renderNormalBuffer[i + j * 3 + 0] = e.currentVertexNormal[0]; e.renderNormalBuffer[i + j * 3 + 1] = e.currentVertexNormal[1]; e.renderNormalBuffer[i + j * 3 + 2] = e.currentVertexNormal[2]; }

20 bytes of difference.

I’ve got a bigger one. This baby saved me 402 bytes:

for (int fsize = 3; fsize <= 4; ++fsize) { const uint faceSize = fsize + 1; //3 vertices + normal const uchar *faces = (uchar *)(fsize == 3 ? model.faces3 : model.faces4); const uint faceCount = fsize == 3 ? model.face3Count : model.face4Count; for (int fi = 0; fi < faceCount; fi += 1) { // collect normal vertex + 3/4 vertices for (int ci = 0; ci < faceSize; ++ci) { const uint vi = faces[fi * faceSize + ci] * valuesPerVertex; for (int k = 0; k < 3; ++k) { faceVertices[ci * 3 + k] = ((float)vert[vi + k * 2 + 0] + ((float)vert[vi + k * 2 + 1] / 100.0f)) / s - vAlign[k]; } } setCurrentVertexNormal(faceVertices[0], faceVertices[1], faceVertices[2]); int vertexIndex; for (int k = 3; k < faceSize * 3; k += 3) { vertexIndex = vertex(faceVertices[k], faceVertices[k + 1], faceVertices[k + 2]); } if (fsize == 3) { index(vertexIndex - 2); index(vertexIndex - 1); index(vertexIndex); } else if (fsize == 4) { index(vertexIndex - 3); index(vertexIndex - 2); index(vertexIndex - 1); index(vertexIndex - 1); index(vertexIndex); index(vertexIndex - 3); } } }

in place of this:

for (int i = 0; i < model.face3Count; i += 1) { const uint ni0 = (faces3[i * 4 + 0]) * valuesPerVertex; const uint vi0 = (faces3[i * 4 + 1]) * valuesPerVertex; const uint vi1 = (faces3[i * 4 + 2]) * valuesPerVertex; const uint vi2 = (faces3[i * 4 + 3]) * valuesPerVertex; float v1x = ((float)vert[vi0 + vx0] + ((float)vert[vi0 + vx1] / 100.0f)) / s - vAlignX; float v1y = ((float)vert[vi0 + vy0] + ((float)vert[vi0 + vy1] / 100.0f)) / s - vAlignY; float v1z = ((float)vert[vi0 + vz0] + ((float)vert[vi0 + vz1] / 100.0f)) / s - vAlignZ; float v2x = ((float)vert[vi1 + vx0] + ((float)vert[vi1 + vx1] / 100.0f)) / s - vAlignX; float v2y = ((float)vert[vi1 + vy0] + ((float)vert[vi1 + vy1] / 100.0f)) / s - vAlignY; float v2z = ((float)vert[vi1 + vz0] + ((float)vert[vi1 + vz1] / 100.0f)) / s - vAlignZ; float v3x = ((float)vert[vi2 + vx0] + ((float)vert[vi2 + vx1] / 100.0f)) / s - vAlignX; float v3y = ((float)vert[vi2 + vy0] + ((float)vert[vi2 + vy1] / 100.0f)) / s - vAlignY; float v3z = ((float)vert[vi2 + vz0] + ((float)vert[vi2 + vz1] / 100.0f)) / s - vAlignZ; float n1x = ((float)vert[ni0 + vx0] + ((float)vert[ni0 + vx1] / 100.0f)) / s - vAlignX; float n1y = ((float)vert[ni0 + vy0] + ((float)vert[ni0 + vy1] / 100.0f)) / s - vAlignY; float n1z = ((float)vert[ni0 + vz0] + ((float)vert[ni0 + vz1] / 100.0f)) / s - vAlignZ; setCurrentVertexNormal(n1x, n1y, n1z); triangle(v1x, v1y, v1z, v2x, v2y, v2z, v3x, v3y, v3z); } for (int i = 0; i < model.face4Count; i += 1) { const uint vi0 = faces4[i * 5 + 1] * valuesPerVertex; const uint vi1 = faces4[i * 5 + 2] * valuesPerVertex; const uint vi2 = faces4[i * 5 + 3] * valuesPerVertex; const uint vi3 = faces4[i * 5 + 4] * valuesPerVertex; const uint ni0 = faces4[i * 5 + 0] * valuesPerVertex; float v1x = ((float)vert[vi0 + vx0] + ((float)vert[vi0 + vx1] / 100.0f)) / s - vAlignX; float v1y = ((float)vert[vi0 + vy0] + ((float)vert[vi0 + vy1] / 100.0f)) / s - vAlignY; float v1z = ((float)vert[vi0 + vz0] + ((float)vert[vi0 + vz1] / 100.0f)) / s - vAlignZ; float v2x = ((float)vert[vi1 + vx0] + ((float)vert[vi1 + vx1] / 100.0f)) / s - vAlignX; float v2y = ((float)vert[vi1 + vy0] + ((float)vert[vi1 + vy1] / 100.0f)) / s - vAlignY; float v2z = ((float)vert[vi1 + vz0] + ((float)vert[vi1 + vz1] / 100.0f)) / s - vAlignZ; float v3x = ((float)vert[vi2 + vx0] + ((float)vert[vi2 + vx1] / 100.0f)) / s - vAlignX; float v3y = ((float)vert[vi2 + vy0] + ((float)vert[vi2 + vy1] / 100.0f)) / s - vAlignY; float v3z = ((float)vert[vi2 + vz0] + ((float)vert[vi2 + vz1] / 100.0f)) / s - vAlignZ; float v4x = ((float)vert[vi3 + vx0] + ((float)vert[vi3 + vx1] / 100.0f)) / s - vAlignX; float v4y = ((float)vert[vi3 + vy0] + ((float)vert[vi3 + vy1] / 100.0f)) / s - vAlignY; float v4z = ((float)vert[vi3 + vz0] + ((float)vert[vi3 + vz1] / 100.0f)) / s - vAlignZ; // vertex normal float n1x = ((float)vert[ni0 + vx0] + ((float)vert[ni0 + vx1] / 100.0f)) / s - vAlignX; float n1y = ((float)vert[ni0 + vy0] + ((float)vert[ni0 + vy1] / 100.0f)) / s - vAlignY; float n1z = ((float)vert[ni0 + vz0] + ((float)vert[ni0 + vz1] / 100.0f)) / s - vAlignZ; setCurrentVertexNormal(n1x, n1y, n1z); quad(v1x, v1y, v1z, v2x, v2y, v2z, v3x, v3y, v3z, v4x, v4y, v4z); }

Having additional loops for decreasing amount of index access ends with smaller size! Sure, in performance terms it may be slower. More loops may mean more branching. However, this was js13k.

The weirdest thing is not wanting global variables. Not because of bad smells but because every global variable’s name is exported for JS. So, variable name is put into *.wasm as a string. So if you have global state put everything into a struct and make one global variable of that struct.

WebAssembly: conclusions of the experiment

Size optimization was quite a challenge. WASM was not as small as JS would be. My binary (before compression) took more than 19 kB. With my 3D engine (supporting colored/textured triangles, quads, texture generation, cameras, shader lightning), some tools like dynamic Array and memalloc, dead simple game architecture (ECS) and some amount of game implementation (time-rewind of few objects) 19 kB feels like a lot. Too much for this competition to make something very good.

It eats so much bytes because of 2 reasons: 1. Although WASM itself is a high level bytecode that introduces variables and loops it has to denote every single operation like basic operators between numbers ( + , - etc.), variable access, constant access, index access. If parts of the expressions come from non-local environment then temporary variables are introduced and whole thing grows. 2. Lack of very simple APIs: collections (like dynamic arrays), memory management, logging, math functions, WebGL functions, WebGL context, DOM access etc.

I was mostly interested in speeding up web app frameworks like Angular or Elm. Whatever you want to call, you need to import its definition. When you call it the arguments/data needs a conversion. Even a simplest array has to be specially handled when it’s passed as a pointer. You need Float32Array , Int32Array or similar.

Memory management is not automatic. No Garbage Collection means it’s a out-of-scope challenge for implementing functional languages in WebAssembly, like Elm. It makes almost impossible converting JavaScript to WebAssembly. The easiest approach would be to implement it specifically to Elm’s compiler (or whatever language/framework).

Summary

My game was a design failure. As 95% (rough estimate) of designs are. It looked very cool on paper and was either too dumb or too hard in practice.

However, I gladly explored the topic of WebAssembly. I have implemented WebGL engine in this experiment and now I understand why exactly HTML5 APIs can’t be used efficiently at this point and what are the challenges for functional languages like Elm. It’s been a good experience I wouldn’t recommend.

Resources: