While performance optimization is something often talked and written about, code size improvements are less often sought after. In this article, we’ll talk about the latter.

Why should we optimize for smaller code size, anyway?

One reason (which is also why I’d gotten into this subject) is that in particular scenarios and devices, your code size is actually limited. It might be unbelievable, but even in this era — when it’s legit to embed a full Chrome web engine in your Android application — 10KBs here and there might make the difference.

Now, I must admit that after a while having size optimizations in mind, it actually became quite fun to fight with GCC in an attempt to make certain functions / snippets produce better machine code.

If this foreword has got you interested, or if you’re just wondering what more can be done besides using the helpful GCC flags -Os -ffunction-section -fdata-section -Wl,--gc-sections -flto — then I invite you to read onward.

A slight disclaimer: I won’t be diving deep into all subjects mentioned here. I’ll try to briefly explain every matter though, so don’t be intimidated if you didn’t understand every bit (like those flags I mentioned above).

I’ll be talking about RISC architectures, where it’s easier to understand which machine code the compiler may produce, and why some results are better than others. It’s harder to be done on complex architectures like x86 , where code size improvements are, in my opinion, less relevant as well.

Project-wide optimizations

We’ll work on MicroPython. In case you’ve never heard of this project, then shortly:

MicroPython is a lean and efficient implementation of Python 3.

It’s a very cool project, and it’s an interesting target for size optimizations because it’s already damn good at it. You won’t find any low-hanging fruits here. This codebase is routinely optimized for size, with new changes and improvements in every part of it.

When optimizing, you can focus on a module, a function, a small code snippet… Or, you can work on changes that have a project-wide effect — that’s what the focus will be on today.

We’ll be working with the qemu-arm port of MicroPython, since it is an ARM build that doesn’t require any special hardware — it is emulated on QEMU.

Code & Data references

In a big project composed of many files and modules, there are tons of functions and data items involved.

Many ISAs (Instruction Set Architectures), ARM included, have multiple instructions for accessing data. Apparently, those instructions may occupy quite a bit of the total code size.

For example, given this C code:

There are multiple ways to turn global variable access into ARM code. The compilers will usually use the following one:

00000000 <mul_by_global>:

0: e59f3008 ldr r3, [pc, #8] ; 10 <mul_by_global+0x10>

4: e5933000 ldr r3, [r3]

8: e0000093 mul r0, r3, r0

c: e12fff1e bx lr

10: 00000000 .word 0x00000000

After the function’s epilogue, the compiler has placed a single data word, which will contains the address of my_global_int . The function loads that address into a register, then dereferences the address itself. This procedure is required because ARM doesn’t have a general “read word from 32-bit address into register” (like the one x86 has), and the compiler doesn’t know where my_global_int will eventually be placed, so it has to prepare for the worst and generate code that will work with any address.

That’s 12 bytes, or the equivalent of 3 opcodes, to load a single word from memory. Not too good.

In Thumb (basically 16-bit ARM mode) it’s even worse: Since most opcodes are 16-bit, and the loaded data word must be aligned to 4 bytes for ldr , then if the function doesn’t have a 4 bytes alignment, a padding nop must be added! (That’s the nop in address 0x16 . It never runs, but it occupies program space)

0000000c <mul_by_global_plus_one>:

c: 4b02 ldr r3, [pc, #8] ; (18 <mul_by_global_plus_one+0xc>)

e: 681b ldr r3, [r3, #0]

10: 4358 muls r0, r3

12: 3001 adds r0, #1

14: 4770 bx lr

16: 46c0 nop ; (mov r8, r8)

18: 00000000 .word 0x00000000

10 bytes are used to load the value. That’s 5 general opcodes.

Can we make it any better?

Yes, there are tricks.

One common approach is to place certain items in a different, smaller segment which will allow using shorter instructions when accessing them This “smaller” section is called sdata or sbss (small-data, small-bss)

But why does it have to be small? Why can’t we place there all data and enjoy better access instructions? Well, it depends on the platform / architecture, but in many of them the implementation of the small data section is to set one of the general purpose registers to always point to that section. Then items can be accessed via this register, which evidently results in better instructions. The offset of the data from the register is limited, hence the limit of the section size. For example, in ARM’s ldr the offset has signed 13 bits, so about 8192 bytes.

We’ll see how we can combine the sdata concept with smaller data access instructions.

Which items should be improved?

Each program has the data items & functions that are most commonly referenced. If we have to choose which accesses are to be improved, we better pick data items that are referenced many times throughout the code.

A big project like MicroPython surely has such commonly-referenced items. So, how do we find them? Counting references in the source code is a hassle, and not too accurate anyway: in a highly optimized project, there may be great distinctions between what we see in the input sources and what’s left in the output binary.

We’ll extract this information from the final binary itself, and use the relocations section for aid.

Relocations, in short, is all information about what needs to be fixed in a program when assigning it to a different location in memory (usually, prior to execution).

For example, accessing data via an absolute address will require a relocation entry, in case that absolute address is changed. We can count those entries, and it will tell us the number of references to an item.

I wrote a helper script, count_references.py, that lists all data items referenced in an ELF, ordered by the number of references they have.

Now, let’s get to business. Grab the MicroPython sources, set up some ARM toolchain and compile the qemu-arm port.



$ # on arch

$ pacman -S arm-none-eabi-gcc

$ # on ubuntu

$ apt install gcc-arm-none-eabi

$ make -C ports/qemu-arm $ git clone https://github.com/micropython/micropython.git $ # on arch$ pacman -S arm-none-eabi-gcc$ # on ubuntu$ apt install gcc-arm-none-eabi$ make -C ports/qemu-arm

We get this final ELF:

$ arm-none-eabi-size ports/qemu-arm/build/firmware.elf

text data bss dec hex filename

151624 12 448 152084 25214 ports/qemu-arm/build/firmware.elf

Let’s use the count_references.py script and check for candidates.

$ python count_references.py ports/qemu-arm/build/firmware.elf

$ # what? no output?

Final linking of ELFs may strip away the relocations info (If a program is to be loaded to a known, constant address, why is the relocation info required?)

You can tell ld to keep this information using --emit-relocs . Add this to the LDFLAGS in ports/qemu-arm/Makefile , then re-link and try count_references.py again.

: 1377

R_ARM_ABS32: 1377

__aeabi_fmul: 199

R_ARM_THM_CALL: 199

__aeabi_fsub: 139

R_ARM_THM_CALL: 139

__aeabi_fadd: 130

R_ARM_THM_CALL: 130

compile_node: 99

R_ARM_THM_CALL: 95

R_ARM_THM_JUMP24: 4

.....

The top of the list has a blank symbol name. Those are nameless data items (string literals and other unnamed consts) so grouping by symbol name doesn’t work with them. Skip those.

The next few are split between relocation type. All items with R_ARM_THM_CALL are functions (that’s the relocation type for bl ), and while calls can also be optimized, that’s the subject for another post.

So, we’re left with a few leaders having R_ARM_ABS32 . This relocation is performed by writing the absolute address of the data item being relocated — it’s the one used by those pesky ldr s we’ve seen (and it has other uses as well, e.g function pointers in a struct).

Let’s start with mp_const_none_obj . It is a tiny object — just the size of a pointer (4 bytes for us), and it has almost a 100 accesses, many of them are of the ldr type I’ve presented earlier. For example:

0001c6ec <mp_builtin_open>:

1c6ec: 4800 ldr r0, [pc, #0] ; (1c6f0 <mp_builtin_open+0x4>)

1c6ee: 4770 bx lr

1c6f0: 00022144 .word 0x00022144

This function is a mere return mp_const_none; ( mp_const_none is a macro referring to the address of mp_const_none_obj ).

Moving objects around

What if those 4 poor bytes can be shoved somewhere else? Recalling the sdata possible implementation with a register — that’d work, but it seemed too big of a change. I kept searching for something else.

When thinking of the memory addresses used in your program, it’s a good idea to know how it’s laid out. The qemu-arm port has a linker script describing the output memory sections ( ports/qemu-arm/stm32.ld ).

The linker script dictates how functions and data objects are to be placed, and it gives great freedom on other output settings, among them the addresses of output sections.

The script tells the linker that the read-only memory of the program resides at address 0x0 .

MEMORY

{

ROM : ORIGIN = 0x00000000, LENGTH = 1M

RAM : ORIGIN = 0x20000000, LENGTH = 128K

}

Hmm.. Interesting. We can exploit this fact.

Most ISAs have instructions that deal with small immediates, without requiring extra loads of memory. In the case of Thumb, there’s a mov rd, #imm instruction, accepting an 8-bit unsigned immediate. An object stored around address 0x0 can be referenced with an 8-bit number, can’t it?

Furthermore, there’s even a cmp rd, #imm with 8-bit immediates, so all comparisons with mp_const_none won’t even require an extra register load. There are many such comparisons.

So… How do we actually do it?

Easy! Remember the linker script? It was meant for this kind of stuff.

SECTIONS

{

.text : {

. = ALIGN(4);

KEEP(*(.isr_vector))

*(.text*)

*(.rodata*)

. = ALIGN(4);

_etext = .;

_sidata = _etext;

} > ROM

....

}

This snippet from stm32.ld defines an output section named .text , that will be placed in the read-only memory. It will take the input sections given in the object files: all .isr_vector sections, then all .text.* sections, and so on, in the order they appeared in the script.

We can have mp_const_none_obj be placed in its own, private section, and that section could be placed first in the output .text !

Well, not first, actually, since the . isr_vector really has to come before due to reasons we won’t get into. Luckily, this first one is only 0x40 bytes, sparing us the remaining space in the 8-bit region around 0x0 .

We’ll add __attribute__((section(".rodata.mp_const_none_obj"))) to the definition of the object in objnone.c , and then add KEEP(*(.rodata.mp_const_none_obj)) in the linker script.

Re-link, and….

Nothing has changed!

$ arm-none-eabi-size ports/qemu-arm/build/firmware.elf

text data bss dec hex filename

151624 12 448 152084 25214 build/firmware.elf

Well, not the size, if anything. That’s exactly the same result! How come? Did the linker disobey our new command?

$ arm-none-eabi-objdump -dz ports/qemu-arm/build/firmware.elf

ports/qemu-arm/build/firmware.elf: file format elf32-littlearm Disassembly of section .text: 00000000 <isr_vector>:

0: 00 00 02 20 95 a5 01 00 69 a5 01 00 69 a5 01 00 ... ....i...i...

10: 69 a5 01 00 69 a5 01 00 69 a5 01 00 00 00 00 00 i...i...i.......

20: 00 00 00 00 00 00 00 00 00 00 00 00 69 a5 01 00 ............i...

30: 69 a5 01 00 00 00 00 00 69 a5 01 00 69 a5 01 00 i.......i...i... 00000040 <mp_const_none_obj>:

40: 48 21 02 00 H!......

No, it didn’t. mp_const_none_obj is placed exactly where we wanted it to be.

Let’s check mp_builtin_open again:

0001c6f0 <mp_builtin_open>:

1c6f0: 4800 ldr r0, [pc, #0] ; (1c6f4 <mp_builtin_open+0x4>)

1c6f2: 4770 bx lr

1c6f4: 00000040 .word 0x00000040

What happened, then? A full word is used to load the poor 0x40 value.

Problem is, the linker is the guy knowing that mp_const_none_obj will be available at 0x40 . The compiler, which is responsible for emitting the instructions, didn’t know that &mp_const_none_obj is an expression that can fit within a mov rd, #imm instruction. So, like I mentioned earlier, it prepares for the worst and emits instructions that’ll work with with any 32-bit value.

Previous paragraph is a bit inaccurate when LTO (Link-time Optimization) is active. Check out the bottom for a few notes about LTO.

How can you tell the compiler it can use a different instruction? In the case of function calls, for example, many architectures differentiate between short calls and long calls. You can tell the compiler which mode should be used via a global command-line switch, and even on the function level by adding e.g __attribute__((long_call)) on a function declaration. Across different architectures, short/long calls are implemented using different instructions, while short calls are the better half, requiring less instruction bytes (and usually less instructions)

I’m not aware of any short access attributes, however.

As a general rule, the compiler will optimize whatever it knows during compile time. If we can force the compiler to directly access address 0x40 instead, things might change. It feels hacky, but let’s try the following:

// replace this line in obj.h

#define mp_const_none (MP_OBJ_FROM_PTR(&mp_const_none_obj))

// with this one

#define mp_const_none (MP_OBJ_FROM_PTR((void*)0x40))

Recompile, and…

$ arm-none-eabi-size ports/qemu-arm/build/firmware.elf

text data bss dec hex filename

151344 12 448 151804 250fc ports/qemu-arm/build/firmware.elf

Woohoo! It worked. Size decreased by about 300 bytes. Let’s move more relevant objects close to our address 0x40 : I have chose mp_const_true , mp_const_false , mp_type_type . They all scored high in the script’s output.

After undergoing the same procedure with those 3:

$ arm-none-eabi-size ports/qemu-arm/build/firmware.elf

text data bss dec hex filename

150512 12 448 150972 24dbc ports/qemu-arm/build/firmware.elf

Voila! Size improvement: 1KB. Quite cool.

You can also run the tests to make sure nothing was broken.

$ make -C ports/qemu-arm -f Makefile.test test

Afterword

Does it actually help? qemu-arm by itself is an experimental port used for emulation and testing. And it’s just 1KB, isn’t it?

Well, I have tried this trick in a few other ports of MicroPython, but their memory map was too different for it to work out-of-the-box. It might be useful on some other, specific ARM builds. You can see my progress in this pull request on GitHub.

Edit 18.12.2019: About a month after this writing, Damien (the MicroPython project maintainer) took this basic idea even further: By manipulating the way Python objects are represented natively, he let the common constants (as mp_const_none and others) be represented with very few, low bits. On most architectures these bits can be loaded simple, small instructions since they fit in a byte. Those “numbers” are not real objects anymore (not pointing to any struct), so this can be applied in all architectures and builds, without imposing anything on the address space layout (unlike the idea presented in this article). A very neat trick!

Sometimes with a bit of thought you can greatly improve your output code. Optimization possibilities are endless. Compilers can do most of the obvious, local tricks, but (meanwhile) it’s us who can think of the more complex, wider and creative changes that’ll improve our programs even more.

Anyway, it sure was fun, and 1KB is a significant decrease in the world of “micro size optimizations” I’m dealing with here.

That’s it! It was my first technical writing (or, my first writing at all) and I really enjoyed it. I hope you have enjoyed the reading :)

Similar tricks in other projects?

The first example that pops into my mind is the current (or current_thread_info ) pointer in the Linux kernel. current allows quick access to the data structure of the currently executing task/thread. It is used extensively.

Its implementation is different between architectures, but it is generally divided between the two:

Dedicate a register that’ll always point to the struct. When the current task changes, the register set changes as well and so the pointer is updated for the new task.

Store the struct in the lower end of the stack, then a simple and operation to remove the few low bits of the current stack pointer will give you… the current struct. In pseudo assembly: and r0, sp, 0xffffe000 for 32-bit kernel with stack size 8192.

I like these, because they kill two birds with one stone: implementing a thread-local storage and allowing access to it in the easiest manner possible. 10/10 for creativity with the stack pointer trick.

A note about LTO…

Link-time optimization 101: Standard compilation process takes each source file (or translation unit, if you may) and generates the machine code for it into an object file, then lets the linker merge all those files into the final target without changing any of machine code.

With LTO, the generation of machine code is delayed until all object files are given to the linker, which then generates machine code for the entire program at once.

What you actually have inside those semi-object files varies between different compilers, but the point is — it’s supposedly enough data to allow the linker to perform virtually any optimization the compiler could have conducted itself, had it received the entire program as a single source file.

In short, LTO is awesome.

Nonetheless, I couldn’t get LTO to make this optimization by itself :( A bit disappointing, to say the least.