ARM-ASM-Tutorial Aus der Mikrocontroller.net Artikelsammlung, mit Beiträgen verschiedener Autoren (siehe Versionsgeschichte) Suche Wechseln zu: Navigation The ARM processor architecture is widely used in all kinds of industrial applications and also a significant number of hobby and maker projects. This tutorial aims to teach the fundamentals of programming ARM processors in assembly language. Tutorial by Niklas Gürtler. Thread in Forum for feedback and questions. Inhaltsverzeichnis 1 Introduction 1.1 Why assembly? 1.2 About ARM 1.3 Architecture and processor variants

2 Prerequisites 2.1 Microcontroller selection 2.2 Processor type & documentation 2.3 Debug adapter 2.4 Development Software

3 Setup 3.1 Hardware 3.2 Software 3.2.1 Linux 3.2.2 Windows

4 Writing assembly applications 4.1 First rudimentary program 4.2 Flashing the program 4.3 Starting the debugger 4.4 Using processor registers 4.5 Accessing periphery 4.5.1 Clock Configuration 4.5.2 GPIO Configuration 4.5.3 Writing GPIO pins 4.6 Data processing 4.7 Reading periphery registers 4.8 Jump instructions 4.9 Counting Loops 4.10 Using RAM

5 Memory Management 5.1 Address space 5.2 The Linker 5.3 Linker Scripts 5.3.1 Reserving memory blocks 5.3.2 Defining symbols in linker scripts 5.3.3 Absolute section placement 5.4 Program Structure

6 More assembly techniques 6.1 Instruction set state 6.2 Constants 6.3 The Stack 6.4 Function calls 6.4.1 Using the stack for functions 6.4.2 Calling Convention 6.5 Conditional Execution 6.5.1 Conditions 6.5.2 The IT instruction 6.6 8/16 bit arithmetic 6.7 Alignment 6.8 Offset addressing 6.9 Iterating arrays 6.10 Literal loads 6.10.1 The “mov”-instruction 6.10.2 The “movt” instruction 6.10.3 PC-relative loads 6.11 The SysTick timer 6.12 Exceptions & Interrupts 6.13 Macros 6.14 Weak symbols 6.15 Symbol aliases 6.16 Improved vector table 6.17 .include 6.18 Local Labels 6.19 Initializing RAM 6.20 Peripheral interrupts 6.21 Analysis tools 6.21.1 Disassembler 6.21.2 readelf 6.21.3 nm 6.21.4 addr2line 6.21.5 objcopy 6.22 Interfacing C and C++ code 6.22.1 Environment setup for C and C++ 6.22.2 Calling functions 6.22.3 Accessing global variables 6.23 Clock configuration 6.24 Project template & makefile

Introduction Why assembly? Today, there is actually little reason to use assembly language for entire projects, because high-quality optimizing compilers for high-level languages (especially C and C++) are readily available as free open source software and because the ARM architecture is specifically optimized for high-level languages. However, knowledge in assembly is still useful for debugging certain problems, writing low-level software such as bootloaders and operating system kernels, and reverse engineering software for which no source code is available. Occasionally it is necessary to manually optimize some performance-critical code section. Sometimes claims are made that ARM processors can’t be programmed in assembly. Therefore, this tutorial will show that this is very well possible by showing how to write entire (small) applications entirely in the ARM assembly language! As most of the resources and tools for ARM focus on C programming and because of the complexity of the ARM ecosystem, the largest difficulty in getting started with ARM assembly is not the language itself, but rather using the tools correctly and finding relevant documentation. Therefore, this tutorial will focus on the development environment and how the written assembly code is transformed into the final program. With a good understanding of the environment, all the ARM instructions can be learned simply by reading the architecture documentation. Because of the complex ecosystem around ARM, a general introduction of the ARM processor market is necessary. About ARM Arm Holdings is the company behind the ARM architecture. Arm does not manufacture any processors themselves, but designs the “blueprints” for processor cores, which are then licensed by various semiconductor companies such as ST, TI, NXP and many others, who combine the processor with various support hardware (most notably flash and RAM memories) and peripheral modules to produce a final complete processor IC. Some of these peripheral modules are even licensed from other companies – for example, the USB controller modules by Synopsys are found in many different processors from various manufacturers. Because of this licensing model, ARM processor cores are found in a very large variety of products for which software can be developed using a single set of tools (especially compiler, assembler and debugger). This makes knowledge about the ARM architecture, particularly the ARM assembly language, useful for a large range of applications. Since the ARM processor cores always require additional hardware modules to function, both the ARM-made processor core and the manufacturer-specific periphery modules have to be considered when developing software for ARM systems. For example, the instruction set is defined by ARM and software tools (compiler, assembler) need to be configured for the correct instruction set version, while the clock configuration is manufacturer-specific and needs to be addressed by initialization code specifically made for one processor. Architecture and processor variants A processor’s architecture defines the interface between hardware and software. Its most important part is the instruction set, but it also defines e.g. hardware behavior under exceptional circumstances (e.g. memory access errors, division by zero, etc.). Processor architectures evolve, so they have multiple versions and variants. They also define optional functionality that may or may not be present in a processor (e.g. a floating-point unit). For ARM, the architectures are documented exhaustively in the “ARM Architecture Reference Manuals”. While the architecture is an abstract concept, a processor core is a concrete definition of a processor (e.g. as a silicon layout or HDL) that implements a certain architecture. Code that only uses knowledge of the architecture (e.g. an algorithm that does not access any periphery) will run on any processor implementing this architecture. Arm, as mentioned, designs processor cores for their own architectures, but some companies develop custom processors that conform to an ARM architecture, for example Apple and Qualcomm. ARM architectures are numbered, starting with ARMv1 up until the most recent ARMv8. ARMv6 is the oldest architecture still in significant use, while ARMv7 is the most widespread one. Suffixes are appended to the version to denote variants of the architecture; e.g. ARMv7-M is for small embedded systems while ARMv7-A for more powerful processors. ARMv7E-M adds digital signal processing capabilities including saturating and SIMD operations. Older ARM processors are named ARM1, ARM2 …, while after ARM11 the name “Cortex” was introduced. The Cortex-M family, including e.g. Cortex-M3 and Cortex-M4 (implementing ARMv7-M and ARMv7E-M architecture, respectively) is designed for microcontrollers, where power consumption, memory size, chip size and latency are important. The Cortex-A family, including e.g. Cortex-A8 and Cortex-A17 (both implementing ARMv7-A architecture) is intended for powerful processors (called “application processors”) for e.g. multimedia and communication products, particularly smartphones and tablets. These processors have much more processing power, typically feature high-bandwidth interfaces to the external world, and are designed to be used with high-level operating systems, most notably Linux (and Android). An overview of ARM processors and their implemented architecture version can be found on Wikipedia. This tutorial will focus on the Cortex-M microcontrollers, as these are much easier to program without an operating system and because assembly language is less relevant on Cortex-A processors. However, the large range of ARM-based devices necessitates flexibility in the architecture specification and software tools, which sometimes complicates their use. There is actually not a single, but three instruction sets for ARM processors: The “A32” instruction set for 32bit ARM architectures, also simply called “ARM” instruction set, favors speed over program memory consumption. All instructions are 4 bytes in size.

The “A64” instruction set is for the new 64bit ARM processors

The “T32” instruction set for 32bit ARM architectures, also known as “Thumb”, favors program memory consumption over speed. Most instructions are 2 bytes in size, and some are 4 bytes. The 64bit Cortex-A application processors support all three instruction sets, while the 32bit ones only A32 and T32. The Cortex-M microcontrollers only support T32. Therefore, this tutorial will only talk about “thumb2”, the second version of the “T32” instruction set. Prerequisites First, suitable hardware and software need to be selected for demonstrating the usage of assembly language. For this tutorial, the choice of the specific microcontroller is of no great significance. However, to ensure that the example codes are easily transferable to your setup, it is recommended to use the same components. Microcontroller selection For the microcontroller, an STM32F103C8 or STM32F103RB by STMicroelectronics will be used. Both controllers are identical except for the flash size (64 KiB vs 128 KiB) and number of pins (48 vs 64). These controllers belong to ST’s “mainstream” entry-level- family and are quite popular among hobbyist developers with many existing online resources. Several development boards with these controllers are available, for example: Nucleo-F103, “Blue Pill” (search for “stm32f103c8t6” on AliExpress, Ebay or Amazon), Olimexino-STM32, STM32-P103, STM32-H103, STM3210E-EVAL. Processor type & documentation First, the microcontroller manufacturer’s documentation is used to find out what kind of ARM processor core and architecture is used for the chosen chip. This information is used to find all the relevant documentation. The first source of information is the STM32F103RB/C8 datasheet. According to the headline, this is a medium-density device. This term is ST-specific and denotes a product family with certain features. The very first paragraph states that this microcontroller uses a Cortex-M3 processor core with 72 MHz. This document also contains the electrical characteristics and pinouts.

device. This term is ST-specific and denotes a product family with certain features. The very first paragraph states that this microcontroller uses a processor core with 72 MHz. This document also contains the electrical characteristics and pinouts. The next important document is the STM32F103 reference manual that contains detailed descriptions of the periphery. Particularly, detailed information about periphery registers and bits can be found here.

The ARM developer website provides information about the Cortex-M3 processor core, particularly the ARM Cortex-M3 Processor Technical Reference Manual. According to chapter 1.5.3, this processor implements the ARMv7-M architecture .

. The architecture is documented in the ARMv7M Architecture Reference Manual. Particularly, it contains the complete documentation of the instruction set. For any serious STM32 development, you should be familiar with all these documents. Debug adapter There are many different ways of getting your program to run on an STM32 controller. A debug adapter is not only capable of writing software to the controller’s flash, but can also analyze the program’s behavior while it is running. This allows you to run the program one instruction at a time, analyze program flow and memory contents and find the cause of crashes. While it is not strictly necessary to use such a debugger, it can save a lot of time during development. Since entry-level models are available cheaply, not using one doesn’t even save money. Debuggers connect to a host PC via USB (some via Ethernet) and to the microcontroller (“target”) via JTAG or SWD. While these two interfaces are closely related and perform the same function, SWD uses fewer pins (2 instead of 4, excluding reset and ground). Most STM32 controllers support JTAG, and all support SWD. Documenting all possible way of flashing and debugging STM32 controllers is beyond the scope of this tutorial; a lot of information is already available online on that topic. Therefore, this tutorial will assume that the ST-Link debug adapter by STMicroelectronics is used, which is cheap and popular among hobbyists. Some of the aforementioned boards even include an ST-Link adapter, which can also be used “stand-alone” to flash an externally connected microcontroller. The examples should work with other adapters as well; please consult the appropriate documentation on how to use them. Development Software On the software part, several tools are needed for developing microcontroller firmware. Using a complete Integrated Development Environment (IDE) saves time and simplifies repetitive steps but hides some important steps that are necessary to gain a basic understanding of the process. Therefore, this tutorial will show the usage of the basic command line tools to demonstrate the underlying principles. Of course, for productive development, using an IDE is a sensible choice. The tools presented will work on Windows, Linux and Mac OS X (untested). First, a text editor for writing assembly code is needed. Any good editor such as Notepad++, gedit or Kate is sufficient. When using Windows, the ST-Link Utility can be useful, but is not strictly required. Next, an assembler toolchain is needed to translate the written assembly code into machine code. For this, the GNU Arm Embedded Toolchain is used. This is a collection of open source tools for writing software in Assembly, C and C++ for Cortex-M microcontrollers. Even though the package is maintained by ARM, the software is created by a community of open-source developers. For this tutorial, only the contained applications “binutils” (includes assembler & linker) and “GDB” (debugger) are really needed, but if you later decide to work with C or C++ code, the contained compilers will come in handy. Apart from that, this package is also shipped as part of several IDEs such as SW4STM32, Atollic TrueSTUDIO, emIDE, Embedded Studio and even Arduino – so if you (later) wish to work with one of these, your assembly code will be compatible with it. Another component is required to talk with the debug adapter. For the ST-Link, this is done by OpenOCD, which communicates with the adapter via USB. Other adapters such as the J-Link ship with their own software. Lastly, a calculator that supports binary and hexadecimal modes can be very helpful. Both the default Gnome calculator and the Windows calculator (calc.exe) are suitable. Setup Follow the instructions in the next chapters to set up your development environment. Hardware The only thing that needs to be done hardware-wise is connecting the debugger with your microcontroller. If you are using a development board with an integrated debugger (such as the Nucleo-F103), this is achieved by setting the jumpers accordingly (see the board’s documentation – for e.g. the Nucleo-F103, both “CN2” jumpers need to be connected). When using an external debugger, connect the “GND”, “JTMS/SWDIO” and “JTCK/SWCLK” pins of debugger and microcontroller. Connect the debugger’s “nRESET” (or “nTRST” if it only has that) pin to the microcontroller’s “NRST” input. If your board has jumpers or solder bridges for the “BOOT0” pin, make sure that the pin is low. Applying power to the microcontroller board is typically done via USB. Software Linux Some linux distributions ship with packages for the ARM toolchain. Unfortunately, these are often outdated and also configured slightly differently than the aforementioned package maintained by ARM. Therefore, to be consistent with the examples, it is strongly recommended to use the package by ARM. Download the Linux binary tarball from the downloads page and extract it to some directory whose path does not contain any spaces. The extracted directory contains a subdirectory called “bin”. Copy the full path to that directory (e.g. “/home/user/gcc-arm-none-eabi-8-2019-q3-update/bin”). Add this path to the “PATH” environment variable. On Ubuntu/Debian systems, this can be done via: echo 'export PATH="${PATH}:/home/user/gcc-arm-none-eabi-8-2019-q3-update/bin"' | sudo tee /etc/profile.d/gnu-arm-embedded.sh OpenOCD can be installed via the package manager, e.g. (Ubuntu/Debian): sudo apt-get install openocd After that, log out and back in (or just reboot). In a terminal, type arm-none-eabi-as -version . The output should look similar to this: $ arm-none-eabi-as -version GNU assembler ( GNU Tools for Arm Embedded Processors 8 -2019-q3-update ) 2 .32.0.20190703 Copyright ( C ) 2019 Free Software Foundation, Inc. This program is free software ; you may redistribute it under the terms of the GNU General Public License version 3 or later. This program has absolutely no warranty. This assembler was configured for a target of ` arm-none-eabi ' . Similarly, for openocd -v : $ openocd -v Open On-Chip Debugger 0 .10.0 Licensed under GNU GPL v2 For bug reports, read http://openocd.org/doc/doxygen/bugs.html If an error message appears, the installation isn’t correct. Windows Options for installing GCC Download the Windows installer from the downloads page and run it. Enable the options “Add path to environment variable” and “Add registry information”, and disable “Show Readme” and “Launch gccvar.bat”. A Windows package for OpenOCD can be obtained from the gnu-mcu-eclipse downloads page. Download the appropriate file, e.g. " gnu-mcu-eclipse-openocd-0.10.0-12-20190422-2015-win64.zip”. The archive contains a path like “GNU MCU Eclipse/OpenOCD/0.10.0-12-20190422-2015”. Extract the contents of the inner directory (i.e. the subdirectories “bin”, “doc”, “scripts”…) into some directory whose path does not contain any spaces, e.g. “C:\OpenOCD”. You should now have a directory “C:\OpenOCD\bin” or similar. Copy its full path. Opening PC properties Setting environment variable Set the “Path” environment variable to include this path: Right-Click on “This PC”, then “Properties” → “Advanced System Settings”→ “Environment Variables”. In the lower list (labeled “System variables”), select “Path”. Click “Edit” → “New”, paste the path, and click “OK” multiple times. Open a new command window (Windows Key + R, type “cmd” + Return). Type arm-none-eabi-as -version . The output should look similar to this: C:\>arm-none-eabi-as -version GNU assembler (GNU Tools for Arm Embedded Processors 8-2019-q3-update) 2.32.0.20190703 Copyright (C) 2019 Free Software Foundation, Inc. This program is free software; you may redistribute it under the terms of the GNU General Public License version 3 or later. This program has absolutely no warranty. This assembler was configured for a target of `arm-none-eabi'. Similarly, for openocd -v : C:\>openocd -v GNU MCU Eclipse OpenOCD, 64-bitOpen On-Chip Debugger 0.10.0+dev-00593-g23ad80df4 (2019-04-22-20:25) Licensed under GNU GPL v2 For bug reports, read http://openocd.org/doc/doxygen/bugs.html If an error message appears, the installation isn’t correct. Writing assembly applications The full source code of the examples in the following chapters contain be found on GitHub. The name of the corresponding directory is given after each example code below. First rudimentary program After the software setup, you can begin setting up a first project. Create an empty directory for that, e.g. “prog1”. Inside the project directory, create your first assembly file “prog1.S” (“.S” being the file name extension for assembly files in GNU context) with the following content: .syntax unified .cpu cortex-m3 .thumb .word 0x20000400 .word 0x080000ed .space 0xe4 nop @ Do Nothing b . @ Endless loop Example name: “EmptyProgram” When this file is sent to the assembler, it will translate the instructions into binary machine code, with 2 or 4 bytes per instruction. These bytes are concatenated to form a program image, which is later written into the controller’s flash memory. Therefore, assembly code more or less directly describes flash memory contents. The lines starting with a dot “.” are assembler directives that control the assembler’s operation. Only some of those directives emit bytes that will end up in flash memory. The @ symbol starts a comment. The first line lets the assembler use the new “unified” instruction syntax (“UAL” - Unified Assembler Language) instead of the old ARM syntax. The second line declares the used processor Cortex-M3, which the assembler needs to know in order to recognize the instructions available on that processor. The third line instructs the assembler to use the Thumb (T32) instruction set. We can’t start putting instructions in flash memory right away, as the processor expects a certain data structure to reside at the very beginning of the memory. This is what the “.word” and “.space” instructions create. These will be explained later. The first “real” instruction is “nop”, which will be the first instruction executed after the processor starts. “nop” is short for “No OPeration” - it causes the processor to do nothing and continue with the next instruction. This next instruction is “b .”. “b” is short for “branch” and instructs the processor to jump to a certain “target” location, i.e. execute the instruction at that target next. In assembly language, the dot “.” represents the current location in program memory. Therefore, “b .” instructs the processor to jump to this very instruction, i.e. execute it again and again in an endless loop. Such an endless loop is frequently found at the end of microcontroller programs, as it prevents the processor from executing random data that is located in flash memory after the program. To translate this assembly code, open a terminal (linux) / command window (Windows). Enter the project directory by typing cd <Path to Project Directory> . Call the assembler like this: arm-none-eabi-as -g prog1.S -o prog1.o This instructs the assembler to translate the source file “prog1.S” into an object file “prog1.o”. This is an intermediary file that contains binary machine code, but is not a complete program yet. The “-g”-Option tells the assembler to include debug information, which does not influence the program itself, but makes debugging easier. To turn this object file into a final program, call the linker like this: arm-none-eabi-ld prog1.o -o prog1.elf -Ttext = 0x8000000 This creates a file “prog1.elf” that contains the whole generated program. The “-Ttext” option instructs the linker to assume 0x8000000 as the start address of the flash memory. The linker might output a warning like this: arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000008000000 This is not relevant for executing the program without an operating system and can be ignored. Flashing the program To download the compiled application to the microcontroller that has been attached via ST-Link, use OpenOCD like so: openocd -f interface/stlink-v2.cfg -f target/stm32f1x.cfg -c "program prog1.elf verify reset exit" Unfortunately, the application does not do anything that can be observed from the outside, except perhaps increase the current consumption. Starting the debugger To check whether the program is actually running, start a debugging session to closely observe the processor’s behavior. First, run OpenOCD such that it acts as a GDB server: openocd -f interface/stlink-v2.cfg -f target/stm32f1x.cfg Then, open a new terminal/command window and start a GDB session: arm-none-eabi-gdb prog1.elf GDB provides its own interactive text-based user interface. First, type this command to let GDB connect to the already running OpenOCD instance: target remote :3333 Then, stop the currently running program: monitor reset halt If this fails, hold your board’s reset button just before executing the command and repeat until it succeeds. GDB can also download code to flash memory by simply typing: load Which will overwrite the previously flashed program (which, in this case, is identical anyways). After loading the program, reset the controller again: monitor reset halt Now, examine the contents of the CPU registers: info reg The output should look something like r0 0x0 0 r1 0x0 0 r2 0x0 0 r3 0x0 0 r4 0x0 0 r5 0x0 0 r6 0x0 0 r7 0x0 0 r8 0x0 0 r9 0x0 0 r10 0x0 0 r11 0x0 0 r12 0x0 0 sp 0x0 0x0 lr 0x0 0 pc 0x8000000 0x8000000 <_stack+133693440> xPSR 0x1000000 16777216 msp 0x20000400 0x20000400 psp 0x27e3fa34 0x27e3fa34 primask 0x0 0 basepri 0x0 0 faultmask 0x0 0 control 0x0 0 At this point, the processor is ready to start executing your program. The processor is halted just before the first instruction, which is “nop”. You can let the processor execute one single instruction (i.e. the “nop”) by typing stepi info reg again, you will see that PC is now “0x80000ee”, i.e. the processor is about to execute the next instruction, “b .”. When you do stepi continue If you typeagain, you will see that PC is now “0x80000ee”, i.e. the processor is about to execute the next instruction, “b .”. When you doagain (repeatedly), nothing more will happen – the controller is stuck in the mentioned endless loop, exactly as intended. You can instruct the processor to run the program continuously, without stopping after each instruction by typing You can interrupt the running program by pressing “Ctrl+C”. Run the commands kill quit to exit GDB. You can terminate OpenOCD by pressing “Ctrl+C” in its terminal. Using processor registers The example program hasn’t done anything useful, but any “real” program will need to process some data. On ARM, any data processing is done via the processor registers. The 32bit ARM platforms have 16 processor registers, each of which is 32bit in size. The last three of those (r13-r15) have a special meaning and can only be used with certain restrictions. The first thirteen (r0-r12) can be used freely by the application code for data processing. All calculations (e.g. addition, multiplication, logical and/or) need to be performed on those processor registers. To process data from memory, it first has to be loaded into a register, then processed, and stored back into memory. This is typical for RISC platforms and is known as a “load-store-architecture”. As the starting point for any calculation, some specific values need to be put into the registers. The easiest way to do that is: ldr r0 , = 123456789 The number 123456789 will be encoded as part of the program, and the instruction lets the processor copy it into the register “r0”. Any number and any register in the range r0-r13 can be used instead. The instruction “mov” can be used to copy the contents from one register to another: mov r1 , r0 This copies r0 to r1. Unlike some other processor architectures, “mov” can not be used to access memory, but only the processor registers. In ARM, 32bit numbers are called "words" and are most frequently used. 16bit numbers are known as half-words, and 8bit numbers as bytes, as usual. Accessing periphery To write microcontroller programs that interact with the outside world, access to the controller’s periphery modules is required. Interaction with periphery happens mainly through periphery registers (also known as “special function registers”, SFR). Despite their name, they work quite differently from processor registers. Instead of numbers, they have addresses (in the range of 0x40000000-0x50000000) that are not contiguous (i.e. there are gaps), they cannot be directly used for data processing but need to be explicitly read and written before and after any calculations. Not all of them are 32bit; many have only 16bit, and some of those bits may not exist and can’t be accessed. The microcontroller manufacturer’s documentation uses names for these registers, but the assembler doesn’t know these. Therefore, the assembly code needs to use the numerical addresses. The easiest way to get the microcontroller to do something that produces some visible result is to send a signal via an output pin to turn on an LED. Using a pin to send/receive arbitrary software-defined signals is called “GPIO” (General Purpose Input/Output). First, choose a pin – for example, PA8 (this one is available on all package variants). Connect an LED to this pin and to GND (“active high”). Use a series resistor to limit the current to max. 15mA (the absolute maximum being 25mA), e.g. 100Ω for a 3,3V supply and a standard LED. For higher loads (e.g. high-power LEDs or a relay) use an appropriate transistor. As with most microcontrollers, the pins are grouped into so-called “ports”, each of which has up to 16 pins. The ports are named by letters of the alphabet, i.e. “GPIOA”, “GPIOB”, “GPIOC” etc. The number of ports and pins varies among the individual microcontroller types. The 16 pins of one port can be read or written in one single step. Clock Configuration Many ARM controllers feature a certain trap: Most periphery modules are disabled by default to save power. The software has to explicitly enable the needed modules. On STM32 controllers, this is done via the “RCC” (Reset and Clock Control) module. Particularly, this module allows the software to disable/enable the clock signal for each periphery module. Because MOSFET-based circuits (virtually all modern ICs) only draw power if a clock signal is applied, turning off the clock of unused modules can reduce the power usage considerably. This is documented in the aforementioned reference manual in chapter 7. The subchapter 7.3.7 describes the periphery register “RCC_APB2ENR” which allows you to configure the clock signal for some peripheral modules. This register has 32 bits, of which 14 are “reserved”, i.e. can’t be used and should only be written with zeroes. Each of the available 18 bits enables one specific periphery module if set to “1” or disables it if set to “0”. According to the manual, the reset value of this register is 0, so all periphery modules are disabled by default. In order to turn on the GPIOA module to which the desired pin PA8 belongs, the bit “IOPAEN” needs to be set to “1”. This is bit number two in the register. Since registers can only be accessed to as a whole (individual bits can’t be addressed), a 32bit-value where bit two is “1” and all others are kept as “0” needs to be written. This value is 0x00000004. To write to the register, its address needs to be given in the code. The addresses of the periphery registers are grouped by the periphery modules they belong to - each periphery module (e.g. RCC, GPIOA, GPIOB, USB, …) has its own base address. The addresses of the individual registers are specified as an offset that needs to be added to this base address to obtain the full absolute address of the register. Chapter 7.3.7 specifies the offset address of RCC_APB2ENR as “0x18”. Chapter 3.3 specifies the base addresses of all periphery modules – RCC is given as “0x40021000”. So, the absolute address of RCC_APB2ENR is “0x40021000+ 0x18=0x40021018”. In short: To enable GPIOA, the value 0x00000004 needs to be written to address 0x40021018. According to the “load-store” principle, ARM processors can’t do this in a single step. Both the value to be written and the address need to reside in processor registers in order to perform the write access. So, what needs to done is: Load the value 0x00000004 into a register

Load the value 0x40021018 into another register

Store the value from the first register into the memory location specified by the second register. This last step is performed by the “STR” instruction as follows: .syntax unified .cpu cortex-m3 .thumb .word 0x20000400 .word 0x080000ed .space 0xe4 ldr r0 , = 0x00000004 ldr r1 , = 0x40021018 str r0 , [ r1 ] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA b . The square brackets are required but just serve as a reminder to the programmer that the contents of “r1” is used as an address. After the “str” instruction, the GPIOA periphery is enabled, but doesn’t do anything yet. GPIO Configuration By default, all GPIO pins are configured as “input”, even if there is no software to process the input data. Since inputs are “high-impedance”, i.e. only a very small current can flow into/out of the pin, the risk of (accidental) short-circuits and damage to the microcontroller is minimized. However, this current is too small to light up an LED, so you have to configure the pin PA8 as “output”. The STM32 support multiple output modes, of which the right one for the LED is “General Purpose Output Push-Pull, 2 MHz”. Access and configuration of GPIO pins is achieved via the registers of the GPIO periphery. The STM32 have multiple identical instances of GPIO modules, which are named GPIOA, GPIOB, … Each of those instances has a distinct base address, which are again described in chapter 3.3 of the reference manual (e.g. “0x40010800” for GPIOA, “0x40010C00” for GPIOB etc.). The registers of the GPIO module are described in chapter 9.2, and there is one instance of each register per GPIO module. To access a specific register of a specific GPIO module, the base address of that module needs to be added to the offset address of the register. For example, “GPIOA_IDR” has address “0x40010800+0x08=0x40010808”, while “GPIOB_ODR” has address “0x40010C00+0x0C= 0x40010C0C”. Configuration of the individual GPIO pins happens through the “GPIOx_CRL” and “GPIOx_CRH” registers (“x” is a placeholder for the concrete GPIO module) – see chapters 9.2.1 and 9.2.2. Both registers are structured identically, where each pin uses 4 bits, so each of the two registers handles 8 pins in 8x4=32 bits. Pins 0-7 are configured by “GPIOx_CRL” and pins 8-15 by “GPIOx_CRH”. Pin 0 is configured by bits 0-3 of “GPIOx_CRL”, pin 1 by bits 4-7 of “GPIOx_CRL”, pin 8 by bits 0-3 of “GPIOx_CRH” and so on. The 4 bits per pin are split into two 2-bit fields: “MODE” occupies bits 0-1, and “CNF” bits 2-3. “MODE” selects from input and output modes (with different speeds). In output mode, “CNF” determines whether the output value is configured from software (“General Purpose” mode) or driven by some other periphery module (“Alternate function” mode), and whether two transistors (“Push-pull”) or one (“open-drain”) are used to drive the output. In input mode, “CNF” selects from analog mode (for ADC), floating input and input with pull-up/down resistors (depending on the value in the “GPIOx_ODR” register). Therefore, to configure pin PA8 into “General Purpose Output Push-Pull, 2 MHz” mode, bits 0-3 of “GPIOA_CRH” need to be set to value “2”. The default value of “4” configures the pin as “input”. To keep the other pins at their “input” configuration, the value “0x44444442” needs to be written to register “GPIOA_CRH”, which has address “0x40010804”: ldr r0 , = 0x44444442 ldr r1 , = 0x40010804 str r0 , [ r1 ] @ Set CNF8 : MODE8 in GPIOA_CRH to 2 Writing GPIO pins The GPIO pin still outputs the default value, which is 0 for “low”. To turn on the LED, the output has to be set to “1” for “high”. This is achieved via the GPIOA_ODR register, which has 16bits, one for each pin (see chapter 9.2.4). To enable the LED, set bit 8 to one: .syntax unified .cpu cortex-m3 .thumb .word 0x20000400 .word 0x080000ed .space 0xe4 ldr r0 , = 0x00000004 ldr r1 , = 0x40021018 str r0 , [ r1 ] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA ldr r0 , = 0x44444442 ldr r1 , = 0x40010804 str r0 , [ r1 ] @ Set CNF8 : MODE8 in GPIOA_CRH to 2 ldr r0 , = 0x100 ldr r1 , = 0x4001080C str r0 , [ r1 ] @ Set ODR8 in GPIOA_ODR to 1 to set PA8 high b . Example name: “SetPin” This program enables the GPIOA periphery clock, configures PA8 as output, and sets it to high. If you run it on your microcontroller, you should see the LED turn on – the first program to have a visible effect! Data processing ARM supports many instructions for mathematical operations. For example, addition can be performed as: ldr r0 , = 222 ldr r1 , = 111 add r2 , r0 , r1 This will first load the value 222 into register r0, load 111 into r1, and finally add r0 and r1 and store the result (i.e. 333) in r2. The operand for the result is (almost) always put on the left, while the input operand(s) follow on the right. You can also overwrite an input register with the result: add r0 , r0 , r1 This will write the result to r0, overwriting the previous value. This is commonly shortened to add r0 , r1 The output operand can be omitted, and the first input (here: r0) will be overwritten. This applies to most data processing instructions. Other frequently used data processing instructions that are used in a similar fashion are: sub for subtraction

for subtraction mul for multiplication

for multiplication and for bitwise and

for bitwise and orr for bitwise or

for bitwise or eor for bitwise exclusive or (“xor”)

for bitwise exclusive or (“xor”) lsl for logical left shift

for logical left shift lsr for logical right shift Most of these instructions can not only take registers as input, but also immediate arguments. Such an argument is encoded directly into the instruction without needing to put it into a register first. Immediate arguments need to be prefixed by a hash sign #, and can be decimal, hexadecimal or binary. For example, add r0 , r0 , #23 adds 23 to the register r0 and stores the result in r0. This can again be shortened to add r0 , #23 Such immediate arguments can not be arbitrarily large, because they need to fit inside the instruction, which is 16 or 32 bit in size and also needs some room for the instruction and register numbers as well. So, if you want to add a large number, you have to use “ldr” first as shown to load it into a register. Try out the above examples and use GDB to examine their behavior. Use GDB’s “info reg” command to display the register contents. Don't forget to execute both the “arm-none-eabi-as” and “arm-none-eabi-ld” commands to translate the program. Reading periphery registers The last example works, but has a flaw: Even though only a few bits per register need to be modified, the code overwrites all the bits in the register at once. The bits that should not be modified are just overwritten with their respective default value. If some of those bits had been changed before – for example to enable some other periphery module – these changes would be lost. Keeping track of the state of the register throughout the program is hardly practical. Since ARM does not permit modifying individual bits, the solution is to read the whole register, modify the bits as needed, and write the result back. This is called a “read-modify-write” cycle. Reading registers is done via the “ldr” instruction. As with “str”, the address needs to be written into a processor register beforehand, and the instruction stores the read data into a processor register as well. Starting the with the “RCC_APB2ENR” register, you can read it via: ldr r1 , = 0x40021018 ldr r0 , [ r1 ] Even though the two “ldr” instruction look similar, they work differently – the first one loads a fixed value into a register (r1), while the second loads data from the periphery register into r1. The loaded value should then be modified by setting bit two to “1”. This can be done with the “orr” instruction: orr r0 , r0 , #4 After that, we can store r0 as before. With the GPIOA_CRH register, it’s slightly more complicated: The bits 0, 2 and 3 need to be cleared, while bit 1 needs to be set to 1. The other bits (4-31) need to keep their value. To clear the bits, use the “and” instruction after loading the current periphery register value: ldr r1 , = 0x40010804 ldr r0 , [ r1 ] and r0 , #0xfffffff0 orr r0 , #2 str r0 , [ r1 ] @ Set CNF8 : MODE8 in GPIOA_CRH to 2 For the “GPIOx_ODR” registers, such tricks are not needed, as there is a special “GPIOx_BSRR” register which simplifies writing individual bits: This register can not be read, and writing zeroes to any bit has no effect on the GPIO state. However, if a 1 is written to any of the bits 0-15, the corresponding GPIO pin is set to high (i.e. the corresponding bit in ODR set to 1). If any of the bits 16-31 is written to 1, the corresponding pin is set to low. So, the pin can be set to 1 like this: ldr r1 , = 0x40010810 ldr r0 , = 0x100 str r0 , [ r1 ] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high So, the modified program is: .syntax unified .cpu cortex-m3 .thumb .word 0x20000400 .word 0x080000ed .space 0xe4 ldr r1 , = 0x40021018 ldr r0 , [ r1 ] orr r0 , r0 , #4 str r0 , [ r1 ] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA ldr r1 , = 0x40010804 ldr r0 , [ r1 ] and r0 , #0xfffffff0 orr r0 , #2 str r0 , [ r1 ] @ Set CNF8 : MODE8 in GPIOA_CRH to 2 ldr r1 , = 0x40010810 ldr r0 , = 0x100 str r0 , [ r1 ] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high b . Example name: “SetPin2” Jump instructions For a traditional “hello world” experience, the LED should not only light up, but blink, i.e. turn on and off repeatedly. Setting pin PA8 to low level can be achieved by writing a 1 to bit 24 in the “GPIO_BSRR” register: ldr r1 , = 0x40010810 ldr r0 , = 0x1000000 str r0 , [ r1 ] By pasting the this behind the instructions for turning on the LED, it will be turned on and off again. To get the LED to blink, those two blocks need to be repeated endlessly, i.e. at the end of the code there needs to be an instruction for jumping back to the beginning. A simple endless loop was already explained: The “b .” instruction, which just executes itself repeatedly. To have it jump somewhere else, the dot needs to be substituted for the desired target address, for example: .syntax unified .cpu cortex-m3 .thumb .word 0x20000400 .word 0x080000ed .space 0xe4 ldr r1 , = 0x40021018 ldr r0 , [ r1 ] orr r0 , r0 , #4 str r0 , [ r1 ] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA ldr r1 , = 0x40010804 ldr r0 , [ r1 ] and r0 , #0xfffffff0 orr r0 , #2 str r0 , [ r1 ] @ Set CNF8 : MODE8 in GPIOA_CRH to 2 ldr r1 , = 0x40010810 ldr r0 , = 0x100 str r0 , [ r1 ] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high ldr r1 , = 0x40010810 ldr r0 , = 0x1000000 str r0 , [ r1 ] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low b 0x8000104 Example name: “Blink” The address specified is an absolute address, which is the address of the “ldr” instruction at the beginning of the block for setting the pin to high. Actually, the branch instruction “b” is not capable of jumping directly to such an absolute address - again, because a 32 bit wide address can't be encoded in a 16/32 bit wide instruction. Instead, the assembler calculates the distance of the jump target and the location of the “b” instruction, and stores it into the instruction. When jumping backwards, this distance is negative. When executing program code, the processor always stores the address of the currently executed instruction plus four in the r15 register, which is therefore also known as PC, the program counter. When encountering a “b” instruction, the processor adds the contained distance value to the PC value to calculate the absolute address of the jump target before jumping there. This means that “b” performs a relative jump, and even if the whole machine code section were moved somewhere else in memory, the code would still work. However, the assembly language syntax does not really represent this, as the assembler expects absolute addresses which it then transforms into relative ones. Specifying the target address directly as shown is very impractical, as it has to be calculated manually, and if the section of code is moved or modified, the address needs to be changed. To rectify this, the assembler supports labels: You can assign a name to a certain code location, and use this name to refer to the code location instead of specifying the address as a number. A label is defined by writing its name followed by a colon: BlinkLoop: ldr r1 , = 0x40010810 ldr r0 , = 0x100 str r0 , [ r1 ] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high ldr r1 , = 0x40010810 ldr r0 , = 0x1000000 str r0 , [ r1 ] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low b BlinkLoop Example name: “Blink2” This is purely a feature of the assembler – the generated machine code will be identical to the previous example. In “b BlinkLoop”, the assembler substitutes the label for the address it represents to calculate the relative jump distance. The assembler actually provides no direct way of directly specifying the relative offset that will be encoded in the instruction, but it can be done like this: b (. + 4 + 42 * 2 ) The resulting instruction will contain “42” as the jump offset. As suggested by the syntax, the processor multiples this number by 2 (since instructions can only reside at even memory addresses, it would waste one bit of memory to specify the number directly) and adds to it the address of the “b” instruction plus 4. The assembly syntax is designed to represent the end result of the operation, so the assembler reverses the peculiar pre-calculations of the processor. If you want to do this calculation yourself, you have to again undo the assembler’s own calculation with the expression shown above. There is usually no reason to do that, though. Counting Loops The above example for a blinking LED does not really work yet – the LED blinks so fast the human eye can’t see it. The LED will just appear slightly dim. To achieve a proper blinking frequency, the code needs to be slowed down. The easiest way for that is to have the processor execute a large number of “dummy” instructions between setting the pin high and low. Simply placing many “nop” instructions isn’t possible though, as there is simply not enough program memory to store all of them. The solution is a loop that executes the same instructions a specific number of times (as opposed to the endless loops from the examples above). To do that, the processor has to count the number of loop iterations. It is actually easier to count down than up, so start by loading the desired number of iterations into a register and begin the loop by subtracting “1”: ldr r2 , = 1000000 subs r2 , #1 Now, the processor should make a decision: If the register has reached zero, terminate the loop; else, continue by again subtracting “1”. The ARM math instructions can automatically perform some tests on the result to check whether it is positive/negative or zero and whether an overflow occurred. To enable those checks, append an “s” to the instruction name – hence, “subs” instead of “sub”. The result of these checks is automatically stored in the “Application Program Status Register” (APSR) – the contained bits N, Z, C, V indicate whether the result was negative, zero, set the carry bit or caused an overflow. This register is usually not accessed directly. Instead, use the conditional variant of the “b” instruction, where two letters are appended to indicate the desired condition. The jump is only performed if the condition is met; otherwise, the instruction does nothing. The available condition codes are described in the chapter “Condition Codes” of this tutorial. The conditions are formulated in terms of the mentioned bits of the APSR. For example, the “bne” instruction only performs a jump if the zero (Z) flag is not set, i.e. when the result of the last math instruction (with an “s” appended) was not zero. The “beq” instruction is the opposite of that – it only performs a jump if the result was zero. So, to perform the jump back to the beginning of the loop, add a label before the “subs” instruction, and put a “bne” instruction after the “subs” that jumps to this label if the counter has not reached zero yet: ldr r2 , = 1000000 delay1: subs r2 , #1 bne delay1 @ Iterate delay loop The actual loop consists only of the two instructions “subs” and “bne”. By placing two of those loops (with two different labels!) in between the blocks that turn the pins on and off, the blink frequency is lowered sufficiently such that it becomes visible: .syntax unified .cpu cortex-m3 .thumb .word 0x20000400 .word 0x080000ed .space 0xe4 ldr r1 , = 0x40021018 ldr r0 , [ r1 ] orr r0 , r0 , #4 str r0 , [ r1 ] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA ldr r1 , = 0x40010804 ldr r0 , [ r1 ] and r0 , #0xfffffff0 orr r0 , #2 str r0 , [ r1 ] @ Set CNF8 : MODE8 in GPIOA_CRH to 2 BlinkLoop: ldr r1 , = 0x40010810 ldr r0 , = 0x100 str r0 , [ r1 ] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high ldr r2 , = 1000000 delay1: subs r2 , #1 bne delay1 @ Iterate delay loop ldr r1 , = 0x40010810 ldr r0 , = 0x1000000 str r0 , [ r1 ] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low ldr r2 , = 1000000 delay2: subs r2 , #1 bne delay2 @ Iterate delay loop b BlinkLoop Example name: “BlinkDelay” You might notice that the registers r0-r2 are loaded with the same values over and over again. To make the code both shorter and faster, take advantage of the available processor registers, and load the values that don’t change before the loop. Then, just use them inside the loop: .syntax unified .cpu cortex-m3 .thumb .word 0x20000400 .word 0x080000ed .space 0xe4 ldr r1 , = 0x40021018 ldr r0 , [ r1 ] orr r0 , r0 , #4 str r0 , [ r1 ] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA ldr r1 , = 0x40010804 ldr r0 , [ r1 ] and r0 , #0xfffffff0 orr r0 , #2 str r0 , [ r1 ] @ Set CNF8 : MODE8 in GPIOA_CRH to 2 ldr r0 , = 0x40010810 @ Load address of GPIOA_BSRR ldr r1 , = 0x100 @ Register value to set pin to high ldr r2 , = 0x1000000 @ Register value to set pin to low ldr r3 , = 1000000 @ Iterations for delay loop BlinkLoop: str r1 , [ r0 ] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high mov r4 , r3 delay1: subs r4 , #1 bne delay1 @ Iterate delay loop str r2 , [ r0 ] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low mov r4 , r3 delay2: subs r4 , #1 bne delay2 @ Iterate delay loop b BlinkLoop Example name: “BlinkDelay2” Using RAM Until now, all data in the example codes was stored in periphery or processor registers. In all but the most simple programs, larger amounts of data have to be processed for which the thirteen general-purpose processor registers aren’t enough. For this, the microcontroller features a block of SRAM that stores 20 KiB of data. Accessing data in RAM works similar to accessing periphery registers – load the address in a processor register and use “ldr” and “str” to read and write the data. After reset, the RAM contains just random ones and zeroes, so before the first read access, some value has to be stored. As the programmer decides what data to place where, they have to keep track which address in memory contains what piece of data. You can use the assembler to help keeping track by declaring what kind of memory blocks you need and giving them names. To do this, you must first tell the assembler that the next directives refer to data instead of instructions with the “.data” directive. Then, use the “.space” directive for each block of memory you need. To assign names to the blocks, place a label definition (using a colon) right before that. After the definitions, put a “.text” directive to make sure the instructions after that will properly go to program memory (flash): .data var1: .space 4 @ Reserve 4 bytes for memory block “ var1 ” var2: .space 1 @ Reserve 1 byte for memory block “ var2 ” .text @ Instructions go here... Here, a data block of 4 bytes is reserved and named “var1”. Another block of 1 byte is named “var2”. Note that just inserting these lines will not modify the assembler output – these are just instructions to the assembler itself. To access these memory blocks, you can use “var1” and “var2” just like literal addresses. Load them into registers and use these with “ldr” and “str” like this: .syntax unified .cpu cortex-m3 .thumb .word 0x20000400 .word 0x080000ed .space 0xe4 .data var1: .space 4 @ Reserve 4 bytes for memory block “ var1 ” var2: .space 1 @ Reserve 1 byte for memory block “ var2 ” .text ldr r0 , = var1 @ Get address of var1 ldr r1 , = 0x12345678 str r1 , [ r0 ] @ Store 0x12345678 into memory block “ var1 ” ldr r1 , [ r0 ] @ Read memory block “ var1 ” and r1 , #0xFF @ Set bits 8..31 to zero ldr r0 , = var2 @ Get address of var2 strb r1 , [ r0 ] @ Store a single byte into var2 b . Example name: “RAMVariables” Note the use of “strb” - it works similar to “str”, but only stores a single byte. Since the processor register r1 is of course 32bit in size, only the lower 8 bits are stored, and the rest is ignored. There is still something missing – nowhere in the code is there any address of the RAM. To tell the linker where the RAM is located, pass the option -Tdata=0x20000000 to the arm-none-eabi-ld call to tell the linker that this is the address of the first byte of RAM. This program can't be flashed directly with OpenOCD, as OpenOCD doesn't recognize the RAM as such; GDB has to be used as explained above. When a linker script is used as described in the next chapters (using the NOLOAD attribute), OpenOCD can again be used directly. If you run this program via GDB, you can use the commands x/1xw &var1 and x/1xb &var2 to read the data stored in memory. After this quick introduction a more abstract overview is indicated. Memory Management If there is one thing that sets higher and lower level programming languages apart, it’s probably memory management. Assembly programmers have to think about memory, addresses, layout of program and data structures all the time. Assembler and linker provide some help which needs to be utilized effectively. Therefore, this chapter will explain some more fundamentals of the ARM architecture and how the toolchain works. Address space In the examples so far, addresses were used for periphery register accesses and jump instructions without really explaining what they mean, so it’s time to catch up with that. To access periphery registers and memory locations in any memory type (RAM, Flash, EEPROM…), an address is required, which identifies the desired location. On most platforms, addresses are simply unsigned integers. The set of all possible addresses that can be accessed in a uniform way is called an “address space”. Some platforms such as AVR have multiple address spaces (for Flash, EEPROM, and RAM+periphery) where each memory needs to be accessed in a distinct way and the programmer needs to know which address space an address belongs to – e.g. all three memory types have a memory location with address 123. However, the ARM architecture uses only a single large address space where addresses are 32bit unsigned integers in the range of 0-4294967295. Each address refers to one byte of 8 bits. The address space is divided into several smaller ranges, each of which refers to a specific type of memory. For the STM32F103, this is documented in the datasheet in chapter 4. All addresses in all memory types are accessed in the same way – directly via the “ldr” and “str” instructions, or by executing code from a certain location, which can be achieved by jumping to the respective address with the “b” instruction. This also makes it possible to execute from RAM – simply perform a jump to an address that refers to some code located in RAM. Note that there are large gaps between the individual ranges in address space; attempting to access those usually leads to a crash. While the addresses of periphery are fixed and defined by the manufacturer, the layout of program code and data in memory can be set by the programmer rather freely. Up until now, the example programs defined the flash memory contents in a linear fashion by listing the instructions on the order they should appear in flash memory. However, when translating multiple assembly source files into one program, the order in which the contents from those files appears in the final program isn’t defined a priori. Also, even though in the last example the memory blocks for RAM were defined before the code, the code actually comes first in address space. What makes all this work is the Linker. The Linker Usually the last step in translating source code into a usable program, the linker is an often overlooked, sometimes misunderstood but important and useful tool, if applied correctly. Many introductions into programming forego explaining its workings in detail, but as any trade, embedded development requires mastery of the tools! A good understanding of the linker can save time solving strange errors and allow you to implement some less common use cases, such as utilizing multiple RAM blocks present in some microcontrollers, executing code from RAM or defining complex memory layouts as sometimes required by RTOSes. Translation of native applications using assembler, compiler and linker arm-none-eabi-ld calls the GNU linker that is shipped with the GNU toolchain. Until now, only one assembly source files was translated for each program. To translate a larger program that consists of three assembly files “file1.S”, “file2.S” and “file3.s”, the assembler would be called three times to produce three object code files “file1.o”, “file2.o” and “file3.o”. The linker would then be called to combine all three into a single output file. You have already used a linker – the commandcalls the GNU linker that is shipped with the GNU toolchain. Until now, only one assembly source files was translated for each program. To translate a larger program that consists of three assembly files “file1.S”, “file2.S” and “file3.s”, the assembler would be called three times to produce three object code files “file1.o”, “file2.o” and “file3.o”. The linker would then be called to combine all three into a single output file. When translating any of these assembly files, the assembler does not know of the existence of the other files. Therefore, it can’t know whether the contents of any other file will end up in flash memory before the currently processed file, and also can’t know the final location in flash memory of the machine code it is emitting and placing in the object file (ending .o). This means that the object file does not contain any absolute addresses (except for those of periphery registers, as these were specified explicitly). For example, when loading the address of the RAM data blocks (“ldr r0, =var1”) the assembler doesn’t know the address, only the linker does. Therefore, the assembler puts a placeholder in the object file that will be overwritten by the linker. A jump (“b” instruction) to a label defined in another assembly file works similarly; the assembler uses a placeholder for the address. For the jump instructions we used inside the same file (e.g. “b BlinkLoop”), a placeholder is not necessary, as the assembler can calculate the distance of the label and the instruction and generate the relative jump itself. However, if the target resides within a different section (see below), this isn’t possible, and a placeholder becomes necessary. As the contents of object files has no fixed address and can be moved around by the linker, these files are called relocatable. On Unix Systems (including Linux), the Executable and Linkable Format (ELF) is used for both object files and executable program files. This format is also used by ARM, and the GNU ARM toolchain. Because it was originally intended to be used with operating systems, some of its concepts don’t perfectly map the embedded use case. The object (.o) files created by the assembler and linker, and also the final program (usually no ending, but in embedded contexts and also in above example commands, .elf is used) are all in ELF format. The specification of ELF for ARM can be found here, and the generic specification for ELF on which the ARM ELF variant is based can be found here. ELF files are structured into sections. Each section may contain code, data, debug information (used by GDB) and other things. In an object file, the sections have no fixed address. In the final program file, they have one. Sections also have various attributes that indicate whether its contents is executable code or data, is read-only and whether memory should be allocated for it. The linker combines and reorders the sections from the object files (“input sections”) and places them into sections in the final program file (“output sections”) while assigning them absolute addresses. Another important aspect are symbols. A symbol defines a name for an address. The address of a symbol may be defined as an absolute number (e.g. 0x08000130) or as an offset relative to the beginning of a section (e.g. “start address of section .text plus 0x130”). Labels defined in assembly source code define symbols in the resulting object file. For example, the “var1” label defined in the last example results in a symbol “var1” in the “prog1.o” file whose address is set to be equal to the beginning of “.data”. The symbol “var” is defined similarly, but with an offset of 4. After the linking process, the “prog1.elf” file contains a “.data” section with absolute address 0x20000000, and so the “var1” and “var2” symbols get absolute addresses as well. As mentioned, the assembler puts placeholders in the object files when it doesn’t know the address of something. In ELF files, there placeholders are called “relocation entries” and they reference symbols by name. When the linker sees such a relocation entry in one of its input files, it searches for a symbol in the input files with a matching name and fills in its address. If no symbol with that name was found, it emits this dreaded error: (.text+0x132): undefined reference to `Foo' Google finds almost a million results for that message, but knowing how the linker operates makes it easy to understand and solve – since the symbol was not found in any object file, make sure it is spelled correctly and that the object file that contains it is actually fed to the linker. Linker Scripts A linker script is a text file written in a linker-specific language that controls how the linker maps input sections to output sections. The example project hasn’t explicitly specified one yet, which lets the linker use a built-in default one. This has worked so far, but results in a slightly mixed up program file (unsuitable symbols) and has some other disadvantages. Therefore, it’s time to do things properly and write a linker script. Linker scripts aren’t usually created on a per-project basis, but usually provided by the microcontroller manufacturer to fit a certain controller’s memory layout. To learn how they work, a quick introduction into writing one will follow. The full documentation can be found here. It’s customary to name the linker script after the controller they are intended for, so create a text file “stm32f103rb.ld” or “stm32f103c8.ld” with the following contents: MEMORY { FLASH : ORIGIN = 0x8000000, LENGTH = 128K SRAM : ORIGIN = 0x20000000, LENGTH = 20K } SECTIONS { .text : { *(.text) } >FLASH .data (NOLOAD) : { *(.data) } >SRAM } Example name: “LinkerScriptSimple” This is this minimum viable linker script for a microcontroller. If you are using a STM32F103C8, replace the 128K by 64K. The lines inside the “MEMORY” block define the available memory regions on your microcontroller by specifying their start address and size within the address space. The names “FLASH” and “SRAM” can be chosen arbitrarily, as they have no special meaning. This memory definition has no meaning outside of the linker script, as it is just an internal helper for writing the script; it can even be left out and replaced by some manual address calculations. The interesting part happens inside the “SECTIONS” command. Each sub-entry defines an output section that will end up in the final program file. These can be named arbitrarily, but the names “.text” and “.data” for executable code and data storage respectively are usually used. The asterisk expressions “*(.text)” and “(*.data)” tell the linker to put the contents of the input sections “.text” and “.data” at that place in the output section. In this case, the names for the input sections and output sections are identical. The input section names “.data”, “.text” (and some more) are used by the assembler and C and C++ compilers by default, so even though they can be changed, it’s best to keep them. You can however name the output sections arbitrarily, for example: SECTIONS { .FlashText : { *(.text) } >FLASH .RamData (NOLOAD) : { *(.data) } >SRAM } The commands “>FLASH” and “>SRAM” tell the linker to calculate the address of the output sections according to the respective memory declaration above: The first output section with a “>FLASH” command will end up at address 0x8000000, the next with “>FLASH” right after that section and so on. The “>SRAM” works the same way with the start address “0x20000000”. The “NOLOAD” attribute does not change the linker’s behavior, but marks the corresponding output section as “not-loadable”, such that OpenOCD and GDB will not attempt to write it into RAM – the program has to take care of initializing any RAM data anyways when running stand-alone. To specify the filename of the linker script, use the “-T” option: arm-none-eabi-ld prog1.o -o prog1.elf -T stm32f103rb.ld The -Tdata and -Ttext aren’t needed anymore, as the addresses are now defined in the linker script. Since the linker script defines the sizes of the memory regions, the linker can now warn you when your program consumes too much memory (either flash or RAM): arm-none-eabi-ld: prog1.elf section `.text' will not fit in region `FLASH' arm-none-eabi-ld: region `FLASH' overflowed by 69244 bytes Reserving memory blocks Using the processor’s stack will be explained later, but you can already use the linker script to assign a memory block for it. It’s best to allocate memory for the stack at the beginning of SRAM, so put this before the “*(.data)” command: . = . + 0x400; Inside a linker script, the dot “.” refers to the current address in the output file; therefore, this command increments the address by 0x400, leaving an “empty” block of that size. The “.data” input section will be located after that, at address 0x20000400. Defining symbols in linker scripts As mentioned before, the controller requires a certain data structure called the “vector table” to reside at the very beginning of flash memory. It is defined in the assembler source file: .word 0x20000400 .word 0x080000ed .space 0xe4 The “.word” directive tells the assembler to output the given 32bit-number. Just like processor instructions, these numbers are put into the current section (.text by default, .data if specified) and therefore end up in flash memory. The first 32bit-number, which occupies the first 4 bytes in flash memory, is the initial value of the stack pointer which will be explained later. This number should be equal to the address of the first byte after the memory block that was reserved for the stack. The reserved block starts at address 0x20000000 and has size 0x400, so the correct number is 0x20000400. However, if the size of the reserved block was modified in the linker script, the above assembly line needs to be adjusted as well. To avoid any inconsistencies, and to be able to manage everything related to the memory-layout centrally in the linker script, it is desirable to replace the number in the assembly source file with a symbol expression. To do this, define a symbol in the linker script: .data (NOLOAD) : { . = . + 0x400; _StackEnd = .; *(.data) } >SRAM Example name: “LinkerScriptSymbols” This will define a symbol “_StackEnd” to have the value of “.”, which is the current address, which at this point is 0x20000400. In the assembly source file, you can now replace the number with the symbol: .word _StackEnd The assembler will put a placeholder in the object file, which the linker will overwrite with the value of 0x20000400. This modification will not change the output file, but avoids putting absolute addresses in source files. The name “_StackEnd” was chosen arbitrarily; since names that start with an underscore and a capital letter may not be used in C and C++ programs, there is no possibility of conflict if any C/C++ source is added later. Typically, all symbols that are part of the runtime environment and should be “invisible” to C/C++ code are named this way. The same rule applies to names starting with two underscores. The second entry of the vector table is the address of the very first instruction to be executed after reset. Currently the address is hard-coded as the first address after the vector table. If you wanted to insert some other code before this first instruction, this number would have to be changed. This is obviously impractical, and therefore the number should be replaced by a label as well. Since the code executed at reset is commonly known as the “reset handler”, define it like that: .syntax unified .cpu cortex-m3 .thumb .word _StackEnd .word Reset_Handler .space 0xe4 .type Reset_Handler , % function Reset_Handler: @ Put code here The “.type” directive tells the assembler that the label refers to executable code. The exact meaning of this will be covered later. Leave the “.space” directive alone for now. Absolute section placement The vector table needs to be at the beginning of flash memory, and the examples have relied on the assembler putting the first things from the source file into flash memory first. This stops working if you use multiple source files. You can use the linker script to make sure the vector table is always at the beginning of flash memory. To do that, you first have to separate the vector table from the rest of the code so that the linker can handle it specially. This is done by placing the vector table in its own section: .syntax unified .cpu cortex-m3 .thumb .section .VectorTable , "a" .word _StackEnd .word Reset_Handler .space 0xe4 .text .type Reset_Handler , % function Reset_Handler: Example name: “LinkerScriptAbsolutePlacement” The “.section” directive instructs the assembler to put the following data into the custom section “.VectorTable”. The “a” flag marks this section as allocable, which is required to have the linker allocate memory for it. To place the vector table at the beginning of flash memory, define a new output section in the linker script: MEMORY { FLASH : ORIGIN = 0x8000000, LENGTH = 128K SRAM : ORIGIN = 0x20000000, LENGTH = 20K } SECTIONS { .VectorTable : { *(.VectorTable) } >FLASH .text : { *(.text) } >FLASH .data (NOLOAD) : { . = . + 0x400; _StackEnd = .; *(.data) } >SRAM } This puts the .VectorTable input section into the equally-named output section. It is also possible to put it into .text alongside the code: MEMORY { FLASH : ORIGIN = 0x8000000, LENGTH = 128K SRAM : ORIGIN = 0x20000000, LENGTH = 20K } SECTIONS { .text : { *(.VectorTable) *(.text) } >FLASH .data (NOLOAD) : { . = . + 0x400; _StackEnd = .; *(.data) } >SRAM } Even though both variants produce the same flash image, the first one is slightly nicer to work with in GDB. The modified LED-blinker application now looks like: .syntax unified .cpu cortex-m3 .thumb .section .VectorTable , "a" .word _StackEnd .word Reset_Handler .space 0xe4 .text .type Reset_Handler , % function Reset_Handler: ldr r1 , = 0x40021018 ldr r0 , [ r1 ] orr r0 , r0 , #4 str r0 , [ r1 ] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA ldr r1 , = 0x40010804 ldr r0 , [ r1 ] and r0 , #0xfffffff0 orr r0 , #2 str r0 , [ r1 ] @ Set CNF8 : MODE8 in GPIOA_CRH to 2 ldr r0 , = 0x40010810 @ Load address of GPIOA_BSRR ldr r1 , = 0x100 @ Register value to set pin to high ldr r2 , = 0x1000000 @ Register value to set pin to low ldr r3 , = 1000000 @ Iterations for delay loop BlinkLoop: str r1 , [ r0 ] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high mov r4 , r3 delay1: subs r4 , #1 bne delay1 @ Iterate delay loop str r2 , [ r0 ] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low mov r4 , r3 delay2: subs r4 , #1 bne delay2 @ Iterate delay loop b BlinkLoop Program Structure Because the vector table is usually the same for all projects, it is handy to move it into a separate file, for example called “vectortable.S”: .syntax unified .cpu cortex-m3 .thumb .section .VectorTable , "a" .word _StackEnd .word Reset_Handler .space 0xe4 Assemble and link this source code with two assembler commands: arm-none-eabi-as -g prog1.S -o prog1.o arm-none-eabi-as -g vectortable.S -o vectortable.o arm-none-eabi-ld prog1.o vectortable.o -o prog1.elf -T stm32f103rb.ld This will result in the dreaded “undefined reference” error. To alleviate this, use the “.global” directive in the main source file “prog1.S”: .syntax unified .cpu cortex-m3 .thumb .type Reset_Handler , % function .global Reset_Handler Reset_Handler: @ Code here ... This will tell the assembler to make the symbol “Reset_Handler” visible globally, such that it can be used from other files. By default, the assembler creates a local symbol for each label, which can’t be used from other source files (same as static in C). The symbol is still there in the final program file, though - it can be used for debugging purposes. More assembly techniques After having set up the project for using the linker properly, some more aspects of assembly programming will be introduced. Instruction set state As mentioned before, ARM application processors support both the T32 and A32/A64 “ARM” instruction sets, and are capable of dynamically switching between them. This can be used to encode time-critical program parts in the faster A32/64 instruction set, and less critical parts in the T32 “thumb” instruction set to save memory. Actually, reducing program size may improve performance too, because the cache memories may become more effective. Even though the Cortex-M microcontrollers based on the ARMv7-M architecture do not support the A32/A64 instruction sets, some of the switching-logic is still there, requiring the program code to work accordingly. The switch between the instruction sets happens when jumping with the “bx” “Branch and Exchange” and “blx” “Branch with Link and Exchange” instructions. Since all instructions are of size 2 or 4, and code may only be stored at even addresses, the lowest bit of the address of any instruction is always zero. When performing a jump with “bx” or “blx”, the lowest bit of the target address is used to indicate the instruction set of the jump target: If the bit is “1”, the processor expects the code to be T32, else A32. Another specialty of the “bx” and “blx” instructions is that they take the jump target address from a register instead as encoding it in the instruction directly. This called an indirect jump. An example of such a jump is: ldr r0 , = SomeLabel bx r0 Such indirect jumps are necessary if the difference of the jump target address and the jump instruction is too large to be encoded in the instruction itself for a relative jump. Also, sometimes you want to jump to an address that has been passed from another part of the program, which e.g. happens in C/C++ code when using function pointers or virtual functions. In these cases, you need to make sure that the lowest bit of the address passed to “bx/blx” via a register has the lowest bit set, to indicate that the target code is T32. Otherwise, the code will crash. This can be achieved by telling the assembler that the target label refers to code (and not data) via the already mentioned “.type” directive: .type SomeLabel , % function SomeLabel: @ Some code... That way, when you refer to the label to load its address into a register, the lowest bit will be set. Actually, using “.type” for all code labels is a good idea, even though it does not matter if you only refer to a label via the “b” instruction (including the conditional variant) which does not encode the lowest bit and does not attempt to perform an instruction set switch. As was already shown, there is another case where the lowest bit matters: when specifying the address of the reset handler (and later, exception handler functions) in the vector table, the bit must be set, so the “.type” directive is necessary here too: .type Reset_Handler , % function If you were writing code for a Cortex-A processor, you would use “.arm” instead of “.thumb” to have your code (or performance critical parts of it) encoded as A32. The “.type” directive would be used as well, and the assembler would clear the lowest bit in the address to ensure the code is executed as A32. For example: .cpu cortex-a8 .syntax unified @ Small but slower code here .thumb .type Block1 , % function Block1: ldr r0 , = Block2 bx r0 @ Larger but faster code here .arm .type Block2 , % function Block2: @ ... The directive “.code 32” has the same meaning as “.arm”, and “.code 16” the same as “.thumb” (although the name is slightly misleading, as T32 instructions can be 32 bit as well). There is also “.type Label, %object” to declare some label refers to data in flash or RAM; this is optional, but helps in working with analysis tools (see below). Constants The previous examples contain a lot of numbers (esp. addresses), the meaning of which is not obvious to the reader - so called “magic numbers”. As code is typically read many times more than written/modified, readability is important, even for assembly code. Therefore, it is common practice to define constants that assign names to numbers such as addresses, and use names instead of the number directly. The assembler actually does not provide any dedicated mechanism for defining constants. Instead, symbols as introduced before are used. You can define a symbol in any of the following ways: RCC_APB2ENR = 0x40021018 .set GPIOA_CRH , 0x40010804 .equ GPIOA_ODR , 0x4001080C and then use it in place of the number: ldr r1 , = RCC_APB2ENR Replacing (almost) all numbers in the source code for the LED blinker by constants yields a source code like this: .syntax unified .cpu cortex-m3 .thumb RCC_APB2ENR = 0x40021018 RCC_APB2ENR_IOPAEN = 4 GPIOA_CRH = 0x40010804 GPIOA_BSRR = 0x40010810 GPIOx_BSRR_BS8 = 0x100 GPIOx_BSRR_BR8 = 0x1000000 GPIOx_CRx_GP_PP_10MHz = 1 GPIOx_CRx_GP_PP_2MHz = 2 GPIOx_CRx_GP_PP_50MHz = 3 GPIOx_CRx_GP_OD_10MHz = 1 | 4 GPIOx_CRx_GP_OD_2MHz = 2 | 4 GPIOx_CRx_GP_OD_50MHz = 3 | 4 GPIOx_CRx_AF_PP_10MHz = 1 | 8 GPIOx_CRx_AF_PP_2MHz = 2 | 8 GPIOx_CRx_AF_PP_50MHz = 3 | 8 GPIOx_CRx_AF_OD_10MHz = 1 | 4 | 8 GPIOx_CRx_AF_OD_2MHz = 2 | 4 | 8 GPIOx_CRx_AF_OD_50MHz = 3 | 4 | 8 GPIOx_CRx_IN_ANLG = 0 GPIOx_CRx_IN_FLOAT = 4 GPIOx_CRx_IN_PULL = 8 DelayLoopIterations = 1000000 .text .type Reset_Handler , % function .global Reset_Handler Reset_Handler: ldr r1 , = RCC_APB2ENR ldr r0 , [ r1 ] orr r0 , r0 , #RCC_APB2ENR_IOPAEN str r0 , [ r1 ] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA ldr r1 , = GPIOA_CRH ldr r0 , [ r1 ] and r0 , #0xfffffff0 orr r0 , #GPIOx_CRx_GP_PP_2MHz str r0 , [ r1 ] @ Set CNF8 : MODE8 in GPIOA_CRH to 2 ldr r0 , = GPIOA_BSRR @ Load address of GPIOA_BSRR ldr r1 , = GPIOx_BSRR_BS8 @ Register value to set pin to high ldr r2 , = GPIOx_BSRR_BR8 @ Register value to set pin to low ldr r3 , = DelayLoopIterations @ Iterations for delay loop BlinkLoop: str r1 , [ r0 ] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high mov r4 , r3 delay1: subs r4 , #1 bne delay1 @ Iterate delay loop str r2 , [ r0 ] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low mov r4 , r3 delay2: subs r4 , #1 bne delay2 @ Iterate delay loop b BlinkLoop Example name: “BlinkConstants” This is much more readable than before. In fact, you could even leave out the comments, as the code becomes more self-documenting. The addresses of periphery registers are defined individually, but the bits for the GPIO registers are the same for each GPIO module, so the names include an “x” to denote that they apply to all GPIO modules. The “CRL”/“CRH” registers get a special treatment. Since the individual bits have little direct meaning, it would be pointless to name them. Instead, 15 symbols are defined to denote the 15 possible modes of operation per pin (combinations of input/output, open-drain vs. push-pull, analog vs. digital, floating vs. pull-resistors, and output driver slew rate). Each of the 15 symbols has a 4 bit value that needs to be written into the appropriate 4 bits of the register. To configure e.g. PA10 as General Purpose Open-Drain with 10 MHz slew rate: ldr r1 , = GPIOA_CRH ldr r0 , [ r1 ] and r0 , #0xfffff0ff orr r0 , #(GPIOx_CRx_GP_OD_10MHz<<8) str r0 , [ r1 ] C-like arithmetic operators can be used in constant expressions, like + - * / and bitwise operators like | (or), & (and), << (left shift) and >> (right shift). Note that these calculations are always done by the assembler. In the example, or | is used to combine bit values. Since these constants are actually symbols, they can collide with assembler labels, so you must not define a symbol with the same name as any label. A different kind of constants are register aliases. Using the “.req” directive, you can define a name for a processor register: MyData .req r7 ldr MyData , = 123 add MyData , 3 This can be useful for large assembly blocks where the meaning of register data is not obvious. It also allows you to re-assign registers without having to modify many lines of code. The Stack In computer science, a stack is a dynamic data structure where data can be added and removed flexibly. Like a stack of books, the last element that was put on top must be taken and removed first (LIFO-structure - Last In, First Out). Adding an element is usually called “push”, and reading & removing “pop”. Many processor architectures including ARM feature circuitry to deal with such a structure efficiently. Like most others, ARM does not provide a dedicated memory area for this - it just facilitates using an area that the programmer reserved for this purpose as a stack. Therefore, a part of the SRAM needs to be reserved for the stack. On ARM, the program stores processor registers on the stack, i.e. 32bit per element. The stack is commonly used when the contents of some register will be needed again later after it has been overwritten by some complex operation that needs many registers. These accesses always come in pairs: Some operation that writes to r0

Push (save) r0 to the stack

(save) r0 to the stack Some operation that overwrites r0

Pop (restore) r0 from the stack

(restore) r0 from the stack Use the value in r0 which is the same as initially assigned ARM’s instructions for accessing the stack are unsurprisingly called “push” and “pop”. They can save/restore any of the registers r0-r12 and r14, for example: ldr r0 , = 1000000 @ Use r0 ... push { r0 } @ Save value 1000000 @ … Some code that overwrites r0 … pop { r0 } @ Restore value 1000000 @ Continue using r0 ... It is also possible to save/restore multiple registers in one go: ldr r0 , = 1000000 ldr r1 , = 1234567 @ Use r0 and r1 ... push { r0 , r1 } @ Save values 1000000 and 1234567 @ … Some code that overwrites r0 and r1 … pop { r0 , r2 } @ Restore 1000000 into r0 and 1234567 into r2 @ Continue using r0 and r2... It does not matter to which register the data is read back - in the previous example, the value that was held in r1 is restored into r2. In larger applications, many store-restore pairs will be nested: ldr r0 , = 1000000 @ Use r0 ... push { r0 } @ Save value 1000000 @ Inner Code Block : ldr r0 , = 123 @ Use r0 … push { r0 } @ Save value 123 @ Inner-Inner Code Block that overwrites r0 pop { r0 } @ Restore value 123 @ Continue using r0 ... pop { r0 } @ Restore value 1000000 into r0 @ Continue using r0 … The “inner” push-pop pair works with value 123, and the “outer” push-pop pair works with value 1000000. Assuming that the stack was empty at the beginning, it will contain 1000000 after the first “push”, and both 1000000 and 123 after the second push. After the first “pop” it contains only 1000000 again, and is empty after the second “pop”. At the beginning of a push-pop pair, the current contents of the stack is irrelevant - it may be empty or contain many elements. After the “pop”, the stack will be restored to its previous state. This makes it possible to (almost) arbitrarily nest push-pop-pairs - after any inner push-pop-pair has completed, the stack is in the same state as before entering the inner pair, so the “pop” part of the outer pair doesn’t even notice the stack was manipulated in between. This is why it is important to make sure that each “push” has a matching “pop”, and vice-versa. As mentioned, an area of memory has to be reserved for the stack. Access to the stack memory is managed via the stack pointer (SP). The stack pointer resides in the processor register r13, and “sp” is an alias for that. As the name implies, the stack pointer contains a 32bit memory address - specifically, the address of the first byte in the stack that contains any saved data. When storing a 32bit register value using “push”, the stack pointer is first decremented by 4 before the value is written at the newly calculated address. To restore a value, the address currently stored in the stack pointer is read from memory, after which the stack pointer is incremented by 4. This is called a “full-descending” stack (see the ARM Architecture Reference Manual, chapter B1.5.6). On ARMv7-A (Cortex-A), this behaviour can be changed, but on ARMv7-M, it is dictated by the exception handling logic, which will be explained later. An implication of this is that if the stack is empty, the stack pointer contains the address of the first byte after the stack memory area. If the stack is completely full, it contains the address of the very first byte inside the stack memory area. This means that the stack grows downward. Since the stack is empty at program start, the stack pointer therefore needs to be initialized to the first address after the memory area. Before executing the first instruction, the processor loads the first 4 bytes from the flash into the stack pointer. This is why “_StackEnd” was defined and used to place the address of the first byte after the stack memory region into the first 4 bytes of flash. The stack pointer must always be a multiple of 4 (see chapter B5.1.3 in the ARM Architecture Reference Manual). It is a common error (which is even present in the example projects by ST!) to initialize the stack pointer to the last address inside the stack memory area (e.g. 0x200003FF instead of 0x20000400), which is not divisible by four. This can cause the application to crash or “just” slow it down. Actually, the ARM ABI requires the stack pointer to be a multiple of 8 for public software interfaces, which is important for e.g. the “printf” C function. So, when calling any external code, make sure the stack pointer is a multiple of 8. In the previous examples, the stack memory area was defined with a size of 0x400, i.e. 1KiB. Choosing an appropriate stack size is critical for an application; if it is too small, the application will crash, if it is too large, memory is wasted that could be used otherwise. Traditionally, the stack is configured to reside at the end of available memory, e.g. 0x20005000 for the STM32F103. As the linker starts allocating memory for data (using “.data” in assembly or global/static variables in C) at the beginning of the memory, the stack is as far away from that regular data as possible, minimizing the chance of a collision. However, if the stack grows continuously, the stack pointer might end up pointing into the regular data area (“.data” or C globals) or heap memory (used by “malloc” in C). In that case, writing to the stack silently overwrites some of the regular data. This can result in all kinds of hard to find errors. Therefore, the example codes put the stack area at the beginning of RAM, and the regular data after that - if the stack grows too large, the stack pointer will reach values below 0x20000000, and any access will result in an immediate “clean” crash. It is probably easy to find the code location that allocates too much stack memory, and possibly increase the stack size. Using the Cortex-M3’s memory protection unit (MPU) enables even more sophisticated strategies, but that is out of scope for this tutorial. Function calls Many programming languages feature a “function” concept. Also known as a “procedures” or “subprograms”, functions are the most basic building blocks of larger applications, and applying them correctly is key for clean, reusable code. The assembler does not know about functions directly, so you have to build them yourself. A function is a block of code (i.e. a sequence of instructions) that you can jump to, does some work, and then jumps back to the place from which the first jump originated. This ability to jump back is the main difference from any other block of assembly code. To make this explicit, such a jump to a function is known as a “call” (as in “calling a function”). The location in code that starts the jump to the function is known as the “caller”, and the called function as “callee”. From the perspective of the caller, calling a function resembles a “user-defined” instruction - it performs some operation after which the code of the caller continues as before. To make the jump back possible, the address of the next instruction after the one that started the function call needs to be saved, so that the function can jump back to that location (without calling the function directly again). This is done via the Link Register (LR), which is the processor register r14. Function calls are performed with the “bl” instruction. This instruction performs a jump, much like the well-known “b”, but also saves the address of the next instruction in LR. When the function is finished, it returns to the caller by jumping to the address stored in LR. As already mentioned, jumping to a location from a register is called an indirect jump, which is performed by the “bx” instruction. So, to return from a function, use “bx lr”: .text .type Reset_Handler , % function .global Reset_Handler Reset_Handler: bl EnableClockGPIOA @ Call function to enable GPIOA ' s peripheral clock @ Some more code ... ldr r1 , = GPIOA_CRH ldr r0 , [ r1 ] and r0 , #0xfffffff0 orr r0 , #GPIOx_CRx_GP_PP_2MHz str r0 , [ r1 ] .type EnableClockGPIOA , % function EnableClockGPIOA: ldr r1 , = RCC_APB2ENR ldr r0 , [ r1 ] orr r0 , r0 , #RCC_APB2ENR_IOPAEN str r0 , [ r1 ] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA bx lr @ Return to caller Here, the code to enable the clock for GPIOA was packaged into a function. To enable this clock, only a single line is now required - “bl EnableClockGPIOA”. When calling a function, the “bl” instruction automatically makes sure to set the lowest bit in LR such that the subsequent “bx lr” will not crash because of an attempted instruction set switch, which is not possible on Cortex-M. If you need to call a function indirectly, use “blx” with a register, and remember to ensure that the lowest bit is set, typically via “.type YourFunction, %function”. Usually, all the code of an application resides within functions, with the possible exception of the Reset_Handler. The order in which functions are defined in the source files does not matter, as the linker will always automatically fill in the correct addresses. If you want to put functions in separate source files, remember to use “.global FunctionName” to make sure the symbol is visible to other files. Using the stack for functions In large applications it is common for functions to call other functions in a deeply nested fashion. However, a function implemented as shown can’t do that - using “bl” would overwrite the LR, and so the return address of the outer function would be lost, and that function couldn’t ever return. The solution is to use the stack: At the beginning of a function that calls other functions, use “push” to save the LR, and at the end use “pop” to restore it. For example, the blink program could be restructured like this: .syntax unified .cpu cortex-m3 .thumb RCC_APB2ENR = 0x40021018 RCC_APB2ENR_IOPAEN = 4 GPIOA_CRH = 0x40010804 GPIOA_BSRR = 0x40010810 GPIOx_BSRR_BS8 = 0x100 GPIOx_BSRR_BR8 = 0x1000000 GPIOx_CRx_GP_PP_2MHz = 2 DelayLoopIterations = 1000000 .text .type Reset_Handler , % function .global Reset_Handler Reset_Handler: bl EnableClockGPIOA bl ConfigurePA8 ldr r5 , = 5 @ Number of LED flashes. bl Blink b . .type Blink , % function Blink: push { lr } ldr r0 , = GPIOA_BSRR @ Load address of GPIOA_BSRR ldr r1 , = GPIOx_BSRR_BS8 @ Register value to set pin to high ldr r2 , = GPIOx_BSRR_BR8 @ Register value to set pin to low ldr r3 , = DelayLoopIterations @ Iterations for delay loop BlinkLoop: str r1 , [ r0 ] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high bl Delay str r2 , [ r0 ] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low bl Delay subs r5 , #1 bne BlinkLoop pop { lr } bx lr .type EnableClockGPIOA , % function EnableClockGPIOA: ldr r1 , = RCC_APB2ENR ldr r0 , [ r1 ] orr r0 , r0 , #RCC_APB2ENR_IOPAEN str r0 , [ r1 ] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA bx lr @ Return to caller .type ConfigurePA8 , % function ConfigurePA8: ldr r1 , = GPIOA_CRH ldr r0 , [ r1 ] and r0 , #0xfffffff0 orr r0 , #GPIOx_CRx_GP_PP_2MHz str r0 , [ r1 ] @ Set CNF8 : MODE8 in GPIOA_CRH to 2 bx lr .type Delay , % function Delay: mov r4 , r3 DelayLoop: subs r4 , #1 bne DelayLoop @ Iterate delay loop bx lr Example name: “BlinkFunctions” The Reset_Handler just became much prettier. There now are functions for enabling the GPIOA clock, configuring PA8 as output, and one that delays execution so that the LED blinking is visible. The “Blink” function performs the blinking, but only for 5 flashes, after which it returns (an endless blink-loop wouldn’t be good for demonstrating returns). As you see, LR is saved on the stack to allow “Blink” to call further functions. The two lines pop { lr } bx lr are actually longer than necessary. It is actually possible to directly load the return address from the stack into the program counter, PC: pop { pc } This way, the return address that was saved on the stack is directly used for the jump back. Just the same way, you can use “push” and “pop” to save and restore any other registers while your function is running. Calling Convention Actually building a large program as shown in the last example is a bad idea. The “Delay” function requires 1000000 to reside in r4. The “Blink” function relies on “Delay” not overwriting r0-r2, and r5, and requires the number of flashes to be given via r5. Such requirements can quickly grow into an intricate web of interdependencies, that make it impossible to write larger functions that call several sub-functions or restructure anything. Therefore, it is common to use a calling convention, which defines which registers a function may overwrite, which it should keep, how it should use the stack, and how to pass information back to the caller. When building an entire application out of your own assembly code, you can invent your own calling convention. However, it is always a good idea to use existing standards: The AAPCS defines a calling convention for ARM. This convention is also followed by C and C++ compilers, so using it makes your code automatically compatible with those. The Cortex-M interrupt mechanism follows it too, which would make it awkward to adapt code that uses some other convention to Interrupts. The specification of the calling convention is quite complex, so here is a quick summary of the basics: Functions may only modify the registers r0-3 and r12. If more registers are needed, they have to be saved and restored using the stack. The APSR may be modified too.

The LR is used as shown for the return address.

When returning (via “bx lr”) the stack should be exactly in the same state as during the jump to the function (via “bl”).

The registers r0-r3 may be used to pass additional information to a function, called parameters, and the function may overwrite them.

The register r0 may be used to pass a result value back to the caller, which is called the return value. This means that when you call a function, you must assume registers r0-r3 and r12 may be overwritten but the others keep their values. In other words, the registers r0-r3 and r12 are (if at all) saved outside the function (“caller-save”), and the registers r4-r11 are (if at all) saved inside the function (“callee-save”). A function that does not call any other functions is called a “leaf-function” (as it is a leaf in the call tree). If such a function is simple, it might not require to touch the stack at all, as the return value is just saved in a register (LR) and it might only overwrite the registers r0-r3 and r12, which the caller can make sure to contain no important data. This makes small functions efficient, as register accesses are faster than memory accesses, such as to the stack. If all your functions follow the calling convention, you can call any function from anywhere and be sure about what it overwrites, even if it calls many other functions on its own. Restructuring the LED blinker could look like this: .syntax unified .cpu cortex-m3 .thumb RCC_APB2ENR = 0x40021018 RCC_APB2ENR_IOPAEN = 4 GPIOA_CRH = 0x40010804 GPIOA_BSRR = 0x40010810 GPIOx_BSRR_BS8 = 0x100 GPIOx_BSRR_BR8 = 0x1000000 GPIOx_CRx_GP_PP_2MHz = 2 DelayLoopIterations = 1000000 .text .type Reset_Handler , % function .global Reset_Handler Reset_Handler: bl EnableClockGPIOA bl ConfigurePA8 ldr r0 , = 5 bl Blink b . .type Blink , % function Blink: push { r4-r7 , lr } ldr r4 , = GPIOA_BSRR @ Load address of GPIOA_BSRR ldr r5 , = GPIOx_BSRR_BS8 @ Register value to set pin to high ldr r6 , = GPIOx_BSRR_BR8 @ Register value to set pin to low mov r7 , r0 @ Number of LED flashes. BlinkLoop: str r5 , [ r4 ] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high ldr r0 , = DelayLoopIterations @ Iterations for delay loop bl Delay str r6 , [ r4 ] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low ldr r0 , = DelayLoopIterations @ Iterations for delay loop bl Delay subs r7 , #1 bne BlinkLoop pop { r4-r7 , pc } .type EnableClockGPIOA , % function EnableClockGPIOA: ldr r1 , = RCC_APB2ENR ldr r0 , [ r1 ] orr r0 , r0 , #RCC_APB2ENR_IOPAEN str r0 , [ r1 ] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA bx lr @ Return to caller .type ConfigurePA8 , % function ConfigurePA8: ldr r1 , = GPIOA_CRH ldr r0 , [ r1 ] and r0 , #0xfffffff0 orr r0 , #GPIOx_CRx_GP_PP_2MHz str r0 , [ r1 ] @ Set CNF8 : MODE8 in GPIOA_CRH to 2 bx lr @ Parameters: r0 = Number of iterations .type Delay , % function Delay: DelayLoop: subs r0 , #1 bne DelayLoop @ Iterate delay loop bx lr Example name: “BlinkFunctionCallingConvention” The three small functions at the end only use registers r0 and r1, which they are free to overwrite. The “Delay” function expects the number of iterations as a parameter in r0, which it then modifies. Therefore, the “Blink” function fills r0 before every call to “Delay”. Alternatively, “Delay” could use a fixed iteration count, i.e. the “ldr” could be moved into “Delay”. As the “Blink” function must assume that “Delay” overwrites r0-r3 and r12, it keeps its own data in r4-r7, which are guaranteed to be retained according to the calling convention. Since “Blink”, in turn, must preserve these registers for the function that called it, it uses “push” and “pop” to save and restore them. Note the shortened syntax “r4-r7” in the instructions. The number of LED flashes is passed in r0 as a parameter; as this register will be overwritten, this number is moved to r7. Alternatively, “Blink” could re-load the constants each time they are used in r1/r2, such that only one register (r4) needs to be saved as it is needed to count the number of flashes: .type Blink , % function Blink: push { r4 , lr } mov r4 , r0 BlinkLoop: ldr r1 , = GPIOA_BSRR @ Load address of GPIOA_BSRR ldr r2 , = GPIOx_BSRR_BS8 @ Register value to set pin to high str r2 , [ r1 ] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high ldr r0 , = DelayLoopIterations @ Iterations for delay loop bl Delay ldr r1 , = GPIOA_BSRR @ Load address of GPIOA_BSRR ldr r2 , = GPIOx_BSRR_BR8 @ Register value to set pin to low str r2 , [ r1 ] @