

Author: “No Bugs” Hare Follow: Job Title: Sarcastic Architect Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,

Calling a Spade a Spade, Keeping Tongue in Cheek

Today I will try to address one issue which causes a lot of confusion for those of us who’re trying themselves in embedded programming. It is a question of the differences between “von Neumann” architectures, “Harvard” architectures, and the most confusing one – “Modified Harvard.” In some cases the confusion has went so bad that some highly quoted posts such as [DigitalDIY] went as far as directly comparing “Modified Harvard architecture” against “RISC architecture”, which is pretty much like comparing apples with beef (not even with oranges).

While the concepts behind von Neumann and Harvard are quite simple, existing definition of “Modified Harvard” (at least as presented in [WikiModifiedHarvard]) includes two things which are very different from developer’s point of view. While I hate arguing about terminology, I will try to clarify this confusion, so that we as developers know what “Modified Harvard” can really mean from our perspective.

On Arguing About Terminology

“In medieval times terminology flame wars have lead to real-world wars and numerous executions of those who preferred the 'wrong' definition.As I’ve mentioned above, I really hate arguing about definitions and terminology in general, as terminology debates are known to cause the most heated flame wars for no reason at all. In medieval times terminology flame wars have lead to real-world wars and numerous executions of those who preferred the “wrong” definition. In the 21st century, we’re much more civilized and manage to confine ourselves to throwing virtual feces to a virtual opponent; regardless of this undisputable attitude improvement, the whole line of the argument about “which definition is a ‘right’ one” is still perfectly pointless.

As a result, I will not argue whether the current definitions from [WikiModifiedHarvard] are ‘right’ or ‘wrong’. Instead I will take Wiki’s definition/understanding as granted, and will describe what it really means from the developer’s perspective.

First, we’ll start discussing what hides behind basic definitions of “von Neumann architecture” and “Harvard architecture”.

Pure von Neumann Architecture

von Neumann Architecture (see also [WikiVonNeumann]) is a really simple one: we have one physical memory, which contains both data and code; we also have one bus from our CPU to our memory, which technically means that we cannot read both data and code at the same time (though from developer’s perspective, the latter effect is not really observable). On the other hand, we have one address space, so we can have pointers to the data, and pointers to the code; also we can write the code as data, and then execute it as a code.

Does it sound like your typical C program (ok, except maybe for the last part)? It should, because it pretty much is – from the developer’s perspective, that is.

non-Modified Harvard Architecture

Original (non-modified) Harvard architecture is also fairly simple. With a Harvard system, we have our CPU with two RAMs and two buses – one RAM (and an associated bus) being for data only, and another RAM (again, with an associated bus) being for code only. What is more important for us as developers, is that there are two address spaces, so with a pure Harvard architecture we cannot have a pointer to the code (or at least cannot read the code as data).

Note that even with Harvard architecture we still can use in-code constants to be read as data: something along the lines of

MOV A, 42

should still be possible in pure Harvard architectures1, but what we cannot do, is to place a constant (such as a null-terminated string) into the code segment, and then to pass pointer to this constant to the strcpy() – or to any other function which expects a data pointer.

Up until this point, everything was quite simple and clear. Now let’s head towards the point of confusion.

Almost-Harvard – Access-Instruction-Memory-as-Data

In [WikiModifiedHarvard], “Modified Harvard” architecture is defined rather vaguely, but it lists three distinct possibilities which are clearly defined as “Modified Harvard”. These architectures which are clearly defined as “Modified Harvard”, are: Split Cache, Access Instruction Memory as Data, and Read Instructions from Data Memory. Let’s set aside Split Cache for the moment, and consider the last two options.

With an Access-Instruction-Memory-as-Data “Modified Harvard”, it is pretty much basic non-modified Harvard architecture, but with a special set of instructions which allow reading constants from code memory into CPU registers (as we noted above, accessing in-code constants is in fact possible even without these instructions, but they do make the access simple and more convenient). One clear example of such an architecture is AVR8. Let’s consider it in more detail (we’ll use C as an implemented by avr-gcc compiler). In avr-gcc, if you write something along the lines of:

char* global_s = “IT Hare Rules Forever”;//Hey, it even rhymed ;). void f(char* buf) { strcpy(buf, global_s); }

“If all you have is 2K RAM, using a few hundred bytes of them just to store your constants (just because CPU cannot read these constants directly from Flash) is a horrible waste.– it will work. However, “under the hood” it will be implemented (by avr-gcc and its libraries) as follows:

before main() is called, all global constants (strings or otherwise) are loaded into RAM (we’ll see later how it can be done).

global_s is made to point not to code segment (which resides in Flash memory), but to a loaded-into-RAM copy

is made to point not to code segment (which resides in Flash memory), but to a loaded-into-RAM copy after this point, all the code (including any library functions such as strcpy()) will work

This solution does work, but has a tiny molehill-size drawback: it requires to copy all constants into RAM. While this drawback would indeed be of minor importance on x86 (or other desktop/server CPU), on AVR8 the maximum RAM size is 64K (that’s Kilobytes, not Megabytes neither Gigabytes), with typical RAM sizes for AVR8 being of the order of 2K. This observation quickly promotes this copy-to-RAM drawback from a molehill-size problem to a mountain-size one: if all you have is 2K RAM, using a few hundred bytes of them just to store your constants (just because CPU cannot read these constants directly from Flash) is a horrible waste.

To address this issue, gcc-avr provides a PROGMEM qualifier and a series of pgm_read_byte() functions, which can be used as follows:

char* global_s PROGMEM = “IT Hare Sucks Big Time”; void f(char* buf) { strcpy_PF(buf, global_s); }

Specifying global_s with PROGMEM qualifier, we’re saying to compiler “don’t bother with copying this constant to RAM”. So far so good, but there is a price for it, and this price comes at the point of using our global_s. While technically global_s looks a normal char* pointer, in practice it is not. In particular, we cannot use *global_s to get the first byte of our global_s string (!); neither we can feed global_s to any of standard functions such as strcpy().

While implementation-wise global_s is still a 2-byte pointer (just like any other pointer in AVR8), and compiler will allow us to write c = *global_s, semantically global_s belongs to code address space, and *global_s is performed with a silent assumption that the pointer belongs to data address space, so *global_s will read something from the data address space, which has nothing to do with our global_s string, but is some pretty arbitrary data which had been unfortunate to reside on that address in the data address space.

If global_s is defined with PROGMEM qualifier, then instead of *global_s, we should use pgm_read_byte(global_s). And instead of strcpy(buf,global_s) we should use library function strcpy_PF(buf,global_s); this function, in fact, treats global_s as a pointer-to-program-memory.

To make things worse, while there are quite a few library functions which support pointers-to-program-memory, in many cases you will need to write your own functions. To illustrate how it should be done, we’ll write our own my_strcpy_PF():

char* my_strcpy_PF(char* dst, const char* /*PROGMEM*/ src) { while(true) { char c = pgm_read_byte(src);//if not for PROGMEM, we'd have simple *src here *dst = c; if(!c) return dst; dst++; src++; } }

Note that the point of the example right above is not to encourage you to write your own strcpy(), but to illustrate the way how to implement your own functions which need to read from PROGMEM, but go beyond the whatever-is-already-supported-by-avr-library.

From the programmer’s perspective, this Access-Instruction-As-Data architecture is pretty much like “programming in pure Harvard architecture”, though with a few things which help a bit to go around Harvard peculiarities (yes, in pure Harvard the code above would be even worse 🙁 ). As a result I would suggest to name it as an “Almost-Harvard” architecture.

Almost-Harvard – Read-Instructions-from-Data-Memory

As described in [WikiModifiedHarvard], it is possible to modify pure Harvard architecture to allow executing code from the data address space (which in turn will allow to implement things such as JITs or self-modified code). Still access to constants-which-reside-in-code-segment will be complicated. And still for most of our regular developers it will qualify as an “Almost-Harvard”.

Almost-von-Neumann – Split-Cache

Now let’s take a look at the third item which is listed in [WikiModifiedHarvard] as one of “Modified Harvard” architectures – it is split-cache.

Most of the modern CPUs do have separate instruction cache and data cache (usually the difference exists only at L1, and is gone at L2 and above). [WikiModifiedHarvard] states that having split-cache is enough to name the architecture “Modified Harvard”.

As I’ve said in the very beginning of this post, I am not going to argue about definitions, but let’s see where this specific definition leads us as developers. Adding a cache doesn’t change the way we’re programming the CPU; moreover, original 8086 CPU was a pure von Neumann CPU (it didn’t have any caches at all ;-)), and all the x86 are fully compatible with original 8086. So, when moving from 8086 to, say, Core i7, we don’t need to change machine code (even less C code), but according to [WikiModifiedHarvard], we’re jumping from von Neumann architecture to “Modified Harvard” architecture. Ouch.

As noted above (and mentioned in [WikiModifiedHarvard] too), from developer’s perspective there is zero or almost-zero difference between von Neumann architectures with added Split-Cache, and pure von Neumann architectures. In particular, simple

char* global_s = “IT Hare Doesn't Care”; void f(char* buf) { strcpy(buf, global_s); }

“From developer's perspective there is zero or almost-zero difference between von Neumann architectures with added Split-Cache, and pure von Neumann architectures.will work without copying-global_s-to-RAM, and without any trickery related to address spaces being different.

As a result, I suggest to name such architectures Almost-von-Neumann architectures.

Almost-von-Neumann: data/instruction cache coherence

If you are writing your-usual-C-programs, the difference between modified-Harvard architectures ends here. However, if you’re into stuff such as JITs and self-modifying code, there is a further subtle difference. Almost-von-Neumann architectures can be further divided into Almost-von-Neumann-with-DI-Cache-Coherence and Almost-von-Neumann-without-DI-Cache-Coherence. If you’re dealing with Almost-von-Neumann-without-DI-Cache-Coherence, then if you have wrote some code to the memory, and want to execute it, you need to do something special (for example, to drop L1 instruction cache) to make sure that your code will be executed correctly. This behaviour is usually related to the implementation of data and instruction split-caches, and if no special measures are taken, for a split-cache implementation data cache and instruction cache are not guaranteed to be coherent, causing problems unless this something special is explicitly done by the code. In contrast, for Almost-von-Neumann-with-DI-Cache-Coherence, you don’t need to make anything special even in this quite exotic case (i.e. from the developer’s point of view, Almost-von-Neumann-with-DI-Cache-Coherence behaves exactly like pure von Neumann architecture).

I want to re-iterate that unless you’re doing some really unusual things (such as JIT or self-modifying code), you shouldn’t care about data-instruction cache coherence. In other words, as long as you’re writing your usual C/C++/Java/Python/… program – you don’t need to think about cache coherence at all.

Conclusions and Suggestions

As we’ve seen above, definition of “Modified Harvard” architecture from [WikiModifiedHarvard] is quite confusing at least from the developer’s point of view2: according to that definition, both “systems-which-behave-almost-like-Harvard” and “systems-which-behave-almost-like-von-Neumann” qualify as “Modified Harvard”. Moreover, with such a broad definition of “Modified Harvard” (and as mentioned in [WikiModifiedHarvard] itself), it becomes to encompass pretty-much-every-CPU-and-MCU-out-there; but whenever any definition starts to cover everything-in-existence, it becomes perfectly useless. In other words, there is very little point in saying that “this CPU has Modified Harvard architecture”, as with such a broad definition all modern CPUs are modified-Harvard.

“Fortunately for us, both x86 and ARM do normally qualify as 'Almost-von-Neumann'.As a result, I suggest, whenever it comes to us developers, to use more specific terms “Almost-Harvard” and “Almost-von-Neumann” (which can be further divided with respect to cache coherence, as described above). The key difference from developer’s perspective is about address space of code and data being the same, or different.

If there is one address space for both code and data – it means that there is only one pointer type, which in turn means that you can forget about AVR8-like complications with PROGMEM etc. described above, and write pretty much like your usual C code. I propose to name this kind of systems “Almost-von-Neumann” (if necessary, a further clarification may be added whether it is Almost-von-Neumann-with-DI-Cache-Coherence, or Almost-von-Neumann-without-DI-Cache-Coherence).

Below is a summary of some of popular architectures, describing their position in this classification.

Classification according to This Article Architecture Examples Almost-von-Neumann-with-DI-Cache-Coherence x86, x64 Almost-von-Neumann-without-DI-Cache-Coherence ARM Almost-Harvard AVR8

Fortunately for us, both x86 and ARM do normally3 qualify as “Almost-von-Neumann”. Note that even for “Almost-von-Neumann” systems, certain subtle differences between the code segment and the data segment may exist (for example, speed of reading from the code segment and the data segment can be different); however, for most programming intents and purposes, you can forget about these differences and think in terms of pure von Neumann model.

If, on the other hand, there are separate address spaces, then we do need to address question “there is a pointer, but which address space it belongs to?”, with all the associated AVR8-like complications. I propose to name such systems “Almost-Harvard”.

Dixi. Now it is time to start arguing about terminology 😉 .

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.