If you have some experience in C, then you probably know that a string is just an array of char s. Or it can be represented by a pointer to char — because a pointer and an array are pretty much the same thing, right? That is somewhat true, but the reality is slightly more complex.

Let’s look at a few examples (assume that the source file is always called main.c). Note: I’m using the gcc compiler on an Ubuntu Linux, and you might get different results in a different environment. For a slightly more detailed explanation of the setup, refer to my previous article on C.

#include <stdio.h> int main() {

char str[3] = "cat";

puts(str);

return 0;

} > gcc main.c -o out

> ./out

cat�

What has just happened here? Where is that character at the end of the word coming from?

Another example.

int main(void) {

char *str = "cat";

str[0] = 'b';

return 0;

} > gcc main.c -o out

> ./out

Segmentation fault (core dumped)

A modification of the previous example.

int main(void) {

char str[] = "cat";

str[0] = 'b';

return 0;

} > gcc main.c -o out

> ./out

We simply replaced a pointer with an array, and it fixed the segmentation fault error. But why?

A third example.

#include <string.h> int main() {

char *str1 = "cat";

char *str2;

strcpy(str2, str1);

return 0;

} > gcc main.c -o out

> ./out

Segmentation fault (core dumped)

It might not be hard for an experienced developer to see what mistakes I’ve made in the above snippets. But let’s be frank: if you think of this code in terms of higher-level language constructs (print a string, modify a string, copy a string), they all make total sense, but the output is far from what you would expect.

A few related questions:

Why can’t we declare an array of unknown size inside a function without initializing it, like so: int arr[]; ?

? Why can we use arrays of unknown size as function parameters?

Why can’t an array be used as a function return value?

What happens if we compare two strings: "cat" == "cat" ? Is it going to be true or false?

The three concepts that we’ve used (strings, arrays, and pointers) are among the most basic things in C, but if you try doing anything with them without understanding how they’re represented in memory, you’re very likely to get the errors that we’ve just seen (or similar ones). Of course, you can learn a few do’s and don’ts of the language to avoid certain types of mistakes, but you will have a much better understanding if you study the underlying principles instead.

Automatic Variables and the Stack

If a variable of any type is declared inside a function and doesn’t have keywords extern or static , it’s an automatic variable. It means that when execution enters the block where the variable is declared, storage for the variable is automatically allocated, and it’s likewise automatically deallocated when execution of the block ends.

Variable scope (the section of the source code where a variable is visible) and storage duration (how long the corresponding region in memory is guaranteed to retain its value) are related but distinct concepts, to be explored in a future post. For now let’s just focus on automatic storage duration.

In JavaScript, we don’t deal with memory, so if we declare a variable, we never have to worry about how or where it’s stored. Not so in C — it’s very useful (and not too hard) to understand how variable storage is organized.

Automatic allocation is usually implemented using a stack — a special region of memory that holds local variables and return addresses, and may also contain function parameters and return values.

Here’s a detailed example. The program doesn’t do anything useful — we’re only looking at it to understand how memory allocation on the stack works.

void one(void); void two(void); int main(void) {

one();

return 0;

} void one(void) {

long local1 = 1;

two();

} void two(void) {

long local2 = 2;

}

When the program is compiled and linked on my Linux machine, here’s what it looks like:

00000000004004d6 <main>:

4004d6: push rbp

4004d7: mov rbp,rsp

4004da: call 4004e6 <one>

4004df: mov eax,0x0

4004e4: pop rbp

4004e5: ret 00000000004004e6 <one>:

4004e6: push rbp

4004e7: mov rbp,rsp

4004ea: sub rsp,0x10

4004ee: mov QWORD PTR [rbp-0x8],0x1

4004f6: call 4004fe <two>

4004fb: nop

4004fc: leave

4004fd: ret 00000000004004fe <two>:

4004fe: push rbp

4004ff: mov rbp,rsp

400502: mov QWORD PTR [rbp-0x8],0x2

40050a: nop

40050b: pop rbp

40050c: ret

40050d: nop DWORD PTR [rax]

The above is the abridged output of objdump:

> gcc main.c -o out

> objdump -d -M intel — no-show-raw-insn out

You can read more about objdump and its options on its manual page.

The addresses on the left are actual memory addresses where the instructions will reside at execution time. You don’t have to understand all the commands, we’ll just look at a few lines.

The processor here (x86_64 architecture) has a number of registers, which are simply special 8- or 4-byte memory cells that have much faster access times than the main memory and that are addressed using mnemonics ( rbp , rsp , eax , etc.) rather than numeric addresses.

rbp and rsp are two 8-byte registers that have special meaning (they’re called base pointer and stack pointer, respectively). In the course of program execution, the values of these registers change in such a way that they always define the current function’s stack frame.

Also, long is chosen as the variable type in the source because its size on a 64-bit machine is 8 bytes, just like memory addresses and registers like rbp and rsp . We could use any other integer type here (like int , short , or char ), but with long the stack is nicely divided into 8-byte segments, each of them holding one value.

4004e7: mov rbp,rsp : copies the value of rsp to rbp .

Here’s what the stack looks like after the execution of the above line (the stack is not presented in full — only the relevant part):

rbp == 0x7fffffffdc70

rsp == 0x7fffffffdc70 Address | Value |

0x7fffffffdc78 | 0x400fdf | <- “main” function return address

0x7fffffffdc70 | 0x7fffffffdc80 | <- rbp and rsp both point here; the value stored at this memory location is the previous value of rbp (0x7fffffffdc80)

We’ve just made rbp and rsp equal. rsp always points to the tip of the stack, which is its smallest address (currently 0x7fffffffdc70), since the stack on x86_64 grows down.

The runtime stack addresses in all the examples were obtained using gdb, the GNU debugger. The explanation of its usage is beyond the scope of this post.

4004ea: sub rsp,0x10 : in higher-level pseudocode it reads rsp = rsp — 16 — in order to allocate space on the stack for the local variable, rsp is decreased by 16. rbp stays the same.

rbp == 0x7fffffffdc70

rsp == 0x7fffffffdc60 Address | Value |

0x7fffffffdc78 | 0x400fdf |

0x7fffffffdc70 | 0x7fffffffdc80 | <- rbp still points here

0x7fffffffdc68 | cruft |

0x7fffffffdc60 | cruft | <- rsp points here

“Cruft” means that the cell can hold any value, but we’re not interested in it since we’re not planning to read it before writing.

4004ee: mov QWORD PTR [rbp-0x8],0x1 — rbp — 8 in the context of the function one is the location of the variable local1 . This line is where the actual long local1 = 1 initialization happens. We could later change the variable’s value — what would happen is a different value would be written to the same memory location, and it would always be addressed as rbp — 8 , because when execution is inside one , the value of rbp doesn’t change.

rbp == 0x7fffffffdc70

rsp == 0x7fffffffdc60 Address | Value |

0x7fffffffdc78 | 0x400fdf |

0x7fffffffdc70 | 0x7fffffffdc80 | <- rbp still points here

0x7fffffffdc68 | 1 | <- local1 value

0x7fffffffdc60 | cruft | <- rsp points here

The compiler allocated more space on the stack than was needed for our variable, but that’s due to stack alignment requirements, and we don’t have to worry about it.

4004f6: call 4004fe <two> — we’re entering function two . Two things happen here: execution jumps to 0x4004fe (the address of the first instruction of two ), and the return address is pushed onto the stack.

rbp == 0x7fffffffdc70

rsp == 0x7fffffffdc58 Address | Value |

0x7fffffffdc78 | 0x400fdf |

0x7fffffffdc70 | 0x7fffffffdc80 | <- rbp still points here

0x7fffffffdc68 | 1 | <- local1 value

0x7fffffffdc60 | cruft |

0x7fffffffdc58 | 0x400fb | <- "one" function return address; rsp points here

4004fe: push rbp — pushes the current value of rbp onto the stack

rbp == 0x7fffffffdc70

rsp == 0x7fffffffdc50 Address | Value |

0x7fffffffdc78 | 0x400fdf |

0x7fffffffdc70 | 0x7fffffffdc80 | <- rbp still points here

0x7fffffffdc68 | 1 | <- local1 value

0x7fffffffdc60 | cruft |

0x7fffffffdc58 | 0x400fb | <- "one" function return address

0x7fffffffdc50 | 0x7fffffffdc70 | <- rsp points here

4004ff: mov rbp,rsp — copies the value of rsp to rbp . The same thing happened when we entered one . In fact, this operation is performed when execution of any function starts.

rbp == 0x7fffffffdc50

rsp == 0x7fffffffdc50 Address | Value |

0x7fffffffdc78 | 0x400fdf |

0x7fffffffdc70 | 0x7fffffffdc80 |

0x7fffffffdc68 | 1 | <- local1 value

0x7fffffffdc60 | cruft |

0x7fffffffdc58 | 0x400fb | <- "one" function return address

0x7fffffffdc50 | 0x7fffffffdc70 | <- both rsp and rbp point here

At this point, we can’t access local1 as rbp — 8 anymore, but it’s fine since the variable is out of scope. Remember the source — local1 is local to one , so we can’t access it while we’re inside two .

What we’ve just seen is automatic memory allocation for local1. In the C code, we simply had to declare it, and it’s up to the compiler to set aside memory for it ( sub rsp, 0x10 ).

When execution of one ends, memory for local1 will be automatically deallocated in like manner: incrementing the stack pointer is one of the things that leave instruction does. “Deallocation” doesn’t mean that the value at that memory cell changes immediately — but it’s considered cruft from now on, and can be updated at any time.

Pointers

Equipped with knowledge of automatic variables, let’s talk about pointers.

A pointer is a variable that has a fixed size of eight bytes (on a 64-bit machine). We often use pointers for accessing vector data (arrays), but intrinsically they’re scalar values. We can cast an integer constant to a pointer type — that’s a legitimate operation.

void dummy_func(void) {

int *p = (int *) 123;

}

In fact, if we replace the pointer type with `long`, the compiled code will likely be exactly the same:

void dummy_func(void) {

long p = (long) 123;

}

The cast operator is unnecessary in the second example, I just wanted to point out that int * is directly replaceable with long here since each of the two types occupies 8 bytes in memory. And those 8 bytes are automatically allocated, similarly to the long variables in the previous section.

Generally, we don’t use pointers like that though, directly assigning numeric values to them, because if we try dereferencing such a pointer with the * operator, we’ll likely cause a segmentation fault error (because address 123 might be in a protected memory region).

There are several ways to make pointers useful. For example, we can indirectly reference another variable.

#include <stdio.h> void dummy_func(void) {

long i = 123;

long *p = &i; // Initialize p with the address of i

printf("%ld

", *p); // "123" will be printed

}

Two automatic variables are created, and the second one ( p ) holds the address of the first one, so the stack might look something like the following:

Address | Value |

0x7fffffffdc70 | 0x7fffffffdc68 | <- the "p" variable, holding the address of "i"

0x7fffffffdc68 | 123 | <- the "i" variable, holding the value 123

It’s up to the compiler to order automatic variables on the stack — the fact that i comes first in the source doesn’t matter.

Pointers, as already mentioned, can also be used as arrays.

#include <stdio.h>

#include <stdlib.h> void dummy_func(void) {

long *p = malloc(10 * sizeof(long)); // Allocate memory for a 10-element array

p[0] = 123;

printf("%ld

", p[0]); // "123" will be printed

free(p);

}

Again, when long *p is declared, the only memory automatically allocated on the stack is that for the pointer itself. Memory for the 10-element array has to be explicitly allocated by calling malloc and then later freed by calling free . These two functions are provided by the standard library, and they operate on the region of memory known as the heap. Operating systems usually provide low-level primitives for requesting memory from the heap and recycling it, and those are used by malloc and free , respectively.

The allocation on the heap using malloc is also called dynamic allocation, as opposed to automatic (explained above) and static (to be discussed in the section on strings).

Note that we’re using array syntax ( p[0] ) with a pointer type variable, instead of the dereferencing operator ( * ). We’ll explore why it’s possible in the next section.

We can also use a pointer to access a string:

#include <stdio.h> void dummy_func(void) {

int *p = "cat";

printf("%c

", p[0]); // "c" will be printed

}

Arrays

An array in C, as the standard puts it, is “a contiguously allocated nonempty set of objects with a particular member object type”.

And where is it allocated? If any variable, including an array, is declared within a function without keywords static or extern , it’s allocated automatically.

So, if we have a function:

void dummy_func(void) {

long arr[3] = {1, 2, 3};

// Use the array

}

then the stack during its execution would look something like the following:

Address | Value |

0x7fffffffdc70 | 3 | <- arr[2]

0x7fffffffdc68 | 2 | <- arr[1]

0x7fffffffdc60 | 1 | <- arr[0]

Remember when we said earlier that subscripting arr[1] is equivalent to pointer dereferencing ( *(arr + 1) )? The reason is that in most cases when the compiler sees an array variable, it assumes that it’s working with a pointer variable instead. The pointer has no dedicated storage, but the compiler knows its value (it always points to the first element of the array).

In fact, array subscripting (the operation of getting an array element using the square bracket syntax) is by definition equivalent to adding the element index to the array variable and then dereferencing it (something like *(arr + 1) ).

To drive it home, let’s add a pointer to the example:

void dummy_func(void) {

long arr[] = {1, 2, 3}; // We don’t need to specify the size of the array since we’re initializing it with three elements

long *ptr = arr; // We can do it since arr is implicitly substituted with a pointer that holds the address of array’s first element. We’re assigning a pointer value to a pointer variable here, so no error or warning will be generated.

}

In this case, arr[1] , *(arr + 1) , ptr[1] , and *(ptr + 1) all produce the same result.

But are arr[1] and ptr[1] equivalent? The answer is no. The assembly code produced for these two operations would be different. In the first case, the compiler knows the address of the first element of the array (technically, not the absolute address, but its offset from the base pointer), so at runtime we simply have to fetch the desired element.

Not so with pointers — we have to fetch the value it contains at runtime (which happens to be the address of the first array element), and only then can we get the necessary array element. So with a pointer variable, one more instruction is needed to obtain an element than with an array variable.

Here’s the stack for the example:

Address | Value |

0x7fffffffdc70 | 3 | <- arr[2] == *(arr + 2) == ptr[2] == *(ptr + 2) == 3

0x7fffffffdc68 | 2 | <- arr[1] == *(arr + 1) == ptr[1] == *(ptr + 1) == 2

0x7fffffffdc60 | 1 | <- arr[0] == *arr == ptr[0] == *ptr == 1

0x7fffffffdc58 | 0x7fffffffdc60 | <- ptr, holds the address of the first array element

Arrays do not only produce different assembly code than pointers when compiled, there are also syntactic differences. You can create a pointer from an array, but not vice versa. You can’t also assign to arrays (initialization doesn’t count).

#include <stdlib.h> void dummy_func(void) {

long *ptr = malloc(3 * sizeof(long));

long arr1[] = ptr; // Wrong, the initializer should contain a number of long values in braces, but not a pointer; will cause a compile-time error long arr2[3];

arr2 = {1, 2, 3}; // Wrong, an array cannot be the left side of an assignment operation (except at initialization); will also cause a compile-time error

}

You can’t initialize an array with the value of a pointer because whenever an array is declared, storage for a certain number of elements has to be allocated, as we discussed previously. So long arr1[] = ptr; doesn’t make sense because it’s not clear how much memory to allocate. Moreover, even if it was written as long arr1[3] = ptr; , it would still be ambiguous. Should we cast ptr to long and store it in the first array element, or should we copy a contiguous chunk of memory that ptr is pointing to into the chunk represented by arr1 ?

The reason you can’t assign anything to an array is that an array variable itself does not occupy any memory (although the elements of the array do). The assignment operator in C is always about storing a value in a memory location that the left-hand side of the assignment evaluates to. But arr2 does not have an associated memory location. So again, either we’d have to change the semantics of the assignment operator if there was an array on the left, or disallow assignment for arrays. Creators of C chose the latter, and it’s not hard to see their reasoning.

A couple more things to look at: using arrays as function parameters and returning arrays from functions.

Here’s a function declaration: void dummy_func(int a[]); . The function accepts one argument, which is an array of int . How do we pass it? In the section about the stack we discussed how storage for local variables is allocated. In fact, arguments are passed in a similar fashion. All arguments in C are passed by value (unlike JavaScript where objects are passed by reference), and so each of them has a concrete representation on the stack (in certain compilers and operating systems registers can be used for argument passing, but for the sake of this discussion let’s assume that the stack is always used). So if the function looked like void dummy_func(long l); , then 8 bytes on the stack at a fixed offset from the base pointer would be occupied by the argument value during the execution of the function.

An array of unknown size ( int a[] ) is different because, unsurprisingly, its size is not known at compile time. To remedy it, C converts the array parameter to a pointer parameter at compile time, so void dummy_func(int a[]); is treated exactly as void dummy_func(int *a); . Since the size of an argument has to be known at compile time, it’s quite natural to pass a pointer around — it has a fixed size of 8 bytes on a 64-bit machine.

Returning arrays from functions is not allowed in C, and this, again, is not a weird exception. On the opposite, it’s probably unwillingness to make weird exceptions. Return values are placed on the stack, just like arguments, and they have to have a fixed size, so we can’t just pass an array, we’d have to convert the array to a pointer. But if we created the array inside the function and then tried to return it, we would be returning a pointer to an automatic array whose storage duration ends when the function where it was created returns.

Long story short, returning arrays from functions would be awkward in C, so the language doesn’t allow it.

Update: it wouldn’t, in fact, be unrealistic to let functions have array return types, but the array size would always have to be fixed (something like int dummy(void)[3] ). It would only complicate things though — the behavior of return values would then be different from that of parameters (placing the whole array on the stack instead of passing a pointer).

Strings

There are pointer types and array types in C, but strings, unlike in JavaScript, are not a separate data type. According to the standard, a string “is a contiguous sequence of characters terminated by and including the first null character.”

To put it another way, strings are simply arrays of char whose last element is its first '\0' character. The word “first” is important here — strings can’t have a '\0' in the middle. (Newer versions of C also support multibyte character strings, but for the sake of this conversation we can assume that every character is represented by a single char ).

So if we’re talking about a string representing the word “cat”, it’s four characters: 'c' , 'a' , 't' , '\0' . They are located at consecutive addresses in memory, and the last character is simply the number 0.

It looks something like this:

Address | Value |

0x7fffffffdc73 | '\0' (0) |

0x7fffffffdc72 | 't' (116) |

0x7fffffffdc71 | 'a' (97) |

0x7fffffffdc70 | 'c' (99) |

See how previously we were displaying addresses with a step of 8 bytes, whereas here the step is one byte. The numbers associated with characters are from the ASCII table.

String literals are related to strings, but it’s a distinct concept. A string literal is a sequence of characters in a source file (not in memory) enclosed in double quotes, like "cat" .

A string literal does not necessarily represent a string. For instance, "cat\0dog" is a valid string literal, but what it creates in memory is not a string since a string can’t contain null characters in the middle.

Also, char str[] = {'c', 'a', 't', '\0'}; creates a string but doesn’t use a string literal.

When the compiler encounters a string literal of length N, it makes sure that when the program runs, (N+1) bytes of static storage are allocated and filled with the necessary characters before the execution of the program starts. Static storage, as opposed to automatic and dynamic, is allocated at the beginning of execution and is freed after the program terminates. (N+1) bytes are needed to include all the characters of the literal plus the null character at the end.

Then the compiler converts the literal to a pointer to its first character, so char *ptr = "cat"; means that a character sequence will be created somewhere in memory and "cat" will be implicitly converted to a value of type “pointer to char”, which will then be used to initialize the variable ptr . ptr would contain the address of the memory location where character 'c' is stored.

There is one exception to the “convert string literal to pointer” behavior, namely array initialization. Arrays inside functions are automatic variables, so if we do char arr[] = "cat"; , then four bytes are allocated for the array on the stack and filled with the characters of the literal (plus the null character). "cat" is not converted to a pointer in this instance. The result of this initialization will be the same as if we wrote char str[] = {'c', 'a', 't', '\0'}; .

We also need to understand why a null character is required to terminate a string. Isn’t it wasteful? It’s actually not. A string is just a sequence of bytes in memory, there’s no meta information stored anywhere. Whenever we wish to display a string in the console, we need to know how many characters to display. That’s what the null character is for — it’s a way to delimit the string in the absence of an explicitly stored length.

Answers to Questions

We’ve learned enough now to be able to understand what’s happening in the examples from the beginning of the article.

Question 1: why are we getting a weird character in the output?

#include <stdio.h> int main() {

char str[3] = "cat";

puts(str);

return 0;

} > gcc main.c -o out

> ./out

cat�

Answer: we’re using an array of insufficient length for storing the string (it should have been four, not three). When the string is printed, memory locations are traversed until a byte with all bits set to zero is encountered. The number of characters that gets printed in this case depends on the compiler and the execution environment.

You should never do the above, simply use an array of unknown size ( char str[] = "cat"; ): it will be initialized with the correct length.

Question 2: why does behavior differ if we initialize an array and a pointer with a string literal and then try to reassign an element of the resultant array? char *str = "cat"; str[0] = 'b'; and char str[] = "cat"; str[0] = 'b'; .

Answer: remember, when there’s a string literal in the code, the compiler allocates static storage for the array of characters that it represents, and the literal is replaced with a pointer to that memory during compilation. So when a pointer variable is initialized ( char *str = "cat"; ), the value might be a read-only memory address (it’s up to the compiler where to place the string). When we try writing to that address ( str[0] = 'b'; ), the process immediately terminates with an error. When you initialize an array though, the storage for "cat" is allocated on the stack, which is writable during program execution, so it works fine.

Question 3: why does string copying fail?

#include <string.h> int main() {

char *str1 = "cat";

char *str2;

strcpy(str2, str1);

return 0;

}

Answer: the strcpy function copies the string pointed to by the second parameter into the array pointed to by the first parameter. str2 was not initialized with a value, and thus contains cruft. strcpy still happily treats that cruft as an address, and tries to write to it, but that address is likely not in a writable memory region.

Question 4: why can’t we declare an array of unknown size inside a function without initializing it, like so: int arr[]; ?

Answer: because an automatic variable must have a known size when it’s declared. When the code is compiled, there are no symbolic names for variables, and all values are rather stored and retrieved by offsets from the beginning or end of the stack frame. If we don’t know a variable’s size, we don’t know offsets for variables following it.

Update: C99 introduced variable-length arrays to the language, and their size is in fact not known at compile time, and it’s neither the stack pointer nor the base pointer that’s used to calculate addresses of the arrays’ elements. However, variable length just means that the length is defined at runtime, not that it might change once it’s defined. And in the case of an unknown-size array, if int arr[]; was allowed, the language would have to allow assignment of arrays (which it doesn’t) and also either disallow further assignments (to prevent changes to the array’s size) or make arrays resizable, which would be rather inefficient on the stack.

Question 5: why can we use arrays of unknown size as function parameters?

Answer: arrays as function parameters are simply converted to pointers, so the size of such a parameter is always the size of a pointer variable (8 bytes in the case of a 64-bit processor architecture).

Question 6: Why can’t an array be used as a function return value?

Answer: see the “Arrays” section.

Question 7: What happens if we compare two strings: "cat" == "cat" ? Is it going to be true or false?

Answer: we first need to understand what we’re comparing here. A string literal is replaced with a pointer, so two pointers are being compared. Which has nothing to do with lexicographic comparison of strings (like what the equality operator does in JavaScript). So do the pointers have the same value? The answer is, maybe. The compiler may choose to allocate memory separately for each string literal in the code, or reuse memory for identical literals. This behavior is unspecified, so you shouldn’t rely on it being implemented one way or the other. Basically, never write code where a comparison operator has a string literal as one of its operands, because you have no control over the address where it will be allocated. And if you need to lexicographically compare two strings, use the strcmp function.

Conclusion

Hopefully, it’s clearer now why arrays and pointers function the way they do, and what strings really are. Certain decisions of the language creators have also probably started making sense — the features that you might previously have seen as quirks are actually based on sound principles.

Resources

The C standard: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf

AMD64 Linux ABI: https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf

The Development of the C Language by Dennis Ritchie: https://www.bell-labs.com/usr/dmr/www/chist.html

The C FAQ (specifically its “Arrays and Pointers” section): http://c-faq.com/aryptr/index.html

Eli Bendersky’s excellent articles: https://eli.thegreenplace.net/2009/10/21/are-pointers-and-arrays-equivalent-in-c and https://eli.thegreenplace.net/2011/09/06/stack-frame-layout-on-x86-64/

The Full Series