October, 2005: Managed String Library for C

Robert C. Seacord is Senior Vulnerability Analyst for CERT/CC and author of Secure Coding in C and C++ (Addison-Wesley, 2005). He can be reached at [email protected]

Stringssuch as command-line arguments, environment variables, and console inputare of special concern in secure programming because they account for most of the data exchanged between an end user and a software system. Graphic and web-based applications make extensive use of text input fields and, because of standards like XML, data exchanged between programs is increasingly in string form as well. As a result, weaknesses in string representation, string management, and string manipulation have led to a broad range of software vulnerabilities and exploits.

Many of the vulnerabilities in existing C code result from interactions with standardized library calls that, by today's standards, would no longer be considered secure (strcpy(), for example). Unfortunately, because these functions are standard, they continue to be supported and developers continue to use themoften to detrimental effect.

Common String-Manipulation Errors

Common string-manipulation errors include unbounded string copies, null-termination errors, string truncation, and improper data sanitization [6].

Unbounded string copies occur when data is copied from an unbounded source to a fixed-length character array (for instance, when reading from standard input into a fixed-length buffer using gets()). Reading data from unbounded sources creates an interesting problem for a programmer. Because it is not possible to know beforehand how many characters a user will supply, it is impossible to know whether an allocated array is of sufficient length. A common solution is to statically allocate an array that is much larger than anticipated and assume the length will not be exceeded. While this approach works well with friendly users, malicious users can easily exceed a fixed-length character array. Unbounded string copies are also common using functions such as strcpy() and strcat() that do not take the size of the destination buffer into account.

Another common problem with C-style strings is a failure to properly null terminate. For example, strncpy() is frequently recommended as a secure alternative to strcpy(). However, incorrect use of this function can still result in buffer overflows or null-termination errors, as in:

char buff[10]; strncpy(buff, "1234567890",sizeof(buff));

In this case, the 10-character string copied into buff completely fills the available space, leaving no room for a terminating null character. Null-termination errors are difficult to find and can remain dormant and undetected in deployed code until a particular set of inputs causes a failure.

String-truncation errors are another example of a problem that has been caused, to some degree, by attempts to prevent buffer overflows. In an effort to eliminate buffer overflows, security experts have recommend the use of functions that restrict the number of bytes copied; for example, strncpy() instead of strcpy(), fgets() instead of gets(), and snprintf() instead of sprintf(). While these alternative functions can be effective in mitigating buffer-overflow vulnerabilities, it is often at the cost of truncating strings that exceed the specified limits [3]. This of course, results in a loss of data, and in some cases can also lead to software vulnerabilities.

Improper data sanitization can also lead to vulnerabilities, particularly when data is passed between different logical units. John Viega and Matt Messier provide an example of an application that inputs an e-mail address from a user and writes the address to a buffer [7]:

sprintf(buff, "/bin/mail %s < /tmp/email", addr);

The buffer is then executed using the system() call. The risk is, of course, that the user enters this string as an e-mail address:

Current or Proposed Solutions

Ideally, a secure string library would mitigate security risks resulting from these common programming errors. This string library should also be implemented with these goals in mind:

Succeed or fail. Managed string library functions should succeed or fail loudly. Copy and concatenation functions, for example, should copy the entire string or fail so that string truncation does not occur. Undefined behavior, from copying overlapping objects, for example, must be eliminated.

Familiar to C programmers. The managed string API should be familiar to C programmers to ease adoption and promote standardization.

Legacy system modernization. The managed string API should provide similar semantics to existing, standard C library functions to simplify legacy-system modernization.

Not all of these objectives are fully compatible. It is not possible, for example, to fully preserve existing APIs and provide an adequate level of protection against buffer overflows and other potential security flaws.

Because errors in string manipulation have long been recognized as a leading source of vulnerabilities in C-language programs, a number of libraries have been developed to prevent these common programming errors. These libraries can be categorized as static or dynamic, depending on how they allocate space.

Libraries that implement a static approach allocate fixed-sized strings, meaning that once the character array has been filled, it is impossible to add data. Examples of the static approach include the ISO/IEC TR 24731 Specification for Safer, More Secure C Library Functions [4] and OpenBSD's strlcpy() and strlcat() [5]. Because the static approach discards excess data, there is always a possibility that actual program data will be lost. Consequently, the resulting string must be fully validated.

String libraries that implement a dynamic approach resize strings as required. Dynamic approaches scale better and do not discard excess data. The major disadvantage is that if inputs are not limited, they can exhaust memory and consequently be used in denial-of-service attacks. Examples of dynamic approaches include the C String Library (SafeStr) from Matt Messier and John Viega [2] and Vstr by James Antill [1].

ISO/IEC TR 24731 Functions

ISO/IEC TR 24731 defines alternative versions of C standard functions that are designed to be safer replacements for existing functions. For example, ISO/IEC TR 24731 defines the strcpy_s(), strcat_s(), strncpy_s(), and strncat_s() functions as replacements for strcpy(), strcat(), strncpy(), and strncat(), respectively.

The ISO/IEC TR 24731 functions were created by Microsoft to help retrofit its existing, legacy code base in response to numerous, well-publicized security incidents over the past decade. These functions were then proposed to the ISO/IEC JTC1/SC22/ WG14 international standardization working group for the programming language C for standardization.

The strcpy_s() function, for example, has this signature:

errno_t strcpy_s( char * restrict s1, rsize_t s1max, const char * restrict s2);

The signature is similar to strcpy() but takes an extra argument of type rsize_t that specifies the maximum length of the destination buffer. (Functions that accept parameters of type rsize_t diagnose a constraint violation if the values of those parameters are greater than RSIZE_MAX. Extremely large object sizes are frequently a sign that an object's size was calculated incorrectly. For example, negative numbers appear as very large positive numbers when converted to an unsigned type like size_t. For those reasons, it is sometimes beneficial to restrict the range of object sizes to detect errors. For machines with large address spaces, ISO/IEC TR 24731 recommends that RSIZE_MAX be defined as the smaller of the size of the largest object supported or (SIZE_MAX >> 1), even if this limit is smaller than the size of some legitimate, but very large, objects.) The semantics are also similar. When there are no input validation errors, the strcpy_s() function copies characters from a source string to a destination character array up to and including the terminating null character. The function returns zero on success.

The strcpy_s() function only succeeds when the source string can be fully copied to the destination without overflowing the destination buffer. The following conditions are treated as a constraint violation:

The source and destination pointers are checked to see if they are null.

The maximum length of the destination buffer is checked to see if it is equal to zero, greater than RSIZE_MAX, or less than or equal to the length of the source string.

When a constraint violation is detected, the destination string is set to the null string and the function returns a nonzero value. In Listing 1, the strcpy_s() function is used to copy src1 to dst1. However, the call to copy src2 to dst2 fails because there is insufficient space available to copy the entire string, which consists of seven characters, to the destination buffer. As a result, r2 is assigned a nonzero value and dst2[0] is set to "\0."

Users of the ISO/IEC TR 24731 functions are less likely to introduce a security flaw because the size of the destination buffer and the maximum number of characters to append must be specified. ISO/IEC TR 24731 functions also ensure null termination of the destination string.

ISO/IEC TR 24731 functions are still capable of overflowing a buffer if the maximum length of the destination buffer and number of characters to copy are incorrectly specified. As a result, these functions are not especially secure but may be useful in preventive maintenance to reduce the likelihood of vulnerabilities in an existing legacy code base.

strlcpy() and strlcat()

Many UNIX variants, including most BSD implementations and Solaris, offer the strlcpy() and strlcat() functions to copy and concatenate strings in a less error-prone manner. These functions' prototypes are:

size_t strlcpy(char *dst, const char *src, size_t size); size_t strlcat(char *dst, const char *src, size_t size);

The strlcpy() function copies the null-terminated string from src to dst (up to size characters). The strlcat() function appends the null-terminated string src to the end of dst (but no more than size characters will be in the destination).

To help prevent writing outside the bounds of the array, the strlcpy() and strlcat() functions accept the full size of the destination string as a size parameter. For statically allocated destination buffers, this value is easily computed at compile time using the sizeof() operator.

Both functions guarantee that the destination string is null terminated for all nonzero-length buffers to prevent null-termination errors.

The strlcpy() and strlcat() functions return the total length of the string they tried to create. For strlcpy(), that is simply the length of the source; for strlcat(), it is the length of the destination (before concatenation) plus the length of the source. To check for truncation, programmers need to verify that the return value is less than the size parameter. If the resulting string is truncated, programmers now know the number of bytes needed to store the entire string and may reallocate and recopy.

Neither strlcpy() nor strlcat() zero-fill their destination strings (other than the compulsory null byte to terminate the string). This results in performance close to that of strcpy() and much better than strncpy() [5].

The strlcpy() and strlcat() functions are not universally available in the standard libraries of UNIX systems; in particular, they are not available for GNU/Linux. Because they are relatively small functions, however, you can easily include them in your own program's source whenever the underlying system doesn't provide them. It is still possible that the incorrect use of these functions results in a buffer overflow if the specified buffer size is longer than the actual buffer length. Truncation errors are also possible if you fail to verify the results of these functions.

SafeStr

The C String Library (SafeStr) from Messier and Viega provides a rich string-handling library for C that has secure semantics, yet is interoperable with legacy library code in a straightforward manner.

The SafeStr library uses a dynamic approach for C that automatically resizes strings as required. SafeStr accomplishes this by reallocating memory and moving the contents of the string whenever an operation requires that a string grow in size. As a result, buffer overflows should not be possible when using the library.

The SafeStr library is built around the safestr_t type. The safestr_t type is compatible with char * and allows safestr_t structures to be cast as char * and behave as C-style strings. The safestr_t type keeps accounting information (that is, the actual and allocated length) in memory directly preceding the memory referenced by the pointer.

Error handling in SafeStr is performed using XXL, a library that provides both exceptions and asset management for C and C++. The caller is responsible for handling exceptions thrown by SafeStr and XXL. If no exception handler is specified, the default action is to output a message to stderr and call abort(). The dependency on XXL can be an issue because both libraries need to be adopted to support this solution.

Listing 2 is a program that uses SafeStr and XXL. This program allocates two strings and copies one string to the other. The use of XXL provides a convenient mechanism for error checking.

Managed String Library

The managed string library was developed in response to the need for a string library that can improve the quality and security of newly developed C-language programs while eliminating obstacles to widespread adoption and possible standardization.

As the name implies, the managed string library is based on a dynamic approach, in that memory is allocated and reallocated as required. This approach eliminates the possibility of unbounded copies, null-termination errors, and truncation by ensuring there is always adequate space available for the resulting string (including the terminating null character). The one exception is if memory is exhausted, which is treated as an error condition. In this way, the managed string library accomplishes the goal of succeeding or failing loudly.

The managed string library also protects against improper data sanitization by (optionally) ensuring that all characters in a string belong to a predefined set of "safe" characters.

Listing 3 illustrates the structure of the managed string type string_m. The string_m type is a structure consisting of a size and a character pointer. The size contains the size of the memory allocated for the string, not the string length. The character pointer references an array of size characters. For compatibility with existing C-library functions, the string consists of a contiguous sequence of characters terminated by and including the first null character. The str field in the string_m structure points to the initial character in the string. The length of a string is the number of bytes preceding the null character and is always less than the size of the string. This structure was created, in part, to apply software engineering principles of data encapsulation and data hiding. Users of the managed string library should, under no circumstance, directly access the size or str fields of the string_m structure, as this could compromise the integrity of the data structure and the security of the library.

The managed string library handles errors by consistently returning the status code in the function return value. This approach encourages status checking because the user can insert the function call as the expression in an if statement and take appropriate action on failure. The greatest disadvantage of this approach is that it prevents functions from returning any other value. This means that all values (other than the status) returned by a function must be returned as a pass-by-reference parameter, preventing a programmer from nesting function calls. This tradeoff is necessary because nesting function calls can conflict with a programmer's willingness to check status codes.

All functions in the managed string library operate on the managed string type string_m with the exception of the string creation function strcreate_m(), which is used to create a managed string from a C-style string or string literal.

Listing 4 shows how strcreate_m() is used to create the string "test." The managed string API also includes the getstr_m() function, which can be used to extract a copy of a C-style string from a managed string. This function is necessary to maintain compatability with functions that do not operate on managed strings. For example, formatted output functions such as fprintf() (also referenced in Listing 4) are outside the managed string library.

The managed string library also supports the concept of a NULL string and an empty string to allow conversion between managed strings and C-style strings (Listing 5). However, it is necessary to use isnullstr_m() and isemptystr_m() to test for null and empty strings. When converting from a NULL managed string to a C-style string, the getstr_m() function simply returns NULL (no memory is allocated).

While the dynamic approach eliminates the possibility of unbounded copies, null-termination errors, and truncation, there is still the issue of improper data sanitization. The managed string library includes a setcharset() function to help solve this problem and eliminate vulnerabilities resulting from improper data sanitization.

Listing 6 shows how the managed string library supports data sanitization using the setcharset() function. The first call to strcreate_m() creates a managed string containing a "valid" set of characters to be used by the data abstraction. Once set by setcharset(), the managed string API guarantees that newly created strings contain only those characters included in the set of valid characters. This technique is called "white listing" and is the preferred approach for data sanization [6]. The second call to strcreate_m() in Listing 6 succeeds because it only contains valid characters, while the third call fails (the character "d" is not in the set of valid characters).

Conclusion

It is clearly possible to implement an alternate version of the standard C string library using a dynamic allocation approach that fairly closely approximates the syntax and semantics of the existing, standard API. Dynamically allocated memory will never be as efficient as static allocation by the compiler. However, performance should be adequate for most tasks, and programmers can resort to traditional C-style strings in performance-critical code sections. More work is necessary to determine whether this library will be broadly adopted by developers and embraced by the ISO/IEC JTC1/SC22/WG14 international standardization working group. Further empirical studies must also be performed to determine whether the managed string library effectively reduces the number of vulnerabilities introduced in C-language programming.

Acknowledgments

Thanks to my colleagues Dan Plakosh and John Robert, who have assisted with the specification and development of the managed string library. Thanks to Jason Rafail, Robert Mead, and William Fithen for reviewing this paper. Thanks also to Jeff Havrilla, Jeff Carpenter, Rich Pethia, and my team members in the CERT/CC.

References

[1] Antill, James. Vstr Documentation: Overview; http://www.and .org/vstr/. [2] Messier, Matt and John Veiga. "Safe C String Library v1.0.3," January 30, 2005; http://www.zork.org/safestr/. [3] Wheeler, D. "Secure Programming for Linux and Unix HOWTOCreating Secure Software," 2003; http://www.dwheeler.com/secure-programs/. [4] Meyers, Randy. Specification for Safer, More Secure C Library Functions, ISO/IEC TR 24731, June 6, 2004. [5] Miller, T. C. and T. de Raadt. "strlcpy and strlcat-Consistent, Safe, String Copy and Concatenation," 175-178, Proceedings of the FREENIX Track; http://www.usenix.org/publications/library/ proceedings/usenix99/ full_papers/millert/millert.pdf. [6] Seacord, Robert C. Secure Coding in C and C++, Addison-Wesley, 2005. ISBN 0321335724. [7] Viega, John and Matt Messier. Secure Programming Cookbook for C and C++: Recipes for Cryptography, Authentication, Networking, Input Validation & More, O'Reilly & Associates, 2003. ISBN 0596003943.

CUJ