The prior article in this series explained how the Swift and Clang compilers used llvm::SourceMgr to emit diagnostics for source locations in memory buffers, represented by the class llvm::MemoryBuffer . This article focuses on llvm::MemoryBuffer , the primary abstraction for reading files and streams into memory. Since it's used by Swift, Clang, and LLVM tools like llvm-tblgen , I found it valuable to understand how it works.

Reading a file into memory using C++

The documentation for libLLVMSupport's llvm::MemoryBuffer class says it "provides simple read-only access to a block of memory, and provides simple methods for reading files and standard input into a memory buffer." To better understand how it does that, I tried writing a simple C++ program, called read.cpp , that reads a file – itself, in this case – into memory. For simplicity's sake my program is only meant to operate on Unix systems.

My read.cpp program reads a file into memory by using various system calls. These are requests made to the operating system for things like "open a file and give me its file descriptor," or "read 8 bytes from the file with this file descriptor." Julia Evans has a wonderful comic that explains them further:

My read.cpp program uses four system calls:

open(2) to get a file descriptor for the file. fstat , which returns information about a file descriptor. Specifically, read.cpp allocates memory based on the file's size. read(2) , which reads a given number of bytes from a file into a pre-allocated block of memory. close(2) to close a file descriptor once I'm done using it.

Once the read.cpp program allocates memory and reads its own source file into that memory, it increments the char * pointer into the memory and prints out the first line of the file:

read.cpp 1 #include <cerrno> 2 #include <iostream> 3 #include <system_error> 4 5 #include <fcntl.h> 6 #include <sys/stat.h> 7 #include <unistd.h> 8 9 int main() { 10 // I'll open this file itself and read it into memory. 11 auto FileName = __FILE__ ; 12 13 // The system call open(2) gets a file descriptor 14 // representing the open file. 15 int OpenFlags = O_RDONLY ; 16 int FD = open (FileName, OpenFlags); 17 18 // open(2) returns a -1 if the file could not be opened. 19 // In this case, print an error and return. 20 if (FD < 0 ) { 21 std :: error_code Err( errno , std :: generic_category ()); 22 std :: cerr << "[ERROR] Could not open file \" " 23 << FileName << " \" : " << Err. message () 24 << std :: endl ; 25 return 1 ; 26 } 27 28 // Syscall fstat populates the struct stat pointer 29 // with information about the given file descriptor, 30 // including the file's size in bytes. 31 struct stat Stat; 32 if ( fstat (FD, &Stat) < 0 ) { 33 std :: error_code Err( errno , std :: generic_category ()); 34 std :: cerr << "[ERROR] Could not acquire information " 35 << "on file descriptor \" " << FD 36 << " \" : " << Err. message () << std :: endl ; 37 return 1 ; 38 } 39 40 off_t FileSize = Stat. st_size ; 41 std :: cout << "[NOTE] File size: " << FileSize << " bytes" 42 << std :: endl ; 43 44 // Allocate memory in size equal to the number of bytes 45 // in the file. 46 char *Memory = static_cast < char *>( operator new ( 47 FileSize + 1 , std :: nothrow )); 48 Memory[FileSize] = 0 ; 49 50 // Use syscall read(2) to read in bytes from the given 51 // file descriptor, into the prepared buffer, 16 bytes 52 // at a time. 53 const ssize_t ChunkSize = 16; 54 ssize_t Offset = 0 ; 55 ssize_t ReadBytes = 0 ; 56 do { 57 ReadBytes = read (FD, Memory + Offset, ChunkSize); 58 if (ReadBytes < 0 ) { 59 std :: error_code Err( errno , std :: generic_category ()); 60 std :: cerr << "[ERROR] Could not read from file " 61 "descriptor \" " 62 << FD << " \" : " << Err. message () 63 << std :: endl ; 64 delete Memory; 65 return 1 ; 66 } 67 Offset += ReadBytes; 68 } while (ReadBytes != 0 ); 69 70 // I've now read the file into memory. To demonstrate: 71 std :: cout << "[NOTE] Here's the first line " 72 << "of the file: \" " ; 73 char *Ptr = Memory; 74 while (*Ptr != '

' && *Ptr != '\0' ) { 75 std :: cout << *Ptr; 76 ++Ptr; 77 } 78 std :: cout << " \" " << std :: endl ; 79 80 // Once I'm done with the file, I need to delete the 81 // memory I allocated, otherwise this is a memory leak. 82 delete Memory; 83 84 // Finally, I need to close the open file descriptor, 85 // using the system call close(2). 86 if ( close (FD) < 0 ) { 87 std :: error_code Err( errno , std :: generic_category ()); 88 std :: cerr << "[ERROR] Could not close file " 89 << "descriptor \" " << FD << " \" :" 90 << Err. message () << std :: endl ; 91 return 1 ; 92 } 93 94 return 0 ; 95 }

I can compile and run this program like so:

clang++ read.cpp -o my-read-example ./my-read-example [NOTE] File size: 2820 bytes [NOTE] Here's the first line of the file: "#include <cerrno>"

This is a good initial implementation of reading a file into memory in C++. In fact, this is very similar to what the llvm::MemoryBuffer::getFile function does. However, there's room for improvement.

Reading a large file into memory using mmap(2)

Recall that we allocated memory on the heap using operator new , and then used the syscall read(2) to populate that memory with the contents of our file:

read.cpp 46 char *Memory = static_cast < char *>( operator new ( 47 FileSize + 1 , std :: nothrow )); 48 Memory[FileSize] = 0 ; .. 56 do { 57 ReadBytes = read (FD, Memory + Offset, ChunkSize); .. 67 Offset += ReadBytes; 68 } while (ReadBytes != 0 );

This allocation would be problematic if we had a huge file to read into memory. A file with a size of 1 gigabyte would result in 1 gigabyte of memory being allocated – that's a lot of RAM!

Thankfully, the syscall mmap(2) allows us to read in bits of the file at a time. Once again, Julia Evans explains it best with another great comic:

I can modify the read.cpp program to use mmap(2) when reading from large files:

read.cpp 5 #include <fcntl.h> + #include <sys/mman.h> 7 #include <sys/stat.h> 8 #include <unistd.h> 9 10 int main() { .. ++ // For "large" files over 1024 bytes in size, I'll use ++ // syscall mmap(2). ++ char *Memory = nullptr ; ++ bool UseMMap = (FileSize > 1024 ); ++ if (UseMMap) { ++ std :: cout << "[NOTE] Using mmap" << std :: endl ; ++ int ProtectedOptions = PROT_READ ; ++ int Flags = MAP_SHARED ; ++ Memory = static_cast < char *>( mmap ( nullptr , FileSize, ++ ProtectedOptions, ++ Flags, FD, 0 )); ++ if (Memory == MAP_FAILED ) { ++ std :: error_code Err( errno , std :: generic_category ()); ++ std :: cerr ++ << "[ERROR] Could not mmap file descriptor \" " ++ << FD << " \" : " << Err. message () << std :: endl ; ++ } ++ } else { .. // ...use operator new as before. 89 } 90 91 // I've now read the file into memory. ++ // Note that this works exactly as before, we ++ // don't have to worry about whether it's an mmap: 94 std :: cout << "[NOTE] Here's the first line " 95 << "of the file: \" " ; 96 char *Ptr = Memory; 97 while (*Ptr != '

' && *Ptr != '\0' ) { 98 std :: cout << *Ptr; 99 ++Ptr; 100 } 101 std :: cout << " \" " << std :: endl ; 102 +++ if (UseMMap) { +++ // Once I'm done with the mmap'ed region, I need to +++ // release it. +++ munmap (Memory, FileSize); +++ } else { 108 // Once I'm done with the file, I need to delete the 109 // memory I allocated, otherwise this is a memory leak. 110 delete Memory; +++ } ... 123 return 0 ; 124 }

Compiling and running this program produces the exact same results as before, with the important distinction that this program can open even very large files, without allocating a ton of memory.

To experiment, you could try adding millions of lines of comments to the bottom of read.cpp . Flip the (FileSize > 1024); conditional to < in order to use operator new , and you'll allocate hundreds of megabytes of memory up front. Then flip it back, to use mmap(2) , and you'll allocate almost no memory.

For the most part, llvm::MemoryBuffer works exactly the same way as the read.cpp program above. It has a few extra bells and whistles, too: it works on both Unix and Windows, it uses a more complex hueristic to decide whether to use mmap(2) or not, and it uses some interesting syscalls and options on platforms that support them. I'll explain these as I write about it in detail below.

The LLVM implementation of read.cpp : llvm :: MemoryBuffer :: getFileOrSTDIN

Swift and Clang both use the llvm::MemoryBuffer::getFileOrSTDIN static member function to open input file arguments passed to them on the command-line. For example, below is the code in libswiftFrontend converts the string filenames it was passed on the command-line into llvm::MemoryBuffer objects. The filename is a std::string stored as swift::InputFile::file .

swift/lib/Frontend/Frontend.cpp 315 std :: pair < std :: unique_ptr < llvm :: MemoryBuffer >, 316 std :: unique_ptr < llvm :: MemoryBuffer >> 317 CompilerInstance ::getInputBufferAndModuleDocBufferIfPresent( 318 const InputFile &input) { ... 326 using FileOrError = llvm :: ErrorOr < std :: unique_ptr < llvm :: MemoryBuffer >>; 327 FileOrError inputFileOrErr = llvm :: MemoryBuffer :: getFileOrSTDIN (input. file ()); 328 if (!inputFileOrErr) { 329 Diagnostics . diagnose ( SourceLoc (), diag :: error_open_input_file , input. file (), 330 inputFileOrErr. getError (). message ()); 331 return std :: make_pair ( nullptr , nullptr ); 332 } ... 342 }

As I wrote in the previous article, these llvm::MemoryBuffer will then be passed over to the llvm::SourceMgr , which takes ownership of them. The swift::Parser will then interact with llvm::SourceMgr (or more precisely, a wrapper called swift::SourceManager ) in order to emit diagnostics at particular locations in the buffer.

The llvm::MemoryBuffer::getFileOrSTDIN function returns either a std::unique_ptr to an llvm::MemoryBuffer for the given file, or an error. This is represented by the llvm::ErrorOr type. (I'll write more about llvm::ErrorOr in the future, but in the meantime you can watch this 5-minute lightning talk from LLVM Developers Meeting 2016 to learn more about them.)

The getFileOrSTDIN function just checks for a file name of "-" and then delegates its logic to either llvm::MemoryBuffer::getSTDIN or getFile . It may optionally be given an int64_t FileSize argument, but if not the default value of -1 signals the function to find out on its own – just as my example read.cpp program above did, by using the fstat system call.

llvm/include/llvm/Support/MemoryBuffer.h 125 /// Open the specified file as a MemoryBuffer, or open stdin if the Filename 126 /// is "-". 127 static ErrorOr < std :: unique_ptr < MemoryBuffer >> 128 getFileOrSTDIN( const Twine &Filename, int64_t FileSize = -1 , 129 bool RequiresNullTerminator = true );

llvm/lib/Support/MemoryBuffer.cpp 143 ErrorOr < std :: unique_ptr < MemoryBuffer >> 144 MemoryBuffer ::getFileOrSTDIN( const Twine &Filename, int64_t FileSize, 145 bool RequiresNullTerminator) { 146 SmallString < 256 > NameBuf; 147 StringRef NameRef = Filename. toStringRef (NameBuf); 148 149 if (NameRef == "-" ) 150 return getSTDIN (); 151 return getFile (Filename, FileSize, RequiresNullTerminator); 152 }

I'll focus on the getFile case for now, which delegates in turn to a function called getFileAux . The getFileAux static function implements some of the logic I implemented in the read.cpp example above: it opens the file in order to obtain a file descriptor, it reads that file, and then it calls close(2) in order to close the file descriptor:

llvm/include/llvm/Support/MemoryBuffer.h 73 /// Open the specified file as a MemoryBuffer, returning a new MemoryBuffer 74 /// if successful, otherwise returning null. If FileSize is specified, this 75 /// means that the client knows that the file exists and that it has the 76 /// specified size. 77 /// 78 /// \param IsVolatile Set to true to indicate that the contents of the file 79 /// can change outside the user's control, e.g. when libclang tries to parse 80 /// while the user is editing/updating the file or if the file is on an NFS. 81 static ErrorOr < std :: unique_ptr < MemoryBuffer >> 82 getFile ( const Twine &Filename, int64_t FileSize = -1 , 83 bool RequiresNullTerminator = true , bool IsVolatile = false );

llvm/lib/Support/MemoryBuffer.cpp 229 ErrorOr < std :: unique_ptr < MemoryBuffer >> 230 MemoryBuffer ::getFile( const Twine &Filename, int64_t FileSize, 231 bool RequiresNullTerminator, bool IsVolatile) { 232 return getFileAux < MemoryBuffer >(Filename, FileSize, FileSize, 0 , 233 RequiresNullTerminator, IsVolatile); 234 } ... 242 template < typename MB> 243 static ErrorOr < std :: unique_ptr < MB >> 244 getFileAux( const Twine &Filename, int64_t FileSize, uint64_t MapSize, 245 uint64_t Offset, bool RequiresNullTerminator, bool IsVolatile) { 246 int FD; 247 std :: error_code EC = sys :: fs :: openFileForRead (Filename, FD); 248 249 if (EC) 250 return EC; 251 252 auto Ret = getOpenFileImpl < MB >(FD, Filename, FileSize, MapSize, Offset, 253 RequiresNullTerminator, IsVolatile); 254 close (FD); 255 return Ret; 256 }

Unlike read.cpp , the getFileAux function does not call the open(2) system call directly in order to obtain an open file descriptor for given filename. Instead, it uses the llvm::sys::fs::openFileForRead function. This LLVM helper function, unlike open(2) , works on both Windows and Unix platforms.

Per-platform implementations of system calls in LLVM

The llvm::sys::fs::openFileForRead function has a single delcaration, in the header file FileSystem.h :

llvm/include/llvm/Support/FileSystem.h ... /// @brief Opens the file with the given name in a read-only mode, returning ... /// its open file descriptor. ... /// ... /// @param Name The name of the file to open. ... /// @param ResultFD The location to store the descriptor for the opened file. ... /// @param RealPath If nonnull, extra work is done to determine the real path ... /// of the opened file, and that path is stored in this ... /// location. ... /// @returns errc::success if \a Name has been opened, otherwise a ... /// platform-specific error_code. 822 std :: error_code openFileForRead( const Twine &Name, int &ResultFD, 823 SmallVectorImpl < char > *RealPath = nullptr );

But the LLVM codebase defines two separate implementations of this function: one that's used on Windows platforms, and another that's used on Unix. It accomplishes this using CMake.

I've found that a working knowledge of CMake is a gift that really keeps on giving when it comes to compiler development. If you haven't already, you can read about it more in my articles The Swift Compiler's Build System and Reading and Understanding the CMake in apple/swift.

LLVM's root CMakeLists.txt file appends two directories to its modules path, and then includes one file from each of those directories: llvm/cmake/config-ix.cmake and llvm/cmake/modules/HandleLLVMOptions.cmake . Finally, it configures a header file named config.h.cmake :

llvm/CMakeLists.txt 184 set ( CMAKE_MODULE_PATH 185 ${CMAKE_MODULE_PATH} 186 " ${CMAKE_CURRENT_SOURCE_DIR} /cmake 187 " ${CMAKE_CURRENT_SOURCE_DIR} /cmake/modules" 188 ) ... 588 include (config-ix) ... 602 include (HandleLLVMOptions) ... 737 configure_file ( 738 ${LLVM_MAIN_INCLUDE_DIR} /llvm/Config/config.h.cmake 739 ${LLVM_INCLUDE_DIR} /llvm/Config/config.h)

The config-ix.cmake file uses the built-in CMake function check_symbol_exists in order to determine which system calls are available in the target build environment. For example, it checks whether pread is available and, if it is, has CMake define a variable named HAVE_PREAD :

Then, in HandleLLVMOptions.cmake , it uses the built-in CMake platform variables, WIN32 and UNIX , to set the CMake variables LLVM_ON_WIN32 and LLVM_ON_UNIX to True or False :

llvm/cmake/modules/HandleLLVMOptions.cmake 108 if ( WIN32 ) ... 114 set (LLVM_ON_WIN32 1 ) 115 set (LLVM_ON_UNIX 0 ) ... 117 else ( WIN32 ) 118 if ( UNIX ) 119 set (LLVM_ON_WIN32 0 ) 120 set (LLVM_ON_UNIX 1 ) ... 129 endif ( WIN32 )

At this point, CMake variables like HAVE_PREAD and LLVM_ON_UNIX would only be visible from within CMake. To make their values visible in C++, the config.h.cmake file is configured via a call to the CMake built-in function configure_file , as shown in a code snippet above. The config.h.cmake file is full of #cmakedefine directives, which configure_file transforms into #define statements for consumption in C++. For example, config.h.cmake contains these #cmakedefine statements…

llvm/include/llvm/Config/config.h.cmake 142 /* Define to 1 if you have the `pread' function. */ 143 #cmakedefine HAVE_PREAD ${HAVE_PREAD} ... 311 /* Define if this is Unixish platform */ 312 #cmakedefine LLVM_ON_UNIX ${LLVM_ON_UNIX} 313 314 /* Define if this is Win32ish platform */ 315 #cmakedefine LLVM_ON_WIN32 ${LLVM_ON_WIN32}

…which on a Unix-like platform, such as macOS, are transformed into these statements, placed in a file in the build directory named include/llvm/Config/config.h :

build/include/llvm/Config/config.h 142 /* Define to 1 if you have the `pread' function. */ 143 #define HAVE_PREAD 1 ... 311 /* Define if this is Win32ish platform */ 312 #define LLVM_ON_UNIX 1

And in llvm/lib/Support/Path.cpp , instead of finding an implementation of the llvm::sys::fs::openFileForRead function, instead there's a condiitonal include based on these definitions:

llvm/lib/Support/Path.cpp 1072 // Include the truly platform-specific parts. 1073 #if defined(LLVM_ON_UNIX) 1074 #include "Unix/Path.inc" 1075 #endif 1076 #if defined(LLVM_ON_WIN32) 1077 #include "Windows/Path.inc" 1078 #endif

It's in the included llvm/lib/Support/Unix/Path.inc file that I can find the actual implementation of llvm::sys::fs::openFileForRead that's used on Unix platforms.

Opening a file on Unix

As in the read.cpp example at the beginning of this article, the Unix implementation of the llvm::sys::fs::openFileForRead function uses the system call open(2) in order to open a file and get its file descriptor:

llvm/lib/Support/Unix/Path.inc 719 std :: error_code openFileForRead( const Twine &Name, int &ResultFD, 720 SmallVectorImpl < char > *RealPath) { 721 SmallString < 128 > Storage; 722 StringRef P = Name.toNullTerminatedStringRef(Storage); 723 int OpenFlags = O_RDONLY ; 724 #ifdef O_CLOEXEC 725 OpenFlags |= O_CLOEXEC ; 726 #endif 727 if ((ResultFD = sys :: RetryAfterSignal ( -1 , open , P. begin (), OpenFlags)) < 0 ) 728 return std :: error_code ( errno , std :: generic_category ()); 729 #ifndef O_CLOEXEC 730 int r = fcntl (ResultFD, F_SETFD , FD_CLOEXEC ); 731 ( void )r; 732 assert (r == 0 && "fcntl(F_SETFD, FD_CLOEXEC) failed" ); 733 #endif ... 758 return std :: error_code (); 759 }

The implementation above is long-winded because of two pieces of Unix trivia.

First off, instead of calling open(2) directly, it calls llvm::sys::RetryAfterSignal , which invokes open(2) in a while loop. This loop retries the open(2) call if it fails with an EINTR error code:

llvm/include/llvm/Support/Errno.h 33 template < typename FailT, typename Fun, typename ... Args> 34 inline auto RetryAfterSignal( const FailT &Fail, const Fun &F, 35 const Args &... As) -> decltype (F(As...)) { 36 decltype (F(As...)) Res; 37 do 38 Res = F(As...); 39 while (Res == Fail && errno == EINTR ); 40 return Res; 41 }

I'm not a C++ expert. In case you aren't either, allow me to offer an explanation for the templates being used in the code above. The RetryAfterSignal function has three template parameters: const FailT &Fail , representing a value returned when the function call fails. const Fun &F , representing the callable function. A template parameter pack const Args &... As , representing the arguments passed to function F . RetryAfterSignal uses the trailing return type syntax, of the form auto function -> return_type . Its return type is specified as decltype(F(As...)) . In other words, the return type is the type returned by the expression F(As...) . To map this all to the concrete example we were looking at in llvm::sys::fs::openFileForRead , recall that function had the expression sys::RetryAfterSignal(-1, open, P.begin(), OpenFlags) . Here -1 is the failure value const FailT &Fail , open is the function value const Fun &F , and (P.begin(), OpenFlags) are the template parameter pack arguments passed into the open function. The return type is the type returned by open(P.begin(), OpenFlags) , which is int .

The llvm::sys::RetryAfterSignal function ignores the EINTR and retries because "blocking" Unix functions like open(2) and read(2) return EINTR whenever they are interrupted by a Unix signal. Interruptions like this can occur for all sorts of reasons, some of which you can read more about here. In these cases, LLVM will simply try again.

The other quirk in the llvm::sys::fs::openFileForRead implementation is the check for O_CLOEXEC , an open(2) flag that only exists on Linux 2.6.23 and above. This option has the OS automatically close the file descriptor if the process forks. If it's not available, the implementation uses the syscall fcntl in order to set a similar flag.

Reading the file into an llvm :: WritableMemoryBuffer

The llvm::sys::fs::openFileForRead function opens a file and returns its file descriptor. Then control is returned back to the getFileAux function, which passes the open descriptor into the getOpenFileImpl static function:

llvm/lib/Support/MemoryBuffer.cpp 242 template < typename MB> 243 static ErrorOr < std :: unique_ptr < MB >> 244 getFileAux( const Twine &Filename, int64_t FileSize, uint64_t MapSize, 245 uint64_t Offset, bool RequiresNullTerminator, bool IsVolatile) { 246 int FD; 247 std :: error_code EC = sys :: fs :: openFileForRead (Filename, FD); 248 249 if (EC) 250 return EC; 251 252 auto Ret = getOpenFileImpl < MB >(FD, Filename, FileSize, MapSize, Offset, 253 RequiresNullTerminator, IsVolatile); 254 close (FD); 255 return Ret; 256 }

The getOpenFileImpl implements the same logic the read.cpp example at the beginning of this article did. If the file's size was not provided, it finds out how large the file is by calling llvm::sys::fs::status , which on Unix calls fstat . It then makes a decision as to whether to use mmap(2) or to allocate memory up front using operator new . If it allocates memory, then it uses the system call read(2) (or pread , if HAVE_PREAD is true) in order to read the bytes of the file into memory:

llvm/lib/Support/MemoryBuffer.cpp 416 template < typename MB> 417 static ErrorOr < std :: unique_ptr < MB >> 418 getOpenFileImpl( int FD, const Twine &Filename, uint64_t FileSize, 419 uint64_t MapSize, int64_t Offset, bool RequiresNullTerminator, 420 bool IsVolatile) { 421 static int PageSize = sys :: Process :: getPageSize (); 422 423 // Default is to map the full file. 424 if (MapSize == uint64_t ( -1 )) { 425 // If we don't know the file size, use fstat to find out. fstat on an open 426 // file descriptor is cheaper than stat on a random path. 427 if (FileSize == uint64_t ( -1 )) { 428 sys :: fs :: file_status Status; 429 std :: error_code EC = sys :: fs :: status (FD, Status); 430 if (EC) 431 return EC; ... 441 FileSize = Status. getSize (); 442 } 443 MapSize = FileSize; 444 } 445 446 if ( shouldUseMmap (FD, FileSize, MapSize, Offset, RequiresNullTerminator, 447 PageSize, IsVolatile)) { 448 std :: error_code EC; 449 std :: unique_ptr < MB > Result( 450 new ( NamedBufferAlloc (Filename)) MemoryBufferMMapFile < MB >( 451 RequiresNullTerminator, FD, MapSize, Offset, EC)); 452 if (!EC) 453 return std :: move (Result); 454 } 455 456 auto Buf = WritableMemoryBuffer :: getNewUninitMemBuffer (MapSize, Filename); 457 if (!Buf) { 458 // Failed to create a buffer. The only way it can fail is if 459 // new(std::nothrow) returns 0. 460 return make_error_code ( errc :: not_enough_memory ); 461 } 462 463 char *BufPtr = Buf. get ()-> getBufferStart (); 464 465 size_t BytesLeft = MapSize; 466 #ifndef HAVE_PREAD 467 if (lseek(FD, Offset, SEEK_SET) == -1 ) 468 return std :: error_code ( errno , std :: generic_category ()); 469 #endif 470 471 while (BytesLeft) { 472 #ifdef HAVE_PREAD 473 ssize_t NumRead = sys :: RetryAfterSignal ( -1 , :: pread , FD, BufPtr, BytesLeft, 474 MapSize - BytesLeft + Offset); 475 #else 476 ssize_t NumRead = sys :: RetryAfterSignal ( -1 , :: read , FD, BufPtr, BytesLeft); 477 #endif 478 if (NumRead == -1 ) { 479 // Error while reading. 480 return std :: error_code ( errno , std :: generic_category ()); 481 } 482 if (NumRead == 0 ) { 483 memset (BufPtr, 0 , BytesLeft); // zero-initialize rest of the buffer. 484 break ; 485 } 486 BytesLeft -= NumRead; 487 BufPtr += NumRead; 488 } 489 490 return std :: move (Buf); 491 }

The functions llvm::sys::Process::getPageSize and llvm::sys::fs::status above use the same CMake tricks as llvm::sys::fs::openFileForRead did in order to include a platform-specific implementation: getPageSize is implemented in llvm/lib/Support/Unix/Process.inc and Windows/Process.inc , and status is implemented in Unix/Path.inc and Windows/Path.inc . On Unix they use system calls getpagesize and fstat in order to get the information they need from the operating system.

The code above instantiates either an llvm::MemoryBufferMMapFile or an llvm::WritableMemoryBuffer based on whether the helper function shouldUseMMap returns true or false . As it was in the read.cpp example at the beginning of this article, one criteria for that decision is the size of the file – for example, if it's smaller than a page on the system, or smaller than 16 kilobytes, then mmap(2) is not used:

llvm/lib/Support/MemoryBuffer.cpp 308 static bool shouldUseMmap ( int FD, 309 size_t FileSize, 310 size_t MapSize, 311 off_t Offset, 312 bool RequiresNullTerminator, 313 int PageSize, 314 bool IsVolatile) { ... 321 // We don't use mmap for small files because this can severely fragment our 322 // address space. 323 if (MapSize < 4 * 4096 || MapSize < ( unsigned )PageSize) 324 return false ; ... 360 return true ; 361 }

Assuming mmap(2) is not used, then the getOpenFileImpl function calls the static function llvm::WritableMemoryBuffer::getNewUninitMemBuffer . This function allocates the buffer memory just as the read.cpp example did, by using operator new . Unlike the read.cpp example program, however, this function not only allocates memory for a buffer to store the file's contents, it also allocates space for an instance of the llvm::MemoryBuffer class, and for the name of the file:

llvm/lib/Support/MemoryBuffer.cpp 273 std :: unique_ptr <W ritableMemoryBuffer > 274 WritableMemoryBuffer ::getNewUninitMemBuffer( size_t Size, const Twine &BufferName) { 275 using MemBuffer = MemoryBufferMem < WritableMemoryBuffer >; 276 // Allocate space for the MemoryBuffer, the data and the name. It is important 277 // that MemoryBuffer and data are aligned so PointerIntPair works with them. ... 280 SmallString < 256 > NameBuf; 281 StringRef NameRef = BufferName. toStringRef (NameBuf); 282 size_t AlignedStringLen = alignTo ( sizeof (MemBuffer) + NameRef. size () + 1 , 16 ); 283 size_t RealLen = AlignedStringLen + Size + 1 ; 284 char *Mem = static_cast < char *>( operator new (RealLen, std :: nothrow )); 285 if (!Mem) 286 return nullptr ; 287 288 // The name is stored after the class itself. 289 CopyStringRef (Mem + sizeof ( MemBuffer ), NameRef); 290 291 // The buffer begins after the name and must be aligned. 292 char *Buf = Mem + AlignedStringLen; 293 Buf[Size] = 0 ; // Null terminate buffer. 294 295 auto *Ret = new (Mem) MemBuffer ( StringRef (Buf, Size), true ); 296 return std :: unique_ptr < WritableMemoryBuffer >(Ret); 297 }

Based on the code above, I can see that the memory that's being allocated here is laid out in three distinct segments:

The first segment of memory allocated is sized such that an instance of llvm :: MemoryBufferMem < llvm :: WritableMemoryBuffer > could fit within it. Note that the size is calculated using sizeof ( MemBuffer ) , and then the memory buffer is instantiated by calling new (Mem) MemBuffer (...) . As I mentioned in my article on Getting Started with the Swift Frontend: Lexing & Parsing, this is a "placement" new operator call. It doesn't allocate any memory, and instead calls the MemBuffer constructor, and then places the constructed instance in the chunk of memory Mem . (You can read more about "placement new" here.) The second segment of memory stores the name of the buffer. It's sized using the call to NameRef. size () above, and then the name is copied by calling the static helper function CopyStringRef . Finally comes the rest of the buffer, which is the same size as the file being read into it.

The memory buffer allocated and returned by the llvm::WritableMemoryBuffer::getNewUninitMemBuffer function is an llvm::MemoryBufferMem<llvm::WritableMemoryBuffer> . MemoryBufferMem<T> is defined as a subclass of T . In this case, T is an llvm::WritableMemoryBuffer , which in turn derives from llvm::MemoryBuffer . The constructor of MemoryBufferMem calls through to llvm::MemoryBuffer::init :

llvm/lib/Support/MemoryBuffer.cpp 83 /// MemoryBufferMem - Named MemoryBuffer pointing to a block of memory. 84 template < typename MB> 85 class MemoryBufferMem : public MB { 86 public: 87 MemoryBufferMem( StringRef InputData, bool RequiresNullTerminator) { 88 MemoryBuffer :: init (InputData. begin (), InputData. end (), 89 RequiresNullTerminator); 90 } 91 92 /// Disable sized deallocation for MemoryBufferMem, because it has 93 /// tail-allocated data. 94 void operator delete( void *p) { :: operator delete (p); } ... 104 };

And the llvm::MemoryBuffer::init function simply sets private members pointing to the beginning and end of the buffer:

llvm/include/llvm/Support/MemoryBuffer.h 42 class MemoryBuffer { 43 const char *BufferStart; // Start of the buffer. 44 const char *BufferEnd; // End of the buffer. .. 154 };

llvm/lib/Support/MemoryBuffer.cpp 44 /// init - Initialize this MemoryBuffer as a reference to externally allocated 45 /// memory, memory that we know is already null terminated. 46 void MemoryBuffer ::init( const char *BufStart, const char *BufEnd, 47 bool RequiresNullTerminator) { 48 assert ((!RequiresNullTerminator || BufEnd[ 0 ] == 0 ) && 49 "Buffer is not null terminated!" ); 50 BufferStart = BufStart; 51 BufferEnd = BufEnd; 52 }

In summary, on a Unix system:

The llvm :: MemoryBuffer :: getFileOrSTDIN static function checks whether its been given a filename of "-" and, if it has, calls llvm :: MemoryBuffer :: getSTDIN . Otherwise, it calls llvm :: MemoryBuffer :: getFile . llvm :: MemoryBuffer :: getFile calls through to getFileAux . getFileAux gets an open file descriptor by calling llvm :: sys :: fs :: openFileForRead , then getOpenFileImpl to instantiate a new llvm :: MemoryBuffer and read in the contents of the file, and finally `close(2) in order to close the file descriptor. getOpenFileImpl checks the file size and determines whether to use mmap(2) . If mmap(2) is not used, then getOpenFileImpl allocates memory for an llvm :: MemoryBuffer , its name, and its contents. It then reads in the contents of the file using read(2) or pread , depending on what's available on the operating system.

Mapping the file into an llvm :: MemoryBufferMMapFile

Recall that getOpenFileImpl instantiates an llvm::MemoryBufferMMapFile if shouldUseMMap returns true :

llvm/lib/Support/MemoryBuffer.cpp 416 template < typename MB> 417 static ErrorOr < std :: unique_ptr < MB >> 418 getOpenFileImpl( int FD, const Twine &Filename, uint64_t FileSize, 419 uint64_t MapSize, int64_t Offset, bool RequiresNullTerminator, 420 bool IsVolatile) { ... 446 if ( shouldUseMmap (FD, FileSize, MapSize, Offset, RequiresNullTerminator, 447 PageSize, IsVolatile)) { 448 std :: error_code EC; 449 std :: unique_ptr < MB > Result( 450 new ( NamedBufferAlloc (Filename)) MemoryBufferMMapFile < MB >( 451 RequiresNullTerminator, FD, MapSize, Offset, EC)); 452 if (!EC) 453 return std :: move (Result); 454 } 455 456 auto Buf = WritableMemoryBuffer :: getNewUninitMemBuffer (MapSize, Filename); ... 490 return std :: move (Buf); 491 }

The llvm::MemoryBufferMMapFile class makes use of the llvm::sys::fs::mapped_file_region class, a wrapper around the mmap(2) and munmap system calls:

llvm/lib/Support/MemoryBuffer.cpp 166 /// \brief Memory maps a file descriptor using sys::fs::mapped_file_region. 167 /// 168 /// This handles converting the offset into a legal offset on the platform. 169 template < typename MB> 170 class MemoryBufferMMapFile : public MB { 171 sys :: fs :: mapped_file_region MFR; ... 185 public: 186 MemoryBufferMMapFile( bool RequiresNullTerminator, int FD, uint64_t Len, 187 uint64_t Offset, std :: error_code &EC) 188 : MFR (FD, MB :: Mapmode , getLegalMapSize (Len, Offset), 189 getLegalMapOffset (Offset), EC) { 190 if (!EC) { 191 const char *Start = getStart (Len, Offset); 192 MemoryBuffer :: init (Start, Start + Len, RequiresNullTerminator); 193 } 194 } ... 208 };

The mapped_file_region constructor calls mapped_file_region::init , which calls mmap(2) . Its destructor calls munmap :

llvm/lib/Support/Unix/Path.inc 597 std :: error_code mapped_file_region::init( int FD, uint64_t Offset, 598 mapmode Mode) { ... 623 Mapping = :: mmap ( nullptr , Size, prot, flags, FD, Offset); 624 if (Mapping == MAP_FAILED ) 625 return std :: error_code ( errno , std :: generic_category ()); 626 return std :: error_code (); 627 } 628 629 mapped_file_region ::mapped_file_region( int fd, mapmode mode, size_t length, 630 uint64_t offset, std :: error_code &ec) 631 : Size (length), Mapping (), FD (fd), Mode (mode) { ... 634 ec = init (fd, offset, mode); 635 if (ec) 636 Mapping = nullptr ; 637 } 638 639 mapped_file_region ::~mapped_file_region() { 640 if (Mapping) 641 :: munmap (Mapping, Size); 642 }

What I learned

Looking into llvm::MemoryBuffer and how LLVM reads source files into memory taught me a lot:

At build time LLVM's CMake code determines which platform it's being built for. Based on this, it includes Unix- or Windows-specific implementations, such as llvm/lib/Support/Unix/Path.inc or Windows/Path.inc .

or . Also at build time LLVM CMake determines which system calls are available on the target platform. For example, if pread is available, then getOpenFileImpl will use pread to read the file into an llvm :: WritableMemoryBuffer , instead of `read(2) .

is available, then will use to read the file into an , instead of . I can use mmap(2) to access the contents of a very large file without allocating a large amount of memory. LLVM's shouldUseMMap function references the file size, among other characteristics, to determine whether to use pre-allocated memory with llvm :: WritableMemoryBuffer , or mmap(2) with llvm :: MemoryBufferMMapFile .

to access the contents of a very large file without allocating a large amount of memory. LLVM's function references the file size, among other characteristics, to determine whether to use pre-allocated memory with , or with . llvm :: MemoryBuffer maintains a buffer for the contents of a source file as a "trailing object" – a block of memory that is allocated when the class is constructed, but is not a member of the class itself. LLVM uses this trailing object pattern extensively. (It even defines an llvm :: TrailingObjects class template, which I plan on writing more about in the future.)

If you enjoyed this article and would like to read more like it, please consider supporting me on Patreon. I wouldn't be able to write these articles were it not for the support I receive.