The Background

The Tracer Project

One of the frustrations I had with existing Python profilers was that there was no easy or efficient means to filter or exclude trace information based on the module name of the code being executed. I tackled this in my tracer project, which allows you to set an environment variable named TRACER_MODULE_NAMES to restrict which modules should be traced, e.g.:

set TRACER_MODULE_NAMES=myproject1;myproject2;myproject3.subproject;numpy;pandas;scipy

If the code being executed is coming from the module myproject3.subproject.foo , then we need to trace it, as that string prefix matches the third entry on our list.

This article details the custom data structure and algorithm I came up with in order to try and solve the prefix matching problem more optimally with a SIMD approach. The resulting StringTable component is used extensively within the tracer project, and as such, must conform to unique constraints such as no use of the C runtime library and allocating all memory through TraceStore-backed allocators. Thus, it's not really something you'd drop in to your current project in its current form. Hopefully, the article still proves to be interesting.

Note: the code samples provided herein are copied directly from the tracer project, which is written in C and assembly, and uses the Pascal-esque Cutler Normal Form style for C. If you're used to the more UNIX-style Kernel Normal Form of C, it's quite like that, except that it's absolutely nothing like that, and all these code samples will probably be very jarring.

Baseline C Implementation

The simplest way of solving this in C is to have an array of C strings (i.e. NULL terminated byte arrays), then for each string, loop through byte by byte and see if it prefix matches the search string.

Baseline (Cutler Normal Form)

Baseline (Kernel Normal Form) // // Declare a set of module names to be used as a string array. // const PCSZ ModuleNames[] = { "myproject1", "myproject2", "myproject3.subproject", "numpy", "pandas", "scipy", NULL, }; // // Define the function pointer typedef. // typedef STRING_TABLE_INDEX (IS_PREFIX_OF_CSTR_IN_ARRAY)( _In_ PCSZ *StringArray, _In_ PCSZ String, _Out_opt_ PSTRING_MATCH Match ); typedef IS_PREFIX_OF_CSTR_IN_ARRAY *PIS_PREFIX_OF_CSTR_IN_ARRAY; // // Forward declaration. // IS_PREFIX_OF_CSTR_IN_ARRAY IsPrefixOfCStrInArray; _Use_decl_annotations_ STRING_TABLE_INDEX IsPrefixOfCStrInArray( PCSZ *StringArray, PCSZ String, PSTRING_MATCH Match ) { PCSZ Left; PCSZ Right; PCSZ *Target; ULONG Index = 0; ULONG Count; for (Target = StringArray; *Target != NULL; Target++, Index++) { Count = 0; Left = String; Right = *Target; while (*Left && *Right && *Left++ == *Right++) { Count++; } if (Count > 0 && !*Right) { if (ARGUMENT_PRESENT(Match)) { Match->Index = (BYTE)Index; Match->NumberOfMatchedCharacters = (BYTE)Count; Match->String = NULL; } return (STRING_TABLE_INDEX)Index; } } return NO_MATCH_FOUND; } const char *module_names[] = { "myproject1", "myproject2", "myproject3.subproject", "numpy", "pandas", "scipy", 0, }; struct string_match { /* Index of the match. */ unsigned char index; /* Number of characters matched. */ unsigned char number_of_chars_matched; /* Pad out to an 8-byte boundary. */ unsigned short padding[3]; /* Pointer to the string that was matched. */ char *str; }; unsigned char is_prefix_of_c_str_in_array(const char **array, const char *str, struct string_match *match) { char *left, *right, **target; unsigned int c, i = 0; for (target = array; target; target++, i++) { c = 0; left = str; right *target; while (*left && *right && *left++ == *right) { c++; } if (c > 0 && !*right) { if (match) { match->index = i; match->chars_matched = c; match->str = target[i]; } return i; } } return -1; }

Another type of code pattern that the string table attempts to replace is anything that does a lot of if/else if/else if-type string comparisons to look for keywords. For example, in the Quake III source, there's some symbol/string processing logic that looks like this:

// call instructions reset currentArgOffset if ( !strncmp( token, "CALL", 4 ) ) { EmitByte( &segment[CODESEG], OP_CALL ); instructionCount++; currentArgOffset = 0; return; } // arg is converted to a reversed store if ( !strncmp( token, "ARG", 3 ) ) { EmitByte( &segment[CODESEG], OP_ARG ); instructionCount++; if ( 8 + currentArgOffset >= 256 ) { CodeError( "currentArgOffset >= 256" ); return; } EmitByte( &segment[CODESEG], 8 + currentArgOffset ); currentArgOffset += 4; return; } // ret just leaves something on the op stack if ( !strncmp( token, "RET", 3 ) ) { EmitByte( &segment[CODESEG], OP_LEAVE ); instructionCount++; EmitInt( &segment[CODESEG], 8 + currentLocals + currentArgs ); return; } // pop is needed to discard the return value of // a function if ( !strncmp( token, "pop", 3 ) ) { EmitByte( &segment[CODESEG], OP_POP ); instructionCount++; return; } ...

An example of using the string table approach for this problem is discussed in the Other Applications section.

The Proposed Interface

Let's take a look at the interface we're proposing, the IsPrefixOfStringInTable function, that this article is based upon:

// // Our string table index is simply a char, with -1 indicating no match found. // typedef CHAR STRING_TABLE_INDEX; #define NO_MATCH_FOUND -1 typedef STRING_TABLE_INDEX (IS_PREFIX_OF_STRING_IN_TABLE)( _In_ PSTRING_TABLE StringTable, _In_ PSTRING String, _Out_opt_ PSTRING_MATCH StringMatch ); typedef IS_PREFIX_OF_STRING_IN_TABLE *PIS_PREFIX_OF_STRING_IN_TABLE; IS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable; _Use_decl_annotations_ STRING_TABLE_INDEX IsPrefixOfStringInTable( PSTRING_TABLE StringTable, PSTRING String, PSTRING_MATCH Match ) /*++ Routine Description: Searches a string table to see if any strings "prefix match" the given search string. That is, whether any string in the table "starts with or is equal to" the search string. Arguments: StringTable - Supplies a pointer to a STRING_TABLE struct. String - Supplies a pointer to a STRING struct that contains the string to search for. Match - Optionally supplies a pointer to a variable that contains the address of a STRING_MATCH structure. This will be populated with additional details about the match if a non-NULL pointer is supplied. Return Value: Index of the prefix match if one was found, NO_MATCH_FOUND if not. --*/

All implementations discussed in this article adhere to that function signature. The STRING_TABLE structure will be discussed shortly.

The STRING_MATCH structure is used to optionally communicate information about the prefix match back to the caller. The index and characters matched fields are often very useful when using the string table for text parsing; see the other applications section below for an example.

The structure is defined as follows:

// // This structure is used to communicate matches back to the caller. // typedef struct _STRING_MATCH { // // Index of the match. // BYTE Index; // // Number of characters matched. // BYTE NumberOfMatchedCharacters; // // Pad out to 8-bytes. // USHORT Padding[3]; // // Pointer to the string that was matched. The underlying buffer will // stay valid for as long as the STRING_TABLE struct persists. // PSTRING String; } STRING_MATCH, *PSTRING_MATCH, **PPSTRING_MATCH; C_ASSERT(sizeof(STRING_MATCH) == 16);

The Test Data

Instead of using some arbitrary Python module names, this article is going to focus on a string table constructed out of a set of 16 strings that represent reserved names of the NTFS file system, at least when it was first released way back in the early 90s.

This list is desirable as it has good distribution of characters, there is a good mix of both short and long entries, plus one oversized one ($INDEX_ALLOCATION, which clocks in at 17 characters), and almost all strings lead with a common character (the dollar sign), preventing a simple first character optimization used by the initial version of the StringTable component I wrote in 2016.

So the scenario we'll be emulating, in this case, is that we've just been passed a filename for creation, and we need to check if it prefix matches any of the reserved names.

Here's the full list of NTFS names we'll be using. We're assuming 8-bit ASCII encoding (no UTF-8) and case sensitive. (If this were actually the NT kernel, we'd need to use wide characters with UTF-16 enconding, and be case-insensitive.)

NTFS Reserved Names

$AttrDef

$BadClus

$Bitmap

$Boot

$Extend

$LogFile

$MftMirr

$Mft

$Secure

$UpCase

$Volume

$Cairo

$INDEX_ALLOCATION

$DATA

????

.

The ordering is important in certain cases. For example, when you have overlapping strings, such as $MftMirr, and $Mft, you should put the longest strings first. They will be matched first, and as our routine terminates upon the first successful prefix match — if a longer string resided after a shorter one, it would never get detected.

Let's review some guiding design requirements and cover some of the design decisions I made, which should help shape your understanding of the implementation.

Requirements and Design Decisions

The STRING struct will be used to capture incoming search strings as well as the representation of any strings registered in the table (or more accurately, in the corresponding StringArray structure associated with the string table.

// // The STRING structure used by the NT kernel. Our STRING_ARRAY structure // relies on an array of these structures. We never pass raw 'char *'s // around, only STRING/PSTRING structs/pointers. // typedef struct _STRING { USHORT Length; USHORT MaximumLength; ULONG Padding; PCHAR Buffer; } STRING, *PSTRING; typedef const STRING *PCSTRING;

The design should optimize for string lengths less than or equal to 16. Lengths greater than 16 are permitted, up to 128 bytes, but they incur more overhead during the prefix lookup.

The design should prioritize the fast-path code where there is no match for a given search string. Being able to terminate the search as early as possible is ideal.

The performance hits taken by unaligned data access are non-nelgible, especially when dealing with XMM/YMM loads. Pay special care to alignment constrants and make sure that everything under our control is aligned on a suitable boundary. (The only thing we can't really control in the real world is the alignment of the incoming search string buffer, which will often be at undesirable alignments like 2, 4, 6, etc. Our test program explicitly aligns the incoming search strings on 32-byte boundaries to avoid the penalties associated with unaligned access.)

The string table is geared toward a single-shot build. Once you've created it with a given string array or used a delimited environment variable, that's it. There are no AddString() or RemoveString() routines. The order you provided the strings in will be the same order the table uses — no re-ordering will be done. Thus, for prefix matching purposes, if two strings share a common prefix, the longer one should go first, as the prefix search routine will check it first.

Only single matches are performed; the first match that qualifies as a prefix match (target string in table had length less than or equal to the search string, and all of its characters matched). There is no support for obtaining multiple matches — if you've constructed your string tables properly (no duplicate or incorrectly-ordered overlapping fields), you shouldn't need to.

So, to summarise, the design guidelines are as follows.

Prioritize fast-path exit in the non-matched case. (I refer to this as negative matching in a lot of places.)

in a lot of places.) Optimize for up to 16 string slots, where each slot has up to 16 characters, ideally. It can have up to 128 in total, however, any bytes outside of the first sixteen live in the string array structure supporting the string table (accessible via pStringArray).

If a slot is longer than 16 characters, optimize for the assumption that it won't be *that* much longer. i.e. assume a string of length 18 bytes is more common than 120 bytes.

The Data Structures

The primary data structure employed by this solution is the STRING_TABLE structure. It is composed of supporting structures: STRING_SLOT, SLOT_INDEX and SLOT_LENGTH, and either embeds or points to the originating STRING_ARRAY structure from which it was created.

Let's review the STRING_TABLE (view on GitHub) structure first and then touch on the supporting structures.

STRING_TABLE

C - Cutler Normal Form

C - Kernel Normal Form

MASM // // The STRING_TABLE struct is an optimized structure for testing whether a // prefix entry for a string is in a table, with the expectation that the // strings being compared will be relatively short (ideally <= 16 characters), // and the table of string prefixes to compare to will be relatively small // (ideally <= 16 strings). // // The overall goal is to be able to prefix match a string with the lowest // possible (amortized) latency. Fixed-size, memory-aligned character arrays, // and SIMD instructions are used to try and achieve this. // typedef struct _STRING_TABLE { // // A slot where each individual element contains a uniquely-identifying // letter, with respect to the other strings in the table, of each string // in an occupied slot. // STRING_SLOT UniqueChars; // // (16 bytes consumed.) // // // For each unique character identified above, the following structure // captures the 0-based index of that character in the underlying string. // This is used as an input to vpshufb to rearrange the search string's // characters such that it can be vpcmpeqb'd against the unique characters // above. // SLOT_INDEX UniqueIndex; // // (32 bytes consumed.) // // // Length of the underlying string in each slot. // SLOT_LENGTHS Lengths; // // (48 bytes consumed, aligned at 16 bytes.) // // // Pointer to the STRING_ARRAY associated with this table, which we own // (we create it and copy the caller's contents at creation time and // deallocate it when we get destroyed). // // N.B. We use pStringArray here instead of StringArray because the // latter is a field name at the end of the struct. // // PSTRING_ARRAY pStringArray; // // (56 bytes consumed, aligned at 8 bytes.) // // // String table flags. // STRING_TABLE_FLAGS Flags; // // (60 bytes consumed, aligned at 4 bytes.) // // // A 16-bit bitmap indicating which slots are occupied. // USHORT OccupiedBitmap; // // A 16-bit bitmap indicating which slots have strings longer than 16 chars. // USHORT ContinuationBitmap; // // (64 bytes consumed, aligned at 64 bytes.) // // // The 16-element array of STRING_SLOT structs. We want this to be aligned // on a 64-byte boundary, and it consumes 256-bytes of memory. // STRING_SLOT Slots[16]; // // (320 bytes consumed, aligned at 64 bytes.) // // // We want the structure size to be a power of 2 such that an even number // can fit into a 4KB page (and reducing the likelihood of crossing page // boundaries, which complicates SIMD boundary handling), so we have an // extra 192-bytes to play with here. The CopyStringArray() routine is // special-cased to allocate the backing STRING_ARRAY structure plus the // accommodating buffers in this space if it can fit. // // (You can test whether or not this occurred by checking the invariant // `StringTable->pStringArray == &StringTable->StringArray`, if this // is true, the array was allocated within this remaining padding space.) // union { STRING_ARRAY StringArray; CHAR Padding[192]; }; } STRING_TABLE, *PSTRING_TABLE, **PPSTRING_TABLE; // // Assert critical size and alignment invariants at compile time. // C_ASSERT(FIELD_OFFSET(STRING_TABLE, UniqueIndex) == 16); C_ASSERT(FIELD_OFFSET(STRING_TABLE, Lengths) == 32); C_ASSERT(FIELD_OFFSET(STRING_TABLE, pStringArray) == 48); C_ASSERT(FIELD_OFFSET(STRING_TABLE, Slots) == 64); C_ASSERT(FIELD_OFFSET(STRING_TABLE, Padding) == 320); C_ASSERT(sizeof(STRING_TABLE) == 512); struct string_table { char unique_chars[16]; unsigned char unique_index[16]; unsigned char slot_lengths[16]; struct string_array *string_array_ptr; struct string_table_flags flags; unsigned short occupied_bitmap; unsigned short continuation_bitmap; char slots[16][16]; union { struct string_array string_array; char padding[184]; } u; }; STRING_TABLE struct UniqueChars CHAR 16 dup (?) UniqueIndex BYTE 16 dup (?) Lengths BYTE 16 dup (?) pStringArray PSTRING_ARRAY ? Flags ULONG ? OccupiedBitmap USHORT ? ContinuationBitmap USHORT ? Slots STRING_SLOT 16 dup ({ }) union StringArray STRING_ARRAY {?} Padding CHAR 192 dup (?) ends STRING_TABLE ends ; ; Assert our critical field offsets and structure size as per the same approach ; taken in StringTable.h. ; .erre (STRING_TABLE.UniqueIndex eq 16), @CatStr(<UnexpectedOffset STRING_TABLE.UniqueIndex: >, %(STRING_TABLE.UniqueIndex)) .erre (STRING_TABLE.Lengths eq 32), @CatStr(<UnexpectedOffset STRING_TABLE.Lengths: >, %(STRING_TABLE.Lengths)) .erre (STRING_TABLE.pStringArray eq 48), @CatStr(<UnexpectedOffset STRING_TABLE.pStringArray: >, %(STRING_TABLE.pStringArray)) .erre (STRING_TABLE.Slots eq 64), @CatStr(<UnexpectedOffset STRING_TABLE.Slots: >, %(STRING_TABLE.Slots)) .erre (STRING_TABLE.Padding eq 320), @CatStr(<UnexpectedOffset STRING_TABLE.Padding: >, %(STRING_TABLE.Padding)) .erre (size STRING_TABLE eq 512), @CatStr(<IncorrectStructSize: STRING_TABLE: >, %(size STRING_TABLE)) PSTRING_TABLE typedef ptr STRING_TABLE ; ; CamelCase typedefs that are nicer to work with in assembly ; than their uppercase counterparts. ; StringTable typedef STRING_TABLE

The following diagram depicts an in-memory representation of the STRING_TABLE structure using our NTFS reserved prefix names. It is created via the CreateStringTable routine, which we feature in the appendix of this article.

In order to improve the uniqueness of the unique characters selected from each string, the strings are sorted by length during string table creation and enumerated in this order whilst identifying unique characters. The rationale behind this is that shorter strings simply have fewer characters to choose from, longer strings have more to choose from. If we identified unique characters in the order they appear in the string table, we may have longer strings preceeding shorter ones, such that toward the end of the table, nothing unique can be extracted from the short ones.

The utility of the string table is maximised by ensuring a unique character is selected from every string, thus, we sort by length first. Note that the uniqueness is actually determined by offset:character pairs, with the offsets becoming the indices stored in the UniqueIndex slot. If you trace through the diagram above, you'll see that the unique character in each slot matches the character in the corresponding string slot, indicated by the underlying index.

Supporting Structures

STRING_ARRAY

typedef struct _Struct_size_bytes_(SizeInQuadwords>>3) _STRING_ARRAY { // // Size of the structure, in quadwords. Why quadwords? It allows us to // keep this size field to a USHORT, which helps with the rest of the // alignment in this struct (we want the STRING Strings[] array to start // on an 8-byte boundary). // // N.B. We can't express the exact field size in the SAL annotation // below, because the array of buffer sizes are inexpressible; // however, we know the maximum length, so we can use the implicit // invariant that the total buffer size can't exceed whatever num // elements * max size is. // _Field_range_(<=, ( sizeof(struct _STRING_ARRAY) + ((NumberOfElements - 1) * sizeof(STRING)) + (MaximumLength * NumberOfElements) ) >> 3) USHORT SizeInQuadwords; // // Number of elements in the array. // USHORT NumberOfElements; // // Minimum and maximum lengths for the String->Length fields. Optional. // USHORT MinimumLength; USHORT MaximumLength; // // A pointer to the STRING_TABLE structure that "owns" us. // struct _STRING_TABLE *StringTable; // // The string array. Number of elements in the array is governed by the // NumberOfElements field above. // STRING Strings[ANYSIZE_ARRAY]; } STRING_ARRAY, *PSTRING_ARRAY, **PPSTRING_ARRAY;

The string array captures a raw array representation of the underlying strings making up the string table. It is either embedded within the padding area at the end of the string table, or a separate allocation is made during string table creation. The main interface to creating a string table is via a STRING_ARRAY structure. The helper functions, CreateStringTableFromDelimitedString and CreateStringTableFromDelimitedEnvironmentVariable simply break down their input into a STRING_ARRAY representation first before calling CreateStringTable

Note: the odd-looking macros _Struct_size_bytes_ and _Field_range_ are SAL Annotations. There's a neat deck called Engineering Better Software at Microsoft which captures some interesting details about SAL, for those wanting to read more. The Code Analysis engine that uses the annotations is built upon the Z3 Theorem Prover, which is a fascinating little project in its own right.

And finally, we're left with the smaller helper structs that we use to encapsulate the various innards of the string table. (I use unions that feature XMMWORD representations (which is a typedef of __m128i, representing an XMM register) as well as underlying byte/character representations as I personally find it makes the resulting C code a bit nicer.)

STRING_SLOT

// // String tables are composed of a 16 element array of 16 byte string "slots", // which represent a unique character (with respect to other strings in the // table) for a string in a given slot index. The STRING_SLOT structure // provides a convenient wrapper around this construct. // typedef union DECLSPEC_ALIGN(16) _STRING_SLOT { XMMWORD CharsXmm; CHAR Char[16]; } STRING_SLOT, *PSTRING_SLOT, **PPSTRING_SLOT; C_ASSERT(sizeof(STRING_SLOT) == 16);

SLOT_INDEX

// // An array of 1 byte unsigned integers used to indicate the 0-based index of // a given unique character in the corresponding string. // typedef union DECLSPEC_ALIGN(16) _SLOT_INDEX { XMMWORD IndexXmm; BYTE Index[16]; } SLOT_INDEX, *PSLOT_INDEX, **PPSLOT_INDEX; C_ASSERT(sizeof(SLOT_INDEX) == 16);

SLOT_LENGTHS

// // A 16 element array of 1 byte unsigned integers, used to capture the length // of each string slot in a single XMM 128-bit register. // typedef union DECLSPEC_ALIGN(16) _SLOT_LENGTHS { XMMWORD SlotsXmm; BYTE Slots[16]; } SLOT_LENGTHS, *PSLOT_LENGTHS, **PPSLOT_LENGTHS; C_ASSERT(sizeof(SLOT_LENGTHS) == 16);

String Table Construction

The CreateSingleStringTable routine is responsible for construction of a new STRING_TABLE. It is here we identify the unique set of characters (and their indices) to store in the first two fields of the string table.

CreateSingleStringTable // // Define private types used by this module. // typedef struct _LENGTH_INDEX_ENTRY { BYTE Length; BYTE Index; } LENGTH_INDEX_ENTRY; typedef LENGTH_INDEX_ENTRY *PLENGTH_INDEX_ENTRY; typedef struct _LENGTH_INDEX_TABLE { LENGTH_INDEX_ENTRY Entry[16]; } LENGTH_INDEX_TABLE; typedef LENGTH_INDEX_TABLE *PLENGTH_INDEX_TABLE; typedef union DECLSPEC_ALIGN(32) _CHARACTER_BITMAP { YMMWORD Ymm; XMMWORD Xmm[2]; LONG Bits[(256 / (4 << 3))]; // 8 } CHARACTER_BITMAP; C_ASSERT(sizeof(CHARACTER_BITMAP) == 32); typedef CHARACTER_BITMAP *PCHARACTER_BITMAP; typedef struct _SLOT_BITMAPS { CHARACTER_BITMAP Bitmap[16]; } SLOT_BITMAPS; typedef SLOT_BITMAPS *PSLOT_BITMAPS; // // Function implementation. // _Use_decl_annotations_ PSTRING_TABLE CreateSingleStringTable( PRTL Rtl, PALLOCATOR StringTableAllocator, PALLOCATOR StringArrayAllocator, PSTRING_ARRAY StringArray, BOOL CopyArray ) /*++ Routine Description: Allocates space for a STRING_TABLE structure using the provided allocators, then initializes it using the provided STRING_ARRAY. If CopyArray is set to TRUE, the routine will copy the string array such that the caller is free to destroy it after the table has been successfully created. If it is set to FALSE and StringArray->StringTable has a non-NULL value, it is assumed that sufficient space has already been allocated for the string table and this pointer will be used to initialize the rest of the structure. DestroyStringTable() must be called against the returned PSTRING_TABLE when the structure is no longer needed in order to ensure resources are released. Arguments: Rtl - Supplies a pointer to an initialized RTL structure. StringTableAllocator - Supplies a pointer to an ALLOCATOR structure which will be used for creating the STRING_TABLE. StringArrayAllocator - Supplies a pointer to an ALLOCATOR structure which may be used to create the STRING_ARRAY if it cannot fit within the padding of the STRING_TABLE structure. This is kept separate from the StringTableAllocator due to the stringent alignment requirements of the string table. StringArray - Supplies a pointer to an initialized STRING_ARRAY structure that contains the STRING structures that are to be added to the table. CopyArray - Supplies a boolean value indicating whether or not the StringArray structure should be deep-copied during creation. This is typically set when the caller wants to be able to free the structure as soon as this call returns (or can't guarantee it will persist past this function's invocation, i.e. if it was stack allocated). Return Value: A pointer to a valid PSTRING_TABLE structure on success, NULL on failure. Call DestroyStringTable() on the returned structure when it is no longer needed in order to ensure resources are cleaned up appropriately. --*/ { BYTE Byte; BYTE Count; BYTE Index; BYTE Length; BYTE NumberOfElements; ULONG HighestBit; ULONG OccupiedMask; PULONG Bits; USHORT OccupiedBitmap; USHORT ContinuationBitmap; PSTRING_TABLE StringTable; PSTRING_ARRAY StringArray; PSTRING String; PSTRING_SLOT Slot; STRING_SLOT UniqueChars; SLOT_INDEX UniqueIndex; SLOT_INDEX LengthIndex; SLOT_LENGTHS Lengths; LENGTH_INDEX_TABLE LengthIndexTable; PCHARACTER_BITMAP Bitmap; SLOT_BITMAPS SlotBitmaps; PLENGTH_INDEX_ENTRY Entry; // // Validate arguments. // if (!ARGUMENT_PRESENT(StringTableAllocator)) { return NULL; } if (!ARGUMENT_PRESENT(StringArrayAllocator)) { return NULL; } if (!ARGUMENT_PRESENT(SourceStringArray)) { return NULL; } if (SourceStringArray->NumberOfElements == 0) { return NULL; } // // Copy the incoming string array if applicable. // if (CopyArray) { StringArray = CopyStringArray( StringTableAllocator, StringArrayAllocator, SourceStringArray, FIELD_OFFSET(STRING_TABLE, StringArray), sizeof(STRING_TABLE), &StringTable ); if (!StringArray) { return NULL; } } else { // // We're not copying the array, so initialize StringArray to point at // the caller's SourceStringArray, and StringTable to point at the // array's StringTable field (which will be non-NULL if sufficient // space has been allocated). // StringArray = SourceStringArray; StringTable = StringArray->StringTable; } // // If StringTable has no value, we've either been called with CopyArray set // to FALSE, or CopyStringArray() wasn't able to allocate sufficient space // for both the table and itself. Either way, we need to allocate space for // the table. // if (!StringTable) { StringTable = (PSTRING_TABLE)( StringTableAllocator->AlignedCalloc( StringTableAllocator->Context, 1, sizeof(STRING_TABLE), STRING_TABLE_ALIGNMENT ) ); if (!StringTable) { return NULL; } } // // Make sure the fields that are sensitive to alignment are, in fact, // aligned correctly. // if (!AssertStringTableFieldAlignment(StringTable)) { DestroyStringTable(StringTableAllocator, StringArrayAllocator, StringTable); return NULL; } // // At this point, we have copied the incoming StringArray if necessary, // and we've allocated sufficient space for the StringTable structure. // Enumerate over all of the strings, set the continuation bit if the // length > 16, set the relevant slot length, set the relevant unique // character entry, then move the first 16-bytes of the string into the // relevant slot via an aligned SSE mov. // // // Initialize pointers and counters, clear stack-based structures. // Slot = StringTable->Slots; String = StringArray->Strings; OccupiedBitmap = 0; ContinuationBitmap = 0; NumberOfElements = (BYTE)StringArray->NumberOfElements; UniqueChars.CharsXmm = _mm_setzero_si128(); UniqueIndex.IndexXmm = _mm_setzero_si128(); LengthIndex.IndexXmm = _mm_setzero_si128(); // // Set all the slot lengths to 0x7f up front instead of defaulting // to zero. This allows for simpler logic when searching for a prefix // string, which involves broadcasting a search string's length to an XMM // register, then doing _mm_cmpgt_epi8() against the lengths array and // the string length. If we left the lengths as 0 for unused slots, they // would get included in the resulting comparison register (i.e. the high // bits would be set to 1), so we'd have to do a subsequent masking of // the result at some point using the OccupiedBitmap. By defaulting the // lengths to 0x7f, we ensure they'll never get included in any cmpgt-type // SIMD matches. (We use 0x7f instead of 0xff because the _mm_cmpgt_epi8() // intrinsic assumes packed signed integers.) // Lengths.SlotsXmm = _mm_set1_epi8(0x7f); ZeroStruct(LengthIndexTable); ZeroStruct(SlotBitmaps); for (Count = 0; Count < NumberOfElements; Count++) { XMMWORD CharsXmm; // // Set the string length for the slot. // Length = Lengths.Slots[Count] = (BYTE)String->Length; // // Set the appropriate bit in the continuation bitmap if the string is // longer than 16 bytes. // if (Length > 16) { ContinuationBitmap |= (Count == 0 ? 1 : 1 << (Count + 1)); } if (Count == 0) { Entry = &LengthIndexTable.Entry[0]; Entry->Index = 0; Entry->Length = Length; } else { // // Perform a linear scan of the length-index table in order to // identify an appropriate insertion point. // for (Index = 0; Index < Count; Index++) { if (Length < LengthIndexTable.Entry[Index].Length) { break; } } if (Index != Count) { // // New entry doesn't go at the end of the table, so shuffle // everything else down. // Rtl->RtlMoveMemory(&LengthIndexTable.Entry[Index + 1], &LengthIndexTable.Entry[Index], (Count - Index) * sizeof(*Entry)); } Entry = &LengthIndexTable.Entry[Index]; Entry->Index = Count; Entry->Length = Length; } // // Copy the first 16-bytes of the string into the relevant slot. We // have taken care to ensure everything is 16-byte aligned by this // stage, so we can use SSE intrinsics here. // CharsXmm = _mm_load_si128((PXMMWORD)String->Buffer); _mm_store_si128(&(*Slot).CharsXmm, CharsXmm); // // Advance our pointers. // ++Slot; ++String; } // // Store the slot lengths. // _mm_store_si128(&(StringTable->Lengths.SlotsXmm), Lengths.SlotsXmm); // // Loop through the strings in order of shortest to longest and construct // the uniquely-identifying character table with corresponding index. // for (Count = 0; Count < NumberOfElements; Count++) { Entry = &LengthIndexTable.Entry[Count]; Length = Entry->Length; Slot = &StringTable->Slots[Entry->Index]; // // Iterate over each character in the slot and find the first one // without a corresponding bit set. // for (Index = 0; Index < Length; Index++) { Bitmap = &SlotBitmaps.Bitmap[Index]; Bits = (PULONG)&Bitmap->Bits[0]; Byte = Slot->Char[Index]; if (!BitTestAndSet(Bits, Byte)) { break; } } UniqueChars.Char[Count] = Byte; UniqueIndex.Index[Count] = Index; LengthIndex.Index[Count] = Entry->Index; } // // Loop through the elements again such that the unique chars are stored // in the order they appear in the table. // for (Count = 0; Count < NumberOfElements; Count++) { for (Index = 0; Index < NumberOfElements; Index++) { if (LengthIndex.Index[Index] == Count) { StringTable->UniqueChars.Char[Count] = UniqueChars.Char[Index]; StringTable->UniqueIndex.Index[Count] = UniqueIndex.Index[Index]; break; } } } // // Generate and store the occupied bitmap. Each bit, from low to high, // corresponds to the index of a slot. When set, the slot is occupied. // When clear, it is not. So, fill bits from the highest bit set down. // HighestBit = (1 << (StringArray->NumberOfElements-1)); OccupiedMask = _blsmsk_u32(HighestBit); StringTable->OccupiedBitmap = (USHORT)OccupiedMask; // // Store the continuation bitmap. // StringTable->ContinuationBitmap = (USHORT)(ContinuationBitmap); // // Wire up the string array to the table. // StringTable->pStringArray = StringArray; // // And we're done, return the table. // return StringTable; }

The Benchmark

The performance comparison graphs in the subsequent sections were generated in Excel, using CSV data output by the creatively-named program StringTable2BenchmarkExe.

Modern CPUs are fast, timing is hard, especially when you're dealing with CPU cycle comparisons. No approach is perfect. Here's what I settled on:

The benchmark utility has #pragma optimize("", off) at the start of the file, which disables global optimizations, even in release (optimized) builds. This prevents the compiler doing clever things with regards to scheduling of the timestamping logic, which affects reported times.

at the start of the file, which disables global optimizations, even in release (optimized) builds. This prevents the compiler doing clever things with regards to scheduling of the timestamping logic, which affects reported times. The benchmark utility pins itself to a single core and sets its thread priority to the highest permissible value at startup. (Turbo is disabled on the computer, such that the frequency is pinned to 3.68GHz.)

The benchmark utility is fed an array of function pointers and test inputs. It iterates over each test input, and then iterates over each function, calling it with the test input and potentially verifying the result (some functions are included for comparison purposes but don't actually produce correct results, and thus, do not have their results verified).

The test input string is copied into a local buffer that is aligned on a 32 byte boundary. This ensures that all test inputs are being compared fairly. (The natural alignment of the buffers varies anywhere from 2 to 512 bytes, unaligned buffers have a significant impact on the timings.)

The function is run once, with the result captured. If verification has been requested, the result is verified. We __debugbreak() immediately if there's a mismatch, which is handy during development.

immediately if there's a mismatch, which is handy during development. NtDelayExecution(TRUE, 1) is called, which results in a sleep of approximately 100 nanoseconds. This is done to force a context switch, such that the thread gets a new scheduling quantum before each function is run.

is called, which results in a sleep of approximately 100 nanoseconds. This is done to force a context switch, such that the thread gets a new scheduling quantum before each function is run. The function is executed 100 times for warmup.

Timings are taken for 1000 iterations of the function using the given test input. The __rdtscp() intrinsic is used (which forces some serialization) to capture the timestamp counter before and after the iterations.

intrinsic is used (which forces some serialization) to capture the timestamp counter before and after the iterations. This process is repeated 100 times. The minimum time observed to perform 1000 iterations (out of 100 attempts) is captured as the function's best time.

Release vs PGO Oddities

All of the times in the graphs come from the profile-guided optimization build of the StringTable component. The PGO build is faster than the normal release build in every case, except one, where it is notably slower.

It's... odd. I haven't investigated it. The following graph depicts the affected function, IsPrefixOfStringInTable_1, and a few other versions for reference, and depicts the performance of the PGO build to the release build on the input strings "$INDEX_ALLOCATION" and "$Bai123456789012".

Only that function is affected, and the problem really only manifests on the two example test strings depicted. As this routine essentially serves as one of the initial baseline implementations, it would be misleading to compare all of our optimized PGO versions to the abnormally-slow baseline implementation. So, the release and PGO timings were blended together into a single CSV, and the Excel PivotTables pick whatever the minimum time is for a given function and test input.

Thus, you're always looking at the PGO timings, except for this outlier case where the release versions are faster.

The Implementations

Round 1

In this section, we'll take a look at the various implementations I experimented with on the first pass, prior to soliciting any feedback. I figured there were a couple of ways I could present this information. First, I could hand-pick what I choose to show and hide, such that a nice rosey picture is presented that makes it seem like I effortlessly arrived at the fastest implementation without much actual effort whatsoever.

Or I could show the gritty reality of how everything actually went down in a chronological fashion, errors and all. And there were definitely some errors! For better or for worse, I've chosen to go down this route, so you'll get to enjoy some pretty tedious tweaks (changing a single line, for example) before the juicy stuff really kicks in.

Additionally, with the benefit of writing this little section introduction retro-actively, iterations 4 and 5 aren't testing what I thought they were initially testing. I've left them in as is; if anything, it demonstrates the importance of only changing one thing at a time, and making sure you're testing what you think you're testing. I'll discuss the errors with those iterations later in the article.

IsPrefixOfCStrInArray

Let's review the baseline implementation again, as that's what we're ultimately comparing ourselves against. This version enumerates the string array (and thus has a slightly different function signature to the STRING_TABLE-based functions) looking for prefix matches. No SIMD instructions are used. The timings captured should be proportional to the location of the test input string in the array. That is, it should take less time to prefix match strings that occur earlier in the array versus those that appear later.

_Use_decl_annotations_ STRING_TABLE_INDEX IsPrefixOfCStrInArray( PCSZ *StringArray, PCSZ String, PSTRING_MATCH Match ) { PCSZ Left; PCSZ Right; PCSZ *Target; ULONG Index = 0; ULONG Count; for (Target = StringArray; *Target != NULL; Target++, Index++) { Count = 0; Left = String; Right = *Target; while (*Left && *Right && *Left++ == *Right++) { Count++; } if (Count > 0 && !*Right) { if (ARGUMENT_PRESENT(Match)) { Match->Index = (BYTE)Index; Match->NumberOfMatchedCharacters = (BYTE)Count; Match->String = NULL; } return (STRING_TABLE_INDEX)Index; } } return NO_MATCH_FOUND; }

IsPrefixOfStringInTable_1

This version is similar to the IsPrefixOfCStrInArray implementation, except it utilizes the slot length information provided by the STRING_ARRAY structure, and conforms to our standard IsPrefixOfStringInTable function signature. It uses no SIMD instructions.

_Use_decl_annotations_ STRING_TABLE_INDEX IsPrefixOfStringInTable_1( PSTRING_TABLE StringTable, PSTRING String, PSTRING_MATCH Match ) /*++ Routine Description: Searches a string table to see if any strings "prefix match" the given search string. That is, whether any string in the table "starts with or is equal to" the search string. This routine performs a simple linear scan of the string table looking for a prefix match against each slot. Arguments: StringTable - Supplies a pointer to a STRING_TABLE struct. String - Supplies a pointer to a STRING struct that contains the string to search for. Match - Optionally supplies a pointer to a variable that contains the address of a STRING_MATCH structure. This will be populated with additional details about the match if a non-NULL pointer is supplied. Return Value: Index of the prefix match if one was found, NO_MATCH_FOUND if not. --*/ { BYTE Left; BYTE Right; ULONG Index; ULONG Count; PSTRING_ARRAY StringArray; PSTRING TargetString; //IACA_VC_START(); StringArray = StringTable->pStringArray; if (StringArray->MinimumLength > String->Length) { return NO_MATCH_FOUND; } for (Count = 0; Count < StringArray->NumberOfElements; Count++) { TargetString = &StringArray->Strings[Count]; if (String->Length < TargetString->Length) { continue; } for (Index = 0; Index < TargetString->Length; Index++) { Left = String->Buffer[Index]; Right = TargetString->Buffer[Index]; if (Left != Right) { break; } } if (Index == TargetString->Length) { if (ARGUMENT_PRESENT(Match)) { Match->Index = (BYTE)Count; Match->NumberOfMatchedCharacters = (BYTE)Index; Match->String = TargetString; } return (STRING_TABLE_INDEX)Count; } } //IACA_VC_END(); return NO_MATCH_FOUND; }

Here's the performance of these two baseline routines:

That's an interesting result! Even without using any SIMD instructions, version 1, the IsPrefixOfStringInTable_1 routine, is faster (in all but one case) than the baseline IsPrefixOfCStrInArray routine, thanks to a more sophisticated data structure.

(And really, it's not even using the sophisticated parts of the STRING_TABLE ; it's just leveraging the fact that we've captured the lengths of each string in the backing STRING_ARRAY structure, by virtue of the fact that we use the STRING structure to wrap our strings (versus relying on the standard NULL-terminated C string approach).)

IsPrefixOfStringInTable_2

This version is the first of the routines to use SIMD instructions. It is actually based on the prefix matching routine I wrote for the first version of the StringTable component back in 2016. The layout of the STRING_TABLE struct differed in the first version; only the first character of each slot was used to do the initial exclusion (as opposed to the unique character), and lengths were unsigned shorts instead of chars (16 bits instead of 8 bits), so the match bitmap had to be constructed slightly differently.

None of those details really apply to our second attempt at the StringTable component, detailed in this article. Our lengths are 8 bits, and we use unique characters in the initial negative match fast-path. However, the first version used an elaborate AVX2 prefix match routine that was geared toward matching long strings, and attempted to use non-temporal streaming load instructions where possible (which would only make sense for a large number of long strings in a very small set of cache-thrashing scenarios). Compare our simpler implementation, IsPrefixMatch , which we use in version 3 onward, to the far more elaborate (and unncessary) IsPrefixMatchAvx2 :

IsPrefixMatch

IsPrefixMatchAvx2 FORCEINLINE BYTE IsPrefixMatch( _In_ PCSTRING SearchString, _In_ PCSTRING TargetString, _In_ BYTE Offset ) { PBYTE Left; PBYTE Right; BYTE Matched = 0; BYTE Remaining = (SearchString->Length - Offset) + 1; Left = (PBYTE)RtlOffsetToPointer(SearchString->Buffer, Offset); Right = (PBYTE)RtlOffsetToPointer(TargetString->Buffer, Offset); while (--Remaining && *Left++ == *Right++) { Matched++; } Matched += Offset; if (Matched != TargetString->Length) { return NO_MATCH_FOUND; } return Matched; } FORCEINLINE USHORT IsPrefixMatchAvx2( _In_ PCSTRING SearchString, _In_ PCSTRING TargetString, _In_ USHORT Offset ) { USHORT SearchStringRemaining; USHORT TargetStringRemaining; ULONGLONG SearchStringAlignment; ULONGLONG TargetStringAlignment; USHORT CharactersMatched = Offset; LONG Count; LONG Mask; PCHAR SearchBuffer; PCHAR TargetBuffer; STRING_SLOT SearchSlot; XMMWORD SearchXmm; XMMWORD TargetXmm; XMMWORD ResultXmm; YMMWORD SearchYmm; YMMWORD TargetYmm; YMMWORD ResultYmm; SearchStringRemaining = SearchString->Length - Offset; TargetStringRemaining = TargetString->Length - Offset; SearchBuffer = (PCHAR)RtlOffsetToPointer(SearchString->Buffer, Offset); TargetBuffer = (PCHAR)RtlOffsetToPointer(TargetString->Buffer, Offset); // // This routine is only called in the final stage of a prefix match when // we've already verified the slot's corresponding original string length // (referred in this routine as the target string) is less than or equal // to the length of the search string. // // We attempt as many 32-byte comparisons as we can, then as many 16-byte // comparisons as we can, then a final < 16-byte comparison if necessary. // // We use aligned loads if possible, falling back to unaligned if not. // StartYmm: if (SearchStringRemaining >= 32 && TargetStringRemaining >= 32) { // // We have at least 32 bytes to compare for each string. Check the // alignment for each buffer and do an aligned streaming load (non- // temporal hint) if our alignment is at a 32-byte boundary or better; // reverting to an unaligned load when not. // SearchStringAlignment = GetAddressAlignment(SearchBuffer); TargetStringAlignment = GetAddressAlignment(TargetBuffer); if (SearchStringAlignment < 32) { SearchYmm = _mm256_loadu_si256((PYMMWORD)SearchBuffer); } else { SearchYmm = _mm256_stream_load_si256((PYMMWORD)SearchBuffer); } if (TargetStringAlignment < 32) { TargetYmm = _mm256_loadu_si256((PYMMWORD)TargetBuffer); } else { TargetYmm = _mm256_stream_load_si256((PYMMWORD)TargetBuffer); } // // Compare the two vectors. // ResultYmm = _mm256_cmpeq_epi8(SearchYmm, TargetYmm); // // Generate a mask from the result of the comparison. // Mask = _mm256_movemask_epi8(ResultYmm); // // There were at least 32 characters remaining in each string buffer, // thus, every character needs to have matched in order for this search // to continue. If there were less than 32 characters, we can terminate // this prefix search here. (-1 == 0xffffffff == all bits set == all // characters matched.) // if (Mask != -1) { // // Not all characters were matched, terminate the prefix search. // return NO_MATCH_FOUND; } // // All 32 characters were matched. Update counters and pointers // accordingly and jump back to the start of the 32-byte processing. // SearchStringRemaining -= 32; TargetStringRemaining -= 32; CharactersMatched += 32; SearchBuffer += 32; TargetBuffer += 32; goto StartYmm; } // // Intentional follow-on to StartXmm. // StartXmm: // // Update the search string's alignment. // if (SearchStringRemaining >= 16 && TargetStringRemaining >= 16) { // // We have at least 16 bytes to compare for each string. Check the // alignment for each buffer and do an aligned streaming load (non- // temporal hint) if our alignment is at a 16-byte boundary or better; // reverting to an unaligned load when not. // SearchStringAlignment = GetAddressAlignment(SearchBuffer); if (SearchStringAlignment < 16) { SearchXmm = _mm_loadu_si128((XMMWORD *)SearchBuffer); } else { SearchXmm = _mm_stream_load_si128((XMMWORD *)SearchBuffer); } TargetXmm = _mm_stream_load_si128((XMMWORD *)TargetBuffer); // // Compare the two vectors. // ResultXmm = _mm_cmpeq_epi8(SearchXmm, TargetXmm); // // Generate a mask from the result of the comparison. // Mask = _mm_movemask_epi8(ResultXmm); // // There were at least 16 characters remaining in each string buffer, // thus, every character needs to have matched in order for this search // to continue. If there were less than 16 characters, we can terminate // this prefix search here. (-1 == 0xffff -> all bits set -> all chars // matched.) // if ((SHORT)Mask != (SHORT)-1) { // // Not all characters were matched, terminate the prefix search. // return NO_MATCH_FOUND; } // // All 16 characters were matched. Update counters and pointers // accordingly and jump back to the start of the 16-byte processing. // SearchStringRemaining -= 16; TargetStringRemaining -= 16; CharactersMatched += 16; SearchBuffer += 16; TargetBuffer += 16; goto StartXmm; } if (TargetStringRemaining == 0) { // // We'll get here if we successfully prefix matched the search string // and all our buffers were aligned (i.e. we don't have a trailing // < 16 bytes comparison to perform). // return CharactersMatched; } // // If we get here, we have less than 16 bytes to compare. Our target // strings are guaranteed to be 16-byte aligned, so we can load them // using an aligned stream load as in the previous cases. // TargetXmm = _mm_stream_load_si128((PXMMWORD)TargetBuffer); // // Loading the remainder of our search string's buffer is a little more // complicated. It could reside within 15 bytes of the end of the page // boundary, which would mean that a 128-bit load would cross a page // boundary. // // At best, the page will belong to our process and we'll take a performance // hit. At worst, we won't own the page, and we'll end up triggering a hard // page fault. // // So, see if the current search buffer address plus 16 bytes crosses a page // boundary. If it does, take the safe but slower approach of a ranged // memcpy (movsb) into a local stack-allocated STRING_SLOT structure. // if (!PointerToOffsetCrossesPageBoundary(SearchBuffer, 16)) { // // No page boundary is crossed, so just do an unaligned 128-bit move // into our Xmm register. (We could do the aligned/unaligned dance // here, but it's the last load we'll be doing (i.e. it's not // potentially on a loop path), so I don't think it's worth the extra // branch cost, although I haven't measured this empirically.) // SearchXmm = _mm_loadu_si128((XMMWORD *)SearchBuffer); } else { // // We cross a page boundary, so only copy the the bytes we need via // __movsb(), then do an aligned stream load into the Xmm register // we'll use in the comparison. // __movsb((PBYTE)&SearchSlot.Char, (PBYTE)SearchBuffer, SearchStringRemaining); SearchXmm = _mm_stream_load_si128(&SearchSlot.CharsXmm); } // // Compare the final vectors. // ResultXmm = _mm_cmpeq_epi8(SearchXmm, TargetXmm); // // Generate a mask from the result of the comparison, but mask off (zero // out) high bits from the target string's remaining length. // Mask = _bzhi_u32(_mm_movemask_epi8(ResultXmm), TargetStringRemaining); // // Count how many characters were matched and determine if we were a // successful prefix match or not. // Count = __popcnt(Mask); if ((USHORT)Count == TargetStringRemaining) { // // If we matched the same amount of characters as remaining in the // target string, we've successfully prefix matched the search string. // Return the total number of characters we matched. // CharactersMatched += (USHORT)Count; return CharactersMatched; } // // After all that work, our string match failed at the final stage! Return // to the caller indicating we were unable to make a prefix match. // return NO_MATCH_FOUND; }

The AVX2 routine is overkill, especially considering the emphasis we put on favoring short strings versus longer ones in the requirements section. But we want to put broad statements like that to the test, so let's include it as our first SIMD implementation so that we can see how it stacks up against the simpler versions.

Also note, this is the first time we're seeing the full body of the SIMD-style IsPrefixOfStringInTable implementation. It's commented heavily, and, in general, the core algorithm doesn't fundamentally change across iterations (things are just tweaked slightly), so I'd recommend reading through it thoroughly to build up a mental model of how the matching algorithm works. It's pretty straight forward, and the subsequent iterations will make a lot more sense as they're typically presented as diffs against the previous version first.

_Use_decl_annotations_ STRING_TABLE_INDEX IsPrefixOfStringInTable_2( PSTRING_TABLE StringTable, PSTRING String, PSTRING_MATCH Match ) /*++ Routine Description: Searches a string table to see if any strings "prefix match" the given search string. That is, whether any string in the table "starts with or is equal to" the search string. This is our first AVX-optimized version of the routine. Arguments: StringTable - Supplies a pointer to a STRING_TABLE struct. String - Supplies a pointer to a STRING struct that contains the string to search for. Match - Optionally supplies a pointer to a variable that contains the address of a STRING_MATCH structure. This will be populated with additional details about the match if a non-NULL pointer is supplied. Return Value: Index of the prefix match if one was found, NO_MATCH_FOUND if not. --*/ { ULONG Bitmap; ULONG Mask; ULONG Count; ULONG Length; ULONG Index; ULONG Shift = 0; ULONG CharactersMatched; ULONG NumberOfTrailingZeros; ULONG SearchLength; PSTRING TargetString; PSTRING_ARRAY StringArray; STRING_SLOT Slot; STRING_SLOT Search; STRING_SLOT Compare; SLOT_LENGTHS Lengths; XMMWORD LengthXmm; XMMWORD UniqueChar; XMMWORD TableUniqueChars; XMMWORD IncludeSlotsByUniqueChar; XMMWORD IgnoreSlotsByLength; XMMWORD IncludeSlotsByLength; XMMWORD IncludeSlots; const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff); StringArray = StringTable->pStringArray; // // If the minimum length of the string array is greater than the length of // our search string, there can't be a prefix match. // if (StringArray->MinimumLength > String->Length) { goto NoMatch; } // // Unconditionally do the following five operations before checking any of // the results and determining how the search should proceed: // // 1. Load the search string into an Xmm register, and broadcast the // character indicated by the unique character index (relative to // other strings in the table) across a second Xmm register. // // 2. Load the string table's unique character array into an Xmm register. // // 3. Broadcast the search string's length into an XMM register. // // 3. Load the string table's slot lengths array into an XMM register. // // 4. Compare the unique character from step 1 to the string table's unique // character array set up in step 2. The result of this comparison // will produce an XMM register with each byte set to either 0xff if // the unique character was found, or 0x0 if it wasn't. // // 5. Compare the search string's length from step 3 to the string table's // slot length array set up in step 3. This allows us to identify the // slots that have strings that are of lesser or equal length to our // search string. As we're doing a prefix search, we can ignore any // slots longer than our incoming search string. // // We do all five of these operations up front regardless of whether or not // they're strictly necessary. That is, if the unique character isn't in // the unique character array, we don't need to load array lengths -- and // vice versa. However, we assume the benefits afforded by giving the CPU // a bunch of independent things to do unconditionally up-front outweigh // the cost of putting in branches and conditionally loading things if // necessary. // // // Load the first 16-bytes of the search string into an XMM register. // LoadSearchStringIntoXmmRegister(Search, String, SearchLength); // // Broadcast the search string's unique characters according to the string // table's unique character index. // UniqueChar = _mm_shuffle_epi8(Search.CharsXmm, StringTable->UniqueIndex.IndexXmm); // // Load the slot length array into an XMM register. // Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm); // // Load the string table's unique character array into an XMM register. // TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm); // // Broadcast the search string's length into an XMM register. // LengthXmm.m128i_u8[0] = (BYTE)String->Length; LengthXmm = _mm_broadcastb_epi8(LengthXmm); // // Compare the search string's unique character with all of the unique // characters of strings in the table, saving the results into an XMM // register. This comparison will indicate which slots we can ignore // because the characters at a given index don't match. Matched slots // will be 0xff, unmatched slots will be 0x0. // IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars); // // Find all slots that are longer than the incoming string length, as these // are the ones we're going to exclude from any prefix match. // // N.B. Because we default the length of empty slots to 0x7f, they will // handily be included in the ignored set (i.e. their words will also // be set to 0xff), which means they'll also get filtered out when // we invert the mask shortly after. // IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm); // // Invert the result of the comparison; we want 0xff for slots to include // and 0x0 for slots to ignore (it's currently the other way around). We // can achieve this by XOR'ing the result against our all-ones XMM register. // IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm); // // We're now ready to intersect the two XMM registers to determine which // slots should still be included in the comparison (i.e. which slots have // the exact same unique character as the string and a length less than or // equal to the length of the search string). // IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar, IncludeSlotsByLength); // // Generate a mask. // Bitmap = _mm_movemask_epi8(IncludeSlots); if (!Bitmap) { // // No bits were set, so there are no strings in this table starting // with the same character and of a lesser or equal length as the // search string. // goto NoMatch; } // // A popcount against the mask will tell us how many slots we matched, and // thus, need to compare. // Count = __popcnt(Bitmap); do { // // Extract the next index by counting the number of trailing zeros left // in the bitmap and adding the amount we've already shifted by. // NumberOfTrailingZeros = _tzcnt_u32(Bitmap); Index = NumberOfTrailingZeros + Shift; // // Shift the bitmap right, past the zeros and the 1 that was just found, // such that it's positioned correctly for the next loop's tzcnt. Update // the shift count accordingly. // Bitmap >>= (NumberOfTrailingZeros + 1); Shift = Index + 1; // // Load the slot and its length. // Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm); Length = Lengths.Slots[Index]; // // Compare the slot to the search string. // Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm); // // Create a mask of the comparison, then filter out high bits from the // search string's length (which is capped at 16). (This shouldn't be // technically necessary as the string array buffers should have been // calloc'd and zeroed, but optimizing compilers can often ignore the // zeroing request -- which can produce some bizarre results where the // debug build is correct (because the buffers were zeroed) but the // release build fails because the zeroing got ignored and there are // junk bytes past the NULL terminator, which get picked up in our // 128-bit loads.) // Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength); // // Count how many characters matched. // CharactersMatched = __popcnt(Mask); if ((USHORT)CharactersMatched == 16 && Length > 16) { // // The first 16 characters in the string matched against this // slot, and the slot is oversized (longer than 16 characters), // so do a direct comparison between the remaining buffers. // TargetString = &StringTable->pStringArray->Strings[Index]; CharactersMatched = IsPrefixMatchAvx2(String, TargetString, 16); if (CharactersMatched == NO_MATCH_FOUND) { // // The prefix match failed, continue our search. // continue; } else { // // We successfully prefix matched the search string against // this slot. The code immediately following us deals with // handling a successful prefix match at the initial slot // level; let's avoid an unnecessary branch and just jump // directly into it. // goto FoundMatch; } } if ((USHORT)CharactersMatched == Length) { FoundMatch: // // This slot is a prefix match. Fill out the Match structure if the // caller provided a non-NULL pointer, then return the index of the // match. // if (ARGUMENT_PRESENT(Match)) { Match->Index = (BYTE)Index; Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched; Match->String = &StringTable->pStringArray->Strings[Index]; } return (STRING_TABLE_INDEX)Index; } // // Not enough characters matched, so continue the loop. // } while (--Count); // // If we get here, we didn't find a match. // NoMatch: //IACA_VC_END(); return NO_MATCH_FOUND; }

Let's see how version 2, our first SIMD attempt, performs in comparison to the two baselines.

Eek! Our first SIMD attempt actually has worse prefix matching performance in most cases! The only area where it shows a performance improvement is negative matching.

IsPrefixOfStringInTable_3

For version 3, let's replace the call to IsPrefixMatchAvx2 with our simpler version, IsPrefixMatch :

Diff

Full % diff -u IsPrefixOfStringInTable_2.c IsPrefixOfStringInTable_3.c --- IsPrefixOfStringInTable_2.c 2018-04-15 22:35:55.458773500 -0400 +++ IsPrefixOfStringInTable_3.c 2018-04-15 22:35:55.456274700 -0400 @@ -18,7 +18,7 @@ _Use_decl_annotations_ STRING_TABLE_INDEX -IsPrefixOfStringInTable_2( +IsPrefixOfStringInTable_3( PSTRING_TABLE StringTable, PSTRING String, PSTRING_MATCH Match @@ -278,7 +278,7 @@ TargetString = &StringTable->pStringArray->Strings[Index]; - CharactersMatched = IsPrefixMatchAvx2(String, TargetString, 16); + CharactersMatched = IsPrefixMatch(String, TargetString, 16); if (CharactersMatched == NO_MATCH_FOUND) { _Use_decl_annotations_ STRING_TABLE_INDEX IsPrefixOfStringInTable_3( PSTRING_TABLE StringTable, PSTRING String, PSTRING_MATCH Match ) /*++ Routine Description: Searches a string table to see if any strings "prefix match" the given search string. That is, whether any string in the table "starts with or is equal to" the search string. This is our first AVX-optimized version of the routine. Arguments: StringTable - Supplies a pointer to a STRING_TABLE struct. String - Supplies a pointer to a STRING struct that contains the string to search for. Match - Optionally supplies a pointer to a variable that contains the address of a STRING_MATCH structure. This will be populated with additional details about the match if a non-NULL pointer is supplied. Return Value: Index of the prefix match if one was found, NO_MATCH_FOUND if not. --*/ { ULONG Bitmap; ULONG Mask; ULONG Count; ULONG Length; ULONG Index; ULONG Shift = 0; ULONG CharactersMatched; ULONG NumberOfTrailingZeros; ULONG SearchLength; PSTRING TargetString; PSTRING_ARRAY StringArray; STRING_SLOT Slot; STRING_SLOT Search; STRING_SLOT Compare; SLOT_LENGTHS Lengths; XMMWORD LengthXmm; XMMWORD UniqueChar; XMMWORD TableUniqueChars; XMMWORD IncludeSlotsByUniqueChar; XMMWORD IgnoreSlotsByLength; XMMWORD IncludeSlotsByLength; XMMWORD IncludeSlots; const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff); StringArray = StringTable->pStringArray; // // If the minimum length of the string array is greater than the length of // our search string, there can't be a prefix match. // if (StringArray->MinimumLength > String->Length) { goto NoMatch; } // // Unconditionally do the following five operations before checking any of // the results and determining how the search should proceed: // // 1. Load the search string into an Xmm register, and broadcast the // character indicated by the unique character index (relative to // other strings in the table) across a second Xmm register. // // 2. Load the string table's unique character array into an Xmm register. // // 3. Broadcast the search string's length into an XMM register. // // 3. Load the string table's slot lengths array into an XMM register. // // 4. Compare the unique character from step 1 to the string table's unique // character array set up in step 2. The result of this comparison // will produce an XMM register with each byte set to either 0xff if // the unique character was found, or 0x0 if it wasn't. // // 5. Compare the search string's length from step 3 to the string table's // slot length array set up in step 3. This allows us to identify the // slots that have strings that are of lesser or equal length to our // search string. As we're doing a prefix search, we can ignore any // slots longer than our incoming search string. // // We do all five of these operations up front regardless of whether or not // they're strictly necessary. That is, if the unique character isn't in // the unique character array, we don't need to load array lengths -- and // vice versa. However, we assume the benefits afforded by giving the CPU // a bunch of independent things to do unconditionally up-front outweigh // the cost of putting in branches and conditionally loading things if // necessary. // // // Load the first 16-bytes of the search string into an XMM register. // LoadSearchStringIntoXmmRegister(Search, String, SearchLength); // // Broadcast the search string's unique characters according to the string // table's unique character index. // UniqueChar = _mm_shuffle_epi8(Search.CharsXmm, StringTable->UniqueIndex.IndexXmm); // // Load the slot length array into an XMM register. // Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm); // // Load the string table's unique character array into an XMM register. // TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm); // // Broadcast the search string's length into an XMM register. // LengthXmm.m128i_u8[0] = (BYTE)String->Length; LengthXmm = _mm_broadcastb_epi8(LengthXmm); // // Compare the search string's unique character with all of the unique // characters of strings in the table, saving the results into an XMM // register. This comparison will indicate which slots we can ignore // because the characters at a given index don't match. Matched slots // will be 0xff, unmatched slots will be 0x0. // IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars); // // Find all slots that are longer than the incoming string length, as these // are the ones we're going to exclude from any prefix match. // // N.B. Because we default the length of empty slots to 0x7f, they will // handily be included in the ignored set (i.e. their words will also // be set to 0xff), which means they'll also get filtered out when // we invert the mask shortly after. // IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm); // // Invert the result of the comparison; we want 0xff for slots to include // and 0x0 for slots to ignore (it's currently the other way around). We // can achieve this by XOR'ing the result against our all-ones XMM register. // IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm); // // We're now ready to intersect the two XMM registers to determine which // slots should still be included in the comparison (i.e. which slots have // the exact same unique character as the string and a length less than or // equal to the length of the search string). // IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar, IncludeSlotsByLength); // // Generate a mask. // Bitmap = _mm_movemask_epi8(IncludeSlots); if (!Bitmap) { // // No bits were set, so there are no strings in this table starting // with the same character and of a lesser or equal length as the // search string. // goto NoMatch; } // // A popcount against the mask will tell us how many slots we matched, and // thus, need to compare. // Count = __popcnt(Bitmap); do { // // Extract the next index by counting the number of trailing zeros left // in the bitmap and adding the amount we've already shifted by. // NumberOfTrailingZeros = _tzcnt_u32(Bitmap); Index = NumberOfTrailingZeros + Shift; // // Shift the bitmap right, past the zeros and the 1 that was just found, // such that it's positioned correctly for the next loop's tzcnt. Update // the shift count accordingly. // Bitmap >>= (NumberOfTrailingZeros + 1); Shift = Index + 1; // // Load the slot and its length. // Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm); Length = Lengths.Slots[Index]; // // Compare the slot to the search string. // Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm); // // Create a mask of the comparison, then filter out high bits from the // search string's length (which is capped at 16). (This shouldn't be // technically necessary as the string array buffers should have been // calloc'd and zeroed, but optimizing compilers can often ignore the // zeroing request -- which can produce some bizarre results where the // debug build is correct (because the buffers were zeroed) but the // release build fails because the zeroing got ignored and there are // junk bytes past the NULL terminator, which get picked up in our // 128-bit loads.) // Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength); // // Count how many characters matched. // CharactersMatched = __popcnt(Mask); if ((USHORT)CharactersMatched == 16 && Length > 16) { // // The first 16 characters in the string matched against this // slot, and the slot is oversized (longer than 16 characters), // so do a direct comparison between the remaining buffers. // TargetString = &StringTable->pStringArray->Strings[Index]; CharactersMatched = IsPrefixMatch(String, TargetString, 16); if (CharactersMatched == NO_MATCH_FOUND) { // // The prefix match failed, continue our search. // continue; } else { // // We successfully prefix matched the search string against // this slot. The code immediately following us deals with // handling a successful prefix match at the initial slot // level; let's avoid an unnecessary branch and just jump // directly into it. // goto FoundMatch; } } if ((USHORT)CharactersMatched == Length) { FoundMatch: // // This slot is a prefix match. Fill out the Match structure if the // caller provided a non-NULL pointer, then return the index of the // match. // if (ARGUMENT_PRESENT(Match)) { Match->Index = (BYTE)Index; Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched; Match->String = &StringTable->pStringArray->Strings[Index]; } return (STRING_TABLE_INDEX)Index; } // // Not enough characters matched, so continue the loop. // } while (--Count); // // If we get here, we didn't find a match. // NoMatch: //IACA_VC_END(); return NO_MATCH_FOUND; }

Phew! We finally see superior performance across the board. This ends the short lived tenure of version 2, which is demonstrably worse in every case. We'll also omit the IsPrefixOfCStrInArray routine from the graphs for now (for the most part), as it has served its initial baseline purpose.

IsPrefixOfStringInTable_4

When I first wrote the initial string table code, I was playing around with different strategies for loading the initial search string buffer. That resulted in the file StringLoadStoreOperations.h, which defined a bunch of helper macros. I've included them below, but don't spend too much time absorbing them, they're not good practice, and they're all irrelevant anyway as soon as we switch to _mm_loadu_si128() in a few versions. I'm including them because they set the scene for versions 4, 5 and 6.

/*++ VOID LoadSearchStringIntoXmmRegister_SEH( _In_ STRING_SLOT Slot, _In_ PSTRING String, _In_ USHORT LengthVar ); Routine Description: Attempts an aligned 128-bit load of String->Buffer into Slot.CharXmm via the _mm_load_si128() intrinsic. The intrinsic is surrounded in a __try/ __except block that catches EXCEPTION_ACCESS_VIOLATION exceptions. If such an exception is caught, the routine will check to see if the string buffer's address will cross a page boundary if 16-bytes are loaded. If a page boundary would be crossed, a __movsb() intrinsic is used to copy only the bytes specified by String->Length, otherwise, an unaligned 128-bit load is attemped via the _mm_loadu_si128() intrinsic. Arguments: Slot - Supplies the STRING_SLOT local variable name within the calling function that will receive the results of the load operation. String - Supplies the name of the PSTRING variable that is to be loaded into the slot. This will usually be one of the function parameters. LengthVar - Supplies the name of a USHORT local variable that will receive the value of min(String->Length, 16). Return Value: None. --*/ #define LoadSearchStringIntoXmmRegister_SEH(Slot, String, LengthVar) \ LengthVar = min(String->Length, 16); \ TRY_SSE42_ALIGNED { \ Slot.CharsXmm = _mm_load_si128((PXMMWORD)String->Buffer); \ } CATCH_EXCEPTION_ACCESS_VIOLATION { \ if (PointerToOffsetCrossesPageBoundary(String->Buffer, 16)) { \ __movsb(Slot.Char, String->Buffer, LengthVar); \ } else { \ Slot.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer); \ } \ } /*++ VOID LoadSearchStringIntoXmmRegister_AlignmentCheck( _In_ STRING_SLOT Slot, _In_ PSTRING String, _In_ USHORT LengthVar ); Routine Description: This routine checks to see if a page boundary will be crossed if 16-bytes are loaded from the address supplied by String->Buffer. If a page boundary will be crossed, a __movsb() intrinsic is used to only copy String->Length bytes into the given Slot. If no page boundary will be crossed by a 128-bit load, the alignment of the address supplied by String->Buffer is checked. If the alignment isn't at least on a 16-byte boundary, an unaligned load will be issued via the _mm_loadu_si128() intrinsic, otherwise, an _mm_load_si128() will be used. Arguments: Slot - Supplies the STRING_SLOT local variable name within the calling function that will receive the results of the load operation. String - Supplies the name of the PSTRING variable that is to be loaded into the slot. This will usually be one of the function parameters. LengthVar - Supplies the name of a USHORT local variable that will receive the value of min(String->Length, 16). Return Value: None. --*/ #define LoadSearchStringIntoXmmRegister_AlignmentCheck(Slot, String,LengthVar) \ LengthVar = min(String->Length, 16); \ if (PointerToOffsetCrossesPageBoundary(String->Buffer, 16)) { \ __movsb(Slot.Char, String->Buffer, LengthVar); \ } else if (GetAddressAlignment(String->Buffer) < 16) { \ Slot.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer); \ } else { \ Slot.CharsXmm = _mm_load_si128((PXMMWORD)String->Buffer); \ } /*++ VOID LoadSearchStringIntoXmmRegister_AlwaysUnaligned( _In_ STRING_SLOT Slot, _In_ PSTRING String, _In_ USHORT LengthVar ); Routine Description: This routine performs an unaligned 128-bit load of the address supplied by String->Buffer into the given Slot via the _mm_loadu_si128() intrinsic. No checks are done regarding whether or not a page boundary will be crossed. Arguments: Slot - Supplies the STRING_SLOT local variable name within the calling function that will receive the results of the load operation. String - Supplies the name of the PSTRING variable that is to be loaded into the slot. This will usually be one of the function parameters. LengthVar - Supplies the name of a USHORT local variable that will receive the value of min(String->Length, 16). Return Value: None. --*/ #define LoadSearchStringIntoXmmRegister_Unaligned(Slot, String, LengthVar) \ LengthVar = min(String->Length, 16); \ if (PointerToOffsetCrossesPageBoundary(String->Buffer, 16)) { \ __movsb(Slot.Char, String->Buffer, LengthVar); \ } else if (GetAddressAlignment(String->Buffer) < 16) { \ Slot.CharsXmm = _mm_loadu_si128(String->Buffer); \ } else { \ Slot.CharsXmm = _mm_load_si128(String->Buffer); \ } /*++ VOID LoadSearchStringIntoXmmRegister_AlwaysMovsb( _In_ STRING_SLOT Slot, _In_ PSTRING String, _In_ USHORT LengthVar ); Routine Description: This routine copies min(String->Length, 16) bytes from String->Buffer into the given Slot via the __movsb() intrinsic. The memory referenced by the Slot is not cleared first via SecureZeroMemory(). Arguments: Slot - Supplies the STRING_SLOT local variable name within the calling function that will receive the results of the load operation. String - Supplies the name of the PSTRING variable that is to be loaded into the slot. This will usually be one of the function parameters. LengthVar - Supplies the name of a USHORT local variable that will receive the value of min(String->Length, 16). Return Value: None. --*/ #define LoadSearchStringIntoXmmRegister_AlwaysMovsb(Slot, String, LengthVar) \ LengthVar = min(String->Length, 16); \ __movsb(Slot.Char, String->Buffer, LengthVar);

In our StringTable2.vcxproj file, we have the following:

<PropertyGroup Label="Globals"> ... <LoadSearchStringStrategy>AlwaysMovsb</LoadSearchStringStrategy> <!-- <LoadSearchStringStrategy>SEH</LoadSearchStringStrategy> <LoadSearchStringStrategy>AlignmentCheck</LoadSearchStringStrategy> <LoadSearchStringStrategy>AlwaysUnaligned</LoadSearchStringStrategy> -->

This basically allowed me to toggle which of the strategies I wanted to use to do load the search string into an XMM register. As you can see above, the default is to use the AlwaysMovsb approach*; so, with version 4, let's swap that out for the SEH approach, which wraps the aligned load in a structured exception handler that falls back to __movsb() if the aligned load fails and the pointer plus 16 bytes crosses a page boundary.

[*]: Or was it?

Narrator: it wasn't.

(Note: these little Narrator interjections work best if you imagine they're being read in Ron Howard's voice.)

Diff

Full % diff -u IsPrefixOfStringInTable_4.c IsPrefixOfStringInTable_3.c --- IsPrefixOfStringInTable_3.c 2018-04-15 22:35:55.456274700 -0400 +++ IsPrefixOfStringInTable_4.c 2018-04-15 22:35:55.453274200 -0400 @@ -18,7 +18,7 @@ _Use_decl_annotations_ STRING_TABLE_INDEX -IsPrefixOfStringInTable_3( +IsPrefixOfStringInTable_4( PSTRING_TABLE StringTable, PSTRING String, PSTRING_MATCH Match @@ -31,7 +31,8 @@ search string. That is, whether any string in the table "starts with or is equal to" the search string. - This is our first AVX-optimized version of the routine. + This routine is a variant of version 3 that uses a structured exception + handler for loading the initial search string. Arguments: @@ -123,7 +124,7 @@ // Load the first 16-bytes of the search string into an XMM register. // - LoadSearchStringIntoXmmRegister(Search, String, SearchLength); + LoadSearchStringIntoXmmRegister_SEH(Search, String, SearchLength); // // Broadcast the search string's unique characters according to the string _Use_decl_annotations_ STRING_TABLE_INDEX IsPrefixOfStringInTable_3( PSTRING_TABLE StringTable, PSTRING String, PSTRING_MATCH Match ) /*++ Routine Description: Searches a string table to see if any strings "prefix match" the given search string. That is, whether any string in the table "starts with or is equal to" the search string. This routine is a variant of version 3 that uses a structured exception handler for loading the initial search string. Arguments: StringTable - Supplies a pointer to a STRING_TABLE struct. String - Supplies a pointer to a STRING struct that contains the string to search for. Match - Optionally supplies a pointer to a variable that contains the address of a STRING_MATCH structure. This will be populated with additional details about the match if a non-NULL pointer is supplied. Return Value: Index of the prefix match if one was found, NO_MATCH_FOUND if not. --*/ { ULONG Bitmap; ULONG Mask; ULONG Count; ULONG Length; ULONG Index; ULONG Shift = 0; ULONG CharactersMatched; ULONG NumberOfTrailingZeros; ULONG SearchLength; PSTRING TargetString; PSTRING_ARRAY StringArray; STRING_SLOT Slot; STRING_SLOT Search; STRING_SLOT Compare; SLOT_LENGTHS Lengths; XMMWORD LengthXmm; XMMWORD UniqueChar; XMMWORD TableUniqueChars; XMMWORD IncludeSlotsByUniqueChar; XMMWORD IgnoreSlotsByLength; XMMWORD IncludeSlotsByLength; XMMWORD IncludeSlots; const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff); StringArray = StringTable->pStringArray; // // If the minimum length of the string array is greater than the length of // our search string, there can't be a prefix match. // if (StringArray->MinimumLength > String->Length) { goto NoMatch; } // // Unconditionally do the following five operations before checking any of // the results and determining how the search should proceed: // // 1. Load the search string into an Xmm register, and broadcast the // character indicated by the unique character index (relative to // other strings in the table) across a second Xmm register. // // 2. Load the string table's unique character array into an Xmm register. // // 3. Broadcast the search string's length into an XMM register. // // 3. Load the string table's slot lengths array into an XMM register. // // 4. Compare the unique character from step 1 to the string table's unique // character array set up in step 2. The result of this comparison // will produce an XMM register with each byte set to either 0xff if // the unique character was found, or 0x0 if it wasn't. // // 5. Compare the search string's length from step 3 to the string table's // slot length array set up in step 3. This allows us to identify the // slots that have strings that are of lesser or equal length to our // search string. As we're doing a prefix search, we can ignore any // slots longer than our incoming search string. // // We do all five of these operations up front regardless of whether or not // they're strictly necessary. That is, if the unique character isn't in // the unique character array, we don't need to load array lengths -- and // vice versa. However, we assume the benefits afforded by giving the CPU // a bunch of independent things to do unconditionally up-front outweigh // the cost of putting in branches and conditionally loading things if // necessary. // // // Load the first 16-bytes of the search string into an XMM register. // LoadSearchStringIntoXmmRegister_SEH(Search, String, SearchLength); // // Broadcast the search string's unique characters according to the string // table's unique character index. // UniqueChar = _mm_shuffle_epi8(Search.CharsXmm, StringTable->UniqueIndex.IndexXmm); // // Load the slot length array into an XMM register. // Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm); // // Load the string table's unique character array into an XMM register. // TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm); // // Broadcast the search string's length into an XMM register. // LengthXmm.m128i_u8[0] = (BYTE)String->Length; LengthXmm = _mm_broadcastb_epi8(LengthXmm); // // Compare the search string's unique character with all of the unique // characters of strings in the table, saving the results into an XMM // register. This comparison will indicate which slots we can ignore // because the characters at a given index don't match. Matched slots // will be 0xff, unmatched slots will be 0x0. // IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars); // // Find all slots that are longer than the incoming string length, as these // are the ones we're going to exclude from any prefix match. // // N.B. Because we default the length of empty slots to 0x7f, they will // handily be included in the ignored set (i.e. their words will also // be set to 0xff), which means they'll also get filtered out when // we invert the mask shortly after. // IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm); // // Invert the result of the comparison; we want 0xff for slots to include // and 0x0 for slots to ignore (it's currently the other way around). We // can achieve this by XOR'ing the result against our all-ones XMM register. // IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm); // // We're now ready to intersect the two XMM registers to determine which // slots should still be included in the comparison (i.e. which slots have // the exact same unique character as the string and a length less than or // equal to the length of the search string). // IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar, IncludeSlotsByLength); // // Generate a mask. // Bitmap = _mm_movemask_epi8(IncludeSlots); if (!Bitmap) { // // No bits were set, so there are no strings in this table starting // with the same character and of a lesser or equal length as the // search string. // goto NoMatch; } // // A popcount against the mask will tell us how many slots we matched, and // thus, need to compare. // Count = __popcnt(Bitmap); do { // // Extract the next index by counting the number of trailing zeros left // in the bitmap and adding the amount we've already shifted by. // NumberOfTrailingZeros = _tzcnt_u32(Bitmap); Index = NumberOfTrailingZeros + Shift; // // Shift the bitmap right, past the zeros and the 1 that was just found, // such that it's positioned correctly for the next loop's tzcnt. Update // the shift count accordingly. // Bitmap >>= (NumberOfTrailingZeros + 1); Shift = Index + 1; // // Load the slot and its length. // Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm); Length = Lengths.Slots[Index]; // // Compare the slot to the search string. // Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm); // // Create a mask of the comparison, then filter out high bits from the // search string's length (which is capped at 16). (This shouldn't be // technically necessary as the string array buffers should have been // calloc'd and zeroed, but optimizing compilers can often ignore the // zeroing request -- which can produce some bizarre results where the // debug build is correct (because the buffers were zeroed) but the // release build fails because the zeroing got ignored and there are // junk bytes past the NULL terminator, which get picked up in our // 128-bit loads.) // Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength); // // Count how many characters matched. // CharactersMatched = __popcnt(Mask); if ((USHORT)CharactersMatched == 16 && Length > 16) { // // The first 16 characters in the string matched against this // slot, and the slot is oversized (longer than 16 characters), // so do a direct comparison between the remaining buffers. // TargetString = &StringTable->pStringArray->Strings[Index]; CharactersMatched = IsPrefixMatch(String, TargetString, 16); if (CharactersMatched == NO_MATCH_FOUND) { // // The prefix match failed, continue our search. // continue; } else { // // We successfully prefix matched the search string against // this slot. The code immediately following us deals with // handling a successful prefix match at the initial slot // level; let's avoid an unnecessary branch and just jump // directly into it. // goto FoundMatch; } } if ((USHORT)CharactersMatched == Length) { FoundMatch: // // This slot is a prefix match. Fill out the Match structure if the // caller provided a non-NULL pointer, then return the index of the // match. // if (ARGUMENT_PRESENT(Match)) { Match->Index = (BYTE)Index; Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched; Match->String = &StringTable->pStringArray->Strings[Index]; } return (STRING_TABLE_INDEX)Index; } // // Not enough characters matched, so continue the loop. // } while (--Count); // // If we get here, we didn't find a match. // NoMatch: //IACA_VC_END(); return NO_MATCH_FOUND; }

Performance of version 4 was slightly worse than 3 in every case:

Version 3 is still in the lead with the AlwaysMovsb -based search string loading approach.

Narrator: except the AlignmentCheck macro was actually active, not the AlwaysMovsb one.

IsPrefixOfStringInTable_5

Version 5 is an interesting one. It's the first time we attempt to validate our claim that it's more efficient to give the CPU a bunch of independent things to do up-front, versus putting more branches in and attempting to terminate as early as possible.

Note: we'll also explicitly use the LoadSearchStringIntoXmmRegister_AlwaysMovsb macro here, instead of LoadSearchStringIntoXmmRegister , just to make it more explicit that we're actually relying on the __movsb() -based string loading routine.

Narrator: can anyone spot the mistake with this logic?

Diff

Full % diff -u IsPrefixOfStringInTable_3.c IsPrefixOfStringInTable_5.c --- IsPrefixOfStringInTable_3.c 2018-04-15 22:35:55.456274700 -0400 +++ IsPrefixOfStringInTable_5.c 2018-04-15 13:24:52.480972900 -0400 @@ -16,9 +16,13 @@ #include "stdafx.h" +// +// Variant of v3 with early-exits. +// + _Use_decl_annotations_ STRING_TABLE_INDEX -IsPrefixOfStringInTable_3( +IsPrefixOfStringInTable_5( PSTRING_TABLE StringTable, PSTRING String, PSTRING_MATCH Match @@ -31,7 +35,11 @@ search string. That is, whether any string in the table "starts with or is equal to" the search string. - This is our first AVX-optimized version of the routine. + This routine is a variant of version 3 that uses early exits (i.e. + returning NO_MATCH_FOUND as early as we can). It is designed to evaluate + the assertion we've been making that it's more optimal to give the CPU + to do a bunch of things up front versus doing something, then potentially + branching, doing the next thing, potentially branching, etc. Arguments: @@ -51,6 +59,8 @@ --*/ { ULONG Bitmap; + ULONG CharBitmap; + ULONG LengthBitmap; ULONG Mask; ULONG Count; ULONG Length; @@ -71,7 +81,6 @@ XMMWORD IncludeSlotsByUniqueChar; XMMWORD IgnoreSlotsByLength; XMMWORD IncludeSlotsByLength; - XMMWORD IncludeSlots; const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff); StringArray = StringTable->pStringArray; @@ -123,7 +132,7 @@ // Load the first 16-bytes of the search string into an XMM register. // - LoadSearchStringIntoXmmRegister(Search, String, SearchLength); + LoadSearchStringIntoXmmRegister_AlwaysMovsb(Search, String, SearchLength); // // Broadcast the search string's unique characters according to the string @@ -133,11 +142,6 @@ UniqueChar = _mm_shuffle_epi8(Search.CharsXmm, StringTable->UniqueIndex.IndexXmm); - // - // Load the slot length array into an XMM register. - // - - Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm); // // Load the string table's unique character array into an XMM register. @@ -146,13 +150,6 @@ TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm); // - // Broadcast the search string's length into an XMM register. - // - - LengthXmm.m128i_u8[0] = (BYTE)String->Length; - LengthXmm = _mm_broadcastb_epi8(LengthXmm); - - // // Compare the search string's unique character with all of the unique // characters of strings in the table, saving the results into an XMM // register. This comparison will indicate which slots we can ignore @@ -162,6 +159,25 @@ IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars); + CharBitmap = _mm_movemask_epi8(IncludeSlotsByUniqueChar); + + if (!CharBitmap) { + return NO_MATCH_FOUND; + } + + // + // Load the slot length array into an XMM register. + // + + Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm); + + // + // Broadcast the search string's length into an XMM register. + // + + LengthXmm.m128i_u8[0] = (BYTE)String->Length; + LengthXmm = _mm_broadcastb_epi8(LengthXmm); + // // Find all slots that are longer than the incoming string length, as these // are the ones we're going to exclude from any prefix match. @@ -182,31 +198,16 @@ IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm); - // - // We're now ready to intersect the two XMM registers to determine which - // slots should still be included in the comparison (i.e. which slots have - // the exact same unique character as the string and a length less than or - // equal to the length of the search string). - // - - IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar, - IncludeSlotsByLength); + LengthBitmap = _mm_movemask_epi8(IncludeSlotsByLength); - // - // Generate a mask. - // + if (!LengthBitmap) { + return NO_MATCH_FOUND; + } - Bitmap = _mm_movemask_epi8(IncludeSlots); + Bitmap = CharBitmap & LengthBitmap; if (!Bitmap) { - - // - // No bits were set, so there are no strings in this table starting - // with the same character and of a lesser or equal length as the - // search string. - // - - goto NoMatch; + return NO_MATCH_FOUND; } // _Use_decl_annotations_ _Use_decl_annotations_ STRING_TABLE_INDEX IsPrefixOfStringInTable_5( PSTRING_TABLE StringTable, PSTRING String, PSTRING_MATCH Match ) /*++ Routine Description: Searches a string table to see if any strings "prefix match" the given search string. That is, whether any string in the table "starts with or is equal to" the search string. This routine is a variant of version 3 that uses early exits (i.e. returning NO_MATCH_FOUND as early as we can). It is designed to evaluate the assertion we've been making that it's more optimal to give the CPU to do a bunch of things up front versus doing something, then potentially branching, doing the next thing, potentially branching, etc. Arguments: StringTable - Supplies a pointer to a STRING_TABLE struct. String - Supplies a pointer to a STRING struct that contains the string to search for. Match - Optionally supplies a pointer to a variable that contains the address of a STRING_MATCH structure. This will be populated with additional details about the match if a non-NULL pointer is supplied. Return Value: Index of the prefix match if one was found, NO_MATCH_FOUND if not. --*/ { ULONG Bitmap; ULONG CharBitmap; ULONG LengthBitmap; ULONG Mask; ULONG Count; ULONG Length; ULONG Index; ULONG Shift = 0; ULONG CharactersMatched; ULONG NumberOfTrailingZeros; ULONG SearchLength; PSTRING TargetString; PSTRING_ARRAY StringArray; STRING_SLOT Slot; STRING_SLOT Search; STRING_SLOT Compare; SLOT_LENGTHS Lengths; XMMWORD LengthXmm; XMMWORD UniqueChar; XMMWORD TableUniqueChars; XMMWORD IncludeSlotsByUniqueChar; XMMWORD IgnoreSlotsByLength; XMMWORD IncludeSlotsByLength; const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff); StringArray = StringTable->pStringArray; // // If the minimum length of the string array is greater than the length of // our search string, there can't be a prefix match. // if (StringArray->MinimumLength > String->Length) { goto NoMatch; } // // Unconditionally do the following five operations before checking any of // the results and determining how the search should proceed: // // 1. Load the search string into an Xmm register, and broadcast the // character indicated by the unique character index (relative to // other strings in the table) across a second Xmm register. // // 2. Load the string table's unique character array into an Xmm register. // // 3. Broadcast the search string's length into an XMM register. // // 3. Load the string table's slot lengths array into an XMM register. // // 4. Compare the unique character from step 1 to the string table's unique // character array set up in step 2. The result of this comparison // will produce an XMM register with each byte set to either 0xff if // the unique character was found, or 0x0 if it wasn't. // // 5. Compare the search string's length from step 3 to the string table's // slot length array set up in step 3. This allows us to identify the // slots that have strings that are of lesser or equal length to our // search string. As we're doing a prefix search, we can ignore any // slots longer than our incoming search string. // // We do all five of these operations up front regardless of whether or not // they're strictly necessary. That is, if the unique character isn't in // the unique character array, we don't need to load array lengths -- and // vice versa. However, we assume the benefits afforded by giving the CPU // a bunch of independent things to do unconditionally up-front outweigh // the cost of putting in branches and conditionally loading things if // necessary. // // // Load the first 16-bytes of the search string into an XMM register. // LoadSearchStringIntoXmmRegister_AlwaysMovsb(Search, String, SearchLength); // // Broadcast the search string's unique characters according to the string // table's unique character index. // UniqueChar = _mm_shuffle_epi8(Search.CharsXmm, StringTable->UniqueIndex.IndexXmm); // // Load the string table's unique character array into an XMM register. // TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm); // // Compare the search string's unique character with all of the unique // characters of strings in the table, saving the results into an XMM // register. This comparison will indicate which slots we can ignore // because the characters at a given index don't match. Matched slots // will be 0xff, unmatched slots will be 0x0. // IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars); CharBitmap = _mm_movemask_epi8(IncludeSlotsByUniqueChar); if (!CharBitmap) { return NO_MATCH_FOUND; } // // Load the slot length array into an XMM register. // Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm); // // Broadcast the search string's length into an XMM register. // LengthXmm.m128i_u8[0] = (BYTE)String->Length; LengthXmm = _mm_broadcastb_epi8(LengthXmm); // // Find all slots that are longer than the incoming string length, as these // are the ones we're going to exclude from any prefix match. // // N.B. Because we default the length of empty slots to 0x7f, they will // handily be included in the ignored set (i.e. their words will also // be set to 0xff), which means they'll also get filtered out when // we invert the mask shortly after. // IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm); // // Invert the result of the comparison; we want 0xff for slots to include // and 0x0 for slots to ignore (it's currently the other way around). We // can achieve this by XOR'ing the result against our all-ones XMM register. // IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm); LengthBitmap = _mm_movemask_epi8(IncludeSlotsByLength); if (!LengthBitmap) { return NO_MATCH_FOUND; } Bitmap = CharBitmap & LengthBitmap; if (!Bitmap) { return NO_MATCH_FOUND; } // // A popcount against the mask will tell us how many slots we matched, and // thus, need to compare. // Count = __popcnt(Bitmap); do { // // Extract the next index by counting the number of trailing zeros left // in the bitmap and adding the amount we've already shifted by. // NumberOfTrailingZeros = _tzcnt_u32(Bitmap); Index = NumberOfTrailingZeros + Shift; // // Shift the bitmap right, past the zeros and the 1 that was just found, // such that it's positioned correctly for the next loop's tzcnt. Update // the shift count accordingly. // Bitmap >>= (NumberOfTrailingZeros + 1); Shift = Index + 1; // // Load the slot and its length. // Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm); Length = Lengths.Slots[Index]; // // Compare the slot to the search string. // Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm); // // Create a mask of the comparison, then filter out high bits from the // search string's length (which is capped at 16). (This shouldn't be // technically necessary as the string array buffers should have been // calloc'd and zeroed, but optimizing compilers can often ignore the // zeroing request -- which can produce some bizarre results where the // debug build is correct (because the buffers were zeroed) but the // release build fails because the zeroing got ignored and there are // junk bytes past the NULL terminator, which get picked up in our // 128-bit loads.) // Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength); // // Count how many characters matched. // CharactersMatched = __popcnt(Mask); if ((USHORT)CharactersMatched == 16 && Length > 16) { // // The first 16 characters in the string matched against this // slot, and the slot is oversized (longer than 16 characters), // so do a direct comparison between the remaining buffers. // TargetString = &StringTable->pStringArray->Strings[Index]; CharactersMatched = IsPrefixMatch(String, TargetString, 16); if (CharactersMatched == NO_MATCH_FOUND) { // // The prefix match failed, continue our search. // continue; } else { // // We successfully prefix matched the search string against // this slot. The code immediately following us deals with // handling a successful prefix match at the initial slot // level; let's avoid an unnecessary branch and just jump // directly into it. // goto FoundMatch; } } if ((USHORT)CharactersMatched == Length) { FoundMatch: // // This slot is a prefix match. Fill out the Match structure if the // caller provided a non-NULL pointer, then return the index of the // match. // if (ARGUMENT_PRESENT(Match)) { Match->Index = (BYTE)Index; Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched; Match->String = &StringTable->pStrin