We’ve been busy rebuilding Thread’s frontend codebase using React, and lately that’s included rebuilding our browse filters.

These filters allow customers to find products more easily by selecting which brands, colours or prices they want to see.

Because some of our filters contain a lot of options (we have over 500 brands available on Thread) we need them to be easily searchable, and sorted correctly.

Sorting

Take the following list of brands:

Let’s see how we can best sort them intuitively for our users.

Sorting these in JavaScript initially seemed straightforward: just call sort() on the brands array, passing a function to compare the two labels, eg:

This appears to work, with the exception of Études , which gets incorrectly placed at the bottom of the list. JavaScript compares strings lexicographically, but with limitations. It actually uses the character’s position in the Unicode table to determine its ordering.

If we look the characters z and é we can see the following:

Character Encoding hex dec (bytes) dec binary z UTF-8 7A 122 122 01111010 é UTF-8 C3 A9 195 169 50089 11000011 10101001

Notice the decimal values for the characters; é has a much higher value than z meaning it will be sorted after z .

Fortunately, JavaScript has String.prototype.localeCompare which meaning we can simply rewrite the above function to be:

Have a read of the docs to see what other arguments the localeCompare function accepts.

Searching

Sorting was quite straightforward to solve, but what about searching?

Our browse filters provide a handy search input that filters large lists as you type, however searching for Cote was not yielding results for Côte et Ciel .

We didn’t think it was reasonable to expect our customers to know we’re using the brand spelling with diacritics, or to be able to easily input the right characters.

So in order to allow customers to find these brands, we need to remove the diacritical marks from the characters before comparing them with the search query.

We can do this in two parts.

First, we need to “decompose” the string so that any characters with diacritical marks are represented by their two-byte surrogate pair. This is a technique used in UTF-8 to represent a single character as a pair of two bytes.

We can do this using the normalize function in JavaScript, and passing the “NFD” form.

Looking at the docs, we can see that “NFD” stands for “Normalization Form Canonical Decomposition”–the important word being “Decomposition”.

We can see this working by comparing the string length before and after decomposition:

This shows that the ô character is now being represented by two bytes rather than one.

Second, now we have separate bytes for the character and the diacritical mark, we can remove the unwanted characters using the String.prototype.replace function:

The replace function takes a range of Unicode points from U+0300 to U+036F , covering any diacritic bytes we might have in the string.

Et voilà! This is a great technique to have in mind if you’re allowing your users to search lists of data that might include diacritics.