I'VE GOT THE BYTE ON MY SIDE 57005 or alive

Curious behavior when de-duplicating a collection in PowerShell

This is a bug/curiosity in PowerShell that I stumbled upon a few years ago, but never wrote up. The behavior hasn’t changed significantly in the intervening verions, so now I’m finally getting around to a quick blog post.

Here’s a chart detailing the runtime of 4 different PowerShell approaches to de-duplicate a collection - i.e. filter an input collection to just its unique elements. Code used for the benchmark can be found here.

Same chart, with log scale so it’s a little easier to distinguish the trend lines:

The solutions being compared are:

$data | select -unique Built-in, does nothing but de-duplicate the input collection.

$data | group |% Name Composition of built-ins group (which groups together duplicate items) and % (to read off the key from each group)

$data | sort -unique Built-in, sorts the input collection and additionally de-duplicates it.

$data | hashunique Custom cmdlet using simple hashset-based approach. 12 lines, available alongside benchmark code.



I was surprised to see how poorly select -unique performs. It’s part of the core PowerShell library (presumably authored by experts), implemented as compiled C# code (not script), and its sole purpose is to solve this exact problem.

So why is it the slowest option?!

Specific questions I’m still wondering about -

Why do “select” and “group” have quadratic runtime?

Filtering out duplicates from a collection, or grouping together equivalent items, can be done in \(O(n)\) time by utilizing a hash set or hash table, respectively. The trendline above hints, and a quick ILSpy session confirms, that select and group are instead relying on an \(O(n^2)\) algorithm that doesn’t use hashing at all.

My best guess is that somehow PowerShell’s object and equality system preclude efficient, reliable hashing for all scriptable objects, so they decide not to do it for any objects. Perhaps everything just relies on equality.

But this doesn’t quite hold up. Here’s a simple example where two items are considered non-unique by select , sort , and group yet aren’t considered equal:

PS> 'dude' > sweet.txt PS> $f1 = dir .\sweet.txt PS> $f2 = dir .\sweet.txt PS> ($f1, $f2 | select -Unique).Length 1 PS> ($f1, $f2 | sort -Unique).Length 1 PS> ($f1, $f2 | group).Length 1 PS> $f1 -eq $f2 False

Practically speaking, when I’m at the terminal and need to de-duplicate a collection, 99% of the time it’s just going to be strings or numbers. It’s a shame this core use case is left completely unoptimized. Going forward I will just keep hashunique in my profile and use that.

Why is “group” faster than “select -unique”?

Setting aside asymtotic complexity, how can select -unique be slower by some constant factor than piping together group |% Name ? The latter has to carry around a ton of extra data, and also incurs the cost of an additional pipeline command. This makes no sense to me.

Please enable JavaScript to view the comments