Challenge your performance intuition with C++ magic squares

This was supposed to be a five-minute interview question. I came up with it on Thursday, then spent a night looking for a good answer. Then the other night. Then the Saturday morning. Then it finally occurred to me that it's more like 500 minutes now and it wouldn't work for an interview.

But it's still a fun problem to fiddle with. And it goes like this.

Here is a magic square:

8 3 4 1 5 9 6 7 2

It has all the numbers from 1 to 9 each occurring exactly once. The sum in every row is 15. The sum in every column is 15. And the sum in both diagonals is 15, too. We can write a square as a 9-character string like this: "834159672".

The problem: propose a function that tells if a 3x3 square written as an ASCII string of digits is magic or not fast.

Here is a benchmark. It has no dependencies, and it's C++11 standard compliant, so it runs in basically every environment.

#include <iostream> #include <chrono> using namespace std; // Here you should implement your // magic checker. It should return // true if the square is magic, // false otherwise. // // 8 1 6 this is // "816357492" = 3 5 7 a magic // 4 9 2 square // bool check_if_magic(const std::string& square) { // add your code here // (or above if you need global stuff) return false; } // this generates all possible combinations // of 1-9 digits that may or may not // form a magic square static string buffer = "000000000"; void generate_or_check(int index_or_check = 8) { if(index_or_check == -1){ if(check_if_magic(buffer)) cout << buffer << " "; return; } for(auto i = 1u; i < 10; ++i){ buffer[index_or_check] = '0' + i; generate_or_check(index_or_check-1); } } // this runs the generator and measures time int main() { auto start = std::chrono::system_clock::now(); generate_or_check(); auto end = std::chrono::system_clock::now(); chrono::duration<double> difference = end - start; cout << difference.count() << "



"; }

I ran it with an empty checker to see how much time the generator takes. Then ran thirteen different solutions three times each, and picked the best time for every solution. Now I know what works and what doesn't.

Now let's play a game. There will be twelve rounds, each round is a match between two solutions. Or rather between their execution times. Using your intuition and best judgment, please estimate their relative performance. Use the sliders below the code samples.

Round 1. Check 15 and check digits vs. check digits and then check 15

The naïve solution would be to check every row, column, and diagonal sum and then assert that all the numbers are there. We don't have to parse actual numbers, we can go with ASCII.

To check that all the numbers are there and each occurs only once, we can fill the array of counters and then check that every number has its counter set to 1.

The question is, what should we do first? On the left, the rows, columns, and diagonals checks come first and then there's a counter check. On the right, the counter check comes first. Let's find out which one if faster.

Now place your bets!

bool check_if_magic(const std::string& sq) { if ((sq[0] + sq[1] + sq[2] != '5'*3) || (sq[3] + sq[4] + sq[5] != '5'*3) || (sq[6] + sq[7] + sq[8] != '5'*3) || (sq[0] + sq[3] + sq[6] != '5'*3) || (sq[1] + sq[4] + sq[7] != '5'*3) || (sq[2] + sq[5] + sq[8] != '5'*3) || (sq[0] + sq[4] + sq[8] != '5'*3) || (sq[2] + sq[4] + sq[6] != '5'*3)) return false; std::array<int, 9> numbers_count{}; for(auto i = 0u; i < 9; ++i) ++numbers_count[sq[i]-'1']; for(auto i = 0u; i < 9; ++i) if(numbers_count[i] != 1) return false; return true; } bool check_if_magic(const std::string& sq) { std::array<int, 9> numbers_count{}; for(auto i = 0u; i < 9; ++i) ++numbers_count[sq[i]-'1']; for(auto i = 0u; i < 9; ++i) if(numbers_count[i] != 1) return false; if ((sq[0] + sq[1] + sq[2] != '5'*3) || (sq[3] + sq[4] + sq[5] != '5'*3) || (sq[6] + sq[7] + sq[8] != '5'*3) || (sq[0] + sq[3] + sq[6] != '5'*3) || (sq[1] + sq[4] + sq[7] != '5'*3) || (sq[2] + sq[5] + sq[8] != '5'*3) || (sq[0] + sq[4] + sq[8] != '5'*3) || (sq[2] + sq[4] + sq[6] != '5'*3)) return false; return true; }

Reveal the truth

Yes, the difference is huge. The thing is, short-circuiting makes every line in 15-checks an independent optimizing heuristic. We only go past the first line when the numbers in the first row sum up to 15. Which only happens in 8.36% of all cases. So the very first line cuts out 91.74% of computation. It's very effective.

Round 2. All checks vs. fewer checks

Not all the checks are that effective though. Sure, the first three are independent but if you have established that all your rows sum up to 15 each that means that the total is 45. Now if we check that the first two columns sum up to 15, the third one has to be 45 - 15*2 = 15. The fifth check is redundant.

And we will gain a bit of performance if we throw it away. But in fact, we can also throw away the third check as well. If we assert that the numbers are unique then the total sum is 45 anyway.

So let's throw away two redundant lines and see what will happen.

bool check_if_magic(const std::string& sq) { if ((sq[0] + sq[1] + sq[2] != '5'*3) || (sq[3] + sq[4] + sq[5] != '5'*3) || (sq[6] + sq[7] + sq[8] != '5'*3) || (sq[0] + sq[3] + sq[6] != '5'*3) || (sq[1] + sq[4] + sq[7] != '5'*3) || (sq[2] + sq[5] + sq[8] != '5'*3) || (sq[0] + sq[4] + sq[8] != '5'*3) || (sq[2] + sq[4] + sq[6] != '5'*3)) return false; std::array<int, 9> numbers_count{}; for(auto i = 0u; i < 9; ++i) ++numbers_count[sq[i]-'1']; for(auto i = 0u; i < 9; ++i) if(numbers_count[i] != 1) return false; return true; } bool check_if_magic(const std::string& sq) { if ((sq[0] + sq[1] + sq[2] != '5'*3) || (sq[3] + sq[4] + sq[5] != '5'*3) // || (sq[6] + sq[7] + sq[8] != '5'*3) || (sq[0] + sq[3] + sq[6] != '5'*3) || (sq[1] + sq[4] + sq[7] != '5'*3) // || (sq[2] + sq[5] + sq[8] != '5'*3) || (sq[0] + sq[4] + sq[8] != '5'*3) || (sq[2] + sq[4] + sq[6] != '5'*3)) return false; std::array<int, 9> numbers_count{}; for(auto i = 0u; i < 9; ++i) ++numbers_count[sq[i]-'1']; for(auto i = 0u; i < 9; ++i) if(numbers_count[i] != 1) return false; return true; }

Reveal the truth

Not the best idea. Yes, we do assert that the numbers are unique but we do this in the most expensive part of the code. The fifth check is redundant because of the first three, but the third — only because of the thing we still have to compute. We retire a line of code but we lose pretty effective heuristic.

Round 3. Naïve solution vs. micro-optimization

The naïve solution is the eight sum-is-15 checks and the digits-are-unique check. The micro-optimization concerns the latter part. To check if the string is a permutation of digits, we want to fill a map of bools and assert that its content is all true. Yes, with a binary map, if there are three equal digits in a square, the value in the map for the digit will be true. But then we wouldn't have enough digits left to fill all nine positions so it works.

Since we are here for the performance, we would use an uint_fast64_t instead of a real map and trigger its bits. The assertion then would be a mere integer comparison. Also, we will start with the loaded map and check it with 0. All for performance sake, right?

bool check_if_magic(const std::string& sq) { if ((sq[0] + sq[1] + sq[2] != '5'*3) || (sq[3] + sq[4] + sq[5] != '5'*3) || (sq[6] + sq[7] + sq[8] != '5'*3) || (sq[0] + sq[3] + sq[6] != '5'*3) || (sq[1] + sq[4] + sq[7] != '5'*3) || (sq[2] + sq[5] + sq[8] != '5'*3) || (sq[0] + sq[4] + sq[8] != '5'*3) || (sq[2] + sq[4] + sq[6] != '5'*3)) return false; std::array<int, 9> numbers_count{}; for(auto i = 0u; i < 9; ++i) ++numbers_count[sq[i]-'1']; for(auto i = 0u; i < 9; ++i) if(numbers_count[i] != 1) return false; return true; } auto c15 = '5' * 3; uint_fast64_t ideal_char_map = static_cast<uint_fast64_t>(0x1FF) << 49; uint_fast64_t char_map_one = 1u; bool check_if_magic(const std::string& sq) { if ((sq[0] + sq[1] + sq[2] != c15) || (sq[3] + sq[4] + sq[5] != c15) || (sq[6] + sq[7] + sq[8] != c15) || (sq[0] + sq[3] + sq[6] != c15) || (sq[1] + sq[4] + sq[7] != c15) || (sq[2] + sq[5] + sq[8] != c15) || (sq[0] + sq[4] + sq[8] != c15) || (sq[2] + sq[4] + sq[6] != c15)) return false; auto char_map = ideal_char_map; for(auto i = 0u; i < 9; ++i) char_map ^= char_map_one << sq[i]; if (char_map != 0) return false; return true; }

Reveal the truth

The optimization seems legit. It seems sane. Yet apparently, the optimization makes the code slightly slower. How come? The thing is, it doesn't. Generally. The optimization only works as a pessimization with this particular processor, compiler, and compiling options. Here are my measurements for other configurations: ARMv7 (v71), clang 3.5.0, -O2: 24.8 vs 24.4 Intel i7-7700HQ, g++, -O3: 0.89 vs 0.75 Intel i7-4700, MSVC 15.9.11, -O2: 4.0 vs 3.7 So yes, using an integer as a binary map is a good thing. Generally. But sometimes it just isn't.

Round 4. Microoptimized solution vs. the oddity heuristic

The heuristic proposed is meant to end the function as early as possible. It is based on a fact that in a magic square, the corner values are always even and the rest are odd. If one of the numbers fails to pass the oddity test, then there is no point in the full magic check at all.

This heuristic is rather effective. On random numbers, it narrows down our cases with a pretty good 29 to 1 ratio. So only one string in 512 will get to the actual check. Well, technically our numbers are not entirely random, we have more odds than evens, but still.

auto c15 = '5' * 3; uint_fast64_t ideal_char_map = static_cast<uint_fast64_t>(0x1FF) << 49; uint_fast64_t char_map_one = 1u; bool check_if_magic(const std::string& sq) { if ((sq[0] + sq[1] + sq[2] != c15) || (sq[3] + sq[4] + sq[5] != c15) || (sq[6] + sq[7] + sq[8] != c15) || (sq[0] + sq[3] + sq[6] != c15) || (sq[1] + sq[4] + sq[7] != c15) || (sq[2] + sq[5] + sq[8] != c15) || (sq[0] + sq[4] + sq[8] != c15) || (sq[2] + sq[4] + sq[6] != c15)) return false; auto char_map = ideal_char_map; for(auto i = 0u; i < 9; ++i) char_map ^= char_map_one << sq[i]; if (char_map != 0) return false; return true; } auto c15 = '5' * 3; uint_fast64_t ideal_char_map = static_cast<uint_fast64_t>(0x1FF) << 49; uint_fast64_t char_map_one = 1u; bool check_if_magic(const std::string& sq) { if ( (sq[0] & 1) != 0 || (sq[1] & 1) == 0 || (sq[2] & 1) != 0 || (sq[3] & 1) == 0 || (sq[4] & 1) == 0 || (sq[5] & 1) == 0 || (sq[6] & 1) != 0 || (sq[7] & 1) == 0 || (sq[8] & 1) != 0) return false; if ((sq[0] + sq[1] + sq[2] != c15) || (sq[3] + sq[4] + sq[5] != c15) || (sq[6] + sq[7] + sq[8] != c15) || (sq[0] + sq[3] + sq[6] != c15) || (sq[1] + sq[4] + sq[7] != c15) || (sq[2] + sq[5] + sq[8] != c15) || (sq[0] + sq[4] + sq[8] != c15) || (sq[2] + sq[4] + sq[6] != c15)) return false; auto char_map = ideal_char_map; for(auto i = 0u; i < 9; ++i) char_map ^= char_map_one << sq[i]; if (char_map != 0) return false; return true; }

Reveal the truth

The right is only slightly more than 10% faster. The thing is, although this heuristic is great in narrowing, the computation itself is so light, the cost of heuristic eats up its benefit.

Round 5. Direct solution vs. the central 5 heuristic

We can choose less effective but cheaper heuristic. For instance, we know that all the 3x3 magic squares have 5 in the center. Let's try and exploit this fact.

Also, this makes the main check a little bit cheaper since we don't have to check for the central 5 in a sum anymore.

auto c15 = '5' * 3; uint_fast64_t ideal_char_map = static_cast<uint_fast64_t>(0x1FF) << 49; uint_fast64_t char_map_one = 1u; bool check_if_magic(const std::string& sq) { if ((sq[0] + sq[1] + sq[2] != c15) || (sq[3] + sq[4] + sq[5] != c15) || (sq[6] + sq[7] + sq[8] != c15) || (sq[0] + sq[3] + sq[6] != c15) || (sq[1] + sq[4] + sq[7] != c15) || (sq[2] + sq[5] + sq[8] != c15) || (sq[0] + sq[4] + sq[8] != c15) || (sq[2] + sq[4] + sq[6] != c15)) return false; auto char_map = ideal_char_map; for(auto i = 0u; i < 9; ++i) char_map ^= char_map_one << sq[i]; if (char_map != 0) return false; return true; } auto magic_number = '5' * 3; auto not_so_magic_number = '5' * 2; uint_fast64_t ideal_char_map = static_cast<uint_fast64_t>(0x1FF) << 49; uint_fast64_t char_map_one = 1u; bool check_if_magic(const std::string& sq) { if(sq[4] != '5') return false; if ((sq[0] + sq[1] + sq[2] != magic_number) || (sq[3] + sq[5] != not_so_magic_number) || (sq[6] + sq[7] + sq[8] != magic_number) || (sq[0] + sq[3] + sq[6] != magic_number) || (sq[1] + sq[7] != not_so_magic_number) || (sq[2] + sq[5] + sq[8] != magic_number) || (sq[0] + sq[8] != not_so_magic_number) || (sq[2] + sq[6] != not_so_magic_number)) return false; auto char_map = ideal_char_map; for(auto i = 0u; i < 9; ++i) char_map ^= char_map_one << sq[i]; if (char_map != 0) return false; return true; }

Reveal the truth

The right is more than 50% faster. But it shouldn't be. Trying to understand why it is, I got it wrong the first time. Luckily, I was corrected. [spoilers!] Learning the value of good benchmarking technique with C++ magic squares is an excellent explanation of why my benchmark sucks. Long story short, branch prediction. When the numbers come in order, this 5-check works in a very branch prediction friendly way. The data have a pattern. It's 26244 not fives, then 6561 fives, then 52488 not fives, and then its 6561 fives again. Of course, it's cheap. But with randomized input, it isn't. It's still cheap, sure, but not enough to pay off own weight. The code on the right loses a bit to the code on the left.

Round 6. Fair magic test vs. precached answers

There are only 8 3x3 magic squares in existence. So do we really have to test the string against the magic definition? Why couldn't we precache the magic values somewhere and simply compare with that?

auto c15 = '5' * 3; uint_fast64_t ideal_char_map = static_cast<uint_fast64_t>(0x1FF) << 49; uint_fast64_t char_map_one = 1u; bool check_if_magic(const std::string& sq) { if ((sq[0] + sq[1] + sq[2] != c15) || (sq[3] + sq[4] + sq[5] != c15) || (sq[6] + sq[7] + sq[8] != c15) || (sq[0] + sq[3] + sq[6] != c15) || (sq[1] + sq[4] + sq[7] != c15) || (sq[2] + sq[5] + sq[8] != c15) || (sq[0] + sq[4] + sq[8] != c15) || (sq[2] + sq[4] + sq[6] != c15)) return false; auto char_map = ideal_char_map; for(auto i = 0u; i < 9; ++i) char_map ^= char_map_one << sq[i]; if (char_map != 0) return false; return true; } const std::array<std::string, 8> all_magic_squares = { "816357492", "492357816", "618753294", "294753618", "834159672", "672159834", "438951276", "276951438" }; bool check_if_magic(const std::string& sq) { for(auto i = 0u; i < 8; ++i) if(sq == all_magic_squares[i]) return true; return false; }

Reveal the truth

The right is more than 10 times slower. Apparently, string comparisons are way heavier than the whole testing thing.

Round 7. String array vs. string set

Well, strings comparisons are relatively slow. But we don't have to do all eight comparisons every time. If we organize our comparison work-flow as a tree-like structure, we could minimize the number of comparisons to somewhere form 1 to 3 per string.

Luckily we don't have to reinvent the tree, we can take the standard set instead.

const std::array<std::string, 8> all_magic_squares = { "816357492", "492357816", "618753294", "294753618", "834159672", "672159834", "438951276", "276951438" }; bool check_if_magic(const std::string& sq) { for(auto i = 0u; i < 8; ++i) if(sq == all_magic_squares[i]) return true; return false; } const std::set<std::string> all_magic_squares = { "816357492", "492357816", "618753294", "294753618", "834159672", "672159834", "438951276", "276951438" }; bool check_if_magic(const std::string& sq) { return all_magic_squares.find(sq) != all_magic_squares.end()); }

Reveal the truth

The right is considerably better. Well, better than the string array.

Round 8. String set vs. string unordered set

We have another structure that seems promising in terms of performance. The unordered set is known to have constant complexity for the cases with no collisions. Since our data is small, it is unlikely to cause collisions in a hash map. Let's try that.

const std::set<std::string> all_magic_squares = { "816357492", "492357816", "618753294", "294753618", "834159672", "672159834", "438951276", "276951438" }; bool check_if_magic(const std::string& sq) { return all_magic_squares.find(sq) != all_magic_squares.end()); } const std::unordered_set<std::string> all_magic_squares = { "816357492", "492357816", "618753294", "294753618", "834159672", "672159834", "438951276", "276951438" }; bool check_if_magic(const std::string& sq) { return all_magic_squares.find(sq) != all_magic_squares.end()); }

Reveal the truth

The usual set works better here. Yes, we saved on comparisons, but the hash function for a string turns out to be rather heavy itself.

Round 9. String array vs. string array plus the oddity heuristic

Now, when we have heavier body, maybe we can gain more from the heuristic. Let's try our oddity one here.

const std::array<std::string, 8> all_magic_squares = { "816357492", "492357816", "618753294", "294753618", "834159672", "672159834", "438951276", "276951438" }; bool check_if_magic(const std::string& sq) { for(auto i = 0u; i < 8; ++i) if(sq == all_magic_squares[i]) return true; return false; } const std::array<std::string, 8> all_magic_squares = { "816357492", "492357816", "618753294", "294753618", "834159672", "672159834", "438951276", "276951438" }; bool check_if_magic(const std::string& sq) { if ( (sq[0] & 1) != 0 || (sq[1] & 1) == 0 || (sq[2] & 1) != 0 || (sq[3] & 1) == 0 || (sq[4] & 1) == 0 || (sq[5] & 1) == 0 || (sq[6] & 1) != 0 || (sq[7] & 1) == 0 || (sq[8] & 1) != 0) return false; for(auto i = 0u; i < 8; ++i) if(sq == all_magic_squares[i]) return true; return false; }

Reveal the truth

And it works like a charm.

Round 10. Direct solution vs. answers in uint64_t.

Since we know that the center value is always five, we don't have to store it in our answers either. This leaves us with exactly 8 bytes per string that can be cheaply encoded as a single 64-bit value. Why don't we store these values instead of strings?

auto c15 = '5' * 3; uint_fast64_t ideal_char_map = static_cast<uint_fast64_t>(0x1FF) << 49; uint_fast64_t char_map_one = 1u; bool check_if_magic(const std::string& sq) { if ((sq[0] + sq[1] + sq[2] != c15) || (sq[3] + sq[4] + sq[5] != c15) || (sq[6] + sq[7] + sq[8] != c15) || (sq[0] + sq[3] + sq[6] != c15) || (sq[1] + sq[4] + sq[7] != c15) || (sq[2] + sq[5] + sq[8] != c15) || (sq[0] + sq[4] + sq[8] != c15) || (sq[2] + sq[4] + sq[6] != c15)) return false; auto char_map = ideal_char_map; for(auto i = 0u; i < 9; ++i) char_map ^= char_map_one << sq[i]; if (char_map != 0) return false; return true; } const std::array<uint64_t, 8> magic_numbers { 3545515123101087289, 3690191062107239479, 3544956562637535289, 3978984379655991859, 3689073941180135479, 4123101758198592049, 3977867258728887859, 4122543197735040049 }; bool check_if_magic(const std::string& sq) { if(sq[4] != '5') return false; uint64_t magic_number = *(reinterpret_cast<const uint32_t*> (sq.data())); magic_number <<= 32; magic_number += *(reinterpret_cast<const uint32_t*> (sq.data()+5)); for(auto i = 0u; i < 8; ++i) if(magic_number == magic_numbers[i]) return true; return false; }

Reveal the truth

It's fast! In fact, it is as fast as our best choice from the earlier.

Round 11. An array on uint64_t vs. a set of uint64_t.

Let's now use a set instead of an array here. There should be fewer comparisons anyway, but would it help?

const std::array<uint64_t, 8> magic_numbers { 3545515123101087289, 3690191062107239479, 3544956562637535289, 3978984379655991859, 3689073941180135479, 4123101758198592049, 3977867258728887859, 4122543197735040049 }; bool check_if_magic(const std::string& sq) { if(sq[4] != '5') return false; uint64_t magic_number = *(reinterpret_cast<const uint32_t*> (sq.data())); magic_number <<= 32; magic_number += *(reinterpret_cast<const uint32_t*> (sq.data()+5)); for(auto i = 0u; i < 8; ++i) if(magic_number == magic_numbers[i]) return true; return false; } const std::set<uint64_t> magic_numbers { 3545515123101087289, 3690191062107239479, 3544956562637535289, 3978984379655991859, 3689073941180135479, 4123101758198592049, 3977867258728887859, 4122543197735040049 }; bool check_if_magic(const std::string& sq) { if(sq[4] != '5') return false; uint64_t magic_number = *(reinterpret_cast<const uint32_t*> (sq.data())); magic_number <<= 32; magic_number += *(reinterpret_cast<const uint32_t*> (sq.data()+5)); return magic_numbers.find(magic_number) != magic_numbers.end(); }

Reveal the truth

Well, on this data a simple loop works better. Perhaps, it's because we only need to go through 64 bytes and it fits into a single cache line. We lose this compactness with the set.

Round 12. An array on uint64_t vs. an unordered set of uint64_t.

Could we gain some speed from using unordered set?

const std::array<uint64_t, 8> magic_numbers { 3545515123101087289, 3690191062107239479, 3544956562637535289, 3978984379655991859, 3689073941180135479, 4123101758198592049, 3977867258728887859, 4122543197735040049 }; bool check_if_magic(const std::string& sq) { if(sq[4] != '5') return false; uint64_t magic_number = *(reinterpret_cast<const uint32_t*> (sq.data())); magic_number <<= 32; magic_number += *(reinterpret_cast<const uint32_t*> (sq.data()+5)); for(auto i = 0u; i < 8; ++i) if(magic_number == magic_numbers[i]) return true; return false; } const std::unordered_set<uint64_t> magic_numbers { 3545515123101087289, 3690191062107239479, 3544956562637535289, 3978984379655991859, 3689073941180135479, 4123101758198592049, 3977867258728887859, 4122543197735040049 }; bool check_if_magic(const std::string& sq) { if(sq[4] != '5') return false; uint64_t magic_number = *(reinterpret_cast<const uint32_t*> (sq.data())); magic_number <<= 32; magic_number += *(reinterpret_cast<const uint32_t*> (sq.data()+5)); return magic_numbers.find(magic_number) != magic_numbers.end(); }

Reveal the truth

Apparently, not. Unordered sets may work wonders for large and uniformly distributed data. But our loop is just too small to beat.

Congratulations! You scored pixels of error. I'm not sure if it's good or bad, but if you leave every answer untouched as “exactly equal”, there is pixels of error.

Conclusion

Probably, I shouldn't talk for everyone but in my case, the conclusion would have been: “bad measurements beat good intuition nine times out of ten”. Even though I do have a long history of making things fast, this extreme case — small data, cheap body — neglected my experience to naught. String operations seemed to be cheaper, heuristics — more effective and there were also high hopes for the standard containers. Well, at least slightly higher hopes.

I've learned something from this experience. Hope you did too. If so, then no matter what your score is, you are still a winner.

Links

Further reading: Learning the value of good benchmarking technique with C++ magic squares.

Reddit discussions: r/programming; r/cpp

Github with all the experiments.