Last time I explained how Bernard was born and how he became smart enough to solve any Sudoku puzzle in the world. A fine achievement for any program, but one that was difficult for Bernard to show off at the various aristocratic parties he has become accustomed to attending since he acquired overnight fame on Medium.

You see, when guests requested to see a demonstration of his marvellous talent, they would hand him a puzzle from a sheet of newspaper and eagerly await the result, bristling with anticipation. By the time I explained (with as much tact as I could muster) that Bernard was in fact blind and you would need to communicate the board to him as a Python string, the aforementioned guest would be too embarrassed, confused, or simply bored, to continue. This did nothing for Bernard’s self esteem.

As such, I took up the challenge of teaching Bernard to see so that he might more easily express his talent.

Like before, I won’t be linking to the GitHub just yet as it contains spoilers for the next part, however I encourage anyone inclined to follow along with the code presented here.

What Are We Looking For?

Let’s start with an photo of a Sudoku board that we will teach Bernard to understand.

Figure 1: Original image of a Sudoku board

The board is at the centre of the image and it’s really the only bit we’re interested in. The first thing we’ll want to do is separate the wheat from the chaff. The minimum information we need to do that are the four corners of the grid.

A human would trivially be able to identify the 4 corners of the grid and wouldn’t have to think about how they do it. Since humans are too lazy and stupid, their brains handle this automatically for them. Sadly, this isn’t as easy for our technological troubadour and we will need to get a bit creative.

How Computers Can See

Programming using images is quite complex and involves quite a lot of non-trivial mathematics. Luckily for us, there is an open source project, OpenCV, which has been made especially for these tasks. Luckier still, there are Python bindings for this project which Bernard will understand. Use pip install opencv-python to install the module and its dependencies. The following snippet reads an image from disk:

Snippet 1: Loads the image using OpenCV

If we inspect the type of img we can see that it's actually a Numpy array. Numpy is a Python module optimised for mathematic operations with matrices, and is certainly worth reading up on.

When we look at the content of img it appears at first glance to be somewhat incomprehensible. However when we use imshow to display the image on the screen it looks fine, so clearly the computer is comprehending it. The structure is actually quite simple: the shape of the array is 2-D (according to height and width of the image) and each pixel is represented by a tuple of 3 representing the Blue-Green-Red (BGR) colour channels (it defaults to BGR, not RGB). Each channel is an integer between 0 and 255, so (0, 0, 0) is pure black and (255, 255, 255) is pure white. These are in a list of lists (2-D matrix) where the number of rows is the height of the image and the number of columns the width. If we choose to read the image as greyscale (use cv2.IMREAD_GRAYSCALE ), the colour information is lost and each element simply becomes a single integer between 0 and 255, representing the brightness of the pixel.

Back to the task at hand, we need a method for identifying the grid in the center. We’re going to take advantage of the fact that the Sudoku grid will likely be the largest complete feature in the image, assuming the photographer was competent and not attempting some elaborate and unusually cruel deception. The grid in the middle is the only relevant information in the image, so we want to get the coordinates of the four corners of the grid in order to keep the grid and remove the nonsense.

There are functions we can use to try to detect the corners in the image, but before doing that we should process the image to improve the reliability of those operations. At the moment the image is mostly grays and each pixel as some grey value between 0 and 255. This is quite noisy for the operations we will perform for corner detection. A way to reduce this is by applying binary thresholding. These algorithms reduce an image to pure black and white, based on the contrast. Global thresholding makes the split based on a threshold measured from the entire image, whereas adaptive thresholding calculates a threshold for each pixel in the image based on the mean value of surrounding pixels. This is useful when the contrast is uneven across parts of the image, which is pretty likely given the nature of human photographers.

Snippet 2: Demonstrates differences in thresholding algorithms

Figure 2: Binary and adaptive thresholding outcomes

We can also blur the image beforehand to reduce the noise picked up by the thresholding algorithm and dilate the image to increase the thickness of the lines. It’s worth doing anything we can to make this as easy as possible for Bernard, he’ll thank us later.

Snippet 3: Image pre-processing algorithm using Gaussian blurring, adaptive thresholding and dilation.

Figure 3: Shows the steps in the preprocessing method

Compare the adaptive threshold images from Figure 2 and 3 to see the noise reduction when we use blurring: in this case less detail is more. Now we’ve reduced the complexity of the image considerably and have retained the feature of the image we are interested in, the only bit left is to have Bernard pick out the right shape. As the photographer is trying to take a photo of the Sudoku board, we can assume that it is the largest single feature in the image so long as the photographer isn’t an ignoramus.

Contours are a way of describing the boundaries of a shape that has the exact same intensity. Since we converted to binary tones using thresholding, we can use the findContours (algorithm) function to detect the contours in the image:

Snippet 4: Finds the contours in the image

Figure 4: Contours detected in the image

We can see from Figure 4 that the contours describe the shapes we need incredibly well and that we only need the external contours to extract the board from the rest of the image. The function cv2.contourArea is handy here as we can use it to easily get the largest feature in the image. We could use the Ramer-Douglas-Peucker algorithm to approximate the number of sides of the shape as well as this would allow us to filter for rectangular objects only. However during testing I found that this limitation gave false negatives when folds in the page gave the illusion that the grid had 5 sides instead of 4. As such it was more reliable to do away with that constraint and solely base the grid on the area.

Finally, getting the four corners of the grid is easy with the following logic:

Top left point has the smallest x and smallest y coordinate, so minimise x + y .

and smallest coordinate, so minimise . Top right point has the largest x and the smallest y coordinate, so maximise x - y .

and the smallest coordinate, so maximise . Bottom right point has the largest x and the largest y coordinate, so maximise x + y .

and the largest coordinate, so maximise . Bottom left point has the smallest x and the largest y coordinate, so minimise x - y .

Snippet 5: Finds the four most extreme corners from the largest contour

Figure 5: Detected corners of the grid

A roaring success I think, I dare say I couldn’t have done a better job myself. The next step is to cut out the part of the image we need and throw away the garbage. Notice though that the photo was taken at a slight angle, so whilst our points map out a rectangle, it’s not a perfect square. The first solution I tried for this problem was to berate the photographer and repeatedly hit them on the nose with a newspaper until they provided a better photograph. Sadly, the results of this method were highly inconsistent. Serendipitously, OpenCV has a function, warpPerspective , that can do exactly what we want. It is an implementation of the perspective transform equation, the non-vector form given by:

Equation 1: Non-vector perspective transform function

where X, Y, x, y are the sets of new and old coordinates respectively and the remaining variables are constants that we need to calculate. We can do this by solving multiple parallel equations mapping 4 coordinates from the original image onto the their new locations in the new image. This is easy to do with the four corners of the grid we obtained earlier, we just arrange the points as a square roughly the same size as the original grid. With a bit of elbow grease, this can be rearranged into a matrix-vector product and any brave souls can look to Criminisi et al. for inspiration if they want to solve this by hand. I’d advise wearing harness and appropriate footwear and officially bear no responsibility for any injuries or fatalities incurred. For Bernard, we can just tell him to use the warpPerspective function and he'll diligently solve the equation:

Snippet 6: Uses the 4 corner points and perspective transform to crop the grid out of the image.

Figure 6: Grid cropped from the image and warped into a square

This is looking way better, the amount of irrelevant nonsense left in the image is sufficiently decreased and the warp transformation has made it look like we are viewing the board directly instead of at an angle. The next part is to have Bernard identify the cells from the board. The obvious thing to do would be to divide up the grid into 81 evenly sized squares, since we have already warped it into a square. In this case, it pays to be obvious.

Snippet 7: Estimates the grid and square locations based solely on the shape of Sudoku grids

Figure 7: The estimated grid layout

Not bad, but not perfect either: a lot of cells have overlapping grid lines and the digits aren’t very well centered. This is all the kind of thing that doesn’t really bother your unconscious brain but to Bernard, trying really hard to understand what he’s seeing, this can be incredibly confusing. It is therefore our moral duty and obligation to simplify it further. Taking advantage of the fact that Sudoku grids are always simple and can only contain single digits in the middle of each box, we can implement a function that finds the largest connected pixel structure found when searching around the centre of each square. This is easier if we use similar pre-processing to when we were looking for the grid corners. If we are fortunate enough to find something, we can pull it out and centre it in a new square of a fixed size, scaling it if we have to.

Snippet 8: Gets the largest blob out of the centre of each grid square

Figure 8: Extracted digit information with the original image for comparison

Easy peasy. Well no actually, that wasn’t easy at all, it was absurd amounts of work to get this far and Bernard still can’t even recognise the digits as anything but amorphous blobs. This is the crux of what makes this problem fascinating to me. It was trivial to write a program that could solve any puzzle of any difficulty in less than a second, puzzles that a certain type of a human might spend hours on for just for fun. However we’ve had to use a range of mathematical methods, a huge library of pre-programmed functions and a number of optimisation techniques, simply to be able to unreliably tell where the boxes are and if there’s stuff in them. This dichotomy is evidenced throughout life: catching a ball is easy, explaining precisely how you did it is not.

Your brain subconsciously performs image recognition for you via a series of neurons in the visual ventral stream. It also does it horrendously quickly and vastly more reliably than the hackjob maths we programmed and it is able to do this because the neuron hierarchy is phenomenally complex. When this part of the brain is working properly you don’t notice it at all, but when it doesn’t you might start mistaking hats for wives. Armed with this knowledge, next time we’re going to go full Wizard of Oz and try to give Bernard a brain.

Acknowledgements

The OpenCV project was vital to this part of the project, a thoroughly excellent set of tools for performing this image processing.

Utkarsh Sinha’s tutorial on AI Shack for image processing methods regarding grid detection and digit extraction.

Cáp Hữu Quân’s blog on solving Sudoku for image processing techniques regarding grid detection.

eatonk 's project for method ideas and Sudoku images.

pajwaker 's project for method ideas and Sudoku images.

Yixin Wang’s Sudoku Solver paper for cell centering using structuring elements. A clever method that we’re not employing here, but interesting nonetheless.

A. Criminisi, I. Reid and A. Zimmerman’s online paper A Plane Measuring Device and the companion lecture, Perspective Transform Estimation from the University of Arizona, for their excellent explanation of perspective transformation.

P.S.

For the lazy among you who have skipped reading or performing the tutorial yourselves or simply work for Blue Peter, here’s a link to the finished solution.