Using Torch code to break simplecaptcha with 92% accuracy

Captcha is used as a common tactic to stop bots from entering a website. Any visitor to a website is presented with a text image containing some letters which can be read by a human and not by automated bots. They are quite frequently used to stop automated password hacking or automated login to websites etc. The following is taken from the wikipedia page[1] of captcha. As captchas are usually used to deter software programs they can be usually very hard to read and human accuracy can be around 93% [2]. It also takes something like 10 secs to read a captcha. As can be seen this takes quite a toll on the user experience.

Breaking Captchas

There are a few approaches to defeating CAPTCHAs: using cheap human labor to recognize them, exploiting bugs in the implementation that allow the attacker to completely bypass the CAPTCHA, and finally improving character recognition software. There are numerous criminal services which solve CAPTCHAs automatically.

Neural networks have been used with great success to defeat CAPTCHAs as they are generally indifferent to both affine and non-linear transformations. As they learn by example rather than through explicit coding, with appropriate tools very limited technical knowledge is required to defeat more complex CAPTCHAs.

The only part where humans still outperform computers is Segmentation, i.e., splitting the image into segments containing a single letter. If the background clutter consists of shapes similar to letter shapes, and the letters are connected by this clutter, the segmentation becomes nearly impossible with current software. Hence, an effective CAPTCHA should focus on the segmentation.

We will show that with the increase of computer power(GPU) and advanced neural networks (deep learning), most of the popular captcha softwares can be broken. This has also been in news quite recently [3][4]. Here we demonstrate using a captcha software available for common public usage. This is just to show that most common captcha software are susceptible to this attack and should refrain from using captcha as a system to stop bots.

Deep Machine Learning

Machine learning can be seen as trying to find a function given examples of input and output of that function. So if y=f(x), here x can be the captcha image and y can be text to be read from the image. We are trying to fit a function f using machine learning to read text from the image. The great thing about machine learning is that if we provide the pairs of image and text, the ml is generic enough and tries to figure out the underlying function. Here we use deep neural networks as out machine learning algorithm. So the two main steps are to generate the pairs of images and text which is one of the main tasks in machine learning applications. But we can easily generate the text and image pairs by using the software used by the website. The following excerpt is taken from wikipedia ..

… the algorithm used to create the CAPTCHA must be made public, though it may be covered by a patent. This is done to demonstrate that breaking it requires the solution to a difficult problem in the field of artificial intelligence (AI) rather than just the discovery of the (secret) algorithm, which could be obtained through reverse engineering or other means.

SimpleCaptcha

Here we will try to break Simple Captcha which is a java based captcha software. Some examples of captchas generated are

This library we picked randomly and we can see that the images are quite difficult even for humans. To create the dataset we use the software itself. I have created a dataset of 10000 samples of 5 characters each using all the effects and noises available in the library to make it harder.

Torch

We use torch to train the neural network and VGG based deep neural network architecture. We train on a GPU Nvidia 780 Titan. You can check similar set up for CIFAR dataset at this blog post. The main difference is the criterion, as the output of a image is a sequence of characters which begs us to use RNN. But that is also not necessary here as we use our custom MultiCrossEntropyCriterion and is sufficient to get good enough results. I have committed the code to github repository here. We assume the dataset to be in a folder (named dataset). The shown image size is 50×100. The network can be changed accordingly to fit any other captcha. Be careful to also change the number of classes and text size in file model.lua.

First create the dataset in a folder in the required format as shown. Then edit the parameters in captcha.lua to suit your needs: set validation data size

set batch size to fit the training of network in your GPU

set the number of iterations to run

set learning rate, learning rate decay and momentum

I have not included the training data or code to generate the dataset, but should be simple enough to do. Using this setup and training for a couple of hours I was able to get 92% accuracy on the dataset.

Conclusion

So without much effort and using standard deep learning library and generating datasets automatically and using huge computational power, we can break standard captcha systems in a days work. This shows that reliance on captchas for stopping bots should be stopped. Making captchas harder only makes it harder for humans and not this attack. Using RNN with attention can improve results further. And using custom captcha generator only makes it slightly difficult in generating dataset which can be circumvented by humans (mechanical turk).

Future

So what are the alternatives to captchas. Use recaptcha by google. They have realized that captchas are susceptible to attacks and hence moved away from them. They used advanced machine learning techniques to monitor bot traffic and stop them.

Links

Click to access burszstein_2010_captcha.pdf

Click to access burszstein_2010_captcha.pdf