Introduction

Under the guidance of Prof. Jason Polakis, I had the opportunity to work on a web security project alongside Varshini Sampath and Saumya Solanki. In this work, we showed that how audio based CAPTCHAs, introduced as an accessible alternative for those unable to use the more common visual CAPTCHA can be exploited to nullify their purpose. This research was published in Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (AISEC) and was also presented at Usenix ScAINet 2018.

Terminology

CAPTCHA (will be referring to as “captchas” henceforth)— A program or system intended to distinguish humans from computers as a way to prevent spam. These are often found as garbled text that can only be recognized by humans and not by computers. Challenge/Problem — The test presented to a user to validate themselves as a human. Solver — An automated system built to ‘hack’ the challenge and present itself as a human. Will be referring to this process is called ‘solving’ or ‘breaking’ henceforth. A completed challenge is therefore considered ‘solved’ or ‘broken’.

Accessibility in CAPTCHA systems

Text based captchas, which are the most common, had been broken to the extent of 77% back in 2009. As the captcha breaking systems are getting increasingly better due to advancements in the field of computer vision, the captchas themselves are becoming increasingly harder on humans. They have reached an extent that is making it extremely hard for humans to read.

These captchas, already becoming hard for normal users, present an insurmountable challenge to users having accessibility requirements. In a 2011 research, it was shown that three groups of people had trouble with visual captchas — visually impaired, who constitute 2.6% of the world’s population, users with dyslexia, and users suffering from motor impairment diseases like Parkinson’s.

Thus, audio captchas (an icon can be seen above to trigger an audio challenge, where the system speaks out a sequence of alphanumeric characters to be identified by a human) were introduced specifically to help computer systems be accessible to the visually impaired. The caveat with this approach is that these audio captchas open a new ground for exploitation by attackers. While some researchers argue that automated systems must be able to solve less than 1 out of 10,000 captchas (0.01%), some researchers are generous with the numbers at 1% — 5%. There is no universally agreed upon value for a captcha system to be considered as compromised.

It is now extremely hard to device a good captcha mechanism. Adding noise to thwart automated systems makes it harder for genuine users as well. Having a highly secure yet highly usable captcha seems impossible.

While there has been a lot of prior work done on breaking text and image captchas, audio captchas have not been studied to such great detail. We built an automated system that solves audio captchas using existing speech recognition services. We found that the latest version of Google’s ReCAPTCHA at the time of publishing our work was very vulnerable to our attack. In the paper, we also discussed that all other audio captchas can be solved by leveraging existing speech recognition services.

Our approach for breaking audio CAPTCHAs

After trying out a myriad of approaches, we manually uploaded captcha audio files to a speech recognition service. We used the IBM Watson’s Speech to Text service and were able to transcribe the audio file. We got an accuracy >90% for ReCAPTCHA. For the other services where the accuracy was not much, IBM Watson provided a few alternative transcriptions which had the correct recognition. This indicated that our attack was feasible by using these Speech to Text converters.

Some speech recognition services we evaluated.

Our goal was to demonstrate how state-of-the-art deep learning systems have significantly lowered the bar for attackers. They can now deploy effective attacks without the need of developing sophisticated ML systems on their own (as that has been demonstrated in prior work against audio captchas). We narrowed down to three main voice recognition services: Google’s Cloud Speech API, IBM Watson’s Text-to-Speech API, and Facebook’s Wit.ai service.

These services had the capability of choosing between different accents of English, specifying keywords to look for in audio transcriptions, among others. We evaluated the following captcha service providers: Google’s ReCAPTCHA (multiple versions), BotDetect, Captchas.net, Securimage, Telerik and some websites that implement their own captcha: Apple and Microsoft Live.

While evaluating the audio files from all the captcha sources, we found that some services spelled out digits, some used alphabets or NATO alphabets, while some used a combination of the two. As for the number of characters, they were between 5–10 alphanumeric characters.

An illustration of how our system works.

(The following numbered list corresponds to the steps in the above diagram)

Our system visits a webpage, identifies the captcha element within the page, and the audio challenge is extracted. The audio file is preprocessed and prepared to be sent to the speech recognition service. (removing instructions, format conversion etc.) The audio file is uploaded for transcription with a predetermined configuration. The configuration allows us to fine-tune the transcription (e.g., to only expect numbers) as well as the accent (US, UK etc.) to use. The transcription of the audio challenge is obtained from the API and passed to the post-processing component. For instance, “Alpha Bravo” now gets substituted with “AB”. The final solution is forwarded to the browser automation module, which is responsible for submitting the solution to the web page. The browser automation module inputs the solution while mimicking user behavior, as certain captcha services have deployed extra checks for identifying bots. It also verifies if the solution was accepted by the captcha service.

We then logged these results to a CSV file with the timestamp and we also stored the audio file and named the file with the timestamp. This allowed us to check why certain solutions failed and how we could improve them.

Accessibility problems with audio captchas

Audio challenges have long been the de facto solution for enabling accessibility in tandem with visual captchas. Despite their limitations of being hard for non-native speakers, users of younger ages, users with learning or language disabilities, no other alternative has been widely deployed.

A research by Chellapilla et al. argues that Human-friendly Interaction Proofs (HIPs) must approach a success rate of at least 90%. But a usability study with six blind participants on Google’s ReCAPTCHA found that they were only able to solve 46% of the audio challenges at an average of 65 seconds. A study by Bigham et al. with blind high school students found that none of them were able to solve the audio captchas. In a subsequent study with 89 blind users, they found that the users achieved only a 43% success rate in solving 10 popular audio CAPTCHAs.

In the same study by Bigham et al, it was also found that screen readers used by blind users speak over playing audio CAPTCHAs. As users navigate to the answer box, the accessibility software continue reading out the interface while talking over the playing audio challenge. A playing audio challenge does not pause for users as they type their answer. Reviewing an audio CAPTCHA is cumbersome, often requiring the user to start again from the beginning. Also, replaying an audio CAPTCHA requires users to navigate away from the answer box in order to access the controls of the audio player.

Conclusion

At the time of our research being published, an estimated 200 million captchas were being solved each day, resulting in a cumulative loss of over 500,000 man hours per day. Research has shown audio challenges of popular captcha services to be much more difficult than their visual counterparts. Adding additional distortion to existing audio captchas would further hinder visually impaired users from accessing the Web.

There have been suggestions for killing off captchas altogether. Our study only added fuel to that and calls for better security mechanisms to deal with this problem.

Here’s the link to our original research paper. If you enjoyed reading this article, please consider leaving some 👏👏👏 and follow me on Medium!

Note: As of 4th March 2019, W3C has approved WebAuthn as a standard for password free logins, which will perhaps get rid of captchas altogether 🙌