As one of the most widely used and recognizable captchas on the planet, attempts to circumvent reCaptcha “I am not a robot” verification are conducted regularly. In response, reCaptcha has been updated frequently with it’s most recent major version, reCaptcha 3.0, being launched toward the end of 2018.

As one of the most widely used and recognizable captchas on the planet, attempts to circumvent reCaptcha “I am not a robot” verification are conducted regularly. In response, reCaptcha has been updated frequently with it’s most recent major version, reCaptcha 3.0, being launched toward the end of 2018. For the lulz, I wrote a reCaptcha solver and examined the security updates that have been made. From this, it has been concluded that:

Requests to solve an audio captcha instead of the visual captcha now results in increased scrutiny by reCaptcha, making previous audio captcha solvers largely nonviable for real-world spambots.

Due to improvements in machine learning and image classification, reCaptcha has given increasingly less value to the image-identifying aspect of captcha solving as well, despite trusting it more than the audio captcha (and the image captcha has been retired as of reCaptcha v3.0).

For a reCaptcha solver to be plausibly used in real world scenarios (i.e., that use cheap and/or free proxies), it must be able to farm Google cookies.

These points, as well as an overview of reCaptcha, reCaptcha solving, anti-spam practices, and general implications, are included below.

Context and Earlier Security Research

Earlier Security Research

Here are two of the most useful and notable projects on breaking reCaptcha (and similar captchas), the research and code from which are referenced and expanded on in various places throughout this post.

I'm Not a Human: Breaking the Google Recaptcha

If you want to learn about “click the image” captchas and how to approach building bots to solve them, this presentation from the 2016 Black Hat conference and the accompanyingwritten publicationandslidessome of the best places to start:

The presentation covers the ins-and-outs of these types of captchas, an overview of how image classifiers work, and touches on a number of factors that captchas often take into account behind-the-scenes.

unCaptcha: A Low-resource Defeat of reCaptcha's Audio Challenge

A substantial amount of open-source code was made available in mid to late 2017 throughUncaptcha, a reCaptcha solving project published by Kevin Bock, Daven Patel, George Hughey, and Dave Levin from the University of Maryland.

Parts of the code used in this post have been taken and/or modified from unCaptcha.

How reCaptcha Works

Point-Based Spam Filters

When discussing spam filters, incorrect statements are made constantly:

“You only have to click the fucking cars if you’re in a private browsing/incognito window.”

“You can’t solve reCaptcha on Tor.”

“reCaptcha always gives a harder captcha if you’re using a proxy.”

However, while these statements each have a very mild amount of truth in them, they are all wrong. Rather, reCaptcha uses a system wherein a number of factors are taken into an account to determine a “score” for how spammy or non-spammy the user is. While the exact details are proprietary secrets, the basic concept is simple:

Factors that look suspicious, like using an IP address in a known data center (i.e., a known proxy rather than a residential IP address) increase the spam score.

Factors like having an aged Google search cookie can lower the spam score.

Once a score is calculated, different actions can be taken based on where the user’s score falls on a scale: Very spammy requests are rejected entirely with this error message: "Your computer or network may be sending automated queries. To protect our users, we can't process your request right now. For more details visit our help page." Moderately spammy requests are given a large number of image-clicking challenges that deliberately load in slowly to stall users. Neutral requests are given a small number of image-clicking challenges that load in quickly. Trustworthy requests are given the green checkmark with no captchas needing to be solved.



This is the same type of system used for many other spam filters, such as those used to determine whether an email should go to a user’s inbox, spam folder, or be rejected by the server entirely.

reCaptcha Versions

Cracking reCaptcha

Creating a bot that is capable of solving reCaptcha requires three parts:

Code to interact with the reCaptcha user interface to click buttons, code to handle error messages from reCaptcha, and so on.

A solution to solve or bypass the reCaptcha challenges. There are three general possibilities: Solving the audio captcha; Solving the visual captcha; or Creating a cookie farm to prevent the captcha from requiring a challenge to be solved at all.

A setup that meets certain bare minimum standards for non-spamminess, to prevent requests from being rejected outright and to prevent the rest of your code from being dragged down by having too many “spam points” from general incompetence.

Bare Minimum Standards

Considering that spam filters are based on a point system, fucking up basic shit can make the complicated part of your code pointless. To ensure that a bot meets bare minimum standards for not looking like a spambot, these points are, for all intents and purposes, required.

IP Addresses and Proxies

A user’s IP address can be used to infer a large amount of information about an HTTP request.

As discussed inthe best proxy guide on the Internet, it is possible to determine whether an IP address is residential or a known proxy. While some requests from known proxies are humans doing regular things, many are spam.

While it is possible to repeatedly load and solve some reCaptcha demo without a proxy, connecting from your home IP address each time, this is unrealistic in terms of creating a reCaptcha solver that would actually be used for webspam purposes. Spambots are generally used to create accounts, post comments, and to otherwise interact with websites in massive bulk. Since these sites will generally log IP addresses, for a reCaptcha solver to be viable in real-world webspam scenarios, it must be able to solve reCaptcha while connected to proxies.

More specifically, it should be able to solve reCaptcha from cheap or free proxies. No webspammer is spending hundreds to thousands of dollars for the highest quality proxies, since that would destroy their profit margins, likely causing them to lose money and making the entire project pointless.

Programmatically Interacting with the reCaptcha UI

Browser Emulation vs. Headless Requests

The reCaptcha developersmade the decision to refrain from including an API. Because of this, it’s necessary to write custom code to interact with reCaptcha. When deciding how to do this, the first question is whether to:

Make headless requests where raw requests are made to the server without actually rendering the page source as it would be seen by a human. Robots can work with raw HTML, CSS, JS, and other code without needing to convert it to a pretty, rendered UI. For most bots, headless requests are ideal because they are substantially less resource intensive and are also generally easier to write code for. However, one limitation of headless requests is that it is relatively easy for websites to check whether a request is being made headless; using similar processes to those used to detect user-agent spoofing (more on user agent spoofing in this article).

Emulate the browser where the robot actually opens an instance of FireFox, Chrome, or some other browser as if it were a human.

I heard from a guy who knows a guy who knows a guy who said that it’s possible to solve reCaptcha 2.0 with headless requests. Regardless, this post uses browser emulation viaSelenium.

“Mouse Movement” is Irrelevant

If you look up “how does reCaptcha know whether or not you are a robot” (or similar queries), you’ll likely find 800 normie-tier articles about how reCaptcha “looks at unnatural activity” that robots do, but not humans. Generally, the example that is given is “mouse movement.” While this is an easily-understandable example, it’s important to note that it’s fucking wrong.

Actually clicking the image is a trivial task for a robot. In addition to the fact that clicking the reCaptcha checkbox can be done without bothering to simulate any mouse movement at all, here’s an even easier way to verify that mouse movement is clearly not used:

Open some recaptcha demo, like this one. Press tab until you’ve selected the reCaptcha checkbox. Press enter.

Congratulations. You can even unplug your mouse while doing this.

Audio Captchas

As you may or may not know, reCaptcha 2.0 includes an option to request an audio captcha instead of a visual captcha, for accessibility purposes. Previous solutions to reCaptcha, such as unCaptcha, relied on the audio captcha option, which was able to be solved fairly easily by:

Downloading the audio file Splitting the file into separate files for each the individual spoken letters/numbers Processing it with multiple voice-to-text programs Determining the most likely input for each letter based on the result (e.g., if five voice to text programs heard “seven” and two heard “something else,” assuming that the answer is “seven” for that character)

An example of this is shown in the video below:

The specific script being used in the video above is Uncaptcha.

Audio Captchas are Under Increased Scrutiny

However, while this solution still works forsolving the captcha puzzle, choosing the option to request an audio captcha instead of a visual captcha results in increased suspicion from reCaptcha (i.e., adding “spam points) over the visual option. Presumably because either:

The audio captcha has been solved by a number of different open-source solutions and people started exploiting the audio captcha; or

Google does not like blind people.

Because of this, when using most proxies, requesting an audio captcha (rather than an image captcha) will--in almost all cases--result in an outright block from reCaptcha, returning the error:

"Your computer or network may be sending automated queries. To protect our users, we can't process your request right now. For more details visit our help page."

Of course, this depends on thetype of proxy being used; exceptionally clean proxies (exceptionally expensive proxies) and/or other green flags may allow the audio captcha to be served and solved correctly even with this increased scrutiny, this additional cost and effort is impractical for any realistic spambot.

The implication of this is that a reliable, modular solution to the visual reCaptcha is needed to bypass reCaptcha tests on a large scale.

The Image Captcha

Overview of Image Classification Algorithms

To solve the image captcha, the easiest approach is to find an existing image classifier and to retrain it for this specific user case. While there are many subtopics and concepts to learn about machine learning and image classification algorithms, the basic concept behind machine learning can be simplified as follows:

Get some problems for your robot to solve Create an “answer key” Split the problems into two sets: Roughly 80% as the “training set” Roughly 20% as the “validation set” Repeat the following steps indefinitely until you are satisfied with the algorithm generated: Have the robot use the training set and associated answer key to attempt to create an algorithm that can accurately solve this category of problem. Have the robot use that algorithm to try to solve the problems in the validation set. If, based on the results from the previous step, the algorithm is better than what was being used in the previous loop, keep it. Otherwise, go back to the old algorithm and create another offshoot.

As mentioned inthis thread,this video by SethBlingandthis video by CGP Grey (embedded below) are additional understandable, entry-level explanations of machine learning.

One of the most straightforward methods to get started with machine learning is throughGoogle’s Tensorflow for Poets tutorial, which uses Python and the open-source TensorFlow library to create and train a basic image classifier to identify flowers, which can easily be adjusted to identify other arbitrary images.

Gathering Training Images

To train an image classifier for this particular use case (solving reCaptcha), training images are needed. Fortunately, it’s fairly easy to acquire these with a script that:

Opens a page with reCaptcha on it, like this demo site.

Clicks the image

Downloads the images to a folder/category with the type of challenge (traffic lights, street signs, etcetera)

Hits the “get a new challenge” button to repeat/refresh until reCaptcha throws an error

Upon getting an error or otherwise being unable to download more images, close the instance, cycle to a new proxy, and start from step one.

Repeat until you have somewhere between 1,000 and 10,000 training images for each of the major categories. As of the time of this bot being tested, almost all challenges were for one of these five categories:

“Bus”

“Cars”

“Roads”

“Store Front”

“Street Signs”

While there were other categories like fire hydrants and statues, they were rare enough that they could be skipped with little to no repercussions.

Once a fair number of training images have been downloaded, sort them into “Cars” and “Not Cars” categories (or comparable) to be used as training data for the image classifier. When doing this, Ihired the worst fucking VA in the world; a better option would have been Amazon’s Mechanical Turk, which is full of people whodo an obscene quantity of microtasks to make modest amounts of money.

Regardless of how they get sorted, once you have the data, you can retrain the classifier for reCaptcha solving.

Cookie Farms

While solving the image captcha is a """relatively""" trivial process (compared to when it was first introduced), there are many cases (particularly for Tor users) where, despite many captchas being solved correctly, the reCaptcha either takes (literally) 3-5 minutes to solve or gives the user the "automated queries GTFO" error after wasting like five minutes of their time. The same is true for reCaptcha solvers working from proxies. While the images are clicked correctly and some captchas will eventually get solved correctly, this slow as fuck and completely impractical. From the tests I ran, you're looking at roughly one captcha being solved per hour (when running in a single thread). To get around this, it's necessary to reduce the "spam score" further by either:

Spending infinity dollars on the finest proxies; or

Setting up a cookie farm, as discussed in section/chapter 3 of I’m not a human: Breaking the Google reCAPTCHA

Code and Resources

Here is teh code used during this project. Note that various dependencies will likely need to be installed to make all of this fucking shit run, most notably Selenium. Originally written in April 2018.

Recaptcha Photos

The sorted training photos used for the image captcha solver are included in this .tar.gz:

The image categories include images from challenges for busses, cars, roads, store fronts, and street signs sorted into foldes that do and do not match the categories.

ris.py

# ris.py # Probably mostly from unCaptcha (?) import requests import os import time import random import json import threading import multiprocessing import time import copy import pickle from PIL import Image from os import walk from selenium import webdriver from selenium.webdriver.common.by import By from bs4 import BeautifulSoup from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC def parse_test_file(test_filename): return json.loads(open(test_filename, "r").read()) def test_all(): dirs = list() test_root = "images" for (dirName, subDir, _) in walk(test_root): dirs.extend(subDir) break for d in dirs: full_path = os.path.join(test_root, d) try: search_directory(full_path) except Exception as exc: print("test %s failed" % full_path) print(exc.message) def search_directory(directory, target_keyword=None, width=4): f = [] trues = 0 threads = [] oracle = None try: oracle = parse_test_file(os.path.join(directory, "oracle.json")) if target_keyword == None: target_keyword = oracle["target_keyword"] except IOError as err: print("no oracle file found") manage_vars = multiprocessing.Manager() for (_, _, filenames) in walk(directory): f.extend([file for file in filenames if "image" in file or "output" in file]) i = 0 ret_vals = manage_vars.dict() target_syns = list() for targ_key in target_keyword.split(): target_syns.extend(get_synonyms(targ_key)) target_syns.append(targ_key) print("testing " + directory) for img_file in f: t = multiprocessing.Process(target=reverse_search2, args=(os.path.join(directory, img_file), img_file, ret_vals, target_syns)) threads.append(t) t.start() i+=1 for j in range(0, i-1): threads[j].join() print("") # print ret_vals # print oracle if oracle: # local testing only for img_file in ret_vals.keys(): # print str(ret_vals[img_file]) + " " + str(oracle[img_file]) if(ret_vals[img_file] == oracle[img_file]): trues += 1 print(" %s correct out of %s" % (str(trues), len(ret_vals))) return ret_vals else: # live testing only return get_coor(ret_vals, width) def reverse_search2(img_file, filename, ret_vals, target_keyword="vehicle"): ret_vals[filename] = reverse_search(img_file, target_keyword) # determines if an image keywords matches the target keyword # uses the synonyms of the image keyword def check_image(img_keywords, target_syns, syn_image=False): #print ("Checking keywords against: " + target_keyword) for k in img_keywords: #print(k) if syn_image: image_syns = get_synonyms(k) if image_syns: for image_s in image_syns: for target_s in target_syns: # print("- %s" % (target_s)) if target_s == image_s: return True else: for target_s in target_syns: # print("- %s" % (target_s)) if target_s == k: if (DEBUG > 0): print("Found " + target_s + " equal to " + k) return True return False def get_coor(click_dict, width=4): x = 1 y = 1 coor_dict = dict() for key in sorted(click_dict.keys()): coor_dict[(x, y)] = click_dict[key] y += 1 if y > width: x += 1 y = 1 return coor_dict def pprint(matrix): s = [[str(e) for e in row] for row in matrix] lens = [max(map(len, col)) for col in zip(*s)] fmt = '\t'.join('{{:{}}}'.format(x) for x in lens) table = [fmt.format(*row) for row in s] print '

'.join(table)

recaptcha_solver.py

# Recaptcha Solver # recaptcha_solver.py # Created on 25 April, 2018 # Base code taken from the open source code for Uncaptcha from selenium import webdriver from selenium.webdriver.common import action_chains, keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By import sys import os import time from time import sleep from bs4 import BeautifulSoup import urllib import urllib2 import pdb import logging import random from random import uniform import threading from threading import Timer,Thread,Event import argparse from termcolor import colored import pprint import Tkinter as tk from PIL import ImageTk, Image import traceback # Custom Modules import ris # Import the functions to run reCaptcha images through the image classifier / to get data from the image classifier # recaptcha_solver/tensorflow-for-poets-2/scripts/label_image.py classifier_path = os.path.join( os.getcwd(), 'recaptcha_solver/tensorflow-for-poets-2/scripts' ) sys.path.append(classifier_path) # FIXME import label_image sys.path.remove(classifier_path) # ***** ***** ***** ***** FUNCTION DEFINITIONS ***** ***** ***** ***** # def recaptcha_solver_demo(driver): print "Starting the reCAPTCHA solver demo..." driver.get("https://patrickhlauke.github.io/recaptcha/") #driver.get("https://www.google.com/recaptcha/api2/demo") solve_recaptcha(driver, 5) print "End of recaptcha_solver_demo() function..." # Solve all reCaptchas that are currently loaded on the screen of the driver # returns True if reCaptcha is solved correctly # Otherwise, returns False def solve_recaptcha(driver, seconds_to_wait=15): if ( click_recaptcha(driver, seconds_to_wait) ): wait_for_initial_challenge_to_load(driver) try: return solve_visual_captcha(driver) # Uncomment this and comment out the line above if you want to download training images instead of solve captchas #download_training_images(driver) except: traceback.print_exc() print "Error somewhere, lmao" return False def click_recaptcha(driver, seconds_to_wait=15): for iii in range(seconds_to_wait): try: print "Trying to find the recaptcha's iFrame..." #recaptcha_iframe = driver.find_element(By.CSS_SELECTOR, "iframe[title=\"recaptcha challenge\"]") #recaptcha_iframe = driver.find_element_by_css_selector("#g-recaptcha iframe") recaptcha_iframe = driver.find_element_by_css_selector(".g-recaptcha iframe") print "Found frame..." driver.delete_all_cookies() #Is this even a good choice? (Re: Asian woman) print "Cookies deleted..." driver.switch_to.frame(recaptcha_iframe) print "Switched to the iFrame..." driver.delete_all_cookies() print "Cookies deleted (again)..." print "Trying to find the recaptcha..." recaptcha = driver.find_element_by_css_selector("#recaptcha-anchor") print "Trying to click the recaptcha..." recaptcha.click() print "The reCaptcha has been clicked..." driver.switch_to.default_content() return True except: driver.switch_to.default_content() sys.stdout.flush() sys.stdout.write( '\r' + colored("Waiting for a reCaptcha to load. "+str(iii)+" seconds have passed. Will abandon at "+str(seconds_to_wait)+" seconds...", 'cyan') ) #print colored("Waiting for a reCaptcha to load. "+str(iii)+" seconds have passed. Will wait up to "+str(seconds_to_wait)+" seconds...", 'yellow') time.sleep(1) print colored("

Error: No reCaptcha was detected on the page after waiting for "+str(seconds_to_wait)+" seconds.", 'red') return False def wait_for_initial_challenge_to_load(driver): print "Waiting for challenges to load, probably..." WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, "iframe[title=\"recaptcha challenge\"]"))) iframe = driver.find_element(By.CSS_SELECTOR, "iframe[title=\"recaptcha challenge\"]") driver.switch_to.frame(iframe) WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.ID, "rc-imageselect"))) print "reCaptcha has loaded..." ############################## CAPTCHA ERRORS ############################### # Checks to see if reCaptcha has given the "Try again later" error (which disallows attempts to solve a reCaptcha at all) # "Your computer or network may be sending automated queries. To protect our users, we can't process your request right now. For more details visit our help page" # FIXME def check_for_automated_queries_error(driver): try: if driver.find_element_by_css_selector(".rc-doscaptcha-body-text").is_displayed(): print colored("Error: Automated queries error message was found. Captcha could not be served...", 'red') return True except: print colored("Automated queries error message not found...", 'green') return False #TODO - Test this # Check if reCaptcha has timed out # red border thing # FIXME -- might need to switch to the correct of the two (?) iFrames (?) def check_if_recaptcha_timed_out(driver): #id='recaptcha-anchor' class="recaptcha-checkbox goog-inline-block recaptcha-checkbox-unchecked rc-anchor-checkbox recaptcha-checkbox-expired" try: # fixme SELECTOR not CLASS driver.find_element_by_css_class("#recaptcha-anchor.recaptcha-checkbox.recaptcha-checkbox-expired") print colored("reCaptcha has expired.", 'red') return True except: print colored("reCaptcha has not expired.", 'green') return False pass # TODO -- Make this way faster # Fucking big reason why signs times out def check_if_recaptcha_is_solved(driver): print "Checking if reCaptcha is solved..." try: driver.switch_to.default_content() recaptcha_iframe = driver.find_element_by_css_selector(".g-recaptcha iframe") driver.switch_to.frame(recaptcha_iframe) driver.find_element_by_css_selector("#recaptcha-anchor.recaptcha-checkbox.recaptcha-checkbox-checked") #driver.find_element_by_css_class("#recaptcha-anchor.recaptcha-checkbox.recaptcha-checkbox-checked").get_attribute("aria-checked") print colored("reCaptcha is solved.", 'green', 'on_yellow') driver.switch_to.default_content() return True except: print colored("reCaptcha is not yet solved.", 'red') driver.switch_to.default_content() iframe = driver.find_element(By.XPATH, "/html/body/div/div[4]/iframe") driver.switch_to.frame(iframe) return False # Checks for common reasons that the reCaptcha solver may not be able to continue and returns relevant information def check_if_recaptcha_solver_is_stuck(driver): # Check for 'automated queries' rejection if ( check_for_automated_queries_error(driver) == True): return True # FIXME if (check_if_recaptcha_timed_out(driver)): return True print colored("reCaptcha solver is not stuck. Continuing...", 'green') return False ############################## VISUAL RECAPTCHA ############################# def solve_visual_captcha(driver): return image_recaptcha(driver) ############################## IMAGE RECAPTCHA FUNCTIONS ############################## TASK_PATH = "recaptcha_solver/captcha-images" def should_click_image(img, x1, y1, store, threshold=0.95, target="cars"): decision = parse_classify_image(img, threshold, target) store[(x1,y1)] = decision logging.debug(store) return decision def click_tiles(driver, coords, subdir=None): # Some recaptchas (generally everything except street signs) will fade out when clicked, after which a new image is loaded in # ---> Dynamic -- (.rc-imageselect-tile.rc-imageselect-dynamic-selected) # Other recaptchas (almost exclusively street signs) get static check mark and don't fade out # ---> Static -- (.rc-imageselect-tile.rc-imageselect-tileselected) # These flag is used to determine whether the script should wait for new images to load in or not flag_is_static_select = False flag_is_dynamic_select = False # There are two distinct flags so that it can be determined immediately on the first image click either way, which improves performance orig_srcs, new_srcs = {}, {} for (x, y) in coords: logging.debug("[*] Going to click {} {}".format(x,y)) tile1 = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//div[@id="rc-imageselect-target"]/table/tbody/tr[{0}]/td[{1}]'.format(x, y)))) orig_srcs[(x, y)] = driver.find_element(By.XPATH, "//*[@id=\"rc-imageselect-target\"]/table/tbody/tr[{}]/td[{}]/div/div[1]/img".format(x,y)).get_attribute("src") new_srcs[(x, y)] = orig_srcs[(x, y)] # to check if image has changed tile1.click() if (flag_is_static_select == False and flag_is_dynamic_select == False): try: driver.find_element_by_css_selector(".rc-imageselect-tile.rc-imageselect-tileselected") flag_is_static_select = True except: flag_is_dynamic_select = True #try: # driver.find_element_by_css_selector(".rc-imageselect-tile.rc-imageselect-dynamic-selected") # flag_is_dynamic_select = True #except: # print colored("WARNING: How can the recaptcha image be neither of the two possibilities?", 'red', 'on_yellow') #wait_between(0.1, 0.5) wait_between(0.1, 0.2) # TODO -- Check if the images use a checkmark style (rather than fading out and loading in a new image) if (flag_is_static_select == True): print colored("All images have been checked/selected.", 'blue') #pdb.set_trace() return None # FIXME - Test this else: print colored("New images loading in...", 'blue') # Set the path for where the images will be downloaded to if (subdir != None): subdir = os.path.join(TASK_PATH, subdir) else: subdir = TASK_PATH # Start downloading the new images, etc. logging.debug("[*] Downloading new inbound image...") new_files = {} for (x, y) in orig_srcs: # If there are connection issues, images may not load in correctly once old images are clicked # To prevent this the reCaptcha from hanging forever in an infinite loop, # throw an error after waiting for over [some amount of time, probably set this to five seconds] # TODO -- make this click "verify" insteadi and move on to the next reCaptcha (?) max_seconds_to_wait_raw_value = 7 max_seconds_to_wait = max_seconds_to_wait_raw_value loop_delay = 0.5 while new_srcs[(x, y)] == orig_srcs[(x, y)]: if(max_seconds_to_wait > 0): new_srcs[(x, y)] = driver.find_element(By.XPATH, "//*[@id=\"rc-imageselect-target\"]/table/tbody/tr[{}]/td[{}]/div/div[1]/img".format(x,y)).get_attribute("src") time.sleep(loop_delay) max_seconds_to_wait -= loop_delay #print colored("Remaining wait: "+str(max_seconds_to_wait), 'blue', 'on_cyan') pass else: e = "ERROR: New reCaptcha image "+""+" could not be found after waiting for "+str(max_seconds_to_wait_raw_value)+" seconds." print colored(e, 'red') raise e #urllib.urlretrieve(new_srcs[(x, y)], "captcha.jpeg") #new_path = TASK_PATH+"/new_output{}{}.jpeg".format(x, y) #os.system("mv captcha.jpeg "+new_path) new_path = subdir+"/new_output{}{}.jpeg".format(x, y) urllib.urlretrieve(new_srcs[(x, y)], new_path) #os.system("mv captcha.jpeg "+new_path) new_files[(x, y)] = (new_path) return new_files def handle_queue(to_solve_queue, coor_dict, threshold=0.95, target="cars"): ts = [] for (x,y) in to_solve_queue: image_file = to_solve_queue[(x, y)] t = threading.Thread(target=should_click_image, args=(image_file, x, y, coor_dict, threshold, target)) ts.append(t) t.start() for t in ts: t.join() def image_recaptcha(driver): # This is here to check specifically for the "automated queries" error on the first click print "Checking for 'automated queries' error before continuing..." if ( check_for_automated_queries_error(driver) == True): return False print colored("Attempt to solve the image reCaptcha has started...

", 'cyan') current_captcha = 1 threshold = 0.02 # The threshold for probablity of being a car continue_solving = True while continue_solving: print "Starting reCaptcha #" + str(current_captcha) + "..." print colored("Searching for a relatively easy reCaptcha to solve...

", 'cyan') willing_to_solve = False while not willing_to_solve: target = get_captcha_title(driver) t = get_captcha_dimensions_and_payload(driver); max_width = t[0]; max_height = t[1]; payload = t[2] print colored( "Target:\t", 'cyan') + colored(target, 'cyan', attrs=['bold']) + \ colored("\t" + "Width: \t", 'cyan') + colored(max_width, 'cyan', attrs=['bold']) + \ colored("\t" + "Height:\t", 'cyan') + colored(max_height, 'cyan', attrs=['bold']) # If there is no image classifier for the current category, skip the category # As of June 14th, 2018, these five categories cover almost all reCaptchas served. #if target != "cars" and target != "store front" and target != "bus" and target != "roads": #if target != "street signs": if target != "street signs" and target != "cars" and target != "store front" and target != "bus" and target != "roads": print colored("This reCaptcha has no image classifier. Requesting a new reCaptcha...

", 'magenta') reload_captcha(driver) #threshold = 0.01 # Reset the threshold for probablity of being a car, since there is a new image set #threshold = 0.02 # Reset the threshold for probablity of being a car, since there is a new image set #threshold = 0.25 # Reset the threshold for probablity of being a car, since there is a new image set current_captcha = current_captcha + 1 else: print colored("This reCaptcha is acceptable. Attempt to solve reCaptcha will begin...



", 'green') if (target =="street signs"): threshold = 0.40 else: threshold = 0.02 willing_to_solve = True # Consider FIXME-ing -- Random is probably not ideal, but allows for the loop to break # when it hits uncommon edge cases while minimizing the performance hit that occurs # from running this fucking function. # Main need for performance improvements is with the street signs, so restricted to that # target category if (target != "street signs" or random.randint(1,5) == 5): print "Checking for errors before continuing..." if ( check_if_recaptcha_solver_is_stuck(driver) ): return False print colored("reCaptcha to solve has been chosen...", 'green') subdir_name = "recaptcha--"+str(int(time.time()))+"--"+target+"--"+str(max_width)+"x"+str(max_height) full_task_path = TASK_PATH + "/" + subdir_name download_recaptcha_images(driver, subdir_name) t_dir = os.listdir(full_task_path) t_dir.sort() # build queue of files print colored("Creating queue of image files to solve...", 'cyan') to_solve_queue = {} idx = 0 for f in [full_task_path+"/"+f for f in t_dir if "output_" in f]: y = idx % max_height + 1 # making coordinates 1 indexed to match xpaths x = idx / max_width + 1 #y = idx % 3 + 1 # making coordinates 1 indexed to match xpaths #x = idx / 3 + 1 to_solve_queue[(x, y)] = f idx += 1 logging.debug(to_solve_queue) #print colored("Handling/solving of image queue starting...", 'cyan') print colored("Actual solving of the reCaptcha images starting...", 'cyan') #threshold = threshold - 0.05 coor_dict = {} handle_queue(to_solve_queue, coor_dict, threshold, target) # multithread builds out where to click logging.debug(coor_dict) #os.system("rm "+full_task_path+"/full_payload.jpeg") driver.switch_to.default_content() iframe = driver.find_element(By.XPATH, "/html/body/div/div[4]/iframe") driver.switch_to.frame(iframe) #print colored("Actual solving of the reCaptcha images starting...", 'cyan') continue_solving = True while continue_solving: to_click_tiles = [] for coords in coor_dict: to_click = coor_dict[coords] x, y = coords body = driver.find_element(By.CSS_SELECTOR, "body").get_attribute('innerHTML').encode("utf8") if to_click: to_click_tiles.append((x,y)) # collect all the tiles to click in this round new_files = click_tiles(driver, to_click_tiles, subdir_name) if (new_files != None): handle_queue(new_files, coor_dict, threshold, target) continue_solving = False for to_click_tile in coor_dict.values(): #print colored("In this loop, lmao, lmao", 'cyan', 'on_white') continue_solving = to_click_tile or continue_solving else: continue_solving = False #pdb.set_trace() print colored("The images that appear to match the category of ", 'cyan') + colored(target, 'cyan', attrs=['bold']) + colored(" have been clicked. Clicking the 'verify' button...", 'cyan') #wait_between(1.5, 2.5) # Wait for all the images to fully load in before clicking verify. Otherwise, it always gives a "Pleaes Try Again" error for some reason. JK LOL IGNORE THIS #pdb.set_trace() # TODO -- add this back in as an optional parameter to human-verify captchas driver.find_element(By.ID, "recaptcha-verify-button").click() # wait_between(0.2, 0.5) wait_between(0.4, 0.6) # Increased to prevent getting stuck on wrong image (consider a less ghetto solution) #if driver.find_element_by_class_name("rc-imageselect-incorrect-response").get_attribute("style") != "display: none": # FIXME - Ghetto solution if (target != "street signs" or random.randint(1,3) == 3): if check_if_recaptcha_is_solved(driver) == False: print colored("reCaptcha is not yet solved. Continuing with solution...", 'red') continue_solving = True else: #timeout_timer.cancel() print colored("Recaptcha should be solved.", 'green', 'on_yellow') #time.sleep(10) # FIXME return True else: print colored("Verification step skipped for speed reasons (LMAO). Continuing with solution...", 'yellow') continue_solving = True # TODO check if captcha changed on hitting verify -- "Please Try Again" instead of "Please select all matching images." if (False and target != get_captcha_title(driver) ): print "New Captcha was served" # FIXME check for error message directly def download_training_images(driver): print colored("Attempt to solve the image reCaptcha has started...

", 'cyan') continue_solving = True while continue_solving: print colored("Searching for a relatively easy reCaptcha to solve...

", 'cyan') willing_to_solve = False while not willing_to_solve: target = get_captcha_title(driver) t = get_captcha_dimensions_and_payload(driver); max_width = t[0]; max_height = t[1]; payload = t[2] print colored( "Target:\t", 'cyan') + colored(target, 'cyan', attrs=['bold']) + \ colored("\t" + "Width: \t", 'cyan') + colored(max_width, 'cyan', attrs=['bold']) + \ colored("\t" + "Height:\t", 'cyan') + colored(max_height, 'cyan', attrs=['bold']) subdir_name = target #"recaptcha--"+str(int(time.time()))+"--"+target+"--"+str(max_width)+"x"+str(max_height) download_recaptcha_images(driver, subdir_name, True) reload_captcha(driver) # ##### IMAGE CAPTCHA UTIL ##### # def reload_captcha(driver): reload_captcha = driver.find_element(By.XPATH, "//*[@id=\"recaptcha-reload-button\"]") try: reload_captcha.click() except Exception as e: print colored("Error clicking the button to reload the captcha -- ({0}): {1}".format(e.errno, e.strerror), 'red') wait_between(0.2, 0.5) def get_captcha_title(driver): body = driver.find_element(By.CSS_SELECTOR, "body").get_attribute('innerHTML').encode("utf8") soup = BeautifulSoup(body, 'html.parser') #table = soup.findAll("div", {"id": "rc-imageselect-target"})[0] target = soup.findAll("div", {"class": "rc-imageselect-desc"}) if not target: # find the target target = soup.findAll("div", {"class": "rc-imageselect-desc-no-canonical"}) target = target[0].findAll("strong")[0].get_text() return target def get_captcha_dimensions_and_payload(driver): body = driver.find_element(By.CSS_SELECTOR, "body").get_attribute('innerHTML').encode("utf8") soup = BeautifulSoup(body, 'html.parser') table = soup.findAll("div", {"id": "rc-imageselect-target"})[0] trs = table.findAll("tr") if (len(trs) > 4): #FIXME - Sort of ghetto max_height = 4 #pdb.set_trace() else: max_height = len(trs) max_width = 0 for tr in trs: imgs = tr.findAll("img") payload = imgs[0]["src"] if len(imgs) > max_width: max_width = len(imgs) return [max_width, max_height, payload] def download_recaptcha_images(driver, subdir_name="garbage_heap", rand_file_names=False): t = get_captcha_dimensions_and_payload(driver); max_width = t[0]; max_height = t[1]; payload = t[2] # Pull down catcha to solve and organize directory structure print colored("Creating the directory ", 'cyan') + colored(subdir_name, 'cyan', attrs=['bold']) + colored(" to store the images for the current reCaptcha...", 'cyan') full_task_path = TASK_PATH + "/" + subdir_name print colored("Full path for this task:\t", 'cyan') + colored(full_task_path, 'cyan', attrs=['bold']) if not os.path.exists(full_task_path): os.makedirs(full_task_path) elif (rand_file_names == True): pass else: print colored("Directory already exists. Get better error handling, Jesus Christ.", 'red') return 0 print colored("File download starting...", 'cyan') # TODO -- download directly to the correct directory (might need to be created first) #urllib.urlretrieve(payload, "captcha.jpeg") #os.system("mv captcha.jpeg '"+full_task_path+"/full_payload.jpeg'") # FIXME possible overwriting during multi-threaded operation urllib.urlretrieve(payload, full_task_path+"/full_payload.jpeg") t = "" if (rand_file_names == True): t = subdir_name+"--"+str(int(time.time()))+"--" print colored("Creating distinct images from the main reCaptctha grid...", 'cyan') os.system("convert \""+full_task_path+"/full_payload.jpeg\" -crop "+str(max_width)+"x"+str(max_height)+"@ +repage +adjoin \""+full_task_path+"/"+t+"output_%03d.jpg\"") print colored("The main grid of images has been split into ", 'cyan') + colored(str(max_width*max_height), 'cyan', attrs=['bold']) + colored(" individual images...", 'cyan') # Returns True if the image classifier estimates that 'image' is a car with a confidence level greater than the specified threshold # Otherwise, returns False def parse_classify_image(image="output_008.jpg", threshold=0.95, target="cars"): t = label_image.classify_image(image, target) if ( t[0][0].find("not") == -1): success_label = t[0][0] success_chance = t[1][0] else: success_label = t[0][1] success_chance = t[1][1] # FIXME this relies on alphabetical order m = "The image " + image + " has a " + "{0:.2f}%".format( success_chance * 100 ) + " chance of being matching the label "+str(target)+". Threshold: " + "{0:.2f}%".format( threshold * 100 ) if ( success_chance > threshold ): # Leave this debugging statement in, but don't run it for performance reasons, probably #print colored(m, 'green') return True # Leave this debugging statement in, but don't run it for performance reasons, probably #print colored(m, 'red') return False ############################## UTIL FUNCTIONS ############################# def show_image(path): image_window = tk.Tk() img = ImageTk.PhotoImage(Image.open(path)) panel = tk.Label(image_window, image=img) panel.pack(side="bottom", fill="both", expand="yes") image_window.mainloop() # Actually a thing def wait_between(a, b): rand = uniform(a, b) sleep(rand)

interface.py

# interface.py # Config DEBUG_MODE = False MAX_THREADS = 1 DELAY = 10.0 #imports import selenium from selenium import webdriver from selenium.webdriver.common import action_chains, keys from selenium.webdriver.support.ui import Select from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait import sys import os import traceback import thread import threading from threading import Thread import pdb import time import random import pprint import math import urllib import csv import sqlite3 from termcolor import colored, cprint from screeninfo import get_monitors # ***** Locally Stored Files ***** # # reCaptcha Solver recaptcha_path = os.path.join( os.getcwd(), 'recaptcha_solver' ) sys.path.append(recaptcha_path) # FIXME import recaptcha_solver sys.path.remove(recaptcha_path) ############################################################################## #### ___ _ _ _ _ _ _ #### #### / _ \___ _ __ ___ _ __ __ _| | /\ /\| |_(_) (_) |_ _ _ #### #### / /_\/ _ \ '_ \ / _ \ '__/ _` | | / / \ \ __| | | | __| | | | #### #### / /_\\ __/ | | | __/ | | (_| | | \ \_/ / |_| | | | |_| |_| | #### #### \____/\___|_| |_|\___|_| \__,_|_| \___/ \__|_|_|_|\__|\__, | #### #### |___/ #### ############################################################################## def import_jquery(driver): with open('assets/jquery-3.3.1.min.js', 'r') as jquery_js: jquery = jquery_js.read() #read the jquery from a file driver.execute_script(jquery) #active the jquery lib def clear_cookie_and_session_data(): print "Clearing cookie and session data..." driver.delete_all_cookies() print "Cookie and session data have been cleared..." def set_proxy(driver, ip=False, port=False): profile = webdriver.FirefoxProfile() if (PROXY_TYPE != "NONE"): if (ip==False and port==False): proxies = [] # Fill in proxy details proxies.append( [ "xxx.xxx.xxx.xxx", 00000 ] ) proxies.append( [ "xxx.xxx.xxx.xxx", 00000 ] ) # Connect to a random proxy t = random.choice(proxies) ip = t[0]; port = t[1] profile.set_preference("network.proxy.type", 1) profile.set_preference("network.proxy.http", ip ) profile.set_preference("network.proxy.http_port", port ) profile.set_preference("network.proxy.ssl", ip ) profile.set_preference("network.proxy.ssl_port", port ) profile.set_preference("browser.content.main-window.width", 20) profile.set_preference("browser.content.main-window.height", 30) profile.update_preferences() driver = webdriver.Firefox(profile, executable_path='assets/selenium-drivers/geckodriver') profile._create_tempfolder return driver def refresh_full_session(driver, num_instances=None): try: driver.quit() except: pass driver = set_proxy(driver) # Start the session driver.get("about:newtab") driver.implicitly_wait(10) # seconds resize_window(driver, num_instances) return driver def test_recaptcha_solver(): print "Starting test..." driver = None for i in range(12000): print "Loop "+str(i)+"..." driver = refresh_full_session(driver) try: recaptcha_solver.recaptcha_solver_demo(driver) except: driver.quit() traceback.print_exc() print "Unknown error during recaptcha solver test..." driver.quit() print "End of test..." # Main def main_loop(loop_until_all_complete=True): choice = lambda: test_recaptcha_solver(); try: while ( choice() != -1): pass except (KeyboardInterrupt, SystemExit) as e: print "Keyboard interrupt or system exit detected. Killing all threads..." cleanup_stop_thread() sys.exit() raise e main_loop()

Implications

Why reCaptcha 3.0 Changes Nothing

At face value, the explanation for why the image-clicking challenges are removed in reCaptcha 3.0 seem to be done for user convenience, they appear to be done mainly because the image captchas have become increasingly less effective over time. By the time reCaptcha 3.0 launched in 2018, reCaptcha 2.0 had already been operating almost entirely based off of the non-challenge anti-spam factors, such as as IP address and user cookies.

Without valid cookie history and/or a top-tier proxy, V2 is already extremely slow to solve (by design) and in arguably borderline unsolvable, with a classic example of this being seen by everyone who has ever attempted to solve reCaptcha on Tor during 2018.

Image classifiers are a solved problem at this point. Seems like they're removing the already pointless image clicking part and leaving in the rest.

V2 solvers that are halfway decent at cookie farming should work fine on V3.

Should You Use reCaptcha?

There are three types of spam that websites should be protected from:

Bulk automated spambots that target every possible site they can find;

Manual spam by cheap third-world virtual assistants; and

Custom spam bots built to target your site specifically.

For the average website (i.e., sites that don’t at least have a few million unique users per month):

Bulk automated spam can be mitigated almost entirely by adding a dummy hidden field to your forms and rejecting submissions if that field has any data in it (which sounds like bullshit, but this actually works);

Manual spam is mostly unaffected by captchas; and

Fucking absolutely no one is going to write a custom spam bot for your site until it gets substantially larger.

reCaptcha is a substantial inconvenience to users (and also is borderline unusable throughTor) and every inconvenience on your site negatively impacts not only usability, but also conversion rate. While reCaptcha’s use of many factors arguably makes it a good fit for massive sites likecirclejerk comment fanclub,expired username land, andChristopher Poole’s anime fan site, for the average use case, reCaptcha is excessive, unnecessary, and intrusive.

Share This Post