I discovered what I consider to be fairly serious issue with the reCAPTCHA authentication system today, and wanted to share this. I’m fairly sure not many know these facts, which can affect a lot of forum owners / administrators.
I run a forum using Vanilla Forum at gyaan.in – regular readers of this blog would know about it. A couple of months ago, I upgraded the forum to the new, redesigned Vanilla Forum 2.x version that comes with built-in support for registration verification using reCAPTCHA. Until the 1.x branch, out-of-the-box there was no way to pre-approve registrations; a moderator had to approve each account manually. (This is what gyaan.in used too.) With a function as crucial as user registration I didn’t want to make modifications only to have to re-modify and test it every time I had to apply an upgrade patch. So when version 2.x came along with baked support for reCAPTCHA, I was happy to jump on-board and remove the approval process. (A move that I must admit was controversial within the gyaan.in community and the moderators.)
Over the past few weeks, I noticed that gyaan.in’s email inbox was filling up with a considerable number of mail delivery failure notifications for the initial email sent right after successful registration. I didn’t give much thought to it as I (incorrectly) believed the first step in the new Vanilla Forum sign up process was a verification email. It turns out that it is not – the system sends an email only once the user has been authenticated. Had I known this, the number of mailer daemon messages should have set alarm bells off already.
Today, one of the members (Shreyans) casually mentioned in a private message to me (in which he was discussing other technical issues that he was facing with the forum) that there seemed to be a lot of users on the board with the board with ‘nude’ or ‘naked’ in the username. To my surprise, I discovered that was indeed the case – and in many instances these user accounts had the same email address too. These were obviously spammer accounts, so I deleted them immediately. But that got me thinking how they could have gotten through.
reCAPTCHA (now owned by Google) throws CAPTCHA challenges from a corpus of OCR-recognised words from Google’s text digitisation efforts. You might have seen this verification challenge on Facebook too some time. Two words are shown and you are told to enter both correctly to pass.
Behind the scenes, reCAPTCHA doesn’t know what both the words are. One of the words has been positively identified by OCR and is kept as a ‘control’ word. The second word is not recognised by OCR; user input for that word is taken and stored into a database. Once enough users identify an ‘unknown’ word as the same word, the reCAPTCHA system uses that result for sending back the corrected word to text digitisation programmes and adds it to the corpus of control words used in the system.
A well-known loophole is that it is possible to enter one word incorrectly and have reCAPTCHA consider the answer valid. What I couldn’t understand is how spambots could get past the control word. So I started playing around with the text I entered as reCAPTCHA response in Vanilla Forum’s registration page. I found that…
- if the number of characters entered for each word is correct;
- and, the words are entered as correctly as possible, except for one character (i.e., one character out of an entered word was deliberately incorrect)
…then reCAPTCHA would authenticate the entry as correct! This issue is not isolated to the Vanilla Forum implementation of reCAPTCHA either, as you can achieve similar results using the demo form on the official reCAPTCHA website.
I searched around for possible reasons for this and found this entry in the reCAPTCHA wiki:
On the verification word, reCAPTCHA intentionally allows an “off by one” error depending on how much we trust the user giving the solution. This increases the user experience without impacting security. reCAPTCHA engineers monitor this functionality for abuse.
It seems this is a problem-by-design. What seems to be crucial in equation seems to be the implication that this off-by-one error is allowed “depending on how much we trust the user giving the solution”. How exactly is this trust defined? I don’t think IP address blocking can be used (can it?), because the request for verifying inputs is sent by the server using reCAPTCHA tied to the specific public-private key pair of the site. Which means ‘block IP addresses that send large volumes of incorrect inputs’ cannot be used to define this ‘trust’, as the IP address would be of server rather than the spambot / client.
Another possible yardstick for measuring ‘trust’ would be allowing one-off errors for typographically similar characters: ‘i’ / ‘l’, ‘a’ / ‘d’, ‘r’ / ‘n’, etc. However, I don’t think their system uses this either as in all my attempts, it accepted one-off errors for entirely different-looking characters, such as ‘s’ / ‘w’, ‘q’ / ‘f’, etc.
reCAPTCHA is undoubtedly the most popular CAPTCHA implementation used on the Web these days, which makes this such a serious issue. A lot of forums and sites now use this de-facto because it’s a small way to pitch into the noble ideal of text digitisation, and also because presenting ‘real’ words appears to be a more elegant solution than randomly generated text.
Unfortunately, from what I have found through experience now the checks and balances used by reCAPTCHA are simply not good enough and seem to be leaking through at least 10 spambots daily. And this just on a relatively low-traffic website like gyaan.in. Imagine the implications on a much juicier target like Facebook or the countless StackExchange websites, which all use it for human verification.
For now, I am going back to trusting manual moderator approval on my Vanilla Forum site. It seems when it comes to identifying humans, nobody is better at that job than a human.