19 Aug 2008 @ 3:37 PM 
 

‘Of course I’m not wasting time on the web… I’m *transcribing*…’

 

If you’ve signed up for an account within the past several years or have done pretty much anything on the www and needed to prove that you weren’t a bot, then you’ve encountered a reCAPCHA (Completely Automated Public Turing test to tell Computers and Humans Apart). Yes, it’s a good way to ensure you’re human, but did you know that you were helping archivist decipher old texts?

If You Use the Web, You May Have Already Been Enlisted as a Human Scanner
Those anti-bot security forms that slow you down when you’re entering information just might serve a larger purpose- Scientific American

By Adam Hadhazy

You might think that computer scientists would have figured out a way to get computers to decipher those characters. But they haven’t, so instead they’ve figured out a way to harness all that effort you’re making to protect your security. “When you’re reading those squiggly characters, you are doing something that computers cannot,” says Luis von Ahn, a computer scientist at Carnegie Mellon University (C.M.U.) in Pittsburgh.

Von Ahn and colleagues reported last week in the journal Science that Web users have transcribed the equivalent of 160 books a day—that’s more than 440 million words—in the year since researchers kicked off the program. The initiative is similar to “distributed computing” schemes like SETI@home, which take advantage of unused personal computer processing power to sift through signals received from space for those that might be generated by extraterrestrial intelligence or to figure out how proteins fold. But the difference with this system is that people, not processors, do the calculations.

“We are getting people to help us digitize books at the same time they are authenticating themselves as humans,” von Ahn says. “Every time people are typing these [answers] out, they are actually taking old books or newspapers and helping to transcribe them.”

Other large digitization projects, such as the Google Books Project and the Internet Archive, rely on optical character recognition (OCR) software. Basically, computers take a digital image of a book or newspaper page, then try to discern the individual letters, von Ahn says. But he and other C.M.U. researchers estimate that these programs misinterpret or fail to read up to one out of every five words on weathered, yellowed paper or on pages with faded or smeared ink. Such electronically illegible words and texts must then be manually transcribed by human workers at a relatively high cost, he says.

Von Ahn’s team’s method is a twist on the Web site tests known as CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart), which have been in use since 2000. The new twist on CAPTCHAs is to use a set of letters from old, weathered books and newspapers that computerized transcribing programs cannot recognize. Much of the raw “fuel” comes courtesy of the Internet Archive project, which transmits words that its OCRs cannot recognize or do not appear in the dictionary.

About 40,000 Websites now use the service, called reCAPTCHA, which the project’s site offers for free. Facebook was one of its first major patrons.

Von Ahn estimates that at reCAPTCHA’s current rate of transcription (about four million words a day missed by OCR systems), the program does a week’s worth of transcription from 1,500 professional transcribers in a single day. This data is stored on hard drives at C.M.U. and then sent back to the organization that requested the transcription. (The New York Times, for example, has enlisted reCAPTCHA to digitize the newspaper’s archives dating back to 1851.)

Von Ahn acknowledges that the overall cost for reCAPTCHA is still a bit higher than just using OCR for more recently written, more easily scanned texts. He would not say exactly how much, citing nondisclosure agreements with clients using the software.

When the researchers compared how reCAPTCHA and OCR transcribed five Times articles, reCAPTCHA did a significantly better job—99.1 percent accuracy—than OCR of the sort that Google uses for its book project, which came in at 83.5 percent. (Google declined to comment for this story.)

But as is the way with most technology, today’s innovation is tomorrow’s VHS tape. Eventually computers will be able to decipher reCAPTCHAs, too. “We’ll get a few good years out of reCAPTCHAs,” says co-author Manuel Blum, a professor of computer science at Carnegie Mellon and key developer of some of the first CAPTCHAs.

OCR will continue to improve as well, Blum says, along with so-called machine learning in general.

Either way, with some 100 million books published prior to the dawn of the digital era, says von Ahn, that “makes for a lot of words.”

Tags Categories: daily rambles Posted By: Dawn Masuoka
Last Edit: 19 Aug 2008 @ 09 38 PM

EmailPermalink
 

Responses to this post » (None)

 

Post a Comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

\/ More Options ...
Change Theme...
  • Users » 4
  • Posts/Pages » 586
  • Comments » 617
Change Theme...
  • VoidVoid « Default
  • LifeLife
  • EarthEarth
  • WindWind
  • WaterWater
  • FireFire
  • LightLight

Suki Suki Happy Fun Time



    No Child Pages.

FAQ



    No Child Pages.

Contact Me



    No Child Pages.

Poetry



    No Child Pages.

My Love



    No Child Pages.

Dawn-bot 2.0



    No Child Pages.