OCR (光学字符识别 in Chinese) on Chinese scans and movie subtitles

Update May 1, 2015: (a9t9) launched its very own free and open-source Online OCR service - try it out and let us know how it compares.

Think English language OCR is hard? Then try Chinese. This is what I did for this review. When I reviewed Online OCR services for English, there were 5 OCR surprises. Now I am back, looking at OCR software for Chinese characters.

Why is Chinese OCR difficult?

Why is Chinese OCR more difficult than, say, English or German document OCR?

Optical character recognition by itself is still hard. It is not a solved problem, at least not for the software available to end-users. The result of our English OCR benchmark have been mixed. But Chinese language OCR takes the challenge to another level. Here is why:

(1) Number of characters: A typical Western alphabet has around 24-30 characters whereas Chinese OCR software has to learn far more that that. To be useful, it needs to know at least the 6,763 simplified Chinese characters the GB–2312 standard. Then add another ~5000 or so traditional versions of characters to the mix. So we have at least 10,000 characters. This also means: Some rare characters may not be recognized simply because they’re not in the database – something that cannot happen in English.

(2) Every new character the software supports is another character that might potentially result in a false positive match, so there is a limit for the sake of accuracy.

(3) In Chinese, a character (or two) resembles an English word. Example: 手机 = Mobile Phone (literally: hand machine). In this example 2 Chinese characters = 11 English characters. So the information density in Chinese texts is much higher. This also means the text size needs to be greater. Typical lower limits for OCR software are 15 pixels for Western languages or 20 pixels for East Asian languages.

OCR Software Benchmark for Chinese

For this review a Chinese OCR benchmark consisting of of six images was created: A document scan of magazine article in three different qualities (300dpi, 100dpi, 75 dpi), a Lumia 535 smartphone image of the article, and screenshot of two movie subtitles (more about the movie subtitles later).

1. High-Quality Chinese Scan (300 dpi)

Test 1: Chinese OCR, 300dpi

OCR Service Result Output (Excerpt) Abby Finereader 100% 在中国，餐厅里的菜通常很特别， Google Docs OCR 100% 在中國,餐斤里的菜通常很特別, OnlineOCR 100% 在中国，餐厅里的菜通常很特别， i2 OCR Good 在中国, 餐厅里的菜通常壕艮特另u, NewOCR 100% 在中国， 餐厅里的菜通常很特别,

The first test uses a high-quality scan of a magazine article. All services that can recognize Chinese characters did well in this test, most with no error (100%) and some with almost no error (good).

Having said that, some services that I reviewed for English language OCR do not support Chinese and are thus not part of this review.

1. Low-Quality Chinese Scan (100 dpi)

Test 2: Chinese OCR, 100dpi Scan

OCR Service Result Output (Excerpt) Abby Finereader Good 在中“，枝厅里的芡通常很持别 Google Docs OCR Good 在中山,餐訂里的菜通常很特別, OnlineOCR Good 在中囚,各厅里的菜通常很特别 i2 OCR Fail 在中山l 鲁汀里的菜通常很牺易ul, NewOCR Poor 在中凹. 稷汀呈的菜逆常很持别l

The 100 dpi scan the text is easily readable for a human, even so it is somewhat blurred. For OCR systems this resolutions is close to limit of the technology can handle. Three competitors barely got the good grade, while one service is unusable.

3. Very low-Quality Chinese Scan (75 dpi)

Test 3: Chinese OCR, 75dpi Scan [ ]

OCR Service Result Output (Excerpt) Abby Finereader Fail (no text) Google Docs OCR Fail (no text) OnlineOCR Fail 往中工肠泞王的共诵常很特别 i2 OCR Fail 仕申重. g厅虫的翼矗薰蟹麟颤」, NewOCR Fail 仨中重. 器焘虫的翼邃篱氰麝蒽上

Humans can still read this text, even so it involves guessing a few characters from the context. Not so our PC – every single OCR software flunked this test.

4. Smartphone image

Test 4: Smartphone camera

OCR Service Result Output (Excerpt) Abby Finereader Good 在中国，说厅里的菜通常很特別， Google Docs OCR Fail OCR not trigged OnlineOCR Good 在中国，偿厅里的菜通常很特别， i2 OCR Fail 在口口五， 餐厅里的粟遇常抒艮持另u, NewOCR Good 在中国, 餐厅里的菜通常很特别,

Using your mobile phone as scanner? Sure, that works. Three services deliver a good conversion result despite the yellowish background and somewhat oblique text. Surprisingly Google OCR fails this test: Google Docs no longer has a dedicated “Start OCR” button and the automatic OCR fails to trigger.

5/6: Chinese movie subtitles

This is not your average OCR task. The challenge here are the backgrounds. Of-the-shelf OCR systems have a very hard time distinguishing text from the background.

Test 5: Movie Subtitle 1

OCR Service Result Output (Excerpt) Abby Finereader Poor 1 ^跳，彳II见上面的字吗 Google Docs OCR Poor 行得現上面的字 OnlineOCR Fail ).ir-iv一目日口 i2 OCR Fail (no text) NewOCR Poor 唰 得见上面的学吗

In Subtitle 1 (from the movie Anchoring Seattle, not that this matters….) the subtitle is white on green. No OCR software can read it ok. But Abbyy, Finereader, Google OCR and NewOCR detect at least a few characters correctly.

Test 6: Movie Subtitle 2

OCR Service Result Output (Excerpt) Abby Finereader Fail (no text) Google Docs OCR Fail (no text) OnlineOCR Fail ．叫口圈团口睡鼠喻戒… i2 OCR Fail (no text) NewOCR Fail 莪问大z 之… > .二_…

In Subtitle 2 (from the movie Good for Nothing Heroes) we have a street scene and the subtitle is white on a greyish background. No OCR software can read the characters, despite the large size characters (screenshot from full-screen Youtube replay of the Chinese movie).

Abbyy Cloud SDK works better than Abbyy FineReader for complex backgrounds. So for test 5 and 6 the Abbyy Cloud SDK was used.

In my February OCR review, Abbyy OCR did a great job reading numbers from a gas meter image, so I am surprised it does not better here. But I used Abby FineReader instead of the Abbyy Cloud SDK for this review, as it is easier to use and showed no significant difference in recognition rates in a pre-test. So I went back to the cloud SDK for the subtitle recognition. And indeed, Abbyy Cloud SDK seems to have a more powerful recognition engine and/or background removal. It recognized Subtitle 1 and 2 at least partly. I would love to know why all OCR services miss the ~ first half of both subtitles completely.

Summary - Best OCR (web) software for Chinese characters

Ranking Score Scan1 Scan2 Scan3 Mobile Sub1 Sub1 Abbyy FineReader 8 ++ + - + 0 - OnlineOCR 7 ++ + - + - - Google Docs OCR 6 ++ + - - 0 - NewOCR 6 ++ 0 - + 0 - i2 OCR 2 + - - - - -

Just like with English Google OCR disappoints and is clumsy to use. It seems to be some kind of black magic if or if not the OCR conversion is triggered. If it fails, no error messages guides the user. And if it works, it works just average. The TOP 3 spots are exactly the same as in the English language OCR test: Abbyy wins the review, and the “no-name” OnlineOCR service comes in second - and is the best free OCR service. Google ranks only 3rd, along with NewOCR. So the surprise of this OCR for Chinese characters review is: No surprises - or in other words: We learned that whatever OCR software did well for English also did well for Chinese.

Chinese language version of this article: 最棒的在线中文OCR软件-评介

References

Reviewed online OCR services:

OCR benchmark - test images: