Tika結合Tesseract-OCR 實現光學漢字識別（簡體、宋體的識別率百分之百）—附Java源碼、測試數據和訓練集下載地址

本文轉載自查看原文 2019-12-26 10:02 2064 OCR

OCR(Optical character recognition) —— 光學字符識別，是圖像處理的一個重要分支，中文的識別具有一定挑戰性，特別是手寫體和草書的識別，是重要和熱門的科學研究方向。可惜國內的科研院所，基本沒有幾個高識別率的訓練集——筆者聯系過北京語言大學研究生一篇論文的作者，他們論文說有%90的正確識別率，結果只做了20個筆畫簡單的漢字（20/6753 = %0.3 常用簡體漢字的千分之三），然后找了20個學生，各自手寫了一遍。真的是為了論文而論文，而且很會選擇樣本（小而簡單）

斯坦福大學有個工程項目，專門做中文漢字的識別——歐美發達國家的科研院所更有研究精神

提高識別率，訓練集是關鍵！

提高識別率，訓練集是關鍵！！

提高識別率，訓練集是關鍵！！！

下載訓練集—traineddata請移步：

https://github.com/tesseract-ocr/tessdata

中文請選如下4個：

chi_sim.traineddata （簡體— 對於宋體，像素>= 300dpi:識別率高達%100，同時對英文及阿拉伯數字識別率高達百分之90以上）
chi_sim_vert.traineddata （簡體，豎排）
chi_tra.traineddata （繁體）
chi_tra_vert.traineddata（繁體，豎排）【CoderBaby】

如何做自己的測試數據集

請參考官網: how to train tesseract

經過測試得出如下結論：

對於宋體，白色背景，非傾斜等，像素大於等於300dpi—識別率%100
英文和數字，識別率超過90%
特殊字符識別率不高
像素太低，識別率急劇下降
多種背景顏色變化，識別率極低
字體換成草書等，識別率大幅降低
電影屏幕字幕和網頁截圖識別率較低
掃描件如果字體太淡，太小，完全識別不出來
提高識別率，需要自己做訓練集，工作量巨大的體力活（簡體漢字最少6753個，混合一些復雜的，至少要10000個字符；不同字體要重新做，因為本質上是圖形幾何計算，國內科研院所和開源的做的不多）

Java源碼實現，tika結合Tesseract-OCR

（1）源碼如下(支持多個圖片識別）

    @Test
    public void testCode() throws IOException, SAXException, TikaException, InterruptedException { List<String> fileNames = new ArrayList<>(); fileNames.add("chi_eng.png"); fileNames.add("chi_eng01.png"); fileNames.add("chi_old.png"); fileNames.add("chi-scan-75dpi.jpg"); fileNames.add("chi-scan-100dpi.jpg"); fileNames.add("chi-scan-300dpi.jpg"); fileNames.add("chi-smartphone.jpg"); fileNames.add("chi-subtitle-v1.jpg"); fileNames.add("english00.png"); fileNames.add("pdf_shaomiao.png"); fileNames.add("test.tiff"); fileNames.add("weather.png"); // 轉載請注明出處：https://www.cnblogs.com/NaughtyCat/p/tika-support-Tesseract-OCR-with-source-code-and-test-data.html TesseractOCRParser parser = new TesseractOCRParser(); TesseractOCRConfig config = new TesseractOCRConfig(); // 設置簡體中文訓練集 config.setLanguage("chi_sim"); // 設置Tesseract 安裝路徑 config.setTesseractPath("C:/Program Files/Tesseract-OCR"); // 設置train data 路徑 config.setTessdataPath("C:/Program Files/Tesseract-OCR/tessdata"); ParseContext context = new ParseContext(); context.set(TesseractOCRConfig.class, config); context.set(TesseractOCRParser.class, parser); fileNames.forEach(filename -> { BodyContentHandler handler = new BodyContentHandler(); File file = new File("E:/tika/testData" + File.separator + filename); if (file.exists()) { Metadata metadata = new Metadata(); try (InputStream stream = new FileInputStream(file)) { parser.parse(stream, handler, metadata, context); } catch (Exception e) { } handler.toString(); } }); } }

測試數據（圖片）說明及下載地址

具體說明及測試效果請參見：https://ocr.space/blog/2015/03/best-ocr-software-for-chinese.html

相關測試圖片請參見：https://github.com/A9T9/OCR-Benchmark

(2)原始圖片及效果（）

基於“chi_sim.traineddata ”— 即簡體中文訓練集

圖1

轉換效果如下：

【結論】

300dpi，識別率：%100

圖2

轉換效果如下：

Brief history

Tesseractwes orginally developed at HewlettPackard Laboratones Bristol and
atHewettPackard Co Greeley Colorado beween 1985 and 1994 wthsome
more changes made in 1996 to portto Windows and some C++zing in1998
In2005 Tesseract was open sourced by HP Since 2006 itis developed by Goosgle

Thelatest (LSTM based]j stableversionis4.10, released on July 7.2019.Latest source codes avaable from
master branch on GlHub.Openissues can be foundin ssue racker and Planning iki

Thelatest35 version 5 3.05.02 released onjune 19,2018.Latestsource code for3.055 avaable from
305 branch on GlHHub.There sno development forthisversion,butitcan be used forspecial cases .
see Regression offeatures from 30x

See Release Notes and Change Log formore detas ofthe releases-
Installing Tesseract

You can ettherInstall Tesseractvia prepulltbinary package or pulld iLfrom sourcey
Supported Complersare:

* GCC48 and above
* ang34and above
* MSVC 2015.2017.2019

Othercompllersmightwork butare notofially supportedl
Running Tesseract
Basiccommand line usage:

tesseract inagenane outputbase [-1 ]ang】 [--osn ocrenginenode] [--psn pagesegnode
[configfiles...]

Formore information aboutthe various command line options use esseract --henp or man tesseract .

Examples can befoundin thewiki
For developers

Developers can use Tbtessaract Cor

【結論】
英文，特殊符號等會識別失敗。識別率：>%80

圖3.

轉換效果如下：

E g 氣

Even as Tvanja praised 8e parties Envoyed i 功 i5 7el gzamt7 comgpi 地 08
Qchieveze1 Q 7W7Der- Ofsocial media lsers appeared crilical of er as-
Sesszet 0f 加 e Trip adiistration「5 role 加功 i5 endeavou7
IBM 表示不服 ,Google 不 care。下而讓我們逐字逐句來看他們的論文
吧 , 對於爭論的事情 , 自己下功夫搞清楚。

松貴瑩坊辦少
忠 : https:/ww.cnblogs-com/NaughtyCatpytranslate-of-google-
Quantum-supremacy-article-published-on-nature.html

Quantum supremacy using
a programmable

superconducting
processor

基於可編程的超導處理器實現的量子霸
權

動關盤源 ,https://doorg/10.1038/s41586-019-1666-5
煌收船 2019 樂 7 歷 20 歷
旋准 8 船 2019 樂 9 歷 20 廠
坊終發療 2019 知 10 月 23 廳

Abstract
引言

量子計算機吹牛遢說 , 對於特定的計算任務 , 基於量子處理器的計算
機 , 其速度相較於經典處理器呈指數級增長。根本的挑戰在於構建一

【結論】
宋體，加粗，黑色——識別率%100；傾斜，綠色等——識別率：%70

圖4（掃描件）.

轉換效果如下：

節 P a
為客戶服務是華為存在的睢一理由” 從公司層面
看 , 為客戶創造價值的主業務流只有一個!

Ipo - nisgniedProductDevelopment

B croeis PaFA 4 辜蒙扁)

Unc - LomdTocash
芸 a npe waa8 2 菅墨

E Ig - ssueToResoliton 林
P L a 顫〉

n i t t

6 P: 01

IP0 主業務流包括 : MW 流程、0R 流程、IPD 流程

D
4 一

【結論】
pdf掃描件，只有比較大，比較粗的字能識別出來，顏色較淡的識別不出來
識別率：約%10

圖5.

轉化效果如下：

大行佳孔當自弼不。

。

巧者勞而春者忱 , 無能者無所必 , 作食而邀
游 , 陸若不系之舟。

。

Chacgyuisdt.

。

124565.

。
12256 dogdogunnn
。
。

【結論】
漢字、英文、數字混合
識別率：%60~%70

圖6（天氣網頁截圖）

轉換效果如下：

L f

全國 > 囚川 > 尿膳 > 坂區
今奪偉 8-15 天

llc/4rc

208 238 028 058
人 [ [ 92
s
c E E
無 RR 無 RR 無 RR 無 RR

< < < <

【結論】
背景顏色（藍色，灰色，黑色、橙色）；字體顏色（黑色、白色）。識別率：不到%10

圖7.

轉換效果如下：

機器人餐廳

cra arenzanmu nnanmes
seeu xraguagpt. ssepumes
人吊 pahs ztpznaapsus anea
an sro an sessuassnet
e ssoangm crmazees aas
iusiaanorg.mmouz rpeae
snreenatesezur eeae t
+ngszensenapenecieme
礦 svapgzanohat

【結論】
75dpi，識別率：約為%5 【CoderBaby】

圖8（電影字幕截圖）.