tesseract 中文二次訓練

本文轉載自查看原文 2017-05-04 11:02 4013 python/ linux

tesseract4.0以上版本可參考 https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#tutorial-guide-to-lstmtraining

1. jTessBoxEditor 下載安裝： https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/

2. 收集樣本圖像。(從圖片自動生成)

text2image –text=training_text.txt –outputbase=cert.normal.exp0 –font=FreeMono –fonts_dir=/usr/share/fonts/truetype/freefont/

3.合並樣本圖像。運行jTessBoxEditor工具，在點擊菜單欄中Tools--->Merge TIFF。在彈出的對話框中選擇樣本圖像（按Shift選擇多張），合並成num.font.exp0.tif文件。

java -jar jTessBoxEditor.jar 

# 或者
mogrify -format tif *.jpg
cat *.tif > cert.normal.exp0.tif

4.生成Box File文件。打開命令行，執行命令：

tesseract cert.normal.exp0.tif cert.normal.exp0 -l chi_sim -psm 6 batch.nochop makebox

5. 打開jTessBoxEditor矯正錯誤並訓練

6. 訓練

新建一個font_properties文件，里面內容寫入 normal 0 0 0 0 0 表示默認普通字體

tesseract cert.normal.exp0.tif cert.normal.exp0 nobatch box.train
unicharset_extractor cert.normal.exp0.box

shapeclustering -F font_properties -U unicharset cert.normal.exp0.tr
mftraining -F font_properties -U unicharset -O unicharset cert.normal.exp0.tr
cntraining cert.normal.exp0.tr

最后會生成五個文件，把目錄下的unicharset、inttemp、pffmtable、shapetable、normproto這五個文件前面都加上cert.

如圖：

命令行輸入，合並五個文件：

combine_tessdata cert.

參考：

1. http://www.cnblogs.com/wzben/p/5930538.html

2. http://blog.csdn.net/yimingsilence/article/details/51353772

3. https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#tutorial-guide-to-lstmtraining

4. http://docs.oracle.com/cd/E56344_01/html/E54075/mogrify-1.html 命令手冊

5. http://www.cnblogs.com/robben/p/4315123.html convert mogrify 命令使用

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Tesseract-OCR4.0識別中文與訓練字庫實例 Tesseract-OCR識別中文與訓練字庫實例 tesseract系列（4） -- tesseract訓練問題總結 OCR2：tesseract字庫訓練關於Tesseract的簡單訓練方法 Backtrader中文筆記之Data Feeds(二次完善) Tesseract-OCR的簡單使用與訓練 Tesseract 3 語言數據的訓練方法 tesseract5.0 圖片訓練模型實戰 Backtrader中文筆記之Platform Concepts(二次修復)[平台概念]。