Tesseract 3.02中文字庫訓練
下載chi_sim.traindata字庫
下載tesseract-ocr-setup-3.02.02.exe
下載jTessBoxEditor用於修改box文件
0.准備
為了方便 tif文面命名格式[lang].[fontname].exp[num].tif
lang是語言 fontname是字體
比如我們要訓練自定義字庫 mjorcen字體名normal
那么我們把tif文件重命名 mjorcen.normal.exp0.jpg
圖片 :
下面開始訓練字庫:
1、生成 .box文件
tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 -l chi_sim batch.nochop makebox
把圖片文件和box文件放在同一目錄,
2、用jTessBoxEditor.jar打開tif文件,然后根據實際情況修改box文件

3、 生成 .tr文件
tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 nobatch box.train
4、成一個unicharset文件
unicharset_extractor mjorcen.normal.exp0.box
5、新建一個font_properties文件
里面內容寫入 normal 0 0 0 0 0 表示默認普通字體
6、運行命令
shapeclustering -F font_properties -U unicharset mjorcen.normal.exp0.tr
mftraining -F font_properties -U unicharset -O unicharset mjorcen.normal.exp0.tr
cntraining mjorcen.normal.exp0.tr
結果如下:
E:\data\Users\Administrator\Desktop\ocrBuider3>shapeclustering -F font_propertie s -U unicharset mjorcen.normal.exp0.tr Reading mjorcen.normal.exp0.tr ... Building master shape table Computing shape distances... Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... Stopped with 0 merged, min dist 999.000000 Computing shape distances... Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 1 2 3 4 Stopped with 0 merged, min dist 0.365385 Master shape_table:Number of shapes = 5 max unichars = 1 number with multiple un ichars = 0 E:\data\Users\Administrator\Desktop\ocrBuider3>mftraining -F font_properties -U unicharset -O unicharset mjorcen.normal.exp0.tr Read shape table shapetable of 5 shapes Reading mjorcen.normal.exp0.tr ... Done! E:\data\Users\Administrator\Desktop\ocrBuider3>cntraining mjorcen.normal.exp0.tr Reading mjorcen.normal.exp0.tr ... Clustering ... Writing normproto ...
7、把目錄下的unicharset、inttemp、pffmtable、shapetable、normproto這五個文件前面都加上normal.
8、執行combine_tessdata normal.
9、把 normal.traineddata 復制到Tesseract-OCR 安裝目錄下的tessdata文件夾中
10、測試
tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 -l normal
debug:
E:\data\Users\Administrator\Desktop\ocrBuider3>tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 -l chi_sim batch.nochop makebox Too many unichars in ambiguity on line 22358424 Too many unichars in ambiguity on line 22358424 Too many unichars in ambiguity on line 14941344 Tesseract Open Source OCR Engine v3.02 with Leptonica E:\data\Users\Administrator\Desktop\ocrBuider3>tesseract mjorcen.normal.exp0.jp g mjorcen.normal.exp0 nobatch box.train Tesseract Open Source OCR Engine v3.02 with Leptonica APPLY_BOXES: Boxes read from boxfile: 6 Found 6 good blobs. TRAINING ... Font name = normal Generated training data for 2 words E:\data\Users\Administrator\Desktop\ocrBuider3>unicharset_extractor mjorcen.norm al.exp0.box Extracting unicharset from mjorcen.normal.exp0.box Wrote unicharset file ./unicharset. E:\data\Users\Administrator\Desktop\ocrBuider3>shapeclustering -F font_propertie s -U unicharset mjorcen.normal.exp0.tr Reading mjorcen.normal.exp0.tr ... Building master shape table Computing shape distances... Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... Stopped with 0 merged, min dist 999.000000 Computing shape distances... Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 1 2 3 4 Stopped with 0 merged, min dist 0.365385 Master shape_table:Number of shapes = 5 max unichars = 1 number with multiple un ichars = 0 E:\data\Users\Administrator\Desktop\ocrBuider3>mftraining -F font_properties -U unicharset -O unicharset mjorcen.normal.exp0.tr Read shape table shapetable of 5 shapes Reading mjorcen.normal.exp0.tr ... Done! E:\data\Users\Administrator\Desktop\ocrBuider3>cntraining mjorcen.normal.exp0.tr Reading mjorcen.normal.exp0.tr ... Clustering ... Writing normproto ... E:\data\Users\Administrator\Desktop\ocrBuider3>combine_tessdata normal. Combining tessdata files TessdataManager combined tesseract data files. Offset for type 0 is -1 Offset for type 1 is 140 Offset for type 2 is -1 Offset for type 3 is 489 Offset for type 4 is 123081 Offset for type 5 is 123134 Offset for type 6 is -1 Offset for type 7 is -1 Offset for type 8 is -1 Offset for type 9 is -1 Offset for type 10 is -1 Offset for type 11 is -1 Offset for type 12 is -1 Offset for type 13 is 123920 Offset for type 14 is -1 Offset for type 15 is -1 Offset for type 16 is -1 E:\data\Users\Administrator\Desktop\ocrBuider3>tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 -l normal Tesseract Open Source OCR Engine v3.02 with Leptonica E:\data\Users\Administrator\Desktop\ocrBuider3>tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp1 -l chi_sim Too many unichars in ambiguity on line 15280712 Too many unichars in ambiguity on line 15280712 Too many unichars in ambiguity on line 4324296 Tesseract Open Source OCR Engine v3.02 with Leptonica
normal 結果
應收: 119
普通的中文結果:
應收= II苜
腳本(需要java環境):
目錄結果如下:

腳本如下:
window
@echo off set "src=%1%" set "font_name=%2%" set "desc=%3%" if not defined src set /p src=" please pass your filename : " if not defined font_name set /p font_name=" please pass your font_name : " rem 判斷參數的合法性 if not defined src echo IllegalArgumentException arg1 must not be null & pause>nul & exit if not defined font_name echo IllegalArgumentException arg2 must not be null & pause>nul & exit if not defined desc set "desc=%src:~0,-4%" echo desc %desc% rem 如果目錄下沒有font_properties 文件創建 font_properties ,並寫入文件 if exist font_properties ( echo font_properties exist ) else ( ECHO %font_name% 0 0 0 0 0 >"font_properties" ) rem 刪除原有文件 if exist %font_name%.unicharset ECHO DEL %font_name%.unicharset & DEL /Q names %font_name%.unicharset if exist %font_name%.inttemp ECHO DEL %font_name%.inttemp & DEL /Q names %font_name%.inttemp if exist %font_name%.pffmtable ECHO DEL %font_name%.pffmtable & DEL /Q names %font_name%.pffmtable if exist %font_name%.shapetable ECHO DEL %font_name%.shapetable & DEL /Q names %font_name%.shapetable if exist %font_name%.normproto ECHO DEL %font_name%.normproto & DEL /Q names %font_name%.normproto if exist %font_name%.font_properties ECHO DEL %font_name%.font_properties & DEL /Q names %font_name%.font_properties rem makebox tesseract %src% %desc% -l chi_sim batch.nochop makebox java -Xms128m -Xmx512m -jar jTessBoxEditor/jTessBoxEditor.jar ECHO Please change your results , and press any key to continue pause>nul tesseract %src% %desc% nobatch box.train unicharset_extractor %desc%.box shapeclustering -F font_properties -U unicharset %desc%.tr mftraining -F font_properties -U unicharset -O unicharset %desc%.tr cntraining %desc%.tr rem 配置新文件 if exist unicharset ECHO rename unicharset %font_name%.unicharset & rename unicharset %font_name%.unicharset if exist inttemp ECHO rename inttemp %font_name%.inttemp & rename inttemp %font_name%.inttemp if exist pffmtable ECHO rename pffmtable %font_name%.pffmtable & rename pffmtable %font_name%.pffmtable if exist shapetable ECHO rename shapetable %font_name%.shapetable & rename shapetable %font_name%.shapetable if exist normproto ECHO rename normproto %font_name%.normproto & rename normproto %font_name%.normproto combine_tessdata %font_name%. if exist font_properties ECHO rename font_properties %font_name%.font_properties & rename font_properties %font_name%.font_properties ECHO press any key to continue pause>nul
調用:
注意: 參數1: 文件全名 , 參數2 字體名, 參數3 :輸出文件名, 不填默認為文件名
E:\data\Users\Administrator\Desktop\ocrBuider3>run.bat mjorcen.normal.exp0.jpg normal
實例:
E:\data\Users\Administrator\Desktop\ocrBuider3>run.bat mjorcen.normal.exp0.jpg n ormal desc mjorcen.normal.exp0 font_properties exist Too many unichars in ambiguity on line 2188584 Too many unichars in ambiguity on line 2188584 Too many unichars in ambiguity on line 2686128 Tesseract Open Source OCR Engine v3.02 with Leptonica Please change your results , and press any key to continue Tesseract Open Source OCR Engine v3.02 with Leptonica APPLY_BOXES: Boxes read from boxfile: 6 Found 6 good blobs. TRAINING ... Font name = normal Generated training data for 2 words Extracting unicharset from mjorcen.normal.exp0.box Wrote unicharset file ./unicharset. Reading mjorcen.normal.exp0.tr ... Building master shape table Computing shape distances... Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... Stopped with 0 merged, min dist 999.000000 Computing shape distances... Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 1 2 3 4 Stopped with 0 merged, min dist 0.365385 Master shape_table:Number of shapes = 5 max unichars = 1 number with multiple un ichars = 0 Read shape table shapetable of 5 shapes Reading mjorcen.normal.exp0.tr ... Done! Reading mjorcen.normal.exp0.tr ... Clustering ... Writing normproto ... rename unicharset normal.unicharset rename inttemp normal.inttemp rename pffmtable normal.pffmtable rename shapetable normal.shapetable rename normproto normal.normproto Combining tessdata files TessdataManager combined tesseract data files. Offset for type 0 is -1 Offset for type 1 is 140 Offset for type 2 is -1 Offset for type 3 is 489 Offset for type 4 is 123081 Offset for type 5 is 123134 Offset for type 6 is -1 Offset for type 7 is -1 Offset for type 8 is -1 Offset for type 9 is -1 Offset for type 10 is -1 Offset for type 11 is -1 Offset for type 12 is -1 Offset for type 13 is 123920 Offset for type 14 is -1 Offset for type 15 is -1 Offset for type 16 is -1 rename font_properties normal.font_properties
E:\data\Users\Administrator\Desktop\ocrBuider3>
linux (出自文檔:http://tesseract-ocr.googlecode.com/svn/trunk/doc/combine_tessdata.1.asc) :
#!/bin/bash tesseract zzz.ocra.exp0.tif zzz.ocra.exp0 nobatch box.train unicharset_extractor zzz.ocra.exp0.box echo "ocra 0 0 1 0 0" >font_properties shapeclustering -F font_properties -U unicharset zzz.ocra.exp0.tr mftraining -F font_properties -U unicharset -O zzz.unicharset zzz.ocra.exp0.tr cntraining zzz.ocra.exp0.tr cp normproto zzz.normproto cp inttemp zzz.inttemp cp pffmtable zzz.pffmtable cp shapetable zzz.shapetable combine_tessdata zzz. cp zzz.traineddata /home/youruserid/tessdata/. sudo cp zzz.traineddata /usr/share/tesseract-ocr/tessdata/. tesseract zzz.ocra.exp0.tif output -l zzz

