Tesseract 3.02中文字庫訓練


Tesseract 3.02中文字庫訓練

下載chi_sim.traindata字庫
下載tesseract-ocr-setup-3.02.02.exe 
下載jTessBoxEditor用於修改box文件 

0.准備

為了方便 tif文面命名格式[lang].[fontname].exp[num].tif
lang是語言 fontname是字體 
比如我們要訓練自定義字庫 mjorcen字體名normal
那么我們把tif文件重命名 mjorcen.normal.exp0.jpg

 

圖片 : 

下面開始訓練字庫:

1、生成 .box文件

tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 -l chi_sim batch.nochop makebox

把圖片文件和box文件放在同一目錄,

2、用jTessBoxEditor.jar打開tif文件,然后根據實際情況修改box文件 

 

 

3、 生成 .tr文件

tesseract  mjorcen.normal.exp0.jpg mjorcen.normal.exp0  nobatch box.train

 

4、成一個unicharset文件
 

unicharset_extractor mjorcen.normal.exp0.box

 

5、新建一個font_properties文件


里面內容寫入 normal 0 0 0 0 0 表示默認普通字體

 

6、運行命令

shapeclustering -F font_properties -U unicharset mjorcen.normal.exp0.tr

mftraining -F font_properties -U unicharset -O unicharset mjorcen.normal.exp0.tr

cntraining mjorcen.normal.exp0.tr

結果如下:

E:\data\Users\Administrator\Desktop\ocrBuider3>shapeclustering -F font_propertie
s -U unicharset mjorcen.normal.exp0.tr
Reading mjorcen.normal.exp0.tr ...
Building master shape table
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0 1 2 3 4
Stopped with 0 merged, min dist 0.365385
Master shape_table:Number of shapes = 5 max unichars = 1 number with multiple un
ichars = 0

E:\data\Users\Administrator\Desktop\ocrBuider3>mftraining -F font_properties -U
unicharset -O  unicharset mjorcen.normal.exp0.tr
Read shape table shapetable of 5 shapes
Reading mjorcen.normal.exp0.tr ...
Done!

E:\data\Users\Administrator\Desktop\ocrBuider3>cntraining mjorcen.normal.exp0.tr

Reading mjorcen.normal.exp0.tr ...
Clustering ...

Writing normproto ...

 

7、把目錄下的unicharset、inttemp、pffmtable、shapetable、normproto這五個文件前面都加上normal.

 

8、執行combine_tessdata normal.

 

9、把 normal.traineddata 復制到Tesseract-OCR 安裝目錄下的tessdata文件夾中

 

10、測試

tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 -l normal

 

debug:

 

E:\data\Users\Administrator\Desktop\ocrBuider3>tesseract mjorcen.normal.exp0.jpg
 mjorcen.normal.exp0 -l chi_sim batch.nochop makebox
Too many unichars in ambiguity on line 22358424
Too many unichars in ambiguity on line 22358424
Too many unichars in ambiguity on line 14941344
Tesseract Open Source OCR Engine v3.02 with Leptonica

E:\data\Users\Administrator\Desktop\ocrBuider3>tesseract  mjorcen.normal.exp0.jp
g mjorcen.normal.exp0  nobatch box.train
Tesseract Open Source OCR Engine v3.02 with Leptonica
APPLY_BOXES:
   Boxes read from boxfile:       6
   Found 6 good blobs.
TRAINING ... Font name = normal
Generated training data for 2 words

E:\data\Users\Administrator\Desktop\ocrBuider3>unicharset_extractor mjorcen.norm
al.exp0.box
Extracting unicharset from mjorcen.normal.exp0.box
Wrote unicharset file ./unicharset.

E:\data\Users\Administrator\Desktop\ocrBuider3>shapeclustering -F font_propertie
s -U unicharset mjorcen.normal.exp0.tr
Reading mjorcen.normal.exp0.tr ...
Building master shape table
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0 1 2 3 4
Stopped with 0 merged, min dist 0.365385
Master shape_table:Number of shapes = 5 max unichars = 1 number with multiple un
ichars = 0

E:\data\Users\Administrator\Desktop\ocrBuider3>mftraining -F font_properties -U
unicharset -O  unicharset mjorcen.normal.exp0.tr
Read shape table shapetable of 5 shapes
Reading mjorcen.normal.exp0.tr ...
Done!

E:\data\Users\Administrator\Desktop\ocrBuider3>cntraining mjorcen.normal.exp0.tr

Reading mjorcen.normal.exp0.tr ...
Clustering ...

Writing normproto ...

E:\data\Users\Administrator\Desktop\ocrBuider3>combine_tessdata normal.
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 140
Offset for type 2 is -1
Offset for type 3 is 489
Offset for type 4 is 123081
Offset for type 5 is 123134
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1
Offset for type 13 is 123920
Offset for type 14 is -1
Offset for type 15 is -1
Offset for type 16 is -1

E:\data\Users\Administrator\Desktop\ocrBuider3>tesseract mjorcen.normal.exp0.jpg
 mjorcen.normal.exp0 -l normal
Tesseract Open Source OCR Engine v3.02 with Leptonica

E:\data\Users\Administrator\Desktop\ocrBuider3>tesseract mjorcen.normal.exp0.jpg
 mjorcen.normal.exp1 -l chi_sim
Too many unichars in ambiguity on line 15280712
Too many unichars in ambiguity on line 15280712
Too many unichars in ambiguity on line 4324296
Tesseract Open Source OCR Engine v3.02 with Leptonica

 

 normal 結果

應收: 119

普通的中文結果:

應收= II苜

 

 

 

腳本(需要java環境):

目錄結果如下:

腳本如下:

window

 

@echo off 

set "src=%1%" 
set "font_name=%2%"
set "desc=%3%" 


if  not  defined src set /p src=" please pass your filename : "

if  not  defined font_name set /p font_name=" please pass your font_name : "

rem 判斷參數的合法性

if  not  defined src echo  IllegalArgumentException arg1 must not be null &  pause>nul & exit

if  not  defined font_name echo  IllegalArgumentException arg2 must not be null &  pause>nul & exit

if  not  defined desc set "desc=%src:~0,-4%"  

 echo desc %desc%

rem 如果目錄下沒有font_properties 文件創建 font_properties ,並寫入文件
if exist font_properties (
 echo  font_properties exist
) else (
ECHO  %font_name% 0 0 0 0 0  >"font_properties"
)

rem  刪除原有文件  
if exist %font_name%.unicharset ECHO DEL %font_name%.unicharset &   DEL  /Q  names %font_name%.unicharset
if exist %font_name%.inttemp  ECHO DEL %font_name%.inttemp &  DEL  /Q  names %font_name%.inttemp
if exist %font_name%.pffmtable  ECHO DEL %font_name%.pffmtable &  DEL  /Q  names %font_name%.pffmtable
if exist %font_name%.shapetable ECHO DEL %font_name%.shapetable & DEL  /Q  names %font_name%.shapetable
if exist %font_name%.normproto ECHO DEL %font_name%.normproto & DEL  /Q  names %font_name%.normproto
if exist %font_name%.font_properties ECHO DEL %font_name%.font_properties & DEL  /Q  names %font_name%.font_properties
 
rem   makebox

tesseract  %src%  %desc%   -l chi_sim  batch.nochop makebox

java -Xms128m -Xmx512m -jar jTessBoxEditor/jTessBoxEditor.jar

ECHO Please change your results , and press any key to continue

pause>nul 
  
tesseract  %src%  %desc%  nobatch box.train

unicharset_extractor %desc%.box

shapeclustering -F font_properties -U unicharset %desc%.tr

mftraining -F font_properties -U unicharset -O  unicharset %desc%.tr

cntraining %desc%.tr


rem  配置新文件
if exist unicharset ECHO rename unicharset %font_name%.unicharset &  rename unicharset %font_name%.unicharset
if exist inttemp ECHO rename inttemp %font_name%.inttemp &  rename inttemp %font_name%.inttemp
if exist pffmtable ECHO rename pffmtable %font_name%.pffmtable &  rename pffmtable %font_name%.pffmtable
if exist shapetable ECHO rename shapetable %font_name%.shapetable &  rename shapetable %font_name%.shapetable
if exist normproto ECHO rename normproto %font_name%.normproto &  rename normproto %font_name%.normproto

combine_tessdata %font_name%.

if exist font_properties ECHO rename font_properties %font_name%.font_properties & rename font_properties %font_name%.font_properties

ECHO  press any key to continue
pause>nul 
 

 

 

 

調用: 

注意: 參數1: 文件全名 , 參數2 字體名, 參數3 :輸出文件名, 不填默認為文件名

E:\data\Users\Administrator\Desktop\ocrBuider3>run.bat mjorcen.normal.exp0.jpg normal

實例:

E:\data\Users\Administrator\Desktop\ocrBuider3>run.bat mjorcen.normal.exp0.jpg n
ormal
desc mjorcen.normal.exp0
 font_properties exist
Too many unichars in ambiguity on line 2188584
Too many unichars in ambiguity on line 2188584
Too many unichars in ambiguity on line 2686128
Tesseract Open Source OCR Engine v3.02 with Leptonica
Please change your results , and press any key to continue
Tesseract Open Source OCR Engine v3.02 with Leptonica
APPLY_BOXES:
   Boxes read from boxfile:       6
   Found 6 good blobs.
TRAINING ... Font name = normal
Generated training data for 2 words
Extracting unicharset from mjorcen.normal.exp0.box
Wrote unicharset file ./unicharset.
Reading mjorcen.normal.exp0.tr ...
Building master shape table
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0 1 2 3 4
Stopped with 0 merged, min dist 0.365385
Master shape_table:Number of shapes = 5 max unichars = 1 number with multiple un
ichars = 0
Read shape table shapetable of 5 shapes
Reading mjorcen.normal.exp0.tr ...
Done!
Reading mjorcen.normal.exp0.tr ...
Clustering ...

Writing normproto ...
rename unicharset normal.unicharset
rename inttemp normal.inttemp
rename pffmtable normal.pffmtable
rename shapetable normal.shapetable
rename normproto normal.normproto
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 140
Offset for type 2 is -1
Offset for type 3 is 489
Offset for type 4 is 123081
Offset for type 5 is 123134
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1
Offset for type 13 is 123920
Offset for type 14 is -1
Offset for type 15 is -1
Offset for type 16 is -1
rename font_properties normal.font_properties
E:\data\Users\Administrator\Desktop\ocrBuider3>
 

 

linux (出自文檔:http://tesseract-ocr.googlecode.com/svn/trunk/doc/combine_tessdata.1.asc) :

#!/bin/bash 
tesseract zzz.ocra.exp0.tif zzz.ocra.exp0 nobatch box.train
unicharset_extractor zzz.ocra.exp0.box
echo "ocra 0 0 1 0 0" >font_properties
shapeclustering -F font_properties -U unicharset zzz.ocra.exp0.tr
mftraining -F font_properties -U unicharset -O zzz.unicharset zzz.ocra.exp0.tr
cntraining zzz.ocra.exp0.tr
cp normproto zzz.normproto
cp inttemp zzz.inttemp
cp pffmtable zzz.pffmtable
cp shapetable zzz.shapetable
combine_tessdata zzz.
cp zzz.traineddata /home/youruserid/tessdata/.
sudo cp zzz.traineddata /usr/share/tesseract-ocr/tessdata/.
tesseract zzz.ocra.exp0.tif output -l zzz

 

 

 

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2026 CODEPRJ.COM