【原創】Tesseract-OCR 3.02 訓練筆記

本文轉載自查看原文 2013-04-15 10:10 12570

目的：識別http://www.computrabajo.com.mx/bt-ofrd-human1985-207292.htm中的Email地址

官方文檔：https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

官方的英文文檔很長，這里記錄幾個關鍵步驟。

特別注意：訓練時用的版本與運行時用的版本一定要保持一致。

准備：

1、安裝Tesseract

2、下載圖片，保存到本地，並轉換成Tif格式。

Make Box Files

1、tesseract eng.timesitalic.exp0.tif eng.timesitalic.exp0 batch.nochop makebox

2、把生成的box文件用文本編輯器編輯，使之與圖片中的文字一致

PS：此步驟亦可以使用jTessBoxEditor工具代替

Run Tesseract for Training

tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] nobatch box.train

此步驟生成一個.tr的文件

Compute the Character Set

unicharset_extractor lang.fontname.exp0.box

此步驟生成一個unicharset文件

font_properties (new in 3.01)

用記事本新建一個名為font_properties的文件，內容格式為：<fontname> <italic> <bold> <fixed> <serif> <fraktur>

如：timesitalic 1 0 0 1 0

這個步驟要注意：<fontname>應與步驟"Run Tesseract for Training”命令中指定的[fontname]一致，如果沒有指定，則為UnknownFont

如：UnknownFont 0 0 0 0 0

Clustering

三個命令：

shapeclustering -F font_properties -U unicharset lang.fontname.exp0.tr

mftraining -F font_properties -U unicharset -O lang.unicharset lang.fontname.exp0.tr

cntraining lang.fontname.exp0.tr lang.fontname.exp1.tr

Putting it all together

combine_tessdata lang.

注意：一定要把Clustering生成的文件重命名，我在開始訓練時，沒仔細看官方的最關鍵的一段話：

That is all there is to it! All you need to do now is collect together all (shapetable, normproto, inttemp, pffmtable) the files and rename them with a lang. prefix, where lang is the 3-letter code for your language taken from http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes …

后來在CSDN上看到邊城駱駝的博文http://blog.csdn.net/marvinhong/article/details/8459591，才恍然大悟。

測試

tesseract image.tif output -l lang

最后附上幾個有用的鏈接：

tesseract-ocr - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. - Google Project Hosting

VietOCR | Free Graphics software downloads at SourceForge.net

tesseractdotnet - tesseract-ocr .net - Google Project Hosting

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Tesseract-OCR的簡單使用與訓練 tesseract-ocr Tesseract-OCR 字符識別---樣本訓練 [轉] Tesseract-OCR識別中文與訓練字庫實例軟件安裝筆記3：tesseract-ocr for mac和homebrew [筆記]Win10下編譯Tesseract-OCR 4.0 tesseract-OCR + pytesseract安裝 Tesseract-OCR引擎安裝 Tesseract-OCR 的安裝與使用 Tesseract-ocr 安裝與使用