tesseract 安裝與訓練(圖像識別)

本文轉載自查看原文 2017-07-29 21:30 1739 Python

代碼托管：https://github.com/tesseract-ocr/tesseract

環境：win10

安裝版本：tesseract-ocr-setup-3.02.02.exe

基本使用命令：

tesseract number.jpg result -l eng -psm 7

訓練

下載使用JtessBoxEditor,該工具需要安裝java vm運行。
1.合並圖像
將需要識別的圖片轉換為tif格式，合並到一起。
點擊tools——》merage tiff——》選中所有圖片保存為LAN.new.exp0.tif
tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num]
3.文字糾正（tiff,box需要在同一目錄）
用jtessboxeditor加載box文件，在box editor——》box coordinates里修改文件.最后保存box文件.

1、tesseract image.MyFont.exp0.tif image.MyFont.exp0 -l chi_sim batch.nochop makebox
該步驟會生成一個image.MyFont.exp0.box文件
把tif文件和box文件放在同一目錄，用jTessBoxEditor.jar打開tif文件，然后根據實際情況修改box文件
2、tesseract image.MyFont.exp0.tif image.MyFont.exp0 nobatch box.train
該步驟生成一個image.MyFont.exp0.tr文件
3、unicharset_extractor image.MyFont.exp0.box
該步驟生成一個unicharset文件
4、新建一個font_properties文件
里面內容寫入MyFont 0 0 0 0 0 表示默認普通字體
5、運行命令
shapeclustering -F font_properties -U unicharset image.MyFont.exp0.tr
mftraining -F font_properties -U unicharset -O image.unicharset image.MyFont.exp0.tr
cntraining image.MyFont.exp0.tr
6、把目錄下的unicharset、inttemp、pffmtable、shapetable、normproto這五個文件前面都加上image.
7、執行combine_tessdata image.
然后把image.traineddata放到tessdata目錄
8、用新的字庫對圖片進行分析
tesseract test.tif output -l image

批處理

echo 執行改批處理前先要目錄下創建font_properties文件  
echo Run Tesseract for Training..  
echo 該步驟生成一個image.MyFont.exp0.tr文件  
tesseract image.MyFont.exp0.tif image.MyFont.exp0 nobatch box.train  

echo 該步驟生成一個unicharset文件 
unicharset_extractor.exe num.font.exp0.box
rem 新建一個font_properties文件
echo MyFont 0 0 0 0 0 > font_properties
shapeclustering -F font_properties -U unicharset image.MyFont.exp0.tr  
mftraining -F font_properties -U unicharset -O image.unicharset image.MyFont.exp0.tr
  
echo Clustering..  
cntraining.exe image.MyFont.exp0.tr  
  
echo 重命名文件 
rename normproto image.normproto  
rename inttemp image.inttemp  
rename pffmtable image.pffmtable  
rename shapetable image.shapetable   
  
echo 合並文件 Tessdata..  
combine_tessdata.exe num.

# -*- coding:utf-8 -*-
import pytesseract
from PIL import Image
import requests
import os
# 驗證碼識別
# 下載驗證碼 @輸入下載數量

def code_down(num):
    imgurl = 'https/CImages'
    for i in range(num):
        data=requests.get(imgurl)
        name=str(i)
        download_img(data.content,name)

def download_img(imgdata,name):
    with open('./code_img/'+name+'.jpg','wb') as f:
        f.write(imgdata)

def code_ocr(num):
    for i in range(num):
        image=Image.open('./grey/'+"grey"+str(i)+'.jpg')
        code=pytesseract.image_to_string(image,config='-psm 7')
        print("index:%d code:%s"%(i,code))
def img_grey(num):
    for i in range(num):
        image=Image.open('./code_img/'+str(i)+'.jpg')
        grey=image.convert('L')
        grey.save("./grey/grey"+str(i)+'.jpg')

# 圖片批量下載
# code_down(100)
# 圖片識別
code_ocr(20)
# 圖片灰度處理
# img_grey(86)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 基於Tesseract的OCR圖像識別 tesseract 4.0 ocr圖像識別利器，可識別文字。圖片越高清越准確 python之圖像識別圖像識別圖像識別模型 python 圖像識別 OpenCV圖像識別圖像識別入門 AI圖像識別圖像識別之邊緣識別