Python之pytesseract模塊-實現OCR

本文轉載自查看原文 2021-08-25 19:05 213 自動化測試-PC端應用/ Python

在給PC端應用做自動化測試時，某些情況下無法定位界面上的控件，但我們又想獲得界面上的文字，則可以通過截圖后從圖片上去獲取該文字信息。那么，Python中有沒有對應的工具來實現OCR呢？答案是有的，它叫pytesseract。官方給它的定義如下，一起來了解和使用吧。

Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and "read" the text embedded in images.

Python-tesseract is a wrapper for Google's Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.

安裝

1.首先下載並安裝teseseract安裝包，下載地址：https://digi.bib.uni-mannheim.de/tesseract/

2.安裝完成后，添加系統環境變量。

3.安裝對應的Python庫。在實踐過程中，單獨安裝pytesseract時會報錯，需要與pillow一起安裝。

pip install pillow
pip install pytesseract

4.根據需要解析的文字語言，下載對應的語言包，下載地址：https://github.com/tesseract-ocr/tessdata ，拿中文語言包舉例，下載chi_sim.traineddata后，將其放入Teseseract-OCR安裝目錄下的tessdata目錄即可。

使用

舉個例子，想要提取圖片中的“酌三巡”三個字。

使用方法非常簡單，調用pytesseract.image_to_string()方法即可。

from PIL import Image
import pytesseract

img = Image.open("demo.png")
ocr_text = pytesseract.image_to_string(img, lang="chi_sim")
print("提取結果為：", ocr_text)

運行結果：

參考資料

https://github.com/madmaze/pytesseract
https://github.com/tesseract-ocr/tesseract

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python OCR工具pytesseract詳解 python中文ocr方案-pytesseract Python驗證碼識別安裝Pillow、tesseract-ocr與pytesseract模塊的安裝以及錯誤解決 Python 進行 OCR識別 -- pytesseract庫 python中ocr軟件pytesseract使用一個 Python 包 pytesseract ，幾行代碼實現 OCR 文本識別技術！ Python3實現自動查詢成績（主要使用的包有Tesseract-OCR、PIL、execjs、pytesseract、BeautifulSoup） tesseract-OCR + pytesseract安裝 python實現的ocr接口 python3光學字符識別模塊tesserocr與pytesseract