騰訊雲OCR服務二次開發

本文轉載自查看原文 2022-04-10 20:22 677 雲/ python

騰訊雲OCR服務二次開發

騰訊雲OCR服務二次開發

前言

因為騰訊雲賬戶中還剩一點點錢，剛剛好夠買騰訊雲里文字識別服務，想着自己平時看PDF比較多，可以用這個服務來便捷的進行圖像文字轉換。我購買的是通用印刷體識別，即可以對圖片進行識別，也可以對PDF文件進行識別。需要注意的的是，圖片識別需要將圖片轉為Base64，PDF識別時每次只能識別一張。

本文記錄了對騰訊雲OCR服務二次開發的代碼和開發過程中遇到的問題。

安裝SDK

我使用的是Python 3.6，要使用騰訊雲的OCR服務，要先在本地環境中安裝騰訊雲的SDK。安裝方式見：Python - SDK 中心 - 騰訊雲 (tencent.com)

調用API

學習API文檔

安裝好SDK后，調用相應的接口就ok了，可以參考：文字識別 API 概覽 - 服務端 API 文檔 - 文檔中心 - 騰訊雲 (tencent.com)

因為主要需求是對PDF以及其截圖進行識別，我購買的是GeneralBasicOCR-通用印刷體識別，騰訊可以在API Explorer - 雲 API - 控制台 (tencent.com)中進行調試，比較方便。

通用印刷體識別API

通用印刷體識別主要支持以下參數：

參數名稱	必選	類型	描述
Action	是	String	公共參數，本接口取值：GeneralBasicOCR。
Version	是	String	公共參數，本接口取值：2018-11-19。
Region	是	String	公共參數，詳見產品支持的地域列表，本接口僅支持其中的: ap-beijing, ap-guangzhou, ap-hongkong, ap-seoul, ap-shanghai, ap-singapore, na-toronto
ImageBase64	否	String	圖片/PDF的 Base64 值。要求圖片/PDF經Base64編碼后不超過 7M，分辨率建議600*800以上，支持PNG、JPG、JPEG、BMP、PDF格式。圖片的 ImageUrl、ImageBase64 必須提供一個，如果都提供，只使用 ImageUrl。
ImageUrl	否	String	圖片/PDF的 Url 地址。要求圖片/PDF經Base64編碼后不超過 7M，分辨率建議600*800以上，支持PNG、JPG、JPEG、BMP、PDF格式。圖片存儲於騰訊雲的 Url 可保障更高的下載速度和穩定性，建議圖片存儲於騰訊雲。非騰訊雲存儲的 Url 速度和穩定性可能受一定影響。
Scene	否	String	保留字段。
LanguageType	否	String	識別語言類型。支持自動識別語言類型，同時支持自選語言種類，默認中英文混合(zh)，各種語言均支持與英文混合的文字識別。可選值： zh：中英混合 zh_rare：支持英文、數字、中文生僻字、繁體字，特殊符號等 auto：自動 mix：混合語種 jap：日語 kor：韓語 spa：西班牙語 fre：法語 ger：德語 por：葡萄牙語 vie：越語 may：馬來語 rus：俄語 ita：意大利語 hol：荷蘭語 swe：瑞典語 fin：芬蘭語 dan：丹麥語 nor：挪威語 hun：匈牙利語 tha：泰語 hi：印地語 ara：阿拉伯語
IsPdf	否	Boolean	是否開啟PDF識別，默認值為false，開啟后可同時支持圖片和PDF的識別。
PdfPageNumber	否	Integer	需要識別的PDF頁面的對應頁碼，僅支持PDF單頁識別，當上傳文件為PDF且IsPdf參數值為true時有效，默認值為1。
IsWords	否	Boolean	是否返回單字信息，默認關

考慮到我的實際使用需求，主要會使用到ImageBase64、ImageUrl、IsPdf、PdfPageNumber這幾個參數。

代碼

我主要使用了argparse、base64、json這幾個python內置模塊。

我希望能夠在CLI中便捷的使用這個工具，但是由於有很多不同的情況，所以使用argparse模塊，覆蓋不同的情況。同時，又因為對圖片識別時，參數是base64，所以需要使用base64模塊將圖片轉化為base64格式。

main.py

# -*- coding: UTF-8 -*-
# 參考：https://cloud.tencent.com/document/product/866/33515
# Author:Zhangyifei 2022年4月10日

import pyperclip

from tencentcloud.common.exception.tencent_cloud_sdk_exception import TencentCloudSDKException

from ocrtool import MyOcrTool, ReqObj
from parse_args import parse_args


if __name__ == '__main__':

    try:

        # 實例化Ocr工具
        my_ocr_tool = MyOcrTool()
        client = my_ocr_tool.client
        req = ReqObj()

        # 獲取命令行參數
        args = parse_args()

        if args.local:
            if args.isPdf:
                req.req_local_img(args.local, args.page)
            else:
                req.req_local_img(args.local)
        elif args.url:
            if args.isPdf:
                req.req_url_img(args.url, args.page)
            else:
                req.req_url_img(args.url)

        # 獲取輸出
        resp = client.GeneralBasicOCR(req)

        ans = ''
        if args.newline:
            for i in resp.TextDetections:
                ans += (i.DetectedText + '\n')
        else:
            for i in resp.TextDetections:
                ans += (i.DetectedText)

        print(ans)

        if args.clip:
            pyperclip.copy(ans)

    except TencentCloudSDKException as err:
        print(err)

parse_args.py

import argparse
import sys

def parse_args():
    # 設置命令行參數
    parser = argparse.ArgumentParser(description='OCR解析方式')
    parser.add_argument('-u', '--url', type=str, required=False, help='圖片的url')
    parser.add_argument('-l', '--local', type=str, required=False, help='本地圖片的地址')
    parser.add_argument('-p', '--isPdf', required=False, action='store_true', help='是否是Pdf')
    parser.add_argument('-n', '--page', type=int, required=False, help='識別哪一頁PDF')
    parser.add_argument('-s', '--newline', required=False, action='store_true', help='Ocr識別結果是否換行')
    parser.add_argument('-c', '--clip', required=False, action='store_true', help='輸出結果是否粘貼到剪切板中')

    # 當未輸入命令行參數時，打印幫助
    if len(sys.argv) == 1:
        parser.print_help()
        sys.exit(1)

    # 獲取命令行參數
    args = parser.parse_args()

    # page參數和isPdf參數存在依賴
    if args.isPdf and not args.page:
        parser.print_help()
        parser.error('The --isPdf argument requires the --page argument.')

    # url參數和local參數只能有一個

    if args.url and args.local:
        parser.error('There can only be one argument --url  and argument --local')

    return args

ocrtool.py

# -*- coding: UTF-8 -*-
# 參考：https://cloud.tencent.com/document/product/866/33515
# Author:Zhangyifei 2022年4月10日

import base64
import json

from tencentcloud.common import credential
from tencentcloud.common.profile.client_profile import ClientProfile
from tencentcloud.common.profile.http_profile import HttpProfile
from tencentcloud.ocr.v20181119 import ocr_client, models


def image_to_base64(file_path):
    """
    將pdf轉為Base64流
    :param pdf_path: PDF文件路徑
    :return:
    """
    with open(file_path, "rb") as image_file:
        encoded_string = base64.b64encode(image_file.read())
    return str(encoded_string, 'UTF-8')


class MyOcrTool(object):
    def __init__(self):
        # 參考https://cloud.tencent.com/document/product/866/33515
        self.region = "ap-guangzhou"
        self.cred = credential.Credential("xxx", "xxx")
        self.httpProfile = HttpProfile()
        self.httpProfile.endpoint = "ocr.tencentcloudapi.com"
        self.clientProfile = ClientProfile()
        self.clientProfile.httpProfile = self.httpProfile
        self.client = ocr_client.OcrClient(self.cred, self.region, self.clientProfile)
        self.params = {}


class ReqObj(models.GeneralBasicOCRRequest):

    def __init__(self):
        models.GeneralBasicOCRRequest.__init__(self)

    def update_req_params(self, params):
        # 更新req中的params
        self.from_json_string(json.dumps(params))

    def req_local_img(self, file_path, page=None):
        # 請求本地的image文件
        imagebase64 = image_to_base64(file_path)
        
        # 由於page和isPdf存在依賴，當page存在時，說明是對pdf進行處理
        if not page:
            params = {
                "ImageBase64": imagebase64,
            }
            self.update_req_params(params)
        else:
            params = {
                "ImageBase64": imagebase64,
                "IsPdf": True,
                "PdfPageNumber": page
            }
            self.update_req_params(params)

    def req_url_img(self, url_path, page=None):
        # 請求url中的image文件
        
        # 由於page和isPdf存在依賴，當page存在時，說明是對pdf進行處理
        if not page:
            params = {
                "ImageUrl": url_path
            }
            self.update_req_params(params)
        else:
            params = {
                "ImageUrl": url_path,
                "IsPdf": True,
                "PdfPageNumber": page
            }
            self.update_req_params(params)

運行結果

本地圖片

dsafasdf11

使用-l參數（或者--local）表示對本地圖片進行處理，使用-s（或者--newline）表示對圖片中每行識別出來的內容進行換行。不使用-s時默認表示不換行。使用-c（或者--clip）表示將輸出結果復制到粘貼板上，此時就可以方便的將輸出的內容直接進行文本粘貼。

本地PDF

使用-p或者（--pdf）表示該文件是pdf文件，此時需要記得使用-n或者（--page）表示對哪一頁進行OCR識別。否則的化會有報錯提醒。

PS:如果需要對整個pdf進行識別和輸出，可以重新進行函數封裝，本文沒有相關需求，暫不涉及。

網絡圖片

以下圖為例：

它的url是：https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fww2.sinaimg.cn%2Fmw690%2F001SRYirly1h0czgvocbqj60uj0u043f02.jpg&refer=http%3A%2F%2Fwww.sina.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=auto?sec=1652184476&t=f35c29a812ee6a9a8e8d8fd582e0b60f

我們復制圖片url，使用-u（或者--url）表示對url進行處理，使用-s（或者--newline）表示對圖片中每行識別出來的內容進行換行。

問題整理

問題1：`argparse`模塊參數之間如何生成依賴？

使用if語句進行判斷，不符合依賴條件則拋出錯誤。

    # page參數和isPdf參數存在依賴
    if args.isPdf and not args.page:
        parser.print_help()
        parser.error('The --isPdf argument requires the --page argument.')

問題2：`argparse`模塊`parser`的參數type是bool時，CLI中傳入參數即使是`False`，也會認為是`True`？

這是因為命令行傳入的參數默認會認為是字符串格式，因此傳參是False仍會認為是True。這個問題在argparse bool python - CSDN中有說明解決辦法。我的解決辦法是涉及到type是bool格式的，使用action參數進行判斷。

parser.add_argument('-c', '--clip', required=False, action='store_true', help='輸出結果是否粘貼到剪切板中')

問題3：bytes格式轉化為str格式的方法：

str(encoded_string, 'UTF-8')

后續想法

封裝函數對整個pdf進行處理並輸出成文檔(或EXCEL)
部署web服務器，在網頁中進行操作OCR識別操作。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 騰訊通二次開發接口金蝶雲星空拆分錄二次開發 vtiger二次開發 datax二次開發 phpcms二次開發 Thinkcmf 二次開發 shopnc二次開發（一） shopnc二次開發（二） cad 二次開發（一） Zabbix二次開發