騰訊雲文字識別API提取表格數據並生成Excel文件

本文轉載自查看原文 2019-06-14 17:29 2503 表格文字提取

本文主要介紹了利用騰訊雲表格文字識別API提取圖片表格數據並生成Excel文件。主要涉及的知識點有：騰訊雲API的調用、json文件的處理以及Excel文件的生成。

背景

在工作中，各種電子文件和紙質文件滿天飛，穿梭於各個用戶終端之間。有時，我們需要將紙質版數據電子化，往往需要耗費大量的人力，從而增加工作負擔。一種被稱為OCR的技術的發明，在一定程度上解決了這個問題。文字識別技術已經發展的十分成熟，我們熟知的軟件，如QQ等，都可以進行文字識別。但是支持結構化的表格文字識別的工具不多，即使有，大多數也是收費的——目前我們還沒有養成付費使用的習慣。

鑒於上述情況，本文利用騰訊雲提供的表格文字提取API，結合python，實現了表格文字批量提取的功能，避免了手動錄入的尷尬，減輕了工作負擔。

使用工具及python包介紹

騰訊API

國內大型互聯網公司都提供雲服務，如阿里、百度、騰訊等。本文選擇騰訊雲服務，是因為提供的API說明比較詳細，看一遍就能用。更良心的是，提供了在線測試的功能，基本不用寫代碼也能夠測試效果。

Python包
- pandas 數據分析必備包，用來對二維表數據進行分析整合。
- os 更改系統配置信息，如列出工作目錄的文件，更改工作目錄等。
- json 用來處理json數據，或者把字符串等其他格式的數據轉化為json數據。
- base64 用來對圖片進行base64編碼，這是根據API的要求做的。
- xlwings 用來與Excel進行交互，幾乎可以取代VBA，容易學習。
- tencentcloud 騰訊雲服務，提供了很多功能，值得探索。
- re 正則表達式包，用來處理字符串中的空格等。

必要的准備工作

注冊騰訊雲，獲取SecretID和SecretKey.

在控制台新建一個API秘鑰，獲取SecretID和SecretKey.

准備幾張較為清晰的截圖

代碼實現

# from PIL import Image
# import pytesseract
##導入通用包
import numpy as np
import pandas as pd
import os
import json
import re
import base64
import xlwings as xw
##導入騰訊AI api
from tencentcloud.common import credential
from tencentcloud.common.profile.client_profile import ClientProfile
from tencentcloud.common.profile.http_profile import HttpProfile
from tencentcloud.common.exception.tencent_cloud_sdk_exception import TencentCloudSDKException
from tencentcloud.ocr.v20181119 import ocr_client, models

#定義函數
def excelFromPictures(picture,SecretId,SecretKey):
    try:
        with open(picture,"rb") as f:
                img_data = f.read()
        img_base64 = base64.b64encode(img_data)
        cred = credential.Credential(SecretId, SecretKey)  #ID和Secret從騰訊雲申請
        httpProfile = HttpProfile()
        httpProfile.endpoint = "ocr.tencentcloudapi.com"

        clientProfile = ClientProfile()
        clientProfile.httpProfile = httpProfile
        client = ocr_client.OcrClient(cred, "ap-shanghai", clientProfile)

        req = models.TableOCRRequest()
        params = '{"ImageBase64":"' + str(img_base64, 'utf-8') + '"}'
        req.from_json_string(params)
        resp = client.TableOCR(req)
        #     print(resp.to_json_string())

    except TencentCloudSDKException as err:
        print(err)

    ##提取識別出的數據，並且生成json
    result1 = json.loads(resp.to_json_string())

    rowIndex = []
    colIndex = []
    content = []

    for item in result1['TextDetections']:
        rowIndex.append(item['RowTl'])
        colIndex.append(item['ColTl'])
        content.append(item['Text'])

    ##導出Excel
    ##ExcelWriter方案
    rowIndex = pd.Series(rowIndex)
    colIndex = pd.Series(colIndex)

    index = rowIndex.unique()
    index.sort()

    columns = colIndex.unique()
    columns.sort()

    data = pd.DataFrame(index = index, columns = columns)
    for i in range(len(rowIndex)):
        data.loc[rowIndex[i],colIndex[i]] = re.sub(" ","",content[i])

    writer = pd.ExcelWriter("../tables/" + re.match(".*\.",f.name).group() + "xlsx", engine='xlsxwriter')
    data.to_excel(writer,sheet_name = 'Sheet1', index=False,header = False)
    writer.save()

    #xlwings方案  
    # wb = xw.Book()
    # sht = wb.sheets('Sheet1')
    # for i in range(len(rowIndex)):
    #     sht[rowIndex[i],colIndex[i]].value = re.sub(" ",'',content[i])
    # wb.save("../tables/" + re.match(".*\.",f.name).group() + "xlsx")
    # wb.close()



if not ('tables') in os.listdir():
    os.mkdir("./tables/")

os.chdir("./pictures/")
pictures = os.listdir()
for pic in pictures:
    excelFromPictures(pic,"YoungID","YourKey")
    print("已經完成" + pic + "的提取.")

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 騰訊雲OCR圖片文字識別騰訊雲本地圖片的文字識別。從excel表格中提取數據基於圖像識別的表格數據提取系統 1，騰訊雲api的簽名生成及使用 Python學習-提取excel表格中數據騰訊Ocr文字識別 JAVA調用騰訊雲API,實現人臉識別功能 (一) 【OCR識別】如何實現實時視頻文案轉文字、音頻歌詞字幕提取和翻譯？視頻提取文字，動態識別提取文字並導出Excel... vue自動提取后端swagger api中的api數據生成請求接口文件