python讀取pdf文件

本文轉載自查看原文 2019-03-08 16:38 7380 模塊化

pdfplumber簡介

Pdfplumber是一個可以處理pdf格式信息的庫。可以查找關於每個文本字符、矩陣、和行的詳細信息，也可以對表格進行提取並進行可視化調試。

文檔參考https://github.com/jsvine/pdfplumber

pdfplumber安裝

安裝直接采用pip即可。命令行中輸入

pip install pdfplumber

如果要進行可視化的調試，則需要安裝ImageMagick。
Pdfplumber GitHub： https://github.com/jsvine/pdfplumber
ImageMagick地址：
http://docs.wand-py.org/en/latest/guide/install.html#install-imagemagick-windows
（官網地址沒有6x， 6x地址：https://imagemagick.org/download/binaries/）

（注意：我在裝ImageMagick，使用起來是報錯了，網上參照了這里了解到應該裝6x版，7x版會報錯。故找了6x的地址如上。）

在使用to_image函數輸出圖片時，如果報錯DelegateException。則安裝GhostScript 32位。（注意，一定要下載32位版本，哪怕Windows和python的版本是64位的。）
GhostScript: https://www.ghostscript.com/download/gsdnld.html

簡單使用

import pdfplumber
with pdfplumber.open("path/file.pdf") as pdf:
    first_page = pdf.pages[0]  #獲取第一頁
    print(first_page.chars[0])

pdfplumber.pdf中包含了.metadata和.pages兩個屬性。
metadata是一個包含pdf信息的字典。
pages是一個包含頁面信息的列表。

每個pdfplumber.page的類中包含了幾個主要的屬性。
page_number 頁碼
width 頁面寬度
height 頁面高度
objects/.chars/.lines/.rects 這些屬性中每一個都是一個列表，每個列表都包含一個字典，每個字典用於說明頁面中的對象信息，包括直線，字符，方格等位置信息。

常用方法

extract_text() 用來提頁面中的文本，將頁面的所有字符對象整理為的那個字符串
extract_words() 返回的是所有的單詞及其相關信息
extract_tables() 提取頁面的表格
to_image() 用於可視化調試時，返回PageImage類的一個實例

常用參數

table_settings

表提取設置

默認情況下，extract_tables使用頁面的垂直和水平線（或矩形邊）作為單元格分隔符。但是方法該可以通過table_settings參數高度定制。可能的設置及其默認值：

{
    "vertical_strategy": "lines", 
    "horizontal_strategy": "lines",
    "explicit_vertical_lines": [],
    "explicit_horizontal_lines": [],
    "snap_tolerance": 3,
    "join_tolerance": 3,
    "edge_min_length": 3,
    "min_words_vertical": 3,
    "min_words_horizontal": 1,
    "keep_blank_chars": False,
    "text_tolerance": 3,
    "text_x_tolerance": None,
    "text_y_tolerance": None,
    "intersection_tolerance": 3,
    "intersection_x_tolerance": None,
    "intersection_y_tolerance": None,
}

表提取策略

vertical_strategy 和 horizontal_strategy 的參數選項

`"lines"`	Use the page's graphical lines — including the sides of rectangle objects — as the borders of potential table-cells.
`"lines_strict"`	Use the page's graphical lines — but not the sides of rectangle objects — as the borders of potential table-cells.
`"text"`	For `vertical_strategy`: Deduce the (imaginary) lines that connect the left, right, or center of words on the page, and use those lines as the borders of potential table-cells. For `horizontal_strategy`, the same but using the tops of words.
`"explicit"`	Only use the lines explicitly defined in `explicit_vertical_lines` / `explicit_horizontal_lines`.

舉例使用

讀取文字

import pdfplumber
import pandas as pd

with pdfplumber.open("E:\\600aaa_2.pdf") as pdf:
    page_count = len(pdf.pages)
    print(page_count)  # 得到頁數
    for page in pdf.pages:
        print('---------- 第[%d]頁 ----------' % page.page_number)
        # 獲取當前頁面的全部文本信息，包括表格中的文字
        print(page.extract_text())

讀取表格

import pdfplumber
import pandas as pd
import re

with pdfplumber.open("E:\\600aaa_1.pdf") as pdf:
    page_count = len(pdf.pages)
    print(page_count)  # 得到頁數
    for page in pdf.pages:
        print('---------- 第[%d]頁 ----------' % page.page_number)

        for pdf_table in page.extract_tables(table_settings={"vertical_strategy": "text",
                                                         "horizontal_strategy": "lines",
                                                        "intersection_tolerance":20}): # 邊緣相交合並單元格大小

            # print(pdf_table)
            for row in pdf_table:
                # 去掉回車換行
                print([re.sub('\s+', '', cell) if cell is not None else None for cell in row])

部分參照：https://blog.csdn.net/Elaine_jm/article/details/84841233

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python讀取PDF文件內容 pdf文件的讀取和識別 PHP 讀取 pdf 文件識別與讀取PDF文件 Python讀取PDF內容 python爬蟲：讀取PDF python讀取pdf文檔 Python讀取PDF文檔深入學習python解析並讀取PDF文件內容的方法記一次為解決Python讀取PDF文件的Shell操作