PDF文檔導出指定章節為TXT

需求

要導出3000多個pdf文檔的特定章節內容為txt格式（pdf文字可復制）。

解決

導出PDF

查了一下Python操作PDF文檔的方法，主要是通過3個庫，PyPDF2、pdfminer和pdfplumber。

PyPDF2 是一個純 Python PDF 庫，可以讀取文檔信息（標題，作者等）、寫入、分割、合並PDF文檔，它還可以對pdf文檔進行添加水印、加密解密等。
pdfplumber 是基於 pdfminer.six 開發的模塊，pdfplumber庫按頁處理 pdf ，獲取頁面文字，提取表格等操作。
pdfminer 使用門檻較高，但遇到復雜情況，最后還得用它。目前開源模塊中，它對PDF的支持應該是最全的了。

看網上的例子，pdfminer是用得比較多的，然后直接復制了之前的代碼並修改了一下變量名啥的：

# 解析pdf文件函數
def parse(pdf_path):
    with open(r'C:\Users\Desktop\\' + pdf_path, 'rb') as pdf_file:  # 以二進制讀模式打開
        # 用文件對象來創建一個pdf文檔分析器
        pdf_parser = PDFParser(pdf_file)
        # 創建一個PDF文檔
        pdf_doc = PDFDocument(pdf_parser)
        # 檢測文檔是否提供txt轉換，不提供就忽略
        if pdf_doc.is_extractable:
            # 創建PDf 資源管理器 來管理共享資源
            pdf_rm = PDFResourceManager()
            # 創建一個PDF設備對象
            pdf_lap = LAParams()
            pdf_pa = PDFPageAggregator(pdf_rm, laparams=pdf_lap)
            # 創建一個PDF解釋器對象
            interpreter = PDFPageInterpreter(pdf_rm, pdf_pa)
            # 循環遍歷列表，每次處理一個page的內容
            for page in PDFPage.create_pages(pdf_doc):  # doc.get_pages() 獲取page列表
                interpreter.process_page(page)
                # 接受該頁面的LTPage對象
                layout = pdf_pa.get_result()
                for x in layout:
                    if isinstance(x, LTTextBoxHorizontal):  # 獲取文本內容
                        # 保存文本內容
                        with open(os.path.basename(pdf_path) + '.txt', 'a', encoding='utf-8') as f:  # 生成doc文件的文件名及路徑
                            results = x.get_text()
                            f.write(results)
                            f.write('\n')

運行一下發現很慢，一張頁面要很久。因此不能全部導出之后再裁剪，而是找到指定的頁面之后再導出，那么找到指定頁面只能是通過目錄，或者邊導出邊掃描，發現我們已經導出了所需的內容后面就不需要再導出了。最后，30000多個的文檔運行到一半電腦關機了再重新導出肯定很麻煩，所以還要保存一下導出狀態等信息。

還好，每個文檔都有目錄，那我們可以解析目錄來獲取指定頁。

根據目錄獲取指定頁

百度了一下Python獲取pdf的指定頁，獲取pdf的目錄，發現用的是PyPDF2來完成的，於是就對PyPDF2進行研究，通過其官網發現，它有獲取目錄的能力，可以直接導出目錄及對應的頁碼。

for index, file_path in enumerate(files_list):
    start_page_number = 0  # 開始頁碼
    is_get_page_number_range = False

    info = update_file_info(file_path=file_path)
    with open(file_path, 'rb') as pdf_file:  # 讀取pdf文檔
        pdf = PdfFileReader(pdf_file)  # 加載pdf文檔
        if pdf.isEncrypted:
            pdf.decrypt('')  # 解密
        end_page_number = pdf.getNumPages()  # 獲取總頁碼
        info = update_file_info(info, page_count=end_page_number)  # 保存總頁數
        pdf_directory = pdf.getOutlines()  # 獲取目錄

        is_have_start_page_number = False
        for destination in pdf_directory:
            if isinstance(destination, dict):
                if is_have_start_page_number:
                    end_page_number = pdf.getDestinationPageNumber(destination)
                    is_get_page_number_range = True
                    break

                title = destination.get('/Title')
                if key_word in str(title):
                    # 在目錄中找到關鍵詞了
                    start_page_number = pdf.getDestinationPageNumber(destination)
                    is_have_start_page_number = True
                    continue
    if is_get_page_number_range:
        info = update_file_info(info, start_page_number=start_page_number, end_page_number=end_page_number,
                                is_have_directory=True)
        res = "獲取頁碼成功"
    else:
        info = update_file_info(info, is_have_directory=False)
        res = "獲取頁碼失敗"
    print("掃描進度 : {:.2f}%, 文件 : {}".format(index / len(files_list) * 100, os.path.basename(file_path)), res, ':',
          '[', start_page_number, ',', end_page_number, ']', end=end)

比較重要的就是getOutlines()函數和getDestinationPageNumber(destination)函數，分別是獲取目錄對象，以及根據目錄對象獲取頁數。

這樣，就把目標頁碼找出了，有不能直接在pdf查看器里目錄那里點跳轉的是掃描不出的，要另外想辦法。

導出

先是使用PyPDF2導出文檔。但是使用PyPDF2導出文本的時候導出的是亂碼，使用的是unicode編碼，暫時沒找到轉換的方法，網友說是其年代久遠，對中文支持不好，網上一般配合pdfplumber使用，pdfplumber好像有OCR能力，安裝的時候要安裝一個圖形庫，安裝了很久安裝不上就放棄了pdfplumber。但是pdfminer我不會獲取目錄，那就只能兩個庫配合使用了。

首先是使用PyPDF2掃描一下目錄，這個非常快，然后把配置信息保存在json文件中，然后再由pdfminer提取對應頁文檔。對於沒有跳轉目錄的，可以逐頁分析，找到合適的就保存需要的，沒找到就保存整個文檔的txt導出。

with open(path, 'rb') as pdf_file:  # 讀取pdf文檔
    is_have_target_page = info.get('is_have_directory')
    start_page_number = 0
    end_page_number = 0
    page_count = info.get('page_count')
    if is_have_target_page:
        start_page_number = info.get('start_page_number')
        if start_page_number is None:
            start_page_number = 0
            is_have_target_page = False
        end_page_number = info.get('end_page_number')
        if end_page_number is None:
            is_have_target_page = False
            end_page_number = info.get('page_count')
    else:
        is_have_target_page = False

    pdf_parse = PDFParser(pdf_file)
    pdf_doc = PDFDocument(pdf_parse)
    if pdf_doc.is_extractable:
        pdf_rm = PDFResourceManager(caching=True)
        pdf_lap = LAParams()
        pdf_pa = PDFPageAggregator(pdf_rm, laparams=pdf_lap)
        pdf_pi = PDFPageInterpreter(pdf_rm, pdf_pa)

        if is_have_target_page:

            page_set = set()
            for i in range(start_page_number, end_page_number):
                page_set.add(i)

            pdf_page = PDFPage.get_pages(pdf_file, pagenos=page_set, password=b'', caching=True)
            print('讀取文本->>>')
            for index, page in enumerate(pdf_page):
                print("部分 : 當前文檔進度 : {}/{}".format(index, len(page_set)), end=end)
                pdf_pi.process_page(page)
                layout = pdf_pa.get_result()

                for x in layout:
                    if isinstance(x, LTTextBoxHorizontal):  # 獲取文本內容
                        text += x.get_text() + '\n'
                        # print(x.get_text())
        else:
            pdf_page = PDFPage.create_pages(pdf_doc)
            print('讀取文本->>>')
            is_find_start_page = False
            text_cache = ""
            for index, page in enumerate(pdf_page):
                print("掃描 : 當前文檔進度 : {}/{}, 找到起始位置 : {}".format(index, page_count, is_find_start_page),
                      end=end)
                pdf_pi.process_page(page)
                layout = pdf_pa.get_result()

                page_text = ''
                for x in layout:
                    if isinstance(x, LTTextBoxHorizontal):  # 獲取文本內容
                        page_text += x.get_text() + '\n'
                        # print(x.get_text())

                text_cache += page_text

                if re.search(r'第.節\s*經營情況討論與分析\s*一', page_text):  # 找到這一節了
                    text += page_text # 當前頁開始保存
                    is_find_start_page = True
                    info = update_file_info(info, start_page_number=index)
                    continue

                if is_find_start_page:
                    text += page_text
                    if re.search(r'第.節\s*.*\s*一', page_text):  # 找到下一節了
                        info = update_file_info(info, end_page_number=index)
                        break
            if text == '':
                text = text_cache

保存

保存很簡單，就直接新建個文件，把文本寫入即可。

def save_text_file(file_name, txt):
    """
    覆蓋保存文本文檔到當前腳本目錄下的output目錄下
    UTF-8編碼
    :param file_name: 文件名
    :param txt: 文件內容
    :return: None
    """
    if not file_name.endswith('.txt'):
        file_name += '.txt'  # 補全文件名

    file_path = os.path.join(os.getcwd(), 'output')
    if not os.path.exists(file_path):
        os.mkdir(file_path)  # 創建文件夾

    with open(os.path.join(file_path, file_name), 'w', encoding='utf-8') as txt_file:
        txt_file.write(txt)  # 保存文件

完成

# coding:utf-8
# @Time : 2021/11/5 11:37 
# @Author : minuy
# @File : pdf_to_txt.py
# @Version : v1.1 修改搜索正則，添加文件名后綴，刪除日期后綴，修復掃描不到不保存問題，修復掃描第一頁丟失問題
import os
import json
import re

from PyPDF2 import PdfFileReader

from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBoxHorizontal
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

# 換行符
end = '\n'


def dispose(root_path, name_suffix=None, is_recover=False, load_cache=True, cache_info_path='pdf_cache.json'):
    """
    處理pdf數據
    :param name_suffix: 文件名后綴
    :param root_path: 處理的根目錄
    :param is_recover: 是否覆蓋（已導出的）
    :param load_cache: 是否使用緩存（不會重新掃描）
    :param cache_info_path: 緩存保存位置
    :return: None
    """
    if load_cache:
        if not os.path.exists(cache_info_path):
            load_cache = False
            print('沒有找到緩存數據......')

    if load_cache:
        # pdf文檔的緩存
        pdf_info = load_cache_info(cache_info_path)
    else:
        files = get_files_list(root_path)
        print('開始掃描文檔.......')
        pdf_info = scan_pdf_directory(files, '經營情況討論與分析')
        save_cache_info(pdf_info)

    files = []
    for key, val in pdf_info.items():
        files.append(val.get('file_path'))

    if name_suffix is None:
        name_suffix = ""

    count = 0
    print('開始提取數據.......')
    for index, path in enumerate(files):
        print("處理進度 = {:.2f}%, 文件 = {}".format(index / len(pdf_info) * 100, os.path.basename(path)), end=end)

        info = pdf_info.get(str(index))
        if info.get('is_export') and (not is_recover):
            continue  # 如果已經導出並且不覆蓋，則直接處理下一個
        text, info = parse_pdf(info)  # 提取
        text_file_name = info.get('stock_code') + '_' + str(name_suffix)
        save_text_file(text_file_name, text)  # 保存

        info = update_file_info(info, text_length=len(text), is_export=True, output_length=len(text),
                                text_file_name=text_file_name)
        pdf_info.update({str(index): info})  # 更新緩存信息
        save_cache_info(pdf_info)
        count += 1
        if info.get('is_have_directory'):
            res = '有'
        else:
            res = '無'
        print('> 已保存文件，文件名：{}，長度：{}，目錄：{}，本次運行處理文件個數：{}'
              .format(text_file_name, len(text), res, count))


def save_cache_info(pdf_info):
    """
    保存處理信息
    :param pdf_info: 處理信息
    :return: None
    """
    with open("pdf_cache.json", 'w') as f:
        json_str = json.dumps(pdf_info)
        f.write(json_str)


def load_cache_info(info_path):
    """
    加載處理信息緩存文件
    :param info_path: 加載配置信息的位置
    :return: pdf緩存對象
    """
    with open(info_path, 'r') as f:
        json_str = f.read()
        pdf_info_cache = json.loads(json_str)
    return pdf_info_cache


def parse_pdf(info: dict):
    """
    解析pdf文檔
    :param info: 文檔信息
    :return: 文本內容，文檔信息（股票代碼，日期，起始位置，結束位置）
    """
    path = info.get('file_path')
    if path is None:
        raise ValueError('不存在文件路徑')

    file = os.path.basename(path)  # 獲取文件名
    stock_code = re.search(r'\d{6}', file).group(0)  # 解析股票代碼
    file_date = re.search(r'\d{4}-\d{1,2}-\d{1,2}', file).group(0)  # 解析日期
    info = update_file_info(info, stock_code=stock_code, date=file_date)  # 更新信息

    text = ''  # 文本緩存
    with open(path, 'rb') as pdf_file:  # 讀取pdf文檔
        is_have_target_page = info.get('is_have_directory')
        start_page_number = 0
        end_page_number = 0
        page_count = info.get('page_count')
        if is_have_target_page:
            start_page_number = info.get('start_page_number')
            if start_page_number is None:
                start_page_number = 0
                is_have_target_page = False
            end_page_number = info.get('end_page_number')
            if end_page_number is None:
                is_have_target_page = False
                end_page_number = info.get('page_count')
        else:
            is_have_target_page = False

        pdf_parse = PDFParser(pdf_file)
        pdf_doc = PDFDocument(pdf_parse)
        if pdf_doc.is_extractable:
            pdf_rm = PDFResourceManager(caching=True)
            pdf_lap = LAParams()
            pdf_pa = PDFPageAggregator(pdf_rm, laparams=pdf_lap)
            pdf_pi = PDFPageInterpreter(pdf_rm, pdf_pa)

            if is_have_target_page:

                page_set = set()
                for i in range(start_page_number, end_page_number):
                    page_set.add(i)

                pdf_page = PDFPage.get_pages(pdf_file, pagenos=page_set, password=b'', caching=True)
                print('讀取文本->>>')
                for index, page in enumerate(pdf_page):
                    print("部分 : 當前文檔進度 : {}/{}".format(index, len(page_set)), end=end)
                    pdf_pi.process_page(page)
                    layout = pdf_pa.get_result()

                    for x in layout:
                        if isinstance(x, LTTextBoxHorizontal):  # 獲取文本內容
                            text += x.get_text() + '\n'
                            # print(x.get_text())
            else:
                pdf_page = PDFPage.create_pages(pdf_doc)
                print('讀取文本->>>')
                is_find_start_page = False
                text_cache = ""
                for index, page in enumerate(pdf_page):
                    print("掃描 : 當前文檔進度 : {}/{}, 找到起始位置 : {}".format(index, page_count, is_find_start_page),
                          end=end)
                    pdf_pi.process_page(page)
                    layout = pdf_pa.get_result()

                    page_text = ''
                    for x in layout:
                        if isinstance(x, LTTextBoxHorizontal):  # 獲取文本內容
                            page_text += x.get_text() + '\n'
                            # print(x.get_text())

                    text_cache += page_text

                    if re.search(r'第.節\s*經營情況討論與分析\s*一', page_text):  # 找到這一節了
                        text += page_text # 當前頁開始保存
                        is_find_start_page = True
                        info = update_file_info(info, start_page_number=index)
                        continue

                    if is_find_start_page:
                        text += page_text
                        if re.search(r'第.節\s*.*\s*一', page_text):  # 找到下一節了
                            info = update_file_info(info, end_page_number=index)
                            break
                if text == '':
                    text = text_cache
    return text, info


def save_text_file(file_name, txt):
    """
    覆蓋保存文本文檔到當前腳本目錄下的output目錄下
    UTF-8編碼
    :param file_name: 文件名
    :param txt: 文件內容
    :return: None
    """
    if not file_name.endswith('.txt'):
        file_name += '.txt'  # 補全文件名

    file_path = os.path.join(os.getcwd(), 'output')
    if not os.path.exists(file_path):
        os.mkdir(file_path)  # 創建文件夾

    with open(os.path.join(file_path, file_name), 'w', encoding='utf-8') as txt_file:
        txt_file.write(txt)  # 保存文件


def scan_pdf_directory(files_list, key_word):
    """
    掃描pdf文檔目錄，獲得文檔總頁數，有無目錄，有（起始位置，結束位置）
    key_word 用在有目錄的情況下，
    不匹配則返回整個文檔范圍
    :param files_list: 要掃描的文件列表
    :param key_word: 目錄關鍵詞
    :return: 字典，每個元素為一個處理單元，有唯一的ID
    """
    pdf_info_dict = {}
    for index, file_path in enumerate(files_list):
        start_page_number = 0  # 開始頁碼
        is_get_page_number_range = False

        info = update_file_info(file_path=file_path)
        with open(file_path, 'rb') as pdf_file:  # 讀取pdf文檔
            pdf = PdfFileReader(pdf_file)  # 加載pdf文檔
            if pdf.isEncrypted:
                pdf.decrypt('')  # 解密
            end_page_number = pdf.getNumPages()  # 獲取總頁碼
            info = update_file_info(info, page_count=end_page_number)  # 保存總頁數
            pdf_directory = pdf.getOutlines()  # 獲取目錄

            is_have_start_page_number = False
            for destination in pdf_directory:
                if isinstance(destination, dict):
                    if is_have_start_page_number:
                        end_page_number = pdf.getDestinationPageNumber(destination)
                        is_get_page_number_range = True
                        break

                    title = destination.get('/Title')
                    if key_word in str(title):
                        # 在目錄中找到關鍵詞了
                        start_page_number = pdf.getDestinationPageNumber(destination)
                        is_have_start_page_number = True
                        continue
        if is_get_page_number_range:
            info = update_file_info(info, start_page_number=start_page_number, end_page_number=end_page_number,
                                    is_have_directory=True)
            res = "獲取頁碼成功"
        else:
            info = update_file_info(info, is_have_directory=False)
            res = "獲取頁碼失敗"
        print("掃描進度 : {:.2f}%, 文件 : {}".format(index / len(files_list) * 100, os.path.basename(file_path)), res, ':',
              '[', start_page_number, ',', end_page_number, ']', end=end)

        pdf_info_dict.update({str(index): info})
    return pdf_info_dict


def update_file_info(info=None, file_path=None, start_page_number=None, end_page_number=None, page_count=None,
                     output_length=None,
                     is_have_directory=None, is_export=None, stock_code=None, date=None, text_file_name=None,
                     text_length=None):
    """
    更新字典里的東西，如果不是字典，則被替換成字典
    :param text_length: 導出的文本文件長度
    :param page_count: 總頁數
    :param stock_code: 股票代碼
    :param date: 日期
    :param text_file_name: 對應的文本文件名
    :param info: 字典
    :param file_path: 更新文件路徑
    :param start_page_number: 更新開始頁碼
    :param end_page_number: 更新結束頁碼
    :param output_length: 輸出長度
    :param is_have_directory: 是否存在目錄
    :param is_export: 是否已經導出
    :return: 更新后的info
    """
    if info is None:
        info = {
            'file_path': None,
            'start_page_number': None,
            'end_page_number': None,
            'output_length': None,
            'is_have_directory': None,
            'is_export': None,
            'stock_code': None,
            'date': None,
            'text_file_name': None,
            'page_count': None,
            'text_length': None
        }

    if not isinstance(info, dict):
        raise ValueError("傳入的值info必須是空或者是字典！")

    if file_path:
        info['file_path'] = file_path

    if start_page_number:
        info['start_page_number'] = start_page_number

    if end_page_number:
        info['end_page_number'] = end_page_number

    if output_length:
        info['output_length'] = output_length

    if is_have_directory:
        info['is_have_directory'] = is_have_directory

    if is_export:
        info['is_export'] = is_export

    if stock_code:
        info['stock_code'] = stock_code

    if date:
        info['date'] = date

    if text_file_name:
        info['text_file_name'] = text_file_name

    if page_count:
        info['page_count'] = page_count

    if text_length:
        info['text_length'] = text_length

    return info


def get_files_list(path):
    """
    獲取傳入路徑中及其子目錄下的所有pdf文件路徑
    :param path: 要搜索的根路徑
    :return: pdf文件路徑列表
    """
    files_list = []
    for root, dirs, files in os.walk(path):  # 遍歷目錄
        for file in files:  # 遍歷文件
            file_path = os.path.join(root, file)  # 拼接路徑
            if file_path.endswith(".pdf"):  # 如果是pdf文件
                files_list.append(file_path)  # 添加到列表中
    return files_list


if __name__ == '__main__':
    # 掃描根目錄，文件名后綴，是否覆蓋，是否使用緩存信息
    dispose(r'D:\Project\pdf_ouput', 2019, True, False)

運行結果

D:\Project\pdf_ouput\venv\Scripts\python.exe D:/Project/pdf_ouput/pdf_to_txt.py
開始掃描文檔.......
掃描進度 : 0.00%, 文件 : 000045深紡織A：深紡織A2019年年度報告_2020-03-14.pdf 獲取頁碼失敗 : [ 0 , 182 ]
掃描進度 : 33.33%, 文件 : 002030達安基因：達安基因2019年年度報告_2020-04-30.pdf 獲取頁碼成功 : [ 18 , 38 ]
掃描進度 : 66.67%, 文件 : 102030達安基因：達安基因2019年年度報告_2020-04-21.pdf 獲取頁碼失敗 : [ 0 , 283 ]
開始提取數據.......
處理進度 = 0.00%, 文件 = 000045深紡織A：深紡織A2019年年度報告_2020-03-14.pdf
讀取文本->>>
掃描 : 當前文檔進度 : 0/182, 找到起始位置 : False
掃描 : 當前文檔進度 : 1/182, 找到起始位置 : False
掃描 : 當前文檔進度 : 2/182, 找到起始位置 : False
掃描 : 當前文檔進度 : 3/182, 找到起始位置 : False
掃描 : 當前文檔進度 : 4/182, 找到起始位置 : False
掃描 : 當前文檔進度 : 5/182, 找到起始位置 : False
掃描 : 當前文檔進度 : 6/182, 找到起始位置 : False
掃描 : 當前文檔進度 : 7/182, 找到起始位置 : False
掃描 : 當前文檔進度 : 8/182, 找到起始位置 : False
掃描 : 當前文檔進度 : 9/182, 找到起始位置 : False
掃描 : 當前文檔進度 : 10/182, 找到起始位置 : False
掃描 : 當前文檔進度 : 11/182, 找到起始位置 : False
掃描 : 當前文檔進度 : 12/182, 找到起始位置 : False
掃描 : 當前文檔進度 : 13/182, 找到起始位置 : True
掃描 : 當前文檔進度 : 14/182, 找到起始位置 : True
掃描 : 當前文檔進度 : 15/182, 找到起始位置 : True
掃描 : 當前文檔進度 : 16/182, 找到起始位置 : True
掃描 : 當前文檔進度 : 17/182, 找到起始位置 : True
掃描 : 當前文檔進度 : 18/182, 找到起始位置 : True
掃描 : 當前文檔進度 : 19/182, 找到起始位置 : True
掃描 : 當前文檔進度 : 20/182, 找到起始位置 : True
掃描 : 當前文檔進度 : 21/182, 找到起始位置 : True
掃描 : 當前文檔進度 : 22/182, 找到起始位置 : True
掃描 : 當前文檔進度 : 23/182, 找到起始位置 : True
掃描 : 當前文檔進度 : 24/182, 找到起始位置 : True
掃描 : 當前文檔進度 : 25/182, 找到起始位置 : True
掃描 : 當前文檔進度 : 26/182, 找到起始位置 : True
掃描 : 當前文檔進度 : 27/182, 找到起始位置 : True
掃描 : 當前文檔進度 : 28/182, 找到起始位置 : True
> 已保存文件，文件名：000045_2019，長度：21424，目錄：無，本次運行處理文件個數：1
處理進度 = 33.33%, 文件 = 002030達安基因：達安基因2019年年度報告_2020-04-30.pdf
讀取文本->>>
部分 : 當前文檔進度 : 0/20
部分 : 當前文檔進度 : 1/20
部分 : 當前文檔進度 : 2/20
部分 : 當前文檔進度 : 3/20
部分 : 當前文檔進度 : 4/20
部分 : 當前文檔進度 : 5/20
部分 : 當前文檔進度 : 6/20
部分 : 當前文檔進度 : 7/20
部分 : 當前文檔進度 : 8/20
部分 : 當前文檔進度 : 9/20
部分 : 當前文檔進度 : 10/20
部分 : 當前文檔進度 : 11/20
部分 : 當前文檔進度 : 12/20
部分 : 當前文檔進度 : 13/20
部分 : 當前文檔進度 : 14/20
部分 : 當前文檔進度 : 15/20
部分 : 當前文檔進度 : 16/20
部分 : 當前文檔進度 : 17/20
部分 : 當前文檔進度 : 18/20
部分 : 當前文檔進度 : 19/20
> 已保存文件，文件名：002030_2019，長度：17705，目錄：有，本次運行處理文件個數：2
處理進度 = 66.67%, 文件 = 102030達安基因：達安基因2019年年度報告_2020-04-21.pdf
讀取文本->>>
掃描 : 當前文檔進度 : 0/283, 找到起始位置 : False
掃描 : 當前文檔進度 : 1/283, 找到起始位置 : False
掃描 : 當前文檔進度 : 2/283, 找到起始位置 : False
掃描 : 當前文檔進度 : 3/283, 找到起始位置 : False
掃描 : 當前文檔進度 : 4/283, 找到起始位置 : False
掃描 : 當前文檔進度 : 5/283, 找到起始位置 : False
掃描 : 當前文檔進度 : 6/283, 找到起始位置 : False
掃描 : 當前文檔進度 : 7/283, 找到起始位置 : False
掃描 : 當前文檔進度 : 8/283, 找到起始位置 : False
掃描 : 當前文檔進度 : 9/283, 找到起始位置 : False
掃描 : 當前文檔進度 : 10/283, 找到起始位置 : False
掃描 : 當前文檔進度 : 11/283, 找到起始位置 : False
掃描 : 當前文檔進度 : 12/283, 找到起始位置 : False
掃描 : 當前文檔進度 : 13/283, 找到起始位置 : False
掃描 : 當前文檔進度 : 14/283, 找到起始位置 : False
掃描 : 當前文檔進度 : 15/283, 找到起始位置 : False
掃描 : 當前文檔進度 : 16/283, 找到起始位置 : False
掃描 : 當前文檔進度 : 17/283, 找到起始位置 : False
掃描 : 當前文檔進度 : 18/283, 找到起始位置 : True
掃描 : 當前文檔進度 : 19/283, 找到起始位置 : True
掃描 : 當前文檔進度 : 20/283, 找到起始位置 : True
掃描 : 當前文檔進度 : 21/283, 找到起始位置 : True
掃描 : 當前文檔進度 : 22/283, 找到起始位置 : True
掃描 : 當前文檔進度 : 23/283, 找到起始位置 : True
掃描 : 當前文檔進度 : 24/283, 找到起始位置 : True
掃描 : 當前文檔進度 : 25/283, 找到起始位置 : True
掃描 : 當前文檔進度 : 26/283, 找到起始位置 : True
掃描 : 當前文檔進度 : 27/283, 找到起始位置 : True
掃描 : 當前文檔進度 : 28/283, 找到起始位置 : True
掃描 : 當前文檔進度 : 29/283, 找到起始位置 : True
掃描 : 當前文檔進度 : 30/283, 找到起始位置 : True
掃描 : 當前文檔進度 : 31/283, 找到起始位置 : True
掃描 : 當前文檔進度 : 32/283, 找到起始位置 : True
掃描 : 當前文檔進度 : 33/283, 找到起始位置 : True
掃描 : 當前文檔進度 : 34/283, 找到起始位置 : True
掃描 : 當前文檔進度 : 35/283, 找到起始位置 : True
掃描 : 當前文檔進度 : 36/283, 找到起始位置 : True
掃描 : 當前文檔進度 : 37/283, 找到起始位置 : True
> 已保存文件，文件名：102030_2019，長度：18753，目錄：無，本次運行處理文件個數：3

Process finished with exit code 0

完成~

速度明顯提升，但是后面掃描的時候不應該直接就是一頁一頁的掃描，而是先掃描前面的目錄，獲取對應頁面，這個看看將來還有沒有需求，有需求再改進吧。

總結

Python 導出pdf文檔，可以導出為txt，html，表格，xml，圖片等，PyPDF2主要用來獲取目錄，拆分、合並等操作，主要用到的函數：getNumPages() 獲取總頁碼，getOutlines() 獲取目錄，getDestinationPageNumber(destination) 獲取目錄對應的頁碼，pdfminer功能很強大，現在只會導出，主要的函數有：PDFPage.create_pages(pdf_doc) 導出全部頁，PDFPage.get_pages(pdf_file, pagenos=page_set) 導出集合中的指定頁，pdfplumber 貌似能識別圖片字符。

其他的，掃描根目錄下的所有pdf文檔，配置的讀取和保存，配置的更新等主要涉及到Python基礎和操作邏輯問題了，正則表達式也是個好東西。

參考文檔

Python操作PDF全總結|pdfplumber&PyPDF2

Python使用pdfminer解析PDF_光明~~~

如何利用Python抓取PDF中的某些內容？

python 從PDF文件中讀取書簽/目錄_龍紙人的博客

Python利用PyPDF2庫獲取PDF文件總頁碼實例

PDFMiner: PDFMiner 是一個 Python 的 PDF 解析器，可以從 PDF 文檔中提取信息

PyPDF2 Documentation — PyPDF2 1.26.0 documentation

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python生成指定大小的txt文檔(MB）【Python】讀取各種文檔（txt、csv、excel、pdf）方法 Python將PDF轉為TXT Vue 指定 DIV 導出 PDF python讀取pdf文檔 C#提取TXT文檔指定內容 Python讀取PDF文檔用Itextsharp 組件導出PDF 的文檔的方法用Itext把數據導出到Pdf文檔 02 UIPath分析PDF文檔並導出Excel