Python解析PDF三法

本文轉載自查看原文 2017-03-27 14:17 13731 pdf2jpg/ PythonMagick/ PyPdf2/ ImageMagick/ Python

最近做調研想知道一些NZ當地的旅游信息，於是在NZ留學的友人自高奮勇地幫我去各個加油站拿了一堆旅游小冊子，掃描了發給我。

但是他掃描出的高清圖全在一個pdf里，順序也不對，於是我准備把pdf文件中的圖單個取出轉成jpg方便查看。

使用免費的Adobe Reader X雖然可以一張一張的把圖拷貝下來，轉存進mspaint，但是枯燥的過程不能滿足我熊熊燃燒的程序員之魂。

由於空閑時間不多，先在網上搜到一堆胡里花哨的小軟件，不是看介紹就覺得文不對題就是免費版的軟件內部限定只能轉第一張…

於是決定用python寫個腳本跑批把圖取出來，先選擇了PdfMiner。

PdfMiner的demo：

#!/usr/bin/env python2
#-*-encoding:utf-8-*-    
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import *
from pdfminer.converter import PDFPageAggregator
import urllib2
from cStringIO import StringIO

def Pdf2Txt(DataIO,Save_path):                     #來創建一個pdf文檔分析器
    parser = PDFParser(DataIO)                     #創建一個PDF文檔對象存儲文檔結構
    document = PDFDocument(parser) 
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed
    else:
        #創建一個PDF資源管理器對象來存儲共賞資源
        rsrcmgr=PDFResourceManager();            #設定參數進行分析
        laparams=LAParams();                    #創建一個PDF設備對象
        #device=PDFDevice(rsrcmgr)
        device=PDFPageAggregator(rsrcmgr,laparams=laparams);#創建一個PDF解釋器對象
        interpreter=PDFPageInterpreter(rsrcmgr,device)
        #處理每一頁
        for page in PDFPage.create_pages(document):
            interpreter.process_page(page);        #接受該頁面的LTPage對象
            layout=device.get_result()
            for x in layout:
                try:
                    if(isinstance(x,LTTextBoxHorizontal)):
                        with open('%s'%(Save_path),'a') as f:
                            f.write(x.get_text().encode('utf-8')+'\n')
                except:
                    print "Failed!"

#convert online pdf
'''
url = "pdf url";
html = urllib2.urlopen(urllib2.Request(url)).read();
DataIO = StringIO(html.read());
Pdf2Txt(DataIO,r'C:\workspace\python\converter\resource\b2.txt');
'''
#convert local pdf
with open(r'C:\workspace\python\converter\resource\text.pdf','rb') as html:
    DataIO = StringIO(html.read())
    Pdf2Txt(DataIO,r'C:\workspace\python\converter\resource\b3.txt')

試用后發現PdfMiner更適合配合StringIO轉出pdf文件中的文字類信息。這和我的需求不符，果斷更換。

接着找到了PythonMagick，通過寫demo發現能夠順利轉出我需要的圖，但是PythonMagick並沒有方法可以獲取pdf文件的頁數，於是又找到了PyPdf2，PyPdf2的PdfFileReader中getNumPages()方法可以讀取pdf文件頁數。

PythonMagick的demo：

import PythonMagick;
from PyPDF2 import PdfFileReader;
 
C_RESOURCE_FILE=r'C:\workspace\python\converter\resource';
C_PDFNAME=r'6p.pdf';
C_JPGNAME=r'6p%s.jpg';
 
input_stream = file(C_RESOURCE_FILE+'\\'+C_PDFNAME, 'rb');
pdf_input = PdfFileReader(input_stream,strict=False);     #錯誤1
page_count = pdf_input.getNumPages();
 
img = PythonMagick.Image()    # empty object first
img.density('300');            # set the density for reading (DPI); must be as a string
 
for i in range(page_count):
    try:
        img.read(C_RESOURCE_FILE+'\\'+C_PDFNAME + ('[%s]'%i));     #分頁讀取 PDF
        imgCustRes = PythonMagick.Image(img);  # make a copy
        imgCustRes.sample('x1600');
        imgCustRes.write(C_RESOURCE_FILE+'\\'+(C_JPGNAME%i));
    except Exception, e:
        print e;
        pass;
 
print 'done';

運行時，碰到錯誤1：

PyPDF2.utils.PdfReadError: Multiple definitions in dictionary at byte 0x4717c2 f or key /Info

通過查詢，將嚴格模式關閉，PdfFileReader(input_stream,strict=False)可以解決。

文中所用到的包如下：

PythonMagick可以通過lfd.edu提供的鏡像下載whl文件，比如我用的python2.7，64位windows，下載對應的是PythonMagick‑0.9.10‑cp27‑none‑win_amd64.whl。

安裝方法，cmd進入whl文件所在目錄，運行：

pip install PythonMagick‑0.9.10‑cp27‑none‑win_amd64.whl

PyPdf2可以使用pip直接安裝。

pip install PyPdf2

PdfMiner可以在github里搜一下，關鍵字排名第一有2k star那個的就是。

在搜索過程中，還發現另外一種方法，使用ImageMagick與命令行進行轉換，需要安裝ImageMagick，GhostScript，參照此文。

cmd進入pdf所在目錄，運行：

magick convert 6p.pdf 6p.jpg

此方法能夠將pdf自動按頁轉為jpg。

Reference：

Python使用PDFMiner解析PDF
PdfReadError: Multiple definitions in dictionary at byte 0x30b for key /Type #244
Convert PDF to IMAGE with perl/pythjon
Unofficial Windows Binaries for Python Extension Packages - PythonMagick
PDF to JPG Conversion with Python (for Windows)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python使用PDFMiner解析PDF Python3.x：pdf2htmlEX（解析pdf）安裝和使用 PDF解析 python3使用pdfminer3k解析pdf文件深入學習python解析並讀取PDF文件內容的方法 python中pdf文件解析包pdfplumber的簡單使用 Python：解析PDF文本及表格——pdfminer、tabula、pdfplumber 的用法及對比深入學習Python解析並解密PDF文件內容的方法 Python：解析PDF文本及表格——pdfminer、tabula、pdfplumber 的用法及對比 [轉]Python 解析 PDF 文本和表格的四大方法介紹