pdfminer API介紹：pdf網頁爬蟲

本文轉載自查看原文 2016-04-29 11:36 3473 爬蟲/ 機器學習/ Python

　　安裝 pip install pdfminer

　　爬取數據是數據分析項目的第一個階段，有的加密成pdf格式的文件，下載后需要解析，使用pdfminer工具。

　　先介紹一下什么是pdfminer

　　下面是官方一段英文介紹：

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

　主要用兩個例子學習它的使用

　　例子1：

$ pdf2txt.py -o output.html samples/naacl06-shinyama.pdf
(extract text as an HTML file whose filename is output.html)

$ pdf2txt.py -V -c euc-jp -o output.html samples/jo.pdf
(extract a Japanese HTML file in vertical writing, CMap is required)

$ pdf2txt.py -P mypassword -o output.txt secret.pdf
(extract a text from an encrypted PDF file)

　　參數：

 -o filename
    Specifies the output file name. By default, it prints the extracted contents to stdout in text format.

-p pageno[,pageno,...]
    Specifies the comma-separated list of the page numbers to be extracted. Page numbers start at one. By default, it extracts text from all the pages.

-c codec
    Specifies the output codec.

-t type
    Specifies the output format. The following formats are currently supported.

        text : TEXT format. (Default)
        html : HTML format. Not recommended for extraction purposes because the markup is messy.
        xml : XML format. Provides the most information.
        tag : "Tagged PDF" format. A tagged PDF has its own contents annotated with HTML-like tags. pdf2txt tries to extract its content streams rather than inferring its text locations. Tags used here are defined in the PDF specification (See §10.7 "Tagged PDF"). 

-I image_directory
    Specifies the output directory for image extraction. Currently only JPEG images are supported.

-M char_margin

例子2：

$ dumppdf.py -a foo.pdf
(dump all the headers and contents, except stream objects)

$ dumppdf.py -T foo.pdf
(dump the table of contents)

$ dumppdf.py -r -i6 foo.pdf > pic.jpeg
(extract a JPEG image)

參數：

 -a
    Instructs to dump all the objects. By default, it only prints the document trailer (like a header).

-i objno,objno, ...
    Specifies PDF object IDs to display. Comma-separated IDs, or multiple -i options are accepted.

-p pageno,pageno, ...
    Specifies the page number to be extracted. Comma-separated page numbers, or multiple -p options are accepted. Note that page numbers start at one, not zero.

-r (raw)
-b (binary)
-t (text)
    Specifies the output format of stream contents. Because the contents of stream objects can be very large, they are omitted when none of the options above is specified.

    With -r option, the "raw" stream contents are dumped without decompression. With -b option, the decompressed contents are dumped as a binary blob. With -t option, the decompressed contents are dumped in a text format, similar to repr() manner. When -r or -b option is given, no stream header is displayed for the ease of saving it to a file.

-T
    Shows the table of contents.

編寫自己的pdf解析文檔：

# -*- coding: utf-8 -*-   
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import *
from pdfminer.converter import PDFPageAggregator
import os
# os.chdir(r'F:\test')
fp = open('PDF/1202268749.pdf', 'rb')
#來創建一個pdf文檔分析器
parser = PDFParser(fp)  
#創建一個PDF文檔對象存儲文檔結構
document = PDFDocument(parser)
# 檢查文件是否允許文本提取
if not document.is_extractable:
    raise PDFTextExtractionNotAllowed
else:
    # 創建一個PDF資源管理器對象來存儲共賞資源
    rsrcmgr=PDFResourceManager()
    # 設定參數進行分析
    laparams=LAParams()
    # 創建一個PDF設備對象
    # device=PDFDevice(rsrcmgr)
    device=PDFPageAggregator(rsrcmgr,laparams=laparams)
    # 創建一個PDF解釋器對象
    interpreter=PDFPageInterpreter(rsrcmgr,device)
    # 處理每一頁
    for page in PDFPage.create_pages(document):
        interpreter.process_page(page)
        # 接受該頁面的LTPage對象
        layout=device.get_result()
        for x in layout:
            if(isinstance(x,LTTextBoxHorizontal)):
                with open('a.html','a') as f:
                    f.write(x.get_text().encode('utf-8')+'\n')

參考：

pdfminer官網： http://www.unixuser.org/~euske/python/pdfminer/index.html

http://www.cnblogs.com/RoundGirl/p/4979267.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 pdfminer批量處理PDF文件 Python使用PDFMiner解析PDF 從PDF中提取信息----PDFMiner pdfminer實現pdf布局分析 python （pdfminer realize layout analysis with PDF python） iText PDF Java API 入門介紹教程 Python對pdf中的關鍵字過濾（pdfminer3k或pdfminer使用） python 爬蟲，網頁轉PDF：OSError: No wkhtmltopdf executable found 導入 from pdfminer.pdfinterp import process_pdf 錯誤 python3使用pdfminer3k解析pdf文件 python3用pdfminer3k在線讀取pdf文件