用PDFMiner從PDF中提取文本文字

本文轉載自查看原文 2017-05-08 20:05 4943 文本文件/ PDF/ PDFMiner/ 提取/ 中文亂碼/ Python

1、下載並安裝PDFMiner

　　從https://pypi.python.org/pypi/pdfminer/下載PDFMineer

wget https://pypi.python.org/packages/57/4f/e1df0437858188d2d36466a7bb89aa024d252bd0b7e3ba90cbc567c6c0b8/pdfminer-20140328.tar.gz#md5=dfe3eb1b7b7017ab514aad6751a7c2ea

　　加壓並安裝

tar -zxvf pdfminer-20140328.tar.gz
cd pdfminer-20140328/
make cmap　　#防止中文亂碼，否則處理中文會出現一大堆（CID:xxx）
sudo python setup.py install

2、提取文本文字

from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import sys
import string

def convert_pdf_2_text(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    device = TextConverter(rsrcmgr, retstr, codec='utf-8', laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    with open(path, 'rb') as fp:
        for page in PDFPage.get_pages(fp, set()):
            interpreter.process_page(page)
        text = retstr.getvalue()
    device.close()
    retstr.close()
    return text

text = convert_pdf_2_text(sys.argv[1])
open('real?.txt','wb').write(text)

3、測試結果

【1】http://www.unixuser.org/~euske/python/pdfminer/#source

【2】https://www.zhihu.com/question/31586273

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 java從pdf中提取文本從PDF中提取信息----PDFMiner 如何使用免費PDF控件從PDF文檔中提取文本和圖片從圖片中提取文本教你如何提取文本文檔里的手機號，如何從文檔中提取電話 [譯]使用BeautifulSoup和Python從網頁中提取文本從html中提取純文本從html中提取純文本從pdf中提取內容的方法 python 從PDF中提取附件