Python 將pdf轉換成txt（不處理圖片）

本文轉載自查看原文 2014-07-11 12:18 8845 Python

　　上一篇文章中已經介紹了簡單的python爬網頁下載文檔，但下載后的文檔多為doc或pdf，對於數據處理仍然有很多限制，所以將doc／pdf轉換成txt顯得尤為重要。查找了很多資料，在linux下要將doc轉換成txt確實有難度，所以考慮先將pdf轉換成txt。

　　師兄推薦使用PDFMiner來處理，嘗試了一番，確實效果不錯，在此和大家分享。

　　PDFMiner 的簡介：PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data.有興趣的同學請通過官網進行詳細查看，通過PDFMiner中的小工具pdf2txt.py，便能將pdf轉換成txt，而且仍保留pdf中的格式，超贊！

　　閱讀pdf2txt.py的源碼，我們可以看到具體的實現步驟，為了以后能處理大規模的pdf文件，這里我們只提取出pdf轉換成txt的部分，具體實現代碼如下：

# -*- coding: utf-8 -*-  
#----------------------------------------------------- 
#   功能：將pdf轉換成txt（不處理圖片）
#   作者：chenbjin 
#   日期：2014-07-11
#   語言：Python 2.7.6  
#   環境：linux（ubuntu）
#        PDFMiner20140328（Must be installed）
#   使用：python pdf2txt.py file.pdf
#-----------------------------------------------------

import sys
from pdfminer.pdfinterp import PDFResourceManager,PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
#main
def main(argv) :
    #輸出文件名，這里只處理單文檔，所以只用了argv［1］
    outfile = argv[1] + '.txt'
    args = [argv[1]]

    debug = 0
    pagenos = set()
    password = ''
    maxpages = 0
    rotation = 0
    codec = 'utf-8'   #輸出編碼
    caching = True
    imagewriter = None
    laparams = LAParams()
    #
    PDFResourceManager.debug = debug
    PDFPageInterpreter.debug = debug

    rsrcmgr = PDFResourceManager(caching=caching)
    outfp = file(outfile,'w')
　　 #pdf轉換
    device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams,
                imagewriter=imagewriter)
　　
    for fname in args:
        fp = file(fname,'rb')
        interpreter = PDFPageInterpreter(rsrcmgr, device)
　　　　 #處理文檔對象中每一頁的內容
        for page in PDFPage.get_pages(fp, pagenos,
                          maxpages=maxpages, password=password,
                          caching=caching, check_extractable=True) :
            page.rotate = (page.rotate+rotation) % 360
            interpreter.process_page(page)
        fp.close()
    device.close()
    outfp.close()
    return

if __name__ == '__main__' : main(sys.argv)

　　下一步將嘗試將pdf中的圖片進行轉換，可以通過http://denis.papathanasiou.org/2010/08/04/extracting-text-images-from-pdf-files/ 進行了解。

參考資料：

1.PDFMiner：http://www.unixuser.org/~euske/python/pdfminer/

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python，多圖片轉換成pdf文件怎樣將PDF轉換成圖片？ nodejs將PDF文件轉換成txt文本，並利用python處理轉換后的文本文件 python 將文件夾內的圖片轉換成PDF 轉 Java將PDF轉換成圖片 ImageMagick之PDF轉換成圖片（image） PHP實現PDF轉換成圖片 mac版 PDF轉換成圖片 python 將txt文件轉換成字典 Python: 把txt文件轉換成csv