Python-docx 讀取word.docx內容

本文轉載自查看原文 2017-12-07 22:26 30316 Python

第一次寫博客，也不知道要寫點兒什么好，所以就把我在學習Python的過程中遇到的問題記錄下來，以便之后查看，本人小白，寫的不好，如有錯誤，還請大家批評指正！

中文編碼問題總是讓人頭疼，想要用Python讀取word中的內容，用open()經常報錯，上網一搜結果發現了Python有專門讀取.docx的模塊python_docx（只能讀取.docx文件，不能讀取.doc文件），用起來很方便。

安裝python-docx：

pip install python_docx

（注意：不是pip install docx ! docx也可以安裝，但總是報錯，缺少exceptions，無法導入）

接下來就可以用Python_docx 來讀取word文本了。

代碼如下：

import docx
from docx import Document
path = "C:\\Users\\Administrator\\Desktop\\word.docx"
document = Document(path)
for paragraph in document.paragraphs:
    print(paragraph.text)

運行即可輸出文本。

我嘗試用docx讀取.doc文本

代碼如下：

import os
import docx
for filename in os.listdir(os.getcwd()):
    if filename.endswith('.doc'):
        print(filename[:-4])
        doc = docx.Document(filename[:-4]+".docx")
        for para in doc.paragraphs:
            print (para.text)

結果報錯：docx.opc.exceptions.PackageNotFoundError: Package not found。還是無法識別doc

引用1樓，“改變拓展名並沒有改變其編碼方式，因此無法讀取文本內容，需將doc文件另存為docx文件后再用python-docx讀取其內容”

# Document 還有添加標題、分頁、段落、圖片、章節等方法，說明如下
  |  add_heading(self, text='', level=1)
  |      Return a heading paragraph newly added to the end of the document,
  |      containing *text* and having its paragraph style determined by
  |      *level*. If *level* is 0, the style is set to `Title`. If *level* is
  |      1 (or omitted), `Heading 1` is used. Otherwise the style is set to
  |      `Heading {level}`. Raises |ValueError| if *level* is outside the
  |      range 0-9.
  |  
  |  add_page_break(self)
  |      Return a paragraph newly added to the end of the document and
  |      containing only a page break.
  |  
  |  add_paragraph(self, text='', style=None)
  |      Return a paragraph newly added to the end of the document, populated
  |      with *text* and having paragraph style *style*. *text* can contain
  |      tab (``\t``) characters, which are converted to the appropriate XML
  |      form for a tab. *text* can also include newline (``\n``) or carriage
  |      return (``\r``) characters, each of which is converted to a line
  |      break.
  |  
  |  add_picture(self, image_path_or_stream, width=None, height=None)
  |      Return a new picture shape added in its own paragraph at the end of
  |      the document. The picture contains the image at
  |      *image_path_or_stream*, scaled based on *width* and *height*. If
  |      neither width nor height is specified, the picture appears at its
  |      native size. If only one is specified, it is used to compute
  |      a scaling factor that is then applied to the unspecified dimension,
  |      preserving the aspect ratio of the image. The native size of the
  |      picture is calculated using the dots-per-inch (dpi) value specified
  |      in the image file, defaulting to 72 dpi if no value is specified, as
  |      is often the case.
  |  
  |  add_section(self, start_type=2)
  |      Return a |Section| object representing a new section added at the end
  |      of the document. The optional *start_type* argument must be a member
  |      of the :ref:`WdSectionStart` enumeration, and defaults to
  |      ``WD_SECTION.NEW_PAGE`` if not provided.
  |  
  |  add_table(self, rows, cols, style=None)
  |      Add a table having row and column counts of *rows* and *cols*
  |      respectively and table style of *style*. *style* may be a paragraph
  |      style object or a paragraph style name. If *style* is |None|, the
  |      table inherits the default table style of the document.
  |  
  |  save(self, path_or_stream)
  |      Save this document to *path_or_stream*, which can be eit a path to
  |      a filesystem location (a string) or a file-like object.

docx還有許多其它功能，還正在學習中，詳見官方文檔：https://python-docx.readthedocs.io/en/latest/user/quickstart.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用python-docx處理word.docx文件（1）使用python-docx處理word.docx文件（2）使用python-docx處理word.docx文件（4）使用python-docx處理word.docx文件（5）使用python-docx處理word.docx文件（3） python-docx操作word文件（*.docx） python-docx讀取doc,docx文檔 Python讀取word文檔（python-docx包） Python用python-docx讀寫word文檔 python使用python-docx導出word