Python-docx 讀取word.docx內容


第一次寫博客,也不知道要寫點兒什么好,所以就把我在學習Python的過程中遇到的問題記錄下來,以便之后查看,本人小白,寫的不好,如有錯誤,還請大家批評指正!

中文編碼問題總是讓人頭疼,想要用Python讀取word中的內容,用open()經常報錯,上網一搜結果發現了Python有專門讀取.docx的模塊python_docx(只能讀取.docx文件,不能讀取.doc文件),用起來很方便。

安裝python-docx:

pip install python_docx

(注意:不是pip install docx  ! docx也可以安裝,但總是報錯,缺少exceptions,無法導入)

接下來就可以用Python_docx 來讀取word文本了。

代碼如下:

import docx
from docx import Document
path = "C:\\Users\\Administrator\\Desktop\\word.docx"
document = Document(path)
for paragraph in document.paragraphs:
    print(paragraph.text)

運行即可輸出文本。 

我嘗試用docx讀取.doc文本

代碼如下:

import os
import docx
for filename in os.listdir(os.getcwd()):
    if filename.endswith('.doc'):
        print(filename[:-4])
        doc = docx.Document(filename[:-4]+".docx")
        for para in doc.paragraphs:
            print (para.text)

結果報錯:docx.opc.exceptions.PackageNotFoundError: Package not found。還是無法識別doc

引用1樓,“改變拓展名並沒有改變其編碼方式,因此無法讀取文本內容,需將doc文件另存為docx文件后再用python-docx讀取其內容”

# Document 還有添加標題、分頁、段落、圖片、章節等方法,說明如下
| add_heading(self, text='', level=1) | Return a heading paragraph newly added to the end of the document, | containing *text* and having its paragraph style determined by | *level*. If *level* is 0, the style is set to `Title`. If *level* is | 1 (or omitted), `Heading 1` is used. Otherwise the style is set to | `Heading {level}`. Raises |ValueError| if *level* is outside the | range 0-9. | | add_page_break(self) | Return a paragraph newly added to the end of the document and | containing only a page break. | | add_paragraph(self, text='', style=None) | Return a paragraph newly added to the end of the document, populated | with *text* and having paragraph style *style*. *text* can contain | tab (``\t``) characters, which are converted to the appropriate XML | form for a tab. *text* can also include newline (``\n``) or carriage | return (``\r``) characters, each of which is converted to a line | break. | | add_picture(self, image_path_or_stream, width=None, height=None) | Return a new picture shape added in its own paragraph at the end of | the document. The picture contains the image at | *image_path_or_stream*, scaled based on *width* and *height*. If | neither width nor height is specified, the picture appears at its | native size. If only one is specified, it is used to compute | a scaling factor that is then applied to the unspecified dimension, | preserving the aspect ratio of the image. The native size of the | picture is calculated using the dots-per-inch (dpi) value specified | in the image file, defaulting to 72 dpi if no value is specified, as | is often the case. | | add_section(self, start_type=2) | Return a |Section| object representing a new section added at the end | of the document. The optional *start_type* argument must be a member | of the :ref:`WdSectionStart` enumeration, and defaults to | ``WD_SECTION.NEW_PAGE`` if not provided. | | add_table(self, rows, cols, style=None) | Add a table having row and column counts of *rows* and *cols* | respectively and table style of *style*. *style* may be a paragraph | style object or a paragraph style name. If *style* is |None|, the | table inherits the default table style of the document. | | save(self, path_or_stream) | Save this document to *path_or_stream*, which can be eit a path to | a filesystem location (a string) or a file-like object.

 

docx還有許多其它功能,還正在學習中,詳見官方文檔:https://python-docx.readthedocs.io/en/latest/user/quickstart.html


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM