$用python-docx模塊讀寫word文檔

本文轉載自查看原文 2017-08-05 15:38 8407 python/ $python庫大全/ docx

工作中會遇到需要讀取一個有幾百頁的word文檔並從中整理出一些信息的需求，比如產品的API文檔一般是word格式的。幾百頁的文檔，如果手工一個個去處理，幾乎是不可能的事情。這時就要找一個庫寫腳本去實現了，而本文要講的python-docx庫就能滿足這個需求。

python-docx庫官方文檔

安裝

pip install python-docx

寫docx文件

示例代碼：

# coding:utf-8
# 寫word文檔文件
import sys

from docx import Document
from docx.shared import Inches

def main():
    reload(sys)
    sys.setdefaultencoding('utf-8')
    
    # 創建文檔對象
    document = Document()
    
    # 設置文檔標題，中文要用unicode字符串
    document.add_heading(u'我的一個新文檔',0)
    
    # 往文檔中添加段落
    p = document.add_paragraph('This is a paragraph having some ')
    p.add_run('bold ').bold = True
    p.add_run('and some ')
    p.add_run('italic.').italic = True
    
    # 添加一級標題
    document.add_heading(u'一級標題, level = 1',level = 1)
    document.add_paragraph('Intense quote',style = 'IntenseQuote')
    
    # 添加無序列表
    document.add_paragraph('first item in unordered list',style = 'ListBullet')
    
    # 添加有序列表
    document.add_paragraph('first item in ordered list',style = 'ListNumber')
    document.add_paragraph('second item in ordered list',style = 'ListNumber')
    document.add_paragraph('third item in ordered list',style = 'ListNumber')
    
    # 添加圖片，並指定寬度
    document.add_picture('e:/docs/pic.png',width = Inches(1.25))
    
    # 添加表格: 1行3列
    table = document.add_table(rows = 1,cols = 3)
    # 獲取第一行的單元格列表對象
    hdr_cells = table.rows[0].cells
    # 為每一個單元格賦值
    # 注：值都要為字符串類型
    hdr_cells[0].text = 'Name'
    hdr_cells[1].text = 'Age'
    hdr_cells[2].text = 'Tel'
    # 為表格添加一行
    new_cells = table.add_row().cells
    new_cells[0].text = 'Tom'
    new_cells[1].text = '19'
    new_cells[2].text = '12345678'
    
    # 添加分頁符
    document.add_page_break()
    
    # 往新的一頁中添加段落
    p = document.add_paragraph('This is a paragraph in new page.')
    
    # 保存文檔
    document.save('e:/docs/demo1.docx')
    
if __name__ == '__main__':
    main()

執行以上代碼會在'e:/docs/'路徑下產生一個demo1.docx文件，其內容如下：

讀docx文件

示例代碼：

# coding:utf-8
# 讀取已有的word文檔
import sys

from docx import Document

def main():
    reload(sys)
    sys.setdefaultencoding('utf-8')
    
    # 創建文檔對象
    document = Document('e:/docs/demo2.docx')
    
    # 讀取文檔中所有的段落列表
    ps = document.paragraphs
    # 每個段落有兩個屬性：style和text
    ps_detail = [(x.text,x.style.name) for x in ps]
    with open('out.tmp','w+') as fout:
        fout.write('')
    # 讀取段落並寫入一個文件
    with open('out.tmp','a+') as fout:
        for p in ps_detail:
            fout.write(p[0] + '\t' + p[1] + '\n\n')
    
    # 讀取文檔中的所有段落的列表
    tables = document.tables
    # 遍歷table，並將所有單元格內容寫入文件中
    with open('out.tmp','a+') as fout:
        for table in tables:
            for row in table.rows:
                for cell in row.cells:
                    fout.write(cell.text + '\t')
                fout.write('\n')
    
if __name__ == '__main__':
    main()

假如在'e:/docs/'路徑下有一個demo2.docx文檔，其內如如下：

執行上面腳本后，輸出的out.tmp文件的內容如下：

注意事項

如果段落中是有超鏈接的，那么段落對象是讀取不出來超鏈接的文本的，需要把超鏈接先轉換成普通文本，方法：全選word文檔的所有內容，按快捷鍵Ctrl+Shift+F9即可。

遇到的問題

用pyinstaller打包時的一個問題

用pyinstaller工具（用法詳見：python打包工具pyinstaller的用法）把使用到python-docx庫的腳本打包成exe可執行文件后，雙擊運行生成的exe文件，報錯：

docx.opc.exceptions.PackageNotFoundError: Package not found at 'C:\Users\ADMINI~1.PC-\AppData\Local\Temp\_MEI49~1\docx\templates\default.docx'

經過在stackoverflow上搜索，發現有人遇到過類似的問題（問題鏈接：cx_freeze and docx - problems when freezing），經過嘗試，該問題的第二個回答可以解決這個問題：

I had the same problem and managed to get around it by doing the following. First, I located the default.docx file in the site-packages. Then, I copied it in the same directory as my .py file. I also start the .docx file with Document() which has a docx=... flag, to which I assigned the value: os.path.join(os.getcwd(), 'default.docx') and now it looks like doc = Document(docx=os.path.join(os.getcwd(), 'default.docx')). The final step was to include the file in the freezing process. Et voilà! So far I have no problem.

大概的解決步驟是這樣的：

找到python-docx包安裝路徑下的一個名為default.docx的文件，我是通過everything這個強大的搜索工具全局搜索找到這個文件的，它在我本地所在的路徑是：E:\code\env\.env\Lib\site-packages\docx\templates
把找到的default.docx文件復制到我的py腳本文件所在的目錄下。
修改腳本中創建Document對象的方式：
從原來的創建方式：

document = Document()

修改為：

import os
document = Document(docx=os.path.join(os.getcwd(), 'default.docx'))

再次用pyinstaller工具打包腳本為exe文件
把default.docx文件復制到與生成的exe文件相同的路徑下，再次運行exe文件，順利運行通過，沒有再出現之前的報錯，問題得到解決。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python用python-docx讀寫word文檔 python讀寫word文檔 -- python-docx從入門到精通 Python-docx 模塊讀寫 Word 文檔基礎：創建文檔、段落格式、字體格式設置方法 python-docx template 操作word文檔使用python-docx生成Word文檔 python-docx生成word文檔使用python-docx生成Word文檔 python-docx操作word文檔 Python-docx模塊 Python讀取word文檔（python-docx包）