python-docx讀取doc,docx文檔



API: http://python-docx.readthedocs.io/en/latest/#api-documentation

1.將doc轉為docx

python3.8中win32com 要安裝pypiwin32 pip install pypiwin32

from win32com import client as wc

word = wc.Dispatch("Word.Application")
doc = word.Documents.Open(路徑+名稱.doc)
doc.SaveAs(路徑+名稱.docx, 12)   12為docx
doc.Close()
word.Quit()

2.讀取段落

import docx
docStr = Document(docName)   打開文檔
for paragraph in docStr.paragraphs:
parStr = paragraph.text
--》paragraph.style.name == 'Heading 1'  一級標題   
--》paragraph.paragraph_format.alignment == 1  居中顯示
--》paragraph.style.next_paragraph_style.paragraph_format.alignment == 1  下一段居中顯示
--》paragraph.style.font.color

3.讀取表格

numTables = docStr.tables
for table in numTables:
#行列個數
row_count = len(table.rows)
col_count = len(table.columns)
for i in range(row_count):
    row = table.rows[i].cells
	i行j列內容:row[j].text

或者:
row_count = len(table.rows)
col_count = len(table.columns)
for i in range(row_count):
    for j in range(col_count):
        print(table.cell(i,j).text)

4.按樣式讀取

讀取標題

for p in doc.paragraphs:
    if p.style.name=='Heading 1':
        print(p.text)
import re
for p in doc.paragraphs:
    if re.match("^Heading \d+$",p.style.name):
        print(p.text)

讀取正文

for p in doc.paragraphs:
    if p.style.name=='Normal':
        print(p.text)

獲取docx支持的樣式

from docx.enum.style import WD_STYLE_TYPE
for i in s:
    if i.type==WD_STYLE_TYPE.PARAGRAPH:
        print(i.name)

5.獲取文字格式信息

paragraph 對象 里還有更小的 run 對象,run 對象才包含了段落對象的文字信息。
paragraph.text 方法也是通過 run 對象的方法獲取到文字信息的:

paragraph.text 方法源碼:

def text(self):
     text = ''
        for run in self.runs:
            text += run.text
        return text

文字的字體、大小、下划線等信息都包含在 run 對象中(不清楚的看前面的博客):

獲取段落的 run 對象列表
runs = par0.runs
print(runs)
獲取 run 對象
run_0 = runs[0]
print(run_0.text) # 獲取 run 對象文字信息
打印結果:
堅持因地制宜,差異化打造特色小鎮,
文檔 段落 和 run 對象示意:
獲取文字格式信息:

# 獲取文字格式信息
print('字體名稱:',run_0.font.name)
# 字體名稱: 宋體
print('字體大小:',run_0.font.size)
# 字體大小: 152400
print('是否加粗:',run_0.font.bold)
# 是否加粗: None
print('是否斜體:',run_0.font.italic)
# 是否斜體: True
print('字體顏色:',run_0.font.color.rgb)
# 字體顏色: FF0000
print('字體高亮:',run_0.font.highlight_color)
# 字體高亮: YELLOW (7)
print('下划線:',run_0.font.underline)
# 下划線: True
print('刪除線:',run_0.font.strike)
# 刪除線: None
print('雙刪除線:',run_0.font.double_strike)
# 雙刪除線: None
print('下標:',run_0.font.subscript)
# 下標: None
print('上標:',run_0.font.superscript)
# 上標: None
LIK2

6.設置首行縮進

from docx.shared import Inches,Pt
par2 = doc.add_paragraph('段落文本')
# 左縮進,0.5 英寸
par2.paragraph_format.left_indent = Inches(0.5)
# 右縮進,20 磅
par2.paragraph_format.right_indent = Pt(20)
# 首行縮進
par2.paragraph_format.first_line_indent = Inches(1)

查看首行縮進單位

from docx import Document
from docx.shared import Inches
from docx.oxml.ns import qn

from docx.shared import Cm, Pt
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT

from docx.shared import RGBColor

myDocument = Document('2020年建交集團3月分析報告.docx')

for paragraph in myDocument.paragraphs:
	print(paragraph.paragraph_format.first_line_indent)
	print(dir(paragraph))

參考文檔:
https://www.e1yu.com/9798.html
https://blog.csdn.net/weixin_45903952/article/details/106200213
https://blog.csdn.net/xtfge0915/article/details/83479922
https://blog.csdn.net/zhouz92/article/details/107028727?utm_medium=distribute.pc_aggpage_search_result.none-task-blog-2allfirst_rank_v2~rank_v25-18-107028727.nonecase&utm_term=python%20%E6%AE%B5%E8%90%BD%E8%AE%BE%E7%BD%AE&spm=1000.2123.3001.4430
https://www.w3xue.com/exp/article/20207/92918.html


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM