python獲取doc文件中超鏈接和文本

本文轉載自查看原文 2021-04-17 17:48 311

代碼

from docx import Document
from docx.opc.constants import RELATIONSHIP_TYPE as RT
import re

#存放超鏈接
target = {}
#存放超鏈接文本
text = []
d = Document("material/測試.docx")

#獲取超鏈接
rels = d.part.rels
for rel in rels:
    if rels[rel].reltype == RT.HYPERLINK:
        id = rels[rel]._rId
        id = int(id[3:])
        target[id] = rels[rel]._target
#通過rid進行排序
target = [(k, target[k]) for  k in sorted(target.keys())]

#獲取超鏈接文本
for p in d.paragraphs:
    # 將文檔處理為xml，並轉為字符串
    xml = p.paragraph_format.element.xml
    xml_str = str(xml)
    #獲取文本中由<w:hyperlink>標簽包起來的部分
    hl_list = re.findall('<w:hyperlink[\S\s]*?</w:hyperlink>', xml_str)
    for hyperlink in hl_list:
        # 獲取文本中由<w:t>標簽包起來的部分
        wt_list = re.findall('<w:t[\S\s]*?</w:t>', hyperlink)
        temp = u''
        for wt in wt_list:
            # 去掉<w:t>標簽
            wt_content = re.sub('<[\S\s]*?>', u'', wt)
            temp += wt_content
        text.append(temp)

# #輸出結果
# for i in range(0, len(target)):
#     print(text[i] + " " + target[i])

效果

待處理文件

處理結果

了解docx文件結構

①將docx文件后綴名改為zip

②對zip文件進行解壓

使用vscode打開改文件夾，查看兩個關鍵文件

通過閱讀發現

超鏈接的目標網址及其文本通過id屬性關聯

獲取超鏈接

通過以下代碼可輸出document.xml.rels中超鏈接的相關屬性

from docx import Document
from docx.opc.constants import RELATIONSHIP_TYPE as RT

d = Document("material/測試.docx")
rels = d.part.rels
for rel in rels:
    if rels[rel].reltype == RT.HYPERLINK:
        print('\n'.join(['%s:%s' % item for item in rels[rel].__dict__.items()]))
        # print(" 超鏈接網址為: ", rels[rel]._target)

由於rId的順序不是恰好和文本出現的順序一樣

因此還需要進行排序

#存放超鏈接
target = {}
#存放超鏈接文本
text = []
d = Document("material/測試.docx")

#獲取超鏈接
rels = d.part.rels
for rel in rels:
    if rels[rel].reltype == RT.HYPERLINK:
        id = rels[rel]._rId
        id = int(id[3:])
        target[id] = rels[rel]._target
#通過rid進行排序
target = [(k, target[k]) for  k in sorted(target.keys())]

獲取超鏈接的文本

雖然知道了超鏈接的目標和文本是通過id關聯的

但是奈何沒能找到相關的操作，於是使用了下面的方法來處理

即將document.xml處理為字符串，並獲取hyperlink標簽中的文本

from docx import Document
from docx.opc.constants import RELATIONSHIP_TYPE as RT
import re


#存放超鏈接
target = {}
#存放超鏈接文本
text = []
d = Document("material/測試.docx")

#獲取超鏈接文本
for p in d.paragraphs:
    # 將文檔處理為xml，並轉為字符串
    xml = p.paragraph_format.element.xml
    xml_str = str(xml)
    #獲取文本中由<w:hyperlink>標簽包起來的部分
    hl_list = re.findall('<w:hyperlink[\S\s]*?</w:hyperlink>', xml_str)
    for hyperlink in hl_list:
        # 獲取文本中由<w:t>標簽包起來的部分
        wt_list = re.findall('<w:t[\S\s]*?</w:t>', hyperlink)
        temp = u''
        for wt in wt_list:
            # 去掉<w:t>標簽
            wt_content = re.sub('<[\S\s]*?>', u'', wt)
            temp += wt_content
        text.append(temp)

參考資料：

Python如何提取docx中的超鏈接——https://blog.csdn.net/s1162276945/article/details/102919305

python：對dict排序——https://blog.csdn.net/sinat_20177327/article/details/82319473

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 碼雲倉庫中獲取單個文件的超鏈接 python向excel中寫入超鏈接 Python2　獲取docx/doc文件內容 java正則讀取html 獲取標題/超鏈接/鏈接文本/內容 dw中的超鏈接 HTML中的超鏈接 HTML_超鏈接+文本標簽 Python（1）生成目錄及超鏈接 HTML 超鏈接文本修飾背景屬性 HTML中的超鏈接(Hyperlink)