python-docx 解析文字与图片

本文转载自查看原文 2020-06-06 22:15 860 VBA

python-docx是用来解析docx类型文档的库，可以方便提取每个段落中的文字，但是对图片和表格没有办法处理。使用网上的这段代码，可以批量提取docx文档中图片，但无法保留与文字的相关次序：

from os.path import basename from docx import Document file_name = "D:/2.docx" doc = Document(file_name) for shape in doc.inline_shapes: contentID = shape._inline.graphic.graphicData.pic.blipFill.blip.embed contentType = doc.part.related_parts[contentID].content_type if not contentType.startswith('image'): continue imgName = basename(doc.part.related_parts[contentID].partname) print(contentID, contentType, imgName) # imgData = doc.part.related_parts[contentID]._blob imgData = doc.part.related_parts[contentID].blob with open('D:/docx_pro/output/' + imgName, 'wb') as fp: fp.write(imgData)

上述代码利用Document对象中inline_shapes方法遍历出所有图片的标签，并映射到文档中图片，进而输出。实际上，内联的图片ID存在于paraghaphs对象中的runs对象中，可以使用element.xml属性正则提取ID，进而序化保存，修改代码为：

from docx import Document from os.path import basename import re file_name = "D:/2.docx" doc = Document(file_name) a = list() pattern = re.compile('rId\d+') for graph in doc.paragraphs: b = list() for run in graph.runs: if run.text != '': b.append(run.text) else: # b.append(pattern.search(run.element.xml)) contentID = pattern.search(run.element.xml).group(0) try: contentType = doc.part.related_parts[contentID].content_type except KeyError as e: print(e) continue if not contentType.startswith('image'): continue imgName = basename(doc.part.related_parts[contentID].partname) imgData = doc.part.related_parts[contentID].blob b.append(imgData) a.append(b)

这样就可以得到文档的序化列表，保存到mongodb中。

python-docx-images

对于docx文档转换并保存到数据库中，这里存在多个问题：

原始格式丢失，不过对于轻格式文档来说，不是问题
图片中存在类似于wmf和emf格式的图片，网页无法显示，这也是mammoth库转docx文档会出现某些图片不显示的问题。两种思路解决：
将wmf类型转成png等类型，但这会丢失公式的排版和可编辑性
将wmf类型转换或识别为LaTeX格式，mathtype可以实现，但对emf格式不行；mathpix有图片识别接口，超量收费；网上存有im2latex的库，试过一个，效果不如mathpix，后期填坑目标
表格数据丢失，尝试用Document属性解决，目前没有找到合适的，但mammoth库可以显示table块，虽然样式有问题，后期尝试用其源码实现下

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 python-docx python-docx中文 Python-docx python-docx python-docx Python-docx模块利用python-docx批量处理Word文件—图片【Python】【python-docx讲解】 python-docx的表格样式 python-docx删除段落