python读取word里面的内容

本文转载自查看原文 2021-03-09 10:12 383

1.将word文档转为html操作，通过bs4中的 BeautifulSoup 提取html中所需要的内容

步骤一：下载bs4 和 pydocx 并且引入

pip install bs4
pip install pydocx

# 读取word中的内容
from pydocx import PyDocX

from bs4 import BeautifulSoup  # 将html转为对象的形式

步骤二：读取word里面的内容，并且解析

html = PyDocX.to_html("C:\\Users\\Administrator\\Desktop\\test.docx")
soup = BeautifulSoup(html, 'html.parser') """ demo 表示被解析的html格式的内容 html.parser表示解析用的解析器 """ soup.prettify() # 使用prettify()格式化显示输出 # print(soup.prettify()) title_list = soup.select("h2>span[style='text-indent:1.25em']", attrs={"style": "text-indent:1.25em"}) content_list = soup.find_all('span', attrs={ "class": "pydocx-left"}) # 指定属性，查找class属性为title的标签元素，注意因为class是python的关键字，所以这里需要加个下划线'_' print(len(content_list))

2.读取word里面的内容，以文本的形式，一段一段的读出来，通过样式去获去文档里面的内容

步骤一：下载python-docx，并且引入

pip install python-docx

# 引入
from docx import Document

步骤二：读取word里面的内容

title = ""
content = "" titleArr = [] document = Document("C:\\Users\\Administrator\\Desktop\\test.docx") # 获取所有段落 all_paragraphs = document.paragraphs for paragraph in all_paragraphs: if paragraph.style.name == 'Normal': content = content + paragraph.text + '\n' else: obj = {"title": title, "content": content} if content != '': titleArr.append(obj) content = "" title = paragraph.text # print(obj)

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 python爬虫之获取页面script里面的内容 java读取word内容如何获取HttpServletResponse里面的内容文件上传下载以及读取压缩文件里面的指定内容等详细用法 python 读取word表格内容并写入到excel中去 .docx and .xlsx Python-docx 读取word.docx内容关于通过Python读取hbase里面的数据通过row_prefi前缀的方式读取数据的方式 aspose.word 读取word段落内容 iframe 高度由里面的内容自动撑起 html怎样让表格里面的内容居中