python通用内容提取库的使用

本文转载自查看原文 2022-04-06 02:36 1091 爬虫

from bluextracter import Extractor

if __name__ == '__main__':
    extacert = Extractor()#实例提取类
    url = 'https://m.huicaiba.com/ask/5426118.html'
    resp = requests.get(url)
    resp.encoding = 'utf-8'#手动设置网页源码
    source = resp.text
    extacert.extract(url,source)
    # print('得分:',extacert.score)#得分
    # print('标题:', extacert.title)  # 标题
    # print('文本链接比例:',extacert.link_text_ratio)#
    # print('图片数量:',extacert.img_count)
    # print('内容字数:',extacert.text_count)
    #
    # print('纯文本内容:',extacert.clean_text)#纯文本内容
    print('html内容:',extacert.format_text)#用html标签格式化的内容
    # top_node = extacert.top_node  #原始html是一个elem
    # cc = etree.tostring(top_node,encoding='utf-8').decode('utf-8')
    # print(unescape(cc))

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 使用Python自动提取内容摘要 Python 如何提取邮件内容 python 日志内容提取使用 Python 从网页中提取主要文本内容使用itextpdf提取pdf内容 2、Python 使用Requests库通用爬取数据操作利用python第三方库提取PDF文件的表格内容 python提取批量文件内的指定内容 python 正则提取HTml标签文本内容的 Python 11 提取括号中间的内容