解決爬蟲中文的編碼問題

本文轉載自查看原文 2019-04-18 13:52 568 爬蟲

# 解決爬蟲中文問題
# 1 對整個返回的結果進行重新的編碼
response = requests.get(url=url, headers=headers)
response.encoding = 'utf-8'
page_text = response.text
# 上面有時候不能解決編碼的問題，使用局部解決辦法
# 2 對需要的文字進行重新編碼
title = li.xpath('./a/h3/text()')[0].encode('iso-8859-1').decode('gbk')
# 3 全部重新編碼
response = requests.get(url=url).text.encode('iso-8859-1').decode('utf-8')

import requests
from lxml import etree

url = 'https://www.xxx.com.cn'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}

page_text = requests.get(url=url, headers=headers).text


tree = etree.HTML(page_text)

li_list = tree.xpath('//div[@id="auto-channel-lazyload-article"]/ul/li')
for li in li_list:
    try:
        title = li.xpath('./a/h3/text()')[0].encode('iso-8859-1').decode('gbk')
        a_url = f"https:{li.xpath('./a/@href')[0]}"
        img_src = f"https:{li.xpath('./a/div/img/@src')[0]}"
        desc = li.xpath('./a/p/text()')[0].encode('iso-8859-1').decode('gbk')
    except IndexError:
        continue
    print(title)
    print(a_url)
    print(img_src)
    print(desc)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲之中文編碼問題解決爬蟲中文亂碼問題解決爬蟲中文亂碼問題 scrapy 爬蟲返回json格式內容unicode編碼轉換為中文的問題解決解決pycharm的爬蟲亂碼問題（初步了解各種編碼格式） Python顯示中文時間編碼問題解決 Python中的解決中文字符編碼的問題獲取csv文件編碼，解決csv讀取中文亂碼問題編碼過濾器 | 解決中文亂碼問題如何解決keil mdk中文漢字亂碼或設置編碼問題