UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b ....

本文轉載自查看原文 2020-04-23 19:03 1176 爬蟲/ 后端語言-python

詳細錯誤描述如下：
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

在使用pycurl包進行爬蟲的時候，對爬蟲的返回的頁面，進行寫文件或者打印的時候，需要進行解碼操作。代碼如下：

    import gzip
    import pycurl
    import re
    try:
        from io import BytesIO
    except ImportError:
        from StringIO import StringIO as BytesIO

    headersOfSend=[
    #.....
    "Accept-Encoding: gzip, deflate",
    #......
    "Upgrade-Insecure-Requests: 1",
    "User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36"
    ]

    buffer = BytesIO()
    c = pycurl.Curl()
    c.setopt(c.URL, 'http://xxxxxxxx.com')
    c.setopt(c.WRITEFUNCTION, buffer.write)
    # Set our header function.
    c.setopt(pycurl.HTTPHEADER,headersOfSend)
    c.setopt(c.HEADERFUNCTION, header_function)
    c.perform()
    c.close()
    body = buffer.getvalue()
    print(type(body))
    print(str(body, encoding='utf-8'))#會報錯的行

出現原因

headers請求頭中，包括：Accept-encoding請求頭，請求的響應內容理應是經壓縮的數據。這代表本地可以接收壓縮格式的數據，而服務器在處理時就將大文件壓縮再發回客戶端，瀏覽器在接收完成后在本地對這個文件又進行了解壓操作。
出錯的原因是因為你的程序沒有解壓這個文件，所以刪掉這行就不會出現問題。

解決方案

方法一：刪掉這一行

#"Accept-Encoding: gzip, deflate",

方法二：解碼

不刪除頭里面的字段，用gzip包進行解碼

代碼一

    body = buffer.getvalue()
    res=gzip.decompress(body).decode("utf-8")
    print(res)

代碼二

    body = buffer.getvalue()
    buff = BytesIO(body)
    f=gzip.GzipFile(fileobj=buff)
    # Decode using the encoding we figured out.
    htmls = f.read().decode(encoding)
    print(type(htmls))
    print(dir(htmls))

代碼三

    buffer.seek(0,0)
    f=gzip.GzipFile(fileobj=buffer)
    # Decode using the encoding we figured out.
    htmls = f.read().decode(encoding)
    print(htmls)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。