python3獲取中文網頁亂碼的問題

本文轉載自查看原文 2015-02-02 11:01 4547 Python

在python3中讀取網頁的時候，會有亂碼的問題，如果直接打開，會有錯誤

Traceback (most recent call last):
  File "E:/Source_Code/python34/HTMLParser_in_3.py", line 81, in <module>
    context = f.read()
UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 175: illegal multibyte sequence

然后發現用二進制方式打開（'rb'），就沒有問題，但是處理的時候，就會bytes類型和str類型不兼容的錯誤，直接強類型轉換，后續處理的時候又會獲取不到任何東西。

在python3中的str的decode方法，做了改變，因為python3中全部用Unicode編碼，str取消了decode方法。

上網查了相關資料，發現，二進制打開后，對於得到的bytes類型有decode方法可以轉換為可處理的str。

/tmp/ python3
Python 3.2.3 (default, Feb 20 2013, 14:44:27) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> f1 = open("unicode.txt", 'r').read()
>>> print(f1)
寒冷

>>> f2 = open("unicode.txt", 'rb').read() #二進制方式打開
>>> print(f2)
b'\xe5\xaf\x92\xe5\x86\xb7\n'
>>> f2.decode()
'寒冷\n'
>>> f1.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
>>>

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python抓取中文網頁亂碼通用解決方法解決python3爬取網頁（GB2312編碼）中文亂碼問題 python3 打印中文亂碼 python3亂碼問題：接口返回數據中文亂碼問題解決 [python] 中文亂碼問題 python中文亂碼問題 python cgi網頁出現中文亂碼 Delphi idhttp解決獲取UTF-8網頁中文亂碼問題 python中文輸出亂碼問題 Python的經典問題——中文亂碼