BeautifulSoup Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

本文轉載自查看原文 2016-11-19 10:50 2057 網絡編程

BeautifulSoup很贊的東西

最近出現一個問題：Python 3.3

soup=BeautifulSoup(urllib.request.urlopen(url_path),"html.parser")

soup.findAll("a",{"href":re.compile('^http|^/')})

出現warning：

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

在極少數情況下（通常當UTF-8文檔包含以完全不同的編碼編寫的文本時），獲取Unicode的唯一方法是使用特殊的Unicode字符“REPLACEMENT CHARACTER”（U + FFFD）替換某些字符。如果是Unicode，Dammit需要這樣做，它將在UnicodeDammit或BeautifulSoup對象上將.contains_replacement_characters屬性設置為True。這讓您知道Unicode表示不是原始的精確表示 - 一些數據丟失。如果文檔包含，但是.contains_replacement_characters為False，那么您將知道原來是存在的，並且不代表缺少的數據。

解決：soup=BeautifulSoup(urllib.request.urlopen(url_path),""html.parser",from_encoding="iso-8859-1")

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 [LeetCode] Longest Repeating Character Replacement 最長重復字符置換響應內容 ValueError: No JSON object could be decoded 異常。 json loads No JSON object could be decoded 問題解決 python讀取json文件報 No JSON object could be decoded vue vue項目eslint報錯：[eslint] Unexpected tab character. (no-tabs) rsync實時同步報錯error: some files/attrs were not transferred (see previous errors)解決解決 remote: The project you were looking for could not be found. gitlab 提示：remote: The project you were looking for could not be found. git push時報“The project you were looking for could not be found.” 解決：Invalid character found in the request target.The valid characters are defined in RFC 7230 and RFC3986