問題：python3 使用beautifulSoup時，出錯UnicodeDecodeError: 'gbk' codec …….

本文轉載自查看原文 2017-02-21 21:18 3197

想將html文件轉為純文本，用Python3調用beautifulSoup

超簡單的代碼一直出錯，用於打開本地文件：

 
 
 
         
  
  
  
          from bs4 import BeautifulSoup
  
  
  
          file = open('index.html')
  
  
  
          soup = BeautifulSoup(file,'lxml')
  
  
  
          print (soup)

出現下面的錯誤

UnicodeDecodeError : ‘gbk’ codec can’t decode byte 0xff in position 0: illegal multibyte sequence

beautifulSoup不是自稱可以解析各種編碼格式的嗎？為什么還會出現解析的問題？？？

搜了很多關於beautifulSoup的都沒有解決，突然發現，如果把代碼寫成

 
 
 
         
  
  
  
          from bs4 import BeautifulSoup
  
  
  
          file = open('index.html')
  
  
  
          str1 = file.read() # 錯誤出在這一行！！！
  
  
  
          soup = BeautifulSoup(str1,'lxml')
  
  
  
          print (soup)

原來如此！ 問題出在文件讀取而非BeautifulSoup的解析上！！

好吧，查查為什么文件讀取有問題，直接上正解，同樣四行代碼

 
 
 
         
  
  
  
          from bs4 import BeautifulSoup
  
  
  
          file = open('index.html','r',encoding='utf-16-le')
  
  
  
          soup = BeautifulSoup(file,'lxml')
  
  
  
          print (soup)

然后soup.get_text()得到標簽中的文字

其它

如果文件中存在多種編碼而且報錯，可以采用下面這種方式忽略，沒測試–

 
 
 
         
  
  
  
          soup = BeautifulSoup(content.decode('utf-8','ignore'))

From WizNote

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。