python3中urllib.request.urlopen.read讀取的網頁格式問題

本文轉載自查看原文 2016-05-04 21:58 14198 urllib.request.urlopen python3 網頁抓取爬蟲亂碼

#!/usr/bin/env python3

#-*- coding: utf-8 -*-
#<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102eo83.html">《論電影的七個元素》——關於我對電…</a>

import urllib.request
str0 =r' <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102eo83.html">《論電影的七個元素》——關於我對電…</a>'
title0=str0.find(r'<a title')
print(title0)
href0=str0.find(r'href')
print(href0)
html0=str0.find(r'html')
print(html0)
url=str0[href0+6:html0+4]
print(url)
content = urllib.request.urlopen(url).read()#當該語句讀取的返回值是bytes類型時，要將其轉換成utf-8才能正常顯示在python程序中

print(type(content))#此時為bytes類型
print(content.decode('utf-8'))#需要進行類型轉換才能正常顯示在python中
print(type(content.decode('utf-8')))#返回解碼后的類型，此時為str類型
filename= url[-20:]
open(filename,'wb').write(content)#在寫文件時，要寫成bytes類型的文件‘wb’

初學python，所用python3.5，根據教程寫代碼，所抓取的網頁為新浪博客中的一篇文章，在使用urllib.request.urlopen(url).read()的返回值時，發現content的類型為bytes，如果不進行類型轉換的話，在python打印時是亂碼。

解決方案是將content解碼成utf-8類型再打印，輸出成文件時要以‘wb’寫成字節文件。

其中在谷歌瀏覽器中審查元素頁面打開時，head位置顯示的類型是utf-8，但是實際python程序讀取的格式卻為bytes類型，此處不解。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python3 urllib.request.urlopen() 地址打開錯誤（轉）python3 urllib.request.urlopen() 錯誤UnicodeEncodeError: 'ascii' codec can't encode characters Python3.7中urllib.urlopen 報錯問題 5、urllib.request.urlopen() python使用urllib.urlopen超時的問題 Python爬蟲入門：urllib.request.urlopen用法 Python3使用request/urllib庫重定向問題 python3 urllib.request.Request的用法 python3 urllib.request.Request的用法 python3中urllib庫的request模塊詳解