python3中urllib.request.urlopen.read读取的网页格式问题

本文转载自查看原文 2016-05-04 21:58 14198 urllib.request.urlopen python3 网页抓取爬虫乱码

#!/usr/bin/env python3

#-*- coding: utf-8 -*-
#<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102eo83.html">《论电影的七个元素》——关于我对电…</a>

import urllib.request
str0 =r' <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102eo83.html">《论电影的七个元素》——关于我对电…</a>'
title0=str0.find(r'<a title')
print(title0)
href0=str0.find(r'href')
print(href0)
html0=str0.find(r'html')
print(html0)
url=str0[href0+6:html0+4]
print(url)
content = urllib.request.urlopen(url).read()#当该语句读取的返回值是bytes类型时，要将其转换成utf-8才能正常显示在python程序中

print(type(content))#此时为bytes类型
print(content.decode('utf-8'))#需要进行类型转换才能正常显示在python中
print(type(content.decode('utf-8')))#返回解码后的类型，此时为str类型
filename= url[-20:]
open(filename,'wb').write(content)#在写文件时，要写成bytes类型的文件‘wb’

初学python，所用python3.5，根据教程写代码，所抓取的网页为新浪博客中的一篇文章，在使用urllib.request.urlopen(url).read()的返回值时，发现content的类型为bytes，如果不进行类型转换的话，在python打印时是乱码。

解决方案是将content解码成utf-8类型再打印，输出成文件时要以‘wb’写成字节文件。

其中在谷歌浏览器中审查元素页面打开时，head位置显示的类型是utf-8，但是实际python程序读取的格式却为bytes类型，此处不解。

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 Python3.7中urllib.urlopen 报错问题 Python爬虫入门：urllib.request.urlopen用法 Python3使用request/urllib库重定向问题 python3 urllib.request.Request的用法 python3 urllib.request.Request的用法 Python3.x：关于urllib中urlopen报错问题的解决方案 Python3使用urllib访问网页 urllib.request.urlopen(req).read().decode解析http报文报“'utf-8' codec can't decode”错处理 python3 使用urllib报错urlopen error EOF occurred in violation of protocol (_ssl.c:841) Python3爬虫(2)_利用urllib.urlopen发送数据获得反馈信息