一
爬取京東商品信息
代碼:
import requests
# url = "https://item.jd.com/2967929.html"
url = "https://item.jd.com/100011585270.html"
try:
r = requests.get(url)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[:1000])
except:
print("爬取失敗")
運行結果1:
<script>window.location.href='https://passport.jd.com/uc/login?ReturnUrl=http%3A%2F%2Fitem.jd.com%2F100011585270.html'</script>
運行結果2:
能爬取到信息,但是信息不夠全面。結果2只出現過一次,沒有及時保存。
曾經懷疑結果1的出現是因為沒有登陸,可是登陸后仍然會出現結果1。故排除該可能。
由於偶然出現結果2,所以懷疑可能是網絡原因,或者爬蟲被禁止。
想嘗試更改header,模擬瀏覽器進行訪問,但是由於現在要做scratch的分型雪花,所以暫時擱置。
以上止步於python網絡爬蟲與信息獲取(嵩天老師_MOOC)第一周第三單元第一個視頻
二
把url換成了淘寶的一個鏈接:https://item.taobao.com/item.htm?spm=a1z0d.6639537.1997196601.24.77b47484qxHVRi&id=620107543829
爬取結果:
<html><!-- cph -->
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
<meta charset="gbk"/>
<meta name="format-detection" content="telephone=no, address=no">
<link rel="dns-prefetch" href="//g.alicdn.com">
<link rel="dns-prefetch" href="//gtms01.alicdn.com">
<link rel="dns-prefetch" href="//gtms02.alicdn.com">
<link rel="dns-prefetch" href="//gtms03.alicdn.com">
<link rel="dns-prefetch" href="//gtms04.alicdn.com">
<link rel="dns-prefetch" href="//gd1.alicdn.com">
<link rel="dns-prefetch" href="//gd2.alicdn.com">
<link rel="dns-prefetch" href="//gd3.alicdn.com">
<link rel="dns-prefetch" href="//gd4.alicdn.com">
<link rel="amphtml" hreflang="zh-Hans" href="https://www.taobao.com/list/item-amp/620107543829.htm"/>
<link rel="alternate" hreflang="zh-Hant" href="https://world.taobao.com/item/620107543829.htm" />
<meta name="refer
>>> import requests
>>> r = requests.get("https://item.taobao.com/item.htm?spm=a1z0d.6639537.1997196601.24.77b47484qxHVRi&id=620107543829")
>>> r.encoding
'gb18030'
>>> r.apparent_encoding
'GB2312'
>>> r.encoding = r.apparent_encoding
>>> r.text[:1000]
'\r\n\r\n\r\n<!doctype html>\n<html><!-- cph -->\n <head>\n <meta http-equiv="X-UA-Compatible" content="IE=edge"/>\n<meta charset="gbk"/>\n<meta name="format-detection" content="telephone=no, address=no">\n<link rel="dns-prefetch" href="//g.alicdn.com">\n<link rel="dns-prefetch" href="//gtms01.alicdn.com">\n<link rel="dns-prefetch" href="//gtms02.alicdn.com">\n<link rel="dns-prefetch" href="//gtms03.alicdn.com">\n<link rel="dns-prefetch" href="//gtms04.alicdn.com">\n<link rel="dns-prefetch" href="//gd1.alicdn.com">\n<link rel="dns-prefetch" href="//gd2.alicdn.com">\n<link rel="dns-prefetch" href="//gd3.alicdn.com">\n<link rel="dns-prefetch" href="//gd4.alicdn.com">\n\n<link rel="canonical" href="https://item.taobao.com/item.htm?id=620107543829" />\n<link rel="amphtml" hreflang="zh-Hans" href="https://www.taobao.com/list/item-amp/620107543829.htm"/>\n<link rel="alternate" hreflang="zh-Hant" href="https://world.taobao.com/item/620107543829.htm" />\n\n<meta name="renderer" content="webkit"/>\n<meta name="refer'
>>> r.encoding
'GB2312'