有的網頁在爬取時候會報錯返回
urllib.error.HTTPError: HTTP Error 403: Forbidden
這是網址在檢測連接對象,所以需要偽裝瀏覽器,設置User Agent
在瀏覽器打開網頁 ---> F12 ---> Network ---> 刷新
然后選擇一項 就是在 header 看到 User-Agent
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36
import urllib.request #url包
def openUrl(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'Host': 'jandan.net'
}
req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req) #請求
html = response.read() #獲取
html = html.decode("utf-8") #解碼
print(html) #打印
if __name__ == "__main__":
url = "http://jandan.net/ooxx/" #'http://www.douban.com/'
openUrl(url)
