Response
- r.status_code #http請求的返回狀態,200鏈接成功
- r.text #返回對象的文本內容
- r.content #猜測返回對象的二進制形式
- r.encoding #分析返回對象的編碼方式
- r.apparent_encoding #響應內容編碼方式
xpath
https://zhuanlan.zhihu.com/p/25572729學習網址
自動生成路徑
- f12+選中要爬的內容部分+右鍵copy-->copy xpath
簡單爬蟲模板
import requests
from lxml import etree
def getHtmlText(url,header):
files={}
r=requests.get(url=url,headers=header)
s=etree.HTML(r.text)
for i in range(10):
#xpath的自動生成路徑
files=s.xpath('//*[@id="comments"]/ul[1]/li['+str(i+1)+']/div[2]/p/span/text()')
return files
def saveText(files):
with open("discuss.text","w",encoding="utf-8") as f:
for i in files:
f.write(i)
if __name__ == '__main__':
url="https://book.douban.com/subject/34876107/comments/"
header={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"}
print(getHtmlText(url,header))
files=getHtmlText(url,header)
saveText(files)
