一、使用ruquests的基本步驟:
- 指定url
- 發起請求
- 獲取響應對象中的數據
- 持久化存儲
1 #1 2 url = 'https://www.sogou.com/' 3 #2. 4 response = requests.get(url=url) 5 #3. 6 page_text = response.text 7 #4. 8 with open('./sogou.html','w',encoding='utf-8') as fp: 9 fp.write(page_text)
二、爬取搜狗指定搜索
1 import requests 2 url = "'https://www.sogou.com/web" 3 wd = input("請輸入搜索關鍵字") 4 param = { 5 'query':wd 6 } 7 8 response = requests.get(url=url,params=param).content 9 filename = wd+'.html' 10 with open(filename,'w',encoding='utf8') as f1: 11 f1.write(response)
三、Ajax請求
通過抓包,獲取請求攜帶的參數,
例如獲取分頁顯示的數據,當點擊下一頁時,發送ajax請求,對此時的url請求可以動,這里我們定義好請求參數param,動態的指定頁碼和每頁顯示的數據,通過ajax請求,返回一組json數據
存儲每頁獲取的數據的id,編輯new_url,獲取詳情的信息
1 import requests 2 url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList' 3 headers = { 4 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0' 5 } 6 param = { 7 "on":"true", 8 "page":1, 9 "pageSize":"15", 10 "productName":"", 11 "conditionType":"1", 12 "applyname":"", 13 "applysn":"", 14 } 15 id_list = [] 16 json_object = requests.post(url=url,headers=headers,params=param).json() 17 print(json_object['list']) 18 for i in json_object['list']: 19 id_list.append(i['ID']) 20 21 new_url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById' 22 filename = 'yaojians.text' 23 f = open(filename,'w',encoding='utf8') 24 for id in id_list: 25 param = { 26 'id':id 27 } 28 content_json = requests.post(url=new_url,params=param,headers=headers).json() 29 f.write(str(content_json)+'\n') 30 31 32
四、使用BeautifullSoup爬取數據
bs4解析:
pip install bs4
pip install lxml
解析原理
1、將要進行解析的源碼加載到bs對象
2、調用bs對象中相關的方法或屬性進行源碼中的相關標簽的定位
3、將定位到的標簽之間存在的文本或屬性值獲取
1 import requests 2 from bs4 import BeautifulSoup 3 4 url = 'http://www.shicimingju.com/book/sanguoyanyi.html' 5 headers = { 6 "User-Agent":'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36' 7 } 8 res = requests.get(url=url, headers=headers).text 9 soup = BeautifulSoup(res, 'lxml') 10 a_tags_list = soup.select('.book-mulu > ul > li > a') 11 filename = 'snaguo.text' 12 fp = open(filename, 'w', encoding='utf-8') 13 for a_tag in a_list: 14 title = a_tag.string 15 detail_url = "http://www.shicimingju.com"+a_tag["href"] 16 detail_content = requests.get(url=detail_url, headers=headers).text 17 soup = BeautifulSoup(detail_content, "lxml") 18 detail_text = soup.find('div', class_="chapter_content").text 19 fp.write(title+'\n'+detail_text) 20 print(title, '下載完畢') 21 print('over') 22 fp.close() 23
五、簡單使正則爬取圖片
1 url = 'https://www.qiushibaike.com/pic/page/%d/?s=5170552' 2 start_page = int(input("請輸入起始頁:")) 3 end_page = int(input("請輸入結束頁:")) 4 headers = { 5 "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0" 6 } 7 for page in range(start_page,end_page+1): 8 new_url = format(url%page) 9 response = requests.get(url=new_url, headers=headers).text 10 # 每一頁的圖片url 11 images_url = re.findall('<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>',response,re.S) 12 os.mkdir('qiutu') 13 14 for image_url in images_url: 15 detail_url = 'http:'+image_url 16 # 獲取到當前圖片的二進制流 17 content = requests.get(url=detail_url,headers=headers).content 18 # 切割 把圖片路徑最后的字符作為圖片名 19 image_name = image_url.split('/')[-1] 20 with open('./qiutu/'+image_name,'wb')as f1: 21 f1.write(content) 22 print('over')