實現數據爬取的流程
指定url
基於requests模塊發起請求
獲取響應中的數據
數據解析(正則解析,bs4解析,xpath解析)
進行持久化存儲
一.bs4(BeautifulSoup)
1.安裝
1.pip install bs4
2.pip install lxml
2.解析原理
1.將即將要進行解析的源碼加載到bs對象
2.調用bs對象中相關的方法或屬性進行源碼中的相關標簽的定位
3.將定位到的標簽之間存在的文本或者屬性值獲取
3.基礎使用
使用流程: - 導包:from bs4 import BeautifulSoup - 使用方式:可以將一個html文檔,轉化為BeautifulSoup對象,然后通過對象的方法或者屬性去查找指定的節點內容 (1)轉化本地文件: - soup = BeautifulSoup(open('本地文件'), 'lxml') (2)轉化網絡文件: - soup = BeautifulSoup('字符串類型或者字節類型', 'lxml') (3)打印soup對象顯示內容為html文件中的內容 基礎鞏固: (1)根據標簽名查找 - soup.a 只能找到第一個符合要求的標簽 (2)獲取屬性 - soup.a.attrs 獲取a所有的屬性和屬性值,返回一個字典 - soup.a.attrs['href'] 獲取href屬性 - soup.a['href'] 也可簡寫為這種形式 (3)獲取內容 - soup.a.string - soup.a.text - soup.a.get_text() 【注意】如果標簽還有標簽,那么string獲取到的結果為None,而其它兩個,可以獲取文本內容 (4)find:找到第一個符合要求的標簽 - soup.find('a') 找到第一個符合要求的 - soup.find('a', title="xxx") - soup.find('a', alt="xxx") - soup.find('a', class_="xxx") - soup.find('a', id="xxx") (5)find_all:找到所有符合要求的標簽 - soup.find_all('a') - soup.find_all(['a','b']) 找到所有的a和b標簽 - soup.find_all('a', limit=2) 限制前兩個 (6)根據選擇器選擇指定的內容 select:soup.select('#feng') - 常見的選擇器:標簽選擇器(a)、類選擇器(.)、id選擇器(#)、層級選擇器 - 層級選擇器: div .dudu #lala .meme .xixi 下面好多級 div > p > a > .lala 只能是下面一級 【注意】select選擇器返回永遠是列表,需要通過下標提取指定的對象
需求:使用bs4實現將詩詞名句網站中三國演義小說的每一章的內容爬去到本地磁盤進行存儲 http://www.shicimingju.com/book/sanguoyanyi.html
import requests from bs4 import BeautifulSoup url = 'http://www.shicimingju.com/book/sanguoyanyi.html' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36' } page_text = requests.get(url=url,headers=headers).text soup = BeautifulSoup(page_text,'lxml') a_list = soup.select('.book-mulu > ul > li > a') fp = open('sanguo.txt','w',encoding='utf-8') for a in a_list: title = a.string detail_url = 'http://www.shicimingju.com'+a['href'] detail_page_text = requests.get(url=detail_url,headers=headers).text soup = BeautifulSoup(detail_page_text,'lxml') content = soup.find('div',class_='chapter_content').text fp.write(title+'\n'+content) print(title,'下載完畢') print('over') fp.close()
二.Xpath解析
1.安裝
pip install lxml
2.解析原理
獲取頁面源碼數據
實例化一個etree的對象,並且將頁面源碼數據加載到該對象中
調用該對象的xpath方法進行指定標簽的定位
注意:xpath函數必須結合着xpath表達式進行標簽定位和內容捕獲
3.基礎使用
常用xpath表達式
屬性定位: #找到class屬性值為song的div標簽 //div[@class="song"] 層級&索引定位: #找到class屬性值為tang的div的直系子標簽ul下的第二個子標簽li下的直系子標簽a //div[@class="tang"]/ul/li[2]/a 邏輯運算: #找到href屬性值為空且class屬性值為du的a標簽 //a[@href="" and @class="du"] 模糊匹配: //div[contains(@class, "ng")] //div[starts-with(@class, "ta")] 取文本: # /表示獲取某個標簽下的文本內容 # //表示獲取某個標簽下的文本內容和所有子標簽下的文本內容 //div[@class="song"]/p[1]/text() //div[@class="tang"]//text() 取屬性: //div[@class="tang"]//li[2]/a/@href
測試頁面數據
<html lang="en"> <head> <meta charset="UTF-8" /> <title>測試bs4</title> </head> <body> <div> <p>百里守約</p> </div> <div class="song"> <p>李清照</p> <p>王安石</p> <p>蘇軾</p> <p>柳宗元</p> <a href="http://www.song.com/" title="趙匡胤" target="_self"> <span>this is span</span> 宋朝是最強大的王朝,不是軍隊的強大,而是經濟很強大,國民都很有錢</a> <a href="" class="du">總為浮雲能蔽日,長安不見使人愁</a> <img src="http://www.baidu.com/meinv.jpg" alt="" /> </div> <div class="tang"> <ul> <li><a href="http://www.baidu.com" title="qing">清明時節雨紛紛,路上行人欲斷魂,借問酒家何處有,牧童遙指杏花村</a></li> <li><a href="http://www.163.com" title="qin">秦時明月漢時關,萬里長征人未還,但使龍城飛將在,不教胡馬度陰山</a></li> <li><a href="http://www.126.com" alt="qi">岐王宅里尋常見,崔九堂前幾度聞,正是江南好風景,落花時節又逢君</a></li> <li><a href="http://www.sina.com" class="du">杜甫</a></li> <li><a href="http://www.dudu.com" class="du">杜牧</a></li> <li><b>杜小月</b></li> <li><i>度蜜月</i></li> <li><a href="http://www.haha.com" id="feng">鳳凰台上鳳凰游,鳳去台空江自流,吳宮花草埋幽徑,晉代衣冠成古丘</a></li> </ul> </div> </body> </html>
代碼中的使用
1.下載:pip install lxml 2.導包:from lxml import etree 3.將html文檔或者xml文檔轉換成一個etree對象,然后調用對象中的方法查找指定的節點 2.1 本地文件:tree = etree.parse(文件名) tree.xpath("xpath表達式") 2.2 網絡數據:tree = etree.HTML(網頁內容字符串) tree.xpath("xpath表達式")
安裝xpath插件在瀏覽器中對xpath表達式進行驗證:可以在插件中直接執行xpath表達式
將xpath插件拖動到谷歌瀏覽器拓展程序(更多工具)中,安裝成功
啟動和關閉插件 ctrl + shift + x
示例:
1.解析58二手房的相關數據
import requests from lxml import etree url = 'https://bj.58.com/shahe/ershoufang/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d30000c-0047-e4e6-f587-683307ca570e&ClickID=1' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36' } page_text = requests.get(url=url,headers=headers).text tree = etree.HTML(page_text) li_list = tree.xpath('//ul[@class="house-list-wrap"]/li') fp = open('58.csv','w',encoding='utf-8') for li in li_list: title = li.xpath('./div[2]/h2/a/text()')[0] price = li.xpath('./div[3]//text()') price = ''.join(price) fp.write(title+":"+price+'\n') fp.close() print('over')
2.解析圖片數據:http://pic.netbian.com/4kmeinv/
import requests from lxml import etree import os import urllib url = 'http://pic.netbian.com/4kmeinv/' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36' } response = requests.get(url=url,headers=headers) #response.encoding = 'utf-8' if not os.path.exists('./imgs'): os.mkdir('./imgs') page_text = response.text tree = etree.HTML(page_text) li_list = tree.xpath('//div[@class="slist"]/ul/li') for li in li_list: img_name = li.xpath('./a/b/text()')[0] #處理中文亂碼 img_name = img_name.encode('iso-8859-1').decode('gbk') img_url = 'http://pic.netbian.com'+li.xpath('./a/img/@src')[0] img_path = './imgs/'+img_name+'.jpg' urllib.request.urlretrieve(url=img_url,filename=img_path) print(img_path,'下載成功!') print('over!!!')
3.下載煎蛋網中的圖片數據:http://jandan.net/ooxx 數據加密 (反爬機制)
import requests from lxml import etree import base64 import urllib headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36' } url = 'http://jandan.net/ooxx' page_text = requests.get(url=url,headers=headers).text tree = etree.HTML(page_text) img_hash_list = tree.xpath('//span[@class="img-hash"]/text()') for img_hash in img_hash_list: img_url = 'http:'+base64.b64decode(img_hash).decode() img_name = img_url.split('/')[-1] urllib.request.urlretrieve(url=img_url,filename=img_name)
4.爬取站長素材中的簡歷模板
import requests import random from lxml import etree headers = { 'Connection':'close', #當請求成功后,馬上斷開該次請求(及時釋放請求池中的資源) 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36' } url = 'http://sc.chinaz.com/jianli/free_%d.html' for page in range(1,4): if page == 1: new_url = 'http://sc.chinaz.com/jianli/free.html' else: new_url = format(url%page) response = requests.get(url=new_url,headers=headers) response.encoding = 'utf-8' page_text = response.text tree = etree.HTML(page_text) div_list = tree.xpath('//div[@id="container"]/div') for div in div_list: detail_url = div.xpath('./a/@href')[0] name = div.xpath('./a/img/@alt')[0] detail_page = requests.get(url=detail_url,headers=headers).text tree = etree.HTML(detail_page) download_list = tree.xpath('//div[@class="clearfix mt20 downlist"]/ul/li/a/@href') download_url = random.choice(download_list) data = requests.get(url=download_url,headers=headers).content fileName = name+'.rar' with open(fileName,'wb') as fp: fp.write(data) print(fileName,'下載成功')
5.解析所有的城市名稱
import requests from lxml import etree headers = { 'Connection':'close', #當請求成功后,馬上斷開該次請求(及時釋放請求池中的資源) 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36' } url = 'https://www.aqistudy.cn/historydata/' page_text = requests.get(url=url,headers=headers).text tree = etree.HTML(page_text) li_list = tree.xpath('//div[@class="bottom"]/ul/li | //div[@class="bottom"]/ul/div[2]/li') for li in li_list: city_name = li.xpath('./a/text()')[0] print(city_name)
