方法:
1,一個招聘只為下,會顯示多個頁面數據,依次把每個頁面的連接爬到url;
2,在page_x頁面中,爬到15條的具體招聘信息的s_url保存下來;
3,打開每個s_url鏈接,獲取想要的信息例如,title,connect,salary等;
4,將信息保存並輸入到csv文本中去。
代碼:
from lxml import etree import requests import time #要爬取的網站鏈接 url = "https://www.lagou.com/zhaopin/Java/?labelWords=label" #設置信息頭,模擬人為操作,可以避免一些反爬蟲 head = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36'} res = requests.get(url, headers=head).content.decode("utf-8") re = etree.HTML(res) #獲得該頁面翻頁地址鏈接 s_url = re.xpath("//div[@class='pager_container']/a[position()>2 and position()<7]/@href") print('s_url=', s_url) #依次循環page1,page2等等 for x in s_url: res = requests.get(x, headers=head).content.decode("utf-8") re = etree.HTML(res) print('x==', x) #獲取當前頁面下的所有招聘信息鏈接 list_url = re.xpath("//div[@class='s_position_list ']/ul/li[position()>=0 and position()<15]/div/div[1]/div/a/@href") print('list_url=', list_url) #依次循環每個招聘信息,將標題,內容,薪資獲取到 for y in list_url: r01 = requests.get(y, headers=head).content.decode("utf-8") html01 = etree.HTML(r01) print('y==', y) title = html01.xpath("string(//div[@class='job-name'])") print('title===', title) content = html01.xpath("string(//div[@class='job-detail'])") print('content===', content) salary = html01.xpath("string(/html/body/div[5]/div/div[1]/dd/h3/span[1])") print('salary===', salary) #設置休眠是防止網站識別自己,最好是random休眠 time.sleep(5) # 保存爬蟲信息內容 with open("cn-blog.csv", "a+", encoding="utf-8") as file: file.write(title + "\n") file.write(content + "\n") file.write(salary + "\n") file.write("*" * 50 + "\n")
總結:
1,設置head信息以及sleep,防止網站識別自己(雖然網站還是會屏蔽些,但是也能抓取大部分數據了);
2,用xpath獲取同一個元素下所有內容,用下標[position()>x and position()<y]表示;