拉鈎項目(一)--項目流程+數據提取


聲明:

   1)僅作為個人學習,如有冒犯,告知速刪!
   2)不想誤導,如有錯誤,不吝指教!
     3)文章配套視頻:http://www.bilibili.com/video/BV1aC4y1a7nR?share_medium=android&share_source=copy_link&bbid=XY1C2901EE0D25CCEC5E23A673F2026B36BEF&ts=1592703866866

目標:

   1. 爬取拉鈎網中的關於編程語言的 1)薪資,2)城市范圍,3)工作年限,4)學歷要求;
   2 .將四部分保存到mysql中;
   3.對四部分進行數據可視化;
   4.最后通過pyecharts+bootstrap進行網頁美化 .

技能點:

     1. python網絡基礎(requests,xpath語法等);
   2. MySQL+ pymysql的語法基礎;
   3. pyecharts基礎;
   4. bootstrap基礎;

項目流程及邏輯:

   大方向:先完成爬取一類的信息,進行可視化,走一遍流程很重要,再拓展!

1流程圖

 

1.進入以下位置:

 

頁面圖

 
                              ------->刷新找到請求url:<--------

 

請求url

 

                              ------->分析+請求參數:<--------

 

請求詳情

 

                       ------->因為url是post請求,我們需要提交參數,往下滑:<-------

 

請求參數

2.解決反爬機制

1. 上面的操作解決的是------>拉鈎的ajax請求方式
2. 隱藏在cookies中的時間戳處理:------>session來保持會話-----實時更新cookies
 
1 #獲取cookies的函數
2 #start_url = "https://www.lagou.com/jobs/list_python?#labelWords=&fromSearch=true&suginput="
3 def cookieRequest(start_url):
4     r = requests.Session()
5     r.get(url=start_url, headers=headers, timeout=3)
6     return r.cookies

 

 

3.構造流程

1.構造主函數:
 1 if __name__ == '__main__':
 2     #初始url---獲取cookies
 3     start_url = "https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput="
 4     #模擬請求url
 5     post_url = "https://www.lagou.com/jobs/positionAjax.json?"
 6     #headers
 7     headers = {
 8         "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36",
 9         "accept": "application/json, text/javascript, */*; q=0.01",
10         "accept-encoding": "gzip, deflate, br",
11         "accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
12         "referer": "https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=",
13     }
14     # 動態cookies
15     cookies = cookieRequest(start_url)
16     time.sleep(1)
17     #異常處理
18     try:
19         data = {
20             "first": "true",
21             "pn": 1  # 1
22             "kd": "python",
23         }
24         textInformation(post_url, data, cookies)
25         time.sleep(7)
26         print('------------第%s頁爬取成功,正在進行下一頁--------------' % s)
27     except requests.exceptions.ConnectionError:
28         r.status_code = "Connection refused"

 

2.構造基礎頁函數
 
 1 def textInformation(post_url, data, cookies):
 2     response = requests.post(post_url, headers=headers, data=data, cookies=cookies,timeout=3).text
 3     div1 = json.loads(response)
 4     # 拿到該頁的職位信息
 5     position_data = div1["content"]["positionResult"]["result"]
 6     n = 1
 7     for list in position_data:
 8         infor = {
 9                     "positionName": result["positionName"],
10 11                     "companyFullName": result["companyFullName"],
12                     "companySize": result["companySize"],
13                     "industryField": result["industryField"],
14                     "financeStage": result["financeStage"],
15 16                     "firstType": result["firstType"],
17                     "secondType": result["secondType"],
18                     "thirdType": result["thirdType"],
19 20                     "positionLables": result["positionLables"],
21 22                     "createTime": result["createTime"],
23 24                     "city": result["city"],
25                     "district": result["district"],
26                     "businessZones": result["businessZones"],
27 28                     "salary": result["salary"],
29                     "workYear": result["workYear"],
30                     "jobNature": result["jobNature"],
31                     "education": result["education"],
32 33                     "positionAdvantage": result["positionAdvantage"]
34                 }
35 36         print(infor)
37         time.sleep(5)
38         print('----------寫入%s次-------' %n)
39         n +=1

 

 
3.單獨獲取每個類的show_id(詳情頁使用):

https://www.lagou.com/jobs/4254613.html? show=0977e2e185564709bebd04fe72a34c9f

 1 show_id = []
 2 def getShowId(post_url, headers, cookies):
 3     data = {
 4         "first": "true",
 5         "pn": 1,
 6         "kd": "python",
 7     }
 8     response = requests.post(post_url, headers=headers, data=data, cookies=cookies).text
 9     div1 = json.loads(response)
10     # 拿到該頁的職位信息
11     position_data = div1["content"]["positionResult"]["result"]
12     # 詳情頁的show_id
13     position_show_id = div1['content']['showId']
14     show_id.append(position_show_id)
15     # return position_show_id

 

4.詳情頁信息
 1 def detailinformation(detail_id, show_id):
 2      get_url = "https://www.lagou.com/jobs/{}.html?show={}".format(detail_id, show_id)
 3      # time.sleep(2)
 4      # 詳情頁信息
 5      response = requests.get(get_url, headers=headers,timeout=5).text
 6      # print(response)
 7      html = etree.HTML(response)
 8      div1 = html.xpath("//div[@class='job-detail']/p/text()")
 9      # 職位詳情/清洗數據
10      position_list = [i.replace(u'\xa0', u'') for i in div1]
11      # print(position_list)
12      return position_list

 

完整代碼放在GitHub中:

  https://github.com/xbhog/studyProject

4.暫沒解決/完善的問題

  1. 詳情頁在mysql保存到的時候,有些沒有數據,可能是網絡抖動或者請求頻繁

  1. 沒有使用多線程

  2. 沒有使用scrapy框架

  3. 沒有使用類方法

    ------>下期內容<---------

數據存儲:----存儲環境ubuntu

  1. Mysql存儲

  2. csv存儲

數據存儲鏈接:https://www.cnblogs.com/xbhog/p/13141128.html


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM