使用到的工具:chrome、eclipse、python3(Anaconda3)
模塊:requests、lxml、csv、time
一、數據收集
1、確定目標---爬取重慶地區的二手房(包括單價、總價、戶型、面積等)
1)使用chrome打開目標網站,找到需要爬去的數據項
2)在當前頁面按F12,找到目標數據並拷貝xpath值,結果如圖1-2-2
多抓幾套房的數據會發現,不同房子的xpath的 li[?] 中數字不同,每頁總共60個-也就是最大 li[60]。
圖1-2-1 圖1-2-2
2、分析網頁url
同2)中, 在network下能查看到請求URL,多查看幾頁可以看出不同的頁面的URL 是以 p 后面的數字區分
二、python代碼實現
1、直接上代碼吧---
1 #!/usr/bin/env python 2 #-*- coding:utf8 -*- 3 4 ''' 5 Created on 2018年11月24日 6 @author: perilong 7 ''' 8 import requests 9 from lxml import etree 10 import time 11 import csv 12 13 14 ''' 15 方法名稱:spider 16 功能: 爬取目標網站,並以源碼文本 17 參數: url 目標網址 18 ''' 19 def spider(url): 20 try: 21 header = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) \ 22 AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'} 23 response = requests.get(url=url, headers=header) 24 return response.text 25 except: 26 print('failed to spider the target site, please check if the url is correct or the connection is available!') 27 28 29 ''' 30 方法名稱:spider_detail 31 功能: 解析html源碼,提取房屋參數 32 參數: url 目標網址 33 ''' 34 def spider_detail(url): 35 response_text = spider(url) 36 sel = etree.HTML(response_text) 37 for house_num in range(1, 61): 38 house_model = sel.xpath('//*[@id="houselist-mod-new"]/li[%d]/div[2]/div[2]/span[1]/text()' 39 %house_num)[0].strip() 40 house_area = sel.xpath('//*[@id="houselist-mod-new"]/li[%d]/div[2]/div[2]/span[2]/text()' 41 %house_num)[0].strip() 42 house_floor = sel.xpath('//*[@id="houselist-mod-new"]/li[%d]/div[2]/div[2]/span[3]/text()' 43 %house_num)[0].strip() 44 house_year = sel.xpath('//*[@id="houselist-mod-new"]/li[%d]/div[2]/div[2]/span[4]/text()' 45 %house_num)[0].strip() 46 house_location = sel.xpath('//*[@id="houselist-mod-new"]/li[%d]/div[2]/div[3]/span/text()' 47 %house_num)[0].strip() 48 house_price = sel.xpath('//*[@id="houselist-mod-new"]/li[%d]/div[3]/span[2]/text()' 49 %house_num)[0].strip() 50 house_total = sel.xpath('//*[@id="houselist-mod-new"]/li[%d]/div[3]/span[1]/strong/text()' 51 %house_num)[0].strip() 52 house_connection = sel.xpath('//*[@id="houselist-mod-new"]/li[%d]/div[2]/div[2]/span[5]/text()' 53 %house_num)[0].strip() 54 55 56 57 house_district = house_location.split('\n')[1].split('-')[0].strip() 58 house_garden = house_location.split('\n')[1].split('-')[1].strip() 59 house_price = house_price.strip('元/m²') 60 house_year = house_year.strip('年建造') 61 house_area = house_area.strip('m²') 62 63 house_data = [house_model, house_area, house_floor, house_year, \ 64 house_price, house_total, house_district, house_garden, house_connection] 65 save_csv(house_data) 66 67 print('house_model: ', house_model) 68 print('house_area: ', house_area) 69 print('house_floor: ', house_floor) 70 print('house_year: ', house_year) 71 print('house_garden: ', house_garden) 72 print('house_price: ', house_price) 73 print('house_total: ', house_total) 74 print('house_district: ', house_district) 75 print('house_connection: ', house_connection) 76 print('========================================') 77 78 79 ''' 80 方法名稱:save_csv 81 功能: 將數據按行儲存到csv文件中 82 參數: house_data 獲取到的房屋數據列表 83 ''' 84 def save_csv(house_data): 85 try: 86 with open('D:/spider_data/QFange/chongqing.csv', 'a', encoding='utf-8-sig', newline='') as f: 87 writer = csv.writer(f) 88 writer.writerow(house_data) 89 except: 90 print('write csv error!') 91 92 93 ''' 94 方法名稱:get_all_urls 95 功能: 生成所有所有的url並存放到迭代器中 96 參數: page_number 需要爬網頁總數 97 返回值: url 返回一個url的迭代 98 ''' 99 def get_all_urls(page_number): 100 if(type(page_number) == type(1) and page_number > 0): # 防止錯誤輸入 101 for page in range(1, page_number + 1): 102 url = 'https://chongqing.anjuke.com/sale/p' + str(page) 103 yield url 104 else: 105 print('page_number is incorrect!') 106 107 108 # csv首列寫入 109 save_csv(['house_model', 'house_area', 'house_floor', 'house_year', \ 110 'house_price', 'house_total', 'house_district', 'house_garden', 'house_connection']) 111 112 for url in get_all_urls(50): 113 try: 114 time.sleep(20) 115 spider_detail(url) 116 except: 117 print('An error has been occurred when spidering house-price of chongqing!') 118
2、爬取結果:
想一想 hello world 就心酸。。。