Python爬取鏈家二手房數據——重慶地區

本文轉載自查看原文 2019-05-29 23:08 1393 爬蟲/ python

最近在學習數據分析的相關知識，打算找一份數據做訓練，於是就打算用Python爬取鏈家在重慶地區的二手房數據。

鏈家的頁面如下：

爬取代碼如下：

import requests, json, time
from bs4 import BeautifulSoup
import re, csv
def parse_one_page(url):
    headers={
      'user-agent':'Mozilla/5.0'
    }
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'lxml')
    results = soup.find_all(class_="clear LOGCLICKDATA")
    
    for item in results: 
        output = []
        # 從url中獲得區域
        output.append(url.split('/')[-3]) 
        
        # 獲得戶型、面積、朝向等信息，有無電梯的信息可能會有缺失，數據清理可以很方便的處理
        info1 = item.find('div', 'houseInfo').text.replace(' ', '').split('|')
        for t in info1:
            output.append(t)
            
        # 獲得總價
        output.append(item.find('div', 'totalPrice').text)
        
        # 獲得年份信息，如果沒有就為空值
        info2 = item.find('div', 'positionInfo').text.replace(' ', '')
        if info2.find('年') != -1:
            pos = info2.find('年')
            output.append(info2[pos-4:pos])
        else:
            output.append(' ')
        
        # 獲得單價
        output.append(item.find('div', 'unitPrice').text)
        #print(output)
        write_to_file(output)

def write_to_file(content):
    # 參數newline保證輸出到csv后沒有空行
    with open('data.csv', 'a', newline='') as csvfile:
        writer = csv.writer(csvfile)
        #writer.writerow(['Region', 'Garden', 'Layout', 'Area', 'Direction', 'Renovation', 'Elevator', 'Price', 'Year', 'PerPrice'])
        writer.writerow(content)
        
def main(offset):
    regions = ['jiangbei', 'yubei', 'nanan', 'banan', 'shapingba', 'jiulongpo', 'yuzhong', 'dadukou', 'jiangjing', 'fuling',
             'wanzhou', 'hechuang', 'bishan', 'changshou1', 'tongliang', 'beibei']
    for region in regions:
        for i in range(1, offset):
            url = 'https://cq.lianjia.com/ershoufang/' + region + '/pg'+ str(i) + '/'
            html = parse_one_page(url)
            time.sleep(1)
　　　　　print('{} has been writen.'.format(region))

main(101)

鏈家網站的數據最多只顯示100頁，所以這里我們爬取各個區域的前100頁信息，有的可能沒有100頁，但並不影響，爬取結果如下（已經對數據做了一點處理，有問題的數據出現在有無電梯那一列和小區名那一列，只要排個序然后整體移動單元內容即可，年份缺失后面再做處理）：

接下來，我們用Excel的數據透視表簡單看一下數據的數量信息：

從表中我們可以看到，此次共爬取了33225條數據，Elevator這一項有很多數據缺失，Year這一項由於在爬蟲時使用空格代替了空值，所以這一項也存在一些數據缺失。現在有了數據，后面就可以開始對這些數據進行分析了。

參考書籍：

[1] https://germey.gitbooks.io/python3webspider/content/

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python 爬取鏈家二手房信息 Python爬取鏈家二手房信息上海鏈家網二手房成交數據爬取鏈家廣州二手房的數據與分析——爬取數據爬取鏈家、貝殼、大唐二手房數據 python爬蟲：爬取鏈家深圳全部二手房的詳細信息 python爬蟲爬取鏈家二手房信息 python3 爬蟲教學之爬取鏈家二手房（最下面源碼） //以更新源碼爬蟲小程序之爬取鏈家二手房 python鏈家二手房分析