基於python實現安居客數據抓取

本文轉載自查看原文 2020-05-21 16:55 693 爬蟲/ 安居客/ BeautifulSoup/ Selenium

安居客，抓數據
數據源：https://beijing.anjuke.com/community/
抓取字段：城市，小區名稱，地址，竣工日期，房價，環比上月，網址
使用方法：利用python，結合Selenium、BeautifulSoup庫
開發工具：PyCharm

完整代碼：

# BeautifulSoup用於網頁解析
from bs4 import BeautifulSoup
# webdriver調出來一個瀏覽器，模仿人進行真實操作
from selenium import webdriver
# pandas建立DataFrame
import pandas as pd

# 指明chromedriver所在位置
chrome_driver = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"
driver = webdriver.Chrome(executable_path=chrome_driver)
# 指明要爬取的網頁(因為要翻頁，所以末尾暫時寫p)
request_url = 'https://beijing.anjuke.com/community/p'

# 對字符串數據進行格式化處理，去掉換行和空格
def format_str(str):
    return str.replace('\n', '').replace(' ', '')
# 使用pandas構建DataFrame，表示最后生成數據的表結構
houses = pd.DataFrame(
    columns=['city', 'communityName', 'address', 'completionDate', 'housePrice', 'thanLastMonth', 'networkAddress'])
# 翻頁3次，獲取數據
for i in range(3):
    # 數據翻頁URL地址會改變，p1  p2  p3
    url = request_url + str(i + 1)
    # chrome瀏覽器驅動獲取url地址並打開
    driver.get(url)
    # 獲取html頁面
    html = driver.find_element_by_xpath("//*").get_attribute("outerHTML")
    # BeautifulSoup對html頁面進行解析
    soup = BeautifulSoup(html, 'html.parser', from_encoding='utf-8')

    # 獲取城市名字，等下要寫入數據表格中
    city = soup.find('span', class_='tit').em.text.strip()
    # 獲取所有符合要求的數據條，每一條數據都包含自身很多信息
    house_list = soup.find_all('div', class_='li-itemmod')
    # 循環實現來獲取每一條完整數據中，指定要爬取的數據
    for house in house_list:
        # 創建空字典，等下要把爬取到的數據對應着寫進去
        temp = {}
        # 獲取小區名稱
        communityName = house.find('div', class_='li-info').h3.a.text.strip()
        # 獲取此房所在地址
        address = house.find('div', class_='li-info').address.text.strip()
        # 獲取房子竣工日期
        completionDate = house.find('div', class_='li-info').find('p', class_='date').text.strip()
        # 因為li-side下方有兩個同級p標簽，用列表獲取下面的housePrice和thanLastMonth更合適
        list = house.find('div', class_='li-side').find_all('p')
        # 獲取房子價格(單價)
        housePrice = list[0].text.strip()
        # 獲取環比上月漲跌幅
        thanLastMonth = list[1].text.strip()
        # 獲取此房詳細頁面網址
        networkAddress = house.a.get('href').strip()

        # 將獲取到的數據放入到字典中(因為每個頁面城市都是一樣的，所以在頁面循環時已經拿到數據)
        temp['city'] = format_str(city)
        temp['communityName'] = format_str(communityName)
        temp['address'] = format_str(address)
        temp['completionDate'] = format_str(completionDate)
        temp['housePrice'] = format_str(housePrice)
        temp['thanLastMonth'] = format_str(thanLastMonth)
        temp['networkAddress'] = format_str(networkAddress)
        # 將每一條封好的數據添加到DataFrame中
        houses = houses.append(temp, ignore_index=True)

# DataFrame數據以csv格式進行輸出
# 參數列表中加入index=False, encoding='utf_8_sig'解決中文亂碼問題
houses.to_csv('anjukeBeijing.csv', index=False, encoding='utf_8_sig')

抓取結果：

注意事項：

1.chromedriver要放置在chrome安裝位置的Application目錄下

2.翻頁抓取數據時，頁面url地址會改變，可以使url地址末尾數字逐漸加1來實現翻頁

3.字符串格式化處理時，去掉換行、英文空格就可以了，已滿足格式化處理要求

4.獲取房價和環比上月漲跌幅時，可以考慮先獲取兩個p標簽的集合，再獲取指定數據，可以避免bug出現

5.DataFrame數據以csv格式進行輸出時，參數列表中加入index=False, encoding='utf_8_sig'可以解決csv文件中文亂碼問題

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬取安居客二手房網站數據（轉） python爬取安居客二手房網站數據【Python】上海小區數據爬取和清洗（安居客、鏈家和房天下）【scrapy實踐】_爬取安居客_廣州_新樓盤數據 Python開發爬蟲之BeautifulSoup解析網頁篇：爬取安居客網站上北京二手房數據 Python爬蟲實戰，Scrapy實戰，爬取並簡單分析安居客租房信息安居客scrapy房產信息爬取到數據可視化(上)-scrapy爬蟲 python3 爬蟲之爬取安居客二手房資訊(第一版) python-requests 簡單實現數據抓取用Python實現網頁數據抓取