基于python实现安居客数据抓取

本文转载自查看原文 2020-05-21 16:55 693 爬虫/ 安居客/ BeautifulSoup/ Selenium

安居客，抓数据
数据源：https://beijing.anjuke.com/community/
抓取字段：城市，小区名称，地址，竣工日期，房价，环比上月，网址
使用方法：利用python，结合Selenium、BeautifulSoup库
开发工具：PyCharm

完整代码：

# BeautifulSoup用于网页解析
from bs4 import BeautifulSoup
# webdriver调出来一个浏览器，模仿人进行真实操作
from selenium import webdriver
# pandas建立DataFrame
import pandas as pd

# 指明chromedriver所在位置
chrome_driver = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"
driver = webdriver.Chrome(executable_path=chrome_driver)
# 指明要爬取的网页(因为要翻页，所以末尾暂时写p)
request_url = 'https://beijing.anjuke.com/community/p'

# 对字符串数据进行格式化处理，去掉换行和空格
def format_str(str):
    return str.replace('\n', '').replace(' ', '')
# 使用pandas构建DataFrame，表示最后生成数据的表结构
houses = pd.DataFrame(
    columns=['city', 'communityName', 'address', 'completionDate', 'housePrice', 'thanLastMonth', 'networkAddress'])
# 翻页3次，获取数据
for i in range(3):
    # 数据翻页URL地址会改变，p1  p2  p3
    url = request_url + str(i + 1)
    # chrome浏览器驱动获取url地址并打开
    driver.get(url)
    # 获取html页面
    html = driver.find_element_by_xpath("//*").get_attribute("outerHTML")
    # BeautifulSoup对html页面进行解析
    soup = BeautifulSoup(html, 'html.parser', from_encoding='utf-8')

    # 获取城市名字，等下要写入数据表格中
    city = soup.find('span', class_='tit').em.text.strip()
    # 获取所有符合要求的数据条，每一条数据都包含自身很多信息
    house_list = soup.find_all('div', class_='li-itemmod')
    # 循环实现来获取每一条完整数据中，指定要爬取的数据
    for house in house_list:
        # 创建空字典，等下要把爬取到的数据对应着写进去
        temp = {}
        # 获取小区名称
        communityName = house.find('div', class_='li-info').h3.a.text.strip()
        # 获取此房所在地址
        address = house.find('div', class_='li-info').address.text.strip()
        # 获取房子竣工日期
        completionDate = house.find('div', class_='li-info').find('p', class_='date').text.strip()
        # 因为li-side下方有两个同级p标签，用列表获取下面的housePrice和thanLastMonth更合适
        list = house.find('div', class_='li-side').find_all('p')
        # 获取房子价格(单价)
        housePrice = list[0].text.strip()
        # 获取环比上月涨跌幅
        thanLastMonth = list[1].text.strip()
        # 获取此房详细页面网址
        networkAddress = house.a.get('href').strip()

        # 将获取到的数据放入到字典中(因为每个页面城市都是一样的，所以在页面循环时已经拿到数据)
        temp['city'] = format_str(city)
        temp['communityName'] = format_str(communityName)
        temp['address'] = format_str(address)
        temp['completionDate'] = format_str(completionDate)
        temp['housePrice'] = format_str(housePrice)
        temp['thanLastMonth'] = format_str(thanLastMonth)
        temp['networkAddress'] = format_str(networkAddress)
        # 将每一条封好的数据添加到DataFrame中
        houses = houses.append(temp, ignore_index=True)

# DataFrame数据以csv格式进行输出
# 参数列表中加入index=False, encoding='utf_8_sig'解决中文乱码问题
houses.to_csv('anjukeBeijing.csv', index=False, encoding='utf_8_sig')

抓取结果：

注意事项：

1.chromedriver要放置在chrome安装位置的Application目录下

2.翻页抓取数据时，页面url地址会改变，可以使url地址末尾数字逐渐加1来实现翻页

3.字符串格式化处理时，去掉换行、英文空格就可以了，已满足格式化处理要求

4.获取房价和环比上月涨跌幅时，可以考虑先获取两个p标签的集合，再获取指定数据，可以避免bug出现

5.DataFrame数据以csv格式进行输出时，参数列表中加入index=False, encoding='utf_8_sig'可以解决csv文件中文乱码问题

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 python爬取安居客二手房网站数据（转） python爬取安居客二手房网站数据【Python】上海小区数据爬取和清洗（安居客、链家和房天下）【scrapy实践】_爬取安居客_广州_新楼盘数据 Python开发爬虫之BeautifulSoup解析网页篇：爬取安居客网站上北京二手房数据 Python爬虫实战，Scrapy实战，爬取并简单分析安居客租房信息安居客scrapy房产信息爬取到数据可视化(上)-scrapy爬虫 python3 爬虫之爬取安居客二手房资讯(第一版) python-requests 简单实现数据抓取用Python实现网页数据抓取