基于数据集Airbnb的数据分析
链接:https://pan.baidu.com/s/1Tz0e9WowqGQ6gam4LhWC3g
提取码:nqtq
开发环境:PyCharm
写在前面:数据的分析形式多种多样,本篇文章仅供参考。在python中可以不用打分号,纯属个人习惯
Airbnb数据集分析---基于calendar数据集的价格因素分析
1)将需要的包导入并读取calendar文件
import pandas as pd; import numpy as np; import matplotlib.pyplot as plt; import seaborn as sns;
calendar = pd.read_csv('madrid-airbnb-data/calendar.csv');
print(calendar.head())
2)显示数据:(因为使用PyCharm所有列没有显示完整,可以自行查看)
3)修改气质price和date的数据类型
calendar['price'] = calendar['price'].str.replace(r'[$,]', "", regex = True).astype(np.float32); calendar['adjusted_price'] = calendar['adjusted_price'].str.replace(r'[$,]', "", regex = True).astype(np.float32); calendar['date'] = pd.to_datetime(calendar['date']);
print(calendar['date'].head())
4)添加星期和月份属性,注意星期从0开始
calendar['weekday'] = calendar['date'].dt.weekday; calendar['month'] = calendar['date'].dt.month; print(calendar['month'].head())
5)使用柱状图表示月份和价格的关系,将每个月的价格分为一组,求改月价格的平均值
month_price = calendar.groupby('month')['price'].mean(); sns.barplot(x = month_price.index, y = month_price.values); plt.show()
6)使用柱状图来表示星期几和价格的关系
weekday_price = calendar.groupby('weekday')['price'].mean(); sns.barplot(x = weekday_price.index, y = weekday_price.values); plt.show();
7)查看价格小于300的直方图
sns.displot(calendar[calendar['price'] < 300]['price'], kde = True); plt.show();
Airbnb数据集分析---基于listings_detailed房屋数据预处理
说明:本数据集的里面有一处数据有一点问题,会显示以下内容,不会影响我们的操作
1)观察数据集的列
listings_detailed = pd.read_csv('madrid-airbnb-data/listings_detailed.csv'); print(listings_detailed.columns.values.tolist());
#列数很多,先不展示,可以自行查看即可
2)修改其中price的类型并添加最低消费字段
listings_detailed['price'] = listings_detailed['price'].str.replace(r'[$,]', "", regex = True).astype(np.float32); listings_detailed['cleaning_fee'] = listings_detailed['cleaning_fee'].str.replace(r'[$,]', "", regex = True).astype(np.float32); listings_detailed['cleaning_fee'].fillna(0, inplace=True); #最低消费计算方法 listings_detailed['minimum_cost'] = (listings_detailed['price'] + listings_detailed['cleaning_fee']) * listings_detailed['minimum_nights'];
3)添加设施数量字段,再增添一个新的列,在代码注释中说明
listings_detailed['n_amenities'] = listings_detailed['amenities'].str[1:-1].str.split(",").apply(len); # 根据房间容纳人数,添加一个新的列,用来表示类型:Signal(1)、Couple(2)、Family(5)、Group(100) listings_detailed['accommodates_type'] = pd.cut(listings_detailed['accommodates'], bins = [1,2,3,5,100], right=False, include_lowest=True,
labels=['Signal', 'Couple', 'Family', 'Group']);
4)查看房屋属于哪一个社区,以及房间评分
listings_detailed['neighbourhood_group_cleansed']; listings_detailed['review_scores_rating']
5)将接下来需要的字段进行整理,主要为与房间价格有关的字段
listings_detailed_df = listings_detailed[['id','host_id', 'listing_url', 'room_type', 'neighbourhood_group_cleansed','price',
'cleaning_fee', 'n_amenities', 'amenities','accommodates_type', 'minimum_cost', 'minimum_nights']]
Airbnb数据集分析---房间类型和社区分析
1)房间类型
# 房间类型对比 room_type_counts = listings_detailed_df['room_type'].value_counts(); fig, axes = plt.subplots(1,2, figsize= (10,5)); # 饼图 axes[0].pie(room_type_counts.values, autopct = '%.2f%%', labels = room_type_counts.index); # 柱状图 sns.barplot(x = room_type_counts.index, y = room_type_counts.values); # 让两个图别有重叠处,进行调整 plt.tight_layout(); plt.show();
2)社区分析
neighbourhood_counts = listings_detailed_df['neighbourhood_group_cleansed'].value_counts(); sns.barplot(y = neighbourhood_counts.index, x = neighbourhood_counts.values, orient = 'h'); plt.show();
Airbnb数据集分析---房间类型和社区对比分析
1)在某一个社区,各种房屋类型占比,先查看数据,后绘图
# 在某一个社区,各种房屋类型占比 # 按照neighbourhood_group_cleansed和room_type进行分组 # unstack 按照room_type不进行堆叠 # fillna(0) 使用0进行替换Nan # 计算比例,row是series类型,series的/是每个value单独计算的 # 按照Entire home/apt进行排序 neighbour_room_type = listings_detailed_df.groupby(['neighbourhood_group_cleansed', 'room_type']) \ .size() \ .unstack('room_type') \ .fillna(0) \ .apply(lambda row: row / row.sum(), axis=1) \ .sort_values('Entire home/apt', ascending=True); print(neighbour_room_type.head())
2)绘图
# left进行起始位置确定 columns = neighbour_room_type.columns; index = neighbour_room_type.index; plt.figure(figsize=(10,5)) plt.barh(index, neighbour_room_type[columns[0]]); left = neighbour_room_type[columns[0]]; plt.barh(index, neighbour_room_type[columns[1]], left = left); left += neighbour_room_type[columns[1]]; plt.barh(index, neighbour_room_type[columns[2]], left = left); left += neighbour_room_type[columns[2]]; plt.barh(index, neighbour_room_type[columns[3]], left = left); plt.legend(columns); plt.show();
Airbnb数据集分析---房东房源数量分析
1)房东房源数量分析
host_number = listings_detailed_df.groupby('host_id').size(); sns.displot(data = host_number[host_number < 10], kde = True); plt.show();
2)按照房东拥有房间数量 查看比例 1,2,3,5+ [1,2),[2,3),[3,4),5+
host_number_bins = pd.cut(host_number,bins=[1,2,3,5,100],right=False, include_lowest=True, labels=['1', '2', '3-4', '5+']).value_counts(); plt.pie(host_number_bins,autopct='%.2f%%', labels=host_number_bins.index); plt.show();
Airbnb数据集分析---评论数与时间分析
# 获取数据集,将date转换为日期类型 reviews = pd.read_csv('madrid-airbnb-data/reviews_detailed.csv', parse_dates=['date']); # 添加年月 reviews['year'] = reviews['date'].dt.year; reviews['month'] = reviews['date'].dt.month; # 按照年月对数据进行分组,查看哪一年/月的数据有多少 n_reviews_year = reviews.groupby('year').size(); sns.barplot(x = n_reviews_year.index, y = n_reviews_year.values); plt.show();
n_reviews_month = reviews.groupby('month').size(); sns.barplot(x = n_reviews_month.index, y = n_reviews_month.values); plt.show();
Airbnb数据集分析---评论数与时间综合分析
year_month_reviews = reviews.groupby(['year', 'month']).size().unstack('month').fillna(0); # 根据月份绘制(月份-评论)折线图 fig, ax = plt.subplots(figsize=(20,10)); for index in year_month_reviews.index: series = year_month_reviews.loc[index]; sns.lineplot(x = series.index, y = series.values, ax = ax); ax.legend(labels = year_month_reviews.index); ax.grid(); # 显示横轴所有月份 _ = ax.set_xticks(list(range(1,13))) plt.show();
Airbnb数据集分析---房间价格分析
1)找出与价格相关的因素,保存并相应格式修改
# 使用listing数据集对房屋价格进行预测 # 提取价格有关的字段 from sklearn.preprocessing import StandardScaler; ml_listings = listings_detailed[listings_detailed['price'] < 300][[ 'host_is_superhost', 'host_identity_verified', 'neighbourhood_group_cleansed', 'latitude', 'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'cleaning_fee', 'minimum_nights', 'maximum_nights', 'availability_90', 'number_of_reviews', # 'review_scores_rating', 'is_business_travel_ready', 'n_amenities', 'price' ]] # 删除异常值 ml_listings.dropna(axis=0, inplace=True); # 提取特征值和目标值 features = ml_listings.drop(columns=['price']); targets = ml_listings['price']; # 对于离散值进行one-hot编码, 统一特征值数据类型,进行目标值预测 disperse_columns = [ 'host_is_superhost', 'host_identity_verified', 'neighbourhood_group_cleansed', 'property_type', 'room_type', 'is_business_travel_ready' ] disperse_features = features[disperse_columns]; disperse_features = pd.get_dummies(disperse_features); # 对连续值进行标准化,因为数值相差不大,对于结果影响不大 continuouse_features = features.drop(columns = disperse_columns); scaler = StandardScaler(); continuouse_features = scaler.fit_transform(continuouse_features); # 对特征值进行组合 feature_array = np.hstack([disperse_features, continuouse_features]);
2)使用随机森林查看平均误差和r2误差
from sklearn.model_selection import train_test_split; # from sklearn.linear_model import LinearRegression; from sklearn.metrics import mean_absolute_error, r2_score; # r2评分:r2的值越接近1越好,提升r2评分,让他拟合的更好,使用随机森林的回归 from sklearn.ensemble import RandomForestRegressor; # 分割训练集和测试集 X_train, X_test,y_train,y_test = train_test_split(feature_array, targets,test_size=0.25); regression = RandomForestRegressor(); # 预测 regression.fit(X_train, y_train); y_predict = regression.predict(X_test); # 查看平均误差和r2评分 print("平均误差:",mean_absolute_error(y_test, y_predict)); print("r2评分:" , r2_score(y_test, y_predict));
如果使用线性回归平均误差和r2评分都没有使用随机森林好,使用线性回归结果如下
Airbnb数据集分析---评论数量可视化及可视化
# 评论数量预测 ym_reviews = reviews.groupby(['year', 'month']).size().reset_index().rename(columns={0:'count'}); # 获取特征和目标值 features = ym_reviews[['year', 'month']]; targets = ym_reviews['count']; # 分割训练集和测试集, 查看模型训练的泛化性 # X_train, X_test,y_train,y_test = train_test_split(features, targets,test_size=0.3); # regression = RandomForestRegressor(n_estimators=100); # regression.fit(X_train, y_train); # y_predict = regression.predict(X_test); # print("平均误差:",mean_absolute_error(y_test, y_predict)); # print("r2评分:" , r2_score(y_test, y_predict)); regression = RandomForestRegressor(n_estimators=100); regression.fit(features,targets); # 预测后结果 y_predict = regression.predict([ [2019,10], [2019,11], [2019,12], ]) # 预测可视化 predict_reviews = pd.DataFrame([[2019, 10 + index, x] for index, x in enumerate(y_predict)], columns=['year', 'month', 'count']); final_reviews = pd.concat([ym_reviews, predict_reviews]).reset_index(); years = final_reviews['year'].unique(); fig, ax = plt.subplots(figsize=(10,5)); for year in years: df = final_reviews[final_reviews['year'] == year]; sns.lineplot(x = 'month', y = 'count', data = df); ax.legend(labels = year_month_reviews.index); ax.grid(); _ = ax.set_xticks(list(range(1,13))) plt.show();