基於數據集Airbnb的數據分析
鏈接:https://pan.baidu.com/s/1Tz0e9WowqGQ6gam4LhWC3g
提取碼:nqtq
開發環境:PyCharm
寫在前面:數據的分析形式多種多樣,本篇文章僅供參考。在python中可以不用打分號,純屬個人習慣
Airbnb數據集分析---基於calendar數據集的價格因素分析
1)將需要的包導入並讀取calendar文件
import pandas as pd; import numpy as np; import matplotlib.pyplot as plt; import seaborn as sns;
calendar = pd.read_csv('madrid-airbnb-data/calendar.csv');
print(calendar.head())
2)顯示數據:(因為使用PyCharm所有列沒有顯示完整,可以自行查看)
3)修改氣質price和date的數據類型
calendar['price'] = calendar['price'].str.replace(r'[$,]', "", regex = True).astype(np.float32); calendar['adjusted_price'] = calendar['adjusted_price'].str.replace(r'[$,]', "", regex = True).astype(np.float32); calendar['date'] = pd.to_datetime(calendar['date']);
print(calendar['date'].head())
4)添加星期和月份屬性,注意星期從0開始
calendar['weekday'] = calendar['date'].dt.weekday; calendar['month'] = calendar['date'].dt.month; print(calendar['month'].head())
5)使用柱狀圖表示月份和價格的關系,將每個月的價格分為一組,求改月價格的平均值
month_price = calendar.groupby('month')['price'].mean(); sns.barplot(x = month_price.index, y = month_price.values); plt.show()
6)使用柱狀圖來表示星期幾和價格的關系
weekday_price = calendar.groupby('weekday')['price'].mean(); sns.barplot(x = weekday_price.index, y = weekday_price.values); plt.show();
7)查看價格小於300的直方圖
sns.displot(calendar[calendar['price'] < 300]['price'], kde = True); plt.show();
Airbnb數據集分析---基於listings_detailed房屋數據預處理
說明:本數據集的里面有一處數據有一點問題,會顯示以下內容,不會影響我們的操作
1)觀察數據集的列
listings_detailed = pd.read_csv('madrid-airbnb-data/listings_detailed.csv'); print(listings_detailed.columns.values.tolist());
#列數很多,先不展示,可以自行查看即可
2)修改其中price的類型並添加最低消費字段
listings_detailed['price'] = listings_detailed['price'].str.replace(r'[$,]', "", regex = True).astype(np.float32); listings_detailed['cleaning_fee'] = listings_detailed['cleaning_fee'].str.replace(r'[$,]', "", regex = True).astype(np.float32); listings_detailed['cleaning_fee'].fillna(0, inplace=True); #最低消費計算方法 listings_detailed['minimum_cost'] = (listings_detailed['price'] + listings_detailed['cleaning_fee']) * listings_detailed['minimum_nights'];
3)添加設施數量字段,再增添一個新的列,在代碼注釋中說明
listings_detailed['n_amenities'] = listings_detailed['amenities'].str[1:-1].str.split(",").apply(len); # 根據房間容納人數,添加一個新的列,用來表示類型:Signal(1)、Couple(2)、Family(5)、Group(100) listings_detailed['accommodates_type'] = pd.cut(listings_detailed['accommodates'], bins = [1,2,3,5,100], right=False, include_lowest=True,
labels=['Signal', 'Couple', 'Family', 'Group']);
4)查看房屋屬於哪一個社區,以及房間評分
listings_detailed['neighbourhood_group_cleansed']; listings_detailed['review_scores_rating']
5)將接下來需要的字段進行整理,主要為與房間價格有關的字段
listings_detailed_df = listings_detailed[['id','host_id', 'listing_url', 'room_type', 'neighbourhood_group_cleansed','price',
'cleaning_fee', 'n_amenities', 'amenities','accommodates_type', 'minimum_cost', 'minimum_nights']]
Airbnb數據集分析---房間類型和社區分析
1)房間類型
# 房間類型對比 room_type_counts = listings_detailed_df['room_type'].value_counts(); fig, axes = plt.subplots(1,2, figsize= (10,5)); # 餅圖 axes[0].pie(room_type_counts.values, autopct = '%.2f%%', labels = room_type_counts.index); # 柱狀圖 sns.barplot(x = room_type_counts.index, y = room_type_counts.values); # 讓兩個圖別有重疊處,進行調整 plt.tight_layout(); plt.show();
2)社區分析
neighbourhood_counts = listings_detailed_df['neighbourhood_group_cleansed'].value_counts(); sns.barplot(y = neighbourhood_counts.index, x = neighbourhood_counts.values, orient = 'h'); plt.show();
Airbnb數據集分析---房間類型和社區對比分析
1)在某一個社區,各種房屋類型占比,先查看數據,后繪圖
# 在某一個社區,各種房屋類型占比 # 按照neighbourhood_group_cleansed和room_type進行分組 # unstack 按照room_type不進行堆疊 # fillna(0) 使用0進行替換Nan # 計算比例,row是series類型,series的/是每個value單獨計算的 # 按照Entire home/apt進行排序 neighbour_room_type = listings_detailed_df.groupby(['neighbourhood_group_cleansed', 'room_type']) \ .size() \ .unstack('room_type') \ .fillna(0) \ .apply(lambda row: row / row.sum(), axis=1) \ .sort_values('Entire home/apt', ascending=True); print(neighbour_room_type.head())
2)繪圖
# left進行起始位置確定 columns = neighbour_room_type.columns; index = neighbour_room_type.index; plt.figure(figsize=(10,5)) plt.barh(index, neighbour_room_type[columns[0]]); left = neighbour_room_type[columns[0]]; plt.barh(index, neighbour_room_type[columns[1]], left = left); left += neighbour_room_type[columns[1]]; plt.barh(index, neighbour_room_type[columns[2]], left = left); left += neighbour_room_type[columns[2]]; plt.barh(index, neighbour_room_type[columns[3]], left = left); plt.legend(columns); plt.show();
Airbnb數據集分析---房東房源數量分析
1)房東房源數量分析
host_number = listings_detailed_df.groupby('host_id').size(); sns.displot(data = host_number[host_number < 10], kde = True); plt.show();
2)按照房東擁有房間數量 查看比例 1,2,3,5+ [1,2),[2,3),[3,4),5+
host_number_bins = pd.cut(host_number,bins=[1,2,3,5,100],right=False, include_lowest=True, labels=['1', '2', '3-4', '5+']).value_counts(); plt.pie(host_number_bins,autopct='%.2f%%', labels=host_number_bins.index); plt.show();
Airbnb數據集分析---評論數與時間分析
# 獲取數據集,將date轉換為日期類型 reviews = pd.read_csv('madrid-airbnb-data/reviews_detailed.csv', parse_dates=['date']); # 添加年月 reviews['year'] = reviews['date'].dt.year; reviews['month'] = reviews['date'].dt.month; # 按照年月對數據進行分組,查看哪一年/月的數據有多少 n_reviews_year = reviews.groupby('year').size(); sns.barplot(x = n_reviews_year.index, y = n_reviews_year.values); plt.show();
n_reviews_month = reviews.groupby('month').size(); sns.barplot(x = n_reviews_month.index, y = n_reviews_month.values); plt.show();
Airbnb數據集分析---評論數與時間綜合分析
year_month_reviews = reviews.groupby(['year', 'month']).size().unstack('month').fillna(0); # 根據月份繪制(月份-評論)折線圖 fig, ax = plt.subplots(figsize=(20,10)); for index in year_month_reviews.index: series = year_month_reviews.loc[index]; sns.lineplot(x = series.index, y = series.values, ax = ax); ax.legend(labels = year_month_reviews.index); ax.grid(); # 顯示橫軸所有月份 _ = ax.set_xticks(list(range(1,13))) plt.show();
Airbnb數據集分析---房間價格分析
1)找出與價格相關的因素,保存並相應格式修改
# 使用listing數據集對房屋價格進行預測 # 提取價格有關的字段 from sklearn.preprocessing import StandardScaler; ml_listings = listings_detailed[listings_detailed['price'] < 300][[ 'host_is_superhost', 'host_identity_verified', 'neighbourhood_group_cleansed', 'latitude', 'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'cleaning_fee', 'minimum_nights', 'maximum_nights', 'availability_90', 'number_of_reviews', # 'review_scores_rating', 'is_business_travel_ready', 'n_amenities', 'price' ]] # 刪除異常值 ml_listings.dropna(axis=0, inplace=True); # 提取特征值和目標值 features = ml_listings.drop(columns=['price']); targets = ml_listings['price']; # 對於離散值進行one-hot編碼, 統一特征值數據類型,進行目標值預測 disperse_columns = [ 'host_is_superhost', 'host_identity_verified', 'neighbourhood_group_cleansed', 'property_type', 'room_type', 'is_business_travel_ready' ] disperse_features = features[disperse_columns]; disperse_features = pd.get_dummies(disperse_features); # 對連續值進行標准化,因為數值相差不大,對於結果影響不大 continuouse_features = features.drop(columns = disperse_columns); scaler = StandardScaler(); continuouse_features = scaler.fit_transform(continuouse_features); # 對特征值進行組合 feature_array = np.hstack([disperse_features, continuouse_features]);
2)使用隨機森林查看平均誤差和r2誤差
from sklearn.model_selection import train_test_split; # from sklearn.linear_model import LinearRegression; from sklearn.metrics import mean_absolute_error, r2_score; # r2評分:r2的值越接近1越好,提升r2評分,讓他擬合的更好,使用隨機森林的回歸 from sklearn.ensemble import RandomForestRegressor; # 分割訓練集和測試集 X_train, X_test,y_train,y_test = train_test_split(feature_array, targets,test_size=0.25); regression = RandomForestRegressor(); # 預測 regression.fit(X_train, y_train); y_predict = regression.predict(X_test); # 查看平均誤差和r2評分 print("平均誤差:",mean_absolute_error(y_test, y_predict)); print("r2評分:" , r2_score(y_test, y_predict));
如果使用線性回歸平均誤差和r2評分都沒有使用隨機森林好,使用線性回歸結果如下
Airbnb數據集分析---評論數量可視化及可視化
# 評論數量預測 ym_reviews = reviews.groupby(['year', 'month']).size().reset_index().rename(columns={0:'count'}); # 獲取特征和目標值 features = ym_reviews[['year', 'month']]; targets = ym_reviews['count']; # 分割訓練集和測試集, 查看模型訓練的泛化性 # X_train, X_test,y_train,y_test = train_test_split(features, targets,test_size=0.3); # regression = RandomForestRegressor(n_estimators=100); # regression.fit(X_train, y_train); # y_predict = regression.predict(X_test); # print("平均誤差:",mean_absolute_error(y_test, y_predict)); # print("r2評分:" , r2_score(y_test, y_predict)); regression = RandomForestRegressor(n_estimators=100); regression.fit(features,targets); # 預測后結果 y_predict = regression.predict([ [2019,10], [2019,11], [2019,12], ]) # 預測可視化 predict_reviews = pd.DataFrame([[2019, 10 + index, x] for index, x in enumerate(y_predict)], columns=['year', 'month', 'count']); final_reviews = pd.concat([ym_reviews, predict_reviews]).reset_index(); years = final_reviews['year'].unique(); fig, ax = plt.subplots(figsize=(10,5)); for year in years: df = final_reviews[final_reviews['year'] == year]; sns.lineplot(x = 'month', y = 'count', data = df); ax.legend(labels = year_month_reviews.index); ax.grid(); _ = ax.set_xticks(list(range(1,13))) plt.show();