基於數據集Airbnb的數據分析


基於數據集Airbnb的數據分析

鏈接:https://pan.baidu.com/s/1Tz0e9WowqGQ6gam4LhWC3g

提取碼:nqtq

開發環境:PyCharm

寫在前面:數據的分析形式多種多樣,本篇文章僅供參考。在python中可以不用打分號,純屬個人習慣

Airbnb數據集分析---基於calendar數據集的價格因素分析

1)將需要的包導入並讀取calendar文件

import pandas as pd;
import numpy as np;
import matplotlib.pyplot as plt;
import seaborn as sns;
calendar = pd.read_csv('madrid-airbnb-data/calendar.csv');
print(calendar.head())

2)顯示數據:(因為使用PyCharm所有列沒有顯示完整,可以自行查看)

3)修改氣質price和date的數據類型

calendar['price'] = calendar['price'].str.replace(r'[$,]', "", regex = True).astype(np.float32);
calendar['adjusted_price'] = calendar['adjusted_price'].str.replace(r'[$,]', "", regex = True).astype(np.float32);

calendar['date'] = pd.to_datetime(calendar['date']);
print(calendar['date'].head())

4)添加星期和月份屬性,注意星期從0開始

calendar['weekday'] = calendar['date'].dt.weekday;
calendar['month'] = calendar['date'].dt.month;
print(calendar['month'].head())

 5)使用柱狀圖表示月份和價格的關系,將每個月的價格分為一組,求改月價格的平均值

month_price = calendar.groupby('month')['price'].mean();
sns.barplot(x = month_price.index, y = month_price.values);
plt.show()

 6)使用柱狀圖來表示星期幾和價格的關系

weekday_price = calendar.groupby('weekday')['price'].mean();
sns.barplot(x = weekday_price.index, y = weekday_price.values);
plt.show();

7)查看價格小於300的直方圖

sns.displot(calendar[calendar['price'] < 300]['price'], kde = True);
plt.show();

 Airbnb數據集分析---基於listings_detailed房屋數據預處理

說明:本數據集的里面有一處數據有一點問題,會顯示以下內容,不會影響我們的操作

1)觀察數據集的列

listings_detailed = pd.read_csv('madrid-airbnb-data/listings_detailed.csv');     
print(listings_detailed.columns.values.tolist());  
#列數很多,先不展示,可以自行查看即可

 2)修改其中price的類型並添加最低消費字段

listings_detailed['price'] = listings_detailed['price'].str.replace(r'[$,]', "", regex = True).astype(np.float32);                  
listings_detailed['cleaning_fee'] = listings_detailed['cleaning_fee'].str.replace(r'[$,]', "", regex = True).astype(np.float32);    
listings_detailed['cleaning_fee'].fillna(0, inplace=True);       

#最低消費計算方法
listings_detailed['minimum_cost'] = (listings_detailed['price'] + listings_detailed['cleaning_fee']) * listings_detailed['minimum_nights'];  

3)添加設施數量字段,再增添一個新的列,在代碼注釋中說明

listings_detailed['n_amenities'] = listings_detailed['amenities'].str[1:-1].str.split(",").apply(len);                                                         
# 根據房間容納人數,添加一個新的列,用來表示類型:Signal(1)、Couple(2)、Family(5)、Group(100)                                               
listings_detailed['accommodates_type'] = pd.cut(listings_detailed['accommodates'], bins = [1,2,3,5,100], right=False, include_lowest=True, 
labels=['Signal', 'Couple', 'Family', 'Group']);

4)查看房屋屬於哪一個社區,以及房間評分

listings_detailed['neighbourhood_group_cleansed']; 
listings_detailed['review_scores_rating']

5)將接下來需要的字段進行整理,主要為與房間價格有關的字段

listings_detailed_df = listings_detailed[['id','host_id', 'listing_url', 'room_type', 'neighbourhood_group_cleansed','price',
'cleaning_fee', 'n_amenities', 'amenities','accommodates_type', 'minimum_cost', 'minimum_nights']]

 Airbnb數據集分析---房間類型和社區分析

1)房間類型

# 房間類型對比
room_type_counts = listings_detailed_df['room_type'].value_counts();
fig, axes = plt.subplots(1,2, figsize= (10,5));
# 餅圖
axes[0].pie(room_type_counts.values, autopct = '%.2f%%', labels = room_type_counts.index);
# 柱狀圖
sns.barplot(x = room_type_counts.index, y = room_type_counts.values);
# 讓兩個圖別有重疊處,進行調整
plt.tight_layout();
plt.show();

 2)社區分析

neighbourhood_counts = listings_detailed_df['neighbourhood_group_cleansed'].value_counts();
sns.barplot(y = neighbourhood_counts.index, x = neighbourhood_counts.values, orient = 'h');
plt.show();

  Airbnb數據集分析---房間類型和社區對比分析

1)在某一個社區,各種房屋類型占比,先查看數據,后繪圖
# 在某一個社區,各種房屋類型占比
# 按照neighbourhood_group_cleansed和room_type進行分組
# unstack 按照room_type不進行堆疊
# fillna(0) 使用0進行替換Nan
# 計算比例,row是series類型,series的/是每個value單獨計算的
# 按照Entire home/apt進行排序
neighbour_room_type = listings_detailed_df.groupby(['neighbourhood_group_cleansed', 'room_type']) \
    .size() \
    .unstack('room_type') \
    .fillna(0) \
    .apply(lambda row: row / row.sum(), axis=1) \
    .sort_values('Entire home/apt', ascending=True);

print(neighbour_room_type.head())

  2)繪圖

# left進行起始位置確定
columns = neighbour_room_type.columns;
index = neighbour_room_type.index;
plt.figure(figsize=(10,5))
plt.barh(index, neighbour_room_type[columns[0]]);
left = neighbour_room_type[columns[0]];
plt.barh(index, neighbour_room_type[columns[1]], left = left);
left += neighbour_room_type[columns[1]];
plt.barh(index, neighbour_room_type[columns[2]], left = left);
left += neighbour_room_type[columns[2]];
plt.barh(index, neighbour_room_type[columns[3]], left = left);
plt.legend(columns);
plt.show();

   Airbnb數據集分析---房東房源數量分析

1)房東房源數量分析

host_number = listings_detailed_df.groupby('host_id').size();
sns.displot(data = host_number[host_number < 10], kde = True);
plt.show();

 

2)按照房東擁有房間數量 查看比例 1,2,3,5+   [1,2),[2,3),[3,4),5+

host_number_bins = pd.cut(host_number,bins=[1,2,3,5,100],right=False, include_lowest=True, labels=['1', '2', '3-4', '5+']).value_counts();
plt.pie(host_number_bins,autopct='%.2f%%', labels=host_number_bins.index);
plt.show();

    Airbnb數據集分析---評論數與時間分析

# 獲取數據集,將date轉換為日期類型
reviews = pd.read_csv('madrid-airbnb-data/reviews_detailed.csv', parse_dates=['date']);
# 添加年月
reviews['year'] = reviews['date'].dt.year;
reviews['month'] = reviews['date'].dt.month;
# 按照年月對數據進行分組,查看哪一年/月的數據有多少
n_reviews_year = reviews.groupby('year').size();
sns.barplot(x = n_reviews_year.index, y = n_reviews_year.values);
plt.show();

 

 

n_reviews_month = reviews.groupby('month').size();
sns.barplot(x = n_reviews_month.index, y = n_reviews_month.values);
plt.show();

 

 Airbnb數據集分析---評論數與時間綜合分析

year_month_reviews = reviews.groupby(['year', 'month']).size().unstack('month').fillna(0);
# 根據月份繪制(月份-評論)折線圖
fig, ax = plt.subplots(figsize=(20,10));
for index in year_month_reviews.index:
    series = year_month_reviews.loc[index];
    sns.lineplot(x = series.index, y = series.values, ax = ax);
ax.legend(labels = year_month_reviews.index);
ax.grid();
# 顯示橫軸所有月份
_ = ax.set_xticks(list(range(1,13)))
plt.show();

 Airbnb數據集分析---房間價格分析

1)找出與價格相關的因素,保存並相應格式修改

# 使用listing數據集對房屋價格進行預測
# 提取價格有關的字段
from sklearn.preprocessing import StandardScaler;
ml_listings = listings_detailed[listings_detailed['price'] < 300][[
    'host_is_superhost',
    'host_identity_verified',
    'neighbourhood_group_cleansed',
    'latitude',
    'longitude',
    'property_type',
    'room_type',
    'accommodates',
    'bathrooms',
    'bedrooms',
    'cleaning_fee',
    'minimum_nights',
    'maximum_nights',
    'availability_90',
    'number_of_reviews',
    # 'review_scores_rating',
    'is_business_travel_ready',
    'n_amenities',
    'price'
]]

# 刪除異常值
ml_listings.dropna(axis=0, inplace=True);

# 提取特征值和目標值
features = ml_listings.drop(columns=['price']);
targets = ml_listings['price'];

# 對於離散值進行one-hot編碼, 統一特征值數據類型,進行目標值預測
disperse_columns = [
    'host_is_superhost',
    'host_identity_verified',
    'neighbourhood_group_cleansed',
    'property_type',
    'room_type',
    'is_business_travel_ready'
]
disperse_features = features[disperse_columns];
disperse_features = pd.get_dummies(disperse_features);
# 對連續值進行標准化,因為數值相差不大,對於結果影響不大
continuouse_features = features.drop(columns = disperse_columns);
scaler = StandardScaler();
continuouse_features = scaler.fit_transform(continuouse_features);

# 對特征值進行組合
feature_array = np.hstack([disperse_features, continuouse_features]);

2)使用隨機森林查看平均誤差和r2誤差

from sklearn.model_selection import train_test_split;
# from sklearn.linear_model import LinearRegression;
from sklearn.metrics import mean_absolute_error, r2_score;
# r2評分:r2的值越接近1越好,提升r2評分,讓他擬合的更好,使用隨機森林的回歸
from sklearn.ensemble import RandomForestRegressor;

# 分割訓練集和測試集
X_train, X_test,y_train,y_test = train_test_split(feature_array, targets,test_size=0.25);
regression = RandomForestRegressor();
# 預測
regression.fit(X_train, y_train);
y_predict = regression.predict(X_test);

# 查看平均誤差和r2評分
print("平均誤差:",mean_absolute_error(y_test, y_predict));
print("r2評分:" , r2_score(y_test, y_predict));

 如果使用線性回歸平均誤差和r2評分都沒有使用隨機森林好,使用線性回歸結果如下

  Airbnb數據集分析---評論數量可視化及可視化

# 評論數量預測
ym_reviews = reviews.groupby(['year', 'month']).size().reset_index().rename(columns={0:'count'});
# 獲取特征和目標值
features = ym_reviews[['year', 'month']];
targets = ym_reviews['count'];

# 分割訓練集和測試集, 查看模型訓練的泛化性
# X_train, X_test,y_train,y_test = train_test_split(features, targets,test_size=0.3);
# regression = RandomForestRegressor(n_estimators=100);
# regression.fit(X_train, y_train);
# y_predict = regression.predict(X_test);
# print("平均誤差:",mean_absolute_error(y_test, y_predict));
# print("r2評分:" , r2_score(y_test, y_predict));

regression = RandomForestRegressor(n_estimators=100);
regression.fit(features,targets);

# 預測后結果
y_predict = regression.predict([
    [2019,10],
    [2019,11],
    [2019,12],
])

# 預測可視化
predict_reviews = pd.DataFrame([[2019, 10 + index, x] for index, x in enumerate(y_predict)], columns=['year', 'month', 'count']);
final_reviews = pd.concat([ym_reviews, predict_reviews]).reset_index();
years = final_reviews['year'].unique();
fig, ax = plt.subplots(figsize=(10,5));
for year in years:
    df = final_reviews[final_reviews['year'] == year];
    sns.lineplot(x = 'month', y = 'count', data = df);

ax.legend(labels = year_month_reviews.index);
ax.grid();
_ = ax.set_xticks(list(range(1,13)))
plt.show();


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM