New York City Airbnb (II) Model Building (with one-hot encoding and IQR)

延續上一篇的主題,今天將會針對紐約市的Airbnb去進行建模的工作:
那我們趕緊開始:

1. 分成訓練集與測試集,7:3的比例
airbnb <- airbnb %>% mutate(id = row_number())
airbnb_train <- airbnb %>% sample_frac(.7) %>% filter(price > 0)
airbnb_test  <- anti_join(airbnb, airbnb_train, by = 'id') %>% filter(price > 0)
nrow(airbnb_train) + nrow(airbnb_test) == nrow(airbnb %>% filter(price > 0))
2. 由於Neighbourhood本來為字串變數,故這裡使用Dummy 技巧將它轉為Dummy Variable,或者稱為One hot Encoding
 DummyTable <- model.matrix( ~ neighbourhood + neighbourhood_group + room_type, data = air)
new <- cbind(air , DummyTable[,-1])
new <- new[,-c(2,3,6)]
  id latitude longitude price minimum_nights number_of_reviews reviews_per_month
1  1 40.64749 -73.97237   149              1                 9              0.21
2  2 40.75362 -73.98377   225              1                45              0.38
3  3 40.80902 -73.94190   150              3                 0              0.00
4  4 40.68514 -73.95976    89              1               270              4.64
  calculated_host_listings_count availability_365 last_year allyear lowavial
1                              6              365      2018       1        0
2                              2              355      2019       1        0
3                              1              365         0       1        0
4                              1              194      2019       0        0
  noavail neighbourhoodArden Heights neighbourhoodArrochar neighbourhoodArverne
1       0                          0                     0                    0
2       0                          0                     0                    0
3       0                          0                     0                    0
4       0                          0                     0                    0
  neighbourhoodAstoria neighbourhoodBath Beach neighbourhoodBattery Park City
1                    0                       0                              0
2                    0                       0                              0
3                    0                       0                              0
4                    0                       0                              0


3.繼續把使用Dummy的資料集再切成7:3的訓練集以及資料集
new <- new %>% mutate(id = row_number())
new_train <- new %>% sample_frac(.7) %>% filter(price > 0)
new_test  <- anti_join(new, new, by = 'id') %>% filter(price > 0)
nrow(new_train) + nrow(new_test) == nrow(new %>% filter(price > 0))
new_test  <- anti_join(new, new_train, by = 'id') %>% filter(price > 0)
nrow(new_train) + nrow(new_test) == nrow(new %>% filter(price > 0))

4. IQR,面對資料量大而且分布不均勻的資料會使用IQR的方式刪除Outlier的樣本!
(筆者已經先用未使用IQR的資料集建模過,效果不彰,文末會交代結果)
#IQR way
IQR_train <- new_train %>% filter(price < quantile(new_train$price, 0.75) & price > quantile(new_train$price, 0.25))
IQR_test <- new_test %>% filter(price < quantile(new_test$price, 0.75) & price > 
quantie(new_test$price, 0.25))

5. 建模: Xgboost
會先轉換成xgb.DMatrix形式再丟入Cross validation求出Best rounds,再對測試集進行預測~
train = xgb.DMatrix(as.matrix(IQR_train[,-c(1,4)]),
                    label = IQR_train$price)
test = xgb.DMatrix(data = as.matrix(IQR_test[,-c(1,4)]),
                   label = IQR_test$price)
xgb.params = list(
  colsample_bytree = 0.5,
  subsample = 0.5,
  booster = "gbtree",
  max_depth = 2,
  eta = 0.03,
  eval_metric = "rmse",
  objective = "reg:linear",
  gamma = 0)
cv.model = xgb.cv(
  params = xgb.params,
  data = train,
  nfold = 5,
  nrounds=200,
  early_stopping_rounds = 30,
  print_every_n = 20)

best.nrounds = cv.model$best_iteration
best.nrounds

#Prediction
xgb.model = xgb.train(paras = xgb.params,
                      data = train,
                      nrounds = best.nrounds)
xgb_y = predict(xgb.model, test)

#XgbRMSE
xgbrmse <- sqrt(mean((xgb_y - IQR_test$price)^2) )

#XgbMAPE
xgbMape <- 100*mean(abs((IQR_test.price - xgb_y)/IQR_test.price))
xgbMape
> best.nrounds
[1] 200
> xgbrmse
[1] 19.72948
> xgbMape
[1] 14.48792
6. 比較
如同前面備註提過,筆者曾用沒刪除Outlier的資料集進行建模,評估結果RMSE 為 146,MAPE為144,進行IQR後,整體誤差下降非常顯著,更顯得事前處理比起建模過成重要的很多很多阿......MAPE甚至少了100%以上......

7. Importance
                 Feature       Gain      Cover  Frequency
1: room_typePrivate room 0.33927444 0.01548010 0.01784440
2:             longitude 0.18453317 0.17776407 0.19386153
3:              latitude 0.16234362 0.19825210 0.26352605
4:      availability_365 0.06352782 0.04418581 0.09407566
5:     reviews_per_month 0.05030044 0.07258317 0.09207709
6:     number_of_reviews 0.04692579 0.08021791 0.10206995
我們也發現最後room_type裡的Private Room是最重要的變數,再來是地理位置,再來是Availability_365,其實EDA的部分給了蠻多靈感,Private Room的價位最高,曼哈頓的價位也是最高,好的EDA已經可以在建模結果有一定預測或相關了。最後關於Private room也是由One hot encoding技巧而來,建模的功夫遠遠比單純丟入變數來得複雜許多!

##來看一下擬合的圖吧


8. 其它方向
這次省略了host_ID,或許未來還可以由Text mining去探索,這次Feature Engineering的Availability並未在Importance看到顯著效果,未來還可以繼續挖掘,最後,筆者未對Price先取對數,之後這是第一個繼續努力的方向。

留言

這個網誌中的熱門文章

Word Vector & Word embedding 初探 - with n-Gram & GLOVE Model

文字探勘之關鍵字萃取 : TF-IDF , text-rank , RAKE

多元迴歸分析- subsets and shrinkage