New York City Airbnb (II) Model Building (with one-hot encoding and IQR)

- 9月 25, 2019

延續上一篇的主題，今天將會針對紐約市的Airbnb去進行建模的工作:
那我們趕緊開始:

1. 分成訓練集與測試集，7:3的比例

airbnb <- airbnb %>% mutate(id = row_number())
airbnb_train <- airbnb %>% sample_frac(.7) %>% filter(price > 0)
airbnb_test  <- anti_join(airbnb, airbnb_train, by = 'id') %>% filter(price > 0)
nrow(airbnb_train) + nrow(airbnb_test) == nrow(airbnb %>% filter(price > 0))

2. 由於Neighbourhood本來為字串變數，故這裡使用Dummy 技巧將它轉為Dummy Variable，或者稱為One hot Encoding

 DummyTable <- model.matrix( ~ neighbourhood + neighbourhood_group + room_type, data = air)
new <- cbind(air , DummyTable[,-1])
new <- new[,-c(2,3,6)]

  id latitude longitude price minimum_nights number_of_reviews reviews_per_month
1  1 40.64749 -73.97237   149              1                 9              0.21
2  2 40.75362 -73.98377   225              1                45              0.38
3  3 40.80902 -73.94190   150              3                 0              0.00
4  4 40.68514 -73.95976    89              1               270              4.64
  calculated_host_listings_count availability_365 last_year allyear lowavial
1                              6              365      2018       1        0
2                              2              355      2019       1        0
3                              1              365         0       1        0
4                              1              194      2019       0        0
  noavail neighbourhoodArden Heights neighbourhoodArrochar neighbourhoodArverne
1       0                          0                     0                    0
2       0                          0                     0                    0
3       0                          0                     0                    0
4       0                          0                     0                    0
  neighbourhoodAstoria neighbourhoodBath Beach neighbourhoodBattery Park City
1                    0                       0                              0
2                    0                       0                              0
3                    0                       0                              0
4                    0                       0                              0

3.繼續把使用Dummy的資料集再切成7:3的訓練集以及資料集

new <- new %>% mutate(id = row_number())
new_train <- new %>% sample_frac(.7) %>% filter(price > 0)
new_test  <- anti_join(new, new, by = 'id') %>% filter(price > 0)
nrow(new_train) + nrow(new_test) == nrow(new %>% filter(price > 0))
new_test  <- anti_join(new, new_train, by = 'id') %>% filter(price > 0)
nrow(new_train) + nrow(new_test) == nrow(new %>% filter(price > 0))

4. IQR，面對資料量大而且分布不均勻的資料會使用IQR的方式刪除Outlier的樣本!
(筆者已經先用未使用IQR的資料集建模過，效果不彰，文末會交代結果)

#IQR way
IQR_train <- new_train %>% filter(price < quantile(new_train$price, 0.75) & price > quantile(new_train$price, 0.25))
IQR_test <- new_test %>% filter(price < quantile(new_test$price, 0.75) & price > 
quantie(new_test$price, 0.25))

5. 建模: Xgboost
會先轉換成xgb.DMatrix形式再丟入Cross validation求出Best rounds，再對測試集進行預測~

train = xgb.DMatrix(as.matrix(IQR_train[,-c(1,4)]),
                    label = IQR_train$price)
test = xgb.DMatrix(data = as.matrix(IQR_test[,-c(1,4)]),
                   label = IQR_test$price)
xgb.params = list(
  colsample_bytree = 0.5,
  subsample = 0.5,
  booster = "gbtree",
  max_depth = 2,
  eta = 0.03,
  eval_metric = "rmse",
  objective = "reg:linear",
  gamma = 0)
cv.model = xgb.cv(
  params = xgb.params,
  data = train,
  nfold = 5,
  nrounds=200,
  early_stopping_rounds = 30,
  print_every_n = 20)

best.nrounds = cv.model$best_iteration
best.nrounds

#Prediction
xgb.model = xgb.train(paras = xgb.params,
                      data = train,
                      nrounds = best.nrounds)
xgb_y = predict(xgb.model, test)

#XgbRMSE
xgbrmse <- sqrt(mean((xgb_y - IQR_test$price)^2) )

#XgbMAPE
xgbMape <- 100*mean(abs((IQR_test.price - xgb_y)/IQR_test.price))
xgbMape

> best.nrounds
[1] 200
> xgbrmse
[1] 19.72948
> xgbMape
[1] 14.48792

6. 比較
如同前面備註提過，筆者曾用沒刪除Outlier的資料集進行建模，評估結果RMSE 為 146，MAPE為144，進行IQR後，整體誤差下降非常顯著，更顯得事前處理比起建模過成重要的很多很多阿......MAPE甚至少了100%以上......

7. Importance

                 Feature       Gain      Cover  Frequency
1: room_typePrivate room 0.33927444 0.01548010 0.01784440
2:             longitude 0.18453317 0.17776407 0.19386153
3:              latitude 0.16234362 0.19825210 0.26352605
4:      availability_365 0.06352782 0.04418581 0.09407566
5:     reviews_per_month 0.05030044 0.07258317 0.09207709
6:     number_of_reviews 0.04692579 0.08021791 0.10206995

我們也發現最後room_type裡的Private Room是最重要的變數，再來是地理位置，再來是Availability_365，其實EDA的部分給了蠻多靈感，Private Room的價位最高，曼哈頓的價位也是最高，好的EDA已經可以在建模結果有一定預測或相關了。最後關於Private room也是由One hot encoding技巧而來，建模的功夫遠遠比單純丟入變數來得複雜許多!

##來看一下擬合的圖吧

8. 其它方向
這次省略了host_ID，或許未來還可以由Text mining去探索，這次Feature Engineering的Availability並未在Importance看到顯著效果，未來還可以繼續挖掘，最後，筆者未對Price先取對數，之後這是第一個繼續努力的方向。

搜尋此網誌

Michael's notes

New York City Airbnb (II) Model Building (with one-hot encoding and IQR)

留言

張貼留言

這個網誌中的熱門文章

Word Vector & Word embedding 初探 - with n-Gram & GLOVE Model

文字探勘之關鍵字萃取 : TF-IDF , text-rank , RAKE

多元迴歸分析- subsets and shrinkage