Tree models

- 7月 09, 2019

大家好，頗久沒寫新文章，

一來是工作讓我投入其中；

二來是晚上很多時刻被netflix 綁架了.....XD

本篇將在資料集中使用不同的樹模型，包含:
1. 決策樹
2. 經由決策樹Bagging而成的RandomForest
3. XGBOOST

這次使用的資料則是mlbench裡的diabetes資料集，探討各因素對於糖尿病的症狀影響程度有多少，話不多說，開始吧!

library(mlbench)
data(PimaIndiansDiabetes)
diabetes <- PimaIndiansDiabetes
set.seed(22)
train.index <- sample(x=1:nrow(diabetes), size=ceiling(0.8*nrow(diabetes)))

train = diabetes[train.index, ]
test = diabetes[-train.index, ]
 #設定測試集與訓練集

再來就是先看看整份資料的狀況吧

> summary(diabetes)
    pregnant         glucose         pressure         triceps     
 Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
 Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
 Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
 3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
    insulin           mass          pedigree           age       
 Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00  
 1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00  
 Median : 30.5   Median :32.00   Median :0.3725   Median :29.00  
 Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24  
 3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
 Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
    diabetes    
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.349  
 3rd Qu.:1.000  
 Max.   :1.000

沒有NA值，可喜可賀!
接下來看看~得到糖尿病~與~未得到糖尿病~
兩組資料的敘述統計

1.得到糖尿病

got <- diabetes%>%
  filter(diabetes == 1)%>%
  summary()   
  pregnant         glucose         pressure         triceps     
 Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
 1st Qu.: 1.750   1st Qu.:119.0   1st Qu.: 66.00   1st Qu.: 0.00  
 Median : 4.000   Median :140.0   Median : 74.00   Median :27.00  
 Mean   : 4.866   Mean   :141.3   Mean   : 70.82   Mean   :22.16  
 3rd Qu.: 8.000   3rd Qu.:167.0   3rd Qu.: 82.00   3rd Qu.:36.00  
 Max.   :17.000   Max.   :199.0   Max.   :114.00   Max.   :99.00  
    insulin           mass          pedigree           age           diabetes
 Min.   :  0.0   Min.   : 0.00   Min.   :0.0880   Min.   :21.00   Min.   :1  
 1st Qu.:  0.0   1st Qu.:30.80   1st Qu.:0.2625   1st Qu.:28.00   1st Qu.:1  
 Median :  0.0   Median :34.25   Median :0.4490   Median :36.00   Median :1  
 Mean   :100.3   Mean   :35.14   Mean   :0.5505   Mean   :37.07   Mean   :1  
 3rd Qu.:167.2   3rd Qu.:38.77   3rd Qu.:0.7280   3rd Qu.:44.00   3rd Qu.:1  
 Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :70.00   Max.   :1

2. 未得到糖尿病

none <- diabetes%>%
  filter(diabetes == 0)%>%
  summary()
none
 pregnant         glucose       pressure         triceps     
 Min.   : 0.000   Min.   :  0   Min.   :  0.00   Min.   : 0.00  
 1st Qu.: 1.000   1st Qu.: 93   1st Qu.: 62.00   1st Qu.: 0.00  
 Median : 2.000   Median :107   Median : 70.00   Median :21.00  
 Mean   : 3.298   Mean   :110   Mean   : 68.18   Mean   :19.66  
 3rd Qu.: 5.000   3rd Qu.:125   3rd Qu.: 78.00   3rd Qu.:31.00  
 Max.   :13.000   Max.   :197   Max.   :122.00   Max.   :60.00  
    insulin            mass          pedigree           age           diabetes
 Min.   :  0.00   Min.   : 0.00   Min.   :0.0780   Min.   :21.00   Min.   :0  
 1st Qu.:  0.00   1st Qu.:25.40   1st Qu.:0.2298   1st Qu.:23.00   1st Qu.:0  
 Median : 39.00   Median :30.05   Median :0.3360   Median :27.00   Median :0  
 Mean   : 68.79   Mean   :30.30   Mean   :0.4297   Mean   :31.19   Mean   :0  
 3rd Qu.:105.00   3rd Qu.:35.30   3rd Qu.:0.5617   3rd Qu.:37.00   3rd Qu.:0  
 Max.   :744.00   Max.   :57.30   Max.   :2.3290   Max.   :81.00   Max.   :0

可以簡單判斷出，得到糖尿病的人在大部分的變數都高於未得到的人，尤其是Glucose，平均數相差甚大，如果想再進一步探討，可以直接進行EDA的部分!

ggplot(PimaIndiansDiabetes, aes(glucose, insulin)) +
  geom_point(aes(color = diabetes))

ggplot(data = PimaIndiansDiabetes) + 
  geom_bar(mapping = aes(x = pregnant, fill = factor(diabetes)))

ggplot(data = PimaIndiansDiabetes) + 
  geom_bar(mapping = aes(x = pressure, fill = factor(diabetes)))

#畫圖語法

Pressure相對不明顯，不過Pregnant、insulin有隨著增大而增加得病的比例，glucose最明顯，glucose應該會是非常關鍵的變數!

1. Decision Tree

library(rpart)
library(rpart.plot)
tree.model <- rpart(diabetes~. ,data = PimaIndiansDiabetes)
rpart.plot(tree.model)

決策樹畫出來後，我們一樣也發現第一個影響的變數就是Glucose，不過整筆資料丟進去Decision tree的話有overfitting的疑慮，不過就現況分析而言，DT的確是很方便找出主要變數的工具，要用於預測的話，仍然是需要其他工具的。

2. RandomForest

library('randomForest') 
rf_model <- randomForest(diabetes ~., data = train)
rf.fitted = predict(rf_model)
print(rf_model)

importance    <- importance(rf_model)
varImportance <- data.frame(Variables = row.names(importance), 
                            Importance = round(importance[ ,'MeanDecreaseGini'],2))


rankImportance <- varImportance %>%
  mutate(Rank = paste0('#',dense_rank(desc(Importance))))


ggplot(rankImportance, aes(x = reorder(Variables, Importance), 
                           y = Importance, fill = Importance)) +
  geom_bar(stat='identity') + 
  geom_text(aes(x = Variables, y = 0.5, label = Rank),
            hjust=0, vjust=0.55, size = 4, colour = 'red') +
  labs(x = 'Variables') +
  coord_flip() + 
  theme_few()

這裡也簡簡單單的把變數丟進去涵式裡，並將Importance 係數視覺化，

我們可以看到glucose 葡萄糖是影響最重的變數

再來則是mass 與 age，Randomforest在此除了作模型預測，更能夠篩選出變數的功能!

RFresult <- test%>%
  cbind(RFprediction)
RFresult
>table(RFresult$RFprediction,RFresult$diabetes)
       0  1
  neg 84 20
  pos 15 34

我們可以看到RF的預測準確率為77.12% 左右

3. Xgboost
近幾年kaggle比賽的神器，讓我們一起來實測看看吧，在使用之前要注意一下data frame的轉換!

train$diabetes <- ifelse(train$diabetes == "pos",1,0) 
test$diabetes <- ifelse(test$diabetes == "pos",1,0) 

dtrain = xgb.DMatrix(data = as.matrix(train[,1:8]),
                     label = train$diabetes)
dtest = xgb.DMatrix(data = as.matrix(test[,1:8]),
                    label = test$diabetes)

xg.model <- xgboost(data= dtrain,     #train sparse matrix 
                    eval.metric= 'logloss',        #model minimizes Root Mean Squared Error
                    objective= "binary:logistic",     #regression
                    #tuning parameters
                    max.depth= 8,            #Vary btwn 3-15
                    eta= 0.1,                #Vary btwn 0.1-0.3
                    nthread = 5,             #Increase this to improve speed
                    subsample= 1,            #Vary btwn 0.8-1
                    colsample_bytree= 0.5,   #Vary btwn 0.3-0.8
                    lambda= 0.5,             #Vary between 0-3
                    alpha= 0.5,              #Vary between 0-3
                    min_child_weight= 3,     #Vary btwn 1-10
                    nround= 30               #Vary btwn 100-3000 based on max.depth, eta, subsample and               colsample
)
xg_prediction <- predict(xg.model,dtest)
xg_prediction <- ifelse(xg_prediction >= 0.55,1,0)
xg_diabetes <- as.numeric(as.character(xg_prediction))
result <- test%>%
  cbind(xg_diabetes)
table(result$diabetes,result$xg_diabetes)
    0  1
  0 85 14
  1 25 29

Oops, 這裡的準確率並沒有比使用RF來的高，(74.5 < 77.12)

4. 結論
本篇一口氣介紹了三個樹模型，其實如果還想要做的詳細的話，在變數的篩選以及調整上，還可以有很多操作的空間，例如可以嘗試將年齡層、懷孕等變數切段觀察，相信仔戲操作後準確率還會再上升很多；二來是針對模型的部分，Kaggle比賽如果是以分類預測為主的話，
Ensemble技巧是一定會套入的，例如本篇而言，如果結果是Randomforest準確率高，則將RF的比重加強，再把Xgboost的結果一起做處理，完美詮釋三個臭皮匠勝過一個諸葛亮的概念，關於Ensemble，日後還有更多詮釋，要做的事情，還有很多很多!

如果有發現問題或建議，歡迎留言唷

註:
同時附上我覺得說明很棒的文章:
https://towardsdatascience.com/basic-ensemble-learning-random-forest-adaboost-gradient-boosting-step-by-step-explained-95d49d1e2725

搜尋此網誌

Michael's notes

Tree models

留言

張貼留言

這個網誌中的熱門文章

Word Vector & Word embedding 初探 - with n-Gram & GLOVE Model

文字探勘之關鍵字萃取 : TF-IDF , text-rank , RAKE

多元迴歸分析- subsets and shrinkage