Tree models




大家好,頗久沒寫新文章,
一來是工作讓我投入其中;
二來是晚上很多時刻被netflix 綁架了.....XD

本篇將在資料集中使用不同的樹模型,包含:
1. 決策樹
2. 經由決策樹Bagging而成的RandomForest
3. XGBOOST

這次使用的資料則是mlbench裡的diabetes資料集,探討各因素對於糖尿病的症狀影響程度有多少,話不多說,開始吧!


library(mlbench)
data(PimaIndiansDiabetes)
diabetes <- PimaIndiansDiabetes
set.seed(22)
train.index <- sample(x=1:nrow(diabetes), size=ceiling(0.8*nrow(diabetes)))

train = diabetes[train.index, ]
test = diabetes[-train.index, ]
 #設定測試集與訓練集
再來就是先看看整份資料的狀況吧

> summary(diabetes)
    pregnant         glucose         pressure         triceps     
 Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
 Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
 Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
 3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
    insulin           mass          pedigree           age       
 Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00  
 1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00  
 Median : 30.5   Median :32.00   Median :0.3725   Median :29.00  
 Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24  
 3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
 Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
    diabetes    
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.349  
 3rd Qu.:1.000  
 Max.   :1.000  
沒有NA值,可喜可賀!
接下來看看~得到糖尿病~與~未得到糖尿病~
兩組資料的敘述統計

1.得到糖尿病

got <- diabetes%>%
  filter(diabetes == 1)%>%
  summary()   
pregnant glucose pressure triceps Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00 1st Qu.: 1.750 1st Qu.:119.0 1st Qu.: 66.00 1st Qu.: 0.00 Median : 4.000 Median :140.0 Median : 74.00 Median :27.00 Mean : 4.866 Mean :141.3 Mean : 70.82 Mean :22.16 3rd Qu.: 8.000 3rd Qu.:167.0 3rd Qu.: 82.00 3rd Qu.:36.00 Max. :17.000 Max. :199.0 Max. :114.00 Max. :99.00 insulin mass pedigree age diabetes Min. : 0.0 Min. : 0.00 Min. :0.0880 Min. :21.00 Min. :1 1st Qu.: 0.0 1st Qu.:30.80 1st Qu.:0.2625 1st Qu.:28.00 1st Qu.:1 Median : 0.0 Median :34.25 Median :0.4490 Median :36.00 Median :1 Mean :100.3 Mean :35.14 Mean :0.5505 Mean :37.07 Mean :1 3rd Qu.:167.2 3rd Qu.:38.77 3rd Qu.:0.7280 3rd Qu.:44.00 3rd Qu.:1 Max. :846.0 Max. :67.10 Max. :2.4200 Max. :70.00 Max. :1
2. 未得到糖尿病

none <- diabetes%>%
  filter(diabetes == 0)%>%
  summary()
none
pregnant glucose pressure triceps Min. : 0.000 Min. : 0 Min. : 0.00 Min. : 0.00 1st Qu.: 1.000 1st Qu.: 93 1st Qu.: 62.00 1st Qu.: 0.00 Median : 2.000 Median :107 Median : 70.00 Median :21.00 Mean : 3.298 Mean :110 Mean : 68.18 Mean :19.66 3rd Qu.: 5.000 3rd Qu.:125 3rd Qu.: 78.00 3rd Qu.:31.00 Max. :13.000 Max. :197 Max. :122.00 Max. :60.00 insulin mass pedigree age diabetes Min. : 0.00 Min. : 0.00 Min. :0.0780 Min. :21.00 Min. :0 1st Qu.: 0.00 1st Qu.:25.40 1st Qu.:0.2298 1st Qu.:23.00 1st Qu.:0 Median : 39.00 Median :30.05 Median :0.3360 Median :27.00 Median :0 Mean : 68.79 Mean :30.30 Mean :0.4297 Mean :31.19 Mean :0 3rd Qu.:105.00 3rd Qu.:35.30 3rd Qu.:0.5617 3rd Qu.:37.00 3rd Qu.:0 Max. :744.00 Max. :57.30 Max. :2.3290 Max. :81.00 Max. :0

可以簡單判斷出,得到糖尿病的人在大部分的變數都高於未得到的人,尤其是Glucose,平均數相差甚大,如果想再進一步探討,可以直接進行EDA的部分!

ggplot(PimaIndiansDiabetes, aes(glucose, insulin)) +
  geom_point(aes(color = diabetes))

ggplot(data = PimaIndiansDiabetes) + 
  geom_bar(mapping = aes(x = pregnant, fill = factor(diabetes)))

ggplot(data = PimaIndiansDiabetes) + 
  geom_bar(mapping = aes(x = pressure, fill = factor(diabetes)))

#畫圖語法




Pressure相對不明顯,不過Pregnant、insulin有隨著增大而增加得病的比例,glucose最明顯,glucose應該會是非常關鍵的變數!


1. Decision Tree

library(rpart)
library(rpart.plot)
tree.model <- rpart(diabetes~. ,data = PimaIndiansDiabetes)
rpart.plot(tree.model)
決策樹畫出來後,我們一樣也發現第一個影響的變數就是Glucose,不過整筆資料丟進去Decision tree的話有overfitting的疑慮,不過就現況分析而言,DT的確是很方便找出主要變數的工具,要用於預測的話,仍然是需要其他工具的。


2. RandomForest

library('randomForest') 
rf_model <- randomForest(diabetes ~., data = train)
rf.fitted = predict(rf_model)
print(rf_model)

importance    <- importance(rf_model)
varImportance <- data.frame(Variables = row.names(importance), 
                            Importance = round(importance[ ,'MeanDecreaseGini'],2))


rankImportance <- varImportance %>%
  mutate(Rank = paste0('#',dense_rank(desc(Importance))))


ggplot(rankImportance, aes(x = reorder(Variables, Importance), 
                           y = Importance, fill = Importance)) +
  geom_bar(stat='identity') + 
  geom_text(aes(x = Variables, y = 0.5, label = Rank),
            hjust=0, vjust=0.55, size = 4, colour = 'red') +
  labs(x = 'Variables') +
  coord_flip() + 
  theme_few()
這裡也簡簡單單的把變數丟進去涵式裡,並將Importance 係數視覺化,
我們可以看到glucose 葡萄糖是影響最重的變數
再來則是mass 與  age,Randomforest在此除了作模型預測,更能夠篩選出變數的功能!


RFresult <- test%>%
  cbind(RFprediction)
RFresult
>table(RFresult$RFprediction,RFresult$diabetes)
       0  1
  neg 84 20
  pos 15 34
我們可以看到RF的預測準確率為77.12% 左右

3. Xgboost
近幾年kaggle比賽的神器,讓我們一起來實測看看吧,在使用之前要注意一下data frame的轉換!

train$diabetes <- ifelse(train$diabetes == "pos",1,0) 
test$diabetes <- ifelse(test$diabetes == "pos",1,0) 

dtrain = xgb.DMatrix(data = as.matrix(train[,1:8]),
                     label = train$diabetes)
dtest = xgb.DMatrix(data = as.matrix(test[,1:8]),
                    label = test$diabetes)

xg.model <- xgboost(data= dtrain,     #train sparse matrix 
                    eval.metric= 'logloss',        #model minimizes Root Mean Squared Error
                    objective= "binary:logistic",     #regression
                    #tuning parameters
                    max.depth= 8,            #Vary btwn 3-15
                    eta= 0.1,                #Vary btwn 0.1-0.3
                    nthread = 5,             #Increase this to improve speed
                    subsample= 1,            #Vary btwn 0.8-1
                    colsample_bytree= 0.5,   #Vary btwn 0.3-0.8
                    lambda= 0.5,             #Vary between 0-3
                    alpha= 0.5,              #Vary between 0-3
                    min_child_weight= 3,     #Vary btwn 1-10
                    nround= 30               #Vary btwn 100-3000 based on max.depth, eta, subsample and               colsample
)
xg_prediction <- predict(xg.model,dtest)
xg_prediction <- ifelse(xg_prediction >= 0.55,1,0)
xg_diabetes <- as.numeric(as.character(xg_prediction))
result <- test%>%
  cbind(xg_diabetes)
table(result$diabetes,result$xg_diabetes)
    0  1
  0 85 14
  1 25 29


Oops, 這裡的準確率並沒有比使用RF來的高,(74.5 < 77.12)



4. 結論
本篇一口氣介紹了三個樹模型,其實如果還想要做的詳細的話,在變數的篩選以及調整上,還可以有很多操作的空間,例如可以嘗試將年齡層、懷孕等變數切段觀察,相信仔戲操作後準確率還會再上升很多;二來是針對模型的部分,Kaggle比賽如果是以分類預測為主的話,
Ensemble技巧是一定會套入的,例如本篇而言,如果結果是Randomforest準確率高,則將RF的比重加強,再把Xgboost的結果一起做處理,完美詮釋三個臭皮匠勝過一個諸葛亮的概念,關於Ensemble,日後還有更多詮釋,要做的事情,還有很多很多!

如果有發現問題或建議,歡迎留言唷


註:
同時附上我覺得說明很棒的文章:
https://towardsdatascience.com/basic-ensemble-learning-random-forest-adaboost-gradient-boosting-step-by-step-explained-95d49d1e2725

留言

這個網誌中的熱門文章

Word Vector & Word embedding 初探 - with n-Gram & GLOVE Model

文字探勘之關鍵字萃取 : TF-IDF , text-rank , RAKE

多元迴歸分析- subsets and shrinkage