Tree models
一來是工作讓我投入其中;
二來是晚上很多時刻被netflix 綁架了.....XD
1. 決策樹
2. 經由決策樹Bagging而成的RandomForest
3. XGBOOST
這次使用的資料則是mlbench裡的diabetes資料集,探討各因素對於糖尿病的症狀影響程度有多少,話不多說,開始吧!
library(mlbench)
data(PimaIndiansDiabetes)
diabetes <- PimaIndiansDiabetes
set.seed(22)
train.index <- sample(x=1:nrow(diabetes), size=ceiling(0.8*nrow(diabetes)))
train = diabetes[train.index, ]
test = diabetes[-train.index, ]
#設定測試集與訓練集
> summary(diabetes) pregnant glucose pressure triceps Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00 Median : 3.000 Median :117.0 Median : 72.00 Median :23.00 Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00 Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00 insulin mass pedigree age Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00 Median : 30.5 Median :32.00 Median :0.3725 Median :29.00 Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00 Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00 diabetes Min. :0.000 1st Qu.:0.000 Median :0.000 Mean :0.349 3rd Qu.:1.000 Max. :1.000沒有NA值,可喜可賀!
接下來看看~得到糖尿病~與~未得到糖尿病~
兩組資料的敘述統計
1.得到糖尿病
got <- diabetes%>%
filter(diabetes == 1)%>%
summary()
pregnant glucose pressure triceps
Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
1st Qu.: 1.750 1st Qu.:119.0 1st Qu.: 66.00 1st Qu.: 0.00
Median : 4.000 Median :140.0 Median : 74.00 Median :27.00
Mean : 4.866 Mean :141.3 Mean : 70.82 Mean :22.16
3rd Qu.: 8.000 3rd Qu.:167.0 3rd Qu.: 82.00 3rd Qu.:36.00
Max. :17.000 Max. :199.0 Max. :114.00 Max. :99.00
insulin mass pedigree age diabetes
Min. : 0.0 Min. : 0.00 Min. :0.0880 Min. :21.00 Min. :1
1st Qu.: 0.0 1st Qu.:30.80 1st Qu.:0.2625 1st Qu.:28.00 1st Qu.:1
Median : 0.0 Median :34.25 Median :0.4490 Median :36.00 Median :1
Mean :100.3 Mean :35.14 Mean :0.5505 Mean :37.07 Mean :1
3rd Qu.:167.2 3rd Qu.:38.77 3rd Qu.:0.7280 3rd Qu.:44.00 3rd Qu.:1
Max. :846.0 Max. :67.10 Max. :2.4200 Max. :70.00 Max. :1
2. 未得到糖尿病none <- diabetes%>%
filter(diabetes == 0)%>%
summary()
none
pregnant glucose pressure triceps
Min. : 0.000 Min. : 0 Min. : 0.00 Min. : 0.00
1st Qu.: 1.000 1st Qu.: 93 1st Qu.: 62.00 1st Qu.: 0.00
Median : 2.000 Median :107 Median : 70.00 Median :21.00
Mean : 3.298 Mean :110 Mean : 68.18 Mean :19.66
3rd Qu.: 5.000 3rd Qu.:125 3rd Qu.: 78.00 3rd Qu.:31.00
Max. :13.000 Max. :197 Max. :122.00 Max. :60.00
insulin mass pedigree age diabetes
Min. : 0.00 Min. : 0.00 Min. :0.0780 Min. :21.00 Min. :0
1st Qu.: 0.00 1st Qu.:25.40 1st Qu.:0.2298 1st Qu.:23.00 1st Qu.:0
Median : 39.00 Median :30.05 Median :0.3360 Median :27.00 Median :0
Mean : 68.79 Mean :30.30 Mean :0.4297 Mean :31.19 Mean :0
3rd Qu.:105.00 3rd Qu.:35.30 3rd Qu.:0.5617 3rd Qu.:37.00 3rd Qu.:0
Max. :744.00 Max. :57.30 Max. :2.3290 Max. :81.00 Max. :0
ggplot(PimaIndiansDiabetes, aes(glucose, insulin)) +
geom_point(aes(color = diabetes))
ggplot(data = PimaIndiansDiabetes) +
geom_bar(mapping = aes(x = pregnant, fill = factor(diabetes)))
ggplot(data = PimaIndiansDiabetes) +
geom_bar(mapping = aes(x = pressure, fill = factor(diabetes)))
#畫圖語法
Pressure相對不明顯,不過Pregnant、insulin有隨著增大而增加得病的比例,glucose最明顯,glucose應該會是非常關鍵的變數!
1. Decision Tree
library(rpart)
library(rpart.plot)
tree.model <- rpart(diabetes~. ,data = PimaIndiansDiabetes)
rpart.plot(tree.model)
決策樹畫出來後,我們一樣也發現第一個影響的變數就是Glucose,不過整筆資料丟進去Decision tree的話有overfitting的疑慮,不過就現況分析而言,DT的確是很方便找出主要變數的工具,要用於預測的話,仍然是需要其他工具的。
2. RandomForest
library('randomForest')
rf_model <- randomForest(diabetes ~., data = train)
rf.fitted = predict(rf_model)
print(rf_model)
importance <- importance(rf_model)
varImportance <- data.frame(Variables = row.names(importance),
Importance = round(importance[ ,'MeanDecreaseGini'],2))
rankImportance <- varImportance %>%
mutate(Rank = paste0('#',dense_rank(desc(Importance))))
ggplot(rankImportance, aes(x = reorder(Variables, Importance),
y = Importance, fill = Importance)) +
geom_bar(stat='identity') +
geom_text(aes(x = Variables, y = 0.5, label = Rank),
hjust=0, vjust=0.55, size = 4, colour = 'red') +
labs(x = 'Variables') +
coord_flip() +
theme_few()
這裡也簡簡單單的把變數丟進去涵式裡,並將Importance 係數視覺化,
我們可以看到glucose 葡萄糖是影響最重的變數
再來則是mass 與 age,Randomforest在此除了作模型預測,更能夠篩選出變數的功能!
RFresult <- test%>% cbind(RFprediction) RFresult >table(RFresult$RFprediction,RFresult$diabetes) 0 1 neg 84 20 pos 15 34我們可以看到RF的預測準確率為77.12% 左右
3. Xgboost
近幾年kaggle比賽的神器,讓我們一起來實測看看吧,在使用之前要注意一下data frame的轉換!
train$diabetes <- ifelse(train$diabetes == "pos",1,0) test$diabetes <- ifelse(test$diabetes == "pos",1,0) dtrain = xgb.DMatrix(data = as.matrix(train[,1:8]), label = train$diabetes) dtest = xgb.DMatrix(data = as.matrix(test[,1:8]), label = test$diabetes) xg.model <- xgboost(data= dtrain, #train sparse matrix eval.metric= 'logloss', #model minimizes Root Mean Squared Error objective= "binary:logistic", #regression #tuning parameters max.depth= 8, #Vary btwn 3-15 eta= 0.1, #Vary btwn 0.1-0.3 nthread = 5, #Increase this to improve speed subsample= 1, #Vary btwn 0.8-1 colsample_bytree= 0.5, #Vary btwn 0.3-0.8 lambda= 0.5, #Vary between 0-3 alpha= 0.5, #Vary between 0-3 min_child_weight= 3, #Vary btwn 1-10 nround= 30 #Vary btwn 100-3000 based on max.depth, eta, subsample and colsample ) xg_prediction <- predict(xg.model,dtest) xg_prediction <- ifelse(xg_prediction >= 0.55,1,0) xg_diabetes <- as.numeric(as.character(xg_prediction)) result <- test%>% cbind(xg_diabetes) table(result$diabetes,result$xg_diabetes) 0 1 0 85 14 1 25 29
Oops, 這裡的準確率並沒有比使用RF來的高,(74.5 < 77.12)
4. 結論
本篇一口氣介紹了三個樹模型,其實如果還想要做的詳細的話,在變數的篩選以及調整上,還可以有很多操作的空間,例如可以嘗試將年齡層、懷孕等變數切段觀察,相信仔戲操作後準確率還會再上升很多;二來是針對模型的部分,Kaggle比賽如果是以分類預測為主的話,
Ensemble技巧是一定會套入的,例如本篇而言,如果結果是Randomforest準確率高,則將RF的比重加強,再把Xgboost的結果一起做處理,完美詮釋三個臭皮匠勝過一個諸葛亮的概念,關於Ensemble,日後還有更多詮釋,要做的事情,還有很多很多!
如果有發現問題或建議,歡迎留言唷
註:
同時附上我覺得說明很棒的文章:
https://towardsdatascience.com/basic-ensemble-learning-random-forest-adaboost-gradient-boosting-step-by-step-explained-95d49d1e2725
留言
張貼留言