主題: Kaggle 鐵達尼生存預測 top 6% and R語言學習歷程小分享

適合對象: 對機器學習有興趣的初學者

小弟的背景: 國立大學BA碩畢，接觸R語言近一年~

<前言>

本來是只會SPSS搭配個Process外掛的社科生，某天撇到教授桌上安裝的R語言覺得好奇，餵狗後發現人外有人天外有天，於是開始了R語言的自學之路.....
接觸後開始學習用R跑SPSS已經會的分析，爾後慢慢接觸到機器學習的領域，更是覺得自己不過是個小小C咖，接受自己渺小後就開始努力學習一直到現在了.....
學了一陣子相關理論以及coding練習操作，再來就是挑戰手邊可以觸及的Open Source，Kaggle的鐵達尼存活分析便是一個很適合初學者的專案，加上前10趴用R語言討論titanic的文章不多QAQ，所以決定產出這篇。

<Titanic>

這次的分析步驟主要為:
A.針對整體資料進行EDA 初探
分析資料前，最基本的問題便是，對於資料的"外表"了解多少? 除了基本的敘述性統計，以視覺化的方式表現可以立刻地掌握出變數間的關係，也是後期特徵工程很重要的參考基準。

B. 對於資料進行整理
和EDA同樣重要的便是對於NA遺漏值的清理、複雜的字串處理以及對於data frame的型式整理。

以上ab兩步驟便是Data mining流程中很重要的Data understanding & Data preparation，其實跟data相關的工作基本上大概花了百分之80以上的時間在這兩步(甚至更多!!!!!!)，實在是個大工程。

C. Model 建模的選擇
from : https://zi.media/@yidianzixun/post/npCNEk

這次的分析是先透過訓練data後，再對剩下的testing data做存活與否binary的預測，乃監督式學習，所以當時的考量有: random forest, logistic regression, SVM，在Kaggle板上也都有人試過了，不過考量到Random forest是以吉尼不純度函數(其實就是資訊多寡)進行分類，相對SVM以切面的方式分類而言，手續單純的多，而Logistic regression在處理categorical features並沒有Random forest來得順手，因此，RF得到青睞!

或者直接看作弊圖判斷:
1. 並非維度縮減 2. 不是非監督式學習 3.於是來到了predicting numeric，然後左邊右邊都是行得通的XD

D. modeling and predict
實際進入建模階段，針對建模結果評估，feature engineering，最後predict

---------------------------------分析開始----------------------------
Package與資料先準備好

library('ggplot2')      
library('ggthemes')     
library('scales')       
library('dplyr')        
library('randomForest') 
library('corrplot')     
library('plyr')         
library(tidyverse)      
train <- read_csv('train.csv')
test <- read_csv('test.csv')
full <- bind_rows(train, test)

接著開始對資料進行EDA吧~

str(full)
summary(full)

Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1309 obs. of  12 variables:
 $ PassengerId: num  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : num  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : num  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : chr  "male" "female" "female" "female" ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : num  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : num  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  NA "C85" NA "C123" ...
 $ Embarked   : chr  "S" "C" "S" "S" ...

PassengerId      Survived          Pclass     
 Min.   :   1   Min.   :0.0000   Min.   :1.000  
 1st Qu.: 328   1st Qu.:0.0000   1st Qu.:2.000  
 Median : 655   Median :0.0000   Median :3.000  
 Mean   : 655   Mean   :0.3838   Mean   :2.295  
 3rd Qu.: 982   3rd Qu.:1.0000   3rd Qu.:3.000  
 Max.   :1309   Max.   :1.0000   Max.   :3.000  
                NA's   :418                     
     Name               Sex                 Age       
 Length:1309        Length:1309        Min.   : 0.17  
 Class :character   Class :character   1st Qu.:21.00  
 Mode  :character   Mode  :character   Median :28.00  
                                       Mean   :29.88  
                                       3rd Qu.:39.00  
                                       Max.   :80.00  
                                       NA's   :263    
     SibSp            Parch          Ticket         
 Min.   :0.0000   Min.   :0.000   Length:1309       
 1st Qu.:0.0000   1st Qu.:0.000   Class :character  
 Median :0.0000   Median :0.000   Mode  :character  
 Mean   :0.4989   Mean   :0.385                     
 3rd Qu.:1.0000   3rd Qu.:0.000                     
 Max.   :8.0000   Max.   :9.000                     
                                                    
      Fare            Cabin             Embarked        
 Min.   :  0.000   Length:1309        Length:1309       
 1st Qu.:  7.896   Class :character   Class :character  
 Median : 14.454   Mode  :character   Mode  :character  
 Mean   : 33.295                                        
 3rd Qu.: 31.275                                        
 Max.   :512.329                                        
 NA's   :1

了解資料整體結構以及變數屬性，之後視覺化或者其他統計分析的手續可以讓我們轉換到正確的結構，例如:survived等等會轉變為factor。
從summary可以看到資料大致上的狀況以及Missing value 大概的數量(Cabin此時為字串變數，故無險視NA)。

各變數NA量:
Survived: 418 (就是testing data)
Age: 263
Fare:1
Cabin: too many

(1) Age and survived

ggplot(full[1:891,], aes(Age,fill = factor(Survived)))+
  geom_histogram(bins = 40)+
  theme_few()+
  xlab("Age")+
  scale_fill_discrete(name = "Survived")+
  ggtitle("Age / Survived")

可以發現，年紀越小 /年紀大到一定程度，存活比率會高一點。

(2) Sex and survived

ggplot(full[1:891,], aes(Sex,fill = factor(Survived)))+
  geom_bar(stat = "count",position = "dodge")+
  theme_bw()+
  xlab("sex")+
  ylab("count")+
  scale_fill_discrete(name = "survived")+
  ggtitle("Sex / Survived")

這裡可以很明顯的看出男女在生存機率上有非常非常大的差異了，性別也因此會是重要的特徵.....(男生哭哭)

(3)Sex+Age and Survived

ggplot(full[1:891,], aes(Age, fill = factor(Survived))) + 
  geom_histogram(bin = 30)+
  theme_few()+
  xlab("Age")+
  ylab("count")+
  facet_grid(.~Sex)+
  scale_fill_discrete(name = "survived")+
  theme_few()+
  ggtitle("Age/Sex/Survived")

小女孩/老奶奶存活比率很高，大哥哥/叔叔/阿伯則是悲劇一場............
我們也在此看出來Sex Age其實對於Survived而言，都是很重要的feature，從男性分布看得更明白，年齡區間的存活率有很大的差異，所以，將年齡化分出其他組別也是一個值得考慮的選擇。

(3)Pclass and Survived

ggplot(full[1:891,], aes(Pclass,fill = factor(Survived)))+
  geom_bar(stat = "count",position = "dodge")+
  theme_bw()+
  xlab("Pclass")+
  ylab("count")+
  scale_fill_discrete(name = "survived")+
  ggtitle("Pclass / Survived")

Pclass1 存活率超過五成， Pclass2接近五成，而Pclass3則是1/3不到.....
由此可知，客艙等級也是很重要的特徵。

(4)Pclass / Sex and Survived

ggplot(full[1:891,], aes(Pclass,fill = factor(Survived)))+
  geom_bar(stat = "count")+
  theme_bw()+
  xlab("Pclass")+
  facet_grid(.~Sex)
  ylab("count")+
  scale_fill_discrete(name = "survived")+
  ggtitle("Pclass vs Sex vs Survived")

P1的女生幾乎都活下來了，P2也是，P3則是一半一半，男生方面P1存活率最高，P2最低。

(4)Pclass / Fare

ggplot(full[1:891,], aes(x = Fare, y = Pclass)) + 
    geom_jitter(aes(colour = factor(Survived))) + 
    theme_few()+
    theme(legend.title = element_blank())+
    labs(x = "Age", y = "Pclass", title = "Fare vs Pclass")+
    scale_fill_discrete(name = "Survived") + 
    scale_x_continuous(name="Fare", limits=c(0, 270), breaks=c(0, 40, 80, 120, 160, 200, 240, 280))

所以Pclass的不同到底代表什麼意義呢?

比較票價與存活的圖表後，我們可以明白，Class1票價最貴，存活率高；Class3存活最低，票價跟class2差不多。哀，世界從來不是公平的，只要花錢住進Class1，存活率可以大大提高，女生存活比率高，錢花的多，存活機率高。Class1大概就是所謂的VIP吧!

(5)Names 處理成 Title

full$Title <- gsub("(.*, )|(\\..*)","",full$Name)  
table(full$Sex, full$Title)

Capt Col Don Dona  Dr Jonkheer Lady Major Master Miss
  female    0   0   0    1   1        0    1     0      0  260
  male      1   4   1    0   7        1    0     2     61    0
        
         Mlle Mme  Mr Mrs  Ms Rev Sir the Countess
  female    2   1   0 197   2   0   0            1
  male      0   0 757   0   0   8   1            0

gsub後面那串是正規表示法，將名字裡頭比較少見的稱呼抓出來做為Title，在生成和性別之間的cross table。

officer <- c('Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev')
royalty <- c('Dona', 'Lady', 'the Countess','Sir', 'Jonkheer')
full$Title[full$Title == 'Mlle']        <- 'Miss' 
full$Title[full$Title == 'Ms']          <- 'Miss'
full$Title[full$Title == 'Mme']         <- 'Mrs' 
full$Title[full$Title %in% royalty]  <- 'Royalty'
full$Title[full$Title %in% officer]  <- 'Officer'

table(full$Sex, full$Title)

           Master Miss  Mr Mrs Officer Royalty
  female      0  264   0 198       1       3
  male       61    0 757   0      23       2

將少見的title在整理成Royal 跟 Officer , master......,等。

full$Surname <- sapply(full$Name,  
                       function(x) strsplit(x, split = '[,.]')[[1]][1])

將名字再以Surname最為新變數。

我們再將Title抓出來看看Survived的分布

  ggplot(full[1:891,], aes(Title,fill = factor(Survived)))+
  geom_bar(stat = "count")+
  theme_bw()+
  xlab("Title")+
  ylab("count")+
  scale_fill_discrete(name = "survived")+
  ggtitle("title vs Survived")

各個稱呼的存活比率都不接近，於是Title也被我納入了Feature考量。

(6)Familysize
Sibsp(老婆丈夫兄弟姊妹) + parch(雙親) + me = family size

full$Fsize <- full$SibSp + full$Parch + 1

ggplot(full[1:891,], aes(x = Fsize, fill = factor(Survived))) +
  geom_bar(stat='count', position='dodge') +
  scale_x_continuous(breaks=c(1:11)) +
  xlab('Family Size') +
  ylab("Count") +
  theme_few()+
  scale_fill_discrete(name = "Survived") + 
  ggtitle("Family Size vs Survived")

爾後Embarked也是以這樣的方式畫出來，在此不贅述

B. Data wrangling

首先來看看NA值有多少吧

sapply(full,function(x) sum(is.na(x)))

 PassengerId    Survived      Pclass        Name         Sex 
          0         418           0           0           0 
        Age       SibSp       Parch      Ticket        Fare 
        263           0           0           0           1 
      Cabin    Embarked       Title     Surname       Fsize 
       1014           2           0           0           0

sapply會將function(x)的結果整理成向量回傳，survived的遺漏值有418個(即Test data)，再來則是處理Age裡263個NA值：
在這裡將資料裡頭不包含NA值的樣本篩選出來並以決策樹去進行分析

library(rpart)
age.model <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + Fsize, data=full[!is.na(full$Age), ], method='anova')

接著將預測好的模型畫出來，再用以預測NA值

rpart.plot(age.model)
full$Age[is.na(full$Age)] <- predict(age.model, full[is.na(full$Age), ])

Age的NA直整理完畢，接下來看看缺失值為1的Fare:

which( is.na(full$Fare))
full_1044 <- full[full$PassengerId == 1044,]
full_1044

   PassengerId Survived Pclass Name  Sex     Age SibSp Parch Ticket  Fare Cabin Embarked Title Surname
        <int>    <int>  <int> <chr> <chr> <dbl> <int> <int> <chr>  <dbl> <chr> <chr>    <chr> <chr>  
1        1044       NA      3 Stor~ male   60.5     0     0 3701      NA NA    S        Mr    Storey 
# ... with 1 more variable: Fsize <dbl>

由前面EDA的部分可以明白，Pclass跟Fare式有很大相關的，在此我們可以去了解當Pclass=3，票價大概的分布：

ggplot(full[full$Pclass == '3', ], 
       aes(x = Fare)) +
  geom_density(fill = 'lightgrey', alpha=0.4) + 
  geom_vline(aes(xintercept=median(Fare, na.rm=T)),
             colour='darkred', linetype='dashed', lwd=1) +
  xlab('Fare') +
  ggtitle("Pclass = 3")+
  ylab("Density") +
  scale_x_continuous(labels=dollar_format())


full$Fare[1044] <- median(full[full$Pclass == '3', ]$Fare, na.rm = TRUE)

#填充NA值

將Pclass的Fare分布畫出來後再將中位數選出來，於是再將其填充至NA值。
再來是Embarked 跟 Cabin的缺失值：

which( is.na(full$Embarked))
> which( is.na(full$Embarked))
[1]  62 830

查看ID 62 830

> full_62
# A tibble: 1 x 15
  PassengerId Survived Pclass Name  Sex     Age SibSp Parch Ticket  Fare Cabin Embarked Title Surname
        <int>    <int>  <int> <chr> <chr> <dbl> <int> <int> <chr>  <dbl> <chr> <chr>    <chr> <chr>  
1          62        1      1 Icar~ fema~  38.0     0     0 113572  80.0 B28   NA       Miss  Icard  
# ... with 1 more variable: Fsize <dbl>
> full_830
# A tibble: 1 x 15
  PassengerId Survived Pclass Name  Sex     Age SibSp Parch Ticket  Fare Cabin Embarked Title Surname
        <int>    <int>  <int> <chr> <chr> <dbl> <int> <int> <chr>  <dbl> <chr> <chr>    <chr> <chr>  
1         830        1      1 Ston~ fema~  62.0     0     0 113572  80.0 B28   NA       Mrs   Stone  
# ... with 1 more variable: Fsize <dbl>

兩人的fare都是80元，於是尋找Embarked 跟 Fare的關係

ggplot(full[!is.na(full$Embarked),],aes(x=Embarked, y=Fare, fill=factor(Pclass))) +
  geom_boxplot() + 
  geom_hline(aes(yintercept=80), color='red', linetype='dashed', lwd=2) +
  xlab('embarked') +
  ylab('fare') +
  ggtitle('embarked vs fare')

可以發現是兩個缺失值都比較有可能是C，Cabin缺失真的太多，於是欲設默認值。

full$Embarked[c(62,830)] <- "C"

full$Cabin <- as.factor(sapply(full$Cabin, function(x) ifelse(is.na(x), 'X', str_sub(x, start = 1, end = 1))))

在這次拿到top 6%之前小弟其實有投過一次Kaggle，當時的成績是Top 16%，爾後看了許多大神的kernel，對一些變數做了Feature engineering，接下來要提到：

這裡將年齡18歲以下稱為child; 18歲以上為Adult；
同時也針對Family Size再進一步調整分別為 AloneSmall,Big，
第一次投稿我僅針對Family Size做出調整，這次針對年齡再一次進行調整

full$Child[full$Age < 18] <- 'Child'
full$Child[full$Age >= 18] <- 'Adult'
table(full$Child, full$Survived)
full$FsizeD[full$Fsize == 1] <- 'Alone'
full$FsizeD[full$Fsize < 5 & full$Fsize > 1] <- 'Small'
full$FsizeD[full$Fsize > 4] <- 'Big'
table(full$FsizeD, full$Survived)

做了以上的工夫，建模前再看看各變數的關係吧!

corr_data <- full[1:891,]
##要進行相關分析，必須對變數進行性質上的變換
## 轉變為 numeric type 和 recode
corr_data$Embarked <- revalue(corr_data$Embarked, 
                              c("S" = 1, "Q" = 2, "C" = 3))
corr_data$Sex <- revalue(corr_data$Sex, 
                         c("male" = 1, "female" = 2))
corr_data$Title <- revalue(corr_data$Title, 
                           c("Mr" = 1, "Master" = 2,"Officer" = 3, 
                             "Mrs" = 4,"Royalty" = 5,"Miss" = 6))
corr_data$FsizeD <- revalue(corr_data$FsizeD, 
                            c("Small" = 1, "Alone" = 2, "Big" = 3))
corr_data$Child <- revalue(corr_data$Child, 
                           c("Adult" = 1, "Child" = 2))
corr_data$FsizeD <- as.numeric(corr_data$FsizeD)
corr_data$Child <- as.numeric(corr_data$Child)
corr_data$Sex <- as.numeric(corr_data$Sex)
corr_data$Embarked <- as.numeric(corr_data$Embarked)
corr_data$Title <- as.numeric(corr_data$Title)
corr_data$Pclass <- as.numeric(corr_data$Pclass)
corr_data$Survived <- as.numeric(corr_data$Survived)

corr_data <-corr_data[,c("Survived", "Pclass", "Sex", 
                         "FsizeD", "Fare", "Embarked","Title","Child")]

str(corr_data)
mcorr_data <- cor(corr_data)
#利用Corrplot可以省去將相關變數矩陣化跟層層ggplot的手續
corrplot(mcorr_data,method="circle")

D. Modeling

full$Child  <- factor(full$Child)
full$Sex  <- factor(full$Sex)
full$Embarked  <- factor(full$Embarked)
full$Title  <- factor(full$Title)
full$Pclass  <- factor(full$Pclass)
full$FsizeD  <- factor(full$FsizeD)
full1 <- full[,-9]
full_mod <- full1[,-10]
train <- full_mod[1:891,]
test <- full_mod[892:1309,]
library(randomForest)
set.seed(110)
rf_model <- randomForest(factor(Survived) ~ Pclass + Sex + Fare + Embarked + Title + 
                           FsizeD + Child, data = train)
rf.fitted = predict(rf_model)
print(rf_model)

#confusion Matrix
Call:
 randomForest(formula = factor(Survived) ~ Pclass + Sex + Fare +      Embarked + Title + FsizeD + Child, data = train) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 16.95%
Confusion matrix:
    0   1 class.error
0 502  47   0.0856102
1 104 238   0.3040936

有關cinfusion Matrix日後會有更多介紹

最後資料做predict:

prediction <- predict(rf_model, test)

solution <- data.frame(Survived = prediction, PassengerID = test$PassengerId)
write.csv(solution, file = 'rf_model_sol.csv', row.names = F)
#寫入predict

提交後，score 從0.799直接上升到0.81339，排名也上升了快1000(原1566)
這表示feature 的調整真的很有效果。

#Importance
做完分析後回頭看看各個特徵的重要性，使用Importance()函數來了解，
並以各個變數裡平均減少吉尼系數(吉尼系數就是不確定程度的衡量，給予資訊越多，系數越小)，減少越多的系數值，越是重要，最後再把他畫出來~

importance    <- importance(rf_model)
varImportance <- data.frame(Variables = row.names(importance), 
                            Importance = round(importance[ ,'MeanDecreaseGini'],2))


rankImportance <- varImportance %>%
  mutate(Rank = paste0('#',dense_rank(desc(Importance))))


ggplot(rankImportance, aes(x = reorder(Variables, Importance), 
                           y = Importance, fill = Importance)) +
  geom_bar(stat='identity') + 
  geom_text(aes(x = Variables, y = 0.5, label = Rank),
            hjust=0, vjust=0.55, size = 4, colour = 'red') +
  labs(x = 'Variables') +
  coord_flip() + 
  theme_few()

分析的價值有時就是在於能看出第一印象無法得知的資訊，也可以讓人有一種"什麼!竟然是這樣!" 的驚嘆，在所有模型特徵中，最重要的反而是title! 性別反而不是第一名，
更顯得特徵工程再篩選的重要性，調參手續還沒加進來，已經可以把排名往前刷了。

心得：

開始接觸R到現在已經快一年，一開始先試著將SPSS已經會的分析套用到軟體上，結果發現讀檔是個問題；安裝套件是個問題；用自己生成的數列不能跑分析也搞了很久(data frame出問題)，在這邊直接推薦Hadley Wickham寫的R for data Science，書裡介紹的基本觀念(包含繪圖、資料整理)都實用有效，很適合初學者，即使操作過後，仍然很適合當參考書，
線上資源諸如R blogger、stackoverflow(遇到問題還沒碰過找不到解法的，很神)。或者線上付費的課程，都是可以學習到很多相關操作的方法。

再來接觸更多的ML演算法，才發現自己的數學實在有一大段，於是又回頭查東查西開始看線性代數，然後開始嗑林軒田老師的機器學習概念、機器學習技能(因為數學，又卡關了)；同時也開始嗑李弘毅的機器學習(數學底子相對弱一點的可以試試看)，發現要念的東西真的很多......
除了精進能力，之後也會在這個blog分享更多讀書心得。

總結其實就是
我還嫩的很

加油!

reference:

1. 泰坦尼克号生存率预测（R），kaggle排名220

https://zhuanlan.zhihu.com/p/29130905

坦尼克號倖存預測n ——Kaggle排名321名（前4%）

https://www.getit01.com/p2018013033255887/

3.
R for data science
https://r4ds.had.co.nz/

4.
R語言和商業分析
https://goo.gl/ARitYt

搜尋此網誌

Michael's notes

Kaggle 鐵達尼生存預測 top 6%

1.

泰坦尼克号生存率预测（R），kaggle排名220

https://zhuanlan.zhihu.com/p/29130905

坦尼克號倖存預測n ——Kaggle排名321名（前4%）

留言

張貼留言

這個網誌中的熱門文章

Word Vector & Word embedding 初探 - with n-Gram & GLOVE Model

文字探勘之關鍵字萃取 : TF-IDF , text-rank , RAKE

多元迴歸分析- subsets and shrinkage