文字探勘之關鍵字萃取 : TF-IDF , text-rank , RAKE

Keyword Extraction

這次會將領域伸向文本分析(text mining),
我們將試著從文字資料裡做關鍵字萃取(Keyword extraction),主要目的就是選出可以代表文章的詞彙。
常見的應用情景如下:
1. 搜尋關鍵字萃取優化搜尋引擎(SEO)

上圖為筆者常常拜訪的部落格,右下角hot key words便是篩選出來的關鍵字應用。

2.消費者體驗優化
消費者會在評價上留下實際體驗後的意見,我們可以先針對內容抓出key words, 如此一來處理客服端便可以提前預知消費者的需求,早一步作出準備。

How to do it? (1)Word of Bag詞袋表達

怎麼去詳細計算呢?
舉簡單例子
Doc1. 我愛豬腳麵線
Doc2. 你愛麻油雞
Doc3. 我愛你,你愛我
Doc4. 你我愛麵線


我們可由上面四篇文件計算出不同的詞與對應的頻率,這樣也就達成將各詞彙量化的目的了!



Ptt Movie版實作練習


1. 讀取資料 & 整理資料(stop words / Customized words) &字頻分析

#library and reading data
library(tidyverse)
library(jiebaR)
library(widyr)
library(tidytext)
library(tm)
library(igraph)
library(ggnetwork)
library(textrank)

Movie <- read_csv("movie板 電影心得.csv")
head(Movie, 5)
Movie$label <- ifelse(Movie$label==0,"好評","差評") 
Movie$label <- as.factor(Movie$label)

我們可以看到PTT上完整的文章,主要分析的內容是content,但可以看到有太多無謂的符號與各種沒有辦法幫助分析的字詞(eg: 雷文、防雷、不負責任....,等),分析之前勢必要處理他們的。



用gsub函數代換掉多餘的符號,同時設定好stopping words,把沒辦法給出實質意義的字詞篩選掉!

#remove symbol and stopwords
my.symbols <- c("《", "》", "【", "】", "|", "(",")",
                "®", "\n", "?", "@", "#", "?", "!", "!","~",":",":",",","。","//")
Movie$content <- gsub(
  paste(my.symbols, collapse = "|"),
  "",
  Movie$content
)

my_stop_words <- c( "電影", "雷文", "防雷線","雷文防雷資訊頁", "我是防雷線", "主文分隔線", "一個", "他們", "可以", "因為",   "自己", "我們", "沒有", "就是", "最後", "覺得", "一個", "不是",   "什麼", "可以", "沒有", "真的", "這部", "就是", "覺得",  "只是", "其實", "所以", "這個", "還是", "雖然",  "一樣", "只是", "完全", "所以", "知道", "這樣", "然後", "還是",  "可能", "甚至", "看到", "喜歡", "開始", "com", "如果", "感覺",  "應該", "不過", "已經", "還有", "出來", "之後", "不會", "這麼",   "但是", "時候", "這種", "根本", "非常", "怎麼", "到底", "一下", "此片", "這些", "很多", "一部", 
                    "來說", "一直", "兩句", "不計",  "分為", "藉此", "還會", "做到","整部","劇情","主角","觀眾","故事","角色","導演","社會","人生")

Movie$content <- gsub(
  paste(my_stop_words, collapse = "|"),
  " ",
  Movie$content)

刪除英文字母和數字
#Remove English & Number
Movie$content <- gsub("[A-z]", "",  Movie$content)
Movie$content <- Movie$content <- removeNumbers(
  as.character(Movie$content))
head(Movie[3])


jiebaR有時沒辦法篩選出特定字詞,例如: "蜘蛛人",容易篩選成"蜘蛛"、"人",因此我們也要作出特定字詞告訴程式。
(筆者一開始沒篩選出彭彭、丁滿,導致後來的分析一直出現"彭丁滿")

然後把content的內容變成分詞,並以空白分開。
#customized word
library(jiebaR)
wk <- worker(stop_word = jiebaR::STOPPATH)
customized_terms <- c("劉德華", "奉俊昊", "玩具總動員", "神隱少女", "湯婆婆","蜘蛛人","復四","漫威","美麗華","彭彭","丁滿",
                      "擊殺數", "助攻數", "娃娃鬼", "輔助鬼", "感情線", "小情小愛",  "心得文", "新手上路", "蜘蛛粉", "哭爆", "靠譜歌王", "披頭四", "愛情戲",  "鋼鐵人", "駭客任務 ", "片商", "寄生上流", "電流大戰", "從前有個好來塢",  "超級英雄")
new_user_word(wk, customized_terms)

Movie.description <- tibble(
  label = Movie$label, # label
  title = Movie$article_title, # title
  content = sapply(
    as.character(Movie$content), # vector to be segmented
    function(char) segment(char, wk) %>% paste(collapse = " ") #Segment
  )
)
head(Movie.description, 5)


轉化成tidy text模式,每個row都是一個辭彙,而非一個video,
把每一個有意義的詞彙當成一個row,就是所謂的token
得到token後,我們可以使用summarize 再把每個token出現的頻率抓出來,再將其視覺化!

#tidytext
library(tidytext)
tok99 <- function(t) str_split(t,"[ ]{1,}") #每遇到一個空白,就要切
tidy_content <- Movie.description %>%
  unnest_tokens(word, content, token=tok99) 
tidy_content <- tidy_content[nchar(tidy_content$word) > 1, ] #字長不只一個字
head(tidy_content, 5)

content <-tidy_content %>% group_by(label,word) %>% summarise(word_frequency = n()) #算出字頻
#visualize-frequency content %>% group_by(label) %>% top_n(10, word_frequency) %>% arrange(word_frequency) %>% ggplot(aes(word, word_frequency, fill = label)) + geom_col(show.legend = FALSE) + facet_wrap(~label, ncol = 2, scales = "free") + coord_flip() + theme_bw() + theme(text=element_text(family="黑體-繁 中黑", size = 12), axis.text.x = element_text(angle = 60, hjust = 1))

字頻的視覺化結果出來了,我們大概能判斷一下,蜘蛛人似乎好壞參半;好評裡頭跟寄生上流相關的字詞比較多(金家 / 社長),差評我似乎沒辦法有太多線索,但這時期電影的反派似乎讓人不太滿意。


2. TF-IDF

一樣的例子:
Doc1. 我愛豬腳麵線
Doc2. 你愛麻油雞
Doc3. 我愛你,你愛我
Doc4. 你我愛麵線

TF指的是Term Frequency;


簡單解釋一下,就是 詞彙佔 文件字數的比例,
Doc1中,"麵線"這個字的TF便是1/4
Doc2中,"麻油雞"的TF = 1/3


IDF指的是Inverse document Frequency








如果以詞頻單純去其重要性的話,那麼"你", "我", "他","和", "以及",這一類連接詞或者主詞肯定數量會很多,而IDF把含該詞彙文件數作為分母便能夠大大的把這類謬誤降低很多。
例如: "麵線"的IDF: log(4/2) = 0.3
          "我" 的IDF: log(4/3) = 0.12
          "麻油雞" 的IDF: log(4/1) = 0.6

TF-IDF最後的結果即是將兩者相乘,最後越高的即是篩選出來的關鍵字。

R已經有tfidf 函數可以直接計算處理:
#tf-idf
content.tfidf <- content %>%
  bind_tf_idf(word, label, word_frequency)

head(content.tfidf)

 head(content.tfidf)
# A tibble: 6 x 6
# Groups:   label [1]
  label word   word_frequency         tf   idf     tf_idf
  <fct> <chr>           <int>      <dbl> <dbl>      <dbl>
1 好評  关乎                4 0.0000140  0.693 0.00000971
2 好評  内心                4 0.0000140  0.693 0.00000971
3 好評  内容                5 0.0000175  0.693 0.0000121 
4 好評  冴子                2 0.00000700 0.693 0.00000485
5 好評  冴羽                6 0.0000210  0.693 0.0000146 
6 好評  卢旺达              1 0.00000350 0.693 0.00000243


#visualize tf-idf
content.tfidf %>%
  group_by(label) %>%
  top_n(10, tf_idf) %>% # select top 10 keywords for each label
  arrange(desc(tf_idf)) %>% # order by tf-idf
  ggplot(aes(word, tf_idf, fill = label)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~label, ncol = 2, scales = "free") +
  coord_flip() +
  theme_bw() +
  theme(text=element_text(family="黑體-繁 中黑", size=14),
        axis.text.x = element_text(angle = 60, hjust = 1))

好評的部分,寄生上流真的有許多關鍵字出來(金家 / 基澤 / 基宇 / 金爸 / 补家),差評的部分,機甲跟法醫佔了不少,根據TF-IDF的篩選方式,可以猜想,應該是有一些討論串很集中地在討論機甲跟法醫等關鍵字,看了原來的檔案,的確有一個討論串整整花了很多篇幅在罵環太平洋2是廢片XD;而不雅詞彙則是因為有一個討論串在戰城市獵人開的黃腔。

3. Co-Occurence

顧名思義,就是計算關鍵字間共同出現的次數

#Co-Ocuurence
total.count <- tidy_content %>%
  group_by(word) %>% 
  summarise(total_count = n())

top15.words <- total.count %>%
  subset(nchar(word)>1) %>%
  arrange(desc(total_count)) %>%
  top_n(15)

temp <- tidy_content %>% subset(word %in% top15.words$word)

library(widyr)
description.cooc <- temp %>% 
  pairwise_count(word, title, sort = TRUE, upper = FALSE)
head(description.cooc)
#Co-Ocuurence heatmap visualise
ggplot(data = description.cooc, aes(item1, item2, fill = n)) +
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "white", high = "red",name="Co-occurence") +
  theme_bw() +
  theme(text=element_text(family="黑體-繁 中黑", size=14),
        axis.text.x = element_text(angle = 60, hjust = 1))
#Co-Ocuurence Network
set.seed(2)
description.cooc %>%
  graph_from_data_frame() %>%
  ggnetwork() %>%
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_edges(aes(size = n, alpha = n), color = "cyan4") +
  geom_nodes(color = "black", size = 8) +
  geom_nodelabel_repel(aes(label = vertex.names),
                       color = 'grey50',
                       box.padding = unit(1, "lines"),
                       family = "黑體-繁 中黑") +
  theme_bw() +
  theme(text=element_text(family="黑體-繁 中黑", size=16))  
筆者在此看不太出來有什麼特別的資訊。


4. Text Rank

詞彙重要性 = 共現其他詞彙重要性的加權平均








一樣的例子:
Doc1. 我愛豬腳麵線
Doc2. 你愛麻油雞
Doc3. 我愛你,你愛我
Doc4. 你我愛麵線

WS(麵線) = (1 - d) + d * 相鄰節點分數

WS(麵線) =
1(和麵線的相鄰次數)/4+3+1+1(愛和其它詞的相鄰比重) * WS愛+
                     1 / 1 * WS豬腳+
                     ..............

WS則是透過不斷跌代才能確認最接近的結果


#texrank
textrank.model <- textrank_keywords(
  tidy_content$word, p = 1/3, ngram_max = 2
)
summary(textrank.model)
head(textrank.model$pagerank$vector)
head(textrank.model$keywords)
label.text <- unique(tidy_content$label)

description.textrank <- NULL

for(labeltext in unique(tidy_content$label)){
  # filter channel descriptions
  text.data <- tidy_content %>% filter(label == labeltext)
  # train text rank model
  textrank.model <- textrank_keywords(
    text.data$word, p = 1/3, ngram_max = 2
  )
  # consolidate textrank results
  description.textrank <- bind_rows(
    description.textrank,
    data.frame(
      label = labeltext,
      word = names(textrank.model$pagerank$vector),
      text_rank = textrank.model$pagerank$vector)
  )
}
#visualise textrank
description.textrank %>%
  arrange(desc(text_rank)) %>%
  group_by(label) %>% 
  top_n(10, text_rank) %>% 
  ungroup() %>%
  ggplot(aes(word, text_rank, fill = label)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "Text Rank") +
  facet_wrap(~label, ncol = 2, scales = "free") +
  coord_flip() +
  theme_bw() +
  theme(text=element_text(family="黑體-繁 中黑", size=14),
        axis.text.x = element_text(angle = 60, hjust = 1))

看下來,寄生上流又再度從好評的文章裡被篩選出來,蜘蛛人仍然好壞參半,但感覺起來,給蜘蛛人差評似乎是因為反派表現不佳的關係。

5. RAKE 演算法

利用詞頻與連結找出關鍵字,
頻率 frequency: A詞出現在幾篇文章中
連結 Degree: A詞跟其它文字有多少的連結數
拿前面的"麵線"作為例子
麵線的 Frequency:2
             Degree: 4
所以麵線的分數就為4/2=2。

#RAKE

tidy_content$label <- tidy_content$label %>%as.factor()
rm(text.data)
rm(content_rake)

description.rake <- NULL
for(labeltext in unique(tidy_content$label)){
  text.data <- tidy_content %>% filter(label == labeltext)
  description.cooc <- text.data %>%
    pairwise_count(word, title, sort = TRUE, upper = FALSE)
  
  description.cooc <- description.cooc %>% spread(item2, n)
  row.names(description.cooc) <- description.cooc$item1
  description.cooc <- description.cooc %>% select(-item1)
  
  description.wf <- text.data %>%
    group_by(word) %>%
    summarise(word_frequency = length(unique(title)))
  
  description.rake <- bind_rows(
    description.rake,
    data.frame(
      label = labeltext,
      word = colnames(description.cooc),
      degree =  colSums(!is.na(description.cooc))) %>%
      inner_join(description.wf)
  )
}
  
description.rake <- description.rake %>%
  mutate(rake = degree / word_frequency)

description.rake %>%
  group_by(label) %>%
  top_n(10, rake) %>% 
  ggplot(aes(reorder(word, rake), rake, fill = label)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "RAKE") +
  facet_wrap(~label, ncol = 2, scales = "free") +
  coord_flip() +
  theme_bw() +
  theme(text=element_text(family="黑體-繁 中黑", size=14),
        axis.text.x = element_text(angle = 60, hjust = 1))
這次篩出來的字詞,好評方面,從前有個好萊塢有不少個相關角色在裡面,負評我還是看不出太多資訊。



總結:

這次用三種不同的關鍵字萃取演算法抓出了不同的關鍵字,感覺起來
TF- IDF 跟RAKE比較能透露出明顯的資訊,演算法一樣沒有優劣,不同情況一樣會有不同的效果,然後恭喜寄生上流在奧斯卡上的傑出表現。





留言

這個網誌中的熱門文章

Word Vector & Word embedding 初探 - with n-Gram & GLOVE Model

多元迴歸分析- subsets and shrinkage