文字探勘之關鍵字萃取 : TF-IDF , text-rank , RAKE
Keyword Extraction
這次會將領域伸向文本分析(text mining),我們將試著從文字資料裡做關鍵字萃取(Keyword extraction),主要目的就是選出可以代表文章的詞彙。
常見的應用情景如下:
1. 搜尋關鍵字萃取優化搜尋引擎(SEO)
上圖為筆者常常拜訪的部落格,右下角hot key words便是篩選出來的關鍵字應用。
2.消費者體驗優化
消費者會在評價上留下實際體驗後的意見,我們可以先針對內容抓出key words, 如此一來處理客服端便可以提前預知消費者的需求,早一步作出準備。
How to do it? (1)Word of Bag詞袋表達
怎麼去詳細計算呢?
舉簡單例子
Doc1. 我愛豬腳麵線
Doc1. 我愛豬腳麵線
Doc2. 你愛麻油雞
Doc3. 我愛你,你愛我
Doc4. 你我愛麵線
我們可由上面四篇文件計算出不同的詞與對應的頻率,這樣也就達成將各詞彙量化的目的了!
Ptt Movie版實作練習
1. 讀取資料 & 整理資料(stop words / Customized words) &字頻分析
#library and reading data
library(tidyverse)
library(jiebaR)
library(widyr)
library(tidytext)
library(tm)
library(igraph)
library(ggnetwork)
library(textrank)
Movie <- read_csv("movie板 電影心得.csv")
head(Movie, 5)
Movie$label <- ifelse(Movie$label==0,"好評","差評")
Movie$label <- as.factor(Movie$label)
我們可以看到PTT上完整的文章,主要分析的內容是content,但可以看到有太多無謂的符號與各種沒有辦法幫助分析的字詞(eg: 雷文、防雷、不負責任....,等),分析之前勢必要處理他們的。
用gsub函數代換掉多餘的符號,同時設定好stopping words,把沒辦法給出實質意義的字詞篩選掉!
#remove symbol and stopwords
my.symbols <- c("《", "》", "【", "】", "|", "(",")",
"®", "\n", "?", "@", "#", "?", "!", "!","~",":",":",",","。","//")
Movie$content <- gsub(
paste(my.symbols, collapse = "|"),
"",
Movie$content
)
my_stop_words <- c( "電影", "雷文", "防雷線","雷文防雷資訊頁", "我是防雷線", "主文分隔線", "一個", "他們", "可以", "因為", "自己", "我們", "沒有", "就是", "最後", "覺得", "一個", "不是", "什麼", "可以", "沒有", "真的", "這部", "就是", "覺得", "只是", "其實", "所以", "這個", "還是", "雖然", "一樣", "只是", "完全", "所以", "知道", "這樣", "然後", "還是", "可能", "甚至", "看到", "喜歡", "開始", "com", "如果", "感覺", "應該", "不過", "已經", "還有", "出來", "之後", "不會", "這麼", "但是", "時候", "這種", "根本", "非常", "怎麼", "到底", "一下", "此片", "這些", "很多", "一部",
"來說", "一直", "兩句", "不計", "分為", "藉此", "還會", "做到","整部","劇情","主角","觀眾","故事","角色","導演","社會","人生")
Movie$content <- gsub(
paste(my_stop_words, collapse = "|"),
" ",
Movie$content)
刪除英文字母和數字
#Remove English & Number
Movie$content <- gsub("[A-z]", "", Movie$content)
Movie$content <- Movie$content <- removeNumbers(
as.character(Movie$content))
head(Movie[3])
jiebaR有時沒辦法篩選出特定字詞,例如: "蜘蛛人",容易篩選成"蜘蛛"、"人",因此我們也要作出特定字詞告訴程式。
(筆者一開始沒篩選出彭彭、丁滿,導致後來的分析一直出現"彭丁滿")
然後把content的內容變成分詞,並以空白分開。
#customized word
library(jiebaR)
wk <- worker(stop_word = jiebaR::STOPPATH)
customized_terms <- c("劉德華", "奉俊昊", "玩具總動員", "神隱少女", "湯婆婆","蜘蛛人","復四","漫威","美麗華","彭彭","丁滿",
"擊殺數", "助攻數", "娃娃鬼", "輔助鬼", "感情線", "小情小愛", "心得文", "新手上路", "蜘蛛粉", "哭爆", "靠譜歌王", "披頭四", "愛情戲", "鋼鐵人", "駭客任務 ", "片商", "寄生上流", "電流大戰", "從前有個好來塢", "超級英雄")
new_user_word(wk, customized_terms)
Movie.description <- tibble(
label = Movie$label, # label
title = Movie$article_title, # title
content = sapply(
as.character(Movie$content), # vector to be segmented
function(char) segment(char, wk) %>% paste(collapse = " ") #Segment
)
)
head(Movie.description, 5)
轉化成tidy text模式,每個row都是一個辭彙,而非一個video,
把每一個有意義的詞彙當成一個row,就是所謂的token
得到token後,我們可以使用summarize 再把每個token出現的頻率抓出來,再將其視覺化!
#tidytext
library(tidytext)
tok99 <- function(t) str_split(t,"[ ]{1,}") #每遇到一個空白,就要切
tidy_content <- Movie.description %>%
unnest_tokens(word, content, token=tok99)
tidy_content <- tidy_content[nchar(tidy_content$word) > 1, ] #字長不只一個字
head(tidy_content, 5)
content <-tidy_content %>%
group_by(label,word) %>%
summarise(word_frequency = n()) #算出字頻
#visualize-frequency
content %>%
group_by(label) %>%
top_n(10, word_frequency) %>%
arrange(word_frequency) %>%
ggplot(aes(word, word_frequency, fill = label)) +
geom_col(show.legend = FALSE) +
facet_wrap(~label, ncol = 2, scales = "free") +
coord_flip() +
theme_bw() +
theme(text=element_text(family="黑體-繁 中黑", size = 12),
axis.text.x = element_text(angle = 60, hjust = 1))
字頻的視覺化結果出來了,我們大概能判斷一下,蜘蛛人似乎好壞參半;好評裡頭跟寄生上流相關的字詞比較多(金家 / 社長),差評我似乎沒辦法有太多線索,但這時期電影的反派似乎讓人不太滿意。
2. TF-IDF
一樣的例子:
Doc1. 我愛豬腳麵線
Doc2. 你愛麻油雞
Doc3. 我愛你,你愛我
Doc4. 你我愛麵線
TF指的是Term Frequency;
簡單解釋一下,就是 i 詞彙佔 j 文件字數的比例,
Doc1中,"麵線"這個字的TF便是1/4
Doc2中,"麻油雞"的TF = 1/3
IDF指的是Inverse document Frequency
如果以詞頻單純去其重要性的話,那麼"你", "我", "他","和", "以及",這一類連接詞或者主詞肯定數量會很多,而IDF把含該詞彙文件數作為分母便能夠大大的把這類謬誤降低很多。
例如: "麵線"的IDF: log(4/2) = 0.3
"我" 的IDF: log(4/3) = 0.12
"麻油雞" 的IDF: log(4/1) = 0.6
TF-IDF最後的結果即是將兩者相乘,最後越高的即是篩選出來的關鍵字。
R已經有tfidf 函數可以直接計算處理:
#tf-idf
content.tfidf <- content %>%
bind_tf_idf(word, label, word_frequency)
head(content.tfidf)
head(content.tfidf)
# A tibble: 6 x 6
# Groups: label [1]
label word word_frequency tf idf tf_idf
<fct> <chr> <int> <dbl> <dbl> <dbl>
1 好評 关乎 4 0.0000140 0.693 0.00000971
2 好評 内心 4 0.0000140 0.693 0.00000971
3 好評 内容 5 0.0000175 0.693 0.0000121
4 好評 冴子 2 0.00000700 0.693 0.00000485
5 好評 冴羽 6 0.0000210 0.693 0.0000146
6 好評 卢旺达 1 0.00000350 0.693 0.00000243
#visualize tf-idf
content.tfidf %>%
group_by(label) %>%
top_n(10, tf_idf) %>% # select top 10 keywords for each label
arrange(desc(tf_idf)) %>% # order by tf-idf
ggplot(aes(word, tf_idf, fill = label)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~label, ncol = 2, scales = "free") +
coord_flip() +
theme_bw() +
theme(text=element_text(family="黑體-繁 中黑", size=14),
axis.text.x = element_text(angle = 60, hjust = 1))
好評的部分,寄生上流真的有許多關鍵字出來(金家 / 基澤 / 基宇 / 金爸 / 补家),差評的部分,機甲跟法醫佔了不少,根據TF-IDF的篩選方式,可以猜想,應該是有一些討論串很集中地在討論機甲跟法醫等關鍵字,看了原來的檔案,的確有一個討論串整整花了很多篇幅在罵環太平洋2是廢片XD;而不雅詞彙則是因為有一個討論串在戰城市獵人開的黃腔。
3. Co-Occurence
顧名思義,就是計算關鍵字間共同出現的次數#Co-Ocuurence
total.count <- tidy_content %>%
group_by(word) %>%
summarise(total_count = n())
top15.words <- total.count %>%
subset(nchar(word)>1) %>%
arrange(desc(total_count)) %>%
top_n(15)
temp <- tidy_content %>% subset(word %in% top15.words$word)
library(widyr)
description.cooc <- temp %>%
pairwise_count(word, title, sort = TRUE, upper = FALSE)
head(description.cooc)
#Co-Ocuurence heatmap visualise
ggplot(data = description.cooc, aes(item1, item2, fill = n)) +
geom_tile(color = "white")+
scale_fill_gradient2(low = "white", high = "red",name="Co-occurence") +
theme_bw() +
theme(text=element_text(family="黑體-繁 中黑", size=14),
axis.text.x = element_text(angle = 60, hjust = 1))
#Co-Ocuurence Network
set.seed(2)
description.cooc %>%
graph_from_data_frame() %>%
ggnetwork() %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_edges(aes(size = n, alpha = n), color = "cyan4") +
geom_nodes(color = "black", size = 8) +
geom_nodelabel_repel(aes(label = vertex.names),
color = 'grey50',
box.padding = unit(1, "lines"),
family = "黑體-繁 中黑") +
theme_bw() +
theme(text=element_text(family="黑體-繁 中黑", size=16))
筆者在此看不太出來有什麼特別的資訊。4. Text Rank
詞彙重要性 = 共現其他詞彙重要性的加權平均一樣的例子:
Doc1. 我愛豬腳麵線
Doc2. 你愛麻油雞
Doc3. 我愛你,你愛我
Doc4. 你我愛麵線
WS(麵線) = (1 - d) + d * 相鄰節點分數
WS(麵線) =
1(和麵線的相鄰次數)/4+3+1+1(愛和其它詞的相鄰比重) * WS愛+
1 / 1 * WS豬腳+
..............
WS則是透過不斷跌代才能確認最接近的結果
#texrank
textrank.model <- textrank_keywords(
tidy_content$word, p = 1/3, ngram_max = 2
)
summary(textrank.model)
head(textrank.model$pagerank$vector)
head(textrank.model$keywords)
label.text <- unique(tidy_content$label)
description.textrank <- NULL
for(labeltext in unique(tidy_content$label)){
# filter channel descriptions
text.data <- tidy_content %>% filter(label == labeltext)
# train text rank model
textrank.model <- textrank_keywords(
text.data$word, p = 1/3, ngram_max = 2
)
# consolidate textrank results
description.textrank <- bind_rows(
description.textrank,
data.frame(
label = labeltext,
word = names(textrank.model$pagerank$vector),
text_rank = textrank.model$pagerank$vector)
)
}
#visualise textrank
description.textrank %>%
arrange(desc(text_rank)) %>%
group_by(label) %>%
top_n(10, text_rank) %>%
ungroup() %>%
ggplot(aes(word, text_rank, fill = label)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "Text Rank") +
facet_wrap(~label, ncol = 2, scales = "free") +
coord_flip() +
theme_bw() +
theme(text=element_text(family="黑體-繁 中黑", size=14),
axis.text.x = element_text(angle = 60, hjust = 1))
看下來,寄生上流又再度從好評的文章裡被篩選出來,蜘蛛人仍然好壞參半,但感覺起來,給蜘蛛人差評似乎是因為反派表現不佳的關係。
5. RAKE 演算法
利用詞頻與連結找出關鍵字,
頻率 frequency: A詞出現在幾篇文章中
連結 Degree: A詞跟其它文字有多少的連結數
拿前面的"麵線"作為例子
麵線的 Frequency:2
Degree: 4
所以麵線的分數就為4/2=2。
#RAKE
tidy_content$label <- tidy_content$label %>%as.factor()
rm(text.data)
rm(content_rake)
description.rake <- NULL
for(labeltext in unique(tidy_content$label)){
text.data <- tidy_content %>% filter(label == labeltext)
description.cooc <- text.data %>%
pairwise_count(word, title, sort = TRUE, upper = FALSE)
description.cooc <- description.cooc %>% spread(item2, n)
row.names(description.cooc) <- description.cooc$item1
description.cooc <- description.cooc %>% select(-item1)
description.wf <- text.data %>%
group_by(word) %>%
summarise(word_frequency = length(unique(title)))
description.rake <- bind_rows(
description.rake,
data.frame(
label = labeltext,
word = colnames(description.cooc),
degree = colSums(!is.na(description.cooc))) %>%
inner_join(description.wf)
)
}
description.rake <- description.rake %>%
mutate(rake = degree / word_frequency)
description.rake %>%
group_by(label) %>%
top_n(10, rake) %>%
ggplot(aes(reorder(word, rake), rake, fill = label)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "RAKE") +
facet_wrap(~label, ncol = 2, scales = "free") +
coord_flip() +
theme_bw() +
theme(text=element_text(family="黑體-繁 中黑", size=14),
axis.text.x = element_text(angle = 60, hjust = 1))
這次篩出來的字詞,好評方面,從前有個好萊塢有不少個相關角色在裡面,負評我還是看不出太多資訊。總結:
這次用三種不同的關鍵字萃取演算法抓出了不同的關鍵字,感覺起來
TF- IDF 跟RAKE比較能透露出明顯的資訊,演算法一樣沒有優劣,不同情況一樣會有不同的效果,然後恭喜寄生上流在奧斯卡上的傑出表現。
留言
張貼留言