Chapter 10 Text-mining
10.1 Power with words
Numbers are great… but words literally tell a story. Analysing text (e.g. books, tweets, survey responses) in a quantitative format is naturally challenging - however there’s a few tricks which can simplify the process.
This chapter outlines the process for inputting text data, and running some simple analysis. The notes and code loosely follow the fabulous book Text Mining with R by Julia Silge and David Robinson.
First up, let’s load some packages.
library(ggplot2)
library(dplyr)
library(tidyverse)
library(tidytext)
library(textdata)
10.2 Frequency analysis
There’s a online depository called Project Gutenberg which catalogue texts that have lost their copyright (mostly because it expires over time). These can be called with the R package called gutenbergr
It just so happens that The Bible is on this list. Let’s check out the most frequent words.
library(gutenbergr)
<- gutenberg_download(30)
bible
<- bible %>%
bible_tidy unnest_tokens(word, text) %>%
anti_join(stop_words)
#Find the most common words
%>%
bible_tidy count(word, sort=TRUE)
## # A tibble: 12,595 x 2
## word n
## <chr> <int>
## 1 lord 7830
## 2 thou 5474
## 3 thy 4600
## 4 god 4446
## 5 ye 3982
## 6 thee 3827
## 7 001 2783
## 8 002 2721
## 9 israel 2565
## 10 003 2560
## # ... with 12,585 more rows
Somewhat unsurprisingly - “lord” wins it by a country mile.
10.3 Sentiment analysis
Just like a frequency analysis, we can do a ‘vibe’ analysis (i.e. sentiment of a text) using a clever thesaurus matching technique. In the tidytext package are lexicons which include the general sentiment of words (e.g. the emotion you can use to describe that word).
Let’s see the count of words most associated with ‘joy’ in the bible.
#Download sentiment list
<- get_sentiments("nrc") %>%
nrcjoy filter(sentiment == "joy")
#Join bible words with sentiment list
%>%
bible_tidy inner_join(nrcjoy) %>%
count(word, sort=TRUE)
## # A tibble: 258 x 2
## word n
## <chr> <int>
## 1 god 4446
## 2 art 494
## 3 peace 429
## 4 found 402
## 5 glory 402
## 6 daughter 324
## 7 pray 313
## 8 love 310
## 9 blessed 302
## 10 mighty 284
## # ... with 248 more rows