Chapter 10 Text-mining
10.1 Power with words
Numbers are great… but words literally tell a story. Analysing text (e.g. books, tweets, survey responses) in a quantitative format is naturally challenging - however there’s a few tricks which can simplify the process.
This chapter outlines the process for inputting text data, and running some simple analysis. The notes and code loosely follow the fabulous book Text Mining with R by Julia Silge and David Robinson.
First up, let’s load some packages.
library(ggplot2)
library(dplyr)
library(tidyverse)
library(tidytext)
library(textdata)10.2 Frequency analysis
There’s a online depository called Project Gutenberg which catalogue texts that have lost their copyright (mostly because it expires over time). These can be called with the R package called gutenbergr
It just so happens that The Bible is on this list. Let’s check out the most frequent words.
library(gutenbergr)
bible <- gutenberg_download(30)
bible_tidy <- bible %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
#Find the most common words
bible_tidy %>%
count(word, sort=TRUE)## # A tibble: 12,595 x 2
## word n
## <chr> <int>
## 1 lord 7830
## 2 thou 5474
## 3 thy 4600
## 4 god 4446
## 5 ye 3982
## 6 thee 3827
## 7 001 2783
## 8 002 2721
## 9 israel 2565
## 10 003 2560
## # ... with 12,585 more rows
Somewhat unsurprisingly - “lord” wins it by a country mile.
10.3 Sentiment analysis
Just like a frequency analysis, we can do a ‘vibe’ analysis (i.e. sentiment of a text) using a clever thesaurus matching technique. In the tidytext package are lexicons which include the general sentiment of words (e.g. the emotion you can use to describe that word).
Let’s see the count of words most associated with ‘joy’ in the bible.
#Download sentiment list
nrcjoy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
#Join bible words with sentiment list
bible_tidy %>%
inner_join(nrcjoy) %>%
count(word, sort=TRUE)## # A tibble: 258 x 2
## word n
## <chr> <int>
## 1 god 4446
## 2 art 494
## 3 peace 429
## 4 found 402
## 5 glory 402
## 6 daughter 324
## 7 pray 313
## 8 love 310
## 9 blessed 302
## 10 mighty 284
## # ... with 248 more rows