Chapter 10 Text-mining

10.1 Power with words

Numbers are great… but words literally tell a story. Analysing text (e.g. books, tweets, survey responses) in a quantitative format is naturally challenging - however there’s a few tricks which can simplify the process.

This chapter outlines the process for inputting text data, and running some simple analysis. The notes and code loosely follow the fabulous book Text Mining with R by Julia Silge and David Robinson.

First up, let’s load some packages.

library(ggplot2)
library(dplyr)
library(tidyverse)
library(tidytext)
library(textdata)

10.2 Frequency analysis

There’s a online depository called Project Gutenberg which catalogue texts that have lost their copyright (mostly because it expires over time). These can be called with the R package called gutenbergr

It just so happens that The Bible is on this list. Let’s check out the most frequent words.

library(gutenbergr)

bible <- gutenberg_download(30)

bible_tidy <- bible %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

#Find the most common words

bible_tidy %>%
  count(word, sort=TRUE)
## # A tibble: 12,595 x 2
##    word       n
##    <chr>  <int>
##  1 lord    7830
##  2 thou    5474
##  3 thy     4600
##  4 god     4446
##  5 ye      3982
##  6 thee    3827
##  7 001     2783
##  8 002     2721
##  9 israel  2565
## 10 003     2560
## # ... with 12,585 more rows

Somewhat unsurprisingly - “lord” wins it by a country mile.

10.3 Sentiment analysis

Just like a frequency analysis, we can do a ‘vibe’ analysis (i.e. sentiment of a text) using a clever thesaurus matching technique. In the tidytext package are lexicons which include the general sentiment of words (e.g. the emotion you can use to describe that word).

Let’s see the count of words most associated with ‘joy’ in the bible.

#Download sentiment list
nrcjoy <- get_sentiments("nrc") %>%
  filter(sentiment == "joy")

#Join bible words with sentiment list
bible_tidy %>%
  inner_join(nrcjoy) %>%
  count(word, sort=TRUE)
## # A tibble: 258 x 2
##    word         n
##    <chr>    <int>
##  1 god       4446
##  2 art        494
##  3 peace      429
##  4 found      402
##  5 glory      402
##  6 daughter   324
##  7 pray       313
##  8 love       310
##  9 blessed    302
## 10 mighty     284
## # ... with 248 more rows