Chapter 8 Text mining
Numbers are great… but words literally tell a story. Analysing text (e.g. books, tweets, survey responses) in a quantitative format is naturally challenging - however there’s a few tricks which can simplify the process.
This chapter outlines the process for inputting text data, and running some simple analysis. The notes and code loosely follow the fabulous book Text Mining with R by Julia Silge and David Robinson.
First up, let’s load some packages.
8.1 Frequency analysis
There’s a online depository called Project Gutenberg which catalogue texts that have lost their copyright.
It just so happens that The Bible is on this list. Let’s check out the most frequent words.
library(tidyverse)
library(tidytext)
# Correct URL for the raw text file
bible_url <- "https://raw.githubusercontent.com/charlescoverdale/casualdabbler2e/main/data/bible.txt"
# Read the text file directly from the URL
bible <- read_lines(bible_url)
# Convert to a tibble
bible_df <- tibble(text = bible)
# Tokenize words and remove stop words
bible_tidy <- bible_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word")
# Find and display the most common words
common_words <- bible_tidy %>%
count(word, sort = TRUE) %>%
head(20) # Show the top 20 words
print(common_words)
## # A tibble: 20 × 2
## word n
## <chr> <int>
## 1 lord 7830
## 2 thou 5474
## 3 thy 4600
## 4 god 4446
## 5 ye 3983
## 6 thee 3827
## 7 1 2830
## 8 2 2724
## 9 3 2570
## 10 israel 2565
## 11 4 2476
## 12 son 2370
## 13 7 2351
## 14 5 2308
## 15 6 2297
## 16 hath 2264
## 17 king 2264
## 18 9 2210
## 19 8 2193
## 20 people 2142
Somewhat unsurprisingly - “lord” wins it by a country mile.
8.2 Sentiment analysis
Just like a frequency analysis, we can do a ‘vibe’ analysis (i.e. sentiment of a text) using a clever thesaurus matching technique.
In the tidytext
package are lexicons which include the general sentiment of words (e.g. the emotion you can use to describe that word).
Let’s see the count of words most associated with ‘joy’ in the bible.
# Tokenize words
bible_tidy <- bible_df %>%
unnest_tokens(word, text) %>%
mutate(word = tolower(word)) # Ensure lowercase matching
# Get NRC lexicon & filter for "joy"
nrcjoy <- tidytext::get_sentiments("nrc") %>%
filter(sentiment == "joy")
# Join words with NRC joy sentiment list & count occurrences
bible_joy_words <- bible_tidy %>%
inner_join(nrcjoy, by = "word") %>%
count(word, sort = TRUE)
# View top joyful words
print(bible_joy_words)
## # A tibble: 264 × 2
## word n
## <chr> <int>
## 1 god 4446
## 2 good 720
## 3 art 494
## 4 peace 429
## 5 found 404
## 6 glory 402
## 7 daughter 324
## 8 pray 313
## 9 love 310
## 10 blessed 302
## # ℹ 254 more rows