Chapter 8 Text mining

Numbers are great… but words literally tell a story. Analysing text (e.g. books, tweets, survey responses) in a quantitative format is naturally challenging - however there’s a few tricks which can simplify the process.

This chapter outlines the process for inputting text data, and running some simple analysis. The notes and code loosely follow the fabulous book Text Mining with R by Julia Silge and David Robinson.

First up, let’s load some packages.

library(ggplot2)
library(dplyr)
library(tidyverse)
library(tidytext)
library(textdata)

8.1 Frequency analysis

There’s a online depository called Project Gutenberg which catalogue texts that have lost their copyright.

It just so happens that The Bible is on this list. Let’s check out the most frequent words.

library(tidyverse)
library(tidytext)

# Correct URL for the raw text file
bible_url <- "https://raw.githubusercontent.com/charlescoverdale/casualdabbler2e/main/data/bible.txt"

# Read the text file directly from the URL
bible <- read_lines(bible_url)

# Convert to a tibble
bible_df <- tibble(text = bible)

# Tokenize words and remove stop words
bible_tidy <- bible_df %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word")

# Find and display the most common words
common_words <- bible_tidy %>%
  count(word, sort = TRUE) %>%
  head(20)  # Show the top 20 words

print(common_words)
## # A tibble: 20 × 2
##    word       n
##    <chr>  <int>
##  1 lord    7830
##  2 thou    5474
##  3 thy     4600
##  4 god     4446
##  5 ye      3983
##  6 thee    3827
##  7 1       2830
##  8 2       2724
##  9 3       2570
## 10 israel  2565
## 11 4       2476
## 12 son     2370
## 13 7       2351
## 14 5       2308
## 15 6       2297
## 16 hath    2264
## 17 king    2264
## 18 9       2210
## 19 8       2193
## 20 people  2142

Somewhat unsurprisingly - “lord” wins it by a country mile.

8.2 Sentiment analysis

Just like a frequency analysis, we can do a ‘vibe’ analysis (i.e. sentiment of a text) using a clever thesaurus matching technique.

In the tidytext package are lexicons which include the general sentiment of words (e.g. the emotion you can use to describe that word).

Let’s see the count of words most associated with ‘joy’ in the bible.

# Tokenize words
bible_tidy <- bible_df %>%
  unnest_tokens(word, text) %>%
  mutate(word = tolower(word))  # Ensure lowercase matching

# Get NRC lexicon & filter for "joy"
nrcjoy <- tidytext::get_sentiments("nrc") %>%
  filter(sentiment == "joy")

# Join words with NRC joy sentiment list & count occurrences
bible_joy_words <- bible_tidy %>%
  inner_join(nrcjoy, by = "word") %>%
  count(word, sort = TRUE)

# View top joyful words
print(bible_joy_words)
## # A tibble: 264 × 2
##    word         n
##    <chr>    <int>
##  1 god       4446
##  2 good       720
##  3 art        494
##  4 peace      429
##  5 found      404
##  6 glory      402
##  7 daughter   324
##  8 pray       313
##  9 love       310
## 10 blessed    302
## # ℹ 254 more rows