Chapter 5 Election data in R

Elections tend to create fascinating data sets. They are spatial in nature, comparable over time (i.e. the number of electorates roughly stays the same) - and more importantly they are consequential for all Australians.

Australia’s compulsory voting system is a remarkable feature of our Federation. Every three-ish years we all turn out at over 7,000 polling booths our local schools, churches, and community centres to cast a ballot and pick up an obligatory election day sausage. The byproduct is a fascinating longitudinal and spatial data set.

The following code explores different R packages, election data sets, and statistical processes aimed at exploring and modelling federal elections in Australia.

One word of warning: I use the term electorates, divisions, and seats interchangeably throughout this chapter.

5.1 Getting started

Let’s load up some packages

#Load packages
library(ggparliament)
library(eechidna)
library(dplyr)
library(ggplot2)
library(readxl)
library(tidyr)
library(tidyverse)
library(purrr)
library(knitr)
library(broom)
library(absmapsdata)
library(sf)
library(tmap)
library(rmarkdown)
library(bookdown)

Some phenomenal Australia economists and statisticians have put together a handy election package called eechidna. It includes three main data sets for the most recent Australia federal election (2019).

  • fp19: first preference votes for candidates at each electorate

  • tpp19: two party preferred votes for candidates at each electorate

  • tcp19: two candidate preferred votes for candidates at each electorate

They’ve also gone to the trouble of aggregating some census data to the electorate level. This can be found with the abs2016 function.

data(fp19)
data(tpp19)
data(tcp19)
data(abs2016)

# Show the first few rows
#head(tpp16) %>% kable("simple")
#head(tcp16) %>% kable("simple")
DT::datatable(tpp19)
DT::datatable(tcp19)

5.2 Working with election maps

As noted in the introduction, elections are spatial in nature.

Not only does geography largely determine policy decisions, we see that many electorates vote for the same party (or even the same candidate) for decades. How electorate boundaries are drawn is a long story (see here, here, and here).

The summary version is the AEC carves up the population by state and territory, uses a wacky formula to decide how many seats each state and territory should be allocated, then draws maps to try and get a roughly equal number of people in each electorate. Oh… and did I mention for reasons that aren’t worth explaining that Tasmania has to have at least 5 seats? Our Federation is a funny thing. Anyhow, at time of writing this is how the breakdown of seats looks.

State/Territory Number of members of the House of Representatives
New South Wales 47
Victoria 39
Queensland 30
Western Australia 15
South Australia 10
Tasmania 5
Australian Capital Territory 3
Northern Territory 2*
TOTAL 151

Note: The NT doesn’t have the population to justify it’s second seat . The AEC scheduled to dissolve it after the 2019 election but Parliament intervened in late 2020 and a bill was passed to make sure both seats were kept (creating 151 nationally).

How variant are these 151 electorates in size?

Massive. Durack in Western Australia (1.63 million square kilometres) is by far the largest and the smallest is Grayndler in New South Wales (32 square kilometres).

Let’s make a map to make things easier.

CED_map <- ced2018 %>%
           ggplot()+
           geom_sf()+
           labs(title="Electoral divisions in Australia",
               subtitle = "It turns out we divide the country in very non-standard blocks",
               caption = "Data: Australian Bureau of Statistics 2016",
               x="",
               y="") + 
           theme_minimal() +
            theme(axis.ticks.x = element_blank(),axis.text.x = element_blank())+
            theme(axis.ticks.y = element_blank(),axis.text.y = element_blank())+
            theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
            theme(legend.position = "right")+
            theme(plot.title=element_text(face="bold",size=12))+
            theme(plot.subtitle=element_text(size=11))+
            theme(plot.caption=element_text(size=8))


CED_map_remove_6 <- ced2018 %>%
                    dplyr::filter(!ced_code_2018 %in% c(506,701,404,511,321,317)) %>%   
                    ggplot()+
           geom_sf()+
           labs(title="194 electoral divisions in Australia",
               subtitle = "Turns out removing the largest 6 electorates makes a difference",
               caption = "Data: Australian Bureau of Statistics 2016",
               x="",
               y="") + 
           theme_minimal() +
            theme(axis.ticks.x = element_blank(),axis.text.x = element_blank())+
            theme(axis.ticks.y = element_blank(),axis.text.y = element_blank())+
            theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
            theme(legend.position = "right")+
            theme(plot.title=element_text(face="bold",size=12))+
            theme(plot.subtitle=element_text(size=11))+
            theme(plot.caption=element_text(size=8))

CED_map
CED_map_remove_6

Next let’s look at what party/candidate is currently the sitting member for each electorate. To do this on a map we’re going to need to join our tcp19 data and the ced2018 spatial data.

In the first data set, the electorate column in called ‘DivisionNm’ and in the second ‘ced_name_2018.’

We see the data in our DivisionNm variable is in UPPERCASE while our ced_name_2018 variable is in Titlecase. Let’s change the first variable to Titlecase. We can then make the column names the same, and run our left_join function.

#Pull in the electorate shapefiles from the absmapsdata package
electorates <- ced2018

#Make the DivisionNm Titlecase
tcp19$DivisionNm=str_to_title(tcp19$DivisionNm)

tcp19_edit <- tcp19 %>% distinct() %>% filter(Elected == "Y")

#Make the column names the same
electorates <- dplyr::rename(electorates, DivisionNm = ced_name_2018)

ced_map_data <- left_join(tcp19_edit, electorates, by = "DivisionNm")

ced_map_data <- as.data.frame(ced_map_data)

head(ced_map_data)
str(ced_map_data)

ggplot()+
  geom_sf(data=ced_map_data,aes(geometry = geometry,fill=PartyAb)) +
  theme_minimal() +
            theme(axis.ticks.x = element_blank(),axis.text.x = element_blank())+
            theme(axis.ticks.y = element_blank(),axis.text.y = element_blank())+
            theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
            theme(legend.position = "right")+
            theme(plot.title=element_text(face="bold",size=12))+
            theme(plot.subtitle=element_text(size=11))+
            theme(plot.caption=element_text(size=8))+
            scale_color_manual("PartyAb", values=c("LP" ="#80b1d3", 
                                                 "NP" = "#006400",
                                                 "ALP"= "#fb8072", 
                                                 "GVIC" = "#33a02c", 
                                                 "XEN" = "#beaed4", 
                                                 "ON" = "#fdc086", 
                                                 "KAP" = "#ffff99", 
                                                 "IND" = "grey25"))

5.3 Answering election questions

Let’s start by answering a simple question: who won the election? For this we’ll need to use the two-candidate preferred data set (to make sure we capture all the minor parties that won seats).

who_won <- tcp19 %>% 
  filter(Elected == "Y") %>% 
  group_by(PartyNm) %>% 
  tally() %>% 
  arrange(desc(n)) 

# inspect
who_won %>% kable("simple")
PartyNm n
AUSTRALIAN LABOR PARTY 68
LIBERAL PARTY 67
NATIONAL PARTY 10
INDEPENDENT 3
CENTRE ALLIANCE 1
KATTER’S AUSTRALIAN PARTY (KAP) 1
THE GREENS 1

Next up let’s see which candidates won with the smallest percentage of votes

who_won_least_votes_prop <- fp16 %>% 
  filter(Elected == "Y") %>% 
  arrange(Percent) %>% 
  mutate(candidate_full_name = paste0(GivenNm, " ", Surname, " (", CandidateID, ")")) %>% 
  dplyr::select(candidate_full_name, PartyNm, DivisionNm, Percent)

who_won_least_votes_prop %>% head %>% kable("simple")
candidate_full_name PartyNm DivisionNm Percent
MICHAEL DANBY (28267) AUSTRALIAN LABOR PARTY MELBOURNE PORTS 27.00
CATHY O’TOOLE (28997) AUSTRALIAN LABOR PARTY HERBERT 30.45
JUSTINE ELLIOT (28987) AUSTRALIAN LABOR PARTY RICHMOND 31.05
TERRI BUTLER (28921) AUSTRALIAN LABOR PARTY GRIFFITH 33.18
STEVE GEORGANAS (29071) AUSTRALIAN LABOR PARTY HINDMARSH 34.02
CATHY MCGOWAN (23288) INDEPENDENT INDI 34.76

This is really something. The relationship we’re seeing here seems to be these are the seats in which the ALP relies heavily on preference flows from the Greens or Independents to win. The electorate I grew up in is listed here (Richmond) - let’s look at how the votes were allocated.

Richmond_fp <- fp16 %>% 
  filter(DivisionNm == "RICHMOND") %>% 
  arrange(-Percent) %>% 
  mutate(candidate_full_name = paste0(GivenNm, " ", Surname, " (", CandidateID, ")")) %>% 
  dplyr::select(candidate_full_name, PartyNm, DivisionNm, Percent, OrdinaryVotes)

Richmond_fp %>% knitr::kable("simple")
candidate_full_name PartyNm DivisionNm Percent OrdinaryVotes
MATTHEW FRASER (29295) NATIONAL PARTY RICHMOND 37.61 37006
JUSTINE ELLIOT (28987) AUSTRALIAN LABOR PARTY RICHMOND 31.05 30551
DAWN WALKER (28783) THE GREENS RICHMOND 20.44 20108
NEIL GORDON SMITH (28349) ONE NATION RICHMOND 6.26 6160
ANGELA POLLARD (29290) ANIMAL JUSTICE PARTY RICHMOND 3.14 3089
RUSSELL KILARNEY (28785) CHRISTIAN DEMOCRATIC PARTY RICHMOND 1.51 1484

Sure enough - the Greens certainly helped get the ALP across the line.

The interpretation that these seats are the most marginal is incorrect (e.g. imagine if ALP win 30% and the Greens win 30% - that is a pretty safe 10% margin assuming traditional preference flows). But - let’s investigate which seats are the most marginal.

who_won_smallest_margin <- tcp16 %>% 
  filter(Elected == "Y") %>% 
  mutate(percent_margin = 2*(Percent - 50), vote_margin = round(percent_margin * OrdinaryVotes / Percent)) %>% 
  arrange(Percent) %>% 
  mutate(candidate_full_name = paste0(GivenNm, " ", Surname, " (", CandidateID, ")")) %>% 
  dplyr::select(candidate_full_name, PartyNm, DivisionNm, Percent, OrdinaryVotes, percent_margin, vote_margin)

# have a look
who_won_smallest_margin %>%
 head %>%
 knitr::kable("simple")
candidate_full_name PartyNm DivisionNm Percent OrdinaryVotes percent_margin vote_margin
CATHY O’TOOLE (28997) AUSTRALIAN LABOR PARTY HERBERT 50.02 44187 0.04 35
STEVE GEORGANAS (29071) AUSTRALIAN LABOR PARTY HINDMARSH 50.58 49586 1.16 1137
MICHELLE LANDRY (28034) LIBERAL PARTY CAPRICORNIA 50.63 44633 1.26 1111
BERT VAN MANEN (28039) LIBERAL PARTY FORDE 50.63 42486 1.26 1057
ANNE ALY (28727) AUSTRALIAN LABOR PARTY COWAN 50.68 41301 1.36 1108
ANN SUDMALIS (28668) LIBERAL PARTY GILMORE 50.73 52336 1.46 1506

Crikey. We see Cathy O’Toole got in with a 0.04% margin (just 35 votes!)

While we’re at it we better do the opposite and see who romped it by the largest margin.

who_won_largest_margin <- tcp16 %>% 
  filter(Elected == "Y") %>% 
  mutate(percent_margin = 2*(Percent - 50), vote_margin = round(percent_margin * OrdinaryVotes / Percent)) %>% 
  arrange(desc(Percent)) %>% 
  mutate(candidate_full_name = paste0(GivenNm, " ", Surname, " (", CandidateID, ")")) %>% 
  dplyr::select(candidate_full_name, PartyNm, DivisionNm, Percent, OrdinaryVotes, percent_margin, vote_margin)

# Look at the data
 who_won_largest_margin %>%
 head %>%
 knitr::kable("simple")
candidate_full_name PartyNm DivisionNm Percent OrdinaryVotes percent_margin vote_margin
ANDREW BROAD (28415) NATIONAL PARTY MALLEE 71.32 62383 42.64 37297
PAUL FLETCHER (28565) LIBERAL PARTY BRADFIELD 71.04 66513 42.08 39398
JULIE BISHOP (28746) LIBERAL PARTY CURTIN 70.70 60631 41.40 35504
SUSSAN LEY (28699) LIBERAL PARTY FARRER 70.53 68114 41.06 39653
JASON CLARE (28931) AUSTRALIAN LABOR PARTY BLAXLAND 69.48 55507 38.96 31125
BRENDAN O’CONNOR (28274) AUSTRALIAN LABOR PARTY GORTON 69.45 68135 38.90 38163

Wowza. That’s really something. Some candidates won seats with a 30-40 percent margin - scooping up 70% of the two candidate preferred vote in the process!

who_won <- tcp16 %>% 
  filter(Elected == "Y") %>% 
  group_by(PartyNm, StateAb) %>% 
  tally() %>% 
  arrange(desc(n)) 

who_won_by_state <- spread(who_won,StateAb, n) %>% arrange(desc(NSW))

#View data set
who_won_by_state %>% 
knitr::kable("simple")

5.5 Mapping booths

The AEC maintains a handy spreadsheet of booth locations for recent federal elections. You can search for your local booth location (probably a school, church, or community center) in the table below.

What do these booths look like on a map? Let’s reuse the CED map above and plot a point for each booth location.

            ggplot()+
            geom_sf(data=ced2018)+
            geom_point(data=booths, aes(x=Longitude, y=Latitude), 
                       colour="purple", size=1, alpha=0.3, inherit.aes=FALSE) +
            labs(title="Polling booths in Australia",
               subtitle = " ",
               caption = "Data: Australian Electoral Comission 2016",
               x="",
               y="") + 
            theme_minimal() +
            theme(axis.ticks.x = element_blank(),axis.text.x = element_blank())+
            theme(axis.ticks.y = element_blank(),axis.text.y = element_blank())+
            theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
            theme(legend.position = "right")+
            theme(plot.title=element_text(face="bold",size=12))+
            theme(plot.subtitle=element_text(size=11))+
            theme(plot.caption=element_text(size=8)) +
            xlim(c(112,157)) + ylim(c(-44,-11))

5.6 Exploring booth level data

Figuring out where a candidates votes come from within an electorate is fundamental to developing a campaign strategy. Even in small electorates (e.g. Wentworth), there are pockets of right leaning and left leaning districts. Once you factor in preference flows - this multi-variate calculus becomes important to winning or maintaining a seat.

In the eechidnapackage, election results are provided at the resolution of polling place. They nmust be downloaded using the functions firstpref_pollingbooth_download, twoparty_pollingbooth_download or twocand_pollingbooth_download (depending on the vote type).

The two files need to be merged to be useful for analysis.. Both have a unique ID for the polling place that can be used to match the records. The two party preferred vote, a measure of preference between only the Australian Labor Party (ALP) and the Liberal/National Coalition (LNP), is downloaded using twoparty_pollingbooth_download. The preferred party is the one with the higher percentage, and we use this to colour the points indicating polling places.

We see that within some big rural electorates (e.g. in Western NSW), there are pockets of ALP preference despite the seat going to the LNP. Note that this data set is on a tpp basis - so we can’t see the booths that were won by minor parties (although it would be fascinating).

## Error in `geom_map()`:
## ! `map` must have the columns `x`, `y`, and `id`

The two candidate preferred vote (downloaded with twocand_pollingbooth_download) is a measure of preference between the two candidates who received the most votes through the division of preferences, where the winner has the higher percentage.

## Error in `geom_map()`:
## ! `map` must have the columns `x`, `y`, and `id`

5.7 Donkeys, dicks, and other informalities

We’re about to go off the deep end into a certain type of election data.

In the 2016 Australian Federal Election, over 720,915 people (5.5% of all votes cast) voted informally. Of these, over half (377,585) had ‘no clear first preference,’ meaning their vote did not contribute to the campaign of any candidate.

I’ll be honest, informal votes absolutely fascinate me. Not only are there 8 types of informal votes (you can read all about the Australian Electoral Commission’s analysis here), but the rate of informal voting varies a tremendous amount by electorate.

Broadly, we can think of informal votes in two main buckets.

  1. Protest votes

  2. Stuff-ups

If we want to get particular about it, I like to subcategorise these buckets into:

  1. Protest votes (i.e. a person that thinks they are voting against):

    • the democratic system,

    • their local selection of candidates on the ballot, or

    • the two most likely candidates for PM.

  2. Stuff ups (people who):

    • filled in the form wrong but a clear preference was still made

    • stuffed up the form entirely and it didn’t contribute towards the tally for any candidtate

This is the good bit:

The AEC works tirelessly to reduce stuff-ups on ballot papers (clear instructions and UI etc), but there isn’t much of a solution for protest votes. What’s interesting is you can track the ‘vibe’ of how consequential an election is by the proportion of protest votes.

Let’s pull some informal voting data from the AEC website.