Chapter 12 Geocoding in R
12.1 About geocoding
Geocoding allows us to find coordinates for a data point (or vice versa). For instance, if we have the street address of an asset/object we can use geocoding to find the latitude and longitude. Similarly, if we have the lat and lon, we can find the street address. This allows us to turn ‘boring’ address data into a spatial data point.
This makes geocoding immensely useful when trying to find the ‘catchments’ around a particular asset/object. It’s also handy in visualising location based data on a map.
To find the lat and long for an address - we have to find someone with a massive database of addresses and coordinates. Unsurprisingly, Google (through their Google Maps team) has this data.
We can ping Google’s data base through an API (they charge a ‘freemium’ model) to find the data we need.
Further reading
12.2 Introducing the dataset
Approved speed camera locations in Victoria, Australia are publicly available through the government website. We can download the dataset as a spreadsheet.
Note: this dataset does not have street numbers - only street names. This makes intuitive sense (as most speed cameras aren’t placed outside residential houses, but rather along main roads). However, it will cause some niche issues in the geocoding process.
12.3 Data cleaning
First, we start by installing and loading up some packages.
#Loads the required required packages
::p_load(ggmap, tmaptools, RCurl, jsonlite, tidyverse, leaflet, writexl, readr, readxl, sf, mapview, webshot,rgdal, tinytex)
pacman
#Make sure the HTML works
::install_phantomjs() webshot
Next, we import the data as a CSV.
We can see that the street and suburb names are in different columns. To run a geocode, we want a single field of address data. Let’s combine the street name and the suburb name into a single field.
We know all these addresses are in Australia, so we can add the word ‘Australia’ to the end of each entry in the newly created address field to help the geocoder find the location.
We think all these addresses are in the state of Victoria… however let’s not at the state in at this stage (we can see why later).
#Read in spreadsheet
<- 'https://raw.githubusercontent.com/charlescoverdale/bookdata/main/mobile-camera-locations-june-2022.csv'
url
#Note: We need to import as a csv rather than xlsx for url functionality
<- read.csv(url, header=TRUE)
camera_address
#Fix the column labels
<- janitor::row_to_names(camera_address,1)
camera_address
#Convert the suburb column to title case
$SUBURB <- str_to_title(camera_address$SUBURB)
camera_address
#Concatenate the two fields into a single address field
$ADDRESS <- paste(camera_address$LOCATION,
camera_address$SUBURB,
camera_addresssep=", ")
#Add in Australia to the address field just to idiot proof the df
$ADDRESS <- paste(camera_address$ADDRESS, ", Australia", sep="")
camera_address
#Preview the data
head(camera_address)
## LOCATION SUBURB Reason Code Audit Date
## 2 Abey Road Cobblebank BCD May-22
## 3 Adam Street Golden Square ACD Apr-22
## 4 Agar Road Coronet Bay BC T
## 5 Airport Drive Melbourne Airport ABCD Apr-22
## 6 Aitken Boulevard Roxburgh Park ABC Apr-22
## 7 Aitken Boulevard Craigieburn ABCD Apr-22
## ADDRESS
## 2 Abey Road, Cobblebank, Australia
## 3 Adam Street, Golden Square, Australia
## 4 Agar Road, Coronet Bay, Australia
## 5 Airport Drive, Melbourne Airport, Australia
## 6 Aitken Boulevard, Roxburgh Park, Australia
## 7 Aitken Boulevard, Craigieburn, Australia
12.4 Running the geocode
Now that we have a useful address field, we’re ready to run the geocode.
We can geocode the lats and lons using the Google API through ggmap
package.
We must register a API key (by creating an account in the google developer suite).
You can ping the API 2,500 times a day. Lucky for us this dataset is only 1,800 rows long!
Uncomment the chunk below (using Ctrl+Shift+C) and enter your unique key from google to run the code.
# #Input the google API key
# register_google(key = "PASTE YOUR UNIQUE KEY HERE")
#
# #Run the geocode function from ggmap package
# camera_ggmap <- ggmap::geocode(location = camera_address$ADDRESS,
# output = "more",
# source = "google")
#
# #We'll bind the newly created address columns to the original df
# camera_ggmap <- cbind(camera_address, camera_ggmap)
#
# #Print the results
# head(camera_ggmap)
#
# #Write the data to a df
# readr::write_csv(camera_ggmap,"C:/Data/camera_geocoded.csv")
We’ll load up the output this chunk generates and continue.
<- 'https://raw.githubusercontent.com/charlescoverdale/bookdata/main/camera_geocoded.csv'
url
#Note: We need to import as a csv rather than xlsx for url functionality
<- read.csv(url, header=TRUE) camera_ggmap
12.5 From geocode to shapefile
The raw dataset our geocode produced looks good! Although… it could probably use some cleaning.
Let’s rename this dataset so we don’t loose the original df (and therefore have to run the google query again). This is simply a best practice step to build in some redundancy.
<- camera_ggmap camera_geocoded
#Rename the API generated address field
<- camera_geocoded %>% dplyr::rename(address_long=address)
camera_geocoded_clean
#Select only the columns we need
<- camera_geocoded_clean %>%
camera_geocoded_clean ::select("LOCATION",
dplyr"SUBURB",
"ADDRESS",
"lon",
"lat",
"address_long"
)
#Make all the column names the same format
<- janitor::clean_names(camera_geocoded_clean)
camera_geocoded_clean
#We still need to convert address_long to title case
$address_long <- str_to_title(camera_geocoded_clean$address_long) camera_geocoded_clean
If we want to go one step further, we can create spatial points from this list of coordinates. This is a good step for eyeballing the data. We see most of it is in Victoria (as expected!)… but it has picked up a couple of points in Sydney and WA.
These are worth investigating seperately for correction or exclusion.
#Convert data frame to sf object
<- sf::st_as_sf(x = camera_geocoded_clean,
camera_points coords = c("lon", "lat"),
crs = "+proj=longlat
+datum=WGS84
+ellps=WGS84
+towgs84=0,0,0")
#Plot an interactive map
mapview(camera_points)
We can export these points as a shapefile using the rgdal
package.
#Write the points to a shapefile for use in QGIS
#sf::st_write(camera_points, "C:/Data/camera_points.shp")
Let’s now have a look at our edge case datapoints.
There’s a couple of ways to do this… but one of the simplest is to extract the postcode as a separate field. We can then simply sort by postcodes that do not start with the number 3.
$postcode <- str_sub(camera_points$address_long,
camera_pointsstart= -16,
end= -12)
$postcode <- as.numeric(camera_points$postcode)
camera_points
<- camera_points %>% filter (postcode <3000 | postcode >= 4000)
outliers
print(outliers)
## Simple feature collection with 7 features and 5 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 115.9555 ymin: -33.90941 xmax: 152.7034 ymax: -25.53849
## CRS: +proj=longlat
## +datum=WGS84
## +ellps=WGS84
## +towgs84=0,0,0
## location suburb
## 1 Epping Road Epping
## 2 High Street Epping
## 3 Kensington Road Kensington
## 4 Maryborough Road Ascot
## 5 Maryborough-Ballarat Road Talbot
## 6 Mooroopna-Murchison Road Murchison
## 7 Plumpton Road Plumpton
## address
## 1 Epping Road, Epping, Australia
## 2 High Street, Epping, Australia
## 3 Kensington Road, Kensington, Australia
## 4 Maryborough Road, Ascot, Australia
## 5 Maryborough-Ballarat Road, Talbot, Australia
## 6 Mooroopna-Murchison Road, Murchison, Australia
## 7 Plumpton Road, Plumpton, Australia
## address_long geometry
## 1 Epping Rd, Epping Nsw 2121, Australia POINT (151.0906 -33.77076)
## 2 High St, Epping Nsw 2121, Australia POINT (151.0836 -33.77611)
## 3 Kensington Rd, Kensington Nsw 2033, Australia POINT (151.2211 -33.90941)
## 4 Maryborough Qld 4650, Australia POINT (152.7034 -25.53849)
## 5 Maryborough Qld 4650, Australia POINT (152.7034 -25.53849)
## 6 Murchison Wa 6630, Australia POINT (115.9555 -26.89259)
## 7 Plumpton Rd, Plumpton Nsw 2761, Australia POINT (150.8439 -33.74823)
## postcode
## 1 2121
## 2 2121
## 3 2033
## 4 4650
## 5 4650
## 6 6630
## 7 2761
Easy enough. We see there’s 7 rows that don’t have a postcode starting in a 3000 postcode.
Four of these are in NSW, two in QLD, and one in WA. It looks like they are real (e.g. the streets and suburbs do exist in that state)… so for now let’s just exclude them from our dataset.
<- camera_points %>% filter (postcode>"3000" & postcode<"4000")
camera_points_vic
#Plot an interactive map
mapview(camera_points_vic)
#Write the points to a shapefile for use in QGIS
#sf::st_write(camera_points_vic, "C:/Data/camera_points_vic.shp")
There we go! A clean, geocoded dataset of speed camera locations in Victoria.