Remembrance of Things Past: or, choropleth maps with R & plotly

Nov 14, 2017 R plotly choropleth

In my experience, revisiting old projects can be intellectually embarassing, especially when one’s fluency with a particular tool or language is evolving quickly. Elements of code that seemed especially elegant in the past can appear clunky and counter-intuitive through fresh eyes. The logic behind an entire workflow may suddenly seem as if it was the product of feverish sleep-deprivation. Beyond that, the analysis itself can come across as overly tedious, too general, or unsophisticated. One can learn a lot in two years and that new knowledge can make previous work appear sub-standard.

Almost two years ago precisely, I put together a project using some storm event data from the National Oceanic and Atmospheric Administration (NOAA). I had been using R for about three years and it seemed a good time to flex my coding muscles and see what I could accomplish with data outside of Excel and Tableau (my primary analytical and visualisation tools, up until then). The resulting analysis, written in R markdown, can be found on RPubs. The accompanying Tableau dashboard, on Tableau Public. I am revisiting this project now in order to identify its particular perils and pitfalls and to see what I have learned (and unlearned) since November 15th, 2015.

My initial reaction to the post on RPubs, even before reading one actual word, is “wow, that’s a lot of code. What was I trying to do?” The short introduction states that I had plans to create a choropleth map that illustrates the “frequency of severe weather and its economic impact in the continental United States.” That’s a lofty goal for a single visualization. It definitely fits into the “kitchen-sink” style of dashboard construction. Just give the user a massive data set along with a bunch of filters and parameters and voila, a data analysis. While this may be an acceptable exploratory dashboard, suitable for someone who just wants to poke around a given data set and get an idea of what is available, it does not serve a specific analytical purpose. I began the project without a question and therefore the analysis yields no answers.

An exploratory dashboard certainly has its uses. A business may utilize a similar dashboard of sales data in order to view trends, identify patterns, or get a high-level view of the company’s sales pipeline. A proper analysis, on the other hand, should answer a question. In particular, a personal project that is meant to appeal to a wider audience should yield some analytical insight beyond, “hey, have a look at this data over here.” To maintain someone’s interest, progress towards a compelling conclusion is helpful. This primary oversight derails a train that might otherwise have ended up somewhere interesting.

As a direct result of my lack of focus on a particular question, I spend a lot of time writing code that yields few benefits. Standardizing the event type descriptions (e.g. strong winds, high winds, thunderstorm winds) into more general categories (high wind), makes an easier-to-navigate dashboard with fewer filter options, but also obscures detail that someone with an actual question might find useful. If an analyst is interested specifically in winds caused by thunderstorms, then my attempt at standardization renders the dashboard essentially useless.

The rest of the code is a bit sloppy at times, but altogether better than I had expected. The conversion of the character strings that represent property and crop damage to numbers could be quite a bit tighter, the function used to standardize the county names could rely on the map/apply family of functions more, but otherwise, no extreme changes are necessary. Thankfully, it’s slightly less embarrassing than I thought it would be.

Now that the perils and pitfalls have been noted, it should be relatively straightforward to turn this data into something with actual analytical value. All I need is a good question…

Has the frequency of storm events in the U.S. changed over the past 20 years?

Climate change, though sometimes highly contentious, is an important political and environmental issue. This data set, used properly, can help illustrate the changing patterns of storm activity in the United States over the past 20 years. I do not assert that any of the forthcoming findings will bear heavily on the issue (I am not adhereing to any rigorous statstical methods), but they may help to clarify a small piece of a complicated puzzle.

First, let’s load the required libraries.

library(rvest)
library(tidyverse)
library(magrittr)
library(lubridate)
library(plotly)
library(maps)
library(scales)

Now, download the data from the NOAA (see ‘url’ below for the exact source of the data files). As the NOAA’s data collection procedures changed in 1996, only data from that point onwards will be used in this analysis.

url <- ("https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/")

#Create a vector of file paths that can be passed to the map_df() function.
file_paths <- url %>%
  read_html() %>%
  html_nodes("table") %>%
  html_table(header = FALSE) %>%
  extract2(1) %>%
  mutate(path = paste0(url, X2),
         #Create a 'year' variable that can be used as a filter.
         year = str_sub(str_extract(X2, "d[0-9]{4}"), 2, 5)) %>%
  filter(str_detect(X2, "details"), year >= 1996 & year <= year(Sys.Date() - years(1))) %>%
  pull(path)

#Combine all csv files into a single data frame. All variables are set to "character" to avoid import errors.
master_dat <- map_df(file_paths, read_csv, col_types = cols(.default = "c"))

To examine the change in frequency of storm events, the raw data is split into two time periods - the first half and the second half. Average annual storm events are then calculated for each time period. Additionally, the data is aggregated at the state level to facilitate mapping. Some variable class transformations are performed and formatting is applied as well.

#This function converts numeric values to a percent format with one decimal place.
percent_func <- function(x, digits = 1, format = 'f') {
  paste0(formatC(100 * x, format = format, digits = digits), "%")
}

#Transform raw data into a presentation table that can be used to generate the choropleth.
dat <- master_dat %>% 
  select(YEAR, STATE, EVENT_ID) %>%
  #Remove regions like "Atlantic Coast" that can't be mapped easily
  filter(STATE %in% toupper(state.name)) %>%
  mutate(YEAR = as.numeric(YEAR)) %>%
  group_by(STATE, YEAR) %>%
  summarise(STORM_COUNT = n()) %>%
  ungroup() %>%
  group_by(STATE) %>%
  #Create a variable that chronologically separates the data into two halves
  mutate(FLAG = ifelse(YEAR %in% sort(unique(YEAR))[1:round(length(unique(YEAR))/2,0)],
                       'avg_annual_storms1','avg_annual_storms2')) %>%
  group_by(STATE, FLAG) %>%
  summarise(avg_storm_count = sum(STORM_COUNT)/n_distinct(YEAR)) %>%
  spread(FLAG, avg_storm_count, fill = 0) %>%
  summarise_all(sum) %>%
  ungroup() %>%
  mutate(pct_delta = avg_annual_storms2/avg_annual_storms1 - 1, pct_delta_disp = percent_func(pct_delta),
         state_name = state.name[match(STATE, toupper(state.name))],
         region = tolower(state_name),
         #This value will be used as a pop-up on the map, showing the change in storm frequency
         hover = paste(paste0(state_name, ':'), pct_delta_disp, sep = ' '))

The plotly library has its own set of functions that can be used to build interactive graphics. It also has a function that will turn a ggplot2 object into a plotly object. As I am more familiar with the ggplot2 syntax, I chose to build the map with ggplot2 and then convert to a plotly chart with ggplotly(). Somewhat surprisingly, it worked exactly as it was supposed to on the first try. Huzzah plotly.

#Acquire map of US from 'maps' library
us <- map_data("state")

#Create a dual layer ggplot map (state borders & fill are separate layers)
gg <- ggplot() + 
  geom_map(data = us, map = us, aes(x = long, y = lat, map_id = region), fill = "#ffffff", 
           color = "#ffffff", size = 0.15) +
  geom_map(data = dat, map = us, aes(fill = pct_delta, map_id = region, text = hover), 
           color = "#ffffff", size = 0.15) +
  #Create color gradient where positive numbers are shades of blue and negative numbers are red
  #Limits are set via the min and max values of the 'pct_delta' field
  scale_fill_gradientn(colours = c("red", "yellow", "white", "lightblue", "darkblue"), 
                       values = scales::rescale(c(min(dat$pct_delta), -.01, 0, .01, max(dat$pct_delta))), 
                       space = "Lab", na.value = "grey50", guide = "colourbar", name = '% change') + 
  #The ggplotly() function creates weirdly squished maps unless the coord_fixed value is set to about 1.3
  coord_fixed(1.3) + 
  #Get rid of extraneous marks
  labs(x = NULL, y = NULL) +
  theme(panel.border = element_blank(), panel.background = element_blank(), axis.ticks = element_blank(), 
        axis.text = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank())

#Convert ggplot object to interactive plotly graphic
ggp <- ggplotly(gg, tooltip = c("text"), width = 600, height = 800)

Average annual storm events in the United States: 1996-2006 vs 2007-2017

On average, there were 49,518 annual storm events in the United States from 1996 through 2006. From 2007 through 2017, that number jumped to 60,098, an increase of about 21.4%. See the state-by-state breakdown below.

Assorted observations:

  • A good analysis starts with a good question.
  • The coastal areas where I would expect an increase in storm activity (e.g. snow storms in Maine & hurricanes in Florida) were the few places that actually saw a decrease in activity.
  • What happened in Wyoming? Storm activity more than doubled there, and surrounding states experienced much less of an increase.
  • I just went back and made something that I did a long time ago better. Life affords few such opportunities. Code is one of them.