One man's function is another man's function.

Oct 8, 2018 R shiny tidyverse

Data is rarely clean enough for use in modeling, reporting, or dashboarding. Null data, human error, and the presence of multiple accurate values can make it difficult to build a good data product or perform a useful analysis. A customer table, for instance, may have multiple correct entires for phone number (mobile, office, home, etc.). It could also have missing and inaccurate values for address if the customer has more than one residence, purchases an item as a digital gift that is delivered via email, or uses different mailing and billing addresses.

In some cases, this is fine. Data is messy. In other cases however, it may suit the analyst to homogenize some variables by populating them with the most commonly occurring variable for a given ID. Even though some values will be lost through homogenization, one may not need every address value that the customer has ever entered. An analysis or model may require a single address value for each customer ID. In these cases, the Data Homogenization Machine can help (links to the app and github repo are at the bottom of this post).

This project is largely inspired by the presentation that Adam Black, research analyst at Maine Medical Center, gave at the Greater Portland useR Group Meetup on September 20, 2018. His work can be found here. The function he developed is used to clean up tables on large databases. As such, he was required to implement some advanced programming techinques with the rlang package, and run pure dplyr code on the database itself. I thought it would be a fun project to genericize the function a bit, avoid running the code directly on the database so that I can use other tidyverse packages like tidyr and purrr, and turn the function into a Shiny application. Of course, there are drawbacks to this approach. If your dataset does not fit in memory, the app will not work. But, if the data does fit in memory, and you do want to use this function on database tables, it is easy enough to import the data locally via a SQL query and then write the homogenized table back to the database with the copy_to() function from dbplyr. Otherwise, it works quite nicely with text files.

A quick overview of dashboard & application development in R

R provides a few different packages for developing interactive dashboards and applications. Flexdashboard is probably the simplest option. Flexdashboards are basically just static RMarkdown documents that can incorporate easy to use templates, html widgets, and interactive charts and maps like those made with plotly or leaflet (interactive elements are rendered client-side in javascript). Shinydashboard is more full-featured. Code is written in R, dashboard layouts are simple to construct, some additional widgets are available (e.g. value boxes), but the application must be hosted on a server that can execute the code (or run locally). Pure Shiny frees the developer from dashboard layout templates and permits free-form, interactive, application development. As with Shinydashboard, these applications require a server to process the R code. Though some knowledge of HTML and CSS is helpful, it is not a requirement.

For this project, I chose to use straight Shiny. It’s not as simple as the other two options, but it’s flexibility is a powerful asset (plus, it had been a while since I used shiny and my skills needed some polishing)

Shiny app development: the UI side

Thanks to the integrated ‘tidyverse’ library, only a few packages are necessary for this application.

library(tidyverse)
library(shiny)
library(shinythemes)
library(rlang)

Most of the shiny applications I’ve developed, as well as most I’ve encountered in the wild, use the fluidPage() layout. These applications automatically size to the browser window and even look pretty good on a mobile device. The fixedPage() layout is exactly that - fixed. Use this layout option if you want precise control over the inidivudal elements of your dashboard or application. Once you’ve decided on a layout option, it’s easy to add elements like sidebars, headers, and tabs. The following code adds a header and sidebar to the app. In addition, I’ve used a theme from the shinythemes package to pretty things up a bit (without having to write a bunch of CSS).

ui <- fluidPage(theme = shinytheme('cerulean'),
  
    tags$head(
        tags$style(HTML('h1 {background-color: #357FAA; text-align: center; color: #FFFFFF;}'))
    ),

    headerPanel('Data Homogenization Machine',
                windowTitle = 'Data Homogenization Machine'),
  
    br(),
  
    sidebarLayout(
    
        sidebarPanel(
            
            tags$h5('This app will homogenize one or more variables for each value in a given ID column.'),
            
            tags$p('Upload a delimited text file, indiciate if it has a header, choose the delimiter, 
              and then enter the ID variable name or position as well as the names or positions 
              of the variables to homogenize. Once submitted, the variables listed will be populated 
              with the most frequently occuring value for each ID. The homogenized table is then 
              printed to the screen and can be downloaded as a csv file.'),
            
            tags$a(href="https://github.com/davidldenton/homogenize",
                   tags$strong("Additional instructions & git repo", style = "color:purple")),

The sidebar also contains the user input controls. Uploading a file, selecting the delimiter, and listing the column names for homogenization all occur here.

            tags$h3('Input controls'),

            # Choose text file to upload
            fileInput(inputId = 'in_file',
                      label = 'Choose text file:',
                      accept = c('text/csv','text/comma-separated-values,text/plain','.csv'),
                      buttonLabel = 'Browse'),

            # Indicate if the file has a header or not
            radioButtons(inputId = 'header_check',
                         label = NULL,
                         choices = c('File with header', 'File without header')),
            
            # Choose delimiter
            selectInput(inputId = 'delimiter',
                        label = 'Delimiter:',
                        choices = c(',', '|', 'tab'),
                        selected = ','),
            
            # Input name or position of ID column
            textInput(inputId = 'id_var',
                      label = HTML('ID column name or numeric position'),
                      value = ''),
            
            # Input list of variables to homogenize
            textInput(inputId = 'variables',
                      label = HTML('List of variable names or numeric positions <br/>(comma-separated)'),
                      value = ''),
            
            # Submit file for homogenization process
            actionButton(inputId = 'submit',
                         label = 'Submit',
                         icon('upload'))
      
        ),

The last UI element, a main panel, is added so that the user can preview the homogenized output and download the results.

        mainPanel(
            
            tags$h3('Homogenized data'),
            
            # Print homogenized data frame
            tableOutput('data_table'),
            
            # Download homogenized data
            downloadButton('download', 'Download')
      
        )
    )
)

Shiny app development: the server side

The UI code determines the location, formatting, and appearance of the the individual objects in your shiny application. To define what constitutes those objects and how they will react to changing user inputs, we need to implement the server() function. This is where the heavy-lifting occurs. The homogenization function is defined here and the newly clean data is printed to the screen.

server <- function(input, output){

    # Define function to homogenize variables (populate column with the most frequently occurring value for each ID)
    homogenize_vars <- function(df, id_var, ...){
        
        # capture name of ID variable as a quosure
        id_var <- enquo(id_var)
        # convert ID quosure to a string
        id_var_name <- quo_name(id_var)
        # capture names of variables to be homogenized (as a quosure)
        ns_vars <- quos(...)
        # convert quosure containing variable names to strings
        ns_var_names <- flatten_chr(map(ns_vars, quo_name))
        # create a character vector of all column names input by the user
        all_var_names <- c(id_var_name, ns_var_names)

        replacement_values <- df %>%
            # select only variables that were provided as user inputs
            select_at(.vars = vars(one_of(all_var_names))) %>%
            # group by ID variable (!! to unquote)
            group_by(!!id_var) %>%
            # for the variables to be homogenized, replace NA values with '0'
            mutate_at(.vars = vars(one_of(ns_var_names)),
                      .funs = funs(ifelse(is.na(.) | . == '', '0', .))) %>%
            # gather variable and values into name/value pairs
            gather(var_name, var_value, ns_var_names) %>%
            group_by(!!id_var, var_name, var_value) %>%
            # count the number of occurences for each value by ID and variable name
            summarise(n = n()) %>%
            group_by(!!id_var, var_name) %>%
            # select most frequently occuring value for each ID/variable pair (NA's appear as '0')
            summarise(row_count = n(),
                first_value = first(var_value, order_by = desc(n)),
                second_value = nth(var_value, 2, order_by = desc(n)),
                replacement_value = case_when(
                  first_value != '0' ~ first_value,
                  !is.na(second_value) ~ second_value,
                  TRUE ~ '0')) %>%
            group_by(!!id_var) %>%
            select(!!id_var, var_name, replacement_value) %>%
            spread(var_name, replacement_value)

        df <- df %>%
            # rename variables to be homogenized in original table
            rename_at(.vars = vars(one_of(ns_var_names)),
                      .funs = funs(paste0(., '_tmp'))) %>%
            # join to data frame containing new values
            left_join(replacement_values, by = id_var_name) %>%
            # drop original, non-homogenized, variables
            select(-contains('_tmp'))
    }

If you reviewed Adam Black’s version of this function (here), you’ll notice I rely heavily on tidyr functions like gather() and spread(). This allows me to simplify the code so that it is easier to understand for more novice R users. Mr. Black, who needs this code to run directly on a database, relies on some advanced rlang functionality, coupled with dplyr. Neither approach works better than the other. The design, in both cases, is a direct response to the specific use case. I do not need a function to homogenize tables directly on a database. I needed something fast that works on a variety of data sources. This particular application focuses solely on text files. It could be easily modified to work with a database connection (assuming the user has enough RAM), excel files, or even fixed width files (yes, people still use those).

Once the data has been submitted and the function defined, a few helper variables are created, and the function is applied to the data frame. As I only want to the data to be processed after the user selects “submit”, the individual reactive elements are wrapped in isolate(), and the entire output is dropped into the observeEvent() function. This prevents the output table from updating every time a user changes one of the input fields.

# Print homogenized data frame to screen after submit
    observeEvent(input$submit, {
        output$data_table <- renderTable({
            
        # Return null if no file is uploaded
        in_file <- input$in_file
        isolate(if(is.null(in_file)) return(NULL))
        
        # Create boolean object to indicate the presence (or not) of a header in the file
        header_bool <- isolate(
            if_else(input$header_check == 'File with header', TRUE, FALSE))
        
        # Read text file and save as 'download_data'
        download_data <- isolate(
            if(input$delimiter == 'tab'){
                read_tsv(in_file$datapath,
                         col_names = header_bool,
                         col_types = cols(.default = "c"))
            }
            else{
                read_delim(in_file$datapath,
                           delim = input$delimiter,
                           col_names = header_bool,
                           col_types = cols(.default = "c"))
            }
        )
        
        # Create vector of variables names that require homogenization and remove any whitespace
        vars_to_homogenize <- isolate(
            if(header_bool){
                strsplit(input$variables, split = ',') %>%
                flatten_chr() %>%
                str_remove_all(pattern = ' ')
                
            }
            else{
                strsplit(input$variables, split = ',') %>%
                flatten_chr() %>%
                str_remove_all(pattern = ' ') %>%
                map(function(x) paste0('X', x)) %>%
                flatten_chr()
                
            }
        )
        
        # Identify ID column
        id_col <- isolate(
            if(header_bool){
                input$id_var
            }
            else{
                paste0('X', input$id_var)
            }
        )

Once the arguments to the homogenize_vars() function have been defined, they are collected in a list and converted to symbols. The list of arguments is then passed to the function and the new, homogenized “download_data” object is created. Note that “download_data” is created using the “<<-” assignment operator, making it available in the global environment outside of the renderTable() function. There is probably a simpler rlang solution to this (e.g. using quosures to capture the environment as well as well as the expression), but this code is pretty straightforward and I didn’t want to go too far down the quotation rabbit hole. I’d love to see some alternate solutions in the comments, however.

        # Arguments to pass to homogenize_vars() function
        args <- isolate(syms(c('download_data',
                               id_col,
                               vars_to_homogenize)))
        
        # Create data frame with homogenized data and print the first 25 rows
        download_data <<- head(do.call(homogenize_vars, args), 25)

Apply some simple formatting to the output table.

        # Define table formatting
        }, striped = TRUE, align = 'c')

The homogenized data frame can now be downloaded by the user. The downloadHandler() function makes this possible. Simply supply a filename, the path to a yet-to-exist temp file, and the content (download_data). With the click of a button, the homogenized data is now available to the user as “clean_data.csv.” The seemingly extra parentheses and curly brackets are the ends of the observeEvent() and server() functions above.

# Create file of homogenized data for download
        output$download <- downloadHandler(
            filename = 'clean_data.csv',
            content = function(file){
                write.csv(download_data, file, row.names = FALSE, na = '')
            }
        )
    })
}

Finally, tie it all together and a shiny app is born.

# Create Shiny app
shinyApp(ui = ui, server = server)

Thank you to Adam Black for providing the inspiration to build a fun data product. I’ve been trying to come up with an interesting blog post for months. Lack of time and ideas had me stymied. All it took was one good presentation and my brain was racing. I can’t emphasize enough how valuable it can be to join a meetup of other R users. The social and knowledge-sharing aspects are great, but sometimes its hard to find inspiration to work on a project that isn’t for one’s job. Adam’s presentation provided the analytical push I needed. And the snowball is still rolling, collecting other ideas as it makes it’s way downhill.

I encourage anyone interested in R (novices through experts) to join us for the next Greater Portland (Maine) useR Group. Details can be found on our Meetup site.