The Comprehensive Guide To R Programming

3.4 Lollipop Charts

A hybrid between a bar chart and a Cleveland dot plot is the lollipop chart. A lollipop chart typically contains categorical variables on the y-axis measured against a second (continuous) variable on the x-axis. Similar to the Cleveland dot plot, the emphasis is on the dot to draw the readers attention to the specific x-axis value achieved by each category. The line is meant to be a minimalistic approach to easily tie each category to its relative point without drawing too much attention to the line itself. A lollipop chart is great for comparing multiple categories as it aids the reader in aligning categories to points but minimizes the amount of ink on the graphic.

3.4.1 Overview

This section introduces the basics of the lollipop chart and compares them to bar charts and dot plots. I also show how to go from a basic lollipop chart to a more refined, publication worthy graphic. To reproduce the code throughout this tutorial you will need to load the following packages. Note that I use the development version of ggplot2 which offers some nice title, subtitle, and caption options which I cover in the last section. You can download the development version with this line of code: devtools::install_github("hadley/ggplot2")

library(dplyr)          # for data manipulation
library(tidyr)          # for data tidying
library(ggplot2)        # for generating the visualizations

In addition, throughout the tutorial I illustrate the graphics with the midwest data set provided in the ggplot2 package.

head(midwest)
# A tibble: 6 x 28
    PID county    state  area poptotal popdensity popwhite popblack
  <int> <chr>     <chr> <dbl>    <int>      <dbl>    <int>    <int>
1   561 ADAMS     IL    0.052    66090      1271.    63917     1702
2   562 ALEXANDER IL    0.014    10626       759      7054     3496
3   563 BOND      IL    0.022    14991       681.    14477      429
4   564 BOONE     IL    0.017    30806      1812.    29344      127
5   565 BROWN     IL    0.018     5836       324.     5264      547
6   566 BUREAU    IL    0.05     35688       714.    35157       50
# ... with 20 more variables: popamerindian <int>, popasian <int>,
#   popother <int>, percwhite <dbl>, percblack <dbl>, percamerindan <dbl>,
#   percasian <dbl>, percother <dbl>, popadults <int>, perchsd <dbl>,
#   percollege <dbl>, percprof <dbl>, poppovertyknown <int>,
#   percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>

3.4.2 Basic Lollipop Chart

Most readers would have little problem understanding either of the basic versions of the lollipop chart, dot plot or the bar chart. Consider if we want to view the top 25 counties in Ohio for percentage of college educated folks. After a little data manipulation (note that I order the counties by percent college educated (percollege) and then make the county variable a factor with the levels ordered accordingly; this will allow us to order the bars and dots in the following charts appropriately)…

ohio_top25 <- midwest %>%
        filter(state == "OH") %>%
        select(county, percollege) %>%
        arrange(desc(percollege)) %>%
        top_n(25) %>%
        arrange(percollege) %>%
        mutate(county = factor(county, levels = .$county))

We could view the data as a horizontal bar chart…

# bar chart
ggplot(ohio_top25, aes(county, percollege)) +
        geom_bar(stat = "identity") +
        coord_flip()

as a dot plot…

# dot plot
ggplot(ohio_top25, aes(percollege, county)) +
        geom_point()

or as a lollipop chart. In the lollipop chart we use geom_segment to plot the lines and we explicitly state that we want the lines to start at x = 0 and extend to the percollege value with xend = percollege. We simply need to include y = county and yend = county to tell R the lines are horizontally attached to each county.

# lollipop chart
ggplot(ohio_top25, aes(percollege, county)) +
        geom_segment(aes(x = 0, y = county, xend = percollege, yend = county), color = "grey50") +
        geom_point()

3.4.3 Comparing Multiple Points of Information

Consider the case where we want to compare counties in Ohio to see how they differ from the state average. For this the emphasis becomes the state average so we can do a data manipulation to generate the state average value and test if each county above or below that value.

ohio <- midwest %>%
        filter(state == "OH") %>%
        select(county, percollege) %>%
        arrange(percollege) %>%
        mutate(Avg = mean(percollege, na.rm = TRUE),
               Above = ifelse(percollege - Avg > 0, TRUE, FALSE),
               county = factor(county, levels = .$county))

head(ohio)
# A tibble: 6 x 4
  county percollege   Avg Above
  <fct>       <dbl> <dbl> <lgl>
1 VINTON       7.91  16.9 FALSE
2 ADAMS        8.74  16.9 FALSE
3 NOBLE        8.85  16.9 FALSE
4 HOLMES       9.33  16.9 FALSE
5 PERRY       10.1   16.9 FALSE
6 MONROE      10.5   16.9 FALSE

We can now incorporate this data into graphic by mapping the x = argument within geom_segment to the state average and then color the counties based on if they are above or below average.

ggplot(ohio, aes(percollege, county, color = Above)) +
        geom_segment(aes(x = Avg, y = county, xend = percollege, yend = county), color = "grey50") +
        geom_point()

Another comparison approach is if we wanted to compare the top 10 counties for each of the midwest states in our data set. In this case we have to do additional manipulation as each state has a couple county names that are common.

top10 <- midwest %>%
        select(state, county, percollege) %>%
        group_by(state) %>%
        arrange(desc(percollege)) %>%
        top_n(10) %>%
        arrange(percollege) %>%
        unite(county_st, county, state, remove = FALSE) %>%
        mutate(county_st = factor(county_st, levels = .$county_st))

head(top10)
# A tibble: 6 x 4
# Groups:   state [2]
  county_st    state county    percollege
  <fct>        <chr> <chr>          <dbl>
1 WARRICK_IN   IN    WARRICK         23.8
2 HENDRICKS_IN IN    HENDRICKS       24.2
3 PORTER_IN    IN    PORTER          24.5
4 ST JOSEPH_IN IN    ST JOSEPH       24.6
5 SUMMIT_OH    OH    SUMMIT          24.7
6 CUYAHOGA_OH  OH    CUYAHOGA        25.1

We can now plot our data and facet by state to get small multiples representing the top 10 counties for each state. Here I abbreviate the names for brevity.

ggplot(top10, aes(percollege, county_st)) +
         geom_segment(aes(x = 0, y = county_st, xend = percollege, yend = county_st), color = "grey50") +
        geom_point() +
        scale_y_discrete(labels = abbreviate) +
        facet_wrap(~ state, scales = "free_y")

3.4.4 Adding Value Markers

Depending on the number of categories (i.e. counties) you are trying to graphically display, and the range of the x-axis, it can be helpful to add value markers to the points to clarify the difference between the points.

OH_top10 <- midwest %>%
        select(state, county, percollege) %>%
        filter(state == "OH") %>%
        arrange(desc(percollege)) %>%
        top_n(10) %>%
        arrange(percollege) %>%
        mutate(county = factor(county, levels = .$county))

ggplot(OH_top10, aes(percollege, county, label = round(percollege, 1))) +
         geom_segment(aes(x = 0, y = county, xend = percollege, yend = county), color = "grey50") +
        geom_point() +
        geom_text(nudge_x = 1.5)

Alternatively, you can enlarge the dots to include the labelling inside of them.

ggplot(OH_top10, aes(percollege, county, label = paste0(round(percollege, 0), "%"))) +
         geom_segment(aes(x = 0, y = county, xend = percollege, yend = county), color = "grey50") +
        geom_point(size = 7) +
        geom_text(color = "white", size = 2)

3.4.5 Finishing Touches

Now let’s take one of these outputs and create a nice, publication worthy graphic. Here I take the graphic above that compares the percent of college educated adults in Ohio counties to the state average and do a little refinement. I use geom_segment and annotate to create a more simplified and appealing legend, add a title, subtitle and caption followed by some basic theme adjustments.

ggplot(ohio, aes(percollege/100, county, color = Above)) +
        geom_segment(aes(x = Avg/100, y = county, xend = percollege/100, yend = county), color = "grey50") +
        geom_point() +
        annotate("text", x = .25, y = "ALLEN", label = "Above Average", color = "#00BFC4", size = 3, hjust = -0.1, vjust = .75) +
        annotate("text", x = .25, y = "FULTON", label = "Below Average", color = "#F8766D", size = 3, hjust = -0.1, vjust = -.1) +
        geom_segment(aes(x = .25, xend = .25 , y = "ASHLAND", yend = "DEFIANCE"),
                     arrow = arrow(length = unit(0.2,"cm")), color = "#00BFC4") +
        geom_segment(aes(x = .25, xend = .25 , y = "KNOX", yend = "PUTNAM"),
                     arrow = arrow(length = unit(0.2,"cm")), color = "#F8766D") +
        scale_x_continuous(labels = scales::percent, expand = c(0, 0), limits = c(.07, .33)) +
        labs(title = "College Educated Adults in Ohio Counties",
             subtitle = "The average percent of college educated adults in Ohio is 16.89%. Franklin, Greene, Geauga, and \nDelaware counties lead Ohio with over 30% of their adults being college educated while Vinton, \nAdams, Holmes, and Perry trailing with less than 10% of their adults being college educated.",
             caption = "U.S. Census Bureau: 2000 Census") +
        theme_minimal() +
        theme(axis.title = element_blank(),
              panel.grid.minor = element_blank(),
              legend.position = "none",
              text = element_text(family = "Georgia"),
              axis.text.y = element_text(size = 8),
              plot.title = element_text(size = 20, margin = margin(b = 10), hjust = 0),
              plot.subtitle = element_text(size = 12, color = "darkslategrey", margin = margin(b = 25, l = -25)),
              plot.caption = element_text(size = 8, margin = margin(t = 10), color = "grey70", hjust = 0))

3.4.6 Wrapping Up

Lollipop charts are a nice alternative to bar charts and dot plots. As previously mentioned, when trying to visualize data across many categories they, much like dot plots, provide a nice minimalistic visualization of the data.