resources

Here are a number of helpful resources to dig deeper into data visualization with {ggplot}:

UO courses & boot camp modules

Reference guides

Color palettes and themes

Cheat sheets

load packages

if (!require(pacman)) {
  install.packages("pacman")
}
pacman::p_load("tidyverse", "here", "sysfonts", install = TRUE)

load the data file

Modify the path as needed to where your week4_data.csv file is downloaded. Note, my path is a little different, but for you, it should be in the data/ directory in the same folder as this script.

Import the csv file and assign it to the variable data

data = read_csv(here("static", "labs", "data", "week4_data.csv"))

tidy the data using code from week 4

In week 4, we did most of these steps independently. Here, we’ll use %>% to thread the processes together without assigning any intermediate variables.

The key variables we’ll be working with today are as follows:

behavior_voting: voted in 2020; single dichotomous item, yes or no
CE_attitudes: civic engagement attitudes; 8-item scale with range 1-7
CE_checklist: checklist of civic engagement activities; 17 dichotonomous items, yes or no
pol_eff: policitical efficacy scale; 4-item scale with range 1-4
reasons_yes: checklist of reasons why people voted in 2020

data_tidy = data %>%
  # filter out test and incomplete responses
  filter(!DistributionChannel == "preview") %>%
  filter(Finished == 1 & consent == 1) %>%
  # select a subset of variables 
  select(ResponseId, behavior_voting, reasons_yes,
         starts_with("CE_attitudes"), contains("checklist"), contains("pol_eff")) %>%
  # convert to long format
  pivot_longer(cols = -c(ResponseId, behavior_voting, reasons_yes), names_to = "scale_name") %>%
  # extract item number from scale_name
  extract(col = "scale_name", into = c("scale_name", "item"),
          regex = "(CE_attitudes|CE_checklist|pol_eff)_([0-9]+)") %>%
  # convert responses to numeric and recode response values
  mutate(value = as.numeric(value),
         value = ifelse(test = scale_name == "CE_checklist" & value == 2,
                        yes = 0,
                        no = value),
         behavior_voting = recode(behavior_voting,
                                  "1" = "yes",
                                  "2" = "no")) %>%
  # calculate scale means or sums for each participant
  group_by(ResponseId, scale_name) %>%
  mutate(summarized_value = ifelse(test = scale_name == "CE_checklist", 
                                   yes = sum(value, na.rm = TRUE),
                                   no = mean(value, na.rm = TRUE))) %>%
  # remove the item and value columns
  select(-item, -value) %>%
  # remove repeated observations (rows)
  unique()

Let’s take a look at the tidied data frame.

We can see that now, instead of multiple items per scale, we have a summary (either a mean or sum) stat in the summarized_value column. Now, each participant has one value per scale.

data_tidy

check distributions

A good first step when working with data is to visualize the distribution of the variables you’re working with. This can help identify outliers or if there are unexpected values.

Let’s look at the distributions of CE_attitudes, CE_checklist, and pol_eff

histograms

Let’s create some histograms using the geom_histogram()

initial plot

data_tidy %>%
  ggplot(aes(x = summarized_value)) +
  geom_histogram()

fill

Use fill = scale_name to separate the scales

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = scale_name)) +
  geom_histogram()

position

Let’s make them non-overlapping by using position = "dodge"

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = scale_name)) +
  geom_histogram(position = "dodge")

facet

Let’s separate the scales into 3 separate subplots using facet_grid(~scale_name)

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = scale_name)) +
  geom_histogram() +
  facet_grid(~scale_name)

bins

Let’s change the number of “bins” in the histogram using bins = 10

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = scale_name)) +
  geom_histogram(bins = 10) +
  facet_grid(~scale_name)

density plots

Rather than plotting a histogram of counts per bin, we’ll look at the density using geom_density()

initial plot

data_tidy %>%
  ggplot(aes(x = summarized_value)) +
  geom_density()

fill

Use fill = scale_name to separate the scales

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = scale_name)) +
  geom_density()

alpha

Change the opacity of the fill color using alpha = .5

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = scale_name)) +
  geom_density(alpha = .5)

facet

Separate into subplots using facet_grid(~scale_name)

Allow the scale range to differ by specifying scale = "free"

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = scale_name)) +
  geom_density(alpha = .5) +
  facet_grid(~scale_name, scale = "free")

legend position

Remove the redundant fill legend using legend.position = "none"

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = scale_name)) +
  geom_density(alpha = .5) +
  facet_grid(~scale_name, scale = "free") +
  theme(legend.position = "none")

density plot by a grouping variable

Now let’s see if the distributions differ for people who voted or didn’t vote in 2020.

Because we’re plotting each scale separately using facet_grid(~scale_name), we can use fill to plot each level of behavior_voting separately for each scale.

initial plot

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = behavior_voting)) +
  geom_density(alpha = .5) +
  facet_grid(~scale_name, scale = "free") +
  theme(legend.position = "bottom")

color

Remove the black outlines by specifying color = NA

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = behavior_voting)) +
  geom_density(alpha = .5, color = NA) +
  facet_grid(~scale_name, scale = "free") +
  theme(legend.position = "top")

summary plots

Now that we’ve gotten a sense of the distribution, let’s look at the average scale scores as a function of voting behavior.

We’ll use stat_summary() to do this.

bar

initial plot

data_tidy %>%
  ggplot(aes(x = behavior_voting, summarized_value)) +
  stat_summary(fun = "mean", geom = "bar") +
  facet_grid(~scale_name)

fill

Let’s add some color to distinguish the groups using fill = behavior_voting

data_tidy %>%
  ggplot(aes(x = behavior_voting, summarized_value, fill = behavior_voting)) +
  stat_summary(fun = "mean", geom = "bar") +
  facet_grid(~scale_name) +
  theme(legend.position = "top")

changing x

Let’s reduce the redundancy by specifying x = scale_name rather than using facet_grid()

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, fill = behavior_voting)) +
  stat_summary(fun = "mean", geom = "bar") +
  theme(legend.position = "top")

position

Separate the bars using position = "dodge" to push them apart

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, fill = behavior_voting)) +
  stat_summary(fun = "mean", geom = "bar", position = "dodge") +
  theme(legend.position = "top")

stat_summary

Visualize uncertainty around the means by adding a new stat_summary() layer

Visualize the standard error with an error bar with fun.data = "mean_se"

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, fill = behavior_voting)) +
  stat_summary(fun = "mean", geom = "bar", position = "dodge") +
  stat_summary(fun.data = "mean_se", geom = "errorbar", position = "dodge") +
  theme(legend.position = "top")

95% CI

Use the 95% confidence interval instead of SE using fun.data = "mean_cl_boot"

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, fill = behavior_voting)) +
  stat_summary(fun = "mean", geom = "bar", position = "dodge") +
  stat_summary(fun.data = "mean_cl_boot", geom = "errorbar", position = "dodge") +
  theme(legend.position = "top")

width

Change the width of the error bars using width = 0

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, fill = behavior_voting)) +
  stat_summary(fun = "mean", geom = "bar", position = "dodge") +
  stat_summary(fun.data = "mean_cl_boot", geom = "errorbar", position = "dodge",
               width = 0) +
  theme(legend.position = "top")

position

Change the position variable so that the erorbars are in the middle of the bars

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, fill = behavior_voting)) +
  stat_summary(fun = "mean", geom = "bar", position = position_dodge(.9)) +
  stat_summary(fun.data = "mean_cl_boot", geom = "errorbar", position = position_dodge(.9),
               width = 0) +
  theme(legend.position = "top")

point range

Instead of using bars, let’s visualize the means and uncertainty around them using a point range with geom = "pointrange"

initial plot

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, fill = behavior_voting)) +
  stat_summary(fun.data = "mean_cl_boot", geom = "pointrange") +
  theme(legend.position = "top")

color

Use color instead of fill this time

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, color = behavior_voting)) +
  stat_summary(fun.data = "mean_cl_boot", geom = "pointrange") +
  theme(legend.position = "top")

position

Separate the values using position = position_dodge(.25)

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, color = behavior_voting)) +
  stat_summary(fun.data = "mean_cl_boot", geom = "pointrange", 
               position = position_dodge(.25)) +
  theme(legend.position = "top")

line

Add a line connecting the means by voting behavior group by adding a stat_summary() layer

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, color = behavior_voting)) +
  stat_summary(aes(group = behavior_voting), fun = "mean", geom = "line") +
  stat_summary(fun.data = "mean_cl_boot", geom = "pointrange", 
               position = position_dodge(.25)) +
  theme(legend.position = "top")

position

Line things up by changing the line position to match the pointrange position

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, color = behavior_voting)) +
  stat_summary(aes(group = behavior_voting), fun = "mean", geom = "line", 
               position = position_dodge(.25)) +
  stat_summary(fun.data = "mean_cl_boot", geom = "pointrange", 
               position = position_dodge(.25)) +
  theme(legend.position = "top")

relationships between continuous variables

Next let’s visualize the relationship between two continuous variables using geom_point() and geom_smooth()

scatter plots

initial plot

data_tidy %>%
  spread(scale_name, summarized_value) %>%
  ggplot(aes(x = CE_attitudes, y = CE_checklist)) +
  geom_point()

trend line

Add a trend line using geom_smooth()

data_tidy %>%
  spread(scale_name, summarized_value) %>%
  ggplot(aes(x = CE_attitudes, y = CE_checklist)) +
  geom_point() +
  geom_smooth()

linear trend

Add a linear trend line by specifying method = "lm"

data_tidy %>%
  spread(scale_name, summarized_value) %>%
  ggplot(aes(x = CE_attitudes, y = CE_checklist)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme(legend.position = "top")

relationship by a grouping variable

As we did previously, let’s see if this relationship differs for people who did and didn’t vote

Do this using shape

initial plot

data_tidy %>%
  spread(scale_name, summarized_value) %>%
  ggplot(aes(x = CE_attitudes, y = CE_checklist, shape = behavior_voting)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme(legend.position = "top")

color

Use color instead of shape as the aesthetic

data_tidy %>%
  spread(scale_name, summarized_value) %>%
  ggplot(aes(x = CE_attitudes, y = CE_checklist, color = behavior_voting)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme(legend.position = "top")

fill

Match the color of the error bands by adding a fill aesthetic

data_tidy %>%
  spread(scale_name, summarized_value) %>%
  ggplot(aes(x = CE_attitudes, y = CE_checklist, color = behavior_voting,
             fill = behavior_voting)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme(legend.position = "top")

putting it all together

Now that we’ve gotten our feet wet, we’ll focus on creating a publication ready plot to communicate common reasons for voting that Penn students endorse.

First, we’ll need to tidy the data and join the text of the reasons from the survey to the data.

join and tidy data

Check the format of the reasons_yes variable

data_tidy %>%
  # select relevant variables
  ungroup() %>%
  select(ResponseId, reasons_yes)

To replace the numbers with text, first we need to wrangle the data is in the long format and each number selected has its own row.

To do this, we’ll do some somewhat complex transformations using strsplit(), which creates a list for each value selected, and unnest() to convert the lists back to a dataframe.

Then, once the data is in the long format, we can join the text with the corresponding numbers using left_join()

reason_text = read_csv(here("static", "labs", "data", "week4_data_reasons.csv"))

data_reasons = data_tidy %>%
  # select relevant variables
  ungroup() %>%
  select(ResponseId, reasons_yes) %>%
  # split the selected responses and convert to a single row per response
  mutate(reasons_yes = strsplit(gsub("[][\"]", "", reasons_yes), ",")) %>%
  unnest(reasons_yes) %>%
  # convert to numeric to facilitate joining
  mutate(reasons_yes = as.numeric(reasons_yes)) %>%
  # join with text 
  left_join(., reason_text, by = "reasons_yes") %>%
  # remove missing responses and "other" responses
  filter(!is.na(text) & !text == "Other") %>%
  # get unique responses
  unique()

create plot

Create a bar plot and fancify it

initial plot

data_reasons %>%
  ggplot(aes(x = text)) +
  geom_bar()

flip

Flip the axis using coord_flip()

data_reasons %>%
  ggplot(aes(x = text)) +
  geom_bar() +
  coord_flip()

text

Add a count number using geom_text()

data_reasons %>%
  ggplot(aes(x = text)) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10)

label

Remove the flipped x label and add a space between the scale and the flipped y label using “”

data_reasons %>%
  ggplot(aes(x = text)) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
  labs(x = "", y = "\ncount")

Reorder

Reorder the bars based on the count by summarizing the number of responses per text

data_reasons %>%
  group_by(text) %>%
  mutate(n_responses = n()) %>%
  ggplot(aes(x = reorder(text, n_responses))) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
  labs(x = "", y = "\ncount")

fill

Create grouping categories for different types of reasons and use this variable as the fill aesthetic

data_reasons %>%
  group_by(text) %>%
  mutate(n_responses = n(),
         category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
                 ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
                 ifelse(grepl("participate|duty|adult|right", text), "responsibility", 
                 ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
  ggplot(aes(x = reorder(text, n_responses), fill = category)) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
  labs(x = "", y = "\ncount")

legend position

Move the legend to the bottom of the plot using legend.position = "bottom"

data_reasons %>%
  group_by(text) %>%
  mutate(n_responses = n(),
         category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
                 ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
                 ifelse(grepl("participate|duty|adult|right", text), "responsibility", 
                 ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
  ggplot(aes(x = reorder(text, n_responses), fill = category)) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
  labs(x = "", y = "\ncount") +
  theme(legend.position = "bottom")

color

Change the color using scale_fill_brewer()

data_reasons %>%
  group_by(text) %>%
  mutate(n_responses = n(),
         category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
                 ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
                 ifelse(grepl("participate|duty|adult|right", text), "responsibility", 
                 ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
  ggplot(aes(x = reorder(text, n_responses), fill = category)) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
  labs(x = "", y = "\ncount") +
  theme(legend.position = "bottom") +
  scale_fill_brewer(palette = 2)

color

Change the color by create the variable palette with HEX values

Manually change the color palette using scale_fill_manual()

Change the scale name with the name = "category" argument

palette = c("#1985a1", "#e64626", "#ffb800", "#4c5c68", "#dcdcdd")
data_reasons %>%
  group_by(text) %>%
  mutate(n_responses = n(),
         category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
                 ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
                 ifelse(grepl("participate|duty|adult|right", text), "responsibility", 
                 ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
  ggplot(aes(x = reorder(text, n_responses), fill = category)) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
  labs(x = "", y = "\ncount") +
  theme(legend.position = "bottom") +
  scale_fill_manual(name = "category", values = palette)

theme

Change the theme with theme_minimal(), note this must come before any theme layers or it will override them

data_reasons %>%
  group_by(text) %>%
  mutate(n_responses = n(),
         category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
                 ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
                 ifelse(grepl("participate|duty|adult|right", text), "responsibility", 
                 ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
  ggplot(aes(x = reorder(text, n_responses), fill = category)) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
  labs(x = "", y = "\ncount") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_fill_manual(name = "category", values = palette)

title

Add a title by adding a title argument to labs()

data_reasons %>%
  group_by(text) %>%
  mutate(n_responses = n(),
         category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
                 ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
                 ifelse(grepl("participate|duty|adult|right", text), "responsibility", 
                 ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
  ggplot(aes(x = reorder(text, n_responses), fill = category)) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
  labs(x = "", y = "\ncount", title = "Reasons for voting in the 2020 election endorsed by Penn students") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_fill_manual(name = "category", values = palette)

title position

Adjust the title position using plot.title.position = "plot"

# save the plot as plot
(plot = data_reasons %>%
  group_by(text) %>%
  mutate(n_responses = n(),
         category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
                 ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
                 ifelse(grepl("participate|duty|adult|right", text), "responsibility", 
                 ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
  ggplot(aes(x = reorder(text, n_responses), fill = category)) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
  labs(x = "", y = "\ncount", title = "Reasons for voting in the 2020 election endorsed by Penn students") +
  theme_minimal() +
  theme(legend.position = "bottom",
        plot.title.position = "plot") +
  scale_fill_manual(name = "category", values = palette))

font

Now that we’ve got a near final plot, let’s learn how to change the font using the {sysfonts} package

# check what files are installed on your machine
font_files()

# add a font from google
font_add_google("Roboto")

Let’s update the font to Helvetica Neue size 14 and convert all grey text to black

plot +
  theme(legend.position = "bottom",
        axis.text = element_text(color = "black"),
        text = element_text(size = 14, family = "HelveticaNeue"))

resize

Specify the figure width and height in the chunk options.

We’ll assign this variable as final_plot to save it

(final_plot = plot +
  theme(legend.position = "bottom",
        axis.text = element_text(color = "black"),
        text = element_text(size = 14, family = "HelveticaNeue")))

save the plot

Now that we’ve got our beautiful plot, let’s save it as a png with ggsave()

ggsave(final_plot, filename = "~/Desktop/plot.png", width = 12, height = 6)

assignment

other ways to visualize distributions

Check out the types of distribution plots available on R Graph Gallery.

Try modifying the following code to use a different distribution geom (e.g. geom_boxplot or geom_violin)

data_tidy %>%
  ggplot(aes(x = scale_name, y = summarized_value))

Add the data points to the plot you made by adding a layer with geom_point()

Spread out the data points using geom_jitter() instead of geom_point()

Decrease the opacity of the points by adding an alpha argument to geom_jitter()

scatter plots

Visualize the relationship between CE_attitudes and pol_eff using geom_point() and geom_smooth() as we did earlier

data_tidy %>%
  spread(scale_name, summarized_value) %>%
  ggplot(aes(x = CE_attitudes, y = pol_eff))

Look at this relationship as a function of behavior_voting using the color aesthetic

Move the legend to the top of the plot

aesthetics

Change the colors on the plot you just made.

You can generate your own palettes using coolers.co or get some inspiration from this collection of color palettes in R

Install the {ggthemes} package and chose one of the available themes to add as a layer

pacman::p_load("ggthemes", install = TRUE)

Week 5: Data visualization

Dani Cosme

2021-04-09

resources

load packages

load the data file

tidy the data using code from week 4

check distributions

histograms

initial plot

fill

position

facet

bins

density plots

initial plot

fill

alpha

facet

legend position

density plot by a grouping variable

initial plot

color

summary plots

bar

initial plot

fill

changing x

position

stat_summary

95% CI

width

position

point range

initial plot

color

position

line

position

relationships between continuous variables

scatter plots

initial plot

trend line

linear trend

relationship by a grouping variable

initial plot

color

fill

putting it all together

join and tidy data

create plot

initial plot

flip

text

label

Reorder

fill

legend position

color

color

theme

title

title position

font

resize

save the plot

assignment

other ways to visualize distributions

scatter plots

aesthetics