load packages

if (!require(pacman)) {
pacman::p_load("tidyverse", "here", "sysfonts", install = TRUE)

load the data file

Modify the path as needed to where your week4_data.csv file is downloaded. Note, my path is a little different, but for you, it should be in the data/ directory in the same folder as this script.

Import the csv file and assign it to the variable data

data = read_csv(here("static", "labs", "data", "week4_data.csv"))

tidy the data using code from week 4

In week 4, we did most of these steps independently. Here, we’ll use %>% to thread the processes together without assigning any intermediate variables.

The key variables we’ll be working with today are as follows:

  • behavior_voting: voted in 2020; single dichotomous item, yes or no
  • CE_attitudes: civic engagement attitudes; 8-item scale with range 1-7
  • CE_checklist: checklist of civic engagement activities; 17 dichotonomous items, yes or no
  • pol_eff: policitical efficacy scale; 4-item scale with range 1-4
  • reasons_yes: checklist of reasons why people voted in 2020
data_tidy = data %>%
  # filter out test and incomplete responses
  filter(!DistributionChannel == "preview") %>%
  filter(Finished == 1 & consent == 1) %>%
  # select a subset of variables 
  select(ResponseId, behavior_voting, reasons_yes,
         starts_with("CE_attitudes"), contains("checklist"), contains("pol_eff")) %>%
  # convert to long format
  pivot_longer(cols = -c(ResponseId, behavior_voting, reasons_yes), names_to = "scale_name") %>%
  # extract item number from scale_name
  extract(col = "scale_name", into = c("scale_name", "item"),
          regex = "(CE_attitudes|CE_checklist|pol_eff)_([0-9]+)") %>%
  # convert responses to numeric and recode response values
  mutate(value = as.numeric(value),
         value = ifelse(test = scale_name == "CE_checklist" & value == 2,
                        yes = 0,
                        no = value),
         behavior_voting = recode(behavior_voting,
                                  "1" = "yes",
                                  "2" = "no")) %>%
  # calculate scale means or sums for each participant
  group_by(ResponseId, scale_name) %>%
  mutate(summarized_value = ifelse(test = scale_name == "CE_checklist", 
                                   yes = sum(value, na.rm = TRUE),
                                   no = mean(value, na.rm = TRUE))) %>%
  # remove the item and value columns
  select(-item, -value) %>%
  # remove repeated observations (rows)

Let’s take a look at the tidied data frame.

We can see that now, instead of multiple items per scale, we have a summary (either a mean or sum) stat in the summarized_value column. Now, each participant has one value per scale.


check distributions

A good first step when working with data is to visualize the distribution of the variables you’re working with. This can help identify outliers or if there are unexpected values.

Let’s look at the distributions of CE_attitudes, CE_checklist, and pol_eff


Let’s create some histograms using the geom_histogram()

initial plot

data_tidy %>%
  ggplot(aes(x = summarized_value)) +


Use fill = scale_name to separate the scales

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = scale_name)) +


Let’s make them non-overlapping by using position = "dodge"

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = scale_name)) +
  geom_histogram(position = "dodge")


Let’s separate the scales into 3 separate subplots using facet_grid(~scale_name)

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = scale_name)) +
  geom_histogram() +


Let’s change the number of “bins” in the histogram using bins = 10

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = scale_name)) +
  geom_histogram(bins = 10) +

density plots

Rather than plotting a histogram of counts per bin, we’ll look at the density using geom_density()

initial plot

data_tidy %>%
  ggplot(aes(x = summarized_value)) +


Use fill = scale_name to separate the scales

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = scale_name)) +


Change the opacity of the fill color using alpha = .5

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = scale_name)) +
  geom_density(alpha = .5)


Separate into subplots using facet_grid(~scale_name)

Allow the scale range to differ by specifying scale = "free"

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = scale_name)) +
  geom_density(alpha = .5) +
  facet_grid(~scale_name, scale = "free")

legend position

Remove the redundant fill legend using legend.position = "none"

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = scale_name)) +
  geom_density(alpha = .5) +
  facet_grid(~scale_name, scale = "free") +
  theme(legend.position = "none")

density plot by a grouping variable

Now let’s see if the distributions differ for people who voted or didn’t vote in 2020.

Because we’re plotting each scale separately using facet_grid(~scale_name), we can use fill to plot each level of behavior_voting separately for each scale.

initial plot

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = behavior_voting)) +
  geom_density(alpha = .5) +
  facet_grid(~scale_name, scale = "free") +
  theme(legend.position = "bottom")


Remove the black outlines by specifying color = NA

data_tidy %>%
  ggplot(aes(x = summarized_value, fill = behavior_voting)) +
  geom_density(alpha = .5, color = NA) +
  facet_grid(~scale_name, scale = "free") +
  theme(legend.position = "top")

summary plots

Now that we’ve gotten a sense of the distribution, let’s look at the average scale scores as a function of voting behavior.

We’ll use stat_summary() to do this.


initial plot

data_tidy %>%
  ggplot(aes(x = behavior_voting, summarized_value)) +
  stat_summary(fun = "mean", geom = "bar") +


Let’s add some color to distinguish the groups using fill = behavior_voting

data_tidy %>%
  ggplot(aes(x = behavior_voting, summarized_value, fill = behavior_voting)) +
  stat_summary(fun = "mean", geom = "bar") +
  facet_grid(~scale_name) +
  theme(legend.position = "top")

changing x

Let’s reduce the redundancy by specifying x = scale_name rather than using facet_grid()

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, fill = behavior_voting)) +
  stat_summary(fun = "mean", geom = "bar") +
  theme(legend.position = "top")


Separate the bars using position = "dodge" to push them apart

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, fill = behavior_voting)) +
  stat_summary(fun = "mean", geom = "bar", position = "dodge") +
  theme(legend.position = "top")


Visualize uncertainty around the means by adding a new stat_summary() layer

Visualize the standard error with an error bar with = "mean_se"

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, fill = behavior_voting)) +
  stat_summary(fun = "mean", geom = "bar", position = "dodge") +
  stat_summary( = "mean_se", geom = "errorbar", position = "dodge") +
  theme(legend.position = "top")

95% CI

Use the 95% confidence interval instead of SE using = "mean_cl_boot"

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, fill = behavior_voting)) +
  stat_summary(fun = "mean", geom = "bar", position = "dodge") +
  stat_summary( = "mean_cl_boot", geom = "errorbar", position = "dodge") +
  theme(legend.position = "top")


Change the width of the error bars using width = 0

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, fill = behavior_voting)) +
  stat_summary(fun = "mean", geom = "bar", position = "dodge") +
  stat_summary( = "mean_cl_boot", geom = "errorbar", position = "dodge",
               width = 0) +
  theme(legend.position = "top")


Change the position variable so that the erorbars are in the middle of the bars

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, fill = behavior_voting)) +
  stat_summary(fun = "mean", geom = "bar", position = position_dodge(.9)) +
  stat_summary( = "mean_cl_boot", geom = "errorbar", position = position_dodge(.9),
               width = 0) +
  theme(legend.position = "top")

point range

Instead of using bars, let’s visualize the means and uncertainty around them using a point range with geom = "pointrange"

initial plot

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, fill = behavior_voting)) +
  stat_summary( = "mean_cl_boot", geom = "pointrange") +
  theme(legend.position = "top")


Use color instead of fill this time

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, color = behavior_voting)) +
  stat_summary( = "mean_cl_boot", geom = "pointrange") +
  theme(legend.position = "top")


Separate the values using position = position_dodge(.25)

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, color = behavior_voting)) +
  stat_summary( = "mean_cl_boot", geom = "pointrange", 
               position = position_dodge(.25)) +
  theme(legend.position = "top")


Add a line connecting the means by voting behavior group by adding a stat_summary() layer

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, color = behavior_voting)) +
  stat_summary(aes(group = behavior_voting), fun = "mean", geom = "line") +
  stat_summary( = "mean_cl_boot", geom = "pointrange", 
               position = position_dodge(.25)) +
  theme(legend.position = "top")


Line things up by changing the line position to match the pointrange position

data_tidy %>%
  ggplot(aes(x = scale_name, summarized_value, color = behavior_voting)) +
  stat_summary(aes(group = behavior_voting), fun = "mean", geom = "line", 
               position = position_dodge(.25)) +
  stat_summary( = "mean_cl_boot", geom = "pointrange", 
               position = position_dodge(.25)) +
  theme(legend.position = "top")

relationships between continuous variables

Next let’s visualize the relationship between two continuous variables using geom_point() and geom_smooth()

scatter plots

initial plot

data_tidy %>%
  spread(scale_name, summarized_value) %>%
  ggplot(aes(x = CE_attitudes, y = CE_checklist)) +

trend line

Add a trend line using geom_smooth()

data_tidy %>%
  spread(scale_name, summarized_value) %>%
  ggplot(aes(x = CE_attitudes, y = CE_checklist)) +
  geom_point() +

linear trend

Add a linear trend line by specifying method = "lm"

data_tidy %>%
  spread(scale_name, summarized_value) %>%
  ggplot(aes(x = CE_attitudes, y = CE_checklist)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme(legend.position = "top")

relationship by a grouping variable

As we did previously, let’s see if this relationship differs for people who did and didn’t vote

Do this using shape

initial plot

data_tidy %>%
  spread(scale_name, summarized_value) %>%
  ggplot(aes(x = CE_attitudes, y = CE_checklist, shape = behavior_voting)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme(legend.position = "top")


Use color instead of shape as the aesthetic

data_tidy %>%
  spread(scale_name, summarized_value) %>%
  ggplot(aes(x = CE_attitudes, y = CE_checklist, color = behavior_voting)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme(legend.position = "top")


Match the color of the error bands by adding a fill aesthetic

data_tidy %>%
  spread(scale_name, summarized_value) %>%
  ggplot(aes(x = CE_attitudes, y = CE_checklist, color = behavior_voting,
             fill = behavior_voting)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme(legend.position = "top")

putting it all together

Now that we’ve gotten our feet wet, we’ll focus on creating a publication ready plot to communicate common reasons for voting that Penn students endorse.

First, we’ll need to tidy the data and join the text of the reasons from the survey to the data.

join and tidy data

Check the format of the reasons_yes variable

data_tidy %>%
  # select relevant variables
  ungroup() %>%
  select(ResponseId, reasons_yes)

To replace the numbers with text, first we need to wrangle the data is in the long format and each number selected has its own row.

To do this, we’ll do some somewhat complex transformations using strsplit(), which creates a list for each value selected, and unnest() to convert the lists back to a dataframe.

Then, once the data is in the long format, we can join the text with the corresponding numbers using left_join()

reason_text = read_csv(here("static", "labs", "data", "week4_data_reasons.csv"))

data_reasons = data_tidy %>%
  # select relevant variables
  ungroup() %>%
  select(ResponseId, reasons_yes) %>%
  # split the selected responses and convert to a single row per response
  mutate(reasons_yes = strsplit(gsub("[][\"]", "", reasons_yes), ",")) %>%
  unnest(reasons_yes) %>%
  # convert to numeric to facilitate joining
  mutate(reasons_yes = as.numeric(reasons_yes)) %>%
  # join with text 
  left_join(., reason_text, by = "reasons_yes") %>%
  # remove missing responses and "other" responses
  filter(! & !text == "Other") %>%
  # get unique responses

create plot

Create a bar plot and fancify it

initial plot

data_reasons %>%
  ggplot(aes(x = text)) +


Flip the axis using coord_flip()

data_reasons %>%
  ggplot(aes(x = text)) +
  geom_bar() +


Add a count number using geom_text()

data_reasons %>%
  ggplot(aes(x = text)) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10)


Remove the flipped x label and add a space between the scale and the flipped y label using “”

data_reasons %>%
  ggplot(aes(x = text)) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
  labs(x = "", y = "\ncount")


Reorder the bars based on the count by summarizing the number of responses per text

data_reasons %>%
  group_by(text) %>%
  mutate(n_responses = n()) %>%
  ggplot(aes(x = reorder(text, n_responses))) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
  labs(x = "", y = "\ncount")


Create grouping categories for different types of reasons and use this variable as the fill aesthetic

data_reasons %>%
  group_by(text) %>%
  mutate(n_responses = n(),
         category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
                 ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
                 ifelse(grepl("participate|duty|adult|right", text), "responsibility", 
                 ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
  ggplot(aes(x = reorder(text, n_responses), fill = category)) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
  labs(x = "", y = "\ncount")

legend position

Move the legend to the bottom of the plot using legend.position = "bottom"

data_reasons %>%
  group_by(text) %>%
  mutate(n_responses = n(),
         category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
                 ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
                 ifelse(grepl("participate|duty|adult|right", text), "responsibility", 
                 ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
  ggplot(aes(x = reorder(text, n_responses), fill = category)) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
  labs(x = "", y = "\ncount") +
  theme(legend.position = "bottom")


Change the color using scale_fill_brewer()

data_reasons %>%
  group_by(text) %>%
  mutate(n_responses = n(),
         category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
                 ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
                 ifelse(grepl("participate|duty|adult|right", text), "responsibility", 
                 ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
  ggplot(aes(x = reorder(text, n_responses), fill = category)) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
  labs(x = "", y = "\ncount") +
  theme(legend.position = "bottom") +
  scale_fill_brewer(palette = 2)


Change the color by create the variable palette with HEX values

Manually change the color palette using scale_fill_manual()

Change the scale name with the name = "category" argument

palette = c("#1985a1", "#e64626", "#ffb800", "#4c5c68", "#dcdcdd")
data_reasons %>%
  group_by(text) %>%
  mutate(n_responses = n(),
         category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
                 ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
                 ifelse(grepl("participate|duty|adult|right", text), "responsibility", 
                 ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
  ggplot(aes(x = reorder(text, n_responses), fill = category)) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
  labs(x = "", y = "\ncount") +
  theme(legend.position = "bottom") +
  scale_fill_manual(name = "category", values = palette)


Change the theme with theme_minimal(), note this must come before any theme layers or it will override them

data_reasons %>%
  group_by(text) %>%
  mutate(n_responses = n(),
         category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
                 ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
                 ifelse(grepl("participate|duty|adult|right", text), "responsibility", 
                 ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
  ggplot(aes(x = reorder(text, n_responses), fill = category)) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
  labs(x = "", y = "\ncount") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_fill_manual(name = "category", values = palette)


Add a title by adding a title argument to labs()

data_reasons %>%
  group_by(text) %>%
  mutate(n_responses = n(),
         category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
                 ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
                 ifelse(grepl("participate|duty|adult|right", text), "responsibility", 
                 ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
  ggplot(aes(x = reorder(text, n_responses), fill = category)) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
  labs(x = "", y = "\ncount", title = "Reasons for voting in the 2020 election endorsed by Penn students") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_fill_manual(name = "category", values = palette)

title position

Adjust the title position using plot.title.position = "plot"

# save the plot as plot
(plot = data_reasons %>%
  group_by(text) %>%
  mutate(n_responses = n(),
         category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
                 ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
                 ifelse(grepl("participate|duty|adult|right", text), "responsibility", 
                 ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
  ggplot(aes(x = reorder(text, n_responses), fill = category)) +
  geom_bar() +
  coord_flip() +
  geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
  labs(x = "", y = "\ncount", title = "Reasons for voting in the 2020 election endorsed by Penn students") +
  theme_minimal() +
  theme(legend.position = "bottom",
        plot.title.position = "plot") +
  scale_fill_manual(name = "category", values = palette))


Now that we’ve got a near final plot, let’s learn how to change the font using the {sysfonts} package

# check what files are installed on your machine
# add a font from google

Let’s update the font to Helvetica Neue size 14 and convert all grey text to black

plot +
  theme(legend.position = "bottom",
        axis.text = element_text(color = "black"),
        text = element_text(size = 14, family = "HelveticaNeue"))


Specify the figure width and height in the chunk options.

We’ll assign this variable as final_plot to save it

(final_plot = plot +
  theme(legend.position = "bottom",
        axis.text = element_text(color = "black"),
        text = element_text(size = 14, family = "HelveticaNeue")))

save the plot

Now that we’ve got our beautiful plot, let’s save it as a png with ggsave()

ggsave(final_plot, filename = "~/Desktop/plot.png", width = 12, height = 6)


other ways to visualize distributions

Check out the types of distribution plots available on R Graph Gallery.

Try modifying the following code to use a different distribution geom (e.g. geom_boxplot or geom_violin)

data_tidy %>%
  ggplot(aes(x = scale_name, y = summarized_value))

Add the data points to the plot you made by adding a layer with geom_point()

Spread out the data points using geom_jitter() instead of geom_point()

Decrease the opacity of the points by adding an alpha argument to geom_jitter()

scatter plots

Visualize the relationship between CE_attitudes and pol_eff using geom_point() and geom_smooth() as we did earlier

data_tidy %>%
  spread(scale_name, summarized_value) %>%
  ggplot(aes(x = CE_attitudes, y = pol_eff))

Look at this relationship as a function of behavior_voting using the color aesthetic

Move the legend to the top of the plot


Change the colors on the plot you just made.

You can generate your own palettes using or get some inspiration from this collection of color palettes in R

Install the {ggthemes} package and chose one of the available themes to add as a layer

pacman::p_load("ggthemes", install = TRUE)