Here are a number of helpful resources to dig deeper into data visualization with {ggplot}
:
UO courses & boot camp modules
Reference guides
Color palettes and themes
Cheat sheets
Modify the path as needed to where your week4_data.csv
file is downloaded. Note, my path is a little different, but for you, it should be in the data/
directory in the same folder as this script.
Import the csv file and assign it to the variable data
In week 4, we did most of these steps independently. Here, we’ll use %>%
to thread the processes together without assigning any intermediate variables.
The key variables we’ll be working with today are as follows:
behavior_voting
: voted in 2020; single dichotomous item, yes or noCE_attitudes
: civic engagement attitudes; 8-item scale with range 1-7CE_checklist
: checklist of civic engagement activities; 17 dichotonomous items, yes or nopol_eff
: policitical efficacy scale; 4-item scale with range 1-4reasons_yes
: checklist of reasons why people voted in 2020data_tidy = data %>%
# filter out test and incomplete responses
filter(!DistributionChannel == "preview") %>%
filter(Finished == 1 & consent == 1) %>%
# select a subset of variables
select(ResponseId, behavior_voting, reasons_yes,
starts_with("CE_attitudes"), contains("checklist"), contains("pol_eff")) %>%
# convert to long format
pivot_longer(cols = -c(ResponseId, behavior_voting, reasons_yes), names_to = "scale_name") %>%
# extract item number from scale_name
extract(col = "scale_name", into = c("scale_name", "item"),
regex = "(CE_attitudes|CE_checklist|pol_eff)_([0-9]+)") %>%
# convert responses to numeric and recode response values
mutate(value = as.numeric(value),
value = ifelse(test = scale_name == "CE_checklist" & value == 2,
yes = 0,
no = value),
behavior_voting = recode(behavior_voting,
"1" = "yes",
"2" = "no")) %>%
# calculate scale means or sums for each participant
group_by(ResponseId, scale_name) %>%
mutate(summarized_value = ifelse(test = scale_name == "CE_checklist",
yes = sum(value, na.rm = TRUE),
no = mean(value, na.rm = TRUE))) %>%
# remove the item and value columns
select(-item, -value) %>%
# remove repeated observations (rows)
unique()
Let’s take a look at the tidied data frame.
We can see that now, instead of multiple items per scale, we have a summary (either a mean or sum) stat in the summarized_value
column. Now, each participant has one value per scale.
A good first step when working with data is to visualize the distribution of the variables you’re working with. This can help identify outliers or if there are unexpected values.
Let’s look at the distributions of CE_attitudes
, CE_checklist
, and pol_eff
Let’s create some histograms using the geom_histogram()
Use fill = scale_name
to separate the scales
Let’s make them non-overlapping by using position = "dodge"
Let’s separate the scales into 3 separate subplots using facet_grid(~scale_name)
Rather than plotting a histogram of counts per bin, we’ll look at the density using geom_density()
Use fill = scale_name
to separate the scales
Change the opacity of the fill color using alpha = .5
Separate into subplots using facet_grid(~scale_name)
Allow the scale range to differ by specifying scale = "free"
Now let’s see if the distributions differ for people who voted or didn’t vote in 2020.
Because we’re plotting each scale separately using facet_grid(~scale_name)
, we can use fill
to plot each level of behavior_voting
separately for each scale.
Now that we’ve gotten a sense of the distribution, let’s look at the average scale scores as a function of voting behavior.
We’ll use stat_summary()
to do this.
Let’s add some color to distinguish the groups using fill = behavior_voting
Let’s reduce the redundancy by specifying x = scale_name
rather than using facet_grid()
Separate the bars using position = "dodge"
to push them apart
Visualize uncertainty around the means by adding a new stat_summary()
layer
Visualize the standard error with an error bar with fun.data = "mean_se"
Use the 95% confidence interval instead of SE using fun.data = "mean_cl_boot"
Change the width of the error bars using width = 0
Change the position variable so that the erorbars are in the middle of the bars
Instead of using bars, let’s visualize the means and uncertainty around them using a point range with geom = "pointrange"
Use color instead of fill
this time
Separate the values using position = position_dodge(.25)
Add a line connecting the means by voting behavior group by adding a stat_summary()
layer
Line things up by changing the line position to match the pointrange position
data_tidy %>%
ggplot(aes(x = scale_name, summarized_value, color = behavior_voting)) +
stat_summary(aes(group = behavior_voting), fun = "mean", geom = "line",
position = position_dodge(.25)) +
stat_summary(fun.data = "mean_cl_boot", geom = "pointrange",
position = position_dodge(.25)) +
theme(legend.position = "top")
Next let’s visualize the relationship between two continuous variables using geom_point()
and geom_smooth()
Add a trend line using geom_smooth()
As we did previously, let’s see if this relationship differs for people who did and didn’t vote
Do this using shape
Use color
instead of shape as the aesthetic
Now that we’ve gotten our feet wet, we’ll focus on creating a publication ready plot to communicate common reasons for voting that Penn students endorse.
First, we’ll need to tidy the data and join the text of the reasons from the survey to the data.
Check the format of the reasons_yes
variable
To replace the numbers with text, first we need to wrangle the data is in the long format and each number selected has its own row.
To do this, we’ll do some somewhat complex transformations using strsplit()
, which creates a list for each value selected, and unnest()
to convert the lists back to a dataframe.
Then, once the data is in the long format, we can join the text with the corresponding numbers using left_join()
reason_text = read_csv(here("static", "labs", "data", "week4_data_reasons.csv"))
data_reasons = data_tidy %>%
# select relevant variables
ungroup() %>%
select(ResponseId, reasons_yes) %>%
# split the selected responses and convert to a single row per response
mutate(reasons_yes = strsplit(gsub("[][\"]", "", reasons_yes), ",")) %>%
unnest(reasons_yes) %>%
# convert to numeric to facilitate joining
mutate(reasons_yes = as.numeric(reasons_yes)) %>%
# join with text
left_join(., reason_text, by = "reasons_yes") %>%
# remove missing responses and "other" responses
filter(!is.na(text) & !text == "Other") %>%
# get unique responses
unique()
Create a bar plot and fancify it
Flip the axis using coord_flip()
Add a count number using geom_text()
Remove the flipped x label and add a space between the scale and the flipped y label using “”
Reorder the bars based on the count by summarizing the number of responses per text
Create grouping categories for different types of reasons and use this variable as the fill aesthetic
data_reasons %>%
group_by(text) %>%
mutate(n_responses = n(),
category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
ifelse(grepl("participate|duty|adult|right", text), "responsibility",
ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
ggplot(aes(x = reorder(text, n_responses), fill = category)) +
geom_bar() +
coord_flip() +
geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
labs(x = "", y = "\ncount")
Move the legend to the bottom of the plot using legend.position = "bottom"
data_reasons %>%
group_by(text) %>%
mutate(n_responses = n(),
category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
ifelse(grepl("participate|duty|adult|right", text), "responsibility",
ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
ggplot(aes(x = reorder(text, n_responses), fill = category)) +
geom_bar() +
coord_flip() +
geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
labs(x = "", y = "\ncount") +
theme(legend.position = "bottom")
Change the color using scale_fill_brewer()
data_reasons %>%
group_by(text) %>%
mutate(n_responses = n(),
category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
ifelse(grepl("participate|duty|adult|right", text), "responsibility",
ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
ggplot(aes(x = reorder(text, n_responses), fill = category)) +
geom_bar() +
coord_flip() +
geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
labs(x = "", y = "\ncount") +
theme(legend.position = "bottom") +
scale_fill_brewer(palette = 2)
Change the color by create the variable palette with HEX values
Manually change the color palette using scale_fill_manual()
Change the scale name with the name = "category"
argument
palette = c("#1985a1", "#e64626", "#ffb800", "#4c5c68", "#dcdcdd")
data_reasons %>%
group_by(text) %>%
mutate(n_responses = n(),
category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
ifelse(grepl("participate|duty|adult|right", text), "responsibility",
ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
ggplot(aes(x = reorder(text, n_responses), fill = category)) +
geom_bar() +
coord_flip() +
geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
labs(x = "", y = "\ncount") +
theme(legend.position = "bottom") +
scale_fill_manual(name = "category", values = palette)
Change the theme with theme_minimal()
, note this must come before any theme layers or it will override them
data_reasons %>%
group_by(text) %>%
mutate(n_responses = n(),
category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
ifelse(grepl("participate|duty|adult|right", text), "responsibility",
ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
ggplot(aes(x = reorder(text, n_responses), fill = category)) +
geom_bar() +
coord_flip() +
geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
labs(x = "", y = "\ncount") +
theme_minimal() +
theme(legend.position = "bottom") +
scale_fill_manual(name = "category", values = palette)
Add a title by adding a title argument to labs()
data_reasons %>%
group_by(text) %>%
mutate(n_responses = n(),
category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
ifelse(grepl("participate|duty|adult|right", text), "responsibility",
ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
ggplot(aes(x = reorder(text, n_responses), fill = category)) +
geom_bar() +
coord_flip() +
geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
labs(x = "", y = "\ncount", title = "Reasons for voting in the 2020 election endorsed by Penn students") +
theme_minimal() +
theme(legend.position = "bottom") +
scale_fill_manual(name = "category", values = palette)
Adjust the title position using plot.title.position = "plot"
# save the plot as plot
(plot = data_reasons %>%
group_by(text) %>%
mutate(n_responses = n(),
category = ifelse(grepl("stake|consequences|financially|future", text), "consequences",
ifelse(grepl("bigger|social|community|world|family", text), "prosociality",
ifelse(grepl("participate|duty|adult|right", text), "responsibility",
ifelse(grepl("advocate|voice|express", text), "agency", "rebellion/control"))))) %>%
ggplot(aes(x = reorder(text, n_responses), fill = category)) +
geom_bar() +
coord_flip() +
geom_text(aes(label = stat(count)), stat = "count", nudge_y = 10) +
labs(x = "", y = "\ncount", title = "Reasons for voting in the 2020 election endorsed by Penn students") +
theme_minimal() +
theme(legend.position = "bottom",
plot.title.position = "plot") +
scale_fill_manual(name = "category", values = palette))
Now that we’ve got a near final plot, let’s learn how to change the font using the {sysfonts}
package
Let’s update the font to Helvetica Neue size 14 and convert all grey text to black
Specify the figure width and height in the chunk options.
We’ll assign this variable as final_plot
to save it
Check out the types of distribution plots available on R Graph Gallery.
Try modifying the following code to use a different distribution geom (e.g. geom_boxplot
or geom_violin
)
Add the data points to the plot you made by adding a layer with geom_point()
Spread out the data points using geom_jitter()
instead of geom_point()
Decrease the opacity of the points by adding an alpha argument to geom_jitter()
Visualize the relationship between CE_attitudes
and pol_eff
using geom_point()
and geom_smooth()
as we did earlier
Look at this relationship as a function of behavior_voting
using the color aesthetic
Move the legend to the top of the plot
Change the colors on the plot you just made.
You can generate your own palettes using coolers.co or get some inspiration from this collection of color palettes in R
Install the {ggthemes}
package and chose one of the available themes to add as a layer