Visualizing Data with ggplot2
- Install the Tidyverse collection of packages if you have not done so already (see the previous module)
- Have a look at the documentation for ggplot2,
- Familiarize yourself with the
ggplot2
cheatseet - Start a new quarto project called “modules” and generate a quarto document named “dataviz.qmd” inside of it so that you can code along with me
- Download the data for the bar chart here. Click the little downnload icon that says “Download raw file” when you hover over it and then save it as
dem_summary.csv
in your project folder - Download the data for the line chart here and save as
dem_waves_ctrs.csv
in your project folder
If you have installed the Tidyverse, then you should already have the packages for this model, including ggplot2
. You can go ahead and load ggplot2
along with readr
and dplyr
.
Note that you could also load these three packages by running library(tidyverse)
. However, it is good to be intentional about which packages we are loading as we are learning them.
Overview
Last week we learned how to gather and wrangle data. This week we are going to start visualizing it with the ggplot2
. We will learn how to make bar charts, histograms, line charts and scatter plots.
Along the way we are going to be talking about the “grammar of graphics” that ggplot2
is based on. The “gg” in ggplot stands for “grammar of graphics.” The grammar of graphics is a layered approach to constructing graphs based on a book by Leland Wilkinson.
The idea is that each visualization you make is going to contain cerain elements. You will start with some data. Then you will incorporate some “aesthetics” which you can think of as the dimensions of the visualization (x-axis, y-axis and color, size or shapes for additional dimensions). Next you identify a geometric obejct that you want to use such as a bar, a line or a point. From there you can customize various elements of the plot like the title and axis scales and labels.
Bar charts
Let’s get started with our first visualization–a basic bar chart. Bar charts are good for comparing data across cases. Our aim here is going to be to summarize levels of democracy across different regions like we did in the last lesson, but this time we will illustrate the differences with a chart.
We will start by loading the data dem_summary.csv
file that you downloaded in the prework section. Next we will do our first ggplot()
call. The ggplot()
function takes two arguments: data
and mapping
. data
refers to the data frame that includes the variables we want to visualize and mapping
refers to the aesthetics mappings for the visualization. The aesthetics mappings are themselves presented in a quoting function aes()
that defines the x and y values of the plot along with other aesthetic values like fill
, color
and linetype
. We will focus on x and y values here and return to these additional aesthetic values later.
After our ggplot()
call, we can add a series of additional functions to define our visualization following a +
sign. The most important group are the geoms which will define the basic type of plot we want to make. In this case, we are calling geom_col()
for our histogram and specifying that the fill color should be “steelblue.”
From there we will further customize our visualization with the labs()
function to provide a title, axis labels and a caption.
dem_summary <- read_csv("data/dem_summary.csv")
ggplot(dem_summary, aes(x = region, y = polyarchy)) + # ggplot call
geom_col(fill = "steelblue") + # we use geom_col() for a a bar chart
labs(
x = "Region",
y = "Avg. Polyarchy Score",
title = "Democracy by region, 1990 - present",
caption = "Source: V-Dem Institute"
)
This looks pretty good but frequently we would want the bars of our bar chart to be sorted in order of the values being displayed. Let’s go ahead and add the reorder() function to our aes()
call so that we are reordering the bars based on descending values of the average polyarchy score.
Line charts
Now let’s create a line chart. Line charts are usually the best option when we want to illustrate trends in our data. For this visualization, we will try to illustrate Samuel Huntington’s waves of democracy by showing how countries representing each of the three waves. The U.S. represents the first wave, Japan the second wave starting with the allied victory in WWII, and Portugal represents the first country to transition in the third wave.
Next, let’s import the dem_waves_ctrs
CSV file from our project folder and save it in an object of the same name.
dem_waves_ctrs <- read_csv("data/dem_waves_ctrs.csv")
This importing the data from a CSV file represents a slightly different workflow than was used in the line chart video where I downloaded and wrangled the data from the V-Dem API. We will come back to wrangling data later on in the course.
Next, we are going to do our ggplot()
call. For the aesthetics mapping, we will put the year on the x-axis and the polyarchy score on the y-axis. We will also specify color
in the aes()
call so that we can color the lines by region.
To get a line chart, we have to specify geom_line()
. Then within the geom_line()
function we will set the linewidth
equal to `1’ so that the lines are a bit more visible.
Finally, we will add a labs()
call as with the previous visualizations. But in addition to title, axis labels and a caption, we will also add color = "Country"
to change the label of the legend to “Country” with a capital “C.”
Let’s run that, save it as an object and then call the object to view the visualization.
# in this ggplot() call, we add a third dimension for line color
ggplot(dem_waves_ctrs, aes(x = year, y = polyarchy, color = country)) +
geom_line(linewidth = 1) + # our geom is a line with a width of 1
labs(
x = "Year",
y = "Polyarchy Score",
title = 'Democracy in countries representing three different "waves"',
caption = "Source: V-Dem Institute",
color = "Country" # make title of legend to upper case
)