- Understand what
ggplot2is and why it’s used
- Be able to think conceptually in the framework of the “grammar of graphics”
- Learn the basic syntax for creating different plots using using
ggplot2 is an
R package. It’s one of the most downloaded packages in the
R universe, and has become the gold standard for data visualization. It’s extremely powerful and flexible, and allows for creating lots of visualizations of different types, ranging from maps to bare-bones academic publications, to complex, paneled charts with labeling, etc.
Because the syntax is so different from “base”
R, it can give the impression of having a somewhat steep learning curve. But in reality, because the principles are so conceptually simple, learning is fairly fast. Generally those who choose to learn it stick with it; that is, once you go
gg, you don’t go back.
Note: we will refer heavily to this online guide about
The name & concept
To demonstrate the ideas in this section, draw a rough plot on a whiteboard as you step through each layer.
ggplot stands for “grammar of graphics”, with “grammar” meaning “the fundamental principles or rules of an art or science” (Wickham 2010). Just as all languages share common principles of grammar and syntax, so too do the many forms of data visualization. The basic idea is that all graphs can be described using a layered grammar: all graphs represent a dataset using the same layers of visual order.
Plots are made of layers. Think of how you draw a plot from scratch:
First, you get a piece of paper – a canvas.
Second, you draw the x axis and y axis: each direction on your canvas represents the range of a set of data. This establishes a landscape of coordinates.
Third, the data need to be placed somewhere in that landscape. You map the data to the coordinates.
Fourth, when you actually draw the data at their prescribed locations on the plot, you have to decide how to do so. You use geometric objects – like points, lines, and bars – and other aesthetic attributes – like colors, line thicknesses, and dot size.
Fifth, you add labels – such as axis titles, an overhead title, or a legend – to help the viewer understand the plot.
You now have a basic plot. But sometimes you will add additional layers:
Sixth, you may add statistical summaries – such as regression lines or standard error bars.
Seventh, you may decide to do an overhaul and split your plot into several facets, in which subgroups of the data are plotted separately to produce a multi-panel plot.
Finally, in the final layer, you may decide to stylize the entire plot to fit a visual theme, such as the trademark styles of vendors like the The Economist or The New York Times.
When you produce a plot with
ggplot, you will mirror this same process step-by-step. This is why you will often see the process underlying
ggplot described using a graphic like this:
Note: If you want to learn more about the theory, the most well-known “grammar of graphics” was written in 2005 and laid out some abstract principles for describing statistical graphics (Wilkinson 2005).
Let’s learn by doing. First, install and load
ggplot2 and associated packages.
titanic dataset, the manifest of Titanic passengers with details such as as age, passenger class, fare paid, and whether or not they survived.
At this point it may be useful to emphasize that the name of the package is
ggplot2, but the name of the function is just
Say you want to explore the relationship between passengers’ age and the fare they paid to travel aboard the Titanic.
- Set up our canvas. If we just type
ggplot()without anything in the parentheses, the function will just return a blank piece of paper.
- Draw the axes and, 3., setup our landscape of coordinates. To do so, we need to feed
ggplot()some data and tell it which columns should be mapped onto the axes.
That code looks a bit clunky, we know. The
aes() input, which is short for
aesthetics, is actually a function. Everything included in its parentheses will be used to map your data to the plot’s aesthetic attributes. So far we have simply said that
Age should be mapped to the x axis and that
Fare should be mapped to y.
But let’s say we also want to color-code the points on our plot according to male/female. To do so, we will add specifications to this
Your plot is still blank, but in the background
ggplot() is all setup to make your plot. Since this
ggplot() call is the basis of everything that will happen next – it contains the data and the way you want to map it to attributes of your plot – let’s save it to a variable for easy recall. We’ll use
p for “plot”.
Note that you don’t need to write out
titanic$Fare. You’ve told
ggplot that your
titanic, so it knows to look inside that dataframe for those columns.
- Map our data to geometric shapes. In this case, a scatter plot of points:
Note the use of a plus sign,
+. You are adding layers to your plot.
- Add some more labels. You see that
ggplot()has automatically added axis titles and a legend, but we can add some more using the
labs()function. Let’s add an overhead title, a sub-title, and a caption.
- Add a statistical summary, like a smoothed regression line.
ggplot() automatically produced a different regression line for each sex. That’s nice, but now our plot is getting pretty cluttered.
- Clean up the look by using facets: a separate plot for each sex.
- Finally, let’s stylize the entire plot with a different theme. You can find theme options in the `
In a bar plot, your data are mapped to bars instead of points. And, instead of showing every data point, you are summarizing the data in some way – i.e., displaying a statistic. That statistic is usually just a count of the number of data points in each subgroup.
Let’s make a bar plot that compares the number of men and women on the Titanic:
Note that, for the
aes() call, we only provided the x axis attribute:
Then, in the
geom_bar() call, we told
statistic should be represented by that bars:
But you are allowed to explicitly set the bars’ heights (i.e., the
y dimension) to represent a different statistic. Let’s say we wanted each bar to represent the mean age of men and women:
# First, determine the mean age of each sex mean_age_males <- mean(titanic$Age[titanic$Sex == 'male'], na.rm = TRUE) mean_age_females <- mean(titanic$Age[titanic$Sex == 'female'], na.rm = TRUE) # Make a new dataframe with this summary data titanic_age <- data.frame(Sex = c('male','female'), mean_age = c(mean_age_males, mean_age_females)) # Plot it ggplot(data = titanic_age, aes(x = Sex,y= mean_age)) + geom_bar(stat = 'identity')
In this case, we are explicitly defining the
y axis in the
aes() call, and telling
geom_bar() to just use the values we specified in
aes() (that’s what
'identity' means; you are telling
ggplot() to just use what you already gave it.)
You can specify other aesthetic attributes, unrelated to the data, within the
Now add better labels:
You can add another variable to your bar plot as follows. Let’s say you want to see the average age in each sex, grouped by who survived and who didn’t:
# First, produce your summary dataframe using some dplyr magic: titanic_em <- titanic %>% group_by(Sex, Survived) %>% summarise(mean_age= mean(Age, na.rm = TRUE)) titanic_em <- titanic_em %>% mutate(Survived = ifelse(Survived == 1, 'Survived','Dead' )) # Check it out titanic_em # A tibble: 4 × 3 # Groups: Sex  Sex Survived mean_age <chr> <chr> <dbl> 1 female Dead 25.0 2 female Survived 28.8 3 male Dead 31.6 4 male Survived 27.3 # Now plot it ggplot(data = titanic_em, aes(x=Sex, y=mean_age, fill = Survived)) + geom_bar(stat='identity')
Rather than stack the bars, you can place them side by side:
If you don’t love these default colors (even if they are colorblind-friendly), you can manually define the colors for each group of bars:
More Titanic plots
1. Make a scatterplot similar to what you did above, but this time color-code by class instead of sex.
2. Notice that
ggplot() automatically uses a continuous color scale for
Pclass, since it has numeric values. To force
ggplot() to consider
Pclass as categories (1st class, 2nd class, 3rd class), replace
factor(Pclass). Did the style of your color scale change?
3. Modify the title, subtitle, and caption to be more descriptive.
4. Produce a bar plot that compares the number of passengers in each class.
5. Make your bar plot as ugly as possible!
6. Now make it as beautiful as possible, including a concise but informative title, subtitle, and caption.
Download the dataset on baby names given to newborns in the USA:
7. Create a line chart showing the number of girls named Mary over time.
8. Change the color of the line to blue.
9. Add a fitting title to the plot.
10. Create a bar chart showing the number of girls named Emma, Olivia, Ava, Sophia, and Emily in 2010.
11. Change the X label to “Names” and the y label to “Total”. (Hint: check out the
labs() help page.)
12. Change the color of the bar to grey and make it more transclucent.
13. Create a bar chart showing the number of people named Emma, Olivia, Ava, Sophia, and Emily in 2010, colored by sex.
14. Create a beautiful chart showing your name over time.
Wickham, Hadley. 2010. “A Layered Grammar of Graphics.” Journal of Computational and Graphical Statistics 19 (1): 3–28. https://doi.org/10.1198/jcgs.2009.07098.
Wilkinson, Leland. 2005. The Grammar of Graphics (Statistics and Computing). Berlin, Heidelberg: Springer-Verlag.