Module 18 Dataframe wrangling
- Understand the importance of tidy dataframes
- Understand what the
tidyverseis and why it is awesome
- Feel comfortable working with dataframes using
Data scientists largely work in data frames and do things to data. This is what the package
dplyr is optimized for. It consists of a series of “verbs” which cover 95% of what you need to do for most basic data processing tasks.
dplyr package contains a set of
verbs: things you do to dataframes. Those verbs are:
%>% is a “pipe”. It is a way to write code without so many parentheses. For example, what if I want to find the square root of the sum of the first six elements of a sequence of 10 to 20 by 2?
Here’s what that command would look like in base
Pretty overwhelming, and pretty easy to make errors in writing it out.
But the above could also be written a simpler way:
When you see the
%>% pipe symbol, think of the word “then”.
The above code could be read aloud like so: “First, get a sequence of every second number between 10 and 20. Then, take the first six values. Then, sum those samples together. Then, take the square root of that sum.”
%>% pipe framework, your code turns from a nonlinear series of parentheses and brackets to a linear progression of steps, which is a closer fit to how we tend to think about working with data. Instead of working from the inside of a command outward, we thinking linearly: take the data, then do things with it, then do more things with it, etc.
Here’s another example:
… could also be written as:
To practice the
dplyr verbs, let’s make a small dataframe named
filter() function is used to subset a dataframe, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of
TRUE for all conditions.
You can also filter according to multiple conditions. Here are three ways to acheive the same thing:
Note that when a condition evaluates to
NA, its row will be dropped. This differ from the base subsetting works with
[ ... ].
Arrange means putting things in order. That is,
arrange() orders the rows of a data frame by the values of selected columns.
To reverse the order, use
You can also arrange by multiple levels:
Select only certain variables in a data frame, making the dataframe skinnier (fewer columns).
As you select columns, you can rename them like so:
You can also select a set of columns using the
rename() changes the names of individual variables.
This verb takes the syntax
<new_name> = <old_name> syntax.
mutate() adds new variables and preserves existing ones.
New variables overwrite existing variables of the same name.
You can call
mutate() multiple times in the same pipe:
You can also remove variables can be removed by setting their value to NULL.
A similar function,
transmute(), adds new variables and drops existing ones, kind of like a combination of
Most data operations are done on groups defined by variables. The function
group_by() takes an existing table and converts it into a grouped one where operations are performed “by group”.
people %>% group_by(sex) %>% mutate(average_age_for_sex = mean(age)) %>% mutate(diff_from_avg_for_sex = age - average_age_for_sex) # A tibble: 4 × 5 # Groups: sex  who sex age average_age_for_sex diff_from_avg_for_sex <chr> <chr> <dbl> <dbl> <dbl> 1 Joe Male 35 34 1 2 Ben Male 33 34 -1 3 Xing Female 32 33 -1 4 Coloma Female 34 33 1
Note that a similar verb,
ungroup(), removes grouping.
summarize() creates an entirely new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarizing all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified.
Note the use of the function,
n(). This simple function counts up the number of records in each group.
To illustrate these
dplyr verbs and re-energize the room, ask everyone to stand. Tell the students that they represent a dataframe called
people. Now, write a
dplyr command into your R Console and ask them to act out the command. After each command, give them time to move around to act it out. If they move around too slowly, egg them on: “Come on, you all are like the slowest computer ever!”
people %>% arrange(shoe_size)
people %>% arrange(shoe_size) %>% filter(sex == "female")
people %>% arrange(hair_length)
people %>% arrange(desc(hair_length))
people %>% group_by(sex) %>% arrange(hair_length))
people %>% arrange(country_of_birth, shirt_color, desc(shoe_size))
Answer these questions using the new
dplyr verbs you just learned:
Baby names over time
1. Run the below code to load a dataset about baby names given in the USA since the 1800’s.
2. Check out the first and last six rows of
3. What are the names of the variables in this dataset?
4. How many rows are in this dataset?
5. What is the earliest year in this dataset?
6. Create a dataframe named
turn_of_century, which contain data only for the year 1900.
7. Create a dataframe named
boys, containing only boys.
8. Create a dataframe named
moms_gen. This should be females born in the year of birth of your mom.
n, in ascending order (i.e., with the least popular name at top). Look at the result; what is the least popular name among women the year your mom was born?
10. Reverse the order and save the result into an object named
11. Create an object named
boys2k. This should be all males born in the year 2000.
boys2k from most to least popular. What was the most popular boys name in 2000?
13. What percentage of boys were named
Joseph in 2000?
14. Were there more Jims or Matthews in 2020?
15. Create an object named
tot_names_by_year, which contains the total counts for boy and girl names assigned in each year of the dataset. You should have four columns:
16. How many people were born with your name in 2020?
17. Was your name more prevalent in 2020 than it was in the year you were born?
18. What if you account for the changing overall population size? In other words, is the proportional prevalence of your name greater in 2020 or your birth year?
19. In which year was your name the most prevalent?
20. Create a basic plot of the proportional prevalence of your name since the earliest year of this dataset.
21. Update this plot with lines for your parent’s names and your siblings names, if you have any.
22. Format that plot so that it is gorgeous and well-labelled.
23. Screenshot it and email it to your family.
After completing the exercises here, it is worthwhile devoting time to the Review modules entitled, “A
dplyr mystery”, “A
dplyr survey”, and “Global health and
ggplot”. Once students become comfortable with working with
dplyr, they will be ready to work independently on projects, using the modules in the Deep
R section for references.