Module 18 Dataframe wrangling

Learning goals

  • Understand the importance of tidy dataframes
  • Understand what the tidyverse is and why it is awesome
  • Feel comfortable working with dataframes using dplyr functions.

The dplyr package

Data scientists largely work in data frames and do things to data. This is what the package dplyr is optimized for. It consists of a series of “verbs” which cover 95% of what you need to do for most basic data processing tasks.

The dplyr package contains a set of verbs: things you do to dataframes. Those verbs are:

  • filter()
  • arrange()
  • select()
  • rename()
  • distinct()
  • mutate()
  • summarise()

The %>% pipe

%>% is a “pipe”. It is a way to write code without so many parentheses. For example, what if I want to find the square root of the sum of the first six elements of a sequence of 10 to 20 by 2?

Here’s what that command would look like in base R:

Pretty overwhelming, and pretty easy to make errors in writing it out.

But the above could also be written a simpler way:

When you see the %>% pipe symbol, think of the word “then”.

The above code could be read aloud like so: “First, get a sequence of every second number between 10 and 20. Then, take the first six values. Then, sum those samples together. Then, take the square root of that sum.”

Using the %>% pipe framework, your code turns from a nonlinear series of parentheses and brackets to a linear progression of steps, which is a closer fit to how we tend to think about working with data. Instead of working from the inside of a command outward, we thinking linearly: take the data, then do things with it, then do more things with it, etc.

Here’s another example:

… could also be written as:

dplyr verbs

To practice the dplyr verbs, let’s make a small dataframe named people:

filter()

The filter() function is used to subset a dataframe, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all conditions.

You can also filter according to multiple conditions. Here are three ways to acheive the same thing:

Note that when a condition evaluates to NA, its row will be dropped. This differ from the base subsetting works with [ ... ].

select()

Select only certain variables in a data frame, making the dataframe skinnier (fewer columns).

As you select columns, you can rename them like so:

You can also select a set of columns using the : notation:

rename()

The function rename() changes the names of individual variables.

This verb takes the syntax <new_name> = <old_name> syntax.

summarize()

summarize() or summarize() creates an entirely new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarizing all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified.

Note the use of the function, n(). This simple function counts up the number of records in each group.

Instructor tip:

To illustrate these dplyr verbs and re-energize the room, ask everyone to stand. Tell the students that they represent a dataframe called people. Now, write a dplyr command into your R Console and ask them to act out the command. After each command, give them time to move around to act it out. If they move around too slowly, egg them on: “Come on, you all are like the slowest computer ever!”

people %>% arrange(shoe_size)
people %>% arrange(shoe_size) %>% filter(sex == "female")
people %>% arrange(hair_length)
people %>% arrange(desc(hair_length))
people %>% group_by(sex) %>% arrange(hair_length))
people %>% arrange(country_of_birth, shirt_color, desc(shoe_size))

etc.

Exercises

Answer these questions using the new dplyr verbs you just learned:

Baby names over time

1. Run the below code to load a dataset about baby names given in the USA since the 1800’s.

2. Check out the first and last six rows of bn.

3. What are the names of the variables in this dataset?

4. How many rows are in this dataset?

5. What is the earliest year in this dataset?

6. Create a dataframe named turn_of_century, which contain data only for the year 1900.

7. Create a dataframe named boys, containing only boys.

8. Create a dataframe named moms_gen. This should be females born in the year of birth of your mom.

9. Order moms_gen by n, in ascending order (i.e., with the least popular name at top). Look at the result; what is the least popular name among women the year your mom was born?

10. Reverse the order and save the result into an object named moms_gen_ordered.

11. Create an object named boys2k. This should be all males born in the year 2000.

12. Arrange boys2k from most to least popular. What was the most popular boys name in 2000?

13. What percentage of boys were named Joseph in 2000?

14. Were there more Jims or Matthews in 2020?

15. Create an object named tot_names_by_year, which contains the total counts for boy and girl names assigned in each year of the dataset. You should have four columns: year, boys, girls, and tot.

16. How many people were born with your name in 2020?

17. Was your name more prevalent in 2020 than it was in the year you were born?

18. What if you account for the changing overall population size? In other words, is the proportional prevalence of your name greater in 2020 or your birth year?

19. In which year was your name the most prevalent?

20. Create a basic plot of the proportional prevalence of your name since the earliest year of this dataset.

21. Update this plot with lines for your parent’s names and your siblings names, if you have any.

22. Format that plot so that it is gorgeous and well-labelled.

23. Screenshot it and email it to your family.

Instructor tip:

After completing the exercises here, it is worthwhile devoting time to the Review modules entitled, “A dplyr mystery”, “A dplyr survey”, and “Global health and ggplot. Once students become comfortable with working with dplyr, they will be ready to work independently on projects, using the modules in the Deep R section for references.