Module 18 Dataframe wrangling

Learning goals

  • Understand the importance of tidy dataframes
  • Understand what the tidyverse is and why it is awesome
  • Feel comfortable working with dataframes using dplyr functions.

The dplyr package

Data scientists largely work in data frames and do things to data. This is what the package dplyr is optimized for. It consists of a series of “verbs” which cover 95% of what you need to do for most basic data processing tasks.

install.packages('dplyr') # if you haven't yet
library(dplyr)

The dplyr package contains a set of verbs: things you do to dataframes. Those verbs are:

  • filter()
  • arrange()
  • select()
  • rename()
  • distinct()
  • mutate()
  • summarise()

The %>% pipe

%>% is a “pipe”. It is a way to write code without so many parentheses. For example, what if I want to find the square root of the sum of the first six elements of a sequence of 10 to 20 by 2?

Here’s what that command would look like in base R:

sqrt(sum(head(seq(10, 20, 2))))
[1] 9.486833

Pretty overwhelming, and pretty easy to make errors in writing it out.

But the above could also be written a simpler way:

seq(10, 20, 2) %>% head %>% sum %>% sqrt
[1] 9.486833

When you see the %>% pipe symbol, think of the word “then”.

The above code could be read aloud like so: “First, get a sequence of every second number between 10 and 20. Then, take the first six values. Then, sum those samples together. Then, take the square root of that sum.”

Using the %>% pipe framework, your code turns from a nonlinear series of parentheses and brackets to a linear progression of steps, which is a closer fit to how we tend to think about working with data. Instead of working from the inside of a command outward, we thinking linearly: take the data, then do things with it, then do more things with it, etc.

Here’s another example:

mean(sd(1:100))
[1] 29.01149

… could also be written as:

1:100 %>% sd %>% mean
[1] 29.01149

dplyr verbs

To practice the dplyr verbs, let’s make a small dataframe named people:

people <- data.frame(who = c('Joe', 'Ben', 'Xing', 'Coloma'),
                    sex = c('Male', 'Male', 'Female', 'Female'),
                    age = c(35, 33, 32, 34))
people
     who    sex age
1    Joe   Male  35
2    Ben   Male  33
3   Xing Female  32
4 Coloma Female  34

filter()

The filter() function is used to subset a dataframe, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all conditions.

people %>% filter(sex == 'Male')
  who  sex age
1 Joe Male  35
2 Ben Male  33
people %>% filter(sex == 'Female')
     who    sex age
1   Xing Female  32
2 Coloma Female  34

You can also filter according to multiple conditions. Here are three ways to acheive the same thing:

people %>% filter(sex == 'Female' & age < 33)
   who    sex age
1 Xing Female  32
people %>% filter(sex == 'Female', age < 33)
   who    sex age
1 Xing Female  32
people %>% filter(sex == 'Female') %>% filter(age < 33)
   who    sex age
1 Xing Female  32

Note that when a condition evaluates to NA, its row will be dropped. This differ from the base subsetting works with [ ... ].

arrange()

Arrange means putting things in order. That is, arrange() orders the rows of a data frame by the values of selected columns.

people %>% arrange(age)
     who    sex age
1   Xing Female  32
2    Ben   Male  33
3 Coloma Female  34
4    Joe   Male  35
people %>% arrange(sex)
     who    sex age
1   Xing Female  32
2 Coloma Female  34
3    Joe   Male  35
4    Ben   Male  33
people %>% arrange(who)
     who    sex age
1    Ben   Male  33
2 Coloma Female  34
3    Joe   Male  35
4   Xing Female  32

To reverse the order, use desc():

people %>% arrange(desc(age))
     who    sex age
1    Joe   Male  35
2 Coloma Female  34
3    Ben   Male  33
4   Xing Female  32

You can also arrange by multiple levels:

people %>% arrange(sex, age)
     who    sex age
1   Xing Female  32
2 Coloma Female  34
3    Ben   Male  33
4    Joe   Male  35

select()

Select only certain variables in a data frame, making the dataframe skinnier (fewer columns).

people %>% select(age)
  age
1  35
2  33
3  32
4  34
people %>% select(sex, age)
     sex age
1   Male  35
2   Male  33
3 Female  32
4 Female  34

As you select columns, you can rename them like so:

people %>% select(sex, years = age)
     sex years
1   Male    35
2   Male    33
3 Female    32
4 Female    34

You can also select a set of columns using the : notation:

people %>% select(who:sex)
     who    sex
1    Joe   Male
2    Ben   Male
3   Xing Female
4 Coloma Female

rename()

The function rename() changes the names of individual variables.

This verb takes the syntax <new_name> = <old_name> syntax.

people %>% rename(gender = sex, years = age, first_name = who)
  first_name gender years
1        Joe   Male    35
2        Ben   Male    33
3       Xing Female    32
4     Coloma Female    34

mutate()

The function mutate() adds new variables and preserves existing ones.

New variables overwrite existing variables of the same name.

people %>% mutate(agein2020 = age - 1)
     who    sex age agein2020
1    Joe   Male  35        34
2    Ben   Male  33        32
3   Xing Female  32        31
4 Coloma Female  34        33
people %>% mutate(is_male = sex == 'Male')
     who    sex age is_male
1    Joe   Male  35    TRUE
2    Ben   Male  33    TRUE
3   Xing Female  32   FALSE
4 Coloma Female  34   FALSE
people %>% mutate(average_age = mean(age))
     who    sex age average_age
1    Joe   Male  35        33.5
2    Ben   Male  33        33.5
3   Xing Female  32        33.5
4 Coloma Female  34        33.5

You can call mutate() multiple times in the same pipe:

people %>% mutate(average_age = mean(age)) %>%
  mutate(diff_from_avg = age - average_age)
     who    sex age average_age diff_from_avg
1    Joe   Male  35        33.5           1.5
2    Ben   Male  33        33.5          -0.5
3   Xing Female  32        33.5          -1.5
4 Coloma Female  34        33.5           0.5

You can also remove variables can be removed by setting their value to NULL.

people %>% mutate(age = NULL)
     who    sex
1    Joe   Male
2    Ben   Male
3   Xing Female
4 Coloma Female

A similar function, transmute(), adds new variables and drops existing ones, kind of like a combination of select() and mutate().

people %>% transmute(average_age = mean(age))
  average_age
1        33.5
2        33.5
3        33.5
4        33.5

group_by()

Most data operations are done on groups defined by variables. The function group_by() takes an existing table and converts it into a grouped one where operations are performed “by group”.

people %>% 
  group_by(sex) %>%
  mutate(average_age_for_sex = mean(age)) 
# A tibble: 4 × 4
# Groups:   sex [2]
  who    sex      age average_age_for_sex
  <chr>  <chr>  <dbl>               <dbl>
1 Joe    Male      35                  34
2 Ben    Male      33                  34
3 Xing   Female    32                  33
4 Coloma Female    34                  33
people %>% 
  group_by(sex) %>%
  mutate(average_age_for_sex = mean(age)) %>%
  mutate(diff_from_avg_for_sex = age - average_age_for_sex)
# A tibble: 4 × 5
# Groups:   sex [2]
  who    sex      age average_age_for_sex diff_from_avg_for_sex
  <chr>  <chr>  <dbl>               <dbl>                 <dbl>
1 Joe    Male      35                  34                     1
2 Ben    Male      33                  34                    -1
3 Xing   Female    32                  33                    -1
4 Coloma Female    34                  33                     1

Note that a similar verb, ungroup(), removes grouping.

summarize()

summarize() or summarize() creates an entirely new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarizing all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified.

people %>% 
  summarize(average_age = mean(age))
  average_age
1        33.5
people %>% 
  summarize(average_age = mean(age),
            standard_dev_of_age = sd(age),
            oldest_age = max(age),
            youngest_age = min(age))
  average_age standard_dev_of_age oldest_age youngest_age
1        33.5            1.290994         35           32
people %>% 
  group_by(sex) %>%
  summarise(avg_age = mean(age),
            oldest_age = max(age),
            total_years = sum(age)) 
# A tibble: 2 × 4
  sex    avg_age oldest_age total_years
  <chr>    <dbl>      <dbl>       <dbl>
1 Female      33         34          66
2 Male        34         35          68
people %>%
  group_by(sex) %>%
  summarise(sample_size = n())
# A tibble: 2 × 2
  sex    sample_size
  <chr>        <int>
1 Female           2
2 Male             2

Note the use of the function, n(). This simple function counts up the number of records in each group.

Instructor tip:

To illustrate these dplyr verbs and re-energize the room, ask everyone to stand. Tell the students that they represent a dataframe called people. Now, write a dplyr command into your R Console and ask them to act out the command. After each command, give them time to move around to act it out. If they move around too slowly, egg them on: “Come on, you all are like the slowest computer ever!”

people %>% arrange(shoe_size)
people %>% arrange(shoe_size) %>% filter(sex == "female")
people %>% arrange(hair_length)
people %>% arrange(desc(hair_length))
people %>% group_by(sex) %>% arrange(hair_length))
people %>% arrange(country_of_birth, shirt_color, desc(shoe_size))

etc.

Exercises

Answer these questions using the new dplyr verbs you just learned:

Baby names over time

1. Run the below code to load a dataset about baby names given in the USA since the 1800’s.

library(dplyr)
library(babynames)
bn <- babynames

2. Check out the first and last six rows of bn.

3. What are the names of the variables in this dataset?

4. How many rows are in this dataset?

5. What is the earliest year in this dataset?

6. Create a dataframe named turn_of_century, which contain data only for the year 1900.

7. Create a dataframe named boys, containing only boys.

8. Create a dataframe named moms_gen. This should be females born in the year of birth of your mom.

9. Order moms_gen by n, in ascending order (i.e., with the least popular name at top). Look at the result; what is the least popular name among women the year your mom was born?

10. Reverse the order and save the result into an object named moms_gen_ordered.

11. Create an object named boys2k. This should be all males born in the year 2000.

12. Arrange boys2k from most to least popular. What was the most popular boys name in 2000?

13. What percentage of boys were named Joseph in 2000?

14. Were there more Jims or Matthews in 2020?

15. Create an object named tot_names_by_year, which contains the total counts for boy and girl names assigned in each year of the dataset. You should have four columns: year, boys, girls, and tot.

16. How many people were born with your name in 2020?

17. Was your name more prevalent in 2020 than it was in the year you were born?

18. What if you account for the changing overall population size? In other words, is the proportional prevalence of your name greater in 2020 or your birth year?

19. In which year was your name the most prevalent?

20. Create a basic plot of the proportional prevalence of your name since the earliest year of this dataset.

21. Update this plot with lines for your parent’s names and your siblings names, if you have any.

22. Format that plot so that it is gorgeous and well-labelled.

23. Screenshot it and email it to your family.

Instructor tip:

After completing the exercises here, it is worthwhile devoting time to the Review modules entitled, “A dplyr mystery”, “A dplyr survey”, and “Global health and ggplot. Once students become comfortable with working with dplyr, they will be ready to work independently on projects, using the modules in the Deep R section for references.