Module 13 Dataframes

Learning goal

  • Practice exploring, summarizing, and filtering dataframes

 

A vector is the most basic data structure in R, and the other structures are built out of vectors. But, as a data scientist, the most common data structure you will be working with – by far – is a dataframe.

A dataframe, essentially, is a spreadsheet: a dataset with rows and columns, in which each column represents is a vector of the same class of data.

Here is what a dataframe looks like:

# Using one of R's built-in datasets
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

In this dataframe, each row pertains to a unique iris plant. The columns contain related information about each individual plant.

Here’s another data.frame, built from scratch, which shows that dataframes are just a group of vectors:

x <- 25:29
y <- 55:59
df <- data.frame(x,y)
df
   x  y
1 25 55
2 26 56
3 27 57
4 28 58
5 29 59

In this command, we used the data.frame() function to combine two vectors into a dataframe with two columns named x and y. R then saved this result in a new variable named df. When we call df, R shows us the dataframe.

The great thing about dataframes is that they allow you to relate different data types to each other.

df <- data.frame(name=c("Ben","Joe","Eric"),
                 height=c(75,73,80))
df
  name height
1  Ben     75
2  Joe     73
3 Eric     80

This dataframe has one column of class character and another of class numeric.

Subsetting & exploring dataframes

To explore dataframes, let’s use a dataset on fuel mileage for all cars sold from 1985 to 2014.

# need to install first install.packages('fueleconomy')
library(fueleconomy)
data(vehicles)
head(vehicles)
     id  make       model year           class           trans
1 13309 Acura 2.2CL/3.0CL 1997 Subcompact Cars Automatic 4-spd
2 13310 Acura 2.2CL/3.0CL 1997 Subcompact Cars    Manual 5-spd
3 13311 Acura 2.2CL/3.0CL 1997 Subcompact Cars Automatic 4-spd
4 14038 Acura 2.3CL/3.0CL 1998 Subcompact Cars Automatic 4-spd
5 14039 Acura 2.3CL/3.0CL 1998 Subcompact Cars    Manual 5-spd
6 14040 Acura 2.3CL/3.0CL 1998 Subcompact Cars Automatic 4-spd
              drive cyl displ    fuel hwy cty
1 Front-Wheel Drive   4   2.2 Regular  26  20
2 Front-Wheel Drive   4   2.2 Regular  28  22
3 Front-Wheel Drive   6   3.0 Regular  26  18
4 Front-Wheel Drive   4   2.3 Regular  27  19
5 Front-Wheel Drive   4   2.3 Regular  29  21
6 Front-Wheel Drive   6   3.0 Regular  26  17

To look at this dataframe in full, you call display it in a separate tab within RStudio using the View() function:

View(vehicles)

A dataframe has rows of data organized into columns. In this dataframe, each row pertains to a single vehicle make/model – i.e., a single observation. Each column pertains to a single type of data. Columns are named in the header of the dataframe.

All the same useful exploration and subsetting functions that applied to vectors now apply to dataframes. In addition to those functions you already know, let’s add some new functions to your inventory of useful functions.

Exploration

head() and tail() summarize the beginning and end of the object:

head(vehicles) 
     id  make       model year           class           trans
1 13309 Acura 2.2CL/3.0CL 1997 Subcompact Cars Automatic 4-spd
2 13310 Acura 2.2CL/3.0CL 1997 Subcompact Cars    Manual 5-spd
3 13311 Acura 2.2CL/3.0CL 1997 Subcompact Cars Automatic 4-spd
4 14038 Acura 2.3CL/3.0CL 1998 Subcompact Cars Automatic 4-spd
5 14039 Acura 2.3CL/3.0CL 1998 Subcompact Cars    Manual 5-spd
6 14040 Acura 2.3CL/3.0CL 1998 Subcompact Cars Automatic 4-spd
              drive cyl displ    fuel hwy cty
1 Front-Wheel Drive   4   2.2 Regular  26  20
2 Front-Wheel Drive   4   2.2 Regular  28  22
3 Front-Wheel Drive   6   3.0 Regular  26  18
4 Front-Wheel Drive   4   2.3 Regular  27  19
5 Front-Wheel Drive   4   2.3 Regular  29  21
6 Front-Wheel Drive   6   3.0 Regular  26  17

tail(vehicles) 
         id make             model year            class        trans
33437 28868 Yugo GV Plus/GV/Cabrio 1990 Minicompact Cars Manual 4-spd
33438  6635 Yugo GV Plus/GV/Cabrio 1990  Subcompact Cars Manual 5-spd
33439  3157 Yugo            GV/GVX 1987  Subcompact Cars Manual 4-spd
33440  5497 Yugo            GV/GVX 1989  Subcompact Cars Manual 4-spd
33441  5498 Yugo            GV/GVX 1989  Subcompact Cars Manual 5-spd
33442  1745 Yugo       Gy/yugo GVX 1986 Minicompact Cars Manual 4-spd
                  drive cyl displ    fuel hwy cty
33437 Front-Wheel Drive   4   1.3 Regular  27  21
33438 Front-Wheel Drive   4   1.3 Regular  28  23
33439 Front-Wheel Drive   4   1.1 Regular  29  24
33440 Front-Wheel Drive   4   1.1 Regular  29  24
33441 Front-Wheel Drive   4   1.3 Regular  28  23
33442 Front-Wheel Drive   4   1.1 Regular  29  22

names() tells you the column names:

names(vehicles) 
 [1] "id"    "make"  "model" "year"  "class" "trans" "drive" "cyl"   "displ"
[10] "fuel"  "hwy"   "cty"  

nrow(), ncol(), and dim() tell you about the dimensions of your dataframe:

nrow(vehicles) 
[1] 33442
ncol(vehicles) 
[1] 12
dim(vehicles) 
[1] 33442    12

Note that length() does not work the same on dataframes as it does with vectors. In dataframes, length() is the equivalent of ncol(); it will not give you the number of rows in a dataset.

Importantly, you can use is.na() to ask whether columns or rows contain NAs:

# Check for NAs

# Which rows in the `hwy` column have NA's?
which(is.na(vehicles$hwy))
integer(0)

# (No NAs in that column!)

# What about rows in the `cyl` column?
which(is.na(vehicles$cyl))
 [1]  1232  1233  2347  3246  3247  3248  6115  6116  6533  7783  7784  8472
[13] 10613 10614 11696 11697 12411 12412 12413 12928 12929 12934 12935 12944
[25] 16429 16430 21070 23472 23473 23474 24485 24486 24487 24488 24489 26150
[37] 28628 28704 28705 28706 28707 28708 28709 29314 29315 30023 30024 30025
[49] 30026 30027 30028 31063 31064 31065 31066 31067 31068 31069

# (lots of NAs in that column!)

Subsetting

Recall that dataframes are filtered by row and/or column using this format: dataframe[rows,columns]. To get the third element of the second column, for example, you type dataframe[3,2].

vehicles[3,2]
[1] "Acura"

Note that the comma is necessary even if you do not want to specify columns. If you try to type this …

vehicles[3]

R will assume you are asking for the third column, not the third row.

To filter a dataframe to multiple values, you can specify vectors for the row and column

vehicles[1:3,11:12] # can use colons
  hwy cty
1  26  20
2  28  22
3  26  18
vehicles[1:3,c(1,11:12)]  # can use c()
     id hwy cty
1 13309  26  20
2 13310  28  22
3 13311  26  18

Columns can also be called according to their names. Use the $ sign to specify a column.

vehicles$hwy[1:5]
[1] 26 28 26 27 29

Note that when you use a $, you will not need to use a comma within your brackets. If you try to run this …

vehicles$hwy[1:5,]

R will throw a fit.

Also recall that you can use logical tests, which return boolean values TRUE or FALSE, to filter dataframes to rows that meet certain conditions. For example, to filter to only the rows for cars with better than 100 mpg, you can use this syntax:

# Build your logical test
verdicts <- vehicles$hwy > 100

# Subset with booleans
vehicles[verdicts,2:3]
           make    model
6533  Chevrolet Spark EV
10613      Fiat     500e
10614      Fiat     500e
16429     Honda   Fit EV
16430     Honda   Fit EV
24487    Nissan     Leaf
24488    Nissan     Leaf
24489    Nissan     Leaf
28628     Scion    iQ EV

Or you can write all this in a single line, to be more efficient:

vehicles[ vehicles$hwy > 100  , 2:3]
           make    model
6533  Chevrolet Spark EV
10613      Fiat     500e
10614      Fiat     500e
16429     Honda   Fit EV
16430     Honda   Fit EV
24487    Nissan     Leaf
24488    Nissan     Leaf
24489    Nissan     Leaf
28628     Scion    iQ EV

Recall that the logical test is returning a bunch of TRUE’s and FALSE’s, one for each row of vehicles. Only the TRUE rows will be returned.

Summarizing

The same summary functions that you have used for vectors work for the columns in dataframes, since each column is also a vector. Check it out:

min(vehicles$hwy)
[1] 9
max(vehicles$hwy)
[1] 109
mean(vehicles$cty)
[1] 17.491
sd(vehicles$cty)
[1] 5.582174
str(vehicles$make)
 chr [1:33442] "Acura" "Acura" "Acura" "Acura" "Acura" "Acura" "Acura" ...
class(vehicles$hwy)
[1] "numeric"

You can also use the summary() function, which provides summary statistics for each column in your dataframe:

summary(vehicles)
       id            make              model                year     
 Min.   :    1   Length:33442       Length:33442       Min.   :1984  
 1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
 Median :16724   Mode  :character   Mode  :character   Median :1999  
 Mean   :17038                                         Mean   :1999  
 3rd Qu.:25265                                         3rd Qu.:2008  
 Max.   :34932                                         Max.   :2015  
                                                                     
    class              trans              drive                cyl        
 Length:33442       Length:33442       Length:33442       Min.   : 2.000  
 Class :character   Class :character   Class :character   1st Qu.: 4.000  
 Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
                                                          Mean   : 5.772  
                                                          3rd Qu.: 6.000  
                                                          Max.   :16.000  
                                                          NA's   :58      
     displ           fuel                hwy              cty        
 Min.   :0.000   Length:33442       Min.   :  9.00   Min.   :  6.00  
 1st Qu.:2.300   Class :character   1st Qu.: 19.00   1st Qu.: 15.00  
 Median :3.000   Mode  :character   Median : 23.00   Median : 17.00  
 Mean   :3.353                      Mean   : 23.55   Mean   : 17.49  
 3rd Qu.:4.300                      3rd Qu.: 27.00   3rd Qu.: 20.00  
 Max.   :8.400                      Max.   :109.00   Max.   :138.00  
 NA's   :57                                                          

The function unique() returns unique values within a column:

unique(vehicles$fuel)
 [1] "Regular"                     "Premium"                    
 [3] "Diesel"                      "Premium or E85"             
 [5] "Electricity"                 "Gasoline or E85"            
 [7] "Premium Gas or Electricity"  "Gasoline or natural gas"    
 [9] "CNG"                         "Midgrade"                   
[11] "Regular Gas and Electricity" "Gasoline or propane"        
[13] "Premium and Electricity"    

Finally, the order() function helps you sort a dataframe according to the values in one of its columns.

# Sort dataframe by highway mileage
# Only keep certain columns
vehicles_sorted <- vehicles[order(vehicles$hwy), 
                            c(2,3,4,10:12)] 
head(vehicles_sorted)
              make                  model year    fuel hwy cty
397   Aston Martin                Lagonda 1985 Regular   9   7
398   Aston Martin                Lagonda 1985 Regular   9   7
406   Aston Martin Saloon/Vantage/Volante 1985 Regular   9   7
408   Aston Martin Saloon/Vantage/Volante 1985 Regular   9   7
27725  Rolls-Royce               Camargue 1987 Regular   9   7
27726  Rolls-Royce            Continental 1987 Regular   9   7

Reverse the order by wrapping rev() around the order() call:

vehicles_sorted <- vehicles[rev(order(vehicles$hwy)),
                            c(2,3,4,10:12)]
head(vehicles_sorted)
           make    model year        fuel hwy cty
6533  Chevrolet Spark EV 2014 Electricity 109 128
10614      Fiat     500e 2014 Electricity 108 122
10613      Fiat     500e 2013 Electricity 108 122
28628     Scion    iQ EV 2013 Electricity 105 138
16430     Honda   Fit EV 2014 Electricity 105 132
16429     Honda   Fit EV 2013 Electricity 105 132

Creating dataframes

As shown above, to create a new dataframe, use the data.frame() function.

             car mgp_hwy mpg_city
100 Acura Legend      23       15
101 Acura Legend      22       17
102 Acura Legend      23       16
103 Acura Legend      21       16
104 Acura Legend      22       17
105 Acura Legend      23       16
106 Acura Legend      24       16

Note how the columns were named in the data.frame() call, and that each column is separated by a comma.

You can also stage an empty dataframe, which sounds useless but will become very useful as you start working with for loops and other higher-order R tools.

df <- data.frame()
df
data frame with 0 columns and 0 rows

To coerce an object into a format that R interprets as a dataframe, use as.dataframe():

df <- as.data.frame(vehicles)
df[1:4,1:4]
     id  make       model year
1 13309 Acura 2.2CL/3.0CL 1997
2 13310 Acura 2.2CL/3.0CL 1997
3 13311 Acura 2.2CL/3.0CL 1997
4 14038 Acura 2.3CL/3.0CL 1998

Modifying dataframes

Combining dataframes

To bind multiple dataframes together by row, use rbind():

# Build up a dataframe
df1 <- data.frame(name=c("Ben","Joe","Eric","Isabelle"),
                  instrument=c("Nose harp","Concertina","Ukelele","Drums"))
df1
      name instrument
1      Ben  Nose harp
2      Joe Concertina
3     Eric    Ukelele
4 Isabelle      Drums
# Build up a second dataframe
df2 <- data.frame(name=c("Matthew"),
                  instrument=c("Washboard"))
# Combine those dataframes together
rbind(df1,df2)
      name instrument
1      Ben  Nose harp
2      Joe Concertina
3     Eric    Ukelele
4 Isabelle      Drums
5  Matthew  Washboard

Note that to be combined, two dataframes have to have the exact same number of columns and the exact same column names.

The only exception to this is adding a dataframe with content an empty dataframe. That can work, and that will be helpful in the Deep R modules ahead.

df <- data.frame() # stage empty dataframe

df1 <- data.frame(name=c("Ben","Joe","Eric","Isabelle"),
                  instrument=c("Nose harp","Concertina","Ukelele","Drums"))

df <- rbind(df,df1)

df
      name instrument
1      Ben  Nose harp
2      Joe Concertina
3     Eric    Ukelele
4 Isabelle      Drums

You can also bind multiple dataframes together by column, using cbind():

df1 <- data.frame(name=c("Ben","Joe","Eric","Isabelle"),
                  instrument=c("Nose harp","Concertina","Ukelele","Drums"))

df <- data.frame(age=c(33,35,35,20), home=c("Canada","Spain","USA","USA"))

df <- cbind(df,df1)

df
  age   home     name instrument
1  33 Canada      Ben  Nose harp
2  35  Spain      Joe Concertina
3  35    USA     Eric    Ukelele
4  20    USA Isabelle      Drums

Note that to be combined, two dataframes have to have the exact same number of rows and the exact same column names.

Adding columns

To create a new column for a dataframe, use the $ symbol and provide the name of the new column:

df$x_factor <- c(3,20,60,40)

df
  age   home     name instrument x_factor
1  33 Canada      Ben  Nose harp        3
2  35  Spain      Joe Concertina       20
3  35    USA     Eric    Ukelele       60
4  20    USA Isabelle      Drums       40

Altering values

To alter certain values in the dataframe, you can assign new values to a subset of your dataframe.

Here are four ways to do the same thing: upating Isabelle’s X-factor:

Option 1: Subsetting a single column

df$x_factor[4] <- 70

Option 2: Subsetting both rows and columns

df[4,5] <- 70

Option 3: Subsetting a column based on a logical test

df$x_factor[df$name == 'Isabelle'] <- 70

Option 3: Subsetting row and columns using logical tests

df[df$name == 'Isabelle', names(df) == 'x_factor'] <- 70
df
  age   home     name instrument x_factor
1  33 Canada      Ben  Nose harp        3
2  35  Spain      Joe Concertina       20
3  35    USA     Eric    Ukelele       60
4  20    USA Isabelle      Drums       70

Exercises

Reading for errors

What is wrong with these commands? Why will each of them throw an error if you run them, and how can you fix them?

1. vehicles[1,15,]

2. vecihles[1:5,]

3. vehicles$hwy[15,]

4. vehicles[1:5,1:13]

 

Subsetting and filtering

5. Subset one field according to a logical test: With no more than two lines of code, get the number of Honda cars in the vehicles dataset.

6. Subset one field according to a logical test for a different field: In a single line of code, show the mileages of all the Toyotas in the dataset.

7. Subset a dataframe to a single subgroup: In a single line of code, determine how many differet car makes/models were produced in 1995.

8. Get the mean value for a subgroup of data: What is the average city mileage for Subaru cars in the dataset?

9. Subset a dataframe to only data from between two values: According to this dataset, how many different car makes/models have been produced with highway mileages between 30 and 40 mpg?

10. Subset by removing NAs: Create a new version of the vehicles dataframe that does not have any NAs in the trans column.

 

Creating dataframes

11. Create a vector called people of 5 peoples names from the class.

12. Show with code how many people are in your vector

13. Create another vector called height which is the number of centimeters tall each of those 5 people are.

14. Combine these two vectors into a data frame.

Now let’s create a new object named animals. This is going to be a dataframe with 4 different columns: species, weight (in kg), color, veg (whether or not the animal is a vegetarian / herbivore).

15. Come up with five species to add to your dataframe and list them in a vector named species.

16. Make the other vectors with details about those species in the correct order.

17. Combine these vectors into a dataframe named animals.

 

Altering dataframes

18. Add a column to your animals dataframe named rank, which ranks each animal from your least favorite (0) to your most favorite (5).

19. Now write code to manually switch the ranking for your top two favorite animals.

20. What is the mean weight of the herbivorous animals that you listed, if any?

21. What is the mean weight of the omnivorous/carnivorous animals that you listed?