Module 13 Dataframes
- Practice exploring, summarizing, and filtering dataframes
A vector is the most basic data structure in
R, and the other structures are built out of vectors. But, as a data scientist, the most common data structure you will be working with – by far – is a dataframe.
A dataframe, essentially, is a spreadsheet: a dataset with rows and columns, in which each column represents is a vector of the same class of data.
Here is what a dataframe looks like:
In this dataframe, each row pertains to a unique iris plant. The columns contain related information about each individual plant.
Here’s another data.frame, built from scratch, which shows that dataframes are just a group of vectors:
In this command, we used the
data.frame() function to combine two vectors into a dataframe with two columns named
R then saved this result in a new variable named
df. When we call
R shows us the dataframe.
The great thing about dataframes is that they allow you to relate different data types to each other.
This dataframe has one column of class
character and another of class
Subsetting & exploring dataframes
To explore dataframes, let’s use a dataset on fuel mileage for all cars sold from 1985 to 2014.
# need to install first install.packages('fueleconomy') library(fueleconomy) data(vehicles) head(vehicles) id make model year class trans 1 13309 Acura 2.2CL/3.0CL 1997 Subcompact Cars Automatic 4-spd 2 13310 Acura 2.2CL/3.0CL 1997 Subcompact Cars Manual 5-spd 3 13311 Acura 2.2CL/3.0CL 1997 Subcompact Cars Automatic 4-spd 4 14038 Acura 2.3CL/3.0CL 1998 Subcompact Cars Automatic 4-spd 5 14039 Acura 2.3CL/3.0CL 1998 Subcompact Cars Manual 5-spd 6 14040 Acura 2.3CL/3.0CL 1998 Subcompact Cars Automatic 4-spd drive cyl displ fuel hwy cty 1 Front-Wheel Drive 4 2.2 Regular 26 20 2 Front-Wheel Drive 4 2.2 Regular 28 22 3 Front-Wheel Drive 6 3.0 Regular 26 18 4 Front-Wheel Drive 4 2.3 Regular 27 19 5 Front-Wheel Drive 4 2.3 Regular 29 21 6 Front-Wheel Drive 6 3.0 Regular 26 17
To look at this dataframe in full, you call display it in a separate tab within
RStudio using the
A dataframe has rows of data organized into columns. In this dataframe, each row pertains to a single vehicle make/model – i.e., a single observation. Each column pertains to a single type of data. Columns are named in the header of the dataframe.
All the same useful exploration and subsetting functions that applied to vectors now apply to dataframes. In addition to those functions you already know, let’s add some new functions to your inventory of useful functions.
tail() summarize the beginning and end of the object:
head(vehicles) id make model year class trans 1 13309 Acura 2.2CL/3.0CL 1997 Subcompact Cars Automatic 4-spd 2 13310 Acura 2.2CL/3.0CL 1997 Subcompact Cars Manual 5-spd 3 13311 Acura 2.2CL/3.0CL 1997 Subcompact Cars Automatic 4-spd 4 14038 Acura 2.3CL/3.0CL 1998 Subcompact Cars Automatic 4-spd 5 14039 Acura 2.3CL/3.0CL 1998 Subcompact Cars Manual 5-spd 6 14040 Acura 2.3CL/3.0CL 1998 Subcompact Cars Automatic 4-spd drive cyl displ fuel hwy cty 1 Front-Wheel Drive 4 2.2 Regular 26 20 2 Front-Wheel Drive 4 2.2 Regular 28 22 3 Front-Wheel Drive 6 3.0 Regular 26 18 4 Front-Wheel Drive 4 2.3 Regular 27 19 5 Front-Wheel Drive 4 2.3 Regular 29 21 6 Front-Wheel Drive 6 3.0 Regular 26 17 tail(vehicles) id make model year class trans 33437 28868 Yugo GV Plus/GV/Cabrio 1990 Minicompact Cars Manual 4-spd 33438 6635 Yugo GV Plus/GV/Cabrio 1990 Subcompact Cars Manual 5-spd 33439 3157 Yugo GV/GVX 1987 Subcompact Cars Manual 4-spd 33440 5497 Yugo GV/GVX 1989 Subcompact Cars Manual 4-spd 33441 5498 Yugo GV/GVX 1989 Subcompact Cars Manual 5-spd 33442 1745 Yugo Gy/yugo GVX 1986 Minicompact Cars Manual 4-spd drive cyl displ fuel hwy cty 33437 Front-Wheel Drive 4 1.3 Regular 27 21 33438 Front-Wheel Drive 4 1.3 Regular 28 23 33439 Front-Wheel Drive 4 1.1 Regular 29 24 33440 Front-Wheel Drive 4 1.1 Regular 29 24 33441 Front-Wheel Drive 4 1.3 Regular 28 23 33442 Front-Wheel Drive 4 1.1 Regular 29 22
names() tells you the column names:
dim() tell you about the dimensions of your dataframe:
length() does not work the same on dataframes as it does with vectors. In dataframes,
length() is the equivalent of
ncol(); it will not give you the number of rows in a dataset.
Importantly, you can use
is.na() to ask whether columns or rows contain
# Check for NAs # Which rows in the `hwy` column have NA's? which(is.na(vehicles$hwy)) integer(0) # (No NAs in that column!) # What about rows in the `cyl` column? which(is.na(vehicles$cyl))  1232 1233 2347 3246 3247 3248 6115 6116 6533 7783 7784 8472  10613 10614 11696 11697 12411 12412 12413 12928 12929 12934 12935 12944  16429 16430 21070 23472 23473 23474 24485 24486 24487 24488 24489 26150  28628 28704 28705 28706 28707 28708 28709 29314 29315 30023 30024 30025  30026 30027 30028 31063 31064 31065 31066 31067 31068 31069 # (lots of NAs in that column!)
Recall that dataframes are filtered by row and/or column using this format:
dataframe[rows,columns]. To get the third element of the second column, for example, you type
Note that the comma is necessary even if you do not want to specify columns. If you try to type this …
R will assume you are asking for the third column, not the third row.
To filter a dataframe to multiple values, you can specify vectors for the
Columns can also be called according to their names. Use the
$ sign to specify a column.
Note that when you use a
$, you will not need to use a comma within your brackets. If you try to run this …
R will throw a fit.
Also recall that you can use logical tests, which return boolean values
FALSE, to filter dataframes to rows that meet certain conditions. For example, to filter to only the rows for cars with better than 100 mpg, you can use this syntax:
Or you can write all this in a single line, to be more efficient:
Recall that the logical test is returning a bunch of
FALSE’s, one for each row of
vehicles. Only the
TRUE rows will be returned.
The same summary functions that you have used for vectors work for the columns in dataframes, since each column is also a vector. Check it out:
You can also use the
summary() function, which provides summary statistics for each column in your dataframe:
summary(vehicles) id make model year Min. : 1 Length:33442 Length:33442 Min. :1984 1st Qu.: 8361 Class :character Class :character 1st Qu.:1991 Median :16724 Mode :character Mode :character Median :1999 Mean :17038 Mean :1999 3rd Qu.:25265 3rd Qu.:2008 Max. :34932 Max. :2015 class trans drive cyl Length:33442 Length:33442 Length:33442 Min. : 2.000 Class :character Class :character Class :character 1st Qu.: 4.000 Mode :character Mode :character Mode :character Median : 6.000 Mean : 5.772 3rd Qu.: 6.000 Max. :16.000 NA's :58 displ fuel hwy cty Min. :0.000 Length:33442 Min. : 9.00 Min. : 6.00 1st Qu.:2.300 Class :character 1st Qu.: 19.00 1st Qu.: 15.00 Median :3.000 Mode :character Median : 23.00 Median : 17.00 Mean :3.353 Mean : 23.55 Mean : 17.49 3rd Qu.:4.300 3rd Qu.: 27.00 3rd Qu.: 20.00 Max. :8.400 Max. :109.00 Max. :138.00 NA's :57
unique() returns unique values within a column:
order() function helps you sort a dataframe according to the values in one of its columns.
# Sort dataframe by highway mileage # Only keep certain columns vehicles_sorted <- vehicles[order(vehicles$hwy), c(2,3,4,10:12)] head(vehicles_sorted) make model year fuel hwy cty 397 Aston Martin Lagonda 1985 Regular 9 7 398 Aston Martin Lagonda 1985 Regular 9 7 406 Aston Martin Saloon/Vantage/Volante 1985 Regular 9 7 408 Aston Martin Saloon/Vantage/Volante 1985 Regular 9 7 27725 Rolls-Royce Camargue 1987 Regular 9 7 27726 Rolls-Royce Continental 1987 Regular 9 7
Reverse the order by wrapping
rev() around the
vehicles_sorted <- vehicles[rev(order(vehicles$hwy)), c(2,3,4,10:12)] head(vehicles_sorted) make model year fuel hwy cty 6533 Chevrolet Spark EV 2014 Electricity 109 128 10614 Fiat 500e 2014 Electricity 108 122 10613 Fiat 500e 2013 Electricity 108 122 28628 Scion iQ EV 2013 Electricity 105 138 16430 Honda Fit EV 2014 Electricity 105 132 16429 Honda Fit EV 2013 Electricity 105 132
As shown above, to create a new dataframe, use the
car mgp_hwy mpg_city 100 Acura Legend 23 15 101 Acura Legend 22 17 102 Acura Legend 23 16 103 Acura Legend 21 16 104 Acura Legend 22 17 105 Acura Legend 23 16 106 Acura Legend 24 16
Note how the columns were named in the
data.frame() call, and that each column is separated by a comma.
You can also stage an empty dataframe, which sounds useless but will become very useful as you start working with
for loops and other higher-order
To coerce an object into a format that
R interprets as a dataframe, use
To bind multiple dataframes together by row, use
Note that to be combined, two dataframes have to have the exact same number of columns and the exact same column names.
The only exception to this is adding a dataframe with content an empty dataframe. That can work, and that will be helpful in the Deep
R modules ahead.
You can also bind multiple dataframes together by column, using
df1 <- data.frame(name=c("Ben","Joe","Eric","Isabelle"), instrument=c("Nose harp","Concertina","Ukelele","Drums")) df <- data.frame(age=c(33,35,35,20), home=c("Canada","Spain","USA","USA")) df <- cbind(df,df1) df age home name instrument 1 33 Canada Ben Nose harp 2 35 Spain Joe Concertina 3 35 USA Eric Ukelele 4 20 USA Isabelle Drums
Note that to be combined, two dataframes have to have the exact same number of rows and the exact same column names.
To create a new column for a dataframe, use the
$ symbol and provide the name of the new column:
To alter certain values in the dataframe, you can assign new values to a subset of your dataframe.
Here are four ways to do the same thing: upating Isabelle’s X-factor:
Option 1: Subsetting a single column
Option 2: Subsetting both rows and columns
Option 3: Subsetting a column based on a logical test
Option 3: Subsetting row and columns using logical tests
Reading for errors
What is wrong with these commands? Why will each of them throw an error if you run them, and how can you fix them?
Subsetting and filtering
5. Subset one field according to a logical test: With no more than two lines of code, get the number of Honda cars in the
6. Subset one field according to a logical test for a different field: In a single line of code, show the mileages of all the Toyotas in the dataset.
7. Subset a dataframe to a single subgroup: In a single line of code, determine how many differet car makes/models were produced in 1995.
8. Get the mean value for a subgroup of data: What is the average city mileage for Subaru cars in the dataset?
9. Subset a dataframe to only data from between two values: According to this dataset, how many different car makes/models have been produced with highway mileages between 30 and 40 mpg?
10. Subset by removing
NAs: Create a new version of the
vehicles dataframe that does not have any
NAs in the
11. Create a vector called
people of 5 peoples names from the class.
12. Show with code how many people are in your vector
13. Create another vector called
height which is the number of centimeters tall each of those 5 people are.
14. Combine these two vectors into a data frame.
Now let’s create a new object named
animals. This is going to be a dataframe with 4 different columns:
weight (in kg),
veg (whether or not the animal is a vegetarian / herbivore).
15. Come up with five species to add to your dataframe and list them in a vector named
16. Make the other vectors with details about those species in the correct order.
17. Combine these vectors into a dataframe named
18. Add a column to your
animals dataframe named
rank, which ranks each animal from your least favorite (0) to your most favorite (5).
19. Now write code to manually switch the ranking for your top two favorite animals.
20. What is the mean weight of the herbivorous animals that you listed, if any?
21. What is the mean weight of the omnivorous/carnivorous animals that you listed?