# Module 12 Subsetting & filtering

#### Learning goals

• Understand how to subset / filter data

You have been introduce to subsetting and filtering briefly in previous modules, but it is such an important concept that we want to devote an entire module to practicing it.

## Subsetting with indices

You have already learned that certain elements of a vector can be called by specifying an index:

``````x <- 55:65

# Call x without subsetting
x
 55 56 57 58 59 60 61 62 63 64 65``````
``````# Now call only the third element of x
x
 57``````

Remember: brackets indicate that you don’t want everything from a vector; you just want certain elements. ‘I want `x`, but not all of it.

You can also subset an object by calling multiple indices:

``````# Now call the third, fourth, and fifth element of x
x[c(3,4,5)]
 57 58 59``````
``````# Another way of doing the same thing:
x[3:5]
 57 58 59``````

## Subsetting with booleans

You can also subset objects with ‘booleans’. This will eventually be your most common way of filtering data, by far.

Recall that boolean / logical data have two possible values: `TRUE` or `FALSE`. For example:

``````# Store Joe's age
joes_age <- 35

# Set the cutoff for old age
old_age <- 36

# Ask whether Joe is old
joes_age >= old_age
 FALSE``````

Recall also that you can calculate whether a condition is `TRUE` or `FALSE` on multiple elements of a vector. For example:

``````# Build a vector of multiple ages
ages <- c(10, 20, 30, 40, 50, 60)

# Set the cutoff for old age
old_age <- 36

# Ask which ages are considered old
ages >= old_age
 FALSE FALSE FALSE  TRUE  TRUE  TRUE``````

Boolean vectors are super useful for subsetting. Think of ‘subsetting’ as keeping only those elements of a vector for which a condition is `TRUE`.

``````x <- 55:59

# Call x without subsetting
x
 55 56 57 58 59``````
``````# Now subset to the second, third, and fourth element
x[c(FALSE, TRUE, TRUE, TRUE, FALSE)]
 56 57 58``````

That command returned elements for which the subetting vector was `TRUE`.

This is equivalent to…

``````x[2:4]
 56 57 58``````

You can also get the same result using a logical test, since logical tests return boolean values:

``````# Develop your logical test: ask which values of x are in the vector 56:58
x %in% c(56,57,58)
 FALSE  TRUE  TRUE  TRUE FALSE

# Now plug taht test it into the subsetting brackets
x[ x %in% c(56,57,58) ]
 56 57 58``````

This methods gets really useful when you are working with bigger datasets, such as this one:

``````# Make a large dataset of random numbers
y <- sample(1:1000,size=100)
length(y)
 100``````
``````range(y)
  17 992``````

With a dataset like this, you can use a boolean filter to figure out how many values are greater than, say, 90.

First, develop your logical test, which will tell you whether each value in the vector is greater than 90:

``````# Develop your logical test,
y > 90
  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
  TRUE  TRUE FALSE  TRUE``````

Now, to get the values corresponding to each `TRUE` in this list, plug your logical test into your subsetting brackets.

``````y[y > 90]
 989 367 630 140 403 732 745 544 547 521 804 921 953 222 420 805 374 685 148
 157 448 908 590 193 428 569 397 390 389 515 317 642 276 238 848 801 160 761
 597 785 578 343 952 786 510 759 909 992 278 435 142 131 691 899 886 716 820
 789 877 838 764 594 955 816 100  97 699 514 985 648 120 208 978 688 607 690
 558 460 320 470 925 901 885 220 387 610 103 146 735 937 527 776``````

Here’s another way you can do the same thing:

``````# Save the result of your logical test in a new vector
verdicts <- y > 90

# Use that vector to subset y
y[verdicts]
 989 367 630 140 403 732 745 544 547 521 804 921 953 222 420 805 374 685 148
 157 448 908 590 193 428 569 397 390 389 515 317 642 276 238 848 801 160 761
 597 785 578 343 952 786 510 759 909 992 278 435 142 131 691 899 886 716 820
 789 877 838 764 594 955 816 100  97 699 514 985 648 120 208 978 688 607 690
 558 460 320 470 925 901 885 220 387 610 103 146 735 937 527 776``````

You can use double logical tests too. For example, what if you want all elements between the values 70 and 90?

``````verdicts <- y > 70 & y < 90
y[verdicts]
 87 83 79 74``````

### Review assignment

1. Create a vector named `nummies` of all numbers from 1 to 100

2. Create another vector named `little_nummies` which consists of all those numbers which are less than or equal to 30

3. Create a boolean vector named `these_are_big` which indicates whether each element of `nummies` is greater than or equal to 70

4. Use `these_are_big` to subset `nummies` into a vector named `big_nummies`

5. Create a new vector named `these_are_not_that_big` which indicates whether each element of `nummies` is greater than 30 and less than 70. You’ll need to use the `&` symbol.

6. Create a new vector named `meh_nummies` which consists of all `nummies` which are greater than 30 and less than 70.

7. How many numbers are greater than 30 and less than 70?

8. What is the sum of all those numbers in `meh_nummies`