Module 39 Working with text
- Learn to apply the most common
Rtools for working with text.
If you do not learn how to edit and transform text-based fields within datasets, you will quickly get stuck in
R. Think of dates, GPS coordinates, user IDs, group names, plot labels, etc. All of these forms of data can contain non-numeric text. Becoming comfortable working with text in
R is an essential part of your
Here we will present the most common functions for working with text. Remember that
R has a special object class for text, known to as character class, and that character objects are often referred to as strings.
Most of these functions come pre-installed in
R. However, several of the tools we will show here (as well as many other useful tools that we will not detail here) come from the
stringr package. Go ahead an install
stringr and load it using
Most common tools
paste0() combines two ore more strings together into a single object:
paste() assumes each object is separated by a blank space. " ".
paste0() assumes no space between objects. Here’s the same input but with
To replicate the original output with
paste(), you manually add blank spaces like this:
You can also use
collapse multiple objects into a single string.
toupper() forces all text in a string to lower case or upper case, respectively:
nchar() returns the number of characters within a string:
substr() trims a string according to a start and end character position:
grep() returns the elements in a character vector that contain a given pattern:
gsub() replaces a given pattern with another in a character vector.
stringr::str_pad(): standardize the lengths of strings by “padding” it (e.g., with zeroes) :
stringr::str_split(): split a string into several strings at the occurrence of a specified character.
This function returns a list for every element in the original vector. A common need is to retrieve one item from each of the elements in this list. For example, let’s say you are trying to retrieve the months of each element in the
dates vector. The list structure makes this tricky to retrieve.
Here’s the way to do it by drawing upon the
apply() family of functions:
as.character(): converts a non-character object into a character string.
This can be particularly useful when trying to resolve problems caused by factors. One common issue occurs when
R mistakes a set of numbers as a set of factors. Using
as.character() can set things right:
If you try to convert straight to numeric, it does not work:
So convert to character firstL
To practice these tools, we will play with the results of a recent survey. View the raw results here.
Read the survey into
R as follows:
To make this spreadsheet easier to work with, let’s rename the columns. Currently, the columns are:
 "Timestamp"  "What is your sex?"  "How old are you (in years)?"  "How many siblings do you have?"  "Does or did your dad ever have a mustache during your life?"  "Have you ever had a mustache?"  "Joe's mustache is:"  "My eyesight is"  "How tall are you in centimeters?"  "What is your shoe size (US)?"  "What is your birthday?"  "What matters more: money or love?"  "How good do you think you are at rock-paper-scissors?"  "How many more pandemics do you think you'll see in your life?"  "What are cooler: cats or dogs?"  "first_name"  "last_name"
Rename them like so:
1. Create a new column named
full_name that combines the first and last name of each respondent.
2. Simplify the
sex column so that
m (lowercase) stands for males,
f (lowercase) stands for females, and
p (lowercase) stands for ‘Prefer not to say’.
3. Modify the column
money_or_love such that the first letter is always capitalized.
4. How many characters is each response in the column
5. How many responses in the column
eyesight have 30 characters or more?
6. Modify the column
money_or_love such that all responses are twenty characters or less.
7. Remove the ‘s’ from the responses in the column
8. In the column
joe_mus_is, replace ‘Deeply captivating’ with just ‘captivating.’
9. How many respondents have the last name ‘Brew’?
10. How many respondents were born in May?
11. Filter the survey only to respondents born in 2000.
12. Replace “both” in the
money_or_love variable with “Money & Love”.
13. Get only the first character
14. How many total characters are in the column
15. How many characters did Joe Brew write for the
16. How many people in the data were born on the 4th day of the month?
17. Create a new variable called
month_born that has only the month of from the
18. Do the same thing for
19. Filter the data set by those born in 2001 and prefer money over love.