Module 39 Working with text

Learning goal

  • Learn to apply the most common R tools for working with text.


If you do not learn how to edit and transform text-based fields within datasets, you will quickly get stuck in R. Think of dates, GPS coordinates, user IDs, group names, plot labels, etc. All of these forms of data can contain non-numeric text. Becoming comfortable working with text in R is an essential part of your R toolbag.

Here we will present the most common functions for working with text. Remember that R has a special object class for text, known to as character class, and that character objects are often referred to as strings.

Most of these functions come pre-installed in R. However, several of the tools we will show here (as well as many other useful tools that we will not detail here) come from the stringr package. Go ahead an install stringr and load it using library().


Most common tools

paste() & paste0()

paste() and paste0() combines two ore more strings together into a single object:

i <- 10
n <- 96
file_name <- "this_file.csv"

paste(i,"of",n,": Processing",file_name,". . . ")
[1] "10 of 96 : Processing this_file.csv . . . "

Notice that paste() assumes each object is separated by a blank space. " ". paste0() assumes no space between objects. Here’s the same input but with paste0() instead.

paste0(i,"of",n,": Processing",file_name,". . . ") 
[1] "10of96: Processingthis_file.csv. . . "

To replicate the original output with paste(), you manually add blank spaces like this:

paste0(i," of ",n,": Processing ",file_name," . . . ") 
[1] "10 of 96: Processing this_file.csv . . . "

You can also use paste() to collapse multiple objects into a single string.

x <- 1:10
[1] "1-2-3-4-5-6-7-8-9-10"

tolower() & toupper()

tolower() and toupper() forces all text in a string to lower case or upper case, respectively:

x <- "That Tree Is Far Away."
[1] "that tree is far away."


nchar() returns the number of characters within a string:

x <- "That Tree Is Far Away."
[1] 22


substr() trims a string according to a start and end character position:

dates <- c("2021-03-01",
substr(dates,1,4) # years
[1] "2021" "2021" "2021"
substr(dates,6,7) # months
[1] "03" "03" "03"
substr(dates,9,10) # days
[1] "01" "02" "03"


grep() returns the elements in a character vector that contain a given pattern:

years <- 1900:1999

# Which elements correspond to the 1980s?
eighties <- grep("198",years)

 [1] 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989


gsub() replaces a given pattern with another in a character vector.

dates <- c("2021-03-01","2021-03-02","2021-03-03")
[1] "2021/03/01" "2021/03/02" "2021/03/03"


stringr::str_pad(): standardize the lengths of strings by “padding” it (e.g., with zeroes) :

days <- as.character(1:15)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
 [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12" "13" "14" "15"


stringr::str_split(): split a string into several strings at the occurrence of a specified character.

dates <- c("2021-03-01","2021-03-02","2021-03-03")
splits <- stringr::str_split(dates,"-")
[1] "2021" "03"   "01"  

[1] "2021" "03"   "02"  

[1] "2021" "03"   "03"  

This function returns a list for every element in the original vector. A common need is to retrieve one item from each of the elements in this list. For example, let’s say you are trying to retrieve the months of each element in the dates vector. The list structure makes this tricky to retrieve.

Here’s the way to do it by drawing upon the apply() family of functions:

sapply(splits, "[[", 1) # years
[1] "2021" "2021" "2021"

sapply(splits, "[[", 2) # Months
[1] "03" "03" "03"

sapply(splits, "[[", 3) # days
[1] "01" "02" "03"


as.character(): converts a non-character object into a character string.

x <- c(1,2,3)
[1] "1" "2" "3"
x <- as.factor(c("group1","group2","group3"))
[1] "group1" "group2" "group3"

This can be particularly useful when trying to resolve problems caused by factors. One common issue occurs when R mistakes a set of numbers as a set of factors. Using as.character() can set things right:

x <- as.factor(c(18,19,20))
[1] 18 19 20
Levels: 18 19 20

If you try to convert straight to numeric, it does not work:

[1] 1 2 3

So convert to character firstL

[1] 18 19 20


To practice these tools, we will play with the results of a recent survey. View the raw results here.

Read the survey into R as follows:

survey <- gsheet2tbl('')

To make this spreadsheet easier to work with, let’s rename the columns. Currently, the columns are:

 [1] "Timestamp"                                                    
 [2] "What is your sex?"                                            
 [3] "How old are you (in years)?"                                  
 [4] "How many siblings do you have?"                               
 [5] "Does or did your dad ever have a mustache during your life?"  
 [6] "Have you ever had a mustache?"                                
 [7] "Joe's mustache is:"                                           
 [8] "My eyesight is"                                               
 [9] "How tall are you in centimeters?"                             
[10] "What is your shoe size (US)?"                                 
[11] "What is your birthday?"                                       
[12] "What matters more: money or love?"                            
[13] "How good do you think you are at rock-paper-scissors?"        
[14] "How many more pandemics do you think you'll see in your life?"
[15] "What are cooler: cats or dogs?"                               
[16] "first_name"                                                   
[17] "last_name"                                                    

Rename them like so:

names(survey) <- c('time', 'sex', 'age','sib', 'dad_mus', 
                   'person_mus', 'joe_mus_is', 'eyesight', 
                   'height', 'shoe_size', 'bday', 'money_or_love', 
                   'rps_skill', 'num_pan', 'cats_dogs', 
                   'first_name', 'last_name')

1. Create a new column named full_name that combines the first and last name of each respondent.

2. Simplify the sex column so that m (lowercase) stands for males, f (lowercase) stands for females, and p (lowercase) stands for ‘Prefer not to say’.

3. Modify the column money_or_love such that the first letter is always capitalized.

4. How many characters is each response in the column eyesight?

5. How many responses in the column eyesight have 30 characters or more?

6. Modify the column money_or_love such that all responses are twenty characters or less.

7. Remove the ‘s’ from the responses in the column cats_dogs.

8. In the column joe_mus_is, replace ‘Deeply captivating’ with just ‘captivating.’

9. How many respondents have the last name ‘Brew’?

10. How many respondents were born in May?

11. Filter the survey only to respondents born in 2000.

12. Replace “both” in the money_or_love variable with “Money & Love”.

13. Get only the first character dad_mus variable.

14. How many total characters are in the column eyesight?

15. How many characters did Joe Brew write for the eyesight question?

16. How many people in the data were born on the 4th day of the month?

17. Create a new variable called month_born that has only the month of from the bday variable.

18. Do the same thing for year.

19. Filter the data set by those born in 2001 and prefer money over love.