# Module 41 Writing functions

#### Learning goals

• Be able to write your own functions
• Be able to use functions to make your work more efficient, effective, and organized

## First steps

You’ve already used dozens of functions during your learning in `R` so far. As you start applying `R` to your own projects, you will inevitably encounter a puzzle that could be solved by a custom function you write yourself. This module shows you how.

As explained in the Calling Functions module, most functions have three key components:

1. one or more inputs,
2. a process that is applied to those inputs, and
3. an output of the result.

When you define your own custom function, these are the three pieces you must be sure to include.

Here is a basic example:

``````my_function <- function(x){
y <- 1.3*x + 10
return(y)
}``````

``````my_function(x=2) # example 1
 12.6
my_function(x=4) # example 2
 15.2``````

Let’s break this down.

• `my_function` is the name you are giving your function. It is the command you will use to call your function.
• The `function()` command is what you use to define a function.
• `x` is the variable you are using to represent your input.
• `y <- 1.3x + 10` is the process that you are applying to your input.
• `return(y)` is the command you use to define what the function’s output will be.

Note that you are not required to write out `x=2` in full when you are calling your function. Just providing `2` can also work:

``````my_function(2)
 12.6``````

### Exercise 1

Define your own basic function and run it to make sure it works.

## Next steps

### Multiple inputs

You can define a function with multiple inputs. Just separate each input with a comma.

To demonstrate this, let’s modify the function above to allow you to define any linear regression you wish:

``````my_function <- function(x,a,b){
y <- a*x + b
return(y)
}``````

``````my_function(x=2,a=1.3,b=10) # example 1
 12.6
my_function(x=4,a=5,b=100) # example 2
 120``````

Note that you do not need to write out the name of each input, as long as you provide inputs in the correct order.

``````my_function(2, 1.3, 10) # example 1
 12.6
my_function(4, 5, 100) # example 2
 120``````

But note that it is usually best practice to name each input in your function call, to prevent the possibility of any confusion or mistakes. Also, when you name each input you can provide inputs in whatever order you wish:

``````my_function(x=2, a=1.3, b=10)
 12.6
my_function(a=1.3, b=10, x=2) # different inout order, same output value
 12.6``````

### Providing defaults for inputs

Just as `R`’s base functions include default values for some inputs (think `na.rm=FALSE` for `mean()` and `sd()`), you can define defaults in your own functions.

This version of `my_function` includes default values for inputs `a` and `b`.

``````my_function <- function(x,a=1.3,b=10){
y <- a*x + b
return(y)
}``````

When you provide default values, you no longer need to specify those inputs in your function call:

``````my_function(x=2)
 12.6``````

Plots can be included in the function commands just as in any other context:

``````my_function <- function(x,a=1.3,b=10){
y <- a*x + b
plot(y ~ x, type="b")
return(y)
}``````
``````my_input <- 1:20
my_function(x=my_input)`````` ``````  11.3 12.6 13.9 15.2 16.5 17.8 19.1 20.4 21.7 23.0 24.3 25.6 26.9 28.2 29.5
 30.8 32.1 33.4 34.7 36.0``````

Adding plots to functions can be super useful if you want to make multiple plots with the same formatting specifications. Rather than retyping the same long plot commands multiple times, just write a single function and call the function as many times as you wish.

Let’s add some fancy formatting to our plot. Note that we will modify the name of the function to make it more descriptive and helpful. The `lm` in `plot_my_lm` stands for linear model, which is what is being defined with the `y=ax+b` equation.

``````plot_my_lm <- function(x,a=1.3,b=10,plot_only=TRUE){

# Process
y <- a*x + b

# Plot
par(mar=c(4.2,4.2,3,.5)) # set plot margins
plot(y ~ x, type="o",axes=FALSE,ann=FALSE,pch=16,col="firebrick",xlim=c(-20,20),ylim=c(-20,20)) # define basic plot
title(main=paste("y =",a,"x +",b)) # print a dynamic main title
title(xlab="x",ylab="y")  # print axis labels
axis(1) # print the X axis
axis(2,las=2) # print the Y axis and turn its labels right-side-up
abline(h=0,v=0,col="grey70") # add grey lines indicating x=0 and y=0

# Return
if(plot_only==FALSE){
return(y)
}
}``````

Note that we added a parameter, `plot_only`. When it is set to `TRUE`, the function will not return any numbers.

Now let’s call this fancy function a bunch of times:

``````my_input <- -20:20 # define a common x input value

par(mfrow=c(3,2)) # stage a multi-paned plot
plot_my_lm(x=my_input,a=2,b=15)
plot_my_lm(x=my_input,a=1,b=10)
plot_my_lm(x=my_input,a=.5,b=5)
plot_my_lm(x=my_input,a=0,b=0)
plot_my_lm(x=my_input,a=-1,b=-5)
plot_my_lm(x=my_input,a=-2,b=15)`````` Think about how many lines of code would have been needed to write out all of these fancy plots if you did not use a custom function! Think about how cluttered and dizzying your code would look! And think about how many opportunities for errors and inconsistencies there would have been! That is the advantage of writing your own functions: it makes your work more efficient, more organized, and less prone to errors.

Another major advantage of this approach comes into play when you decide you want to tweak the formatting of your plot. Rather than going through each `plot(...)` command and modifying the inputs in each one, when you write a custom plotting function you just have to make those changes once. Again, using a custom function saves you time and removes the possibility of inconsistencies or mistakes in the plots you are creating.

### Exercises

Modify the most recent version of `plot_my_lm` above such that you can specify the color for the plotted line as an input in the function. Then reproduce the multi-paned plot using a different color in each plot. (Here is a good reference for color options in R).

## Sourcing functions

As you advance in your coding, you will likely be writing multiple custom functions within a single `R` script. It is usually useful to group these functions into the same section of code near the top of your script.

But for even better script organization and simplification, you should source your functions from a separate `R` script. This means placing your function code in a separate `R` script and calling that file from the script in which you are carrying out your analyses. In addition to simplifying your analysis script, keeping your functions in a separate file allows them to be shared or sourced from any number of other scripts, which further organizes and simplifies your project’s code and increases the reproducibility of your work.

Here is how sourcing functions can work:

1. Open a new `R` script. Save it as `functions.R` and save it in the same working directory as the script you are using to work through this module.

2. Copy and paste the `plot_my_lm()` function into your `functions.R` script. Save that script to ensure your code is safe.

3. Now remove the code defining `plot_my_lm()` from your module `R` script.

4. In its place, type this command:

``source("functions.R")``

This command tells `R` to run the code in `functions.R` and store the objects and outputs from it in its active memory. You can now call `plot_my_lm()` from your module script.

### Review exercises

Carry out the above instructions to ensure that you know how to source a function from a separate `R` script.

#### Exercises with baby names In this exercise, you will investigate annual trends in the prevalence of six names for babies born in the United States.

1. Decide upon five names of interest to you, in addition to your own. Create a vector of these six names.

2. Install and load the package `babynames`, which includes the names of each child born in the United States from 1880 to 2017, according to the Social Security Administration.

3. Make an object named `bn` like this: `bn <- babynames::babynames`.

4. Write a function that takes any name and plots its proportional prevalence from 1880 to 2017. Format the plot beautifully. Provide the name as the title of the plot.

5. Modify your function so that it takes multiple names (instead of just one) and generates a multi-pane plot, one for each of the names passed to the function. Then, test that function on the object you created in number one.

6. Create a function called `first_letter`. This should take any vector as an argument and return the first letter only. You will need to use the `substr` function.

7. Use your `first_letter` function to make a new variable in the babynames dataset called `fl`. This should be the first letter of all names.

8. What was the most popular first letter of boys names in 1900?

9. What was the least popular first letter of girls names in 2017?

10. Make a function called `letter_plot`. This should take two arguments, `letter` and `m_or_f`, and then create a plot showing the popularity of that letter for the letter/sex combination inputted over time.

11. Make a function called `letter_compare`. This should take two arguments: `y` (the year) and `gender` (sex). This should make a plot of the popularity of each letter being used as the first letter for a name, for the sex in question, for the year provided.

#### Exercises with trees data

12. Define an object named `trees` using the built-in `trees` dataset: `trees <- trees` (weird, right?)

13. How many rows are there?

14. How many columns are there?

15. “Girth” is the same thing as “circumference”. Make a function named `girth_to_diameter`. It should do exactly what it says.

16. Create a new variable called `diameter`. Use your new function to populate it.

17. Create a scatterplot showing the association between `diameter` and `Volume`.

18. Create a histogram of `Height`.

19. Create a function named `diameter_to_area`. It should do what it says it does. Create a new variable named `area` using this function. This should be the area of a cross-sectional cut of the tree.

20. Create a dataset named `oranges` by reading in the built-in `Orange` dataset like this: `oranges <- Orange`.

21. Create a plot showing circumference as a function of age, faceted by tree number.

22. Create a function called `plot_tree`. This should take only one argument, `tree_number`, and generate a plot of that tree’s growth over time.

23. Create a function called `circumference_to_radius`. It should do what it says. Use it to create a new variable in the `oranges` data named `radius`.

24. Create a function called `double_it`. It should double it. Use it to create a variable named `diameter`.

25. Assume that the measurements of the trees are in inches, and that the age of the trees is in days. Create (a) a function for converting inches to centimeters and (b) a function for converting days to weeks. Create new variables in the data, using these functions, named `circumference_cm` and `age_weeks`.

26. Plot the association between age in weeks and circumference in centimeters. Facet by tree number.

27. Do trees get bigger as they get older?

#### Exercises with `dplyr` and `ggplot`

``````library(gapminder)
library(dplyr)
library(ggplot2)
gm <- gapminder``````

Previously we analyzed and explored the dataset by using dplyr and ggplot. A lot of our analysis used the same code, just applied to different variables and aspects of the data.

``````# plot the gdp per capita for china over time
china_gdp <- gm %>% filter(country == 'China')
ggplot(china_gdp, aes(year, gdpPercap)) +
geom_line() +
labs(x = 'Year', y = 'GDP per capita', title = "China GDP per capita over time")

# now india
india_gdp <- gm %>% filter(country == 'India')
ggplot(india_gdp, aes(year, gdpPercap)) +
geom_line() +
labs(x = 'Year', y = 'GDP per capita', title = "India GDP per capita over time")``````

If we want to do this for every country we will be reusing the same code over and over again. Lets write a function

``````plot_gdp <- function(country_name){
plot_data <- gm %>% filter(country == country_name)
ggplot(plot_data, aes(year, gdpPercap)) +
geom_line() +
labs(x = 'Year', y = 'GDP per capita', title = paste0(country_name, ' GDP per capita over time'))
}
plot_gdp(country_name = 'China')
plot_gdp(country_name = 'India')
plot_gdp(country_name = 'Angola')``````

Lets take it a step further and add a plotting variable

``````plot_gdp <- function(country_name, plot_var){
plot_data <- gm %>% filter(country == country_name)
ggplot(plot_data, aes_string('year', plot_var)) +
geom_line() +
labs(x = 'Year', y = plot_var, title = paste0(country_name,' ', plot_var ,' over time'))
}
plot_gdp(country_name = 'China', plot_var = 'pop')
plot_gdp(country_name = 'India', plot_var = 'lifeExp')
plot_gdp(country_name = 'Angola',plot_var = 'gdpPercap')``````

Now keep going:

28. Create a function that filters by a continent and year and creates a barplot of the population for all the countries in that continent

29. Add a color argument to the function called color_bars that fills the bar chart with that color

30. Add a title argument to the function that’s called plot_title that combines the name and year into the title of the plot

31. Add another argument that specifies the numerical variable you are plotting on the y axis (up until now it was just population. hint(aes_string))

32. Add another argument called `plot_type` that has a default value “bar”. Use conditionality (if and else statements) to create a bar chart if `plot_type`=“bar”, a point plot otherwise.

33. Create your own function that filters the data in some way and makes a plot. The function should have at least 5 arguments.