# Module 16 Base plots

#### Learning goals

• Make basic plots in `R`
• Basic adjustments to plot formatting

To learn how to plot, let’s first create a dataset to work with:

``````country <- c("USA","Tanzania","Japan","Ctr. Africa Rep.","China","Norway","India")
lifespan <- c(79,65,84,53,77,82,69)
gdp <- c(55335,2875,38674,623,13102,84500,6807)``````

These data come from this publicly available database that compares health and economic indices across countries in 2011.

The `lifespan` column presents the average life expectancy for each country.

The `gdp` column presents the average GDP per capita within that country, which is a common index for the income and wealth of average citizens.

Let’s see if there is a relationship between life expectancy and income.

## Create a basic plot

The simplest way to make a basic plot in `R` is to use its built-in `plot()` function:

``plot(lifespan ~ gdp)`` This syntax is saying this: plot column `lifespan` as a function of `gdp`. The symbol `~` denotes “as a function of”. This frames `lifespan` as a dependent variable (y axis) that is affected by the independent variable (x axis), which in this case is `gdp`.

Note that `R` uses the variable names you provided as the x- and y-axes. You can adjust these labels however you wish (see formatting section below).

You can also produce this exact same plot using the following syntax:

``plot(y=lifespan, x=gdp)``

Choose whichever one is most intuitive to you.

## Most common types of plots

The plot above is a scatter plot, and is one of the most common types of plots in data science.

You can turn this into a line plot by adding a parameter to the `plot()` function:

``plot(lifespan ~ gdp, type="l")`` What a mess! Rather than connecting these values in the order you might expect, `R` connects them in the order that they are listed in their source vectors. This is why line plots tend to be more useful in scenarios such as time series, which are inherently ordered.

Another common plot is the bar plot, which uses a different `R` function:

``barplot(height=lifespan,names.arg=country)`` In this command, the parameter `height` determines the height of the bars, and `names.arg` provides the labels to place beneath each bar.

There are many more plot types out there, but let’s stop here for now.

## Basic plot formatting

You can adjust the default formatting of plots by adding other inputs to your `plot()` command. To understand all the parameters you can adjust, bring up the help page for this function:

``?plot``

If multiple help page options are returned, select the Generiz X-Y Plotting page from the `base` package. This is the plot function that comes built-in to `R`.

Here we demonstrate just a few of the most common formatting adjustments you are likely to use:

Set plot range using `xlim` (for the x axis) and `ylim` (for the y axis):

``plot(lifespan ~ gdp,xlim=c(0,60000),ylim=c(40,100))`` In this command, you are defining axis limits using a 2-element vector (i.e., `c(min,max)`).

Note that it can be easier to read your code if you put each input on a new line, like this:

``````plot(lifespan ~ gdp,
xlim=c(0,60000),
ylim=c(40,100))``````

Make sure each input line within the function ends with a comma, otherwise you `R` will get confused and throw an error.

Set dot type using the input `pch`:

``plot(lifespan ~ gdp,pch=16)`` Set dot color using the input `col` (the default is `col="black"`)

``plot(lifespan ~ gdp,col="red")`` Here is a great resource for color names in `R`.

Set dot size using the input `cex` (the default is `cex=1`):

``plot(lifespan ~ gdp,cex=3)`` Set axis labels using the inputs `xlab` and `ylab`:

``plot(lifespan ~ gdp, xlab="Gross Domestic Product (GDP)",ylab="Life Expectancy")`` Set axis number size using the input `cex.axis` (the default is `cex.axis=1`):

``plot(lifespan ~ gdp,cex.axis=.5)`` Set axis label size using the input `cex.label` (the default is `cex.lab=1`):

``plot(lifespan ~ gdp,cex.lab=.5)`` Set plot margins using the function `par(mar=c())` before you call `plot()`:

``````par(mar=c(5,5,0.5,0.5))
plot(lifespan ~ gdp)`````` In this command, the four numbers in the vector used to define `mar` correspond to the margin for the bottom, left, top, and right sides of the plot, respectively.

Create a multi-pane plot using the function `par(mfrow=c())` before you call `plot()`:

``````par(mfrow=c(1,2))
plot(lifespan ~ gdp,col="red",pch=16)
plot(lifespan ~ gdp,col="blue",pch=16)`````` In this command, the two numbers in the vector used to define `mfrow` correspond to the number of rows and columns, respectively, on the entire plot. In this case, you have 1 row of plots with two columns.

Note that you will need to reset the number of panes when you are done with your multi-pane plot!

``par(mfrow=c(1,1))``

Plot dots and lines at once using the input `type`:

``````par(mfrow=c(1,2))
plot(lifespan ~ gdp,type="b")
plot(lifespan ~ gdp,type="o")`````` ``par(mfrow=c(1,1))``

Note the two slightly different formats here.

Use a logarithmic scale for one or of your axes using the input `log`

``plot(lifespan ~ gdp,log="x")`` ## Plotting with data frames

So far in this tutorial we have been using vectors to produce plots. This is nice for learning, but does not represent the real world very well. You will almost always be producing plots using dataframes.

Let’s turn these vectors into a dataframe:

``````df <- data.frame(country,lifespan,gdp)
df
country lifespan   gdp
1              USA       79 55335
2         Tanzania       65  2875
3            Japan       84 38674
4 Ctr. Africa Rep.       53   623
5            China       77 13102
6           Norway       82 84500
7            India       69  6807``````

To plot data within a dataframe, your `plot()` syntax changes slightly:

``plot(lifespan ~ gdp, data=df)`` This syntax is saying this: using the dataframe named `df` as a source, plot column `lifespan` as a function of column `gdp`. The symbol `~` denotes “as a function of”. This frames `lifespan` as a dependent variable (y axis) that is affected by the independent variable (x axis), which in this case is `gdp`.

Another way to write this command is as follows:

``plot(df\$lifespan ~ df\$gdp)``

In this command, as you learned in the dataframes module, the `\$` symbol is saying, “give me the column in `df` named `lifespan`”. It is a handy way of referring to a column within a dataframe by name.

#### Exercises

The economics of health

1. Using the data above, produce a bar plot that shows the GDP for each country.

2. Use the `df` dataframe you built above to produce a bar plot that shows life expectancy for each country.

3. Use the `df` dataframe to produce a jumbled line plot of life expectancy as a function of GDP. Reference the `plot()` documentation to figure out how to change the thickness of the line.

Shoe ~ height correlation

4. Create a vector of the names of 5 people who are sitting near you. Call it `people`.

5. Create a vector of those same 5 people’s shoe sizes (in the same order!). Call it `shoe_size`.

6. Create another vector of those same 5 people’s height. Call it `height`.

7. Create another vector of those same 5 people’s sex. Call it `sex`.

8. Make a scatterplot of height and shoe size. Is there a correlation?

9. Look up help for `boxplot`. Make a `boxplot`.

10. Make a dataframe named `df`. It should contain columns of all the vectors created so far.

11. Make a side by side `boxplot` with shoe sizes of males vs females.

12. Try to find documentation for making a “histogram” (hint: use text autocomplete).

13. Make a histogram of people’s height.