Module 44 Matrices & lists

Learning goals

  • What R lists are, how to work with them, and when they are useful
  • What R matrices are, how to work with them, and when they are useful

 

As a data scientist, you will almost always be working exclusively with dataframes. But there are occasions when you will need other complex data structures – lists and matrices – to get a job done.

Here we show you what these data structures are like, how to work with them, and when to use them.

Lists

Think of lists as complicated vectors. Instead of being a set of single values, which is what a vector is, a list is a set of complex data structures. In fact, a better analogy may be that lists are like shopping carts. You can put a lot of different things in there.

To see what we mean, stage an empty list:

x <- list()

Now add a simple vector to it:

x$vector <- 1:10

Now add a dataframe to it.

x$dataframe <- data.frame(name=c("Ben","Joe","Eric"),
                 height.inches=c(75,73,80))

Now add a new list to it.

x$list <- list()

Now add a vector to that new list:

x$list$vector <- 10:20

Okay! Nice shopping spree. Let’s see what we have:

x
$vector
 [1]  1  2  3  4  5  6  7  8  9 10

$dataframe
  name height.inches
1  Ben            75
2  Joe            73
3 Eric            80

$list
$list$vector
 [1] 10 11 12 13 14 15 16 17 18 19 20

This is a list: a lot of complicated data structures, all contained in a single variable.

As you saw above, you can access the items in your list using the same dollar sign, $, that we use to access columns in a dataframe:

x$vector
 [1]  1  2  3  4  5  6  7  8  9 10

You can do the same with lists within your list:

x$list$vector
 [1] 10 11 12 13 14 15 16 17 18 19 20

Alternatively, you can subset lists using double brackets, [[ ]].

x[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10
x[[3]]
$vector
 [1] 10 11 12 13 14 15 16 17 18 19 20

Finally, you can create a list from scratch like so:

y <- list("vector1"=1:10,
          "vector2"=11:20)
y
$vector1
 [1]  1  2  3  4  5  6  7  8  9 10

$vector2
 [1] 11 12 13 14 15 16 17 18 19 20

Use cases for lists

Common use cases for lists include:

  • Keeping track of a bunch of related dataframes (i.e., by storing each dataframe as an element in a list).

  • Several R functions return lists as outputs. For example, when you split up a vector of strings using stringr::str_split(), the output is a list that is the same length as your original vector.

  • Returning complex outputs from your own custom functions. Since functions can return only a single object, you can stuff a bunch of different objects into a list and return the list.

Matrices

Matrices are like dataframes, except that they can only contain a single data type. Dataframes can have a column with text and another with numbers, but a matrix will only handle one type.

Here’s a simple dataframe containing two classes of data:

df <- data.frame(name=c("Ben","Joe","Eric"),
                 height.inches=c(75,73,80))
df
  name height.inches
1  Ben            75
2  Joe            73
3 Eric            80

When you coerce this dataframe into a matrix (using the function as.matrix()), the numeric data get coerced into text:

mdf <- as.matrix(df)
mdf
     name   height.inches
[1,] "Ben"  "75"         
[2,] "Joe"  "73"         
[3,] "Eric" "80"         

Other than that, you can treat a variable of class matrix similarly to a dataframe.

Subsetting is the same: matrix[rows,columns]

mdf[2,]
         name height.inches 
        "Joe"          "73" 
mdf[,2]
[1] "75" "73" "80"
mdf[2,2]
height.inches 
         "73" 

To build a matrix from scratch, use the matrix() function.

mx <- matrix(data=1:12, 
             nrow=4,
             ncol=3)
mx
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

The data input takes a vector of data and sorts it into a matrix with rows and columns. It starts “laying down” your data in the first column, then wraps to the second column, etc.

You can also define names for the rows and columns in a matrix:

mx <- matrix(data=1:12, 
             nrow=4,
             ncol=3,
             dimnames=list(c("row1","row2","row3","row4"),
                           c("col1","col2","col3")))
mx
     col1 col2 col3
row1    1    5    9
row2    2    6   10
row3    3    7   11
row4    4    8   12

The dimnames input takes a list with two vectors: the first contains row names, the second contains column names.

Note, however, that you cannot subset a matrix according to their column names. mx$col1, for example, will not work.

One more tool worth knowing for matrices is the function diag(), which returns the values that fall along the matrix’s diagonal ([1,1], [2,2], [3,3], etc.).

mx
     col1 col2 col3
row1    1    5    9
row2    2    6   10
row3    3    7   11
row4    4    8   12

diag(mx)
[1]  1  6 11

The diag() function comes in handy in most use cases for matrices in R (see next section).

Use cases for matrices

Common use cases for matrices include:

  • Matrix algebra applications (duh), such as life history tables in biology.

  • Using certain packages whose inputs require matrix objects. Matrices are particularly common in analyses of social networks.

  • Producing images (after all, an image is just a matrix in which each value is a pixel color.)

To practice the latter use case, let’s build up a simple matrix using a random number generator:

mx <- matrix(data=round(rnorm(100,50,10)),
             nrow=10,
             ncol=10)
mx
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   64   32   50   52   41   37   60   42   40    46
 [2,]   62   35   61   51   42   41   34   40   46    55
 [3,]   43   37   53   50   55   48   42   52   54    51
 [4,]   18   32   72   31   42   50   52   38   50    36
 [5,]   58   59   47   55   47   48   49   37   54    63
 [6,]   71   42   47   53   52   30   48   52   45    41
 [7,]   60   49   47   44   55   53   36   54   57    51
 [8,]   26   38   50   46   41   43   56   45   53    67
 [9,]   55   63   43   46   63   56   54   32   65    45
[10,]   64   41   75   56   52   57   42   44   53    59
heatmap(mx,Rowv=NA,Colv=NA)

To see a real-world example of a matrix in action, here is a dataset containing rates of social associations, scaled between 0 and 0.5, among humpback whales in the fjords of British Columbia, Canada: humpback-sociality.rds.

sociality <- readRDS("./data/humpback-sociality.rds")

Let’s look at the first 5 rows and columns of this dataset:

sociality[1:5,1:5]
           id1        id2        id3        id4        id5
id1 0.50000000 0.02758621 0.00000000 0.00000000 0.00000000
id2 0.02758621 0.50000000 0.02424242 0.00000000 0.02439024
id3 0.00000000 0.02424242 0.50000000 0.00000000 0.00000000
id4 0.00000000 0.00000000 0.00000000 0.50000000 0.02631579
id5 0.00000000 0.02439024 0.00000000 0.02631579 0.50000000

Notice that this matrix is symmetrical. That is, the number of rows equals the number of columns. It is an N x N matrix.

Also note that the row names and column names are the same. Each element in the matrix is the rate of association between the row’s whale ID and the column’s whale ID. This means that all of the values along this matrix’s diagonal will be 0.5, which is the max association rate in this example:

diag(sociality)[1:10]
 id1  id2  id3  id4  id5  id6  id7  id8  id9 id10 
 0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5 

The fact that this matrix is symmetrical with identical rows and columns also means that all the data in the bottom half of the matrix (i.e., below the diagonal) are the mirror image o the data in the top half.

Look again at the first few rows and columns:

sociality[1:3,1:3]
           id1        id2        id3
id1 0.50000000 0.02758621 0.00000000
id2 0.02758621 0.50000000 0.02424242
id3 0.00000000 0.02424242 0.50000000

If we don’t like the fact that the diagonal has large values (after all, it doesn’t make much sense to quantify how much an individual associates with itself), we can use the diag() function to replace those diagonal values with NA:

diag(sociality) <- NA
sociality[1:5,1:5]
           id1        id2        id3        id4        id5
id1         NA 0.02758621 0.00000000 0.00000000 0.00000000
id2 0.02758621         NA 0.02424242 0.00000000 0.02439024
id3 0.00000000 0.02424242         NA 0.00000000 0.00000000
id4 0.00000000 0.00000000 0.00000000         NA 0.02631579
id5 0.00000000 0.02439024 0.00000000 0.02631579         NA

Now that we’ve cancelled out the diagnoal, a heatmap of this dataset will show us which whales are involved in the strongest social associations:

heatmap(sociality,Rowv=NA,Colv=NA)

Review exercise

Task 1

Write a function that uses for loop techniques to convert any matrix, such as the sociality matrix above, into a dataframe. This dataframe must have a row for each value in the matrix, with three columns: row_name, col_name, and data.

The output of your function must be a list with two elements, raw_matrix and df (which contains your new dataframe).

Demonstrate that your function works using the sociality dataset above.

Task 2

Then, write a function that reverses your work: this new function will take the dataframe output of your first function and revert your dataframe back into a matrix. The output of this function will also be a list with two elements, raw_df and matrix (which contains your new dataframe).

In this function, include an input option giving you the choice of setting the diagonal in your matrix to NA.

Demonstrate that the matrix output of your second function is the same as the original sociality dataset, and demonstrate that the diagonal input works as well.