Learning Objectives

• Learn about the Split-Apply-Combine strategy is and how it applies to data
• Describe the difference between wide vs. long table formats and how to convert between them
• Introduce `dplyr` and `tidyr` functions to clean and wrangle data for analysis

7.1 Introduction

The data we get to work with are rarely, if ever, in the format we need to do our analyses. It’s often the case that one package requires data in one format, while another package requires the data to be in another format. To be efficient analysts, we should have good tools for reformatting data for our needs so we can do our actual work like making plots and fitting models. The `dplyr` and `tidyr` R packages provide a fairly complete and extremely powerful set of functions for us to do this reformatting quickly and learning these tools well will greatly increase your efficiency as an analyst.

Analyses take many shapes, but they often conform to what is known as the Split-Apply-Combine strategy. This strategy follows a usual set of steps:

• Split: Split the data into logical groups (e.g., area, stock, year)
• Apply: Calculate some summary statistic on each group (e.g. mean total length by year)
• Combine: Combine the groups back together into a single table

As shown above, our original table is split into groups by `year`, we calculate the mean length for each group, and finally combine the per-year means into a single table.

`dplyr` provides a fast and powerful way to express this. Let’s look at a simple example of how this is done:

Assuming our length data is already loaded in a `data.frame` called `length_data`:

year length_cm
1991 5.673318
1991 3.081224
1991 4.592696
1992 4.381523
1992 5.597777
1992 4.900052
1992 4.139282
1992 5.422823
1992 5.905247
1992 5.098922

We can do this calculation using `dplyr` like this:

``````length_data %>%
group_by(year) %>%
summarize(mean_length_cm = mean(length_cm))``````

Another exceedingly common thing we need to do is “reshape” our data. Let’s look at an example table that is in what we will call “wide” format:

site 1990 1991 1993
gold 100 118 112
lake 100 118 112
dredge 100 118 112

You are probably quite familiar with data in the above format, where values of the variable being observed are spread out across columns (Here: columns for each year). Another way of describing this is that there is more than one measurement per row. This wide format works well for data entry and sometimes works well for analysis but we quickly outgrow it when using R. For example, how would you fit a model with year as a predictor variable? In an ideal world, we’d be able to just run:

``lm(length ~ year)``

But this won’t work on our wide data because `lm()` needs `length` and `year` to be columns in our table.

Or how would we make a separate plot for each year? We could call `plot()` one time for each year but this is tedious if we have many years of data and hard to maintain as we add more years of data to our data set.

The `tidyr` package allows us to quickly switch between wide format and long format using the `pivot_longer()` function:

``````site_data %>%
pivot_longer(-site, names_to = "year", values_to = "length")``````
site year length
gold 1990 101
lake 1990 104
dredge 1990 144
dredge 1993 145

In this lesson we’re going to walk through the functions you’ll most commonly use from the `dplyr` and `tidyr` packages:

Common `dplyr` functions
Function name Description
`mutate()` Creates modify and deletes columns
`group_by()` Groups data by one or more variables
`summarise()` Summaries each group down to one row
`select()` Keep or drop columns using their names
`filter()` Keeps rows that matches conditions
`arrange()` order rows using columns variable
`rename()` Rename a column
Common `tidyr` functions
Function name Description
`pivot_longer()` transforms data from a wide to a long format
`pivot_wider()` transforms data from a long to a wide format
`unite()` Unite multiple columns into one by pasting strings together
`separate()` Separate a character column into multiple columns with a regular expression or numeric locations

7.2 Data Cleaning Basics

To demonstrate, we’ll be working with a tidied up version of a data set from Alaska Department of Fish & Game containing commercial catch data from 1878-1997. The data set and reference to the original source can be found at its public archive.

Setup

First, open a new RMarkdown document. Delete everything below the setup chunk, and add a library chunk that calls `dplyr`, `tidyr`, and `readr`

``````library(dplyr)
library(tidyr)

You may have noticed the following warning messages pop up when you ran your library chunk.

``````Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

filter, lag

The following objects are masked from ‘package:base’:

intersect, setdiff, setequal, union``````

These are important warnings. They are letting you know that certain functions from the `stats` and `base` packages (which are loaded by default when you start R) are masked by different functions with the same name in the `dplyr` package. It turns out, the order that you load the packages in matters. Since we loaded `dplyr` after `stats`, R will assume that if you call `filter()`, you mean the `dplyr` version unless you specify otherwise.

Being specific about which version of `filter()`, for example, you call is easy. To explicitly call a function by its unambiguous name, you use the syntax `package_name::function_name(...)`. So, if I wanted to call the `stats` version of `filter()` in this Rmarkdown document, I would use the syntax `stats::filter(...)`.

Exercise

Warnings are important, but we might not want them in our final document. After you have read the packages in, adjust the chunk settings in your library chunk to suppress warnings and messages.

Now that we have introduce some data wrangling functions, let’s get the data that we are going to use for this lesson.

Setup
1. Find the data file `byerlySalmonByRegion.csv`. Right click the “Download” button and select “Copy Link Address”

2. Paste the copied URL into the `read_csv()` function

The code chunk you use to read in the data should look something like this:

``catch_original <- read_csv("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/df35b.302.1")``

Note for Windows users: Keep in mind, if you want to replicate this workflow in your local computer you also need to use the `url()` function here with the argument `method = "libcurl"`.

It would look like this:

``catch_original <- read.csv(url("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/df35b.302.1", method = "libcurl"))``

This data set is relatively clean and easy to interpret as-is. While it may be clean, it’s in a shape that makes it hard to use for some types of analyses so we’ll want to fix that first.

Exercise

Before we get too much further, spend a minute or two outlining your RMarkdown document so that it includes the following sections and steps:

• Data Sources
• Explore data
• Clean and Reshape data
• Remove unnecessary columns
• Check column typing
• Reshape data

7.3 Explore data

Similar to what we did in our Intro to Rmd lesson, it is good practice to skim through the data you just read in. This is important to make sure the data is read as you were expecting and also it allows you to get familiar with the data.

``````## Prints the column names of my data frame
colnames(catch_original)

## First 6 lines of the data frame

## Summary of each column of data
summary(catch_original)

## Prints unique values in a column (in this case Date)
unique(catch_original\$Year)

## Opens data frame in its own tab to see each row and column of the data
View(catch_original)``````

7.4 About the pipe (`%>%`) operator

Before we jump into learning `tidyr` and `dplyr`, we first need to explain the `%>%`.

Both the `tidyr` and the `dplyr` packages use the pipe operator (`%>%`), which may look unfamiliar. The pipe is a powerful way to efficiently chain together operations. The pipe will take the output of a previous statement, and use it as the input to the next statement.

Say you want to both `filter()` out rows of a data set, and `select()` certain columns.

``````df_filtered <- filter(df, ...)
df_selected <- select(df_filtered, ...)``````

You can write:

``````df_cleaned <- df %>%
filter(...) %>%
select(...)``````

If you think of the assignment operator (`<-`) as reading like “gets”, then the pipe operator would read like “then”.

So you might think of the above chunk being translated as:

The cleaned data frame gets the original data, and then a filter (of the original data), and then a select (of the filtered data).

The benefits to using pipes are that you don’t have to keep track of (or overwrite) intermediate data frames. The drawbacks are that it can be more difficult to explain the reasoning behind each step, especially when many operations are chained together. It is good to strike a balance between writing efficient code (chaining operations), while ensuring that you are still clearly explaining, both to your future self and others, what you are doing and why you are doing it.

Quick Tip

RStudio has a keyboard shortcut for `%>%`

• Windows: `Ctrl` + `Shift` + `M`
• Mac: `cmd` + `shift` + `M`

7.5 Selecting or removing columns using `select()`

The first issue is the extra columns `All` and `notesRegCode`. Let’s select only the columns we want, and assign this to a variable called `catch_data`.

``````catch_data <- catch_original %>%
select(Region, Year, Chinook, Sockeye, Coho, Pink, Chum)

``````# A tibble: 6 × 7
Region  Year Chinook Sockeye  Coho  Pink  Chum
<chr>  <dbl> <chr>     <dbl> <dbl> <dbl> <dbl>
1 SSE     1886 0             5     0     0     0
2 SSE     1887 0           155     0     0     0
3 SSE     1888 0           224    16     0     0
4 SSE     1889 0           182    11    92     0
5 SSE     1890 0           251    42     0     0
6 SSE     1891 0           274    24     0     0``````

Much better!

`select()` also allows you to say which columns you don’t want, by passing unquoted column names preceded by minus (`-`) signs:

``````catch_data <- catch_original %>%
select(-All,-notesRegCode)``````

7.6 Quality Check

Now that we have the data we are interested in using, we should do a little quality check to see that it seems as expected. One nice way of doing this is the `glimpse()` function.

``dplyr::glimpse(catch_data)``
``````Rows: 1,708
Columns: 7
\$ Region  <chr> "SSE", "SSE", "SSE", "SSE", "SSE", "SSE", "SSE", "SSE", "SSE",…
\$ Year    <dbl> 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 18…
\$ Chinook <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "3", "4", "5", "9…
\$ Sockeye <dbl> 5, 155, 224, 182, 251, 274, 207, 189, 253, 408, 989, 791, 708,…
\$ Coho    <dbl> 0, 0, 16, 11, 42, 24, 11, 1, 5, 8, 192, 161, 132, 139, 84, 107…
\$ Pink    <dbl> 0, 0, 0, 92, 0, 0, 8, 187, 529, 606, 996, 2218, 673, 1545, 204…
\$ Chum    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 1, 2, 0, 0, 0, 102, 343…``````
Exercise

Notice the output of the `glimpse()` function call. Does anything seem amiss with this data set that might warrant fixing?

Answer: The Chinook catch data are `character` class. Let’s fix it using the function `mutate()` before moving on.

7.7 Changing column content using `mutate()`

We can use the `mutate()` function to change a column, or to create a new column. First Let’s try to just convert the Chinook catch values to `numeric` type using the `as.numeric()` function, and overwrite the old Chinook column.

``````catch_clean <- catch_data %>%
mutate(Chinook = as.numeric(Chinook))``````
``Warning in mask\$eval_all_mutate(quo): NAs introduced by coercion``
``head(catch_clean)``
``````# A tibble: 6 × 7
Region  Year Chinook Sockeye  Coho  Pink  Chum
<chr>  <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl>
1 SSE     1886       0       5     0     0     0
2 SSE     1887       0     155     0     0     0
3 SSE     1888       0     224    16     0     0
4 SSE     1889       0     182    11    92     0
5 SSE     1890       0     251    42     0     0
6 SSE     1891       0     274    24     0     0``````

We get a warning `NAs introduced by coercion` which is R telling us that it couldn’t convert every value to an integer and, for those values it couldn’t convert, it put an `NA` in its place. This is behavior we commonly experience when cleaning data sets and it’s important to have the skills to deal with it when it comes up.

To investigate, let’s isolate the issue. We can find out which values are `NA`s with a combination of `is.na()` and `which()`, and save that to a variable called `i`.

``````i <- which(is.na(catch_clean\$Chinook))
i``````
``[1] 401``

It looks like there is only one problem row, lets have a look at it in the original data.

``catch_data[i,]``
``````# A tibble: 1 × 7
Region  Year Chinook Sockeye  Coho  Pink  Chum
<chr>  <dbl> <chr>     <dbl> <dbl> <dbl> <dbl>
1 GSE     1955 I            66     0     0     1``````

Well that’s odd: The value in `catch_thousands` is `I`. It turns out that this data set is from a PDF which was automatically converted into a `csv` and this value of `I` is actually a 1.

Let’s fix it by incorporating the `if_else()` function to our `mutate()` call, which will change the value of the `Chinook` column to 1 if the value is equal to `I`, then will use `as.numeric()` to turn the character representations of numbers into numeric typed values.

``````catch_clean <- catch_data %>%
mutate(Chinook = if_else(condition = Chinook == "I",
true = "1",
false = Chinook),
Chinook = as.integer(Chinook))

##check
catch_clean[i, ]``````
``````# A tibble: 1 × 7
Region  Year Chinook Sockeye  Coho  Pink  Chum
<chr>  <dbl>   <int>   <dbl> <dbl> <dbl> <dbl>
1 GSE     1955       1      66     0     0     1``````

7.8 Changing shape using `pivot_longer()` and `pivot_wider()`

The next issue is that the data are in a wide format and, we want the data in a long format instead. `pivot_longer()` from the `tidyr` package helps us do just this conversion. How does did work, if you do not remember all the arguments that go into this function you can always call the `help` page by typing `?pivot_longer` in the console.

``````catch_long <- catch_clean %>%
#pivot longer all columns except Region and Year
pivot_longer(
cols = -c(Region, Year),
names_to = "species",
values_to = "catch"
)

``````# A tibble: 6 × 4
Region  Year species catch
<chr>  <dbl> <chr>   <dbl>
1 SSE     1886 Chinook     0
2 SSE     1886 Sockeye     5
3 SSE     1886 Coho        0
4 SSE     1886 Pink        0
5 SSE     1886 Chum        0
6 SSE     1887 Chinook     0``````

The syntax we used above for `pivot_longer()` might be a bit confusing so let’s walk though it.

The first argument to `pivot_longer` is the columns over which we are pivoting. You can select these by listing either the names of the columns you do want to pivot, or in this case, the names of the columns you are not pivoting over. The `names_to` argument takes the name of the column that you are creating from the column names you are pivoting over. The `values_to` argument takes the name of the column that you are creating from the values in the columns you are pivoting over.

The opposite of `pivot_longer()`, `pivot_wider()`, works in a similar declarative fashion:

``````catch_wide <- catch_long %>%
pivot_wider(names_from = species,
values_from = catch)

``````# A tibble: 6 × 7
Region  Year Chinook Sockeye  Coho  Pink  Chum
<chr>  <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl>
1 SSE     1886       0       5     0     0     0
2 SSE     1887       0     155     0     0     0
3 SSE     1888       0     224    16     0     0
4 SSE     1889       0     182    11    92     0
5 SSE     1890       0     251    42     0     0
6 SSE     1891       0     274    24     0     0``````

Same than we did above we can pull up the documentation of the function to remind ourselves what goes in which argument. Type `?pivot_wider` in the console.

7.9 Renaming columns with `rename()`

If you scan through the data, you may notice the values in the `catch` column are very small (these are supposed to be annual catches). If we look at the metadata we can see that the `catch` column is in thousands of fish so let’s convert it before moving on.

Let’s first rename the `catch` column to be called `catch_thousands`:

``````catch_long <- catch_long %>%
rename(catch_thousands = catch)

``````# A tibble: 6 × 4
Region  Year species catch_thousands
<chr>  <dbl> <chr>             <dbl>
1 SSE     1886 Chinook               0
2 SSE     1886 Sockeye               5
3 SSE     1886 Coho                  0
4 SSE     1886 Pink                  0
5 SSE     1886 Chum                  0
6 SSE     1887 Chinook               0``````
`names()` versus `rename()`

Many people use the base R function `names()` to rename columns, often in combination with column indexing that relies on columns being in a particular order. Column indexing is often also used to select columns instead of the `select()` function from `dplyr`. Although these methods both work just fine, they do have one major drawback: in most implementations they rely on you knowing exactly the column order your data is in.

To illustrate why your knowledge of column order isn’t reliable enough for these operations, considering the following scenario:

Your colleague emails you letting you know that she has an updated version of the conductivity-temperature-depth data from this year’s research cruise, and sends it along. Excited, you re-run your scripts that use this data for your phytoplankton research. You run the script and suddenly all of your numbers seem off. You spend hours trying to figure out what is going on.

Unbeknownst to you, your colleagues bought a new sensor this year that measures dissolved oxygen. Because of the new variables in the data set, the column order is different. Your script which previously renamed the fourth column, `SAL_PSU` to `salinity` now renames the fourth column, `O2_MGpL` to `salinity`. No wonder your results looked so weird, good thing you caught it!

If you had written your code so that it doesn’t rely on column order, but instead renames columns using the `rename()` function, the code would have run just fine (assuming the name of the original salinity column didn’t change, in which case the code would have thrown an error in an obvious way). This is an example of a defensive coding strategy, where you try to anticipate issues before they arise, and write your code in such a way as to keep the issues from happening.

7.10 Adding columns using `mutate()`

Now let’s use `mutate()` again to create a new column called `catch` with units of fish (instead of thousands of fish).

``````catch_long <- catch_long %>%
mutate(catch = catch_thousands * 1000)

Now let’s remove the `catch_thousands` column for now since we don’t need it. Note that here we have added to the expression we wrote above by adding another function call (mutate) to our expression. This takes advantage of the pipe operator by grouping together a similar set of statements, which all aim to clean up the `catch_clean` data frame.

``````catch_long <- catch_long %>%
mutate(catch = catch_thousands * 1000) %>%
select(-catch_thousands)

``````# A tibble: 6 × 4
Region  Year species catch
<chr>  <dbl> <chr>   <dbl>
1 SSE     1886 Chinook     0
2 SSE     1886 Sockeye  5000
3 SSE     1886 Coho        0
4 SSE     1886 Pink        0
5 SSE     1886 Chum        0
6 SSE     1887 Chinook     0``````

We’re now ready to start analyzing the data.

7.11 Summary statistics using `group_by()` and `summarize()`

As we outlined in the Introduction, `dplyr` lets us employ the Split-Apply-Combine strategy and this is exemplified through the use of the `group_by()` and `summarize()` functions:

``````mean_region <- catch_long %>%
group_by(Region) %>%
summarize(catch_mean = mean(catch))

``````# A tibble: 6 × 2
Region catch_mean
<chr>       <dbl>
1 ALU        40384.
2 BER        16373.
3 BRB      2709796.
4 CHG       315487.
5 CKI       683571.
6 COP       179223.``````

Another common use of `group_by()` followed by `summarize()` is to count the number of rows in each group. We have to use a special function from `dplyr`, `n()`.

``````n_region <- catch_long %>%
group_by(Region) %>%
summarize(n = n())

``````# A tibble: 6 × 2
Region     n
<chr>  <int>
1 ALU      435
2 BER      510
3 BRB      570
4 CHG      550
5 CKI      525
6 COP      470``````

Quick Tip If you are finding that you are reaching for this combination of `group_by()`, `summarize()` and `n()` a lot, there is a helpful `dplyr` function `count()` that accomplishes this in one function!

Exercise
• Find another grouping and statistic to calculate for each group
• Find out if you can group by multiple variables
``````## one example can be
catch_year_sp <- catch_long %>%
group_by(Year, species) %>%
summarize(total_year = sum(catch, na.rm = T))``````

7.12 Filtering rows using `filter()`

`filter()` is the verb we use to filter our `data.frame` to rows matching some condition. It’s similar to `subset()` from base R.

Let’s go back to our original `data.frame` and do some `filter()`ing:

``````sse_catch <- catch_long %>%
filter(Region == "SSE")

``````# A tibble: 6 × 4
Region  Year species catch
<chr>  <dbl> <chr>   <dbl>
1 SSE     1886 Chinook     0
2 SSE     1886 Sockeye  5000
3 SSE     1886 Coho        0
4 SSE     1886 Pink        0
5 SSE     1886 Chum        0
6 SSE     1887 Chinook     0``````
Exercise
• Filter to just catches of over one million fish
• Filter to just Chinook from the SSE region
``````## Catches over a million fish
catch_million <- catch_long %>%
filter(catch > 1000000)

## Chinook from SSE data
chinook_see <- catch_long %>%
filter(Region == "SSE",
species == "Chinook")

## OR
chinook_see <- catch_long %>%
filter(Region == "SSE" & species == "Chinook")``````

7.13 Sorting your data using `arrange()`

`arrange()` is how we sort the rows of a `data.frame`. Two common case to use `arrange()` are:

• To calculate a cumulative sum (with `cumsum()`) so row order matters
• To display a table (like in an `.Rmd` document) in sorted order

Let’s re-calculate mean catch by region, and then `arrange()` the output by mean catch:

``````mean_region <- catch_long %>%
group_by(Region) %>%
summarize(mean_catch = mean(catch)) %>%
arrange(mean_catch)

``````# A tibble: 6 × 2
Region mean_catch
<chr>       <dbl>
1 BER        16373.
2 KTZ        18836.
3 ALU        40384.
4 NRS        51503.
5 KSK        67642.
6 YUK        68646.``````

The default sorting order of `arrange()` is to sort in ascending order. To reverse the sort order, wrap the column name inside the `desc()` function:

``````mean_region <- catch_long %>%
group_by(Region) %>%
summarize(mean_catch = mean(catch)) %>%
arrange(desc(mean_catch))

``````# A tibble: 6 × 2
Region mean_catch
<chr>       <dbl>
1 SSE      3184661.
2 BRB      2709796.
3 NSE      1825021.
4 KOD      1528350
5 PWS      1419237.
6 SOP      1110942.``````

7.14 Splitting a column using `separate()` and `unite()`

`separate()` and its complement, `unite()` allow us to easily split a single column into numerous (or numerous into a single).

This can come in really handle when we need to split a column into two pieces by a consistent separator (like a dash).

Let’s make a new `tibble` with fake data to illustrate this. Here we have a set of site identification codes. with information about the island where the site is (the first 3 letters) and a site number (the 3 numbers). If we want to group and summarize by island, we need a column with just the island information.

``````sites_df <- tibble(site = c("HAW-101",
"HAW-103",
"OAH-320",
"OAH-219",
"MAI-039"))

sites_df %>%
separate(site, c("island", "site_number"), "-")``````
``````# A tibble: 5 × 2
island site_number
<chr>  <chr>
1 HAW    101
2 HAW    103
3 OAH    320
4 OAH    219
5 MAI    039        ``````
Exercise

Split the `city` column in the data frame `cities_df` into `city` and `state_code` columns

``````## create `cities_df`
cities_df <- data.frame(city = c("Juneau AK",
"Sitka AK",
"Anchorage AK"))``````
``````colnames(cities_df)

cities_clean <- cities_df %>%
separate(city, c("city", "state_code"), " ")``````

`unite()` does just the reverse of `separate()`. If we have a data.frame that contains columns for year, month, and day, we might want to unite these into a single date column.

``````dates_df <- data.frame(
year = c("1930",
"1930",
"1930"),
month = c("12",
"12",
"12"),
day = c("14",
"15",
"16")
)

dates_df %>%
unite(date, year, month, day, sep = "-")``````
``````        date
1 1930-12-14
2 1930-12-15
3 1930-12-16``````

7.15 Now, all together!

We just ran through the various things we can do with `dplyr` and `tidyr` but if you’re wondering how this might look in a real analysis. Let’s look at that now:

``````catch_original <- read_csv(url("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/df35b.302.1",
method = "libcurl"))

method = "libcurl")) %>%
select(code, mgmtArea)

mean_region <- catch_original %>%
select(-All, -notesRegCode) %>%
mutate(Chinook = ifelse(Chinook == "I", 1, Chinook)) %>%
mutate(Chinook = as.numeric(Chinook)) %>%
pivot_longer(-c(Region, Year),
names_to = "species",
values_to = "catch") %>%
mutate(catch = catch*1000) %>%
group_by(Region) %>%
summarize(mean_catch = mean(catch))

``````# A tibble: 6 × 2
Region mean_catch
<chr>       <dbl>
1 ALU        40384.
2 BER        16373.
3 BRB      2709796.
4 CHG       315487.
5 CKI       683571.
6 COP       179223.``````

We have completed our lesson on Cleaning and Wrangling data. Before we break, let’s practice our Github workflow.

Steps
1. Save the `.Rmd` you have been working on for this lesson.
2. Knit the R Markdown file. This is a way to test everything in your code is working.
3. `Stage > Commit > Pull > Push`