length_data %>%
group_by(year) %>%
summarize(mean_length_cm = mean(length_cm))
Learning Objectives
- Learn about the Split-Apply-Combine strategy is and how it applies to data
- Describe the difference between wide vs. long table formats and how to convert between them
- Introduce
dplyr
andtidyr
functions to clean and wrangle data for analysis
7.1 Introduction
The data we get to work with are rarely, if ever, in the format we need to do our analyses. It’s often the case that one package requires data in one format, while another package requires the data to be in another format. To be efficient analysts, we should have good tools for reformatting data for our needs so we can do our actual work like making plots and fitting models. The dplyr
and tidyr
R packages provide a fairly complete and extremely powerful set of functions for us to do this reformatting quickly and learning these tools well will greatly increase your efficiency as an analyst.
Analyses take many shapes, but they often conform to what is known as the Split-Apply-Combine strategy. This strategy follows a usual set of steps:
- Split: Split the data into logical groups (e.g., area, stock, year)
- Apply: Calculate some summary statistic on each group (e.g. mean total length by year)
- Combine: Combine the groups back together into a single table
As shown above, our original table is split into groups by year
, we calculate the mean length for each group, and finally combine the per-year means into a single table.
dplyr
provides a fast and powerful way to express this. Let’s look at a simple example of how this is done:
Assuming our length data is already loaded in a data.frame
called length_data
:
year | length_cm |
---|---|
1991 | 5.673318 |
1991 | 3.081224 |
1991 | 4.592696 |
1992 | 4.381523 |
1992 | 5.597777 |
1992 | 4.900052 |
1992 | 4.139282 |
1992 | 5.422823 |
1992 | 5.905247 |
1992 | 5.098922 |
We can do this calculation using dplyr
like this:
Another exceedingly common thing we need to do is “reshape” our data. Let’s look at an example table that is in what we will call “wide” format:
site | 1990 | 1991 | … | 1993 |
---|---|---|---|---|
gold | 100 | 118 | … | 112 |
lake | 100 | 118 | … | 112 |
… | … | … | … | … |
dredge | 100 | 118 | … | 112 |
You are probably quite familiar with data in the above format, where values of the variable being observed are spread out across columns (Here: columns for each year). Another way of describing this is that there is more than one measurement per row. This wide format works well for data entry and sometimes works well for analysis but we quickly outgrow it when using R. For example, how would you fit a model with year as a predictor variable? In an ideal world, we’d be able to just run:
lm(length ~ year)
But this won’t work on our wide data because lm()
needs length
and year
to be columns in our table.
Or how would we make a separate plot for each year? We could call plot()
one time for each year but this is tedious if we have many years of data and hard to maintain as we add more years of data to our data set.
The tidyr
package allows us to quickly switch between wide format and long format using the pivot_longer()
function:
site_data %>%
pivot_longer(-site, names_to = "year", values_to = "length")
site | year | length |
---|---|---|
gold | 1990 | 101 |
lake | 1990 | 104 |
dredge | 1990 | 144 |
… | … | … |
dredge | 1993 | 145 |
In this lesson we’re going to walk through the functions you’ll most commonly use from the dplyr
and tidyr
packages:
Function name | Description |
---|---|
mutate() |
Creates modify and deletes columns |
group_by() |
Groups data by one or more variables |
summarise() |
Summaries each group down to one row |
select() |
Keep or drop columns using their names |
filter() |
Keeps rows that matches conditions |
arrange() |
order rows using columns variable |
rename() |
Rename a column |
Function name | Description |
---|---|
pivot_longer() |
transforms data from a wide to a long format |
pivot_wider() |
transforms data from a long to a wide format |
unite() |
Unite multiple columns into one by pasting strings together |
separate() |
Separate a character column into multiple columns with a regular expression or numeric locations |
7.2 Data Cleaning Basics
To demonstrate, we’ll be working with a tidied up version of a data set from Alaska Department of Fish & Game containing commercial catch data from 1878-1997. The data set and reference to the original source can be found at its public archive.
Now that we have introduce some data wrangling functions, let’s get the data that we are going to use for this lesson.
This data set is relatively clean and easy to interpret as-is. While it may be clean, it’s in a shape that makes it hard to use for some types of analyses so we’ll want to fix that first.
7.3 Explore data
Similar to what we did in our Intro to Rmd lesson, it is good practice to skim through the data you just read in. This is important to make sure the data is read as you were expecting and also it allows you to get familiar with the data.
## Prints the column names of my data frame
colnames(catch_original)
## First 6 lines of the data frame
head(catch_original)
## Summary of each column of data
summary(catch_original)
## Prints unique values in a column (in this case Date)
unique(catch_original$Year)
## Opens data frame in its own tab to see each row and column of the data
View(catch_original)
7.4 About the pipe (%>%
) operator
Before we jump into learning tidyr
and dplyr
, we first need to explain the %>%
.
Both the tidyr
and the dplyr
packages use the pipe operator (%>%
), which may look unfamiliar. The pipe is a powerful way to efficiently chain together operations. The pipe will take the output of a previous statement, and use it as the input to the next statement.
Say you want to both filter()
out rows of a data set, and select()
certain columns.
Instead of writing:
You can write:
If you think of the assignment operator (<-
) as reading like “gets”, then the pipe operator would read like “then”.
So you might think of the above chunk being translated as:
The cleaned data frame gets the original data, and then a filter (of the original data), and then a select (of the filtered data).
The benefits to using pipes are that you don’t have to keep track of (or overwrite) intermediate data frames. The drawbacks are that it can be more difficult to explain the reasoning behind each step, especially when many operations are chained together. It is good to strike a balance between writing efficient code (chaining operations), while ensuring that you are still clearly explaining, both to your future self and others, what you are doing and why you are doing it.
7.5 Selecting or removing columns using select()
The first issue is the extra columns All
and notesRegCode
. Let’s select only the columns we want, and assign this to a variable called catch_data
.
catch_data <- catch_original %>%
select(Region, Year, Chinook, Sockeye, Coho, Pink, Chum)
head(catch_data)
# A tibble: 6 × 7
Region Year Chinook Sockeye Coho Pink Chum
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 SSE 1886 0 5 0 0 0
2 SSE 1887 0 155 0 0 0
3 SSE 1888 0 224 16 0 0
4 SSE 1889 0 182 11 92 0
5 SSE 1890 0 251 42 0 0
6 SSE 1891 0 274 24 0 0
Much better!
select()
also allows you to say which columns you don’t want, by passing unquoted column names preceded by minus (-
) signs:
7.6 Quality Check
Now that we have the data we are interested in using, we should do a little quality check to see that it seems as expected. One nice way of doing this is the glimpse()
function.
dplyr::glimpse(catch_data)
Rows: 1,708
Columns: 7
$ Region <chr> "SSE", "SSE", "SSE", "SSE", "SSE", "SSE", "SSE", "SSE", "SSE",…
$ Year <dbl> 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 18…
$ Chinook <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "3", "4", "5", "9…
$ Sockeye <dbl> 5, 155, 224, 182, 251, 274, 207, 189, 253, 408, 989, 791, 708,…
$ Coho <dbl> 0, 0, 16, 11, 42, 24, 11, 1, 5, 8, 192, 161, 132, 139, 84, 107…
$ Pink <dbl> 0, 0, 0, 92, 0, 0, 8, 187, 529, 606, 996, 2218, 673, 1545, 204…
$ Chum <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 1, 2, 0, 0, 0, 102, 343…
7.7 Changing column content using mutate()
We can use the mutate()
function to change a column, or to create a new column. First Let’s try to just convert the Chinook catch values to numeric
type using the as.numeric()
function, and overwrite the old Chinook column.
catch_clean <- catch_data %>%
mutate(Chinook = as.numeric(Chinook))
Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
head(catch_clean)
# A tibble: 6 × 7
Region Year Chinook Sockeye Coho Pink Chum
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SSE 1886 0 5 0 0 0
2 SSE 1887 0 155 0 0 0
3 SSE 1888 0 224 16 0 0
4 SSE 1889 0 182 11 92 0
5 SSE 1890 0 251 42 0 0
6 SSE 1891 0 274 24 0 0
We get a warning NAs introduced by coercion
which is R telling us that it couldn’t convert every value to an integer and, for those values it couldn’t convert, it put an NA
in its place. This is behavior we commonly experience when cleaning data sets and it’s important to have the skills to deal with it when it comes up.
To investigate, let’s isolate the issue. We can find out which values are NA
s with a combination of is.na()
and which()
, and save that to a variable called i
.
It looks like there is only one problem row, lets have a look at it in the original data.
catch_data[i,]
# A tibble: 1 × 7
Region Year Chinook Sockeye Coho Pink Chum
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 GSE 1955 I 66 0 0 1
Well that’s odd: The value in catch_thousands
is I
. It turns out that this data set is from a PDF which was automatically converted into a csv
and this value of I
is actually a 1.
Let’s fix it by incorporating the if_else()
function to our mutate()
call, which will change the value of the Chinook
column to 1 if the value is equal to I
, then will use as.numeric()
to turn the character representations of numbers into numeric typed values.
catch_clean <- catch_data %>%
mutate(Chinook = if_else(condition = Chinook == "I",
true = "1",
false = Chinook),
Chinook = as.integer(Chinook))
##check
catch_clean[i, ]
# A tibble: 1 × 7
Region Year Chinook Sockeye Coho Pink Chum
<chr> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 GSE 1955 1 66 0 0 1
7.8 Changing shape using pivot_longer()
and pivot_wider()
The next issue is that the data are in a wide format and, we want the data in a long format instead. pivot_longer()
from the tidyr
package helps us do just this conversion. How does did work, if you do not remember all the arguments that go into this function you can always call the help
page by typing ?pivot_longer
in the console.
catch_long <- catch_clean %>%
#pivot longer all columns except Region and Year
pivot_longer(
cols = -c(Region, Year),
names_to = "species",
values_to = "catch"
)
head(catch_long)
# A tibble: 6 × 4
Region Year species catch
<chr> <dbl> <chr> <dbl>
1 SSE 1886 Chinook 0
2 SSE 1886 Sockeye 5
3 SSE 1886 Coho 0
4 SSE 1886 Pink 0
5 SSE 1886 Chum 0
6 SSE 1887 Chinook 0
The syntax we used above for pivot_longer()
might be a bit confusing so let’s walk though it.
The first argument to pivot_longer
is the columns over which we are pivoting. You can select these by listing either the names of the columns you do want to pivot, or in this case, the names of the columns you are not pivoting over. The names_to
argument takes the name of the column that you are creating from the column names you are pivoting over. The values_to
argument takes the name of the column that you are creating from the values in the columns you are pivoting over.
The opposite of pivot_longer()
, pivot_wider()
, works in a similar declarative fashion:
catch_wide <- catch_long %>%
pivot_wider(names_from = species,
values_from = catch)
head(catch_wide)
# A tibble: 6 × 7
Region Year Chinook Sockeye Coho Pink Chum
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SSE 1886 0 5 0 0 0
2 SSE 1887 0 155 0 0 0
3 SSE 1888 0 224 16 0 0
4 SSE 1889 0 182 11 92 0
5 SSE 1890 0 251 42 0 0
6 SSE 1891 0 274 24 0 0
Same than we did above we can pull up the documentation of the function to remind ourselves what goes in which argument. Type ?pivot_wider
in the console.
7.9 Renaming columns with rename()
If you scan through the data, you may notice the values in the catch
column are very small (these are supposed to be annual catches). If we look at the metadata we can see that the catch
column is in thousands of fish so let’s convert it before moving on.
Let’s first rename the catch
column to be called catch_thousands
:
7.10 Adding columns using mutate()
Now let’s use mutate()
again to create a new column called catch
with units of fish (instead of thousands of fish).
Now let’s remove the catch_thousands
column for now since we don’t need it. Note that here we have added to the expression we wrote above by adding another function call (mutate) to our expression. This takes advantage of the pipe operator by grouping together a similar set of statements, which all aim to clean up the catch_clean
data frame.
catch_long <- catch_long %>%
mutate(catch = catch_thousands * 1000) %>%
select(-catch_thousands)
head(catch_long)
# A tibble: 6 × 4
Region Year species catch
<chr> <dbl> <chr> <dbl>
1 SSE 1886 Chinook 0
2 SSE 1886 Sockeye 5000
3 SSE 1886 Coho 0
4 SSE 1886 Pink 0
5 SSE 1886 Chum 0
6 SSE 1887 Chinook 0
We’re now ready to start analyzing the data.
7.11 Summary statistics using group_by()
and summarize()
As we outlined in the Introduction, dplyr
lets us employ the Split-Apply-Combine strategy and this is exemplified through the use of the group_by()
and summarize()
functions:
mean_region <- catch_long %>%
group_by(Region) %>%
summarize(catch_mean = mean(catch))
head(mean_region)
# A tibble: 6 × 2
Region catch_mean
<chr> <dbl>
1 ALU 40384.
2 BER 16373.
3 BRB 2709796.
4 CHG 315487.
5 CKI 683571.
6 COP 179223.
Another common use of group_by()
followed by summarize()
is to count the number of rows in each group. We have to use a special function from dplyr
, n()
.
# A tibble: 6 × 2
Region n
<chr> <int>
1 ALU 435
2 BER 510
3 BRB 570
4 CHG 550
5 CKI 525
6 COP 470
Quick Tip If you are finding that you are reaching for this combination of group_by()
, summarize()
and n()
a lot, there is a helpful dplyr
function count()
that accomplishes this in one function!
7.12 Filtering rows using filter()
filter()
is the verb we use to filter our data.frame
to rows matching some condition. It’s similar to subset()
from base R.
Let’s go back to our original data.frame
and do some filter()
ing:
# A tibble: 6 × 4
Region Year species catch
<chr> <dbl> <chr> <dbl>
1 SSE 1886 Chinook 0
2 SSE 1886 Sockeye 5000
3 SSE 1886 Coho 0
4 SSE 1886 Pink 0
5 SSE 1886 Chum 0
6 SSE 1887 Chinook 0
7.13 Sorting your data using arrange()
arrange()
is how we sort the rows of a data.frame
. Two common case to use arrange()
are:
- To calculate a cumulative sum (with
cumsum()
) so row order matters - To display a table (like in an
.Rmd
document) in sorted order
Let’s re-calculate mean catch by region, and then arrange()
the output by mean catch:
mean_region <- catch_long %>%
group_by(Region) %>%
summarize(mean_catch = mean(catch)) %>%
arrange(mean_catch)
head(mean_region)
# A tibble: 6 × 2
Region mean_catch
<chr> <dbl>
1 BER 16373.
2 KTZ 18836.
3 ALU 40384.
4 NRS 51503.
5 KSK 67642.
6 YUK 68646.
The default sorting order of arrange()
is to sort in ascending order. To reverse the sort order, wrap the column name inside the desc()
function:
7.14 Splitting a column using separate()
and unite()
separate()
and its complement, unite()
allow us to easily split a single column into numerous (or numerous into a single).
This can come in really handle when we need to split a column into two pieces by a consistent separator (like a dash).
Let’s make a new tibble
with fake data to illustrate this. Here we have a set of site identification codes. with information about the island where the site is (the first 3 letters) and a site number (the 3 numbers). If we want to group and summarize by island, we need a column with just the island information.
sites_df <- tibble(site = c("HAW-101",
"HAW-103",
"OAH-320",
"OAH-219",
"MAI-039"))
sites_df %>%
separate(site, c("island", "site_number"), "-")
# A tibble: 5 × 2
island site_number
<chr> <chr>
1 HAW 101
2 HAW 103
3 OAH 320
4 OAH 219
5 MAI 039
unite()
does just the reverse of separate()
. If we have a data.frame that contains columns for year, month, and day, we might want to unite these into a single date column.
7.15 Now, all together!
We just ran through the various things we can do with dplyr
and tidyr
but if you’re wondering how this might look in a real analysis. Let’s look at that now:
catch_original <- read_csv(url("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/df35b.302.1",
method = "libcurl"))
region_defs <- read_csv(url("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/df35b.303.1",
method = "libcurl")) %>%
select(code, mgmtArea)
mean_region <- catch_original %>%
select(-All, -notesRegCode) %>%
mutate(Chinook = ifelse(Chinook == "I", 1, Chinook)) %>%
mutate(Chinook = as.numeric(Chinook)) %>%
pivot_longer(-c(Region, Year),
names_to = "species",
values_to = "catch") %>%
mutate(catch = catch*1000) %>%
group_by(Region) %>%
summarize(mean_catch = mean(catch))
head(mean_region)
# A tibble: 6 × 2
Region mean_catch
<chr> <dbl>
1 ALU 40384.
2 BER 16373.
3 BRB 2709796.
4 CHG 315487.
5 CKI 683571.
6 COP 179223.
We have completed our lesson on Cleaning and Wrangling data. Before we break, let’s practice our Github workflow.