Learning Objectives
- Practice using common cleaning and wrangling functions
- Practice joining two data frames
- Practice git and GitHub workflow
About the data
These exercises will be using data on abundance, size, and trap counts (fishing pressure) of California spiny lobster (Panulirus interruptus) and were collected along the mainland coast of the Santa Barbara Channel by Santa Barbara Coastal LTER researchers (LTER, Reed, and Miller 2022).
14.1 Setup
-
Create a new repository on GitHub. Use the following settings:
Add a brief description for your new repository. For example: R practice session cleaning and wrangling data during Delta Science Program Synthesis and Training session 1.
Keep the repository public.
Initialize the repository with a
README
file and an R.gitignore
template.
Clone the repository to a new project in RStudio.
-
Create a new Quarto file in RStudio. Follow these steps:
Add a title to the new Quarto file, for example: “Exercise: Explore, Clean, and Wrangle Data”. Add your name to the
author
field. Press “Create”.Delete the default text in the new Quarto file.
Using level 2 headers, create an outline for this exercise. Include headers for the following sections: About the data, Setup, Read and explore data, Exercise 1, Exercise 2 , … Exercise 6.
Save this file with a meaningful name, eg.
exercise-clean-wrangle.qmd
.
After saving the file,
stage
,commit
, write a commit message,pull
, andpush
this file to the remote repository (on GitHub).
For this exercise we are going to use the SBC LTER: Reef: Abundance, size and fishing effort for California Spiny Lobster (Panulirus interruptus) data. Navigate to this link and briefly explore the data package.
Under the “About the data” section in the
.qmd
file, write a short description of the data, including a link to the data and the access date.Under the “Setup” section, load the following libraries in a new code chunk.
-
Read in the data.
Create a new code chunk under the “Read and explore data” header.
Navigate to the data package site and copy the the URL to access the Time-series of lobster abundance and size data. To copy the URL: hover over the Download button –> right click –> “Copy Link Address”.
Read in the data from the URL using the
read_csv
function and store it aslobster_abundance
.Read in the Time-series of lobster trap buoy counts data as
lobster_traps
by repeating setps (b) and (c).
# Read in data
lobster_abundance <- read_csv("https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-sbc.77.8&entityid=f32823fba432f58f66c06b589b7efac6")
lobster_traps <- read_csv("https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-sbc.77.8&entityid=66dd61c75bda17c23a3bce458c56ed84")
Look at each data frame. Take a minute to explore their data structure, find out which data types are in the data frame, or use a function to get a high-level summary of the data.
Use the
Git
workflow:Stage > Commit > Pull > Push
.
14.2 Convert missing values using mutate()
and na_if()
lobster_abundance
The variable SIZE_MM
uses -99999 as the code for missing values (see metadata). This has the potential to cause conflicts with our analyses. Modify the data following these steps:
14.3 filter()
practice
lobster_abundance
Create a subset with the data for the lobsters at Arroyo Quemado (AQUE
) that have a carapace length greater than 70 mm.
lobster_traps
Create a subset with the traps’ information at all sites where abundance data is not NA
. Note that you first have to identify which are these sites.
HINT: use %in%
.
Git
Don’t forget the Git
workflow! After you’ve completed the exercises or reached a significant stopping point, use the workflow: Stage > Commit > Pull > Push
.
14.4 Calculate totals by site and year
lobster_abundance
and traps_subset
For each data frame, calculate the total count and total traps by site and year (i.e. total traps for every combination of site and year). Store these summary statistics in separate data frames.
HINT: use group_by()
and summarize()
.
Do you notice anything not right in the outcome data frame? We’ll get to it in exercise 7.
14.5 Joining two data frames
total_abundance
and total_traps
Use one of the join_
functions to get an output data frame with the following columns: SITE
, YEAR
, total_lobsters
, total_traps
.
Discuss with your neighbor how the output data frame varies when you do a left_join()
or a full_join()
. What happens when you do an inner_join()
?
14.6 Adding a new column
The sites IVEE
and NAPL
are marine protected areas (MPAs). Read the documentation about the case_when()
function and use it to add this designation to your data set.
HINT: Notice you will have to create a new column with the MPA designation. What function have you previously used to create new columns?
Git
Don’t forget the Git
workflow! After you’ve completed the exercises or reached a significant stopping point, use the workflow: Stage > Commit > Pull > Push
.
14.7 Bonus
What would you do to fix the issues with the values in the total_traps
column? Find the root of the issue, modify the dataset to solve it, and discuss with your neighbor where in your script you would include this step.