Practice using common cleaning and wrangling functions
Practice joining two data frames
Practice git and GitHub workflow
About the data
These exercises will be using data on abundance, size, and trap counts (fishing pressure) of California spiny lobster (Panulirus interruptus) and were collected along the mainland coast of the Santa Barbara Channel by Santa Barbara Coastal LTER researchers (LTER, Reed, and Miller 2022).
14.1 Setup
GitHub & R setup
Create a new repository on GitHub. Use the following settings:
Add a brief description for your new repository. For example: R practice session cleaning and wrangling data during Delta Science Program Synthesis and Training session 1.
Keep the repository public.
Initialize the repository with a README file and an R .gitignore template.
Clone the repository to a new project in RStudio.
Create a new Quarto file in RStudio. Follow these steps:
Add a title to the new Quarto file, for example: “Exercise: Explore, Clean, and Wrangle Data”. Add your name to the author field. Press “Create”.
Delete the default text in the new Quarto file.
Using level 2 headers, create an outline for this exercise. Include headers for the following sections: About the data, Setup, Read and explore data, Exercise 1, Exercise 2 , … Exercise 6.
Save this file with a meaningful name, eg. exercise-clean-wrangle.qmd.
After saving the file, stage, commit, write a commit message, pull, and push this file to the remote repository (on GitHub).
Create a new code chunk under the “Read and explore data” header.
Navigate to the data package site and copy the the URL to access the Time-series of lobster abundance and size data. To copy the URL: hover over the Download button –> right click –> “Copy Link Address”.
Read in the data from the URL using the read_csv function and store it as lobster_abundance.
Read in the Time-series of lobster trap buoy counts data as lobster_traps by repeating setps (b) and (c).
# Read in datalobster_abundance<-read_csv("https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-sbc.77.8&entityid=f32823fba432f58f66c06b589b7efac6")lobster_traps<-read_csv("https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-sbc.77.8&entityid=66dd61c75bda17c23a3bce458c56ed84")
Look at each data frame. Take a minute to explore their data structure, find out which data types are in the data frame, or use a function to get a high-level summary of the data.
Use the Git workflow: Stage > Commit > Pull > Push.
14.2 Convert missing values using mutate() and na_if()
Exercise 1: lobster_abundance
The variable SIZE_MM uses -99999 as the code for missing values (see metadata). This has the potential to cause conflicts with our analyses. Modify the data following these steps:
Verify the SIZE_MM variable contains -99999 values using unique().
Convert every -99999 value to an NA value using mutate() and na_if(). Look up the help page to see how to use na_if().
Create a subset with the traps’ information at all sites where abundance data is not NA. Note that you first have to identify which are these sites.
HINT: use %in%.
Answer
## Create a vector with unique sites in lobster_abundanceabundance_sites<-unique(lobster_abundance$SITE)## Filter sites in vector abovetraps_subset<-lobster_traps%>%filter(SITE%in%abundance_sites)
Save your work and use Git
Don’t forget the Git workflow! After you’ve completed the exercises or reached a significant stopping point, use the workflow: Stage > Commit > Pull > Push.
14.4 Calculate totals by site and year
Exercise 4: lobster_abundance and traps_subset
For each data frame, calculate the total count and total traps by site and year (i.e. total traps for every combination of site and year). Store these summary statistics in separate data frames.
Use one of the join_ functions to get an output data frame with the following columns: SITE, YEAR, total_lobsters, total_traps.
Discuss with your neighbor how the output data frame varies when you do a left_join() or a full_join(). What happens when you do an inner_join()?
Answer
abundance_traps<-total_abundance%>%left_join(total_traps, by =c("SITE", "YEAR"))## Orabundance_traps<-total_abundance%>%full_join(total_traps, by =c("SITE", "YEAR"))## Orabundance_traps<-total_abundance%>%inner_join(total_traps, by =c("SITE", "YEAR"))
14.6 Adding a new column
Exercise 6
The sites IVEE and NAPL are marine protected areas (MPAs). Read the documentation about the case_when() function and use it to add this designation to your data set.
HINT: Notice you will have to create a new column with the MPA designation. What function have you previously used to create new columns?
Don’t forget the Git workflow! After you’ve completed the exercises or reached a significant stopping point, use the workflow: Stage > Commit > Pull > Push.
14.7 Bonus
Exercise 7
What would you do to fix the issues with the values in the total_traps column? Find the root of the issue, modify the dataset to solve it, and discuss with your neighbor where in your script you would include this step.
Answer
# Replace -99999 values for NAs at the beginning of the script, # similar to what we did in question 1 but for lobster_traps data frame. # Then re run all the other steps.lobster_traps<-lobster_traps%>%mutate(TRAPS =na_if(TRAPS, -99999))
LTER, Santa Barbara Coastal, Daniel C Reed, and Robert J Miller. 2022. “SBC LTER: Reef: Abundance, Size and Fishing Effort for California Spiny Lobster (Panulirus Interruptus), Ongoing Since 2012.” Environmental Data Initiative. https://doi.org/10.6073/PASTA/25AA371650A671BAFAD64DD25A39EE18.