library(rvest)
library(readr)
library(dplyr)
library(janitor)
Learning Objectives
- Integrate knowledge on writing functions in R
- Understand how functions can be used when cleaning data
About the data
For this practice session we will use data on shorebird breeding ecology collected in Utqiaġvik, Alaska, 2003-2018 by Richard Lanctot and Sarah Saalfeld. This data is publicly available at the Arctic Data Center.
One of the features if this dataset is that it has many files with similar formatting, most of which contain the column species
which is comprised of the Bird Banding Laboratory species codes. These four letter codes aren’t very intuitive to most people, so the main goal for this session is to write a function that can be used on any file in this dataset that contains a species code.
- Make sure you’re in the right project (
training_{USERNAME}
) and use the Git workflow byPull
ing to check for any changes in the remote repository (aka repository on GitHub). - Create a new Quarto Document.
- Title it “R Practice: Functions”.
- Save the file and name it “r-practice-functions”.
- Insert a Setup r chunck and load the necessary libraries. Note here we introduce a new package called
rvest
. This package enables easy scraping and handling of information from websites.
- Load the species table using the following code. This code scrapes a table from a url and uses some cleaning and wrangling functions to get the table into our Environment in the format we want.
<- rvest::read_html("https://www.pwrc.usgs.gov/BBL/Bander_Portal/login/speclist.php")
webpage
<- rvest::html_nodes(webpage, "table") %>%
tbls ::html_table(fill = TRUE)
rvest
<- tbls[[1]] %>%
species ::clean_names() %>%
janitorselect(alpha_code, common_name) %>%
mutate(alpha_code = tolower(alpha_code))
head(species, 3)
Note: After running the chunk above you should have an object class data.frame
in your global environment. Explore this data frame! What are the column names?
Obtain data from the the Arctic Data Center Utqiaġvik shorebird breeding ecology study, Utqiaġvik. Make sure to click “Show more items in this data set” and then download the following files:
Utqiagvik_predator_surveys.csv
Utqiagvik_nest_data.csv
Utqiagvik_egg_measurements.csv
Note: It’s up to you on how you want to download and load the data! You can either use the download links (obtain by right-clicking the “Download” button and select “Copy Link Address” for each data entity) or manually download the data and then upload the files to RStudio server.
- Use the Git workflow. After you’ve set up your project and uploaded your data go through the workflow:
Stage (add) -> Commit -> Pull -> Push
Exercise
17.1 Question 1
Read in each data file and store the data frame as nest_data
, predator_survey
, and egg_measures
accordingly. After reading the data, insert a new chunk or in the console, explore the data using any function we have used during the lessons (eg. colname()
, glimpse()
)
Answer
## When reading from a file in your data folder in your Rpoj
<- read_csv("data/Utqiagvik_nest_data.csv")
nest_data
<- read_csv("data/ Utqiagvik_predator_surveys.csv")
predator_survey
<- read_csv("data/Utqiagvik_egg_measurements.csv")
egg_measures
## When reading using the url
<- read_csv("https://arcticdata.io/metacat/d1/mn/v2/object/urn%3Auuid%3A982bd2fc-4edf-4da7-96ef-0d11b853102d")
nest_data
<- read_csv("https://arcticdata.io/metacat/d1/mn/v2/object/urn%3Auuid%3A9ffec04c-7e2d-41dd-9e88-b6c2e8c4375e")
predator_survey
<- read_csv("https://arcticdata.io/metacat/d1/mn/v2/object/urn%3Auuid%3A4b219711-2282-420a-b1d6-1893fe4a74a6")
egg_measures
## Exploring the data (these functions can also be used to explore nest_data & egg_measures)
colnames(predator_survey)
glimpse(predator_survey)
unique(predator_survey$species)
summary(predator_survey)
17.2 Question 2
Before thinking of how to write a function, first discuss what are you trying to achieve and how would you get there. Write and run the code that would allow you to combine the species
data frame with the predator_survey
so that the outcome data frame has the species code and common names.
Hint: joins
Answer
<- left_join(predator_survey,
predator_comm_names
species,by = c("species" = "alpha_code"))
17.3 Question 3
How can you generalize the code from the previous question and make it into a function?
The idea is that you can use this function in any data frame that has a column named species
with the Bird Banding Laboratory Species Code.
Answer
<- function(df, species){
assign_species_name <- left_join(df, species, by = c("species" = "alpha_code"))
return_df return(return_df)
}
17.4 Question 4
Place the cursor inside your function, In the top menu go to Code > Insert Roxygen skeleton. Document parameters, return and write one example.
Answer
#' Title
#'
#' @param df A data frame containing BBL species codes in column `species`
#' @param species A data frame defining BBL species codes with columns `alpha_code` and `common_name`
#'
#' @return A data frame with original data df, plus the common name of species
#' @export
#'
#' @examples `*provide an example*`
<- function(df, species){
assign_species_name <- left_join(df, species, by = c("species" = "alpha_code"))
return_df return(return_df)
}
17.5 Question 5
Create clean versions of the three data frames by applying the function you created and removing columns that you think are note necessary(aka selecting the ones you want to keep) and filter out NA
values.
Answer
## This is one solution.
<- assign_species_name(predator_survey, species) %>%
predator_clean select(year, site, date, common_name, count) %>%
filter(!is.na(common_name))
<- assign_species_name(nest_data, species) %>%
nest_location_clean select(year, site, nestID, common_name, lat_corrected, long_corrected) %>%
filter(!is.na(common_name))
<- assign_species_name(egg_measures, species) %>%
eggs_clean select(year, site, nestID, common_name, length, width) %>%
filter(!is.na(common_name))
Congrats! Now you have clean data sets ready for analysis.
17.6 Optional Challenge
For a little extra challenge, try to incorporate an if
statement that looks for NA
values in the common name field you are adding. What other conditionals might you include to make your function smarter?
Answer
#' Function to add common name to data.frame according to the BBL list of species codes
#' @param df A data frame containing BBL species codes in column `species`
#' @param species A data frame defining BBL species codes with columns `alpha_code` and `common_name`
#' @return A data frame with original data df, plus the common name of species
<- function(df, species){
assign_species_name if (!("alpha_code" %in% names(species)) |
!("species" %in% names(df)) |
!("common_name" %in% names(species))){
stop("Tables appear to be formatted incorrectly.")
}
<- left_join(df, species, by = c("species" = "alpha_code"))
return_df
if (nrow(return_df) > nrow(df)){
warning("Joined table has more rows than original table. Check species table for duplicated code values.")
}
if (length(which(is.na(return_df$common_name))) > 0){
<- length(which(is.na(return_df$common_name)))
x warning(paste("Common name has", x, "rows containing NA"))
}
return(return_df)
}