Learning Objectives
- Practice joining tables together
- Practice identifying primary and foreign keys
- Practice using common cleaning and wrangling functions
About the data
These exercises will be using bird survey data collected from the central Arizona-Phoenix metropolitan area by Arizona State University researchers [@warren2021].
9.1 Exercise: Practice Joins
9.1.1 Read in the data
Answer
# read in data using download links
bird_observations <- read_csv("https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-cap.256.10&entityid=53edaa7a0e083013d9bf20322db1780e")
surveys <- read_csv("https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-cap.256.10&entityid=b2466fa5cb5ed7ee1ea91398fc291c59")
sites <- read_csv("https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-cap.256.10&entityid=81bf72420e69077097fb0790dcdc63a6")
taxalist <- read_csv("https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-cap.256.10&entityid=58f863b7e3066e68536a9cacdc7bd58e")
# read in data from the data directory after manually downloading data
bird_observations <- read_csv("data/52_pp52_birds_1.csv")
surveys <- read_csv("data/52_pp52_surveys_1.csv")
sites <- read_csv("data/52_pp52_sites_1.csv")
taxalist <- read_csv("data/52_pp52_taxalist_1.csv")
9.1.2 Get familiar with the data
Answer
# view data in a spreadsheet-style viewer
View(bird_observations)
# returns dimensions of the dataframe by number of rows and number of cols
dim(bird_observations)
# returns the top six rows of the dataframe
head(bird_observations)
# returns all the columns and some info about the cols
glimpse(bird_observations)
# similar to glimpse but returns some summary statistics about the cols
summary(bird_observations)
# returns column names
names(bird_observations)
# returns unique values in a column. In this case we can see all the different bird species IDs
unique(bird_observations$species_id)
Answer:
-
bird_observations
: Primary key is a combination ofsurvey_id
,site_id
, andspecies_id
. The foreign key isspecies_id
. -
taxalist
: Primary key isspecies_id
and does not have a foreign key.
9.1.3 Create a subset of bird_observations
9.1.4 Use left_join()
to merge birds_subset
with the tables sites
and taxalist
Answer:
I expect to see all columns and all observations from birds_subset
and from sites
, I expect to see the columns park_code
, park_district
, park-name
, point_code
, point_location
and park_acreage
and only observations for NU-C
and LI-W
because those are the only site_id
values in birds_subset
and in a left join only the observations matching the left table (in this case, birds_subset
is the left table) will be kept.
9.1.5 Use full_join()
to merge birds_subset
and sites
tables
Answer:
I expect to see all columns and all observations from birds_subset
and all columns and all observations from sites
to be merged into one data frame because in a full join everything is kept.
9.1.6 Use inner_join()
to merge birds_subset
and taxalist
data
Answer:
I expect to only have data merge together based on species_id
and since there is only BHCO
and RWBL
in birds_subset
then I will only retain data related to those two species. I will also expect to see the columns from taxalist
: common_name
and asu_itis
to be merged into the joined table.
Answer
# syntax using pipe
birds_inner <- birds_subset %>%
inner_join(y = taxalist, by = "species_id")
# syntax without pipe
birds_inner <- inner_join(x = birds_subset, y = taxalist, by = "species_id" )
Answer:
You have the same resulting data set regardless of using inner_join()
or left_join()
to merge bird_subset
and taxalist
. The reasons for this are:
inner_join()
keeps only the rows (observations) that have a matching key across both data sets - here,species_id
is our key, and the only rows that match across both data sets are those wherespecies_id
equalsBHCO
orRWBL
left_join()
keeps all rows from the left table (in our case, the left table isbirds_subset
) and merges on data with matching keys (species_id
) on the right (here, the right table istaxalist
). Because our left data set (birds_subset
) only containsspecies_id
s equal toBHCO
orRWBL
, only rows with those species will be kept from the right data set (taxalist
)
You’d expect the outcome to differ from an inner_join()
if birds_subset
contained an observation with a species_id
that was not found in taxalist
. If there was an observation of a species_id
in birds_subset
that was not in taxalist
, then that observation would be kept, and NA
s would be assigned to the common_name
and asu_itis
columns for that observations
9.2 Exercise: Practice Wrangling & Joining Data
9.2.1 Wrangle bird_observations
data and merge the data with all the other tables (sites
, surveys
, and taxalist
)
9.2.2 Explore observer
data and fix the values within this column so that all values are in the same format
Answer
unique(bird_obs_subset$observer)
9.2.3 Bonus: Use a new package lubridate
to wrangle the date data and find the total number of birds by park and month
Hint: How do you learn about a new function or package?
Answer
bird_obs_subset <- bird_obs_subset %>%
mutate(survey_month = lubridate::month(survey_date)) %>%
mutate(survey_month = as.factor(survey_month)) %>%
dplyr::relocate(survey_month, .after = survey_date) %>%
filter(park_name %in% c("Lindo", "Orme", "Palomino", "Sonrisa")) %>%
group_by(park_name, survey_month) %>%
summarize(tot_bird_count_month = n())
Take a look at your final data frame. Does it give you the outcome you expected? Is it informative? How would you improve this wrangling process?