4 Session 4: Model Analysis
library(daymetr)
library(dplyr)
library(ggplot2)
theme_set(theme_bw())
options(digits = 3)
4.1 Learning Objectives
Today you will learn to
- Use visual and statistical approaches to compare model output to data
- Investigate the relationships between model inputs and outputs using sensitivity analysis and variance decomposition
4.2 Comparing models to data
DayMet is a gridded weather data product. It provides daily weather statistics for North America on a 1 km grid. The primary input to DayMet is data from weather stations. Using a number of statistical methods, values for each square on the grid are imputed.
Today we are going to use these as our ‘model,’ and we are going to compare the output of this model to data from a weather station in Maricopa, AZ.
Lets download the daily weather data from DayMet:
<- download_daymet(site = "Maricopa Agricultural Center",
mac_daymet_list lat = 33.07,
lon = -111.97,
start = 2020,
end = 2020,
internal = TRUE)
## Downloading DAYMET data for: Maricopa Agricultural Center at 33.07/-111.97 latitude/longitude !
## Done !
<- mac_daymet_list$data %>%
mac_daymet mutate(tmin = tmin..deg.c., tmax = tmax..deg.c., tmean = (tmin + tmax) / 2) %>%
select(doy = yday, tmean, tmax, tmin)
Before we look at the ground truth, lets visualize these outputs and assess them for meaning.
ggplot(data = mac_daymet, aes(doy, tmean)) +
geom_point() +
geom_line()
Lets get some truth to compare our model to data measured at a weather station.
The AZMet data format is described here: https://cals.arizona.edu/azmet/raw2003.htm. On that page you can see that these are the columns we are interested in:
col | variable |
---|---|
2 | day of year |
4 | max temperature |
5 | min temperature |
6 | mean temperature |
We can read in the data and rename the columns of interest in the following steps:
<- read.csv('https://cals.arizona.edu/azmet/data/0621rd.txt', header = FALSE) %>%
mac_azmet select(doy = V2, tmean = V6, tmax = V4, tmin = V5)
Now, lets join the data frames. We will use the suffix to identify model (DayMet) and observed station (AZMet) data.
<- mac_daymet %>%
comparison left_join(mac_azmet, by = 'doy', suffix = c('_model', '_obs')) %>%
filter(!is.na(tmean_obs))
head(comparison)
## doy tmean_model tmax_model tmin_model tmean_obs tmax_obs
## 1 1 9.04 16.1 1.98 7.7 18.7
## 2 2 9.97 17.0 2.94 7.5 17.9
## 3 3 10.09 18.0 2.16 7.3 18.7
## 4 4 11.22 19.9 2.57 8.7 21.1
## 5 5 11.44 20.3 2.58 10.4 21.4
## 6 6 12.25 21.3 3.23 9.4 20.7
## tmin_obs
## 1 -1.1
## 2 -1.9
## 3 -1.6
## 4 -1.1
## 5 1.0
## 6 -0.5
Just looking at the table, what do you notice about the observed and modeled temperatures?
- looks like the tmax observed is generally higher and tmin is generally lower!
- the station observed five days below freezing in the first six days of the year!
Now, lets see how the model and data compare:
ggplot(data = comparison) +
geom_line(aes(doy, tmean_model)) +
geom_line(aes(doy, tmin_model), color = 'grey') +
geom_line(aes(doy, tmax_model), color = 'grey') +
geom_line(aes(doy, tmean_obs), color = 'red') +
geom_line(aes(doy, tmin_obs), color = 'pink') +
geom_line(aes(doy, tmax_obs), color = 'pink')
ggplot(data = comparison) +
geom_point(aes(doy, tmean_model - tmean_obs))
ggplot(data = comparison) +
geom_point(aes(tmean_model, tmean_obs)) +
geom_abline(aes(intercept = 0, slope = 1)) +
ylim(0, 40) +
xlim(0, 40)
4.2.1 Statistical tests of model performance
We are going to talk about a set of key metrics for comparing models to data:
Slope, RMSE, R^2, Bias, SD ratio
4.2.1.1 Slope
The slope of regression of predicted on observed: \(Y_{obs}= bY_{mod}+\epsilon\). Does slope \(b=1\)?
<- comparison$tmean_model
mod <- comparison$tmean_obs
obs <- lm(obs~ mod)
reg
plot(mod, obs, xlab = 'model tmean', ylab = 'obs tmean') +
abline(0,1) +
abline(coef(reg), lty = 2)
## integer(0)
summary(reg)
##
## Call:
## lm(formula = obs ~ mod)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.24 -2.85 -0.34 2.43 10.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1546 0.7042 3.06 0.0024 **
## mod 0.8514 0.0263 32.39 <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.95 on 309 degrees of freedom
## Multiple R-squared: 0.772, Adjusted R-squared: 0.772
## F-statistic: 1.05e+03 on 1 and 309 DF, p-value: <2e-16
Does the regression model includes the 1:1 line? Why is the P-value so low?
coef(reg)
## (Intercept) mod
## 2.155 0.851
confint(reg)
## 2.5 % 97.5 %
## (Intercept) 0.769 3.540
## mod 0.800 0.903
Is the slope 1:1?
What about for tmin and tmax?
summary(lm(tmin_obs ~ tmin_model, data = comparison))
summary(lm(tmax_obs ~ tmax_model, data = comparison))
4.2.1.2 RMSE Root Mean Square Error
This is the most common model metric. It is the standard deviation of the residuals.
\[RMSE=\sqrt{\frac{1}{n}(Y_{model}-Y_{obs})^2}\]
<- sqrt(mean((mod - obs)^2))
RMSE RMSE
## [1] 4.45
Sometimes you want a “Normalized” estimate of model error. That way the magnitude of the error is on the same scale as the observations. There are many ways to do this, a common approach is to divide by the mean of the observations. \(NRMSE=RMSE/\bar{Y}_{obs}\) is equivalent to coefficient of variance \(CV=\frac{\sigma}{\bar{x}}\).
## to normalize, divide by sd(obs)
<- RMSE/sd(obs) NRMSE
Other approaches to normalization:
- \(RMSE/sd(obs)\)
- \(RMSE/\bar{Y_{obs}}\)
- \(RMSE/IQR\)
4.2.1.3 \(R^2\) Corellation Coefficient
When comparing models to data, it is common to calculate the Correlation Coefficient in terms of deviation from the 1:1 line
\[R^2=1-\frac{\sum{(Y_{mod}-Y_{obs})^2}}{\sum{(Y_{obs}-\bar{Y}_{obs})^2}}\]
cor(mod, obs)
## [1] 0.879
cor.test(mod, obs)
##
## Pearson's product-moment correlation
##
## data: mod and obs
## t = 32, df = 309, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.851 0.902
## sample estimates:
## cor
## 0.879
4.2.1.4 Bias
Do you over or under predict?
\[Y_{model}-Y_{obs}\]
mean(mod - obs)
## [1] 1.62
# normalized
mean(mod - obs)/mean(obs)
## [1] 0.068
It is also useful to visualize and compute statistics on these differences.
hist(mod-obs)
What fraction of the time is the model above the observation?
sum(mod > obs) / length(obs)
## [1] 0.685
4.2.1.5 Ratio of variances
Does the model capture the observed variance? Often a model will have lower variances … why?
\[\frac{SD_{Y_{model}}}{SD_{Y_{obs}}}\]
sd(obs) / sd(mod)
## [1] 0.969
4.2.1.6 Taylor Diagram - three metrics in two dimensions
A Taylor Diagram allows you to plot three metrics of model performance - the RMSE, R^2, and SD ratio in two dimensions, taking advantage of the mathematical relationships among these scores.
It is common to plot multiple points representing different models or model parameterizations, also different years or locations that the model is evaluated at. Plotting these as normalized values allows comparison across variables being predicted.
library(plotrix)
taylor.diagram(ref=obs, model = mod, normalize = TRUE, ref.sd = TRUE)
4.3 Simulation Models
Yesterday I introduced the concept and application of simulation models, with examples related to simulating plant growth. In a regression model, it is often possible to see how each parameter relates to the model output (e.g. slope, intercept).
For a complex model, it is often not possible to understand the high level dynamics of the system by analyzing the mathematics. Another approach is to treat the model as a ‘black box.’ Then we can treat the black box as an experimental system, purturb it and see how it responds.
Lets look at this in more detail.
4.3.1 Example: A leaf level model of photosynthesis
Based on the coupled C4 photosynthesis - conductance model developed by Collatz and Ball Berry
G. Collatz, M. Ribas-Carbo, J. Berry. (1992). Coupled photosynthesis-stomatal conductance model for leaves of C4 plants. Australian Journal of Plant Physiology 519–538.
Water transpiration is coupled to photosynthesis through stomatal conductance, since plants need to regulate water loss as they uptake \(CO_2\):
\[g_s = m\frac{A_n h_s}{c_a}p + b\]
Where \(g_s\) is stomatal conducatnce, \(A_n\) is net photosynthesis, \(h_s\) is relative humidity, \(c_a\) is \(CO_2\) at leaf surface.
and Photosynthesis is
\[A_n=min(A_c, A_L)-R_d\] Where Rubisco-limited rate is \(A_c\) and RuBP limited rate is \(A_L\)
\[A_c=V_m\left[\frac{c_i-\Gamma}{c_i+K_c(1+O_2/K_o)}\right]\]
\[A_L=\] This is a non-linear equation with key plant physiological traits:
Parameter | Description |
---|---|
Vmax | maximum carboxylation of Rubisco according to the Collatz model. |
alpha | alpha parameter according to the Collatz model. Initial slope of the response to Irradiance. |
kparm | k parameter according to the Collatz model. Initial slope of the response to CO2. |
theta | theta parameter according to the Collatz model. Curvature for light response. |
beta | beta parameter according to the Collatz model. Curvature for response to CO2. |
Rd | Rd parameter according to the Collatz model. Dark respiration. |
b0 | intercept for the Ball-Berry stomatal conductance model. |
b1 | slope for the Ball-Berry stomatal conductance model. |
The rate of photosynthesis is determined by environmental factors:
Parameter | Description |
---|---|
Tl | temperature of the leaf (Celsius). |
RH | relative humidity (as a fraction, i.e. 0-1). |
Qp | quantum flux (direct light), (micro mol m-2 s-1). |
Catm | Atmospheric CO2 in ppm (or micromol/mol). |
In the end:
\[[Gs, A, C_i]=f(T, RH, Light, CO_2, v_{max}, \alpha, k, \theta, R_d, b_0, b_1)\]
4.3.2 Let’s run this model!
First, lets load some meteorological data:
library(BioCro) #devtools::install_github('ebimodeling/biocro')
library(lubridate)
library(ggplot2)
data("weather04")
<- ymd('2004-01-01') + days(weather04$doy-1) + hours(weather04$hour)
time <- weather04$solarR
par <- weather04$RH
rh <- weather04$DailyTemp.C temp
Now, estimate photosynthesis rate at each time step. This is isn’t the entire model - just what would be happening to the leaf of a plant with C4 photosynthesis in full sunlight.
<- c4photo(Qp = par, Tl = temp, RH = rh)$Assim
A
pairs(data.frame(A, par, temp, rh))
## whoa this is wierd! What is going on?
plot(temp, rh)
a strange scaling algorithm, interpolating from daily to hourly
see also https://rpubs.com/dlebauer/metdriver_comparisonslibrary(ggplot2)
ggplot()+
geom_line(aes(time, A)) +
scale_x_datetime(limits = ymd_h(c('2004-05-01 0', '2004-06-01 23')))
## Warning: Removed 7992 row(s) containing missing values
## (geom_path).
ggplot()+
geom_line(aes(time, rh)) +
scale_x_datetime(limits = ymd_h(c('2004-05-01 0', '2004-06-01 23')))
## Warning: Removed 7992 row(s) containing missing values
## (geom_path).
question: is f(mean(X)) = mean(f(X))?
<- 11:20*100
testQp <- 21:30
testTl <- 21:30/50
testRH <- c4photo(Qp = mean(testQp),
A1 Tl = mean(testTl),
RH = mean(testRH))
<- lapply(c4photo(Qp = testQp, Tl = testTl, RH = testRH), mean)
A2
::bind_rows(A1 = A1, A2 = A2) dplyr
## # A tibble: 2 × 4
## Gs Assim Ci GrossAssim
## <dbl> <dbl> <dbl> <dbl>
## 1 239. 31.3 171. 32.1
## 2 243. 31.4 173. 32.2
Why are these different?
For non-linear functions see Jensen’s Inequality (Wikipedia)
This means - be careful when and how you use averages - everywhere!!! Spatial and temporal downscaling is how crop modelers deal with lower resolution atmospheric model forecasts. For example the most recent IPCC 100 y climate forecasts were generated on ~100km grids (Taylor et al 2012), thus one data point may simultaneously represent a month that is \(60^o\)F and foggy in San Fransicsco and \(100^o\)F and dry in Davis, CA. At the same time, crop models need to run on local hourly data while also capturing the uncertainty represented by within and across model variability.
4.3.2.1 Model sensitivity
Next, we are going to look at the sensitivity of the model to each of the different weather parameters, holding the others at their mean
<- mean(par)
meanQp <- mean(temp)
meanTl <- mean(rh)
meanRH plot(1:100/100, c4photo(Qp = rep(meanQp, 100),
Tl = rep(meanTl, 100),
RH = 1:100/100)$Assim,
type = 'l', ylab = 'Assim', xlab = 'RH')
plot(1:100/4, c4photo(Qp = rep(meanQp, 100),
Tl = 1:100/4,
RH = rep(meanRH, 100))$Assim,
type = 'l', ylab = 'Assim', xlab = 'RH')
4.3.2.2 Monte Carlo Error propagation
Given a model parameters represented as variables, e.g.
\[ V_{cmax}\sim N(45,2)\\ Rd\sim N(1, 0.1) b1 \sim N(4,1) \]
What is expected uncertainty in model predictions of photosynthesis (A = assimilation)
Lets take some samples from these distributions
set.seed(100)
<- 1000
n <- rnorm(n, 45, 2)
vmax <- rnorm(n, 1, 0.10)
Rd <- rnorm(n, 4, 1) b1
Now we can take a look at them
<- 25:75
x ggplot() +
geom_histogram(aes(vmax, y = ..density..), proability = TRUE) +
geom_line(aes(x, dnorm(x, 45, 2)))
## Warning: Ignoring unknown parameters: proability
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
<- 1:200/100
x ggplot() +
geom_histogram(aes(Rd, y = ..density..), proability = TRUE) +
geom_line(aes(x, dnorm(x, 1, 0.1)))
## Warning: Ignoring unknown parameters: proability
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
Next, we are going to propagate the parameter variance to see how much of an effect it has on the model output variance:
### sample given time series of met
<- matrix(nrow = length(time), ncol = 1000)
A for(i in 1:1000){
<- c4photo(Qp = par, Tl = temp, RH = rh, vmax = vmax[i], Rd = Rd[i], b1=b1[i])$Assim
A[,i]
}
# take a look at the matrix ..
# image(A, xlab = 'time', ylab = 'sample')
## shows an annual cycle of photosynthesis
<- which.min(abs(quantile(colMeans(A), 0.50)-colMeans(A)))
median <- which.min(abs(quantile(colMeans(A), 0.975)-colMeans(A)))
ucl <- which.min(abs(quantile(colMeans(A), 0.025)-colMeans(A)))
lcl
ggplot() +
# geom_smooth(aes(time, A))+
geom_line(aes(time, A[,median])) +
geom_line(aes(time, y = A[,lcl]), linetype = 2) +
geom_line(aes(time, y = A[,ucl]), linetype = 2) +
scale_x_datetime(limits = ymd_h(c('2004-05-01 0', '2004-05-07 23')))
## Warning: Removed 8592 row(s) containing missing values
## (geom_path).
## Warning: Removed 8592 row(s) containing missing values
## (geom_path).
## Warning: Removed 8592 row(s) containing missing values
## (geom_path).
4.3.3 Variance Decomposition - which of these parameters are most important?
Response of interest is the total assimilation
<- colMeans(A)
a_total
summary(aov(a_total ~ vmax + Rd + b1))
## Df Sum Sq Mean Sq F value Pr(>F)
## vmax 1 22.66 22.66 8593 <2e-16 ***
## Rd 1 2.55 2.55 966 <2e-16 ***
## b1 1 0.40 0.40 153 <2e-16 ***
## Residuals 996 2.63 0.00
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
4.3.4 Propagate met and Use met as a variable, sample over variation within the hour
To answer the question - if I were to step out and look at the plants in the summer, how fast would they be growing?
This time, instead of sampling from a parameterized distribution, I am going to sample over the middle (hours 10-4) of days in the middle of the year (approx. summer, day of year 120-240)
### sample over met variability
<- Gs <- Ci <- Qp <- Tl <- RH <- vector(length = 1000)
A2 for(i in 1:1000){
<- sample(120:240, size = 1)
day <- sample(10:16, size = 1)
hour <- day * 24 + hour
j <- par[j]
Qp[i] <- temp[j]
Tl[i] <- rh[j]
RH[i] <- c4photo(Qp = Qp[i], Tl = Tl[i], RH = RH[i], vmax = vmax[i], Rd = Rd[i], b1=b1[i])
res <- res$Assim
A2[i] <- res$Gs
Gs[i] <- res$Ci
Ci[i]
}
hist(A2)
Equivalent of sensitivity analysis: (where A2, Gs, Ci are response variables)
pairs(data.frame(A2, Gs, Ci, vmax, Rd, b1, Qp, Tl, RH), pch = '.')
The lm is pretty much a sensitivity analysis: what is the slope of the effect of inputs on the output of the model.
summary(lm(A2 ~ vmax + Rd + b1 + Qp + Tl + RH))
##
## Call:
## lm(formula = A2 ~ vmax + Rd + b1 + Qp + Tl + RH)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.33 -1.35 0.71 1.61 6.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.61e+01 2.42e+00 -10.78 < 2e-16 ***
## vmax 2.94e-01 3.94e-02 7.47 1.8e-13 ***
## Rd 5.76e-02 8.29e-01 0.07 0.94
## b1 3.67e-01 7.75e-02 4.73 2.6e-06 ***
## Qp 1.05e-02 1.93e-04 54.54 < 2e-16 ***
## Tl 1.08e+00 1.86e-02 57.99 < 2e-16 ***
## RH 9.66e-01 1.28e+00 0.76 0.45
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.56 on 993 degrees of freedom
## Multiple R-squared: 0.895, Adjusted R-squared: 0.894
## F-statistic: 1.41e+03 on 6 and 993 DF, p-value: <2e-16
The analysis of variance partitions the variance - how much if the total variance in A2 is contributed by each of the following parameters (recall that the domain for met variables is << the domain for physiological parameters …
What would happen if we used a whole year of meteorological data instead of the one hour of met data that we used?
summary(aov(A2 ~ vmax + Rd + b1 + Qp + Tl + RH))
## Df Sum Sq Mean Sq F value Pr(>F)
## vmax 1 932 932 142.26 <2e-16 ***
## Rd 1 21 21 3.21 0.074 .
## b1 1 33 33 5.01 0.025 *
## Qp 1 32153 32153 4906.97 <2e-16 ***
## Tl 1 22380 22380 3415.44 <2e-16 ***
## RH 1 4 4 0.57 0.450
## Residuals 993 6507 7
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tidy(aov(Gs ~ vmax + Rd + b1 + Qp + Tl + RH))
tidy(aov(Ci ~ vmax + Rd + b1 + Qp + Tl + RH))
4.3.5 References
Taylor, K.E., R.J. Stouffer, G.A. Meehl: An Overview of CMIP5 and the experiment design.” Bull. Amer. Meteor. Soc., 93, 485-498, doi:10.1175/BAMS-D-11-00094.1, 2012. http://journals.ametsoc.org/doi/pdf/10.1175/BAMS-D-11-00094.1
Humphries S and Long SP (1995) WIMOVAC - a software package for modeling the dynamics of the plant leaf and canopy photosynthesis. Computer Applications in the Bioscience 11(4):361-371.
Miguez FE, Zhu XG, Humphries S, Bollero GA, Long SP (2009) A semimechanistic model predicting the growth and production of the bioenergy crop Miscanthus × giganteus: description, parameterization and validation. Global Change Biology Bioenergy 1: 282-296.
Wang D, Jaiswal D, Lebauer DS, Wertin TM, Bollero GA, Leakey ADB, Long SP (2015) A physiological and biophysical model of coppice willow (Salix spp.) production yields for the contiguous USA in current and future climate scenarios. Plant, cell & environment, 38(9), 1850-1865.
Thornton, M.M., R. Shrestha, Y. Wei, P.E. Thornton, S. Kao, and B.E. Wilson. 2020. Daymet: Daily Surface Weather Data on a 1-km Grid for North America, Version 4. ORNL DAAC, Oak Ridge, Tennessee, USA. https://doi.org/10.3334/ORNLDAAC/1840
Daymet: Daily Surface Weather Data on a 1-km Grid for North America, Version 4 https://doi.org/10.3334/ORNLDAAC/1840