Applied Bayesian Analyses in R

Part2

Sven De Maeyer

Your Turn

Open MarathonData.RData
Estimate two Bayesian Models
Model1: only an intercept
Model2: introduce the effect of km4week and sp4weekon MarathonTimeM
Make plots with the plot() function
What do we learn?

MarathonTimes_Mod1 <- brm(                        
  MarathonTimeM ~ 1, # We only model an intercept 
  data = MarathonData,                         
  backend = "cmdstanr",  
  cores = 4,
  seed = 1975                          
)

MarathonTimes_Mod2 <- brm(                        
  MarathonTimeM ~ km4week + sp4week, 
  data = MarathonData,                         
  backend = "cmdstanr",  
  cores = 4,
  seed = 1975                          
)

Model comparison with loo cross-validation

\(\sim\) AIC or BIC in Frequentist statistics

\(\widehat{elpd}\): “expected log predictive density” (higher \(\widehat{elpd}\) implies better model fit without being sensitive for overfitting!)

loo_Mod1 <- loo(MarathonTimes_Mod1)
loo_Mod2 <- loo(MarathonTimes_Mod2)

Comparison<- 
  loo_compare(
    loo_Mod1, 
    loo_Mod2
    )

print(Comparison, simplify = F)

Model comparison with loo cross-validation

                   elpd_diff se_diff elpd_loo se_elpd_loo p_loo  se_p_loo
MarathonTimes_Mod2    0.0       0.0  -356.3     12.0         7.2    4.4  
MarathonTimes_Mod1  -39.9      12.6  -396.2      5.3         1.7    0.3  
                   looic  se_looic
MarathonTimes_Mod2  712.6   24.0  
MarathonTimes_Mod1  792.3   10.6

WAMBS checklist

When to Worry and How to avoid Misuse of Bayesian Statistics

by Laurent Smeets and Rens van der Schoot

Before estimating the model:

Do you understand the priors?

After estimation before inspecting results:

Does the trace-plot exhibit convergence?
Does convergence remain after doubling the number of iterations?
Does the posterior distribution histogram have enough information?
Do the chains exhibit a strong degree of autocorrelation?
Do the posterior distributions make substantive sense?

Understanding the exact influence of the priors

Do different specification of the multivariate variance priors influence the results?
Is there a notable effect of the prior when compared with non-informative priors?
Are the results stable from a sensitivity analysis?
Is the Bayesian way of interpreting and reporting model results used?

WAMBS Template to use

Dropbox
File called WAMBS_workflow_MarathonData.qmd (quarto document)
Create your own project and project folder
Copy the template and rename it
We will go through the different parts in the slide show
You can apply/adapt the code in the template

Preparations for applying it to Marathon model

Packages needed:

library(here)
library(tidyverse)
library(brms)
library(bayesplot)
library(ggmcmc)
library(patchwork)
library(priorsense)

Preparations for applying it to Marathon model

Load the dataset and the model:

load(
  file = here("Presentations", "MarathonData.RData")
)

MarathonTimes_Mod2 <-
  readRDS(file = 
            here("Presentations",
              "Output",
              "MarathonTimes_Mod2.RDS")
          )

Focus on the priors before estimation

Remember: priors come in many disguises

Uninformative/Weakly informative

When objectivity is crucial and you want let the data speak for itself…

Informative

When including significant information is crucial

previously collected data
results from former research/analyses
data of another source
theoretical considerations
elicitation

`brms` defaults

Weakly informative priors
If dataset is big, impact of priors is minimal
But, always better to know what you are doing!
Complex models might run into convergence issues \(\rightarrow\) specifying more informative priors might help!

So, how to deviate from the defaults?

Check priors used by `brms`

Function: get_prior( )

Remember our model 2 for Marathon Times:

\[\begin{aligned} & \text{MarathonTimeM}_i \sim N(\mu,\sigma_e)\\ & \mu = \beta_0 + \beta_1*\text{km4week}_i + \beta_2*\text{sp4week}_i \end{aligned}\]

get_prior(
  MarathonTimeM ~ 1 + km4week + sp4week, 
  data = MarathonData
)

Check priors used by `brms`

prior: type of prior distribution
class: parameter class (with b being population-effects)
coef: name of the coefficient within parameter class
group: grouping factor for group-level parameters (when using mixed effects models)
resp : name of the response variable when using multivariate models
lb & ub: lower and upper bound for parameter restriction

Visualizing priors

The best way to make sense of the priors used is visualizing them!

Many options:

The Zoo of Distributions https://ben18785.shinyapps.io/distribution-zoo/
making your own visualizations

Over to the WAMBS template (see Dropbox)!

There we demonstrate the use of ggplot2, metRology, ggtext and patchwork to visualize the priors.

Visualizing priors

library(metRology)
library(ggplot2)
library(ggtext)
library(patchwork)

# Setting a plotting theme
theme_set(theme_linedraw() +
            theme(text = element_text(family = "Times", size = 8),
                  panel.grid = element_blank(),
                  plot.title = element_markdown())
)

# Generate the plot for the prior of the Intercept (mu)
Prior_mu <- ggplot( ) +
  stat_function(
    fun = dt.scaled,    # We use the dt.scaled function of metRology
    args = list(df = 3, mean = 199.2, sd = 24.9), # 
    xlim = c(120,300)
  ) +
  scale_y_continuous(name = "density") +
  labs(title = "Prior for the intercept",
       subtitle = "student_t(3,199.2,24.9)")

# Generate the plot for the prior of the error variance (sigma)
Prior_sigma <- ggplot( ) +
  stat_function(
    fun = dt.scaled,    # We use the dt.scaled function of metRology
    args = list(df = 3, mean = 0, sd = 24.9), # 
    xlim = c(0,6)
  ) +
  scale_y_continuous(name = "density") +
  labs(title = "Prior for the residual variance",
       subtitle = "student_t(3,0,24.9)")

# Generate the plot for the prior of the effects of independent variables
Prior_betas <- ggplot( ) +
  stat_function(
    fun = dnorm,    # We use the normal distribution
    args = list(mean = 0, sd = 10), # 
    xlim = c(-20,20)
  ) +
  scale_y_continuous(name = "density") +
  labs(title = "Prior for the effects of independent variables",
       subtitle = "N(0,10)")

Prior_mu + Prior_sigma + Prior_betas +
  plot_layout(ncol = 3)

Visualizing priors

Probability density plots for the different priors used in the example model

Your Turn

Your data and model
What are the priors set by brms?
Can you come up with custom priors for certain parameters?
Try to build a rationale/argumentation for them
Visualize the custom or default priors

DO NOT HESITATE TO ASK FOR GUIDANCE HERE

Tip

Consider re-scaling your (in)dependent variables if it is hard to make sense of parameters a priori. E.g., standardizing variables enables you to think in effect sizes.

Setting custom priors in `brms`

Setting our custom priors can be done with set_prior( ) command

E.g., change the priors for the beta’s (effects of km4week and sp4week):

Custom_priors <- 
  c(
    set_prior(
      "normal(0,10)", 
      class = "b", 
      coef = "km4week"),
    set_prior(
      "normal(0,10)", 
      class = "b", 
      coef = "sp4week")
    )

Prior Predictive Check

Did you set sensible priors?

Simulate data based on the model and the priors

Visualize the simulated data and compare with real data

Check if the plot shows impossible simulated datasets

Prior Predictive Check in `brms`

Step 1: Fit the model with custom priors with option sample_prior="only"

Fit_Model_priors <- 
  brm(
    MarathonTimeM ~ 1 + km4week + sp4week, 
    data = MarathonData,
    prior = Custom_priors,
    backend = "cmdstanr",
    cores = 4,
    sample_prior = "only"
    )

Prior Predictive Check in `brms`

Step 2: visualize the data with the pp_check( ) function

set.seed(1975)

pp_check(
  Fit_Model_priors, 
  ndraws = 300) # number of simulated datasets you wish for

Prior Predictive Check in `brms`

Check some summary statistics

How are summary statistics of simulated datasets (e.g., median, min, max, …) distributed over the datasets?
How does that compare to our real data?
Use type = "stat" argument within pp_check()

pp_check(Fit_Model_priors, 
         type = "stat", 
         stat = "median")

Check some summary statistics

Your Turn

Your data and model
Perform a prior predictive check
If necessary re-think your priors and check again

Focus on convergence of the model (before interpreting the model!)

Does the trace-plot exhibit convergence?

Create custom trace-plots (aka caterpillar plots) with ggs( ) function from ggmcmc package

Model_chains <- ggs(MarathonTimes_Mod2)

Model_chains %>%
  filter(Parameter %in% c(
          "b_Intercept", 
          "b_km4week", 
          "b_sp4week", 
          "sigma"
          )
  ) %>%
  ggplot(aes(
    x   = Iteration,
    y   = value, 
    col = as.factor(Chain)))+
  geom_line() +
  facet_grid(Parameter ~ .,
             scale  = 'free_y',
             switch = 'y') +
  labs(title = "Caterpillar Plots for the parameters",
       col   = "Chains")

Does the trace-plot exhibit convergence?

Caterpillar plots for the parameters in the model

Does convergence remain after doubling the number of iterations?

Re-fit the model with more iterations

Check trace-plots again

Warning

First consider the need to do this! If you have a complex model that already took a long time to run, this check will take at least twice as much time…

Your Turn

Your data and model
Do the first checks on the model convergence

R-hat statistics

Sampling of parameters done by:

multiple chains
multiple iterations within chains

If variance between chains is big \(\rightarrow\) NO CONVERGENCE

R-hat (\(\widehat{R}\)) : compares the between- and within-chain estimates for model parameters

R-hat statistics

\(\widehat{R}\) < 1.015 for each parameter estimate
at least 4 chains are recommended
Effective Sample Size (ESS) > 400 to rely on \(\widehat{R}\)

R-hat in `brms`

mcmc_rhat() function from the bayesplot package

mcmc_rhat(rhat(MarathonTimes_Mod2), 
          size = 3
          )+ 
  yaxis_text(hjust = 1)  # to print parameter names

R-hat in `brms`

Your Turn

Your data and model
Check the R-hat statistics

Autocorrelation

Sampling of parameter values are not independent!
So there is autocorrelation
But you don’t want too much impact of autocorrelation
2 approaches to check this: – ratio of the effective sample size to the total sample size – plot degree of autocorrelation

Ratio effective sample size / total sample size

Should be higher than 0.1 (Gelman et al., 2013)
Visualize making use of the mcmc_neff( ) function from bayesplot

mcmc_neff(
  neff_ratio(MarathonTimes_Mod2)
  ) + 
  yaxis_text(hjust = 1)  # to print parameter names

Ratio effective sample size / total sample size

Plot degree of autocorrelation

Visualize making use of the mcmc_acf( ) function

mcmc_acf(
  as.array(MarathonTimes_Mod2), 
  regex = "b") # to plot only the parameters starting with b (our beta's)

Plot degree of autocorrelation

Your Turn

Your data and model
Check the autocorrelation

Rank order plots

additional way to assess the convergence of MCMC
if the algorithm converged, plots of all chains look similar

mcmc_rank_hist(
  MarathonTimes_Mod2, 
  regex = "b" # only intercept and beta's
  )

Rank order plots

Your Turn

Your data and model
Check the rank order plots

Focus on the Posterior

Does the posterior distribution histogram have enough information?

Histogram of posterior for each parameter
Have clear peak and sliding slopes

Plotting the posterior distribution histogram

Step 1: create a new object with ‘draws’ based on the final model

posterior_PD <- as_draws_df(MarathonTimes_Mod2)

Plotting the posterior distribution histogram

Step 2: create histogram making use of that object

post_intercept <- 
  posterior_PD %>%
  select(b_Intercept) %>%
  ggplot(aes(x = b_Intercept)) +
  geom_histogram() +
  ggtitle("Intercept") 

post_km4week <- 
  posterior_PD %>%
  select(b_km4week) %>%
  ggplot(aes(x = b_km4week)) +
  geom_histogram() +
  ggtitle("Beta km4week") 

post_sp4week <- 
  posterior_PD %>%
  select(b_sp4week) %>%
  ggplot(aes(x = b_sp4week)) +
  geom_histogram() +
  ggtitle("Beta sp4week")

Plotting the posterior distribution histogram

Step 3: print the plot making use of patchwork ’s workflow to combine plots

post_intercept + post_km4week + post_sp4week +
  plot_layout(ncol = 3)

Plotting the posterior distribution histogram

Posterior Predictive Check

Generate data based on the posterior probability distribution
Create plot of distribution of y-values in these simulated datasets
Overlay with distribution of observed data

using pp_check() again, now with our model

pp_check(MarathonTimes_Mod2, 
         ndraws = 100)

Posterior Predictive Check

We can also focus on some summary statistics (like we did with prior predictive checks as well)

pp_check(MarathonTimes_Mod2, 
         ndraws = 300,
         type = "stat",
         stat = "median")

Posterior Predictive Check

Your Turn

Your data and model
Focus on the posterior and do some checks!

Prior sensibility analyses

Why prior sensibility analyses?

Often we rely on ‘arbitrary’ chosen (default) weakly informative priors
What is the influence of the prior (and the likelihood) on our results?
You could ad hoc set new priors and re-run the analyses and compare (a lot of work, without strict sytematical guidelines)
Semi-automated checks can be done with priorsense package

Using the `priorsense` package

Recently a package dedicated to prior sensibility analyses is launched

# install.packages("remotes")
remotes::install_github("n-kall/priorsense")

Key-idea: power-scaling (both prior and likelihood)

background reading:

https://arxiv.org/pdf/2107.14054.pdf

YouTube talk:

https://www.youtube.com/watch?v=TBXD3HjcIps&t=920s

Basic table with indices

First check is done by using the powerscale_sensitivity( ) function

column prior contains info on sensibility for prior (should be lower than 0.05)
column likelihood contains info on sensibility for likelihood (that we want to be high, ‘let our data speak’)
column diagnosis is a verbalization of potential problem (- if none)

powerscale_sensitivity(MarathonTimes_Mod2)

Basic table with indices

Sensitivity based on cjs_dist:
# A tibble: 4 × 4
  variable       prior likelihood diagnosis
  <chr>          <dbl>      <dbl> <chr>    
1 b_Intercept 0.000858     0.0856 -        
2 b_km4week   0.000515     0.0807 -        
3 b_sp4week   0.000372     0.0837 -        
4 sigma       0.00574      0.152  -

Visualization of prior sensibility

powerscale_plot_dens(
  powerscale_sequence(
    MarathonTimes_Mod2
    ),
  variables = c(
      "b_Intercept",
      "b_km4week",
      "b_sp4week"
    )
  )

Visualization of prior sensibility

powerscale_plot_quantities(
  powerscale_sequence(
    MarathonTimes_Mod2
    ),
  variables = c(
      "b_km4week"
      )
  )

Visualization of prior sensibility

Your Turn

Your data and model
Check the prior sensibility of your results

Applied Bayesian Analyses in R

Your Turn

Model comparison with loo cross-validation

Model comparison with loo cross-validation

WAMBS checklist

When to Worry and How to avoid Misuse of Bayesian Statistics

WAMBS Template to use

Preparations for applying it to Marathon model

Preparations for applying it to Marathon model

Focus on the priors before estimation

Remember: priors come in many disguises

brms defaults

Check priors used by brms

Check priors used by brms

Visualizing priors

Visualizing priors

Visualizing priors

Your Turn

Setting custom priors in brms

Prior Predictive Check

Prior Predictive Check in brms

Prior Predictive Check in brms

Prior Predictive Check in brms

Check some summary statistics

Check some summary statistics

Your Turn

Focus on convergence of the model (before interpreting the model!)

Does the trace-plot exhibit convergence?

Does the trace-plot exhibit convergence?

Does convergence remain after doubling the number of iterations?

Your Turn

R-hat statistics

R-hat statistics

R-hat in brms

R-hat in brms

Your Turn

Autocorrelation

Ratio effective sample size / total sample size

Ratio effective sample size / total sample size

Plot degree of autocorrelation

Plot degree of autocorrelation

Your Turn

Rank order plots

Rank order plots

Your Turn

Focus on the Posterior

Does the posterior distribution histogram have enough information?

Plotting the posterior distribution histogram

Plotting the posterior distribution histogram

Plotting the posterior distribution histogram

Plotting the posterior distribution histogram

Posterior Predictive Check

Posterior Predictive Check

Posterior Predictive Check

Posterior Predictive Check

Your Turn

Prior sensibility analyses

Why prior sensibility analyses?

Using the priorsense package

Basic table with indices

Basic table with indices

Visualization of prior sensibility

Visualization of prior sensibility

Visualization of prior sensibility

Visualization of prior sensibility

Your Turn

`brms` defaults

Check priors used by `brms`

Check priors used by `brms`

Setting custom priors in `brms`

Prior Predictive Check in `brms`

Prior Predictive Check in `brms`

Prior Predictive Check in `brms`

R-hat in `brms`

R-hat in `brms`

Using the `priorsense` package