Applied Bayesian Analyses in R

Part1

Sven De Maeyer

Introduction

Welcome

Let’s get to know each other first!

  • what’s your name?
  • a 1/2-minute pitch on your research
  • any experience with Bayesian inference?
  • what do you hope to learn?

Outline

Part 1: Introduction to Bayesian inferences and some basics in brms

Part 2: Checking the model with the WAMBS and how to do it in R

Part 3: Reporting and interpreting Bayesian models

Practical stuff

All material is on a shared Dropbox folder called ABAR_Turku23

You can find:

  • Folder Presentations

  • Subfolders for each part of the workshop

  • With files called Slides_Part1.html (presentation you can open) and Slides_Part1.qmd (source quarto file)

  • html slides can be exported to pdf in your browser (use the menu button bottom right)

Statistical Inference

Why statistical inference?

We want to learn more about a population but we have sample data!

Therefore, we have uncertainty

Why statistical inference? an example

What is the effect of average number of kilometers ran per week on the marathon time?

We can estimate the effect by using a regression model based on a sample:

\[\text{MarathonTime}_i = \beta_0 + \beta_1 * \text{n_km}_i + \epsilon_i\]

Of course, we are not interested in the value of \(\beta_1\) in the sample, but we want to learn more on \(\beta_1\) in the population!

Frequentistic statistical inference (with data…)

Maximal Likelihood estimation:


Call:
lm(formula = MarathonTimeM ~ km4week, data = MarathonData)

Residuals:
    Min      1Q  Median      3Q     Max 
-42.601 -12.469  -0.536  12.816  38.670 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 199.14483    1.93856 102.728  < 2e-16 ***
km4week      -0.50907    0.07233  -7.038 4.68e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.08 on 85 degrees of freedom
Multiple R-squared:  0.3682,    Adjusted R-squared:  0.3608 
F-statistic: 49.53 on 1 and 85 DF,  p-value: 4.675e-10

Frequentistic statistical inference (interpretation …)

Sample: for each km more a week approx half a minute faster

Population:

  • prob to observe a |0.509| estimate if effect is zero in the population is lower than 0.05
  • in 95% of similar samples from that population the estimate would be between -0.6462 and -0.3718

==> So: -0.6 is as equally probable as -0.4 …

Bayesian statistical inference

Frequentist compared to Bayesian inference



Let’s explore a tutorial App of J. Krushke (2019)

https://iupbsapps.shinyapps.io/KruschkeFreqAndBayesApp/



He has also written a nice tutorial, linked to this App:

https://jkkweb.sitehost.iu.edu/KruschkeFreqAndBayesAppTutorial.html

Advantages & Disadvantages of Bayesian analyses

Advantages:

  • Natural approach to express uncertainty
  • Ability to incorporate prior knowledge
  • Increased model flexibility
  • Full posterior distribution of the parameters
  • Natural propagation of uncertainty

Disadvantage:

  • Slow speed of model estimation
  • Some reviewers don’t understand you (“give me the p-value”)

Bayesian Inference

Bayesian Theorem

\[ P(\theta|data) = \frac{P(data|\theta)P(\theta)}{P(data)} \] with

  • \(P(data|\theta)\) : the likelihood of the data given our model \(\theta\)
  • \(P(\theta)\) : our prior belief about model parameters
  • \(P(\theta|data)\): the posterior probability about model parameters

Likelihood

\(P(data|\theta)\) : the likelihood of the data given our model \(\theta\)

This part is about our MODEL and the parameters (aka the unknowns) in that model

Example. A model on “time needed to run a marathon” could be a normal distribution: \(y_i \backsim N(\mu, \sigma)\)

For a certain average (\(\mu\)) and standard deviation (\(\sigma\)) we can calculate the probability of observing a marathon time of 240 minutes for instance

Prior

Expression of our prior knowledge (belief) on the parameter values

Example. A model on “time needed to run a marathon” could be a normal distribution: \(y_i \backsim N(\mu, \sigma)\)

e.g., How probable is an average marathon time of 120 minutes (\(P(\mu=120)\)) and how probable is a standard deviation of 40 minutes (\(P(\sigma=40)\))

Prior as a distribution

How probable are different values as average marathon time in a population?

Types of priors

Uninformative/Weakly informative

When objectivity is crucial and you want let the data speak for itself…

Informative

When including significant information is crucial

  • previously collected data
  • results from former research/analyses
  • data of another source
  • theoretical considerations
  • elicitation

Type of priors (visually)

Two potential priors for the mean marathon time in the population

Bayesian theorem in stats



“For Bayesians, the data are treated as fixed and the parameters vary. […] Bayes’ rule tells us that to calculate the posterior probability distribution we must combine a likelihood with a prior probability distribution over parameter values.”
(Lambert, 2018, p.88)

Let’s apply this idea

Our Model for marathon times

\(y_i \backsim N(\mu, \sigma)\)

Let’s apply this idea

Our Priors related to mu and sigma:

par(mfrow=c(2,2))
curve( dnorm( x , 210 , 30 ) , from=120 , to=300 ,xlab="mu", main="Prior for mu")
curve( dunif( x , 1 , 40 ) , from=-5 , to=50 ,xlab="sigma", main="Prior for sigma")

Let’s apply this idea

Our data:

 [1] 185 193 240 245 155 234 189 196 206 263

Let’s apply this idea

What is the posterior probability of mu = 180 combined with a sigma = 10?

# Calculate the likelihood of the data
# given these parameter values of
# mu = 180 ; sigma = 10
# remember MT contains our data
Likelihood <- 
  sum(
      dnorm(
        MT ,
        mean=180 ,
        sd=10)
      )
Likelihood
[1] 0.09315515
# Calculate our Prior belief

Prior <- dnorm(180, 210 , 30 ) * dunif(10 , 1 , 40 )

Prior
[1] 0.0002068126
# Calculate posterior as product 
# of Likelihood and Prior

Product <- Likelihood * Prior

Product
[1] 1.926566e-05

Grid approximation

We calculated the posterior probability of a combination of 2 parameters

We could repeat this systematically for following values:

# sample some values for mu and sigma
mu.list <- seq(from = 135, 
               to = 300, 
               length.out=200)
sigma.list <- seq(from = 1, 
                  to = 30, 
                  length.out = 200)

aka: we create a grid of possible parameter value combinations and approx the distribution of posterior probabilites

Grid approximation applied

Sampling parameter values

  • Instead of a fixed grid of combinations of parameter values,

  • sample pairs/triplets/… of parameter values

  • reconstruct the posterior probability distribution

Why brms?

Imagine

A ‘simple’ linear model


\[\begin{aligned} & MT_{i} \sim N(\mu,\sigma_{e_{i}})\\ & \mu = \beta_0 + \beta_1*\text{sp4week}_{i} + \beta_2*\text{km4week}_{i} + \\ & \beta_3*\text{Age}_{i} + \beta_4*\text{Gender}_{i} + \beta_5*\text{CrossTraining}_{i} \\ \end{aligned}\]


So you can get a REALLY LARGE number of parameters!

Markov Chain Monte Carlo - Why?

Complex models \(\rightarrow\) Large number of parameters \(\rightarrow\) exponentionally large number of combinations!


Posterior gets unsolvable by grid approximation


We will approximate the ‘joint posterior’ by ‘smart’ sampling


Samples of combinations of parameter values are drawn


BUT: samples will not be random!

MCMC - demonstrated



Following link brings you to an interactive tool that let’s you get the basic idea behind MCMC sampling:


https://chi-feng.github.io/mcmc-demo/app.html#HamiltonianMC,standard


Software


  • different dedicated software/packages are available: JAGS / BUGS / Stan

  • most powerful is Stan! Specifically the Hamiltonian Monte Carlo algorithm makes it the best choice at the moment

  • Stan is a probabilistic programming language that uses C++

Example of Stan code

// generated with brms 2.14.4
functions {
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
}
parameters {
  real Intercept;  // temporary intercept for centered predictors
  real<lower=0> sigma;  // residual SD
}
transformed parameters {
}
model {
  // likelihood including all constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = Intercept + rep_vector(0.0, N);
    target += normal_lpdf(Y | mu, sigma);
  }
  // priors including all constants
  target += normal_lpdf(Intercept | 500,50);
  target += student_t_lpdf(sigma | 3, 0, 68.8)
    - 1 * student_t_lccdf(0 | 3, 0, 68.8);
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept;
}

How brms works

The basics of brms

brms syntax

Very very similar to lm or lme4 and in line with typical R-style writing up of a model …

lme4

Model <- lmer(
  y ~ x1 + x2 + (1|Group),
  data = Data,

  ...
)

brms

Model <- brm(
  y ~ x1 + x2 + (1|Group),
  data = Data,
  family = "gaussian",
  backend = "cmdstanr",
  
  ...
)

Notice:

  • backend = "cmdstanr" indicates the way we want to interact with Stan and C++

Let’s retake the example on running

The simplest model looked like:

\[ MT_i \sim N(\mu,\sigma_e) \]

In brms this model is:

MT <- c(185, 193, 240, 245, 155, 234, 189, 196, 206, 263)

# First make a dataset from our MT vector
DataMT <- data_frame(MT)

Mod_MT1 <- brm(
  MT ~ 1, # We only model an intercept
  data = DataMT,
  backend = "cmdstanr",
  seed = 1975
)
1
create a dataset with potential marathon times
2
change it to a data frame
3
the brm() commando runs the model
4
define the model
5
define the dataset to us
6
how brms and stan communicate
7
make the estimation reproducible

🏃 Try it yourself and run the model … (don’t forget to load the necessary packages: brms & tidyverse)

Basic output brms

summary(Mod_MT1)

Basic output brms

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: MT ~ 1 
   Data: DataMT (Number of observations: 10) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept   209.58     11.18   187.37   231.78 1.00     2386     1935

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    36.21      9.51    23.26    58.13 1.00     2487     2329

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

About chains and iterations

Markov Chain Monte Carlo

  • A chain of samples of parameter values
  • It is advised to run multiple chains
  • Each sample of parameter values is called an iteration
  • First X iterations are used to tune the sampler but not used to interpret the results: X burn-in iterations

brms defaults with 4 chains each of 2000 iterations of which 1000 iterations are used as burn-in

Changing brms defaults

Mod_MT1 <- brm(                        
  MT ~ 1, 
  data = DataMT,   
  backend = "cmdstanr",
  chains = 5,
  iter = 6000,
  warmup = 1500,
  cores = 4,
  seed = 1975           
)

Good old plot() function

plot(Mod_MT1)

Good old plot() function

Your Turn

  • Open MarathonData.RData
  • Estimate your first Bayesian Models
  • Model1: only an intercept
  • Model2: introduce the effect of km4week and sp4weekon MarathonTimeM
  • Change the brms defaults
  • Make plots with the plot() function
  • What do we learn?

About convergence

  • \(\widehat R\) < 1.015 for each parameter estimate

  • at least 4 chains are recommended

  • Effective Sample Size (ESS) > 400 to rely on \(\widehat R\)

But is it a good model?



Two complementary procedures:

  • posterior-predictive check

  • compare models with leave one out cross-validation

Posterior-predictive check

A visual check that can be performed with pp_check() from brms

MarathonTimes_Mod2 <-
  readRDS(file = 
            here("Presentations",
              "Output",
              "MarathonTimes_Mod2.RDS")
          )

pp_check(MarathonTimes_Mod2) + theme_minimal()

Posterior-predictive check

Model comparison with loo cross-validation

\(\sim\) AIC or BIC in Frequentist statistics

\(\widehat{elpd}\): “expected log predictive density” (higher \(\widehat{elpd}\) implies better model fit without being sensitive for overfitting!)

loo_Mod1 <- loo(MarathonTimes_Mod1)
loo_Mod2 <- loo(MarathonTimes_Mod2)

Comparison<- 
  loo_compare(
    loo_Mod1, 
    loo_Mod2
    )

print(Comparison, simplify = F)

Model comparison with loo cross-validation

                   elpd_diff se_diff elpd_loo se_elpd_loo p_loo  se_p_loo
MarathonTimes_Mod2    0.0       0.0  -356.3     12.0         7.2    4.4  
MarathonTimes_Mod1  -39.9      12.6  -396.2      5.3         1.7    0.3  
                   looic  se_looic
MarathonTimes_Mod2  712.6   24.0  
MarathonTimes_Mod1  792.3   10.6  

Your Turn

  • Time to switch to your dataset
  • Think if two alternative models for a certain variable
  • Be naive! Let’s assume normality and
  • Keep it simple
  • Estimate the two models
  • Compare the models
  • Interpret the best fitting model
  • Check the fit with pp_check()

Family time

brms defaults to family = gaussian(link = "identity")

Mod_MT1 <- brm(                        
  MT ~ 1, 
  family = gaussian(link = "identity"),
  data = DataMT,   
  backend = "cmdstanr",
  chains = 5,
  iter = 6000,
  warmup = 1500,
  cores = 4,
  seed = 1975           
)

Family time


But many alternatives are available!


The default family types known in R (e.g, binomial(link = "logit"), Gamma(link = "inverse"), poisson(link = "log"), …)


see help(family)


ModX <- brm(                        
  Y ~ 1, 
  family = binomial(link = "logit"),
  ...
)

Family time


And even more!


brmsfamily(...) has a lot of possible models


see help(brmsfamily)


ModX <- brm(                        
  Y ~ 1, 
  family = brmsfamily(skew_normal, 
                      link = "identity", 
                      link_sigma = "log", 
                      link_alpha = "identity"),
  ...
)

Mixed effects models


brms can estimate models for more complex designs like multilevel models or more generally mixed effects models


Random intercepts…

ModX <- brm(                        
  Y ~ 1 + x + 
    (1 | Groupvariable), 
  ...
)

Random intercepts and slopes…

ModX <- brm(                        
  Y ~ 1 + x + 
    (1 + x | Groupvariable), 
  ...
)

Homework

Time to think about your data and research question

  • Define a simple model?
  • What are the parameters in that model?
  • What are your prior beliefsabout each of the parameters?
  • Minimum and maximum values assumed for each parameter?
  • Do all potential parameter values have a similar probability?
  • Bring along your notes to the next session!