Applied Bayesian Analyses in R

Part1

Sven De Maeyer

Introduction

Welcome

Let’s get to know each other first!

what’s your name?
a 1/2-minute pitch on your research
any experience with Bayesian inference?
what do you hope to learn?

Outline

Part 1: Introduction to Bayesian inferences and some basics in brms

Part 2: Checking the model with the WAMBS and how to do it in R

Part 3: Reporting and interpreting Bayesian models

Practical stuff

All material is on a shared Dropbox folder called ABAR_Turku23

You can find:

Folder Presentations
Subfolders for each part of the workshop
With files called Slides_Part1.html (presentation you can open) and Slides_Part1.qmd (source quarto file)
html slides can be exported to pdf in your browser (use the menu button bottom right)

Statistical Inference

Why statistical inference?

We want to learn more about a population but we have sample data!

Therefore, we have uncertainty

Why statistical inference? an example

What is the effect of average number of kilometers ran per week on the marathon time?

We can estimate the effect by using a regression model based on a sample:

\[\text{MarathonTime}_i = \beta_0 + \beta_1 * \text{n_km}_i + \epsilon_i\]

Of course, we are not interested in the value of \(\beta_1\) in the sample, but we want to learn more on \(\beta_1\) in the population!

Frequentistic statistical inference (with data…)

Maximal Likelihood estimation:


Call:
lm(formula = MarathonTimeM ~ km4week, data = MarathonData)

Residuals:
    Min      1Q  Median      3Q     Max 
-42.601 -12.469  -0.536  12.816  38.670 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 199.14483    1.93856 102.728  < 2e-16 ***
km4week      -0.50907    0.07233  -7.038 4.68e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.08 on 85 degrees of freedom
Multiple R-squared:  0.3682,    Adjusted R-squared:  0.3608 
F-statistic: 49.53 on 1 and 85 DF,  p-value: 4.675e-10

Frequentistic statistical inference (interpretation …)

Sample: for each km more a week approx half a minute faster

Population:

prob to observe a |0.509| estimate if effect is zero in the population is lower than 0.05
in 95% of similar samples from that population the estimate would be between -0.6462 and -0.3718

==> So: -0.6 is as equally probable as -0.4 …

Bayesian statistical inference

Frequentist compared to Bayesian inference

Let’s explore a tutorial App of J. Krushke (2019)

https://iupbsapps.shinyapps.io/KruschkeFreqAndBayesApp/

He has also written a nice tutorial, linked to this App:

https://jkkweb.sitehost.iu.edu/KruschkeFreqAndBayesAppTutorial.html

Advantages & Disadvantages of Bayesian analyses

Advantages:

Natural approach to express uncertainty
Ability to incorporate prior knowledge
Increased model flexibility
Full posterior distribution of the parameters
Natural propagation of uncertainty

Disadvantage:

Slow speed of model estimation
Some reviewers don’t understand you (“give me the p-value”)

Bayesian Inference

Bayesian Theorem

\[ P(\theta|data) = \frac{P(data|\theta)P(\theta)}{P(data)} \] with

\(P(data|\theta)\) : the likelihood of the data given our model \(\theta\)
\(P(\theta)\) : our prior belief about model parameters
\(P(\theta|data)\): the posterior probability about model parameters

Likelihood

\(P(data|\theta)\) : the likelihood of the data given our model \(\theta\)

This part is about our MODEL and the parameters (aka the unknowns) in that model

Example. A model on “time needed to run a marathon” could be a normal distribution: \(y_i \backsim N(\mu, \sigma)\)

For a certain average (\(\mu\)) and standard deviation (\(\sigma\)) we can calculate the probability of observing a marathon time of 240 minutes for instance

Prior

Expression of our prior knowledge (belief) on the parameter values

Example. A model on “time needed to run a marathon” could be a normal distribution: \(y_i \backsim N(\mu, \sigma)\)

e.g., How probable is an average marathon time of 120 minutes (\(P(\mu=120)\)) and how probable is a standard deviation of 40 minutes (\(P(\sigma=40)\))

Prior as a distribution

How probable are different values as average marathon time in a population?

Types of priors

Uninformative/Weakly informative

When objectivity is crucial and you want let the data speak for itself…

Informative

When including significant information is crucial

previously collected data
results from former research/analyses
data of another source
theoretical considerations
elicitation

Type of priors (visually)

Two potential priors for the mean marathon time in the population

Bayesian theorem in stats

“For Bayesians, the data are treated as fixed and the parameters vary. […] Bayes’ rule tells us that to calculate the posterior probability distribution we must combine a likelihood with a prior probability distribution over parameter values.”
(Lambert, 2018, p.88)

Let’s apply this idea

Our Model for marathon times

\(y_i \backsim N(\mu, \sigma)\)

Let’s apply this idea

Our Priors related to mu and sigma:

par(mfrow=c(2,2))
curve( dnorm( x , 210 , 30 ) , from=120 , to=300 ,xlab="mu", main="Prior for mu")
curve( dunif( x , 1 , 40 ) , from=-5 , to=50 ,xlab="sigma", main="Prior for sigma")

Let’s apply this idea

Our data:

 [1] 185 193 240 245 155 234 189 196 206 263

Let’s apply this idea

What is the posterior probability of mu = 180 combined with a sigma = 10?

# Calculate the likelihood of the data
# given these parameter values of
# mu = 180 ; sigma = 10
# remember MT contains our data
Likelihood <- 
  sum(
      dnorm(
        MT ,
        mean=180 ,
        sd=10)
      )
Likelihood

[1] 0.09315515

# Calculate our Prior belief

Prior <- dnorm(180, 210 , 30 ) * dunif(10 , 1 , 40 )

Prior

[1] 0.0002068126

# Calculate posterior as product 
# of Likelihood and Prior

Product <- Likelihood * Prior

Product

[1] 1.926566e-05

Grid approximation

We calculated the posterior probability of a combination of 2 parameters

We could repeat this systematically for following values:

# sample some values for mu and sigma
mu.list <- seq(from = 135, 
               to = 300, 
               length.out=200)
sigma.list <- seq(from = 1, 
                  to = 30, 
                  length.out = 200)

aka: we create a grid of possible parameter value combinations and approx the distribution of posterior probabilites

Grid approximation applied

Sampling parameter values

Instead of a fixed grid of combinations of parameter values,
sample pairs/triplets/… of parameter values
reconstruct the posterior probability distribution

Why `brms`?

Imagine

A ‘simple’ linear model

\[\begin{aligned} & MT_{i} \sim N(\mu,\sigma_{e_{i}})\\ & \mu = \beta_0 + \beta_1*\text{sp4week}_{i} + \beta_2*\text{km4week}_{i} + \\ & \beta_3*\text{Age}_{i} + \beta_4*\text{Gender}_{i} + \beta_5*\text{CrossTraining}_{i} \\ \end{aligned}\]

So you can get a REALLY LARGE number of parameters!

Markov Chain Monte Carlo - Why?

Complex models \(\rightarrow\) Large number of parameters \(\rightarrow\) exponentionally large number of combinations!

Posterior gets unsolvable by grid approximation

We will approximate the ‘joint posterior’ by ‘smart’ sampling

Samples of combinations of parameter values are drawn

BUT: samples will not be random!

MCMC - demonstrated

Following link brings you to an interactive tool that let’s you get the basic idea behind MCMC sampling:

https://chi-feng.github.io/mcmc-demo/app.html#HamiltonianMC,standard

Software

different dedicated software/packages are available: JAGS / BUGS / Stan
most powerful is Stan! Specifically the Hamiltonian Monte Carlo algorithm makes it the best choice at the moment
Stan is a probabilistic programming language that uses C++

Example of Stan code

// generated with brms 2.14.4
functions {
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
}
parameters {
  real Intercept;  // temporary intercept for centered predictors
  real<lower=0> sigma;  // residual SD
}
transformed parameters {
}
model {
  // likelihood including all constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = Intercept + rep_vector(0.0, N);
    target += normal_lpdf(Y | mu, sigma);
  }
  // priors including all constants
  target += normal_lpdf(Intercept | 500,50);
  target += student_t_lpdf(sigma | 3, 0, 68.8)
    - 1 * student_t_lccdf(0 | 3, 0, 68.8);
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept;
}

How `brms` works

The basics of `brms`

`brms` syntax

Very very similar to lm or lme4 and in line with typical R-style writing up of a model …

lme4

Model <- lmer(
  y ~ x1 + x2 + (1|Group),
  data = Data,

  ...
)

brms

Model <- brm(
  y ~ x1 + x2 + (1|Group),
  data = Data,
  family = "gaussian",
  backend = "cmdstanr",
  
  ...
)

Notice:

backend = "cmdstanr" indicates the way we want to interact with Stan and C++

Let’s retake the example on running

The simplest model looked like:

\[ MT_i \sim N(\mu,\sigma_e) \]

In brms this model is:

MT <- c(185, 193, 240, 245, 155, 234, 189, 196, 206, 263)

# First make a dataset from our MT vector
DataMT <- data_frame(MT)

Mod_MT1 <- brm(
  MT ~ 1, # We only model an intercept
  data = DataMT,
  backend = "cmdstanr",
  seed = 1975
)

1: create a dataset with potential marathon times
2: change it to a data frame
3: the brm() commando runs the model
4: define the model
5: define the dataset to us
6: how brms and stan communicate
7: make the estimation reproducible

🏃 Try it yourself and run the model … (don’t forget to load the necessary packages: brms & tidyverse)

Basic output `brms`

summary(Mod_MT1)

Basic output `brms`

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: MT ~ 1 
   Data: DataMT (Number of observations: 10) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept   209.58     11.18   187.37   231.78 1.00     2386     1935

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    36.21      9.51    23.26    58.13 1.00     2487     2329

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

About chains and iterations

Markov Chain Monte Carlo

A chain of samples of parameter values
It is advised to run multiple chains
Each sample of parameter values is called an iteration
First X iterations are used to tune the sampler but not used to interpret the results: X burn-in iterations

brms defaults with 4 chains each of 2000 iterations of which 1000 iterations are used as burn-in

Changing `brms` defaults

Mod_MT1 <- brm(                        
  MT ~ 1, 
  data = DataMT,   
  backend = "cmdstanr",
  chains = 5,
  iter = 6000,
  warmup = 1500,
  cores = 4,
  seed = 1975           
)

Good old `plot()` function

plot(Mod_MT1)

Good old `plot()` function

Your Turn

Open MarathonData.RData
Estimate your first Bayesian Models
Model1: only an intercept
Model2: introduce the effect of km4week and sp4weekon MarathonTimeM
Change the brms defaults
Make plots with the plot() function
What do we learn?

About convergence

\(\widehat R\) < 1.015 for each parameter estimate
at least 4 chains are recommended
Effective Sample Size (ESS) > 400 to rely on \(\widehat R\)

But is it a good model?

Two complementary procedures:

posterior-predictive check
compare models with leave one out cross-validation

Posterior-predictive check

A visual check that can be performed with pp_check() from brms

MarathonTimes_Mod2 <-
  readRDS(file = 
            here("Presentations",
              "Output",
              "MarathonTimes_Mod2.RDS")
          )

pp_check(MarathonTimes_Mod2) + theme_minimal()

Posterior-predictive check

Model comparison with loo cross-validation

\(\sim\) AIC or BIC in Frequentist statistics

\(\widehat{elpd}\): “expected log predictive density” (higher \(\widehat{elpd}\) implies better model fit without being sensitive for overfitting!)

loo_Mod1 <- loo(MarathonTimes_Mod1)
loo_Mod2 <- loo(MarathonTimes_Mod2)

Comparison<- 
  loo_compare(
    loo_Mod1, 
    loo_Mod2
    )

print(Comparison, simplify = F)

Model comparison with loo cross-validation

                   elpd_diff se_diff elpd_loo se_elpd_loo p_loo  se_p_loo
MarathonTimes_Mod2    0.0       0.0  -356.3     12.0         7.2    4.4  
MarathonTimes_Mod1  -39.9      12.6  -396.2      5.3         1.7    0.3  
                   looic  se_looic
MarathonTimes_Mod2  712.6   24.0  
MarathonTimes_Mod1  792.3   10.6

Your Turn

Time to switch to your dataset
Think if two alternative models for a certain variable
Be naive! Let’s assume normality and
Keep it simple
Estimate the two models
Compare the models
Interpret the best fitting model
Check the fit with pp_check()

Family time

brms defaults to family = gaussian(link = "identity")

Mod_MT1 <- brm(                        
  MT ~ 1, 
  family = gaussian(link = "identity"),
  data = DataMT,   
  backend = "cmdstanr",
  chains = 5,
  iter = 6000,
  warmup = 1500,
  cores = 4,
  seed = 1975           
)

Family time

But many alternatives are available!

The default family types known in R (e.g, binomial(link = "logit"), Gamma(link = "inverse"), poisson(link = "log"), …)

see help(family)

ModX <- brm(                        
  Y ~ 1, 
  family = binomial(link = "logit"),
  ...
)

Family time

And even more!

brmsfamily(...) has a lot of possible models

see help(brmsfamily)

ModX <- brm(                        
  Y ~ 1, 
  family = brmsfamily(skew_normal, 
                      link = "identity", 
                      link_sigma = "log", 
                      link_alpha = "identity"),
  ...
)

Mixed effects models

brms can estimate models for more complex designs like multilevel models or more generally mixed effects models

Random intercepts…

ModX <- brm(                        
  Y ~ 1 + x + 
    (1 | Groupvariable), 
  ...
)

Random intercepts and slopes…

ModX <- brm(                        
  Y ~ 1 + x + 
    (1 + x | Groupvariable), 
  ...
)

Homework

Time to think about your data and research question

Define a simple model?
What are the parameters in that model?
What are your prior beliefsabout each of the parameters?
Minimum and maximum values assumed for each parameter?
Do all potential parameter values have a similar probability?
Bring along your notes to the next session!

Applied Bayesian Analyses in R

Introduction

Welcome

Outline

Practical stuff

Statistical Inference

Why statistical inference?

Why statistical inference? an example

Frequentistic statistical inference (with data…)

Frequentistic statistical inference (interpretation …)

Bayesian statistical inference

Frequentist compared to Bayesian inference

Advantages & Disadvantages of Bayesian analyses

Bayesian Inference

Bayesian Theorem

Likelihood

Prior

Prior as a distribution

Types of priors

Type of priors (visually)

Bayesian theorem in stats

Let’s apply this idea

Let’s apply this idea

Let’s apply this idea

Let’s apply this idea

Grid approximation

Grid approximation applied

Sampling parameter values

Why brms?

Imagine

Markov Chain Monte Carlo - Why?

MCMC - demonstrated

Software

Example of Stan code

How brms works

The basics of brms

brms syntax

Let’s retake the example on running

Basic output brms

Basic output brms

About chains and iterations

Changing brms defaults

Good old plot() function

Good old plot() function

Your Turn

About convergence

But is it a good model?

Posterior-predictive check

Posterior-predictive check

Model comparison with loo cross-validation

Model comparison with loo cross-validation

Your Turn

Family time

Family time

Family time

Mixed effects models

Homework

Why `brms`?

How `brms` works

The basics of `brms`

`brms` syntax

Basic output `brms`

Basic output `brms`

Changing `brms` defaults

Good old `plot()` function

Good old `plot()` function