Introduction to Bayesian Statistics with

class: center, middle, inverse, title-slide

.title[
# Introduction to Bayesian Statistics with <img src="img/Rlogo.png" width="200" />    
]
.author[
### Jason Thomas
]
.institute[
### R Working Group
]
.date[
### April 18th, 2025
]

---

# Introduction

* Today we will learn about the Bayesian approach to statistical inference

+ 🤨 ummmm...why, exactly?

* Two main approaches to statistical inference: frequentist vs. Bayesian

+ why one and not the other?
  
  + learning Bayes by example

* [https://github.com/buckipr/R_Working_Group/bayes](https://github.com/buckipr/R_Working_Group/bayes/intro)
 (slides -- coming soon to a repo near you)

---
# In which there are two kinds

Frequentist inference...

* the world introduced in Stats 101, where model parameters are
fixed, but unknown (so we need to estimate them with data)
  
* uncertainty is assessed with respect to repeated resampling
  of the data &rarr; sampling distributions & asymptotics
  
  + are we there yet?
  
  + difficulties with interpretation (e.g., confidence intervals)
  
  + out-of-sample predictions?

---
class: inverse, middle, center
background-image: url(img/asymptotia.png)
background-size: contain

source: RJ Little (2006) Calibrated Bayes. American Statistician

---
class: inverse, center, middle

# Bayesian Framework

---
class: wide-slide
# Welcome to Bayes-ity

.left-column[ <img src="img/Thomas_Bayes.gif">]

.right-column[
* We have been around the block a few times and one might say we even no a thing or two
]

---
class: wide-slide
# Welcome to Bayes-ity

.left-column[ <img src="img/Thomas_Bayes.gif">]

.right-column[
* We have been around the block a few     times and one might say we even no a thing or two

+ example: if you flip a penny, there is a good chance that it lands on "heads" or H

]

---
class: wide-slide
# Welcome to Bayes-ity

.left-column[ <img src="img/Thomas_Bayes.gif">]

.right-column[
* We have been around the block a few times and one might say we even no a thing or two

+ example: if you flip a penny, there is a good chance that it lands on "heads" or H
  
* Our friend comes along and gives us a penny that they found on Mars

+ what is going to happen if you flip it?
 (not a rhetorical question)

]

---
class: wide-slide
# Welcome to Bayes-ity

.left-column[ <img src="img/Thomas_Bayes.gif">]

.right-column[
* Suppose you flip it once and get H.  You do a few more experiments and get
9 T in a row.
  
  + What is the probability that you get H on the next toss? Pr(H) = ?

+ That next toss lands on H, and the subsequent 91 tosses turn up T
  (yay, more data).
  
  + What will happen on the 101st toss, i.e., now, what do you think Pr(H) is?

]

---
class: slide-font-25
# Bayesics

This process of flipping the Martian coin illustrates the Bayesian framework

* stating your original belief about the chance of heads (H)

* gathering some data (e.g., coin tosses) & choosing a model

* **updating** your belief after seeing the data (what do you expect to
happen on the next experiment/observation?)

* (and repeat)

---
class: slide-font-25
# Bayesics

This process of flipping the Martian coin illustrates the Bayesian framework

* stating your original belief about the chance of heads (H)

+ .red[*prior* distribution for the probability of H, Pr(H)]
  
* gathering some data (e.g., coin tosses) & choosing a model

+ .red[*likelihood*] (what frequentists primarily rely on)
  + .red[conditional prob of data given parameters]
  + .red[for a given probability of H, how likely are these data: HTTTTTTTTT]
  
* **updating** your belief after seeing the data (what do you expect to
happen on the next experiment/observation?)

+ .red[*posterior* distribution (used for inference)]
  
* (and repeat) -- use updated beliefs for the next data set

---
# Slightly more formally

* *Prior distribution:* Pr(Heads)

* *Likelihood:* Pr(Data | Heads)
  
* *Posterior* (using Bayes' Theroem)

$$
\mbox{Pr(Heads|Data)} = \frac{\mbox{Pr(Data | Heads) Pr(Heads)}}{\mbox{Pr(Data)}}
$$
Since Pr(Data) (in the denominator) does not involve Pr(Heads), it is
just a scaling constant that can be ignored, so we typically work with

$$
\mbox{Pr(Heads|Data)} \propto \mbox{Pr(Data | Heads) Pr(Heads)}
$$

(posterior is proportional to likelihood * prior)

---
# Bayes' Theorem

Quick detour on probabilities

* `$Pr(A)$`: marginal probability of event *A* (possible events: {*A*, *B*})

+ example:
  &nbsp;&nbsp;
  `$Pr (Heads) = \frac{1}{2} \;\;\;\;, \;\;\;\;Pr [Tails] = \frac{1}{2}$`
  
* `$Pr( D | E )$`: conditional prob of event *D* given that event *E* occurred

+ example:
    &nbsp;&nbsp;&nbsp;&nbsp;
    `$Pr($` 2nd card = ♠️ &nbsp;|&nbsp; 1st card = ♦️)

conditional probability of drawing a spade from a deck of cards given that 
you pulled a diamond on your first draw

---
# Bayes' Theorem

Quick detour on probabilities

* `$Pr(D, E)$`: joint prob of both events `$D$` and `$E$` occurring

+ example:
  &nbsp;&nbsp;&nbsp;&nbsp;
  `$Pr($` 🎲 &nbsp;=&nbsp; 4 &nbsp;&nbsp;,&nbsp;&nbsp; 🎲 &nbsp;>&nbsp; 4)

* Relationships among probabilities:
  
  + `$Pr(A, B) \;=\; Pr(A | B) * Pr(B)$`  &nbsp;&nbsp; &rArr; &nbsp;&nbsp;
  `$\frac{Pr(A,B)}{Pr(B)} \;=\; Pr(A | B)$`
  
  + Bayes' Theorem:
  
  $$ \frac{Pr(B | A) Pr(A)}{Pr(B)} \;=\; \frac{Pr(A, B)}{Pr(B)} \;=\; Pr(A | B) $$

---
# Toy example

* Let's tackle this Martian coin with a few examples, starting
with a silly version with hope that it gives us some insight.

* Set up
  + Data: TTTT
  + Model: *binomial* distribution (for 0/1 or yes/no outcomes)
  + What is `$\pi$` (probability of H)?

* A few details about this model for 0/1 outcomes.
  + the model is not interested in the order, so HTH is the same as HHT
  + but we need to *count* the number of ways we can get one T and
  two H (to figure out how likely this outcome is)

---
# Binomial distribution

* The binomial has one parameter, we will call it `$\pi$`, and some data
  + `$n$` - number of coin flips (trails)
  + `$x$` - number of heads/H (successes)
  + `$\pi$` = Pr(H)

* Formula for the binomial distribution (probability of getting `$x$` heads/H
when you flip the coin `$n$` times) is

`$$Pr(x) = {n \choose x} \pi^x (1 - \pi)^{n - x}$$`

---
class: slide-font-25
# Side note: combinatorics

* `${n \choose x}$`  &mdash; "n choose x" shows up in the binomial distribution
and helps us count

+ In `$n$` coin flips, how many ways can we get `$x$` number of heads?
  
  + `${n \choose x} \;\;=\;\; \frac{n!}{(n-x)! x!}$`
  
  + example

`$${10 \choose 10} \;\;=\;\; \frac{10!}{(10-10)! 10!} \;\;=\;\; \frac{10!}{(1) 10!} \;\;=\;\; 1$$`
The is only 1 way to get 10 H from 10 coin flips:

+ the `$1^{st}$` flip = H, the `$2^{nd}$` flip = H, ..., AND the `$10^{th}$` flip = H.

---
# Side note: more combinatorics

With 10 flips, how many ways can we get only one H?

``` r
> choose(10, 1)
```

```
## [1] 10
```

10 ways to get only 1 H
  + only the first flip = H, all other 9 = T
  + only the second flip = H
  + ...
  + only the tenth flip = H

---
# Back to the binomial

Recall are data: TTTT

``` r
> # Pr(H = 0)
> dbinom(x = 0,     # number of H
+        size = 4,  # number of coin flips
+        prob = .5) # pi = .5
```

```
## [1] 0.0625
```

---
# Bayes
* Bayesian
  + silly prior (only 2 values are possible)
    - Pr(\pi = 0.2) = 0.8
    - Pr(\pi = 0.5) = 0.2
  + posterior?

---
class: slide-font-25
# Simple Bayesian example

Let's model that Martian penny.

* What is your prior belief before seeing the Martian penny in action?

+ I bet it acts like an Earth penny, since we are standing in Columbus. So
  `$Pr(\mbox{Heads})$` is about `$\frac{1}{2}$` with very little wiggle room.

- strong prior belief

+ No idea!  Any value between 0 and 1 is equally likely.

--
  
    - flat/diffuse/uninformative prior belief or uninformative

--
    
  + 0.5 is more likely than other values, but the Martians may have altered
  the coin so not too sure

- weakly informative prior

---
# Introducing the Beta dist

Since the probability of heads is bounded between 0 & 1, a Beta distribution
is a popular choice for a prior distribution.

* The Beta dist. has two parameters: `$\alpha > 0$` , and `$\beta > 0$`

* The mean of the Beta is `$\frac{\alpha}{\alpha + \beta}$`

* In R, we can use `dbeta()` to explore the Beta dist. and `rbeta()` to make
random draws from a Beta

+ `dbeta(x, shape1, shape2)`
  
  + `$\alpha$` is shape1
  
  + `$\beta$` is shape2

---
# Pick your prior

``` r
> p <- seq(0, 1, by = 0.001)
> plot(p, dbeta(p, shape1 = 1, shape2 = 1), type = "l", lwd = 3, ylim = c(0, 4), main = "Prior for Pr(H)",
+ ylab = "density")
> points(p, dbeta(p, shape1 = 2, shape2 = 1), col = "blue", type = "l")
> points(p, dbeta(p, shape1 = 1, shape2 = 2), col = "red", type = "l")
> points(p, dbeta(p, shape1 = 8, shape2 = 8), col = "darkgreen", type = "l")
> points(p, dbeta(p, shape1 = 0.2, shape2 = 0.2), col = "bisque4", type = "l")
> text(0.15, 4, labels = expression(paste(alpha, " = 1, ", beta, " = 1")))
> text(0.15, 3.7, labels = expression(paste(alpha, " = 2, ", beta, " = 1")), col = "blue")
> text(0.15, 3.4, labels = expression(paste(alpha, " = 1, ", beta, " = 2")), col = "red")
> text(0.75, 4, labels = expression(paste(alpha, " = 8, ", beta, " = 8")), col = "darkgreen")
> text(0.75, 3.7, labels = expression(paste(alpha, " = 0.2, ", beta, " = 0.2")), col = "bisque4")
```

---
class: slide-font-25
# Update our penny prior

How to get prob of data?

* A Martian coin flip gives us Heads or Tails, and n flips can be modeled
with the binomial distribution

* To get the posterior distribution for the probability of heads, we use
Bayes' Theorem for `$\theta = Pr(H)$`

$$ Pr(\theta | data) \;\;=\;\;\frac{Pr(\mbox{data} | \theta) \;\;Pr(\theta)}{Pr(data)}$$

`$$\;\;=\;\;\frac{\mbox{Binomial}(\mbox{data} \;|\; p = \theta) \;\;\mbox{Beta}(\theta \;|\; \alpha, \beta)}{Pr(\mbox{data})}$$`
`$$\;\;\propto\;\;\mbox{Binomial}(\mbox{data} \;|\; p = \theta) \;\;\mbox{Beta}(\theta \;|\; \alpha, \beta)$$`

(posterior is proportional to likelihood * prior, since Pr(data) does not depend on `$\theta$`)

---
class: slide-font-25
# Side note: gamma function `$\gamma ( x )$`

(For the mathematically curious) when working with the beta and binomial
distributions we will come across

* `$\gamma(x)$` &mdash; the gamma function appears in the beta distribution

+ `$\gamma(x) = \int_0^{\infty} t^{x-1} \exp(-t) dt$`
  
  + `$\gamma(x) \;\;=\;\; (x-1)! \;\;=\;\; (x-1) * (x-2) * \cdots * 1$`

``` r
> gamma(4)  ## 3 * 2 * 1
```

```
## [1] 6
```

``` r
> gamma(4) == factorial(3)
```

```
## [1] TRUE
```

---
class: slide-font-25
# Posterior penny

Posterior distribution for our parameter `$\theta$` is

`$$\;\;\propto\;\;
{n \choose x} \theta^{x} (1 - \theta)^{n - x} \;\;*\;\;
\frac{\gamma(\alpha + \beta)}{\gamma(\alpha)\gamma(\beta)} \theta^{\alpha - 1} (1 - \theta)^{\beta - 1}$$`

`$$\;\;=\;\;
C_1 \;*\; \color{red}{\theta^{x} (1 - \theta)^{n - x}}\;\;*\;\;
C_2 \;*\; \color{red}{\theta^{\alpha - 1} (1 - \theta)^{\beta - 1}}$$`

`$$\;\;=\;\; C_3 \;*\; \theta^{x + \alpha - 1} (1 - \theta)^{n - x + \beta - 1}$$`
--

This is a Beta `$(\theta \;|\; \alpha^* \; \beta^*)$`

* `$\alpha^* = x + \alpha$`
* `$\beta^* = n - x + \beta$`
* our prior is like a previous data set with `$(\alpha - 1)$` heads
and `$(\beta - 1)$` tails, when updating a beta proir with a binomial likelihood
  + (see .red[*kernels*] on the 2nd line of equations)

---
# Posterior penny &mdash; with numbers

Let's try that again with actual data and a specific prior...

* n = 10 coin flips & x = 3 heads

* flat prior Beta ( `$\alpha$` = 1, `$\beta$` = 1)

* we want to draw inferences about `$\theta$` the probability of the coin
landing on heads (H)

`$$\;\;\propto\;\;
{10 \choose 3} \theta^{3} (1 - \theta)^{10 - 3} \;\;*\;\;
\frac{\gamma(1 + 1)}{\gamma(1)\gamma(1)} \theta^{1 - 1} (1 - \theta)^{1 - 1}$$`

`$$\;\;=\;\;
C_1 \;*\; \theta^{3} (1 - \theta)^{7} \;\;*\;\;
C_2 \;*\; 1 * 1$$`

`$$\;\;=\;\; C_3 \;*\; \theta^{4 - 1} (1 - \theta)^{8 - 1}$$`

So our posterior distribution for `$\theta$` is
Beta `$(\alpha^* \,=\, 4, \; \beta^* \,=\, 8)$`

---
# Posterior & Prior

``` r
> p <- seq(0, 1, by = 0.001)
> plot(p, dbeta(p, shape1 = 4, shape2 = 8), type = "l", lwd = 3, ylim = c(0, 4), main = expression(paste("posterior: Beta(",
+ alpha, " = 4, ", beta, " = 8), ", " prior: Beta( ", alpha, " = 1, ", beta, " = 1)")),
+ ylab = "density", bty = "n")
> points(p, dbeta(p, shape1 = 1, shape2 = 1), col = "red", type = "l", lwd = 2)
> abline(v = 0.5, lty = 2, col = "red") ## prior mean
> abline(v = 0.3, lty = 2, col = "seagreen") ## data mean
> abline(v = 4/(4 + 8), lty = 2) ## posterior mean
> text(0.65, 2, labels = "prior mean = .5", col = "red")
> text(0.15, 4, labels = "data mean = .3", col = "seagreen")
> text(0.5, 3, labels = "posterior mean = 4/12")
```

---
class: slide-font-25
# Posterior penny &mdash; conjugate prior

This is a special example where we use a *conjugate* prior

* *conjugacy* the prior is the type of distribution as the posterior
  
* parameters are different, but both are beta (in the coin example)
  
* depends on likelihood distribution: e.g., beta is a conjugate prior for
the binomial likelihood

* in this example, our prior beliefs can be described as having seen
 `$(\alpha - 1)$` prior heads and `$(\beta - 1)$` prior tails

+ prior  `$(\alpha^* \,=\, 1, \; \beta^* \,=\, 1)$` -- suggesting we haven't
  previously seen anything, and simply put equal weight on the chances of H & T
  + likelihood: 10 flips, 3 H, 7 T
  + posterior  `$(\alpha^* \,=\, (3 + 1), \; \beta^* \,=\, (7 + 1))$`

---
# Posterior & Different Prior

``` r
> p <- seq(0, 1, by = 0.001)
> plot(p, dbeta(p, shape1 = 5, shape2 = 8), type = "l", lwd = 3, ylim = c(0, 4), main = expression(paste("posterior: Beta(",
+ alpha, " = 5, ", beta, " = 8), ", " prior: Beta( ", alpha, " = 2, ", beta, " = 1)")),
+ ylab = "density", bty = "n")
> points(p, dbeta(p, shape1 = 2, shape2 = 1), col = "red", type = "l", lwd = 2)
> points(p, dbeta(p, shape1 = 4, shape2 = 8), col = "blue", type = "l", lty = 2, lwd = 2)
> abline(v = 2/3, lty = 2, col = "red") ## prior mean
> abline(v = 0.3, lty = 2, col = "seagreen") ## data mean
> abline(v = 5/(5 + 8), lty = 2) ## posterior mean
> text(0.15, 3.4, labels = "previous posterior", col = "blue")
> text(0.15, 3.1, labels = expression(paste("Beta(", alpha, " = 4, ", beta, " = 8)")),
+ col = "blue")
```

note the difference between the posterior mean & mode (& median?)

---
# Posterior & Different Prior #2

``` r
> p <- seq(0, 1, by = 0.001)
> plot(p, dbeta(p, shape1 = 9, shape2 = 8), type = "l", lwd = 3, ylim = c(0, 4), main = expression(paste("posterior: Beta(",
+ alpha, " = 9, ", beta, " = 8), ", " prior: Beta( ", alpha, " = 6, ", beta, " = 1)")),
+ ylab = "density", bty = "n")
> points(p, dbeta(p, shape1 = 6, shape2 = 1), col = "red", type = "l", lwd = 2)
> points(p, dbeta(p, shape1 = 4, shape2 = 8), col = "blue", type = "l", lty = 2, lwd = 2)
> abline(v = 6/7, lty = 2, col = "red")
> abline(v = 0.3, lty = 2, col = "seagreen")
> abline(v = 9/(9 + 8), lty = 2)
> text(0.15, 3.4, labels = "previous posterior", col = "blue")
> text(0.15, 3.1, labels = expression(paste("Beta(", alpha, " = 4, ", beta, " = 8)")),
+ col = "blue")
```

---
class: slide-font-25
# Conjugate priors (cont.)

With conjugate priors, deriving the posterior is convenient & relatively easy...
--
and inference is also easy with R functions:

``` r
> alpha = 4
> beta = 8
> qbeta(0.5, alpha, beta) ## median
## [1] 0.3238045
> qbeta(c(0.25, 0.75), alpha, beta) ## IQR
## [1] 0.2364026 0.4204710
> alpha/(alpha + beta) ## mean
## [1] 0.3333333
> pbeta(0.5, alpha, beta) ## Pr(theta < .50)
## [1] 0.8867188
> p <- seq(0, 1, by = 0.001)
> index_mode <- which.max(dbeta(seq(0, 1, by = 0.001), alpha, beta))
> p[index_mode] ## mode
## [1] 0.3
> plot(p, dbeta(p, alpha, beta), type = "l", main = expression(paste("posterior: Beta(",
+ alpha, " = 4, ", beta, " = 8)")), ylab = "density", bty = "n")
> abline(v = qbeta(0.5, alpha, beta))
> abline(v = alpha/(alpha + beta), col = "red")
> abline(v = p[index_mode], col = "blue")
```

![](intro_bayes_files/figure-html/inference-1.png)

---
# Inference through sampling

We can also tell a very similar story with sampling

``` r
> alpha = 4
> beta = 8
> posterior_sample <- rbeta(30, alpha, beta)
> mean(posterior_sample)
## [1] 0.3564887
> median(posterior_sample)
## [1] 0.3347518
> mean(posterior_sample < 0.5)
## [1] 0.8333333
> hist(posterior_sample, freq = FALSE, xlim = c(0, 1), ylim = c(0, 4))
> points(p, dbeta(p, alpha, beta), type = "l")
```

![](intro_bayes_files/figure-html/sampling-1.png)

---
# Inference through sampling (cont.)

Now with a bigger posterior sample

``` r
> alpha = 4
> beta = 8
> posterior_sample <- rbeta(2000, alpha, beta)
> mean(posterior_sample)
## [1] 0.3342336
> median(posterior_sample)
## [1] 0.3249278
> mean(posterior_sample < 0.5)
## [1] 0.884
> hist(posterior_sample, freq = FALSE, xlim = c(0, 1), ylim = c(0, 4))
> points(p, dbeta(p, alpha, beta), type = "l")
```

![](intro_bayes_files/figure-html/sampling2-1.png)

---
class: slide-font-25
# Inference through sampling (cont.)

* Can you calculate a 95% *credible interval*?

+ There is a 95% probability that the parameter is within the 95% credible interval

- Pr(lower bound < `$\theta$` < upper bound) = 95%
 - `$\theta$` is the probability of H in the previous example

+ note the probabilistic interpretation (and not the awkward word salad of
  describing a confidence interval)

* In many situations (e.g., non-conjugate priors), we do not know the exact
form of the posterior distribution

+ but if we can sample from the posterior, then we can still carry out
  inference

---
# Why Bayes??

* In most well-behaved situations (e.g., regression model, large sample size, reasonable priors)
Bayesian inference is going to be very similar to the frequentist approach

* When trouble arises (e.g., small n), Bayesian results can give much more sensible
results

* The probabilistic interpretations from the Bayesian approach are extremely appealing

* Bayesian approach allows you to carry out inference when frequentist cannot
(e.g., multiple imputation for missing data, model uncertainty, comparing non-nested models)

---
# Why NOT Bayes??

* It is hard! Not all problems are as easy as flipping coins

* Sampling methods can help with the hard math.  But now that coding is hard
and it takes a long time!

* Well, the coding used to be hard.  Now we have Stan!

+ can still take a long time

---
# Stan

* Language for Bayesian analysis with applicable to a wide range of models

* Produced C++ code to speed things up

* Companion packages to make life easy

+ [rstanarm](https://mc-stan.org/rstanarm/) allows for a seamless transition
  from R to Stan