class: center, middle, inverse, title-slide .title[ # Introduction to
Bayesian Statistics with
] .author[ ### Jason Thomas ] .institute[ ### R Working Group ] .date[ ### April 18th, 2025 ] --- # Introduction * Today we will learn about the Bayesian approach to statistical inference + 🤨 ummmm...why, exactly? * Two main approaches to statistical inference: frequentist vs. Bayesian + why one and not the other? + learning Bayes by example * [https://github.com/buckipr/R_Working_Group/bayes](https://github.com/buckipr/R_Working_Group/bayes/intro) <br/> (slides -- coming soon to a repo near you) --- # In which there are two kinds Frequentist inference... * the world introduced in Stats 101, where model parameters are fixed, but unknown (so we need to estimate them with data) * uncertainty is assessed with respect to repeated resampling of the data → sampling distributions & asymptotics + are we there yet? + difficulties with interpretation (e.g., confidence intervals) + out-of-sample predictions? --- class: inverse, middle, center background-image: url(img/asymptotia.png) background-size: contain <p style="color:red; font-weight:bold; font-size:20px; text-align: left; position: absolute; bottom: 20px;"> source: RJ Little (2006) Calibrated Bayes. American Statistician</p> --- class: inverse, center, middle # Bayesian Framework --- class: wide-slide # Welcome to Bayes-ity .left-column[<br/><br/><img src="img/Thomas_Bayes.gif">] .right-column[ * We have been around the block a few times and one might say we even no a thing or two ] --- class: wide-slide # Welcome to Bayes-ity .left-column[<br/><br/><img src="img/Thomas_Bayes.gif">] .right-column[ * We have been around the block a few times and one might say we even no a thing or two + example: if you flip a penny, there is a good chance that it lands on "heads" or H ] --- class: wide-slide # Welcome to Bayes-ity .left-column[<br/><br/><img src="img/Thomas_Bayes.gif">] .right-column[ * We have been around the block a few times and one might say we even no a thing or two + example: if you flip a penny, there is a good chance that it lands on "heads" or H * Our friend comes along and gives us a penny that they found on Mars + what is going to happen if you flip it? <br> (not a rhetorical question) ] --- class: wide-slide # Welcome to Bayes-ity .left-column[<br/><br/><img src="img/Thomas_Bayes.gif">] .right-column[ * Suppose you flip it once and get H. You do a few more experiments and get 9 T in a row. + What is the probability that you get H on the next toss? Pr(H) = ? + That next toss lands on H, and the subsequent 91 tosses turn up T (yay, more data). + What will happen on the 101st toss, i.e., now, what do you think Pr(H) is? ] --- class: slide-font-25 # Bayesics This process of flipping the Martian coin illustrates the Bayesian framework * stating your original belief about the chance of heads (H) <br/> -- * gathering some data (e.g., coin tosses) & choosing a model <br/> <br/> <br/> -- * **updating** your belief after seeing the data (what do you expect to happen on the next experiment/observation?) <br/> -- * (and repeat) --- class: slide-font-25 # Bayesics This process of flipping the Martian coin illustrates the Bayesian framework * stating your original belief about the chance of heads (H) + .red[*prior* distribution for the probability of H, Pr(H)] * gathering some data (e.g., coin tosses) & choosing a model + .red[*likelihood*] (what frequentists primarily rely on) + .red[conditional prob of data given parameters] + .red[for a given probability of H, how likely are these data: HTTTTTTTTT] * **updating** your belief after seeing the data (what do you expect to happen on the next experiment/observation?) + .red[*posterior* distribution (used for inference)] * (and repeat) -- use updated beliefs for the next data set --- # Slightly more formally * *Prior distribution:* Pr(Heads) * *Likelihood:* Pr(Data | Heads) * *Posterior* (using Bayes' Theroem) $$ \mbox{Pr(Heads|Data)} = \frac{\mbox{Pr(Data | Heads) Pr(Heads)}}{\mbox{Pr(Data)}} $$ Since Pr(Data) (in the denominator) does not involve Pr(Heads), it is just a scaling constant that can be ignored, so we typically work with $$ \mbox{Pr(Heads|Data)} \propto \mbox{Pr(Data | Heads) Pr(Heads)} $$ (posterior is proportional to likelihood * prior) --- # Bayes' Theorem Quick detour on probabilities * `\(Pr(A)\)`: marginal probability of event *A* (possible events: {*A*, *B*}) + example: `\(Pr (Heads) = \frac{1}{2} \;\;\;\;, \;\;\;\;Pr [Tails] = \frac{1}{2}\)` * `\(Pr( D | E )\)`: conditional prob of event *D* given that event *E* occurred + example: `\(Pr(\)` 2nd card = ♠️ | 1st card = ♦️) conditional probability of drawing a spade from a deck of cards given that you pulled a diamond on your first draw --- # Bayes' Theorem Quick detour on probabilities * `\(Pr(D, E)\)`: joint prob of both events `\(D\)` and `\(E\)` occurring + example: `\(Pr(\)` 🎲 = 4 , 🎲 > 4) * Relationships among probabilities: + `\(Pr(A, B) \;=\; Pr(A | B) * Pr(B)\)` ⇒ `\(\frac{Pr(A,B)}{Pr(B)} \;=\; Pr(A | B)\)` + Bayes' Theorem: $$ \frac{Pr(B | A) Pr(A)}{Pr(B)} \;=\; \frac{Pr(A, B)}{Pr(B)} \;=\; Pr(A | B) $$ --- # Toy example * Let's tackle this Martian coin with a few examples, starting with a silly version with hope that it gives us some insight. * Set up + Data: TTTT + Model: *binomial* distribution (for 0/1 or yes/no outcomes) + What is `\(\pi\)` (probability of H)? * A few details about this model for 0/1 outcomes. + the model is not interested in the order, so HTH is the same as HHT + but we need to *count* the number of ways we can get one T and two H (to figure out how likely this outcome is) --- # Binomial distribution * The binomial has one parameter, we will call it `\(\pi\)`, and some data + `\(n\)` - number of coin flips (trails) + `\(x\)` - number of heads/H (successes) + `\(\pi\)` = Pr(H) * Formula for the binomial distribution (probability of getting `\(x\)` heads/H when you flip the coin `\(n\)` times) is `$$Pr(x) = {n \choose x} \pi^x (1 - \pi)^{n - x}$$` --- class: slide-font-25 # Side note: combinatorics * `\({n \choose x}\)` — "n choose x" shows up in the binomial distribution and helps us count + In `\(n\)` coin flips, how many ways can we get `\(x\)` number of heads? + `\({n \choose x} \;\;=\;\; \frac{n!}{(n-x)! x!}\)` + example `$${10 \choose 10} \;\;=\;\; \frac{10!}{(10-10)! 10!} \;\;=\;\; \frac{10!}{(1) 10!} \;\;=\;\; 1$$` The is only 1 way to get 10 H from 10 coin flips: + the `\(1^{st}\)` flip = H, the `\(2^{nd}\)` flip = H, ..., AND the `\(10^{th}\)` flip = H. --- # Side note: more combinatorics With 10 flips, how many ways can we get only one H? -- ``` r > choose(10, 1) ``` ``` ## [1] 10 ``` -- 10 ways to get only 1 H + only the first flip = H, all other 9 = T + only the second flip = H + ... + only the tenth flip = H --- # Back to the binomial Recall are data: TTTT ``` r > # Pr(H = 0) > dbinom(x = 0, # number of H + size = 4, # number of coin flips + prob = .5) # pi = .5 ``` ``` ## [1] 0.0625 ``` --- # Bayes * Bayesian + silly prior (only 2 values are possible) - Pr(\pi = 0.2) = 0.8 - Pr(\pi = 0.5) = 0.2 + posterior? --- class: slide-font-25 # Simple Bayesian example Let's model that Martian penny. * What is your prior belief before seeing the Martian penny in action? -- + I bet it acts like an Earth penny, since we are standing in Columbus. So `\(Pr(\mbox{Heads})\)` is about `\(\frac{1}{2}\)` with very little wiggle room. -- - strong prior belief -- + No idea! Any value between 0 and 1 is equally likely. -- - flat/diffuse/uninformative prior belief or uninformative -- + 0.5 is more likely than other values, but the Martians may have altered the coin so not too sure -- - weakly informative prior --- # Introducing the Beta dist Since the probability of heads is bounded between 0 & 1, a Beta distribution is a popular choice for a prior distribution. -- * The Beta dist. has two parameters: `\(\alpha > 0\)` , and `\(\beta > 0\)` -- * The mean of the Beta is `\(\frac{\alpha}{\alpha + \beta}\)` -- * In R, we can use `dbeta()` to explore the Beta dist. and `rbeta()` to make random draws from a Beta -- + `dbeta(x, shape1, shape2)` + `\(\alpha\)` is shape1 + `\(\beta\)` is shape2 --- # Pick your prior ``` r > p <- seq(0, 1, by = 0.001) > plot(p, dbeta(p, shape1 = 1, shape2 = 1), type = "l", lwd = 3, ylim = c(0, 4), main = "Prior for Pr(H)", + ylab = "density") > points(p, dbeta(p, shape1 = 2, shape2 = 1), col = "blue", type = "l") > points(p, dbeta(p, shape1 = 1, shape2 = 2), col = "red", type = "l") > points(p, dbeta(p, shape1 = 8, shape2 = 8), col = "darkgreen", type = "l") > points(p, dbeta(p, shape1 = 0.2, shape2 = 0.2), col = "bisque4", type = "l") > text(0.15, 4, labels = expression(paste(alpha, " = 1, ", beta, " = 1"))) > text(0.15, 3.7, labels = expression(paste(alpha, " = 2, ", beta, " = 1")), col = "blue") > text(0.15, 3.4, labels = expression(paste(alpha, " = 1, ", beta, " = 2")), col = "red") > text(0.75, 4, labels = expression(paste(alpha, " = 8, ", beta, " = 8")), col = "darkgreen") > text(0.75, 3.7, labels = expression(paste(alpha, " = 0.2, ", beta, " = 0.2")), col = "bisque4") ``` <img src="intro_bayes_files/figure-html/beta_dist-1.png" style="display: block; margin: auto;" /> --- class: slide-font-25 # Update our penny prior How to get prob of data? * A Martian coin flip gives us Heads or Tails, and n flips can be modeled with the binomial distribution -- * To get the posterior distribution for the probability of heads, we use Bayes' Theorem for `\(\theta = Pr(H)\)` $$ Pr(\theta | data) \;\;=\;\;\frac{Pr(\mbox{data} | \theta) \;\;Pr(\theta)}{Pr(data)}$$ `$$\;\;=\;\;\frac{\mbox{Binomial}(\mbox{data} \;|\; p = \theta) \;\;\mbox{Beta}(\theta \;|\; \alpha, \beta)}{Pr(\mbox{data})}$$` `$$\;\;\propto\;\;\mbox{Binomial}(\mbox{data} \;|\; p = \theta) \;\;\mbox{Beta}(\theta \;|\; \alpha, \beta)$$` (posterior is proportional to likelihood * prior, since Pr(data) does not depend on `\(\theta\)`) --- class: slide-font-25 # Side note: gamma function `\(\gamma ( x )\)` (For the mathematically curious) when working with the beta and binomial distributions we will come across * `\(\gamma(x)\)` — the gamma function appears in the beta distribution + `\(\gamma(x) = \int_0^{\infty} t^{x-1} \exp(-t) dt\)` + `\(\gamma(x) \;\;=\;\; (x-1)! \;\;=\;\; (x-1) * (x-2) * \cdots * 1\)` ``` r > gamma(4) ## 3 * 2 * 1 ``` ``` ## [1] 6 ``` ``` r > gamma(4) == factorial(3) ``` ``` ## [1] TRUE ``` --- class: slide-font-25 # Posterior penny Posterior distribution for our parameter `\(\theta\)` is `$$\;\;\propto\;\; {n \choose x} \theta^{x} (1 - \theta)^{n - x} \;\;*\;\; \frac{\gamma(\alpha + \beta)}{\gamma(\alpha)\gamma(\beta)} \theta^{\alpha - 1} (1 - \theta)^{\beta - 1}$$` -- `$$\;\;=\;\; C_1 \;*\; \color{red}{\theta^{x} (1 - \theta)^{n - x}}\;\;*\;\; C_2 \;*\; \color{red}{\theta^{\alpha - 1} (1 - \theta)^{\beta - 1}}$$` -- `$$\;\;=\;\; C_3 \;*\; \theta^{x + \alpha - 1} (1 - \theta)^{n - x + \beta - 1}$$` -- This is a Beta `\((\theta \;|\; \alpha^* \; \beta^*)\)` * `\(\alpha^* = x + \alpha\)` * `\(\beta^* = n - x + \beta\)` * our prior is like a previous data set with `\((\alpha - 1)\)` heads and `\((\beta - 1)\)` tails, when updating a beta proir with a binomial likelihood + (see .red[*kernels*] on the 2nd line of equations) --- # Posterior penny — with numbers Let's try that again with actual data and a specific prior... * n = 10 coin flips & x = 3 heads * flat prior Beta ( `\(\alpha\)` = 1, `\(\beta\)` = 1) * we want to draw inferences about `\(\theta\)` the probability of the coin landing on heads (H) `$$\;\;\propto\;\; {10 \choose 3} \theta^{3} (1 - \theta)^{10 - 3} \;\;*\;\; \frac{\gamma(1 + 1)}{\gamma(1)\gamma(1)} \theta^{1 - 1} (1 - \theta)^{1 - 1}$$` `$$\;\;=\;\; C_1 \;*\; \theta^{3} (1 - \theta)^{7} \;\;*\;\; C_2 \;*\; 1 * 1$$` `$$\;\;=\;\; C_3 \;*\; \theta^{4 - 1} (1 - \theta)^{8 - 1}$$` So our posterior distribution for `\(\theta\)` is Beta `\((\alpha^* \,=\, 4, \; \beta^* \,=\, 8)\)` --- # Posterior & Prior ``` r > p <- seq(0, 1, by = 0.001) > plot(p, dbeta(p, shape1 = 4, shape2 = 8), type = "l", lwd = 3, ylim = c(0, 4), main = expression(paste("posterior: Beta(", + alpha, " = 4, ", beta, " = 8), ", " prior: Beta( ", alpha, " = 1, ", beta, " = 1)")), + ylab = "density", bty = "n") > points(p, dbeta(p, shape1 = 1, shape2 = 1), col = "red", type = "l", lwd = 2) > abline(v = 0.5, lty = 2, col = "red") ## prior mean > abline(v = 0.3, lty = 2, col = "seagreen") ## data mean > abline(v = 4/(4 + 8), lty = 2) ## posterior mean > text(0.65, 2, labels = "prior mean = .5", col = "red") > text(0.15, 4, labels = "data mean = .3", col = "seagreen") > text(0.5, 3, labels = "posterior mean = 4/12") ``` <img src="intro_bayes_files/figure-html/penny_post-1.png" style="display: block; margin: auto;" /> --- class: slide-font-25 # Posterior penny — conjugate prior This is a special example where we use a *conjugate* prior * *conjugacy* the prior is the type of distribution as the posterior * parameters are different, but both are beta (in the coin example) * depends on likelihood distribution: e.g., beta is a conjugate prior for the binomial likelihood * in this example, our prior beliefs can be described as having seen `\((\alpha - 1)\)` prior heads and `\((\beta - 1)\)` prior tails + prior `\((\alpha^* \,=\, 1, \; \beta^* \,=\, 1)\)` -- suggesting we haven't previously seen anything, and simply put equal weight on the chances of H & T + likelihood: 10 flips, 3 H, 7 T + posterior `\((\alpha^* \,=\, (3 + 1), \; \beta^* \,=\, (7 + 1))\)` --- # Posterior & Different Prior ``` r > p <- seq(0, 1, by = 0.001) > plot(p, dbeta(p, shape1 = 5, shape2 = 8), type = "l", lwd = 3, ylim = c(0, 4), main = expression(paste("posterior: Beta(", + alpha, " = 5, ", beta, " = 8), ", " prior: Beta( ", alpha, " = 2, ", beta, " = 1)")), + ylab = "density", bty = "n") > points(p, dbeta(p, shape1 = 2, shape2 = 1), col = "red", type = "l", lwd = 2) > points(p, dbeta(p, shape1 = 4, shape2 = 8), col = "blue", type = "l", lty = 2, lwd = 2) > abline(v = 2/3, lty = 2, col = "red") ## prior mean > abline(v = 0.3, lty = 2, col = "seagreen") ## data mean > abline(v = 5/(5 + 8), lty = 2) ## posterior mean > text(0.15, 3.4, labels = "previous posterior", col = "blue") > text(0.15, 3.1, labels = expression(paste("Beta(", alpha, " = 4, ", beta, " = 8)")), + col = "blue") ``` <img src="intro_bayes_files/figure-html/penny_post_2-1.png" style="display: block; margin: auto;" /> note the difference between the posterior mean & mode (& median?) --- # Posterior & Different Prior #2 ``` r > p <- seq(0, 1, by = 0.001) > plot(p, dbeta(p, shape1 = 9, shape2 = 8), type = "l", lwd = 3, ylim = c(0, 4), main = expression(paste("posterior: Beta(", + alpha, " = 9, ", beta, " = 8), ", " prior: Beta( ", alpha, " = 6, ", beta, " = 1)")), + ylab = "density", bty = "n") > points(p, dbeta(p, shape1 = 6, shape2 = 1), col = "red", type = "l", lwd = 2) > points(p, dbeta(p, shape1 = 4, shape2 = 8), col = "blue", type = "l", lty = 2, lwd = 2) > abline(v = 6/7, lty = 2, col = "red") > abline(v = 0.3, lty = 2, col = "seagreen") > abline(v = 9/(9 + 8), lty = 2) > text(0.15, 3.4, labels = "previous posterior", col = "blue") > text(0.15, 3.1, labels = expression(paste("Beta(", alpha, " = 4, ", beta, " = 8)")), + col = "blue") ``` <img src="intro_bayes_files/figure-html/penny_post_3-1.png" style="display: block; margin: auto;" /> --- class: slide-font-25 # Conjugate priors (cont.) With conjugate priors, deriving the posterior is convenient & relatively easy... -- and inference is also easy with R functions: ``` r > alpha = 4 > beta = 8 > qbeta(0.5, alpha, beta) ## median ## [1] 0.3238045 > qbeta(c(0.25, 0.75), alpha, beta) ## IQR ## [1] 0.2364026 0.4204710 > alpha/(alpha + beta) ## mean ## [1] 0.3333333 > pbeta(0.5, alpha, beta) ## Pr(theta < .50) ## [1] 0.8867188 > p <- seq(0, 1, by = 0.001) > index_mode <- which.max(dbeta(seq(0, 1, by = 0.001), alpha, beta)) > p[index_mode] ## mode ## [1] 0.3 > plot(p, dbeta(p, alpha, beta), type = "l", main = expression(paste("posterior: Beta(", + alpha, " = 4, ", beta, " = 8)")), ylab = "density", bty = "n") > abline(v = qbeta(0.5, alpha, beta)) > abline(v = alpha/(alpha + beta), col = "red") > abline(v = p[index_mode], col = "blue") ``` <!-- --> --- # Inference through sampling We can also tell a very similar story with sampling ``` r > alpha = 4 > beta = 8 > posterior_sample <- rbeta(30, alpha, beta) > mean(posterior_sample) ## [1] 0.3564887 > median(posterior_sample) ## [1] 0.3347518 > mean(posterior_sample < 0.5) ## [1] 0.8333333 > hist(posterior_sample, freq = FALSE, xlim = c(0, 1), ylim = c(0, 4)) > points(p, dbeta(p, alpha, beta), type = "l") ``` <!-- --> --- # Inference through sampling (cont.) Now with a bigger posterior sample ``` r > alpha = 4 > beta = 8 > posterior_sample <- rbeta(2000, alpha, beta) > mean(posterior_sample) ## [1] 0.3342336 > median(posterior_sample) ## [1] 0.3249278 > mean(posterior_sample < 0.5) ## [1] 0.884 > hist(posterior_sample, freq = FALSE, xlim = c(0, 1), ylim = c(0, 4)) > points(p, dbeta(p, alpha, beta), type = "l") ``` <!-- --> --- class: slide-font-25 # Inference through sampling (cont.) * Can you calculate a 95% *credible interval*? + There is a 95% probability that the parameter is within the 95% credible interval - Pr(lower bound < `\(\theta\)` < upper bound) = 95% - `\(\theta\)` is the probability of H in the previous example + note the probabilistic interpretation (and not the awkward word salad of describing a confidence interval) * In many situations (e.g., non-conjugate priors), we do not know the exact form of the posterior distribution + but if we can sample from the posterior, then we can still carry out inference --- # Why Bayes?? * In most well-behaved situations (e.g., regression model, large sample size, reasonable priors) Bayesian inference is going to be very similar to the frequentist approach -- * When trouble arises (e.g., small n), Bayesian results can give much more sensible results -- * The probabilistic interpretations from the Bayesian approach are extremely appealing -- * Bayesian approach allows you to carry out inference when frequentist cannot (e.g., multiple imputation for missing data, model uncertainty, comparing non-nested models) --- # Why NOT Bayes?? * It is hard! Not all problems are as easy as flipping coins -- * Sampling methods can help with the hard math. But now that coding is hard and it takes a long time! -- * Well, the coding used to be hard. Now we have Stan! -- + can still take a long time --- # Stan * Language for Bayesian analysis with applicable to a wide range of models * Produced C++ code to speed things up * Companion packages to make life easy + [rstanarm](https://mc-stan.org/rstanarm/) allows for a seamless transition from R to Stan