class: center, middle, inverse, title-slide .title[ # Maximum Likelihood
with
] .author[ ###
Jason Thomas
R Working Group ] .date[ ###
November 22th, 2024 ] --- # Introduction This workshop introduces **maximum likelihood estimation** and illustrates many of the concepts and key ideas using <img src="img/Rlogo.png" width="64"> * What is MLE? * Bit of calculus...with pictures 😎 (sorry, this will be next time) + How do we find a *maximum*? + Um, wait, the maximum of what exactly? * Brute force + No time for calculus, I gotta graduate! + I do have access to a computer, so how do I find a max with a fast CPU? --- # Regression: model (sorta) Let's start with a familiar story... * Dependent variable (y) is associated with independent variables (x's). For one observation in our data set, we have $$ y_i \,=\, \beta_1 X_1 \,+\, \beta_2 X_2 \,+\, \cdots \,+\, \beta_p X_p \,+\, \varepsilon_i $$ * We observe `\(y\)` and the `\(x\)`'s (❤️❤️❤️ our beloved data ❤️❤️❤️) + the `\(i\)` subscript just means a generic observation in our data,<br> so `\(i\)` can go from 1 to N (# number of observations) * The **parameters** in our model are the `\(\beta\)`'s + wait, is `\(\varepsilon_i\)` a parameter? --- # Regression: goals This model can teach us a thing or two. * But first we gotta put in a little work to find values for the `\(\beta\)`'s that make good predictions for our data + nice! our model is wrong, but at least it will try to look like the real world (😘 data) * Are the `\(\beta\)`'s different from 0? 🤩 Is this a "good" model, or at least not as bad as my other model? * To make some progress we need to make an assumption about how our data are distributed. + actually, it might be better to think about our are *errors* are distributed --- # Regression: choosing the `\(\beta\)`'s * The universe is pretty fond of the normal distribution, so let's start there. + the universe also makes it hard to draw pictures in more than 2 dimensions, so let's focus on a simple model for now $$ y_i \,=\, \alpha + \beta X \,+\, \varepsilon_i $$ * We assume that `\(\varepsilon_i \sim \mbox{Normal}(\mu = 0, \sigma = 1.8)\)` + the `\(\sim\)` just means that the errors have a normal distribution + `\(\mu\)` is the mean of the normal distribution, and `\(\sigma\)` is the standard deviation --- class: slide-font-25 # Regression: <img src="img/Rlogo.png" width="64"> and the normal <img src="img/Rlogo.png" width="64"> makes it easy to work with the normal distribution ``` r x <- seq(-8, 8, by = .01) plot(x, dnorm(x, mean = 0, sd = 1.8)) # dnorm --> "density" or pdf ``` <img src="mle_files/figure-html/unnamed-chunk-1-1.png" width="45%" style="display: block; margin: auto;" /> --- # Quick detour: functions <img src="img/Rlogo.png" width="64"> includes a lot of useful functions, but we can create our own! ``` r *my_function <- function(x, y) { z <- x + (2*y) return(z) } my_function(3, 4) ``` ``` ## [1] 11 ``` ``` r 3 + 2*4 ``` ``` ## [1] 11 ``` --- # Regression: behind the curtain Formula for the normal distribution: `$$f(y | X, \mu, \sigma^2) \;\;=\;\; \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left\{ \frac{-(y - \mu)^2}{2\sigma^2} \right\}$$` ``` r norm_pdf <- function(x, mu, sigma2) { * y <- (1/sqrt(2*pi*sigma2)) * exp( -((x - mu)^2) / (2*sigma2) ) return(y) } norm_pdf(x = 0.4, mu = 0, sigma2 = 1.8^2) ``` ``` ## [1] 0.2162291 ``` ``` r dnorm( x = 0.4, mean = 0, sd = 1.8 ) ``` ``` ## [1] 0.2162291 ``` --- # Regression: model * So what is our model really trying to say? * To find out, we can use `rnorm()`, which generates random draws from a normal distribution (just like our model assumes) ``` r sample_size <- 85; mu <- 0; sig <- 1.8 alpha <- 3; beta <- 0.7 *x <- rnorm(n = sample_size, # sample randomly from normal dist. mean = -0.3, sd = 12) y <- alpha + beta*x ``` Right? --- # Regression: model Nope! We forgot the assumption about our **errors** following a <br> *normal distribution* ``` r sample_size <- 85; mu <- 0; sig <- 1.8 alpha <- 3; beta <- 0.7 x <- rnorm(n = sample_size, # sample randomly from normal dist. mean = -0.3, sd = 12) *y <- alpha + beta*x + rnorm(n = sample_size, mu, sig) ``` --- # Regression: model ``` r plot(x, y) # data ``` <img src="mle_files/figure-html/unnamed-chunk-6-1.png" width="50%" style="display: block; margin: auto;" /> --- # Regression: sanity check Let's check our errors and see if they have the right distribution ``` r mod1 <- lm(y ~ x) mu; mean(mod1$resid) ``` ``` ## [1] 0 ``` ``` ## [1] -5.26208e-17 ``` ``` r sig; sd(mod1$resid) ``` ``` ## [1] 1.8 ``` ``` ## [1] 1.745053 ``` --- # Regression: sanity check 2 ``` r hist(mod1$resid, probability = TRUE, xlim=c(-6,6), ylim=c(0, .3)) x_range <- seq(-6, 6, by = 0.01) *lines(x_range, dnorm(x_range, mu, sig), col = "red", lwd = 2) ``` <img src="mle_files/figure-html/unnamed-chunk-8-1.png" width="40%" style="display: block; margin: auto;" /> --- # Density ⇨ Likelihood Yeah, yeah, but I'm here to learn about **MLE**! * OK, so remember the density function for the normal distribution? This was in our `norm_pdf()` function and looks like `$$f(y | X, \mu, \sigma^2) \;\;=\;\; \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{\frac{-(y - \mu)^2}{2\sigma^2}\right\}$$` * This is our *density* function + with regression, `\(\mu = \alpha + \beta * X\)` * Quick note: density is NOT the same as probability + they are related (just ask if you want the details) --- # Density ⇨ Likelihood * With **maximum likelihood**, we flip what is known and what is given. + how likely is a value of the parameter, *given* the observed data + instead of `\(f(y | \beta, X)\)`, we use `\(f(\beta | y, X)\)` * After the swap, we search for values of `\(\alpha\)`, `\(\beta\)`, and `\(\sigma\)` that give us the largest **likelihood**, `\(\mathcal{L}\)` `\begin{eqnarray} \mathcal{L}(\alpha, \beta, \sigma \,|\, y, X) &=& \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left\{ \frac{-(y - (\alpha + \beta*X))^2}{2\sigma^2} \right\} \end{eqnarray}` (recall that `\(\mu = \alpha + \beta * X)\)` --- # Likelihood function ``` r like1 <- function(alpha, beta, sigma2, x, y) { mu <- alpha + beta * x dens <- (1/sqrt(2*pi*sigma2)) * exp( -((x - mu)^2) / (2*sigma2) ) return(dens) } like1(alpha = 1, beta = 1, sigma = 1, x = 10, y = 8) ``` ``` ## [1] 0.2419707 ``` ``` r like1(alpha = .5, beta = .5, sigma = .5, x = 10, y = 8) ``` ``` ## [1] 9.056529e-10 ``` Aha! Our first guess is better than our second guess. <br><br> But what complications do we still have to conquer? --- # Obstacles * *First*, we've got more than 1 observation, so we need to set up our likelihood function to include all 58 pairs of (x, y). + (as we'll see this leads to another math and computer issue) * *Second*, it might take a LONG time to just randomly choose values for the parameters and see if it increases the likelihood. + we need an efficient way to find the maximum of this function + the set of parameters that accomplish this are the <br> **maximum likelihood estimates** --- # Multiple observations * To get over the first hurdle, we need to throw another assumption on the existing pile. * If we assume our observations are *independent* and *identically* distributed (i.e., they all have the same distribution), then we are back in the game! + sometimes you will see this as the *iid* assumption * When two observations are independent, then the probability of both occurring is just the probabilities multiplied together + Pr(Heads on 2 coin flips) = <br> Pr(first flip = Heads) * Pr(2nd flip = Heads) --- class: slide-font-28 # Multiple observations (cont.) With the *iid* assumption, our likelihood function is now... `\begin{eqnarray} \mathcal{L}(\alpha, \beta, \sigma \,|\, \boldsymbol{y}, \boldsymbol{X}) &=& \mbox{Normal}(y_1, x_1, \alpha, \beta, \sigma) * \\ && \mbox{Normal}(y_2, x_2, \alpha, \beta, \sigma) * \\ && \;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \vdots \\ && \mbox{Normal}(y_{58}, x_{58}, \alpha, \beta, \sigma) \\ \\ \mbox{which we can abreviate as} \\ \\ \mathcal{L}(\alpha, \beta, \sigma \,|\, \boldsymbol{y}, \boldsymbol{X}) &=& \prod\limits_{i=1}^N \mbox{Normal}\,(\, \alpha, \beta, \,\sigma \;|\; X_i \,,\, y_i \,) \end{eqnarray}` *notation*: I'm using bold to indicate a vector: `\(\boldsymbol{y} = (y_1, y_2, \ldots, y_{58})\)` --- # Log likelihood * If you try this out with more than a few observations, you will quickly discover that likelihood gets really small really fast! + our computer will not be able to keep track of such small numbers * We deal with this issue by working with the `\(\log()\)` of the likelihood + this has the added benefit of making the math easier + `\(\log(A * B) \;=\; \log(A) + \log(B)\)` * So the log likelihood of N observations is simply `\begin{eqnarray} \log\{\mathcal{L}(\alpha, \beta, \sigma \,|\, \boldsymbol{y}, \boldsymbol{X}) \} &=& \sum\limits_{i=1}^N \log\{\mbox{Normal}\,(\, \alpha, \beta, \,\sigma \;|\; X_i \,,\, y_i \,)\} \end{eqnarray}` --- # Exercises 1. Write down the log likelihood function 1. Create an R function that returns the log likelihood for multiple observations 1. Using your function, set `\(\alpha\)` and `\(\sigma\)` to fixed values and plot values of the log likelihood for different values of `\(\beta\)` --- # Maximization Now let's get over our second hurdle... * Maximize the log likelihood function + `\(\ell(\alpha, \beta, \sigma \,|\, \boldsymbol{y}, \boldsymbol{X}) \;\;=\;\; \log\{\mathcal{L}(\alpha, \beta, \sigma \,|\, \boldsymbol{y}, \boldsymbol{X})\)` * The `optim()` function is the trick here. The arguments are + `par` -- a vector of initial values of the parameters that you want to find the "best" values + `fn` -- the function whose maximum you want to find + this tool actually *minimizes* the function, so your function should return the negative log likelihood --- # Maximization (cont.) ``` r max_like_est <- optim(par = c(1, 1, 1), nlog_lik, x=x, y=y) max_like_est$par # MLE ``` ``` ## [1] 3.2139735 0.7168371 1.7344635 ``` ``` r max_like_est$value # negative log likelihood at MLE ``` ``` ## [1] 167.4335 ``` ``` r logLik(mod1) ``` ``` ## 'log Lik.' -167.4335 (df=3) ``` --- # Standard errors ``` r mle2 <- optim(par = c(1, 1, 1), nlog_lik, x=x, y=y, * hessian = TRUE) mle2$par ``` ``` ## [1] 3.2139735 0.7168371 1.7344635 ``` ``` r sqrt(diag(solve(mle2$hessian))) ``` ``` ## [1] 0.20078444 0.01909337 0.13299322 ``` ``` r coef(summary(mod1)) ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.2138430 0.20322361 15.81432 9.055952e-27 ## x 0.7168491 0.01932532 37.09377 1.939099e-53 ``` --- # Exercises * `optim` does 100 iterations by default. Can you find the log likelihood value and parameter estimates after only 25 iterations? * Does the algorithm used to find the minimum make a difference? * Can you enforce a constraint so that `\(\beta\)` must stay below a value of 1? --- # Binomial Model for coin flips * Another example: coin flips + tails, heads, heads, heads (n = 4) + `\(p =\)` Probability(coin flip = heads) + table for `\(\hat{p}\)` & likelihood function + log likelihood & explicit solution