12  MLE

Recall that in this part of the text, we tackle the problem of constructing models (or updating beliefs) from data. Because data collection is also a source of uncertainty, we investigate how to appropariately propagate the uncertainty to the inferences (aka, answers to our questions) that we pulled from the models. Communicating the uncertainty associated with our answers will be key.

The goal of estimation is to find a probability model that describes the collected data so that, ultimately, we can answer questions about a larger population.

We noted that there are several ways to use the data to estimate a probability model.

Roadmap:
✓ Eyeball
✓ Kernel density estimation (nonparametric)
✓ Method of moments (parametric)
Today: Maximum likelihood (parametric)
Next: Bayesian posterior (parametric)

12.1 Key question

\[ \text{Suppose } X \sim F_X \text{ where } F_X \text{ is unknown.}\\ \text{How might one estimate } F_X \text{ from } \underbrace{\boldsymbol{X}_N}_{\text{data}} ?\]

12.1.1 Example

data <- c(
-1.26,  4.05,  0.37, -1.43,  3.33, -0.14,  4.81,  0.44,  1.62,
 4.75,  3.13,  4.56,  0.20,  5.00,  0.47, -0.48,  4.14,  3.62,
 0.70,  3.92,  0.84,  0.38, -0.87,  1.05, -3.69,  0.24,  1.47,
-5.37,  3.24, -6.76, -3.20,  3.84,  2.57,  0.05, -0.35, -2.77,
 5.84,  2.98,  1.06, -0.20,  6.82,  4.52, -1.31,  0.74, -1.86,
 0.17,  0.25,  1.19, -0.11, -0.22
)

12.1.1.1 ECDF

The ECDF is an expression for the CDF. It isn’t smooth.

fhat <- ecdf(data)
plot(fhat, main = parse(text="ECDF~hat(F)[X]"))

12.1.1.2 The histogram

The histogram is a type of estimate of the pdf. Again, it isn’t smooth.

h1 <- hist(data, main = "", xlab = "X")

12.1.1.3 Kernel density estimate of CDF

The kernel density estimate can provide an expression of both the CDF and PDF. It is smooth. It is also nonparametric.

ecdfstar <- function(t, data, smooth){
  outer(t, data, function(a,b){ pnorm(a, b, smooth)}) %>% rowMeans
}
plot(ecdf(data), main = "")
curve(ecdfstar(x, data, smooth = .7), add = TRUE, lwd = 3, col = "blue")

12.1.1.4 Kernel density estimate of PDF

epdfstar <- function(t, data, smooth){
  outer(t, data, function(a,b){ dnorm(a, b, smooth)}) %>% rowMeans
}
hist(data, freq = FALSE, main = "", ylim=c(0,.2), breaks=15)
curve(epdfstar(x, data, smooth = .7), add = TRUE, lwd = 3, col = "blue")

# Perhaps easier code

d1 <- density(data)
hist(data, freq = FALSE, breaks=15)
lines(d1, lwd = 3, col = "blue")

12.1.1.5 This chapter

In this chapter, you will learn about maximum likelihood estimation (MLE) which is another way to estimate a distribution (both CDF and PDF) from data.

MLE is a very commonly used method.

12.2 The big idea behind MLE

ANSWER: Maximum likelihood

12.3 Maximum likelihood details

If the distribution of \(X\) is known, then \[ f_X(x |\ \theta) \] is the density function or the relative likelihood function for a single outcome.

When the observations of the sample are independent, the density function for a sample, \(\boldsymbol{X}_N\) is \[f_{\boldsymbol{X}_N}(\boldsymbol{x}|\ \theta) = \prod_{i=1}^N f_X(x_i|\ \theta)\]

In the usual setting, \(\theta\) is known. Now suppose that \(\boldsymbol{x}\) is known.

\[L(\theta|\ \boldsymbol{x}) = f_{\boldsymbol{X}_N}(\boldsymbol{x}|\ \theta)\]

This is the likelihood function. The maximum likelihood solution is the parameter, \(\theta\), which maximizes the function. One can use visualization tools, mathematics (calculus or linear algebra), or computational optimization algorithms to find a solution.

12.4 Example 2: Salary data

12.5 Example 3: Adult female height

12.5.1 Example 4: BMI

Hmisc::getHdata(nhgh)
bmi <- nhgh$bmi

For the sake of discussion, suppose that scale is known to be 1.7. (We’ll get to the multivariable setting later.)

l <- function(shape, scale = 1.7, bmi){
  outer(shape, bmi, function(a,b) 
    dgamma(
        x = b
      , shape = a
      , scale = 1.7
      , log = TRUE
    )
  ) %>% rowSums
}
curve(l(x, 1.7, bmi), 0, 45, ylab = "", main = "log Likelihood", xlab = "shape")

Note that the likelihood function is a function of the unknown parameter.

Strategy: Estimate the unknown parameter that maximizes the log likelihood.

Why does maximizing the likelihood make sense?

curve(l(x, 1.7, bmi), 0, 45, ylab = "", main = "log Likelihood", xlab = "shape")
xxx <- seq(15,22,by = 0.01)
# Poor mans maximization
mle <- l(xxx, 1.7, bmi) %>% 
  which.max %>% 
  `[`(xxx, .)
abline(v = mle, col = "blue", lwd = 3)

Will use the optim or optimize or mle function for maximization.

Now consider MLE with the normal distribution.

require(stats4)
nLL <- function(mean, sd){
  fs <- dnorm(
        x = bmi
      , mean = mean
      , sd = sd
      , log = TRUE
    ) 
  -sum(fs)
}
fit <- mle(
    nLL
  , start = list(mean = 1, sd = 1)
  , method = "L-BFGS-B"
  , lower = c(0, 0.01)
)
par(mfrow = c(1,2)); plot(profile(fit), absVal = FALSE)

par(mfrow = c(1,2))
plot(ecdf(bmi), main = "")
curve(
    pnorm(x, coef(fit)[1], coef(fit)[2])
  , add = TRUE
  , col = "blue"
  , lwd = 3
)
hist(bmi, freq = FALSE, main = "")
curve(
    dnorm(x, coef(fit)[1], coef(fit)[2])
  , add = TRUE
  , col = "blue"
  , lwd = 3
)

Can use lm function to fit normal distribution, which is often easier.

lm(bmi ~ 1) %>% summary

Call:
lm(formula = bmi ~ 1)

Residuals:
Body Mass Index [kg/m^2] 
    Min      1Q  Median      3Q     Max 
-15.142  -4.892  -1.032   3.558  56.548 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 28.32174    0.08431   335.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.95 on 6794 degrees of freedom

12.6 Hands-on

Estimate the distribution of arm length using the method of maximum likelihood. Generate plots of the estimated CDF and PDF.

Hmisc::getHdata(nhgh)