NAME: Please add your name here

NOTE: The .rmd version of the file is available here: (link)

Instructions

Please prepare reponses/solutions for the following questions. On the day of the exam, you will be given a new set of questions. You will use the solutions you’ve prepared for this exam during the exam.

During the exam, you will also be permitted to access the internet for publicly available content. You will not be allowed to communicate with anyone via the internet or any other means during the exam. This includes, but is not limited to:

No messaging, emailing, or using social media to contact others.
No posting questions or seeking answers on forums, chat rooms, chat bots (including large language models like ChatGPT), or any collaborative platforms.
No sharing or discussing exam content with peers through any online or electronic medium.

You may NOT discuss any aspect of the exam or prep questions with anyone other than the instructor or TA. You may NOT share code or documents.

Submission instructions

Within your course repo, create a folder called final-exam
Within the folder, create the script file exam-prep.rmd with your solutions. Create a rendered report in .pdf output.
Add, commit, and push to your repo on github.com.

Questions

Exam questions are organized into sections corresponding to the learning outcomes of the course.

compare and contrast different definitions of probability, illustrating differences with simple examples

See questions from previous exams

express the rules of probability verbally, mathematically, and computationally

See questions from previous exams

illustrate the rules of probability with examples

See questions from previous exams

using long-run proportion definition of probability, derive the univariate rules of probability

See questions from previous exams

organize/express bivariate random variables in cross tables

See questions from previous exams

define joint, conditional, and marginal probabilities

See questions from previous exams

identify joint, conditional, and marginal probabilities in cross tables

See questions from previous exams

identify when a research question calls for a joint, conditional, or marginal probability

See questions from previous exams

describe the connection between conditional probabilities and prediction

See questions from previous exams

derive Bayes rule from cross tables

See questions from previous exams

apply Bayes rules to answer research questions

See questions from previous exams

apply cross table framework to the special case of binary outcomes with special attention to Sensitivity, Specificity, Positive predictive value, Negative predictive value, Prevalence, Incidence

See questions from previous exams

define/describe confounding variables, including Simpson’s paradox, DAGs, causal pathway

See questions from previous exams

describe approaches for avoiding or addressing confounding, including stratification and randomization

See questions from previous exams

list various data types (nominal, ordinal, interval, ratio, discrete, continuous)

match each data type with probability models that may describe it

See HW

discuss the degree to which models describe the underlying data

tease apart model fit and model utility

express probability models both mathematically, computationally, and graphically (PMF/PDF CMF/CDF, quantile function, histogram/eCDF)

Q. Suppose the yearly hospital charges (in thousands of dollars) for a randomly selected UVA student is a mixture distribution.

For 60% of students, the hospital charges will be $0. For the remaining 40% of students, the hospital charges are a random variable described by a gamma distribution with shape = 2 and scale = 2. (Again, in thousands of dollars.)

The following function mimics the hospital charge distribution. It generates draws of the random variable. Use the function to generate an expression for the CDF and quantile functions of the random variable.

rhc <- function(n){ rgamma(n,shape=2,scale=2)*rbinom(n,1,.4) }

employ probability models (computationally and analytically) to answer research questions

Q. Consider earnings (in thousands of dollars) the first year after graduation from UVA with an undergraduate degree. If X is normal with \(\mu = 60\) and \(\sigma = 10\), what level of earnings represents the top 90th percentile?

explain and implement different approaches for fitting probability models from data (Maximum likelihood, Bayesian posterior)

Q. Suppose an upcoming election for UVA student body president is between two candidates. In a survey of 30 students, 20 voiced support for candidate A. Use the Desmos calculator (link) to fit a probability model with Bayesian methods for the election, specifically the probability that candidate A is the preferred by the student body. Report the 95% credible interval. (In this calculator, \(H_1\) is the number of supporters for candidate A and \(T_1\) is the number of supporters for candidate B.)

Q. Suppose an upcoming election for UVA student body president is between two candidates. In a survey of 30 students, 20 voiced support for candidate A. Use the Desmos calculator (link) to fit a probability model with Maximum Likelihood for the election, specifically the probability that candidate A is the prefered by the student body. Report the 1/20 support interval. (In this calculator, \(n\) is the total number of respondants, \(h\) is the number that voice support for candidate A.)

explore how to communicate uncertainty when constructing models and answering research questions (support intervals, credible intervals)

Q. Repeat the election analysis performed above with additional data. In a survey of 100 students, 60 students voiced support for candidate A. Compare the interval estimates based on the larger dataset to those generated from the smaller dataset. Comment on which analysis you find more persuasive and explain why.

propagate uncertainty in simulations, visualize the uncertainty inherent in fitting probability models from data

Q. Going back to the election question, suppose that the support for candidate A was known to be \(p=0.55\). In an election in which 100 students vote, what is the probability that 51 or more votes will be cast for candidate A?

Q. Now suppose the the probability is unknown, and is estimated from data. The following shows the distribution for \(P\)(Votes>50) when estimated from data using a uniform prior and a survey of 30 students with 20 voicing support for the candidate. Add a line to show the solution when \(p\) is known. Comment on the uncertainty when \(p\) is estimated from data.

pis <- rbeta(10000, 21, 11)
hist(1-pbinom(50,100,pis), freq=FALSE, breaks=100)

explore the trade-offs of model complexity and generalizability

Q. Consider the following estimates of the PDF for infant birthweight. Both are poorly fitting estimates. Explain the concepts of overfitting and underfitting in the context of the birthweight data.

d1 <- MASS::birthwt
hist(d1$bwt, breaks=20, freq=FALSE, xlim = c(0,6000), ylim = c(0,0.0007), main = "Birthweight (grams)", xlab = "")
lines(density(d1$bwt, adjust = 1/5), lwd = 3, col = "navy")
lines(density(d1$bwt, adjust=5), lwd = 3, col = "red")

Q. Explain the concept of generalizability in the context of the birthweight data.

select prior distributions which reflect personal belief (informative vs weakly informative priors)

implement bayesian updating

use probability models to build simulations of complex real world processes to answer research questions

Q: The Monte Hall problem is a classic game show. Contestants on the show where shown three doors. Behind one randomly selected door was a sportscar; behind the other doors were goats.

At the start of the game, contestants would select a door, say door A. Then, the host would open either door B or C to reveal a goat. At that point in the game, the host would ask the contestant if she would like to change her door selection. Once a contestant decided to stay or change, the host would open the chosen door to reveal the game prize, either a goat or a car.

In this problem, consider a modified version of the Monte Hall problem in which the number of doors is variable. Rather than 3 doors, consider a game with 4 or 5 or 50 doors. In the modified version of the game, a contestant would select an initial door, say door A. Then, the host would open one of the remaining doors to reveal a goat. At that point in the game, the host would ask the contestant if she would like to change her door selection. Once a contestant decided to stay or change, the host would open the chosen door to reveal the game prize, either a goat or a car.

Consider two strategies:

Always stay with the first door selected.
Always switch to the unopened door.

A. The function game below plays a single game of Monte Hall. The function returns a vector of length two, the first element is the prize under strategy 1 and the second element is the prize under strategy 2. The function has a single input parameter, N, which is the number of doors in the game.

Use the game function to estimate the probability that both strategies simultaneously result in a goat. Let N=4. (Note the word simultaneously. This means that in the same game, both strategies resulted in a goat.)

require(magrittr)

## Loading required package: magrittr

require(dplyr)

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

game <- function(N){
  if(N<3) stop("Must have at least 3 doors")
  prize <- sample(c(rep("goat",N-1),"car"), N)
  guess <- sample(1:N,1)
  game <- data.frame(door = 1:N, prize = prize, stringsAsFactors = FALSE) %>% 
    mutate(first_guess = case_when(
      door == guess ~ 1
      , TRUE ~ 0
    )) %>% 
    mutate(potential_reveal = case_when(
        first_guess == 1 ~ 0
      , prize == "car" ~ 0
      , TRUE ~ 1
    )) %>% 
    mutate(reveal = 1*(rank(potential_reveal, ties.method = "random") == 3)) %>% 
    mutate(potential_switch = case_when(
      first_guess == 1 ~ 0
      , reveal == 1 ~ 0
      , TRUE ~ 1
    )) %>% 
    mutate(switch = 1*(rank(potential_switch, ties.method = "random") == 3))
  c(game$prize[game$first_guess == 1], game$prize[game$switch == 1])
}