NOTE: The .rmd
version of the file is
available here: (link)
Please prepare reponses/solutions for the following questions. On the day of the exam, you will be given a new set of questions. You will use the solutions you’ve prepared for this exam during the exam.
During the exam, you will also be permitted to access the internet for publicly available content. You will not be allowed to communicate with anyone via the internet or any other means during the exam. This includes, but is not limited to:
You may NOT discuss any aspect of the exam or prep questions with anyone other than the instructor or TA. You may NOT share code or documents.
final-exam
exam-prep.rmd
with your solutions. Create a rendered report in .pdf
output.Exam questions are organized into sections corresponding to the learning outcomes of the course.
compare and contrast different definitions of probability, illustrating differences with simple examples
See questions from previous exams
express the rules of probability verbally, mathematically, and computationally
See questions from previous exams
illustrate the rules of probability with examples
See questions from previous exams
using long-run proportion definition of probability, derive the univariate rules of probability
See questions from previous exams
organize/express bivariate random variables in cross tables
See questions from previous exams
define joint, conditional, and marginal probabilities
See questions from previous exams
identify joint, conditional, and marginal probabilities in cross tables
See questions from previous exams
identify when a research question calls for a joint, conditional, or marginal probability
See questions from previous exams
describe the connection between conditional probabilities and prediction
See questions from previous exams
derive Bayes rule from cross tables
See questions from previous exams
apply Bayes rules to answer research questions
See questions from previous exams
apply cross table framework to the special case of binary outcomes with special attention to Sensitivity, Specificity, Positive predictive value, Negative predictive value, Prevalence, Incidence
See questions from previous exams
define/describe confounding variables, including Simpson’s paradox, DAGs, causal pathway
See questions from previous exams
describe approaches for avoiding or addressing confounding, including stratification and randomization
See questions from previous exams
list various data types (nominal, ordinal, interval, ratio, discrete, continuous)
match each data type with probability models that may describe it
See HW
discuss the degree to which models describe the underlying data
tease apart model fit and model utility
express probability models both mathematically, computationally, and graphically (PMF/PDF CMF/CDF, quantile function, histogram/eCDF)
Q. Suppose the yearly hospital charges (in thousands of dollars) for a randomly selected UVA student is a mixture distribution.
For 60% of students, the hospital charges will be $0. For the remaining 40% of students, the hospital charges are a random variable described by a gamma distribution with shape = 2 and scale = 2. (Again, in thousands of dollars.)
The following function mimics the hospital charge distribution. It generates draws of the random variable. Use the function to generate an expression for the CDF and quantile functions of the random variable.
rhc <- function(n){ rgamma(n,shape=2,scale=2)*rbinom(n,1,.4) }
employ probability models (computationally and analytically) to answer research questions
Q. Consider earnings (in thousands of dollars) the first year after graduation from UVA with an undergraduate degree. If X is normal with \(\mu = 60\) and \(\sigma = 10\), what level of earnings represents the top 90th percentile?
explain and implement different approaches for fitting probability models from data (Maximum likelihood, Bayesian posterior)
Q. Suppose an upcoming election for UVA student body president is between two candidates. In a survey of 30 students, 20 voiced support for candidate A. Use the Desmos calculator (link) to fit a probability model with Bayesian methods for the election, specifically the probability that candidate A is the preferred by the student body. Report the 95% credible interval. (In this calculator, \(H_1\) is the number of supporters for candidate A and \(T_1\) is the number of supporters for candidate B.)
Q. Suppose an upcoming election for UVA student body president is between two candidates. In a survey of 30 students, 20 voiced support for candidate A. Use the Desmos calculator (link) to fit a probability model with Maximum Likelihood for the election, specifically the probability that candidate A is the prefered by the student body. Report the 1/20 support interval. (In this calculator, \(n\) is the total number of respondants, \(h\) is the number that voice support for candidate A.)
explore how to communicate uncertainty when constructing models and answering research questions (support intervals, credible intervals)
Q. Repeat the election analysis performed above with additional data. In a survey of 100 students, 60 students voiced support for candidate A. Compare the interval estimates based on the larger dataset to those generated from the smaller dataset. Comment on which analysis you find more persuasive and explain why.
propagate uncertainty in simulations, visualize the uncertainty inherent in fitting probability models from data
Q. Going back to the election question, suppose that the support for candidate A was known to be \(p=0.55\). In an election in which 100 students vote, what is the probability that 51 or more votes will be cast for candidate A?
Q. Now suppose the the probability is unknown, and is estimated from data. The following shows the distribution for \(P\)(Votes>50) when estimated from data using a uniform prior and a survey of 30 students with 20 voicing support for the candidate. Add a line to show the solution when \(p\) is known. Comment on the uncertainty when \(p\) is estimated from data.
pis <- rbeta(10000, 21, 11)
hist(1-pbinom(50,100,pis), freq=FALSE, breaks=100)
explore the trade-offs of model complexity and generalizability
Q. Consider the following estimates of the PDF for infant birthweight. Both are poorly fitting estimates. Explain the concepts of overfitting and underfitting in the context of the birthweight data.
d1 <- MASS::birthwt
hist(d1$bwt, breaks=20, freq=FALSE, xlim = c(0,6000), ylim = c(0,0.0007), main = "Birthweight (grams)", xlab = "")
lines(density(d1$bwt, adjust = 1/5), lwd = 3, col = "navy")
lines(density(d1$bwt, adjust=5), lwd = 3, col = "red")
Q. Explain the concept of generalizability in the context of the birthweight data.
select prior distributions which reflect personal belief (informative vs weakly informative priors)
implement bayesian updating
use probability models to build simulations of complex real world processes to answer research questions
Q: The Monte Hall problem is a classic game show. Contestants on the show where shown three doors. Behind one randomly selected door was a sportscar; behind the other doors were goats.
At the start of the game, contestants would select a door, say door A. Then, the host would open either door B or C to reveal a goat. At that point in the game, the host would ask the contestant if she would like to change her door selection. Once a contestant decided to stay or change, the host would open the chosen door to reveal the game prize, either a goat or a car.
In this problem, consider a modified version of the Monte Hall problem in which the number of doors is variable. Rather than 3 doors, consider a game with 4 or 5 or 50 doors. In the modified version of the game, a contestant would select an initial door, say door A. Then, the host would open one of the remaining doors to reveal a goat. At that point in the game, the host would ask the contestant if she would like to change her door selection. Once a contestant decided to stay or change, the host would open the chosen door to reveal the game prize, either a goat or a car.
Consider two strategies:
A. The function game
below plays a
single game of Monte Hall. The function returns a vector of length two,
the first element is the prize under strategy 1 and the second element
is the prize under strategy 2. The function has a single input
parameter, N, which is the number of doors in the game.
Use the game
function to estimate the probability that
both strategies simultaneously result in a goat. Let
N=4. (Note the word simultaneously. This means
that in the same game, both strategies resulted in a goat.)
require(magrittr)
## Loading required package: magrittr
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
game <- function(N){
if(N<3) stop("Must have at least 3 doors")
prize <- sample(c(rep("goat",N-1),"car"), N)
guess <- sample(1:N,1)
game <- data.frame(door = 1:N, prize = prize, stringsAsFactors = FALSE) %>%
mutate(first_guess = case_when(
door == guess ~ 1
, TRUE ~ 0
)) %>%
mutate(potential_reveal = case_when(
first_guess == 1 ~ 0
, prize == "car" ~ 0
, TRUE ~ 1
)) %>%
mutate(reveal = 1*(rank(potential_reveal, ties.method = "random") == 3)) %>%
mutate(potential_switch = case_when(
first_guess == 1 ~ 0
, reveal == 1 ~ 0
, TRUE ~ 1
)) %>%
mutate(switch = 1*(rank(potential_switch, ties.method = "random") == 3))
c(game$prize[game$first_guess == 1], game$prize[game$switch == 1])
}