set.seed(26702340)
sample(c("Y","D"), 1)
[1] "Y"
Here we provide solutions for the questions involving R.
Learning objective: Use the tools of data scientists
Learning objective: Implement best programming/coding practices
List three reasons why the following error might happen:
> abc(Surv(futime, death) ~ trt + flt3, data=myeloid)
Error in abc(Surv(futime, death) ~ trt + flt3, data = myeloid) :
could not find function "abc"
When you are done with the exam, please render this reproducible report as a pdf document.
Please explain the advantages of relative file paths over absolute file paths.
Learning objective: Compare and contrast different definitions of probability, illustrating differences with simple examples
Learning objective: Express the rules of probability verbally, mathematically, and computationally.
Machines A, B, and C generate widgets with a defect rate of 0.1, 0.01, and 0.001, respectively. Machine A generates twice as many widgets as B, and machine B generates twice as many widgets as machine C.
Please complete the cross table below based on the information above.
Hints:
A | B | C | ||
---|---|---|---|---|
Defective | ||||
→ cell | ||||
→ row | ||||
→ col | ||||
Functional | ||||
→ cell | ||||
→ row | ||||
→ col | ||||
If you did not know if a widget was functional or defective, what would be the probability that the widget is from machine A? Write the solution using mathematical notation, as in P(???) = ???.
How would you update the probability from the previous problem if you knew the widget was defective? Write the solution using mathematical notation, as in P(???) = ???.
Learning objective: Use probability models to build simulations of complex real world processes to answer research questions
The world series determines the championship team of major leage baseball. The tournament is a first-to-4-wins
competition. This means the tournament could end after 4, 5, 6, or 7 games.
This year, the two teams in the world series are the Los Angeles Dodgers (D) and the New York Yankees (Y). If we suppose that the two teams are equally matched, then the probability of winning for either team is 0.5. The outcome of a single game can be simulated with the sample command:
set.seed(26702340)
sample(c("Y","D"), 1)
[1] "Y"
Modify the code to simulate 7 games.
sample(c("Y","D"), 7, replace=TRUE)
[1] "D" "D" "Y" "Y" "Y" "Y" "Y"
Write a function which determines who won the world series and which game they won it. The input of the function can be the output of your code from the previous problem, a vector of 7 Ys or Ds. The output of your function should be a tuple: The first element is the team which won and the second element is total number of games in the series.
For example, for this sequence
"Y" "D" "D" "D" "D" "Y" "Y"
the output would be (D, 5) because the fourth win for the Dodgers occured in game 5.
<- function(x){
ws_outcome <- table(x)
t1 <- names(t1)
teams <- which(t1 >= 4)
winning_team_idx <- teams[winning_team_idx]
winning_team <- which(cumsum(x==winning_team)==4)[1]
winning_game data.frame(winning_team = winning_team, winning_game = winning_game)
}
# Check to see if the function works
<- sample(c("Y","D"), 7, replace=TRUE)
x ws_outcome(x)
winning_team winning_game
1 D 6
Use the replicate command to generate many outcomes of the world series. Estimate the proportion of times the series ends in 4, 5, 6, and 7 games. Generate a plot of the results.
<- function(x) sample(c("Y","D"), 7, replace=TRUE) |> ws_outcome()
one_series <- replicate(5000, one_series(), simplify = FALSE)
out
# Stack all the elements of the list into a dataframe
<- do.call("rbind", out)
sim_results
# Take a peak at the output
head(sim_results)
winning_team winning_game
1 Y 5
2 D 6
3 D 6
4 Y 5
5 Y 4
6 Y 5
# Summary table (Cell and margin probabilities)
# Note we could have also used the descr::CrossTable command if we also wanted conditional probs
<- table(sim_results) |> proportions() |> addmargins()
t1 t1
winning_game
winning_team 4 5 6 7 Sum
D 0.0620 0.1260 0.1590 0.1518 0.4988
Y 0.0570 0.1318 0.1584 0.1540 0.5012
Sum 0.1190 0.2578 0.3174 0.3058 1.0000
# Barplot
barplot(t1["Sum",1:4], ylim = c(0,max(t1[,1:4])*1.1))
box()
# Line plot
plot(4:7, t1["Sum",1:4], ylim = c(0,max(t1[,1:4])*1.1), pch = 16, lwd = 3, ylab = "Probability", xlab = "Games", axes = FALSE)
axis(1, 4:7)
axis(2)
box()
Learning objective: apply cross table framework to the special case of binary outcomes
Consider a cancer screening device. Supposing that \(P(\text{Cancer}) = p\), \(P(+|\text{cancer}) = tp\), and \(P(-|\text{not cancer}) = tn\), complete the following table with expressions of all joint, marginal, and conditional probabilities.
Cancer | Not Cancer | ||
---|---|---|---|
Test = + | |||
Test = - | |||
If \(p\) = 0.01, \(tp\) = 0.9, and \(tn\) = 0.8, what is the positive predictive value, \(P(\text{Cancer}|+)\)?
Create a figure, with cancer incidence, \(P(\text{Cancer})\), on the x-axis, going from 0 to 1 and the positive predictive value, \(P(\text{Cancer}|+)\), on the y-axis. Show how the positive predictive value changes as the incidence changes. Generate the figure under the assumption that the sensitivity, \(P(+|\text{Cancer})\), is 0.9 and the specificity, \(P(-|\text{Not cancer})\), is 0.95.
From 4.1,
\[ \begin{align*} P(\text{Cancer}|+) & = \frac{P(+|\text{Cancer})P(\text{Cancer})}{P(+|\text{Cancer})P(\text{Cancer}) + P(+|\text{Not cancer})P(\text{Not cancer})}\\ & = \frac{P(+|\text{Cancer})P(\text{Cancer})}{P(+|\text{Cancer})P(\text{Cancer}) + [1-P(-|\text{Not Cancer})][1-P(\text{Cancer})]}\\ & = \frac{\text{tp}\times \text{p}}{\text{tp}\times \text{p} + (1-\text{tn}) \times (1-p)}\\ & = \frac{1}{1 + \frac{(1-\text{tn})}{\text{tp}} \frac{(1-p)}{p}}\\ \end{align*} \]
<- function(p, tp, tn) tp * p / (tp*p + (1-tn)*(1-p))
ppv
# ppoints is a fast way to get a sequence of points between 0 and 1
# could also use seq(0,1,length=100)
<- ppoints(100)
ppp <- ppv(ppp, 0.9, 0.95)
yyy
plot(ppp, yyy, xlab = "P(cancer)", ylab = "Positive predictive value", type = "l", lwd = 3)
Learning objective: define/describe confounding variables, Simpson’s paradox, DAGs, and the causal pathway
The following function generates data from a cohort of individuals who were diagnosed with disease X. In the dataset, there is vaccination status, disease severity, and disease duration.
Generate 1000 draws from the function and create the cross table for vaccination status and disease duration. Calculate a summary of the effect of vaccination by calculating the difference is conditional probabilities:
\[ \Delta = P(\text{Short recovery}|\text{vaccinated}) - P(\text{Short recovery}|\text{unvaccinated}) \]
<- function(R){
vax_data <- rbinom(R,1,.5)
vs <- rbinom(R,1,.25*(vs==1)+.75*(vs==0))
ds <- rbinom(R,1,.7*(ds==1)+.5*(ds==0))
rt data.frame(
vaccination_status = c("vaccinated","unvaccinated")[vs+1]
disease_severity = c("mild","severe")[ds+1]
, recovery_time = c("short","long")[rt+1]
,
)
}
set.seed(34598763)
<- vax_data(1000)
d1
suppressPackageStartupMessages(require(descr))
suppressPackageStartupMessages(require(pander))
CrossTable(d1$vaccination_status, d1$recovery_time, prop.chisq = FALSE, dnn = c("Vax status", "Recovery")) |> pander(digits=1, split.tables=Inf)
Vax status |
Recovery long |
short |
Total |
---|---|---|---|
unvaccinated N Row(%) Column(%) Total(%) |
269 52.0% 46.9% 26.9% |
248 48.0% 58.1% 24.8% |
517 51.7% |
vaccinated N Row(%) Column(%) Total(%) |
304 62.9% 53.1% 30.4% |
179 37.1% 41.9% 17.9% |
483 48.3% |
Total |
573 57.3% |
427 42.7% |
1000 |
# Pick off the probs from the table
370600 - .479691 .
[1] -0.109091
suppressPackageStartupMessages(require(dplyr))
<- d1 |>
p_s_g_v filter(vaccination_status == "vaccinated") |>
mutate(outcome = recovery_time == "short") |>
pull(outcome) |>
mean()
<- d1 |>
p_s_g_u filter(vaccination_status == "unvaccinated") |>
mutate(outcome = recovery_time == "short") |>
pull(outcome) |>
mean()
- p_s_g_u p_s_g_v
[1] -0.1090901
Stratify the table by disease severity. As in the previous problem, calculate the same treatment effect in each strata.
<- d1 |> filter(disease_severity == "mild")
d1mild
CrossTable(d1mild$vaccination_status, d1mild$recovery_time, prop.chisq = FALSE, dnn = c("Vax status", "Recovery")) |> pander(digits=1, split.tables=Inf)
Vax status |
Recovery long |
short |
Total |
---|---|---|---|
unvaccinated N Row(%) Column(%) Total(%) |
199 50.0% 78.0% 38.0% |
199 50.0% 74.3% 38.0% |
398 76.1% |
vaccinated N Row(%) Column(%) Total(%) |
56 44.8% 22.0% 10.7% |
69 55.2% 25.7% 13.2% |
125 23.9% |
Total |
255 48.8% |
268 51.2% |
523 |
<- d1 |> filter(disease_severity == "severe")
d1severe
CrossTable(d1severe$vaccination_status, d1severe$recovery_time, prop.chisq = FALSE, dnn = c("Vax status", "Recovery")) |> pander(digits=1, split.tables=Inf)
Vax status |
Recovery long |
short |
Total |
---|---|---|---|
unvaccinated N Row(%) Column(%) Total(%) |
70 58.8% 22.0% 14.7% |
49 41.2% 30.8% 10.3% |
119 24.9% |
vaccinated N Row(%) Column(%) Total(%) |
248 69.3% 78.0% 52.0% |
110 30.7% 69.2% 23.1% |
358 75.1% |
Total |
318 66.7% |
159 33.3% |
477 |
# pull values from table
c(Mild = .552 - .500, severe = .307 - .412)
Mild severe
0.052 -0.105
<- d1 |>
p_s_g_v_m filter(vaccination_status == "vaccinated") |>
filter(disease_severity == "mild") |>
mutate(outcome = recovery_time == "short") |>
pull(outcome) |>
mean()
<- d1 |>
p_s_g_u_m filter(disease_severity == "mild") |>
filter(vaccination_status == "unvaccinated") |>
mutate(outcome = recovery_time == "short") |>
pull(outcome) |>
mean()
c(mild = p_s_g_v_m - p_s_g_u_m)
mild
0.052
<- d1 |>
p_s_g_v_s filter(vaccination_status == "vaccinated") |>
filter(disease_severity == "severe") |>
mutate(outcome = recovery_time == "short") |>
pull(outcome) |>
mean()
<- d1 |>
p_s_g_u_s filter(disease_severity == "severe") |>
filter(vaccination_status == "unvaccinated") |>
mutate(outcome = recovery_time == "short") |>
pull(outcome) |>
mean()
c(severe = p_s_g_v_s - p_s_g_u_s)
severe
-0.1045021
Based on the summary of the treatment effect that you observe in the combined and stratified tables, does vaccination help shorten recovery?
Which measure of treatment effect is most persuasive? The combined estimate or the stratified estimates? Which estimate should you rely on? Explain why, creating a DAG to represent relationship between the variables.
```{mermaid}
flowchart LR
A(Vax status) --> B(Recovery)
A --> C(Symptom severity) --> B```
flowchart LR A(Vax status) --> B(Recovery) A --> C(Symptom severity) --> B