Exam 2 Prep

Here we provide solutions for the questions involving R.

Section 1. Tools of the data scientists

Learning objective: Use the tools of data scientists

Learning objective: Implement best programming/coding practices

1.1

List three reasons why the following error might happen:

> abc(Surv(futime, death) ~ trt + flt3, data=myeloid)
Error in abc(Surv(futime, death) ~ trt + flt3, data = myeloid) : 
  could not find function "abc"

1.2

When you are done with the exam, please render this reproducible report as a pdf document.

1.3

Please explain the advantages of relative file paths over absolute file paths.

Section 2. Probability

Learning objective: Compare and contrast different definitions of probability, illustrating differences with simple examples

Learning objective: Express the rules of probability verbally, mathematically, and computationally.

2.1

Machines A, B, and C generate widgets with a defect rate of 0.1, 0.01, and 0.001, respectively. Machine A generates twice as many widgets as B, and machine B generates twice as many widgets as machine C.

Please complete the cross table below based on the information above.

Hints:

  • P(A) = 2P(B) = 4P(B), which means P(A) = 4/7, P(B) = 2/7 and P(C) = 1/7.
A B C
Defective
→ cell
→ row
→ col
Functional
→ cell
→ row
→ col

2.2

If you did not know if a widget was functional or defective, what would be the probability that the widget is from machine A? Write the solution using mathematical notation, as in P(???) = ???.

2.3

How would you update the probability from the previous problem if you knew the widget was defective? Write the solution using mathematical notation, as in P(???) = ???.

Section 3. Simulation

Learning objective: Use probability models to build simulations of complex real world processes to answer research questions

3.1

The world series determines the championship team of major leage baseball. The tournament is a first-to-4-wins competition. This means the tournament could end after 4, 5, 6, or 7 games.

This year, the two teams in the world series are the Los Angeles Dodgers (D) and the New York Yankees (Y). If we suppose that the two teams are equally matched, then the probability of winning for either team is 0.5. The outcome of a single game can be simulated with the sample command:

set.seed(26702340)
sample(c("Y","D"), 1)
[1] "Y"

Modify the code to simulate 7 games.

Solution

sample(c("Y","D"), 7, replace=TRUE)
[1] "D" "D" "Y" "Y" "Y" "Y" "Y"

3.2

Write a function which determines who won the world series and which game they won it. The input of the function can be the output of your code from the previous problem, a vector of 7 Ys or Ds. The output of your function should be a tuple: The first element is the team which won and the second element is total number of games in the series.

For example, for this sequence

"Y" "D" "D" "D" "D" "Y" "Y"

the output would be (D, 5) because the fourth win for the Dodgers occured in game 5.

Solution

ws_outcome <- function(x){
    t1 <- table(x)
    teams <- names(t1)
    winning_team_idx <- which(t1 >= 4)
    winning_team <- teams[winning_team_idx]
    winning_game <- which(cumsum(x==winning_team)==4)[1]
    data.frame(winning_team = winning_team, winning_game = winning_game)
}

# Check to see if the function works
x <- sample(c("Y","D"), 7, replace=TRUE)
ws_outcome(x)
  winning_team winning_game
1            D            6

3.3

Use the replicate command to generate many outcomes of the world series. Estimate the proportion of times the series ends in 4, 5, 6, and 7 games. Generate a plot of the results.

Solution

one_series <- function(x) sample(c("Y","D"), 7, replace=TRUE) |> ws_outcome()
out <- replicate(5000, one_series(), simplify = FALSE)

# Stack all the elements of the list into a dataframe
sim_results <- do.call("rbind", out)

# Take a peak at the output
head(sim_results)
  winning_team winning_game
1            Y            5
2            D            6
3            D            6
4            Y            5
5            Y            4
6            Y            5
# Summary table (Cell and margin probabilities)
# Note we could have also used the descr::CrossTable command if we also wanted conditional probs
t1 <- table(sim_results) |> proportions() |> addmargins()
t1
            winning_game
winning_team      4      5      6      7    Sum
         D   0.0620 0.1260 0.1590 0.1518 0.4988
         Y   0.0570 0.1318 0.1584 0.1540 0.5012
         Sum 0.1190 0.2578 0.3174 0.3058 1.0000
# Barplot
barplot(t1["Sum",1:4], ylim = c(0,max(t1[,1:4])*1.1))
box()

# Line plot
plot(4:7, t1["Sum",1:4], ylim =  c(0,max(t1[,1:4])*1.1), pch = 16, lwd = 3, ylab = "Probability", xlab = "Games", axes = FALSE)
axis(1, 4:7)
axis(2)
box()

Section 4. Diagnostics

Learning objective: apply cross table framework to the special case of binary outcomes

4.1

Consider a cancer screening device. Supposing that \(P(\text{Cancer}) = p\), \(P(+|\text{cancer}) = tp\), and \(P(-|\text{not cancer}) = tn\), complete the following table with expressions of all joint, marginal, and conditional probabilities.

Cancer Not Cancer
Test = +
Test = -

4.2

If \(p\) = 0.01, \(tp\) = 0.9, and \(tn\) = 0.8, what is the positive predictive value, \(P(\text{Cancer}|+)\)?

4.3

Create a figure, with cancer incidence, \(P(\text{Cancer})\), on the x-axis, going from 0 to 1 and the positive predictive value, \(P(\text{Cancer}|+)\), on the y-axis. Show how the positive predictive value changes as the incidence changes. Generate the figure under the assumption that the sensitivity, \(P(+|\text{Cancer})\), is 0.9 and the specificity, \(P(-|\text{Not cancer})\), is 0.95.

Solution

From 4.1,

\[ \begin{align*} P(\text{Cancer}|+) & = \frac{P(+|\text{Cancer})P(\text{Cancer})}{P(+|\text{Cancer})P(\text{Cancer}) + P(+|\text{Not cancer})P(\text{Not cancer})}\\ & = \frac{P(+|\text{Cancer})P(\text{Cancer})}{P(+|\text{Cancer})P(\text{Cancer}) + [1-P(-|\text{Not Cancer})][1-P(\text{Cancer})]}\\ & = \frac{\text{tp}\times \text{p}}{\text{tp}\times \text{p} + (1-\text{tn}) \times (1-p)}\\ & = \frac{1}{1 + \frac{(1-\text{tn})}{\text{tp}} \frac{(1-p)}{p}}\\ \end{align*} \]

ppv <- function(p, tp, tn) tp * p / (tp*p + (1-tn)*(1-p))

# ppoints is a fast way to get a sequence of points between 0 and 1
# could also use seq(0,1,length=100)
ppp <- ppoints(100) 
yyy <- ppv(ppp, 0.9, 0.95)

plot(ppp, yyy, xlab = "P(cancer)", ylab = "Positive predictive value", type = "l", lwd = 3)

Section 5. Confounding vs Causal Pathway

Learning objective: define/describe confounding variables, Simpson’s paradox, DAGs, and the causal pathway

5.1

The following function generates data from a cohort of individuals who were diagnosed with disease X. In the dataset, there is vaccination status, disease severity, and disease duration.

Generate 1000 draws from the function and create the cross table for vaccination status and disease duration. Calculate a summary of the effect of vaccination by calculating the difference is conditional probabilities:

\[ \Delta = P(\text{Short recovery}|\text{vaccinated}) - P(\text{Short recovery}|\text{unvaccinated}) \]

Solution A

vax_data <- function(R){
    vs <- rbinom(R,1,.5)
    ds <- rbinom(R,1,.25*(vs==1)+.75*(vs==0))
    rt <- rbinom(R,1,.7*(ds==1)+.5*(ds==0))
    data.frame(
        vaccination_status = c("vaccinated","unvaccinated")[vs+1]
      , disease_severity = c("mild","severe")[ds+1]
      , recovery_time = c("short","long")[rt+1]
    )
}

set.seed(34598763)
d1 <- vax_data(1000)

suppressPackageStartupMessages(require(descr))
suppressPackageStartupMessages(require(pander))
CrossTable(d1$vaccination_status, d1$recovery_time, prop.chisq = FALSE, dnn = c("Vax status", "Recovery")) |> pander(digits=1, split.tables=Inf)
 
Vax status
Recovery
long
 
short
 
Total
unvaccinated
N
Row(%)
Column(%)
Total(%)
 
269
52.0%
46.9%
26.9%
 
248
48.0%
58.1%
24.8%
 
517
51.7%

vaccinated
N
Row(%)
Column(%)
Total(%)
 
304
62.9%
53.1%
30.4%
 
179
37.1%
41.9%
17.9%
 
483
48.3%

Total
573
57.3%
427
42.7%
1000
# Pick off the probs from the table
.370600 - .479691
[1] -0.109091

Solution B

suppressPackageStartupMessages(require(dplyr))
p_s_g_v <- d1 |> 
  filter(vaccination_status == "vaccinated") |> 
  mutate(outcome = recovery_time == "short") |> 
  pull(outcome) |> 
  mean()

p_s_g_u <- d1 |> 
  filter(vaccination_status == "unvaccinated") |> 
  mutate(outcome = recovery_time == "short") |> 
  pull(outcome) |> 
  mean()

p_s_g_v - p_s_g_u
[1] -0.1090901

5.2

Stratify the table by disease severity. As in the previous problem, calculate the same treatment effect in each strata.

Solution A

d1mild <- d1 |> filter(disease_severity == "mild")

CrossTable(d1mild$vaccination_status, d1mild$recovery_time, prop.chisq = FALSE, dnn = c("Vax status", "Recovery")) |> pander(digits=1, split.tables=Inf)
 
Vax status
Recovery
long
 
short
 
Total
unvaccinated
N
Row(%)
Column(%)
Total(%)
 
199
50.0%
78.0%
38.0%
 
199
50.0%
74.3%
38.0%
 
398
76.1%

vaccinated
N
Row(%)
Column(%)
Total(%)
 
56
44.8%
22.0%
10.7%
 
69
55.2%
25.7%
13.2%
 
125
23.9%

Total
255
48.8%
268
51.2%
523
d1severe <- d1 |> filter(disease_severity == "severe")

CrossTable(d1severe$vaccination_status, d1severe$recovery_time, prop.chisq = FALSE, dnn = c("Vax status", "Recovery")) |> pander(digits=1, split.tables=Inf)
 
Vax status
Recovery
long
 
short
 
Total
unvaccinated
N
Row(%)
Column(%)
Total(%)
 
70
58.8%
22.0%
14.7%
 
49
41.2%
30.8%
10.3%
 
119
24.9%

vaccinated
N
Row(%)
Column(%)
Total(%)
 
248
69.3%
78.0%
52.0%
 
110
30.7%
69.2%
23.1%
 
358
75.1%

Total
318
66.7%
159
33.3%
477
# pull values from table
c(Mild = .552 - .500, severe = .307 - .412)
  Mild severe 
 0.052 -0.105 

Solution B

p_s_g_v_m <- d1 |> 
  filter(vaccination_status == "vaccinated") |> 
  filter(disease_severity == "mild") |> 
  mutate(outcome = recovery_time == "short") |> 
  pull(outcome) |> 
  mean()

p_s_g_u_m <- d1 |> 
  filter(disease_severity == "mild") |> 
  filter(vaccination_status == "unvaccinated") |> 
  mutate(outcome = recovery_time == "short") |> 
  pull(outcome) |> 
  mean()

c(mild = p_s_g_v_m - p_s_g_u_m)
 mild 
0.052 
p_s_g_v_s <- d1 |> 
  filter(vaccination_status == "vaccinated") |> 
  filter(disease_severity == "severe") |> 
  mutate(outcome = recovery_time == "short") |> 
  pull(outcome) |> 
  mean()

p_s_g_u_s <- d1 |> 
  filter(disease_severity == "severe") |> 
  filter(vaccination_status == "unvaccinated") |> 
  mutate(outcome = recovery_time == "short") |> 
  pull(outcome) |> 
  mean()

c(severe = p_s_g_v_s - p_s_g_u_s)
    severe 
-0.1045021 

5.3

Based on the summary of the treatment effect that you observe in the combined and stratified tables, does vaccination help shorten recovery?

5.4

Which measure of treatment effect is most persuasive? The combined estimate or the stratified estimates? Which estimate should you rely on? Explain why, creating a DAG to represent relationship between the variables.

```{mermaid}
flowchart LR
  A(Vax status) --> B(Recovery)
  A --> C(Symptom severity) --> B
```

flowchart LR
  A(Vax status) --> B(Recovery)
  A --> C(Symptom severity) --> B