Exam 2

Section 1. Tools of the data scientists

Learning objective: Use the tools of data scientists

Learning objective: Implement best programming/coding practices

1.1

When you are done with the exam, render this report as a pdf document. For full credit, identify the each question (the prompt and entire resposne) when you submit the document via GradeScope.

1.2

The following is a schematic of a project folder, with subfolders and files.

project
|
|---code
|       script.qmd
|
|---data
|       survey-responses.csv
|
|---docs

Supposing the code subfolder is the designated working directory, write the command to be included in the script.qmd file which will read the survey-responses.csv data, avoiding absolute file paths?

Solution

survey <- read.csv("../data/survey-responses.csv")

1.3

The iris dataset contains flower measurement data. Here, we add some fake data regarding the flower color.

d1 <- iris
set.seed(232235099) # Do not change the seed
d1$color <- sample(c("Blue","Violet","White"), nrow(d1), replace=TRUE)

Fix the mistake in the following code chunk. (Keep #| error: true. That is not the mistake.)

install.packages("descr")
CrossTable(d1$color, d1$Species)

Solution

The command install.packages should not be used in the analysis script. Once a package is installed to the library, it doesn’t need to be reinstalled. Rather, the package needs to be loaded so that its functions are in memory and available for use in the script.

1.4

Explain the difference between the library command and the install.packages command.

Solution

The command install.packages finds the package online and installs it to the local machine. The command library imports the package for use in the script.

Section 2. Probability

Learning objective: Compare and contrast different definitions of probability, illustrating differences with simple examples

Learning objective: Express the rules of probability verbally, mathematically, and computationally.

2.1

Consider a survey of political opinions. Since 1979, the the public opinion polling company, Gallup Inc, has asked respondents the following:

In general, are you satisfied or dissatisfied with the way things are going in the United States at this time?

The three possible responses are satisfied, dissatisfied, and no opinion.

Another survey questions is:

Would you say that you are financially better off now than you were a year ago, or are you financially worse off now?

The four possible responses to this question are better off, worse off, same, or no opinion.

A cross table of the responses to the two survey questions might look like this:

satisfied dissatisfied no opinion
better off
→ cell
→ row
→ col
worse off
→ cell
→ row
→ col
same
→ cell
→ row
→ col
No opinion
→ cell
→ row
→ col

Part A Solution

One of the probabilities in the table is marked with an X. Mark with the letter O all the probabilities which, including X, must sum to 1.

satisfied dissatisfied no opinion
better off
→ cell X O O
→ row
→ col
worse off
→ cell O O O
→ row
→ col
same
→ cell O O O
→ row
→ col
No opinion
→ cell O O O
→ row
→ col

Part B Solution

One of the probabilities in the table is marked with an X. Mark with the letter O all the probabilities which, including X, must sum to 1.

satisfied dissatisfied no opinion
better off
→ cell
→ row
→ col
worse off
→ cell
→ row O X O
→ col
same
→ cell
→ row
→ col
No opinion
→ cell
→ row
→ col

Part C Solution

One of the probabilities in the table is marked with an X. Mark with the letter O all the probabilities which, including X, must sum to 1.

satisfied dissatisfied no opinion
better off
→ cell
→ row
→ col O
worse off
→ cell
→ row
→ col O
same
→ cell
→ row
→ col X
No opinion
→ cell
→ row
→ col O

Part D Solution

One of the probabilities in the table is marked with an X. Mark with the letter O all the probabilities which, including X, must sum to 1.

satisfied dissatisfied no opinion
better off
→ cell
→ row
→ col
worse off
→ cell
→ row
→ col
same
→ cell
→ row
→ col
No opinion
→ cell
→ row
→ col
X O O

Part E Solution

One of the probabilities in the table is marked with an X. Mark with the letter O all the probabilities which, including X, must sum to 1.

satisfied dissatisfied no opinion
better off
→ cell O
→ row
→ col
worse off
→ cell O
→ row
→ col
same
→ cell X
→ row
→ col
No opinion
→ cell O
→ row
→ col

2.2

Suppose that everyone had an opinion on the two questions, so that the No opinion column and row can be ignored. The table might look something like this:

satisfied dissatisfied
better off
→ cell
→ row
→ col
worse off
→ cell
→ row
→ col
same
→ cell
→ row
→ col

Also suppose the following probabilities were known:

  • Among those who are better off, the proportion of respondants who are satistified with the way things are going is 0.6.
  • Among those who are worse off, the proportion of respondant who are dissatisfied about the way things are going is 0.7.
  • Among those who are the same, financially, the proportion of respondants who are satistified with the way things are going is 0.5.
  • Overall, 0.2 report being better off, 0.5 report being about the same, and 0.3 report being worse off.

Among those who are satistified with the way things are, what is the proportion who are worse off.

Solution

Let’s first add the information provided to the table.

satisfied dissatisfied
better off 0.2
→ cell
→ row 0.6
→ col
worse off 0.3
→ cell
→ row 0.7
→ col
same 0.5
→ cell
→ row 0.5
→ col

Now, let’s identify the probability we need to calculate. It is marked with a \(\star\).

satisfied dissatisfied
better off 0.2
→ cell
→ row 0.6
→ col
worse off 0.3
→ cell
→ row 0.7
→ col \(\star\)
same 0.5
→ cell
→ row 0.5
→ col

Now, let’s identify the elements needed to calculate \(\star\). It is a column conditional probability, so we will need the column margin (\(\blacksquare\)) and the cell probabilities of the same column (\(\blacktriangle\)).

satisfied dissatisfied
better off 0.2
→ cell \(\blacktriangle\)
→ row 0.6
→ col
worse off 0.3
→ cell \(\blacktriangle\)
→ row 0.7
→ col \(\star\)
same 0.5
→ cell \(\blacktriangle\)
→ row 0.5
→ col
\(\blacksquare\)

Now, we compute. Start with \(\blacktriangle\).

satisfied dissatisfied
better off 0.2
→ cell \(\blacktriangle = 0.2 \times 0.6\)
→ row 0.6
→ col
worse off 0.3
→ cell \(\blacktriangle = (1-0.7) \times 0.3\)
→ row 1-0.7 0.7
→ col \(\star\)
same 0.5
→ cell \(\blacktriangle = 0.5 \times 0.5\)
→ row 0.5
→ col
\(\blacksquare\)

Now \(\blacksquare\).

\[ \blacksquare = 0.2 \times 0.6 + (1-0.7) \times 0.3 + 0.5 \times 0.5 = 0.46 \]

satisfied dissatisfied
better off 0.2
→ cell \(\blacktriangle = 0.2 \times 0.6\)
→ row 0.6
→ col
worse off 0.3
→ cell \(\blacktriangle = (1-0.7) \times 0.3\)
→ row 1-0.7 0.7
→ col \(\star\)
same 0.5
→ cell \(\blacktriangle = 0.5 \times 0.5\)
→ row 0.5
→ col
\(\blacksquare = 0.46\)

Finally \(\star\).

\[ \blacksquare = 0.2 \times 0.6 + (1-0.7) \times 0.3 + 0.5 \times 0.5 = 0.46 \]

\[ \star = \frac{(1-0.7) \times 0.3}{\blacksquare} = \frac{0.09}{0.46} = 0.1957 \]

satisfied dissatisfied
better off 0.2
→ cell \(\blacktriangle = 0.2 \times 0.6\)
→ row 0.6
→ col
worse off 0.3
→ cell \(\blacktriangle = (1-0.7) \times 0.3\)
→ row 1-0.7 0.7
→ col \(\star = 0.1957\)
same 0.5
→ cell \(\blacktriangle = 0.5 \times 0.5\)
→ row 0.5
→ col
\(\blacksquare = 0.46\)

Section 3. Simulation

Learning objective: Use probability models to build simulations of complex real world processes to answer research questions

3.1

The world series determines the championship team of major leage baseball. The tournament is a first-to-4-wins competition. This means the tournament could end after 4, 5, 6, or 7 games.

This year, the two teams in the world series are the Los Angeles Dodgers (D) and the New York Yankees (Y). If we suppose that the two teams are equally matched, then the probability of winning for either team is 0.5. The outcome of a single game can be simulated with the sample command:

sample(c("Y","D"), 1)
[1] "D"

However, the teams may not be equally matched. It is possible that the probability that the Dodgers win a head-to-head matchup is 0.55. Consequently, the probability that the Yankees win a head-to-head matchup would be 1-0.55 = 0.45. Look at the help documentation of sample to determine how to incorporate the belief that the Dodgers’ probability of winning a head-to-head match-up is 0.55. Write the command below.

Solution

sample(c("Y","D"), 1, prob = c(0.45, 0.55))
[1] "Y"

Now, as you did with the before, modify the code to simulate 7 games instead of just 1. This time, also incorporate the belief that the Dodgers’ probability of winning an individual game is 0.55.

sample(c("Y","D"), 7, prob = c(0.45, 0.55), replace=TRUE)
[1] "D" "Y" "D" "D" "D" "Y" "D"

3.2

The following function ws_outcome takes an input vector of games and translates it to a dataframe in which the first column is the winning team and the second column is how many games were played.

flowchart LR
  A(Y,D,D,Y,D,D,Y) ---> |ws_outcome| B(D,6)

ws_outcome <- function(x){
    t1 <- table(x)
    teams <- names(t1)
    winning_team_idx <- which(t1 >= 4)
    winning_team <- teams[winning_team_idx]
    winning_game <- which(cumsum(x==winning_team)==4)[1]
    data.frame(winning_team = winning_team, winning_game = winning_game)
}

The ws_outcome can be combined with your solution to 3.1 to simulate many outcomes of the world series when the Dodgers’ probability of winning a head-to-head match-up is 0.55. Generate 5000 outcomes with the replicate command. Use set.seed(298235). As you did in the prep questions (link), create a summary table with cell and margin probabilities.

HINT: This was the table from the prep questions. Your output should look similar.

            winning_game
winning_team      4      5      6      7    Sum
         D   0.0620 0.1260 0.1590 0.1518 0.4988
         Y   0.0570 0.1318 0.1584 0.1540 0.5012
         Sum 0.1190 0.2578 0.3174 0.3058 1.0000

Solution

one_series <- function(x) sample(c("Y","D"), 7, prob = c(0.45, 0.55), replace=TRUE) |> ws_outcome()
out <- replicate(5000, one_series(), simplify = FALSE)

# Stack all the elements of the list into a dataframe
sim_results <- do.call("rbind", out)

# Table
t1 <- table(sim_results) |> proportions() |> addmargins()
t1
            winning_game
winning_team      4      5      6      7    Sum
         D   0.0896 0.1620 0.1814 0.1710 0.6040
         Y   0.0420 0.0876 0.1304 0.1360 0.3960
         Sum 0.1316 0.2496 0.3118 0.3070 1.0000

Section 4. Diagnostics

Learning objective: apply cross table framework to the special case of binary outcomes

4.1

An audit of an email filtering system resulted in a dataset of 10000 emails, each manually verified as spam or not spam. In addition to the type of email, the dataset indicates if the filter sent the email to the inbox or the junk folder.

The following command reads the dataset into memory. From the data, generate an estimate of the positive predictive value and the negative predictive value of the spam filter by calculating sample proportions.

HINT: Sensitivity in this example = \(P(\text{Junk folder} | \text{Spam})\)

Solution

\[ \begin{align*} \text{PPV} &= P(\text{Type = Spam} | \text{Folder = Junk} ) \\ \text{NPV} &= P(\text{Type = Not spam} | \text{Folder = Inbox} ) \end{align*} \]

These are quantities that are readily available from the cross table.

d1 <- readRDS(url("https://tgstewart.cloud/spam-data.RDS"))
descr:::CrossTable(d1$Type, d1$Folder)
   Cell Contents 
|-------------------------|
|                       N | 
| Chi-square contribution | 
|           N / Row Total | 
|           N / Col Total | 
|         N / Table Total | 
|-------------------------|

=======================================
            d1$Folder
d1$Type        Inbox       Junk   Total
---------------------------------------
Not spam        7589        422    8011
             292.899   1030.042        
               0.947      0.053   0.801
               0.975      0.191        
               0.759      0.042        
---------------------------------------
Spam             197       1792    1989
            1179.696   4148.649        
               0.099      0.901   0.199
               0.025      0.809        
               0.020      0.179        
---------------------------------------
Total           7786       2214   10000
               0.779      0.221        
=======================================
PPV = 0.809
NPV = 0.975

4.2

Using the sensitivity and specificity observed in the data, calculate the negative predictive value of the spam filter if the incidence of spam where to increase to 0.35. As before, you may show your work by inserting an image of your hand written solution. You are also welcome to show your work directly in the report.

Solution

From the table above,

Sensitivity = 0.901
Specificity = 0.947

To calculate NPV from sensitivity and specificity, we can use the table method or Bayes rule.

\[ \begin{align*} \text{NPV} & = P(\text{Not spam} | \text{Inbox}) \\ & = \frac{P(\text{Not spam}\ \&\ \text{Inbox})}{P(\text{Inbox})}\\ & = \frac{P(\text{Inbox} | \text{Not spam}) P(\text{Not spam})}{P(\text{Inbox} | \text{Not spam}) P(\text{Not spam}) + P(\text{Inbox} | \text{Spam}) P(\text{Spam})} \\ & = \frac{P(\text{Inbox} | \text{Not spam}) P(\text{Not spam})}{P(\text{Inbox} | \text{Not spam})[1-P(\text{Spam})] + [1- P(\text{Junk} | \text{Spam})] P(\text{Spam})} \\ & = \frac{\text{Specificity} \times (1-\text{Incidence})}{\text{Specificity} \times (1-\text{Incidence}) + (1-\text{Sensitivity}) \times \text{Incidence}} \\ \end{align*} \]

Spe <- 0.947
Sen <- 0.901
p <- 0.35
c(NPV = Spe * (1-p) / ( Spe * (1-p) + (1-Sen) * p ))
      NPV 
0.9467087 

Section 5. Confounding vs Causal Pathway

Learning objective: define/describe confounding variables, Simpson’s paradox, DAGs, and the causal pathway

5.1

Describe an example of an association between two variables might be impacted by a third, confounding variable. Also describe a situation when a third variable might be an intermediate outcome. Explain how you would measure the association differently in the two scenarios.

Solution

Many examples work. See figures below.

When there is a confounder, one should calculate an adjusted or stratified measure of association. In the third variable is an intermediate outcome, stratification would be inappropriate. The pooled measure of association should be calculated.

5.2

The following code generates DAGs. Tweak the lables in the code so that the DAGs match your examples.

flowchart TB
 subgraph Confounding
  A(Treatment) --> B(Recovery)
  C(Disease severity) --> B
  C --> A
  end

flowchart TB
 subgraph Confounding
  A(Parent's marriage status) --> B(College graduation)
  C(Income) --> B
  C --> A
  end

flowchart TB
  subgraph Intermediate
  D(Vaccination status) --> E(Speed of recovery)
  D --> F(Disease severity)
  F --> E
  end

flowchart TB
  subgraph Intermediate
  D(Preeclampsia) --> E(Cerebral palsy)
  D --> F(Gestational age at delivery)
  F --> E
  end