<- iris
d1 set.seed(232235099) # Do not change the seed
$color <- sample(c("Blue","Violet","White"), nrow(d1), replace=TRUE) d1
Exam 2
Section 1. Tools of the data scientists
Learning objective: Use the tools of data scientists
Learning objective: Implement best programming/coding practices
1.1
When you are done with the exam, render this report as a pdf document. For full credit, identify the each question (the prompt and entire resposne) when you submit the document via GradeScope.
1.2
The following is a schematic of a project folder, with subfolders and files.
project
|
|---code
| script.qmd
|
|---data
| survey-responses.csv
|
|---docs
Supposing the code
subfolder is the designated working directory, write the command to be included in the script.qmd
file which will read the survey-responses.csv
data, avoiding absolute file paths?
Solution
survey <- read.csv("../data/survey-responses.csv")
1.3
The iris dataset contains flower measurement data. Here, we add some fake data regarding the flower color.
Fix the mistake in the following code chunk. (Keep #| error: true
. That is not the mistake.)
install.packages("descr")
CrossTable(d1$color, d1$Species)
Solution
The command install.packages
should not be used in the analysis script. Once a package is installed to the library, it doesn’t need to be reinstalled. Rather, the package needs to be loaded so that its functions are in memory and available for use in the script.
1.4
Explain the difference between the library
command and the install.packages
command.
Solution
The command install.packages
finds the package online and installs it to the local machine. The command library
imports the package for use in the script.
Section 2. Probability
Learning objective: Compare and contrast different definitions of probability, illustrating differences with simple examples
Learning objective: Express the rules of probability verbally, mathematically, and computationally.
2.1
Consider a survey of political opinions. Since 1979, the the public opinion polling company, Gallup Inc, has asked respondents the following:
In general, are you satisfied or dissatisfied with the way things are going in the United States at this time?
The three possible responses are satisfied
, dissatisfied
, and no opinion
.
Another survey questions is:
Would you say that you are financially better off now than you were a year ago, or are you financially worse off now?
The four possible responses to this question are better off
, worse off
, same
, or no opinion
.
A cross table of the responses to the two survey questions might look like this:
satisfied | dissatisfied | no opinion | ||
---|---|---|---|---|
better off | ||||
→ cell | ||||
→ row | ||||
→ col | ||||
worse off | ||||
→ cell | ||||
→ row | ||||
→ col | ||||
same | ||||
→ cell | ||||
→ row | ||||
→ col | ||||
No opinion | ||||
→ cell | ||||
→ row | ||||
→ col | ||||
Part A Solution
One of the probabilities in the table is marked with an X. Mark with the letter O all the probabilities which, including X, must sum to 1.
satisfied | dissatisfied | no opinion | ||
---|---|---|---|---|
better off | ||||
→ cell | X | O | O | |
→ row | ||||
→ col | ||||
worse off | ||||
→ cell | O | O | O | |
→ row | ||||
→ col | ||||
same | ||||
→ cell | O | O | O | |
→ row | ||||
→ col | ||||
No opinion | ||||
→ cell | O | O | O | |
→ row | ||||
→ col | ||||
Part B Solution
One of the probabilities in the table is marked with an X. Mark with the letter O all the probabilities which, including X, must sum to 1.
satisfied | dissatisfied | no opinion | ||
---|---|---|---|---|
better off | ||||
→ cell | ||||
→ row | ||||
→ col | ||||
worse off | ||||
→ cell | ||||
→ row | O | X | O | |
→ col | ||||
same | ||||
→ cell | ||||
→ row | ||||
→ col | ||||
No opinion | ||||
→ cell | ||||
→ row | ||||
→ col | ||||
Part C Solution
One of the probabilities in the table is marked with an X. Mark with the letter O all the probabilities which, including X, must sum to 1.
satisfied | dissatisfied | no opinion | ||
---|---|---|---|---|
better off | ||||
→ cell | ||||
→ row | ||||
→ col | O | |||
worse off | ||||
→ cell | ||||
→ row | ||||
→ col | O | |||
same | ||||
→ cell | ||||
→ row | ||||
→ col | X | |||
No opinion | ||||
→ cell | ||||
→ row | ||||
→ col | O | |||
Part D Solution
One of the probabilities in the table is marked with an X. Mark with the letter O all the probabilities which, including X, must sum to 1.
satisfied | dissatisfied | no opinion | ||
---|---|---|---|---|
better off | ||||
→ cell | ||||
→ row | ||||
→ col | ||||
worse off | ||||
→ cell | ||||
→ row | ||||
→ col | ||||
same | ||||
→ cell | ||||
→ row | ||||
→ col | ||||
No opinion | ||||
→ cell | ||||
→ row | ||||
→ col | ||||
X | O | O |
Part E Solution
One of the probabilities in the table is marked with an X. Mark with the letter O all the probabilities which, including X, must sum to 1.
satisfied | dissatisfied | no opinion | ||
---|---|---|---|---|
better off | ||||
→ cell | O | |||
→ row | ||||
→ col | ||||
worse off | ||||
→ cell | O | |||
→ row | ||||
→ col | ||||
same | ||||
→ cell | X | |||
→ row | ||||
→ col | ||||
No opinion | ||||
→ cell | O | |||
→ row | ||||
→ col | ||||
2.2
Suppose that everyone had an opinion on the two questions, so that the No opinion
column and row can be ignored. The table might look something like this:
satisfied | dissatisfied | ||
---|---|---|---|
better off | |||
→ cell | |||
→ row | |||
→ col | |||
worse off | |||
→ cell | |||
→ row | |||
→ col | |||
same | |||
→ cell | |||
→ row | |||
→ col | |||
Also suppose the following probabilities were known:
- Among those who are better off, the proportion of respondants who are
satistified with the way things are going
is 0.6. - Among those who are worse off, the proportion of respondant who are
dissatisfied about the way things are going
is 0.7. - Among those who are the same, financially, the proportion of respondants who are
satistified with the way things are going
is 0.5. - Overall, 0.2 report being
better off
, 0.5 report being about thesame
, and 0.3 report beingworse off
.
Among those who are satistified with the way things are
, what is the proportion who are worse off
.
Solution
Let’s first add the information provided to the table.
satisfied | dissatisfied | ||
---|---|---|---|
better off | 0.2 | ||
→ cell | |||
→ row | 0.6 | ||
→ col | |||
worse off | 0.3 | ||
→ cell | |||
→ row | 0.7 | ||
→ col | |||
same | 0.5 | ||
→ cell | |||
→ row | 0.5 | ||
→ col | |||
Now, let’s identify the probability we need to calculate. It is marked with a \(\star\).
satisfied | dissatisfied | ||
---|---|---|---|
better off | 0.2 | ||
→ cell | |||
→ row | 0.6 | ||
→ col | |||
worse off | 0.3 | ||
→ cell | |||
→ row | 0.7 | ||
→ col | \(\star\) | ||
same | 0.5 | ||
→ cell | |||
→ row | 0.5 | ||
→ col | |||
Now, let’s identify the elements needed to calculate \(\star\). It is a column conditional probability, so we will need the column margin (\(\blacksquare\)) and the cell probabilities of the same column (\(\blacktriangle\)).
satisfied | dissatisfied | ||
---|---|---|---|
better off | 0.2 | ||
→ cell | \(\blacktriangle\) | ||
→ row | 0.6 | ||
→ col | |||
worse off | 0.3 | ||
→ cell | \(\blacktriangle\) | ||
→ row | 0.7 | ||
→ col | \(\star\) | ||
same | 0.5 | ||
→ cell | \(\blacktriangle\) | ||
→ row | 0.5 | ||
→ col | |||
\(\blacksquare\) |
Now, we compute. Start with \(\blacktriangle\).
satisfied | dissatisfied | ||
---|---|---|---|
better off | 0.2 | ||
→ cell | \(\blacktriangle = 0.2 \times 0.6\) | ||
→ row | 0.6 | ||
→ col | |||
worse off | 0.3 | ||
→ cell | \(\blacktriangle = (1-0.7) \times 0.3\) | ||
→ row | 1-0.7 | 0.7 | |
→ col | \(\star\) | ||
same | 0.5 | ||
→ cell | \(\blacktriangle = 0.5 \times 0.5\) | ||
→ row | 0.5 | ||
→ col | |||
\(\blacksquare\) |
Now \(\blacksquare\).
\[ \blacksquare = 0.2 \times 0.6 + (1-0.7) \times 0.3 + 0.5 \times 0.5 = 0.46 \]
satisfied | dissatisfied | ||
---|---|---|---|
better off | 0.2 | ||
→ cell | \(\blacktriangle = 0.2 \times 0.6\) | ||
→ row | 0.6 | ||
→ col | |||
worse off | 0.3 | ||
→ cell | \(\blacktriangle = (1-0.7) \times 0.3\) | ||
→ row | 1-0.7 | 0.7 | |
→ col | \(\star\) | ||
same | 0.5 | ||
→ cell | \(\blacktriangle = 0.5 \times 0.5\) | ||
→ row | 0.5 | ||
→ col | |||
\(\blacksquare = 0.46\) |
Finally \(\star\).
\[ \blacksquare = 0.2 \times 0.6 + (1-0.7) \times 0.3 + 0.5 \times 0.5 = 0.46 \]
\[ \star = \frac{(1-0.7) \times 0.3}{\blacksquare} = \frac{0.09}{0.46} = 0.1957 \]
satisfied | dissatisfied | ||
---|---|---|---|
better off | 0.2 | ||
→ cell | \(\blacktriangle = 0.2 \times 0.6\) | ||
→ row | 0.6 | ||
→ col | |||
worse off | 0.3 | ||
→ cell | \(\blacktriangle = (1-0.7) \times 0.3\) | ||
→ row | 1-0.7 | 0.7 | |
→ col | \(\star = 0.1957\) | ||
same | 0.5 | ||
→ cell | \(\blacktriangle = 0.5 \times 0.5\) | ||
→ row | 0.5 | ||
→ col | |||
\(\blacksquare = 0.46\) |
Section 3. Simulation
Learning objective: Use probability models to build simulations of complex real world processes to answer research questions
3.1
The world series determines the championship team of major leage baseball. The tournament is a first-to-4-wins
competition. This means the tournament could end after 4, 5, 6, or 7 games.
This year, the two teams in the world series are the Los Angeles Dodgers (D) and the New York Yankees (Y). If we suppose that the two teams are equally matched, then the probability of winning for either team is 0.5. The outcome of a single game can be simulated with the sample command:
sample(c("Y","D"), 1)
[1] "D"
However, the teams may not be equally matched. It is possible that the probability that the Dodgers win a head-to-head matchup is 0.55. Consequently, the probability that the Yankees win a head-to-head matchup would be 1-0.55 = 0.45. Look at the help documentation of sample
to determine how to incorporate the belief that the Dodgers’ probability of winning a head-to-head match-up is 0.55. Write the command below.
Solution
sample(c("Y","D"), 1, prob = c(0.45, 0.55))
[1] "Y"
Now, as you did with the before, modify the code to simulate 7 games instead of just 1. This time, also incorporate the belief that the Dodgers’ probability of winning an individual game is 0.55.
sample(c("Y","D"), 7, prob = c(0.45, 0.55), replace=TRUE)
[1] "D" "Y" "D" "D" "D" "Y" "D"
3.2
The following function ws_outcome
takes an input vector of games and translates it to a dataframe in which the first column is the winning team and the second column is how many games were played.
flowchart LR A(Y,D,D,Y,D,D,Y) ---> |ws_outcome| B(D,6)
<- function(x){
ws_outcome <- table(x)
t1 <- names(t1)
teams <- which(t1 >= 4)
winning_team_idx <- teams[winning_team_idx]
winning_team <- which(cumsum(x==winning_team)==4)[1]
winning_game data.frame(winning_team = winning_team, winning_game = winning_game)
}
The ws_outcome
can be combined with your solution to 3.1 to simulate many outcomes of the world series when the Dodgers’ probability of winning a head-to-head match-up is 0.55. Generate 5000 outcomes with the replicate
command. Use set.seed(298235)
. As you did in the prep questions (link), create a summary table with cell and margin probabilities.
HINT: This was the table from the prep questions. Your output should look similar.
winning_game
winning_team 4 5 6 7 Sum
D 0.0620 0.1260 0.1590 0.1518 0.4988
Y 0.0570 0.1318 0.1584 0.1540 0.5012
Sum 0.1190 0.2578 0.3174 0.3058 1.0000
Solution
<- function(x) sample(c("Y","D"), 7, prob = c(0.45, 0.55), replace=TRUE) |> ws_outcome()
one_series <- replicate(5000, one_series(), simplify = FALSE)
out
# Stack all the elements of the list into a dataframe
<- do.call("rbind", out)
sim_results
# Table
<- table(sim_results) |> proportions() |> addmargins()
t1 t1
winning_game
winning_team 4 5 6 7 Sum
D 0.0896 0.1620 0.1814 0.1710 0.6040
Y 0.0420 0.0876 0.1304 0.1360 0.3960
Sum 0.1316 0.2496 0.3118 0.3070 1.0000
Section 4. Diagnostics
Learning objective: apply cross table framework to the special case of binary outcomes
4.1
An audit of an email filtering system resulted in a dataset of 10000 emails, each manually verified as spam or not spam. In addition to the type of email, the dataset indicates if the filter sent the email to the inbox or the junk folder.
The following command reads the dataset into memory. From the data, generate an estimate of the positive predictive value and the negative predictive value of the spam filter by calculating sample proportions.
HINT: Sensitivity in this example = \(P(\text{Junk folder} | \text{Spam})\)
Solution
\[ \begin{align*} \text{PPV} &= P(\text{Type = Spam} | \text{Folder = Junk} ) \\ \text{NPV} &= P(\text{Type = Not spam} | \text{Folder = Inbox} ) \end{align*} \]
These are quantities that are readily available from the cross table.
<- readRDS(url("https://tgstewart.cloud/spam-data.RDS"))
d1 :::CrossTable(d1$Type, d1$Folder) descr
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
=======================================
d1$Folder
d1$Type Inbox Junk Total
---------------------------------------
Not spam 7589 422 8011
292.899 1030.042
0.947 0.053 0.801
0.975 0.191
0.759 0.042
---------------------------------------
Spam 197 1792 1989
1179.696 4148.649
0.099 0.901 0.199
0.025 0.809
0.020 0.179
---------------------------------------
Total 7786 2214 10000
0.779 0.221
=======================================
PPV = 0.809
NPV = 0.975
4.2
Using the sensitivity and specificity observed in the data, calculate the negative predictive value of the spam filter if the incidence of spam where to increase to 0.35. As before, you may show your work by inserting an image of your hand written solution. You are also welcome to show your work directly in the report.
Solution
From the table above,
Sensitivity = 0.901
Specificity = 0.947
To calculate NPV from sensitivity and specificity, we can use the table method or Bayes rule.
\[ \begin{align*} \text{NPV} & = P(\text{Not spam} | \text{Inbox}) \\ & = \frac{P(\text{Not spam}\ \&\ \text{Inbox})}{P(\text{Inbox})}\\ & = \frac{P(\text{Inbox} | \text{Not spam}) P(\text{Not spam})}{P(\text{Inbox} | \text{Not spam}) P(\text{Not spam}) + P(\text{Inbox} | \text{Spam}) P(\text{Spam})} \\ & = \frac{P(\text{Inbox} | \text{Not spam}) P(\text{Not spam})}{P(\text{Inbox} | \text{Not spam})[1-P(\text{Spam})] + [1- P(\text{Junk} | \text{Spam})] P(\text{Spam})} \\ & = \frac{\text{Specificity} \times (1-\text{Incidence})}{\text{Specificity} \times (1-\text{Incidence}) + (1-\text{Sensitivity}) \times \text{Incidence}} \\ \end{align*} \]
<- 0.947
Spe <- 0.901
Sen <- 0.35
p c(NPV = Spe * (1-p) / ( Spe * (1-p) + (1-Sen) * p ))
NPV
0.9467087
Section 5. Confounding vs Causal Pathway
Learning objective: define/describe confounding variables, Simpson’s paradox, DAGs, and the causal pathway
5.1
Describe an example of an association between two variables might be impacted by a third, confounding variable. Also describe a situation when a third variable might be an intermediate outcome. Explain how you would measure the association differently in the two scenarios.
Solution
Many examples work. See figures below.
When there is a confounder, one should calculate an adjusted or stratified measure of association. In the third variable is an intermediate outcome, stratification would be inappropriate. The pooled measure of association should be calculated.
5.2
The following code generates DAGs. Tweak the lables in the code so that the DAGs match your examples.
flowchart TB subgraph Confounding A(Treatment) --> B(Recovery) C(Disease severity) --> B C --> A end
flowchart TB subgraph Confounding A(Parent's marriage status) --> B(College graduation) C(Income) --> B C --> A end
flowchart TB subgraph Intermediate D(Vaccination status) --> E(Speed of recovery) D --> F(Disease severity) F --> E end
flowchart TB subgraph Intermediate D(Preeclampsia) --> E(Cerebral palsy) D --> F(Gestational age at delivery) F --> E end