import pandas as pd
d1 = pd.read_csv("https://tgstewart.cloud/spam-data.csv")Exam 2 Prep
Instructions. Create a reproducible report with your answers with either Quarto or Jupyter notebook.
Submission instructions.
- Render your report as HTML, then print to PDF.
- Upload PDF report to GradeScope via Canvas (link TBD)
Part 1: Tools of the data scientist
1.1 Write out the output of the following Python code.
numbers = range(14)
result = [x**2 for x in numbers if x % 3 == 0]
print(result)
1.2 Explain the advantages of a virtual environment when developing Python code.
Part 2: Probability
2.1 Consider two coins, one fair (\(P(H)=.5\)) and one weighted (\(P(H)=.25\)).
Suppose you pick coin at random, flip it 3 times, and record the number of heads.
(a) Create a figure of the PMF of the outcome when p=0.5.
(b) Create a figure of the PMF when p = 0.25.
(c) Calculate P(3 heads)
(d) If you observe 3 heads, what is the probability that you selected the fair coin?
2.3 Suppose a diagnostic for a disease has sensitivity 0.95 and specificity of 0.9.
(a) Calculate the PPV when prevalence = 0.1
(b) Create a figure in Python that shows the PPV of the diagnostic for all possible prevalences.
2.4 An audit of an email filtering system resulted in a dataset of 10000 emails, each manually verified as spam or not spam. In addition to the type of email, the dataset indicates if the filter sent the email to the inbox or the junk folder.
(a) The following command reads the dataset into memory. From the data, generate an estimate of the positive predictive value and the negative predictive value of the spam filter by calculating sample proportions.
HINT: Sensitivity in this example = P(Junk folder|Spam)
(b) Using the sensitivity and specificity observed in the data, calculate the negative predictive value of the spam filter if the incidence of spam where to increase to 0.35.
2.5 Consider a survey of political opinions. Since 1979, the public opinion polling company, Gallup Inc, has asked respondents the following:
In general, are you satisfied or dissatisfied with the way things are going in the United States at this time?
The three possible responses are satisfied, dissatisfied, and no opinion.
Another survey questions is:
Would you say that you are financially better off now than you were a year ago, or are you financially worse off now?
The four possible responses to this question are better off, worse off, same, or no opinion.
For the purposes of this problem, ignore the no opinion response for both survey questions.
Also suppose the following probabilities were known:
- Among those who are better off, the proportion of respondants who are satistified with the way things are going is 0.6.
- Among those who are worse off, the proportion of respondant who are dissatisfied about the way things are going is 0.7.
- Among those who are the same, financially, the proportion of respondants who are satistified with the way things are going is 0.5.
- Overall, 0.2 report being better off, 0.5 report being about the same, and 0.3 report being worse off.
(a) Using this information, are General Outlook and Financial Wellbeing independent?
(b) If not, how are the outcomes associated? (Positive, Negative)
Section 3. Simulation
3.1 The world series determines the championship team of major leage baseball. The tournament is a first-to-4-wins competition. This means the tournament could end after 4, 5, 6, or 7 games.
This year, the two teams in the world series are the Los Angeles Dodgers (D) and the Toronto Blue Jays (B). If we suppose that the two teams are equally matched, then the probability of winning for either team is 0.5. The outcome of a single game can be simulated with the command:
import numpy as np
np.random.binomial(n=1,p=0.5)Use a list construction to simulate the outcome of 7 games. (This can be accomplished without a list comprehension, but we are having you practice list comprehensions.)
3.2 Write a function which determines who won the world series and which game they won it. The output of your function should be a tuple: The first element is the team which won and the second element is total number of games in the series.
For example, for this sequence
"1" "0" "0" "0" "0" "1" "1"
the output would be (0, 5) because the fourth win for the Dodgers occured in game 5.
3.3 Generate many outcomes of the world series. Estimate the proportion of times the series ends in 4, 5, 6, and 7 games. Generate a plot of the results.
3.4 The following function generates data from a cohort of individuals who were diagnosed with disease X. In the dataset, there is vaccination status, disease severity, and disease duration.
Using seed = 20251028, generate 1000 draws from the function. Calculate a summary of the effect of vaccination by calculating the difference is conditional probabilities:
\[ \Delta = P(\text{Short recovery}|\text{vaccinated}) - P(\text{Short recovery}|\text{unvaccinated}) \]
import numpy as np
import pandas as pd
def vax_data(R, seed=None):
if seed is not None:
np.random.seed(seed)
vs = np.random.binomial(1, 0.5, R)
ds = np.random.binomial(1, 0.25 * (vs == 1) + 0.75 * (vs == 0))
rt = np.random.binomial(1, 0.7 * (ds == 1) + 0.5 * (ds == 0))
df = pd.DataFrame({
"vaccination_status": np.where(vs == 1, "vaccinated", "unvaccinated"),
"disease_severity": np.where(ds == 1, "mild", "severe"),
"recovery_time": np.where(rt == 1, "short", "long")
})
return df3.5 Calculate the treatment effect as before, but this time calculate it separately for the mild and severe populations.