<- function(R){
widgets <- sample(c("A","B","C"), R, replace=TRUE,prob=c(4,2,1)/7)
machine <- rbinom(R,1,0.1*(machine=="A")+0.01*(machine=="B")+0.001*(machine=="C"))
defect data.frame(machine=machine, defect=defect)
}
# Create a large dataset of widgets
set.seed(23405)
<- widgets(100000) d1
HW 5
Question 1
Machines A, B, and C generate widgets with a defect rate of 0.1, 0.01, and 0.001, respectively. Machine A generates twice as many widgets as B, and machine B generates twice as many widgets as machine C. Please complete the cross table below based on the information above.
Hints: * What type of probability is the defect rate? Where does this probability appear in the table?
A | B | C | ||
---|---|---|---|---|
Defective | ||||
→ cell | ||||
→ row | ||||
→ col | ||||
Functional | ||||
→ cell | ||||
→ row | ||||
→ col | ||||
Solution
The defect rate for a specific machine is a conditional probability. We could express it as
\[ \begin{align*} P(\text{defective}|A) &= 0.1 \\ P(\text{defective}|B) &= 0.01 \\ P(\text{defective}|C) &= 0.001 \\ \end{align*} \]
With the cross-table oriented with machine on columns, the defect rate is a column conditional probability. We can place these values in the table first.
A | B | C | ||
---|---|---|---|---|
Defective | ||||
→ cell | ||||
→ row | ||||
→ col | 0.1 | 0.01 | 0.001 | |
Functional | ||||
→ cell | ||||
→ row | ||||
→ col | ||||
The next information to add to the table comes from the production information. Note that this information refers to the marginal probabilities \(P(A)\), \(P(B)\), and \(P(C)\). The information is provided as a system of equations.
\[ \begin{align} P(A) = 2P(B)\\ P(B) = 2P(C) \end{align} \]
There is a third important equation to incorporate.
\[ P(A) + P(B) + P(C) = 1 \]
With three unknown and three equations, we can solve the system. Substitution is one way to solve. Here we replace \(P(A)\) and \(P(B)\) with \(4P(C)\) and \(2P(C)\), respectively.
\[ 4P(C) + 2P(C) + P(C) = 1 \rightarrow 7P(C) = 1 \rightarrow P(C) = 1/7 \rightarrow P(B) = 2/7 \rightarrow P(A) = 4/7 \]
The marginal probabilities can be added to the table.
A | B | C | ||
---|---|---|---|---|
Defective | ||||
→ cell | ||||
→ row | ||||
→ col | 0.1 | 0.01 | 0.001 | |
Functional | ||||
→ cell | ||||
→ row | ||||
→ col | ||||
4/7 | 2/7 | 1/7 |
Now we use the properties of probabilities to fill in the rest of the table, starting with the fact that the defect rate plus the functional rate must sum to 1.
A | B | C | ||
---|---|---|---|---|
Defective | ||||
→ cell | ||||
→ row | ||||
→ col | 0.1 | 0.01 | 0.001 | |
Functional | ||||
→ cell | ||||
→ row | ||||
→ col | 0.99 | 0.99 | 0.999 | |
4/7 | 2/7 | 1/7 |
The next rule of probability to remember is
\[ P(\text{Defective}, \text{Machine A}) = P(\text{Defective}, | \text{Machine A}) P(\text{Machine A}), \]
a rule which can be generalized to all the cell probabilities.
A | B | C | ||
---|---|---|---|---|
Defective | ||||
→ cell | 0.1 * 4/7 | 0.01 * 2/7 | 0.001 * 1/7 | |
→ row | ||||
→ col | 0.1 | 0.01 | 0.001 | |
Functional | ||||
→ cell | 0.9 * 4/7 | 0.99 * 2/7 | 0.999 * 1/7 | |
→ row | ||||
→ col | 0.9 | 0.99 | 0.999 | |
4/7 | 2/7 | 1/7 |
Now we can use the cell probabilities to calculate the row margins.
A | B | C | ||
---|---|---|---|---|
Defective | ||||
→ cell | 0.1 * 4/7 | 0.01 * 2/7 | 0.001 * 1/7 | 0.1 * 4/7 + 0.01 * 2/7 + 0.001 * 1/7 |
→ row | ||||
→ col | 0.1 | 0.01 | 0.001 | |
Functional | ||||
→ cell | 0.9 * 4/7 | 0.99 * 2/7 | 0.999 * 1/7 | 0.9 * 4/7 + 0.99 * 2/7 + 0.999 * 1/7 |
→ row | ||||
→ col | 0.9 | 0.99 | 0.999 | |
4/7 | 2/7 | 1/7 |
And finally, we can calculate the conditional row probabilities using the rule
\[ \frac{P(\text{Defective}, \text{Machine A})}{P(\text{Defective})} = P(\text{Machine A} | \text{Defective}). \]
A | B | C | ||
---|---|---|---|---|
Defective | ||||
→ cell | 0.1 * 4/7 | 0.01 * 2/7 | 0.001 * 1/7 | 0.1 * 4/7 + 0.01 * 2/7 + 0.001 * 1/7 |
→ row | (0.1 * 4/7)/(0.1 * 4/7 + 0.01 * 2/7 + 0.001 * 1/7) | (0.01 * 2/7)/(0.1 * 4/7 + 0.01 * 2/7 + 0.001 * 1/7) | (0.001 * 1/7)/(0.1 * 4/7 + 0.01 * 2/7 + 0.001 * 1/7) | |
→ col | 0.1 | 0.01 | 0.001 | |
Functional | ||||
→ cell | 0.9 * 4/7 | 0.99 * 2/7 | 0.999 * 1/7 | 0.9 * 4/7 + 0.99 * 2/7 + 0.999 * 1/7 |
→ row | (0.9 * 4/7)/(0.9 * 4/7 + 0.99 * 2/7 + 0.999 * 1/7) | (0.99 * 2/7)/(0.9 * 4/7 + 0.99 * 2/7 + 0.999 * 1/7) | (0.999 * 1/7)/(0.9 * 4/7 + 0.99 * 2/7 + 0.999 * 1/7) | |
→ col | 0.9 | 0.99 | 0.999 | |
4/7 | 2/7 | 1/7 |
A | B | C | ||
---|---|---|---|---|
Defective | ||||
→ cell | 0.0571 | 0.0029 | 0.00014 | 0.0601 |
→ row | 0.9501 | 0.0475 | 0.0024 | |
→ col | 0.1 | 0.01 | 0.001 | |
Functional | ||||
→ cell | 0.5143 | 0.2829 | 0.1427 | 0.9399 |
→ row | 0.5472 | 0.301 | 0.1518 | |
→ col | 0.9 | 0.99 | 0.999 | |
0.5714 | 0.2857 | 0.1429 |
Question 2
If you did not know if a widget was functional or defective, what would be the probability that the widget is from machine A? Write the solution using mathematical notation, as in P(???) = ???.
Solution
\(P(\text{Machine A})\) represent the probability of a random widget being from machine A. It ignores anything about the defective/functional status.
\[ P(\text{Machine A}) = 4/7 \]
Question 3
How would you update the probability from 2 if you knew the widget was defective? Write the solution using mathematical notation, as in P(???) = ???.
Solution
If one knew the widget was defective, our updated probability would be \(P(\text{Machine A}|\text{Defective})\).
\[ P(\text{Machine A}|\text{Defective}) = \frac{0.1 * 4/7}{0.1 * 4/7 + 0.01 * 2/7 + 0.001 * 1/7} \]
\(P(\text{Machine A}|\text{Defective})\) = 0.9501188
Question 4
The following function simulates the widget problem from 2.2. Create the cross table from question 1 empirically, using the function to mimic the manufacture of many widgets and generating the resulting cross table. Be sure to use set.seed
.
Solution
Here is a peak at what the dataset looks like.
head(d1)
machine defect
1 A 1
2 A 0
3 B 0
4 A 0
5 A 0
6 A 0
There are lots of ways to create a cross table. Here is one way.
suppressPackageStartupMessages(require(pander))
suppressPackageStartupMessages(require(descr))
CrossTable(d1$defect, d1$machine, prop.chisq = FALSE) |>
pander(split.table=Inf)
d1$defect |
d1$machine A |
B |
C |
Total |
---|---|---|---|---|
0 N Row(%) Column(%) Total(%) |
51517 54.8491% 89.8981% 51.517% |
28133 29.9526% 99.0285% 28.133% |
14275 15.1983% 99.9300% 14.275% |
93925 93.9250% |
1 N Row(%) Column(%) Total(%) |
5789 95.2922% 10.1019% 5.789% |
276 4.5432% 0.9715% 0.276% |
10 0.1646% 0.0700% 0.010% |
6075 6.0750% |
Total |
57306 57.306% |
28409 28.409% |
14285 14.285% |
100000 |
Question 5
What is simulation error for P(machine A & defective)?
Solution
Simulation error is
\[ \underbrace{P(\text{Machine A}, \text{Defective})}_{\text{from sim (Q4)}} - \underbrace{P(\text{Machine A}, \text{Defective})}_{\text{from table (Q1)}} \]
<- table(d1) |> proportions()
t1 t1
defect
machine 0 1
A 0.51517 0.05789
B 0.28133 0.00276
C 0.14275 0.00010
<- t1["A","1"]
p_sim <- 0.1 * 4/7
p_math - p_math p_sim
[1] 0.0007471429
Question 6
Here is the data from the in-class survey about soft drinks which you will use in the following problems. For the moment, we will exclude reponses with missing values. (But, missing data is something we will touch on in the future.)
suppressPackageStartupMessages(require(dplyr))
<- read.csv("https://tgstewart.cloud/soda.csv") |>
d1 filter(!is.na(cola) & !is.na(sugar))
CrossTable(
$sugar
d1$cola
, d1prop.chisq = FALSE
, |>
) pander(split.table=Inf)
d1$sugar |
d1$cola Cola (Coke, Pepsi, etc.) |
Something else |
Total |
---|---|---|---|
Regular (full sugar) N Row(%) Column(%) Total(%) |
10 25.0000% 50.0000% 15.1515% |
30 75.0000% 65.2174% 45.4545% |
40 60.6061% |
Zero sugar or diet N Row(%) Column(%) Total(%) |
10 38.4615% 50.0000% 15.1515% |
16 61.5385% 34.7826% 24.2424% |
26 39.3939% |
Total |
20 30.303% |
46 69.697% |
66 |
If the proportions in the contingency table represented probabilities, would drink preference and sugar preference be independent?
Solution
If drink preference and sugar preferences were independent, then the following would be true:
\[ P(\text{Regular}, \text{Cola}) = P(\text{Regular})P(\text{Cola}) \]
Let’s calculate the difference.
<- table(d1) |> proportions() |> addmargins()
t1 t1
sugar
cola Regular (full sugar) Zero sugar or diet Sum
Cola (Coke, Pepsi, etc.) 0.1515152 0.1515152 0.3030303
Something else 0.4545455 0.2424242 0.6969697
Sum 0.6060606 0.3939394 1.0000000
<- t1["Cola (Coke, Pepsi, etc.)", "Regular (full sugar)"]
p_reg_cola <- t1["Sum", "Regular (full sugar)"]
p_reg <- t1["Cola (Coke, Pepsi, etc.)", "Sum"]
p_cola
p_reg_cola
[1] 0.1515152
*p_cola p_reg
[1] 0.1836547
- p_reg*p_cola p_reg_cola
[1] -0.03213958
Question 7
Is a preference for Cola positively or negatively or not associated with a preference for regular sugar?
Solution
One way to solve this is to ask: If I look just a Cola drinkers, would I observe more or less sugar drinkers than the general population? If it is more, then cola and sugar is positively associated. If it is less, then they are negatively associated.
\[ P(\text{sugar}|\text{cola}) \text{ vs } P(\text{sugar}) \]
# We could also get these quantities from the table in Q6
<- t1["Cola (Coke, Pepsi, etc.)","Regular (full sugar)"]/t1["Cola (Coke, Pepsi, etc.)","Sum"]
p_sugar_given_cola <- t1["Sum","Regular (full sugar)"]
p_sugar
p_sugar_given_cola
[1] 0.5
p_sugar
[1] 0.6060606
We see fewer sugar drinkers among the cola drinkers than we see in the general population. Therefore, sugar and cola are negatively associated.
Question 8
Consider the population of individuals who suffer from migraine headaches.
Fewer than 3 migraines per month | 3 or more migraines per month | |
---|---|---|
Less toxic treatment | .4 | .1 |
More toxic treatment | .3 | .2 |
What can you conclude from this data about the effectiveness of the more toxic treatment? (Hint: Section 4.7)
Solution
In Section 4.7, we learn about the difference between association and causation. Even if the data show a positive treatment effect for the less toxic treatment, it isn’t clear that there is a causal relationship. It may well be that individuals with more severe migraines choose the more potent treatment.