HW 5

Question 1

Machines A, B, and C generate widgets with a defect rate of 0.1, 0.01, and 0.001, respectively. Machine A generates twice as many widgets as B, and machine B generates twice as many widgets as machine C. Please complete the cross table below based on the information above.

Hints: * What type of probability is the defect rate? Where does this probability appear in the table?

A B C
Defective
→ cell
→ row
→ col
Functional
→ cell
→ row
→ col

Solution

The defect rate for a specific machine is a conditional probability. We could express it as

\[ \begin{align*} P(\text{defective}|A) &= 0.1 \\ P(\text{defective}|B) &= 0.01 \\ P(\text{defective}|C) &= 0.001 \\ \end{align*} \]

With the cross-table oriented with machine on columns, the defect rate is a column conditional probability. We can place these values in the table first.

A B C
Defective
→ cell
→ row
→ col 0.1 0.01 0.001
Functional
→ cell
→ row
→ col

The next information to add to the table comes from the production information. Note that this information refers to the marginal probabilities \(P(A)\), \(P(B)\), and \(P(C)\). The information is provided as a system of equations.

\[ \begin{align} P(A) = 2P(B)\\ P(B) = 2P(C) \end{align} \]

There is a third important equation to incorporate.

\[ P(A) + P(B) + P(C) = 1 \]

With three unknown and three equations, we can solve the system. Substitution is one way to solve. Here we replace \(P(A)\) and \(P(B)\) with \(4P(C)\) and \(2P(C)\), respectively.

\[ 4P(C) + 2P(C) + P(C) = 1 \rightarrow 7P(C) = 1 \rightarrow P(C) = 1/7 \rightarrow P(B) = 2/7 \rightarrow P(A) = 4/7 \]

The marginal probabilities can be added to the table.

A B C
Defective
→ cell
→ row
→ col 0.1 0.01 0.001
Functional
→ cell
→ row
→ col
4/7 2/7 1/7

Now we use the properties of probabilities to fill in the rest of the table, starting with the fact that the defect rate plus the functional rate must sum to 1.

A B C
Defective
→ cell
→ row
→ col 0.1 0.01 0.001
Functional
→ cell
→ row
→ col 0.99 0.99 0.999
4/7 2/7 1/7

The next rule of probability to remember is

\[ P(\text{Defective}, \text{Machine A}) = P(\text{Defective}, | \text{Machine A}) P(\text{Machine A}), \]

a rule which can be generalized to all the cell probabilities.

A B C
Defective
→ cell 0.1 * 4/7 0.01 * 2/7 0.001 * 1/7
→ row
→ col 0.1 0.01 0.001
Functional
→ cell 0.9 * 4/7 0.99 * 2/7 0.999 * 1/7
→ row
→ col 0.9 0.99 0.999
4/7 2/7 1/7

Now we can use the cell probabilities to calculate the row margins.

A B C
Defective
→ cell 0.1 * 4/7 0.01 * 2/7 0.001 * 1/7 0.1 * 4/7 + 0.01 * 2/7 + 0.001 * 1/7
→ row
→ col 0.1 0.01 0.001
Functional
→ cell 0.9 * 4/7 0.99 * 2/7 0.999 * 1/7 0.9 * 4/7 + 0.99 * 2/7 + 0.999 * 1/7
→ row
→ col 0.9 0.99 0.999
4/7 2/7 1/7

And finally, we can calculate the conditional row probabilities using the rule

\[ \frac{P(\text{Defective}, \text{Machine A})}{P(\text{Defective})} = P(\text{Machine A} | \text{Defective}). \]

A B C
Defective
→ cell 0.1 * 4/7 0.01 * 2/7 0.001 * 1/7 0.1 * 4/7 + 0.01 * 2/7 + 0.001 * 1/7
→ row (0.1 * 4/7)/(0.1 * 4/7 + 0.01 * 2/7 + 0.001 * 1/7) (0.01 * 2/7)/(0.1 * 4/7 + 0.01 * 2/7 + 0.001 * 1/7) (0.001 * 1/7)/(0.1 * 4/7 + 0.01 * 2/7 + 0.001 * 1/7)
→ col 0.1 0.01 0.001
Functional
→ cell 0.9 * 4/7 0.99 * 2/7 0.999 * 1/7 0.9 * 4/7 + 0.99 * 2/7 + 0.999 * 1/7
→ row (0.9 * 4/7)/(0.9 * 4/7 + 0.99 * 2/7 + 0.999 * 1/7) (0.99 * 2/7)/(0.9 * 4/7 + 0.99 * 2/7 + 0.999 * 1/7) (0.999 * 1/7)/(0.9 * 4/7 + 0.99 * 2/7 + 0.999 * 1/7)
→ col 0.9 0.99 0.999
4/7 2/7 1/7
A B C
Defective
→ cell 0.0571 0.0029 0.00014 0.0601
→ row 0.9501 0.0475 0.0024
→ col 0.1 0.01 0.001
Functional
→ cell 0.5143 0.2829 0.1427 0.9399
→ row 0.5472 0.301 0.1518
→ col 0.9 0.99 0.999
0.5714 0.2857 0.1429

Question 2

If you did not know if a widget was functional or defective, what would be the probability that the widget is from machine A? Write the solution using mathematical notation, as in P(???) = ???.

Solution

\(P(\text{Machine A})\) represent the probability of a random widget being from machine A. It ignores anything about the defective/functional status.

\[ P(\text{Machine A}) = 4/7 \]

Question 3

How would you update the probability from 2 if you knew the widget was defective? Write the solution using mathematical notation, as in P(???) = ???.

Solution

If one knew the widget was defective, our updated probability would be \(P(\text{Machine A}|\text{Defective})\).

\[ P(\text{Machine A}|\text{Defective}) = \frac{0.1 * 4/7}{0.1 * 4/7 + 0.01 * 2/7 + 0.001 * 1/7} \]

\(P(\text{Machine A}|\text{Defective})\) = 0.9501188

Question 4

The following function simulates the widget problem from 2.2. Create the cross table from question 1 empirically, using the function to mimic the manufacture of many widgets and generating the resulting cross table. Be sure to use set.seed.

Solution

widgets <- function(R){
    machine <- sample(c("A","B","C"), R, replace=TRUE,prob=c(4,2,1)/7)
    defect <- rbinom(R,1,0.1*(machine=="A")+0.01*(machine=="B")+0.001*(machine=="C"))
    data.frame(machine=machine, defect=defect)
}

# Create a large dataset of widgets
set.seed(23405)
d1 <- widgets(100000)

Here is a peak at what the dataset looks like.

head(d1)
  machine defect
1       A      1
2       A      0
3       B      0
4       A      0
5       A      0
6       A      0

There are lots of ways to create a cross table. Here is one way.

suppressPackageStartupMessages(require(pander))
suppressPackageStartupMessages(require(descr))

CrossTable(d1$defect, d1$machine, prop.chisq = FALSE) |>
  pander(split.table=Inf)
 
d1$defect
d1$machine
A
 
B
 
C
 
Total
0
N
Row(%)
Column(%)
Total(%)
 
51517
54.8491%
89.8981%
51.517%
 
28133
29.9526%
99.0285%
28.133%
 
14275
15.1983%
99.9300%
14.275%
 
93925
93.9250%

1
N
Row(%)
Column(%)
Total(%)
 
5789
95.2922%
10.1019%
5.789%
 
276
4.5432%
0.9715%
0.276%
 
10
0.1646%
0.0700%
0.010%
 
6075
6.0750%

Total
57306
57.306%
28409
28.409%
14285
14.285%
100000

Question 5

What is simulation error for P(machine A & defective)?

Solution

Simulation error is

\[ \underbrace{P(\text{Machine A}, \text{Defective})}_{\text{from sim (Q4)}} - \underbrace{P(\text{Machine A}, \text{Defective})}_{\text{from table (Q1)}} \]

t1 <- table(d1) |> proportions()
t1
       defect
machine       0       1
      A 0.51517 0.05789
      B 0.28133 0.00276
      C 0.14275 0.00010
p_sim <- t1["A","1"]
p_math <- 0.1 * 4/7
p_sim - p_math
[1] 0.0007471429

Question 6

Here is the data from the in-class survey about soft drinks which you will use in the following problems. For the moment, we will exclude reponses with missing values. (But, missing data is something we will touch on in the future.)

suppressPackageStartupMessages(require(dplyr))

d1 <- read.csv("https://tgstewart.cloud/soda.csv") |>
    filter(!is.na(cola) & !is.na(sugar))
CrossTable(
        d1$sugar
    , d1$cola
    , prop.chisq = FALSE
) |>
pander(split.table=Inf)
 
d1$sugar
d1$cola
Cola (Coke, Pepsi, etc.)
 
Something else
 
Total
Regular (full sugar)
N
Row(%)
Column(%)
Total(%)
 
10
25.0000%
50.0000%
15.1515%
 
30
75.0000%
65.2174%
45.4545%
 
40
60.6061%

Zero sugar or diet
N
Row(%)
Column(%)
Total(%)
 
10
38.4615%
50.0000%
15.1515%
 
16
61.5385%
34.7826%
24.2424%
 
26
39.3939%

Total
20
30.303%
46
69.697%
66

If the proportions in the contingency table represented probabilities, would drink preference and sugar preference be independent?

Solution

If drink preference and sugar preferences were independent, then the following would be true:

\[ P(\text{Regular}, \text{Cola}) = P(\text{Regular})P(\text{Cola}) \]

Let’s calculate the difference.

t1 <- table(d1) |> proportions() |> addmargins()
t1
                          sugar
cola                       Regular (full sugar) Zero sugar or diet       Sum
  Cola (Coke, Pepsi, etc.)            0.1515152          0.1515152 0.3030303
  Something else                      0.4545455          0.2424242 0.6969697
  Sum                                 0.6060606          0.3939394 1.0000000
p_reg_cola <- t1["Cola (Coke, Pepsi, etc.)", "Regular (full sugar)"]
p_reg <- t1["Sum", "Regular (full sugar)"]
p_cola <- t1["Cola (Coke, Pepsi, etc.)", "Sum"]

p_reg_cola
[1] 0.1515152
p_reg*p_cola
[1] 0.1836547
p_reg_cola - p_reg*p_cola
[1] -0.03213958

Question 7

Is a preference for Cola positively or negatively or not associated with a preference for regular sugar?

Solution

One way to solve this is to ask: If I look just a Cola drinkers, would I observe more or less sugar drinkers than the general population? If it is more, then cola and sugar is positively associated. If it is less, then they are negatively associated.

\[ P(\text{sugar}|\text{cola}) \text{ vs } P(\text{sugar}) \]

# We could also get these quantities from the table in Q6
p_sugar_given_cola <- t1["Cola (Coke, Pepsi, etc.)","Regular (full sugar)"]/t1["Cola (Coke, Pepsi, etc.)","Sum"]
p_sugar <- t1["Sum","Regular (full sugar)"]

p_sugar_given_cola
[1] 0.5
p_sugar
[1] 0.6060606

We see fewer sugar drinkers among the cola drinkers than we see in the general population. Therefore, sugar and cola are negatively associated.

Question 8

Consider the population of individuals who suffer from migraine headaches.

Fewer than 3 migraines per month 3 or more migraines per month
Less toxic treatment .4 .1
More toxic treatment .3 .2

What can you conclude from this data about the effectiveness of the more toxic treatment? (Hint: Section 4.7)

Solution

In Section 4.7, we learn about the difference between association and causation. Even if the data show a positive treatment effect for the less toxic treatment, it isn’t clear that there is a causal relationship. It may well be that individuals with more severe migraines choose the more potent treatment.