Computational Probability
Course overview
This book is intended to compliment instruction in Computational Probability, providing worked examples in R.
The big ideas
Probability is a tool for organizing beliefs in an uncertain world. It is a powerful framework for updating our beliefs as we observe new information. Consequently, it is the basis of both prediction tools and scientific inference. Computational algorithms and computer simulation are tools which assist in calculating probabilities, forming predictions, and measuring evidence for scientific hypotheses.
The goal of this course is to teach you the concepts of probability in two languages. First, we want to teach you probability using the language of mathematics. Second, we want to teach you probability using the language of simulation and computation. Our hope is that you will be able to express concepts both ways and solve problems both ways. For example, expressing the mean as an integral using the density function or as the sample mean of many draws from underlying distribution.
\[ \underbrace{\int xf(x)\, dx}_{\text{mathematics}} \approx \underbrace{\texttt{runif(N) |> Finv() |> mean()}\phantom{\int}}_{\text{computation}} \]
As you become facile with both languages, you will begin to see examples when probability questions can be answered easily with mathematical solutions and examples when computational approaches are faster and easier.
How the course is organized
The course is organized into two parts. In the first part, we develop the concepts of uncertainty, focusing on mathematical and computational expressions of probability. We derive several rules of probability which govern how we organize beliefs and how we manipulate expressions of probability. Ultimately, the goal of developing a language for uncertainty is so that we can describe the world around us and make better predictions and decisions.
Because mathematical and computational expressions of probability is a mouthful, we use the words probability model or model as shorthand. Part 2 is all about different probability models for univariate (and a few bivariate) outcomes. It is in this part that we develop concepts like density function, quantile function, discrete, continuous, and mixture. It is also the part that will introduce families of distributions like the binomial, negative binomial, normal, and Weibull distributions.
In the part three, we tackle the problem of constructing models (or updating beliefs) from data. Because data collection is also a source of uncertainty, we investigate how to appropariately propagate the uncertainty to the inferences (aka, answers to our questions) that we pulled from the models. Communicating the uncertainty associated with our answers will be key.
Learning objectives
Probability is a framework for organizing beliefs; it is not a statement of what your beliefs should be.
Compare and contrast different definitions of probability, illustrating differences with simple examples
- Long-run proportion
- Personal beliefs
- Combination of beliefs and data
Express the rules of probability verbally, mathematically, and computationally
- AND, OR, complement, total probability
- simulation error (relative and absolute)
Illustrate the rules of probability with examples
Organize/express bivariate random variables in cross tables
Define joint, conditional, and marginal probabilities
Identify joint, conditional, and marginal probabilities in cross tables
Identify when a research question calls for a joint, conditional, or marginal probability
Describe the connection between conditional probabilities and prediction
Derive Bayes rule from cross tables
Apply Bayes rules to answer research questions
Determine if joint outcomes are independent
Calculate a measure of association between joint outcomes
Apply cross table framework to the special case of binary outcomes
- Sensitivity
- Specificity
- Positive predictive value
- Negative predictive value
- Prevalence
- Incidence
Define/describe confounding variables
- Simpson’s paradox
- DAGs
- Causal pathway
list approaches for avoiding confounding
- Stratification
- Randomization
Probability models are a powerful framework for describing and simplifying real world phenomena as a means of answering research questions.
List various data types
- Nominal
- Ordinal
- Interval
- Ratio
Match each data type with probability models that may describe it
- Bernoulli
- binomial
- negative binomial
- Poisson
- Gaussian
- gamma
- mixture
Discuss the degree to which models describe the underlying data
Tease apart model fit and model utility
Express probability models both mathematically, computationally, and graphically
- PMF/PDF
- CMF/CDF
- quantile function
- histogram/eCDF
Employ probability models (computationally and analytically) to answer research questions
Explain and implement different approaches for fitting probability models from data
- Tuning
- Method of Moments
- Maximum likelihood
- Bayesian posterior
- kernel density estimation
Visualize the uncertainty inherent in fitting probability models from data
- sampling distribution
- posterior distribution
- bootstrap distribution
Explore how to communicate uncertainty when constructing models and answering research questions
- confidence intervals
- support intervals
- credible intervals
- bootstrap intervals
Propagate uncertainty in simulations
Explore the trade-offs of model complexity and generalizability
Probability is a framework for coherently updating beliefs based on new information and data.
Select prior distributions which reflect personal belief
- informative vs weakly informative priors
Implement bayesian updating
Manipulate the posterior distribution to answer research questions
Probability models can be expressed and applied mathematically and computationally.
Use probability models to build simulations of complex real world processes to answer research questions
A word of encouragement and caution as you begin
Computer scripts and simulations are tools. Like any tool, data scientists use computer programs to accomplish specific tasks. As a future data scientist, it is important that you understand the ultimate goal/task; unfortunately, the simple act of turning on your computer will not endow you with understanding of your computational task. If you cannot describe the steps of your task before you open VS Code, it is unlikely that you will be able to describe the steps of your task after you open VS Code. When you find yourself embarking on a new task, we encourage you to set aside the computer and describe how you might solve a problem in the physical world. Once the steps of your task in the physical world are clear, the steps of your task will be clearer in the digital world.
Example: Birthday problem
A classic probability puzzle is known as the birthday problem.
In a class of 35 (or N) individuals, what is the probability that at least two individuals share a birthday?
If you were tasked with solving this puzzle using simulation in the physical world, how might you do it? There are lots of possible approaches, but here is one way:
- Fill a big bucket with 365 ping pong balls, each labeled with a birthday.
- Create a class birthday roster.
A. Scramble the ping balls and then select one at random.
B. Record the birthday, and return the ping pong ball to the bucket.
C. Repeat steps A and B until you have a roster of 35 birthdays.
- Check the completed roster if there are any duplicate birthdays. If so, record a 1, otherwise record a 0.
- Repeat steps 2 and 3 many times.
- Calculate the probability of a shared birthday as the proportion birthday rosters marked as 1 (in step 3).
Now that we have a clear idea of how we might accomplish this task in the physical world, even though it would be tremendously time consuming, we can mimic the process in code.
The code that we write may mimic the process used in the phyiscal world. Here we write the function create_roster
which matches the process described above. The input n
to the create_roster
command is the size of the roster.
<- function(n){
create_roster sample(1:365, n, replace = TRUE)
}
In the example above, the size of the class is 35, so we can create a roster of birthdays of length 35 with the following command.
create_roster(35)
[1] 210 148 184 245 261 51 349 37 134 17 190 16 256 149 105 310 317 310 248
[20] 362 189 107 181 357 305 208 219 195 148 264 301 29 265 31 45
The output is a list of birthdays. Rather than birthdays in a month day format, like “April 3”, birthdays are reported as day of the year.
Birthday | Number |
---|---|
Jan 1 | 1 |
Jan 2 | 2 |
\(\vdots\) | \(\vdots\) |
Dec 30 | 364 |
Dec 31 | 365 |
The check_roster
will take the list of birthdays and determine if any are duplicated. There are many ways to accomplish this task with code, this is just one way. It used the duplicated
command which takes a vector and returns a vector of the same size. The elements of the output vector are TRUE or FALSE, indicating if the value in the corresponding input vector has already appeared in the vector.
The any
command returns a TRUE
if any elements of the input vector are TRUE
.
<- function(roster){
check_roster |> duplicated() |> any()
roster }
Combining these two commands with the pipe operator will create a roster and immediately check if there are any duplicated birthdays.
create_roster(35) |> check_roster()
[1] TRUE
We can repeat these two steps many times with a for
loop or a replicate
command. The following command repeats the steps 10 times.
replicate(10, create_roster(35) |> check_roster())
[1] TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
The final step is to calculate the proportion of TRUEs. There are many ways to accomplish this step. We could tabulate the number of TRUEs and FALSEs and then divide by the number of replicates. Or, because TRUEs and FALSEs can be treated like 1s and 0s, we can calculate the proportion by calculating the average. We’ll bump up the number of replicates to 10,000 in this calculation.
<- replicate(10000, create_roster(35) |> check_roster())
out mean(out)
[1] 0.8187
According to the simulation, 81.9% of replicates resulted in at least one birthday that was duplicated.
You can compare the simulation solution to the mathematical solution. In a later module, you will learn how to derive the mathematical solution; for now, we offer it free of charge. The right hand side of the expression reexpresses the quantity in a format that matches the R
code that we will write to calculate the quantity.
\[ 1-\frac{\frac{365!}{(365-n)!}}{365^n} = 1-\exp \left[ \log (365!) - \log (365-n)! - n\log(365) \right] \]
Here we create a function bp
which will return the birthday problem probability for any class of size n
.
<- function(n) 1 - exp( lfactorial(365) - lfactorial(365-n) - n*log(365) )
bp bp(35)
[1] 0.8143832
Note how close our estimate is compared to the mathematical solution.
mean(out) - bp(35)
[1] 0.004316761
This is an important quanity called the simulation error, which will be discussed in greater detail in a later module.