Computational Probability

Course overview

This book is intended to compliment instruction in Computational Probability, providing worked examples in R.

The big ideas

Probability is a tool for organizing beliefs in an uncertain world. It is a powerful framework for updating our beliefs as we observe new information. Consequently, it is the basis of both prediction tools and scientific inference. Computational algorithms and computer simulation are tools which assist in calculating probabilities, forming predictions, and measuring evidence for scientific hypotheses.

The goal of this course is to teach you the concepts of probability in two languages. First, we want to teach you probability using the language of mathematics. Second, we want to teach you probability using the language of simulation and computation. Our hope is that you will be able to express concepts both ways and solve problems both ways. For example, expressing the mean as an integral using the density function or as the sample mean of many draws from underlying distribution.

\[ \underbrace{\int xf(x)\, dx}_{\text{mathematics}} \approx \underbrace{\texttt{runif(N) |> Finv() |> mean()}\phantom{\int}}_{\text{computation}} \]

As you become facile with both languages, you will begin to see examples when probability questions can be answered easily with mathematical solutions and examples when computational approaches are faster and easier.

How the course is organized

The course is organized into two parts. In the first part, we develop the concepts of uncertainty, focusing on mathematical and computational expressions of probability. We derive several rules of probability which govern how we organize beliefs and how we manipulate expressions of probability. Ultimately, the goal of developing a language for uncertainty is so that we can describe the world around us and make better predictions and decisions.

Because mathematical and computational expressions of probability is a mouthful, we use the words probability model or model as shorthand. Part 2 is all about different probability models for univariate (and a few bivariate) outcomes. It is in this part that we develop concepts like density function, quantile function, discrete, continuous, and mixture. It is also the part that will introduce families of distributions like the binomial, negative binomial, normal, and Weibull distributions.

In the part three, we tackle the problem of constructing models (or updating beliefs) from data. Because data collection is also a source of uncertainty, we investigate how to appropariately propagate the uncertainty to the inferences (aka, answers to our questions) that we pulled from the models. Communicating the uncertainty associated with our answers will be key.

Learning objectives

Probability is a framework for organizing beliefs; it is not a statement of what your beliefs should be.

Learning objective:

Compare and contrast different definitions of probability, illustrating differences with simple examples

Long-run proportion
Personal beliefs
Combination of beliefs and data

Learning objective:

Express the rules of probability verbally, mathematically, and computationally

AND, OR, complement, total probability
simulation error (relative and absolute)

Learning objective:

Illustrate the rules of probability with examples

Learning objective:

Organize/express bivariate random variables in cross tables

Learning objective:

Define joint, conditional, and marginal probabilities

Learning objective:

Identify joint, conditional, and marginal probabilities in cross tables

Learning objective:

Identify when a research question calls for a joint, conditional, or marginal probability

Learning objective:

Describe the connection between conditional probabilities and prediction

Learning objective:

Derive Bayes rule from cross tables

Learning objective:

Apply Bayes rules to answer research questions

Learning objective:

Determine if joint outcomes are independent

Learning objective:

Calculate a measure of association between joint outcomes

Learning objective:

Apply cross table framework to the special case of binary outcomes

Sensitivity
Specificity
Positive predictive value
Negative predictive value
Prevalence
Incidence

Learning objective:

Define/describe confounding variables

Simpson’s paradox
DAGs
Causal pathway

Learning objective:

list approaches for avoiding confounding

Stratification
Randomization

Probability models are a powerful framework for describing and simplifying real world phenomena as a means of answering research questions.

Learning objective:

List various data types

Nominal
Ordinal
Interval
Ratio

Learning objective:

Match each data type with probability models that may describe it

Bernoulli
binomial
negative binomial
Poisson
Gaussian
gamma
mixture

Learning objective:

Discuss the degree to which models describe the underlying data

Learning objective:

Tease apart model fit and model utility

Learning objective:

Express probability models both mathematically, computationally, and graphically

PMF/PDF
CMF/CDF
quantile function
histogram/eCDF

Learning objective:

Employ probability models (computationally and analytically) to answer research questions

Learning objective:

Explain and implement different approaches for fitting probability models from data

Tuning
Method of Moments
Maximum likelihood
Bayesian posterior
kernel density estimation

Learning objective:

Visualize the uncertainty inherent in fitting probability models from data

sampling distribution
posterior distribution
bootstrap distribution

Learning objective:

Explore how to communicate uncertainty when constructing models and answering research questions

confidence intervals
support intervals
credible intervals
bootstrap intervals

Learning objective:

Propagate uncertainty in simulations

Learning objective:

Explore the trade-offs of model complexity and generalizability

Probability is a framework for coherently updating beliefs based on new information and data.

Learning objective:

Select prior distributions which reflect personal belief

informative vs weakly informative priors

Learning objective:

Implement bayesian updating

Learning objective:

Manipulate the posterior distribution to answer research questions

Probability models can be expressed and applied mathematically and computationally.

Learning objective:

Use probability models to build simulations of complex real world processes to answer research questions

A word of encouragement and caution as you begin

Computer scripts and simulations are tools. Like any tool, data scientists use computer programs to accomplish specific tasks. As a future data scientist, it is important that you understand the ultimate goal/task; unfortunately, the simple act of turning on your computer will not endow you with understanding of your computational task. If you cannot describe the steps of your task before you open VS Code, it is unlikely that you will be able to describe the steps of your task after you open VS Code. When you find yourself embarking on a new task, we encourage you to set aside the computer and describe how you might solve a problem in the physical world. Once the steps of your task in the physical world are clear, the steps of your task will be clearer in the digital world.

Example: Birthday problem

A classic probability puzzle is known as the birthday problem.

In a class of 35 (or N) individuals, what is the probability that at least two individuals share a birthday?

If you were tasked with solving this puzzle using simulation in the physical world, how might you do it? There are lots of possible approaches, but here is one way:

Fill a big bucket with 365 ping pong balls, each labeled with a birthday.
Create a class birthday roster.
A. Scramble the ping balls and then select one at random.
B. Record the birthday, and return the ping pong ball to the bucket.
C. Repeat steps A and B until you have a roster of 35 birthdays.
Check the completed roster if there are any duplicate birthdays. If so, record a 1, otherwise record a 0.
Repeat steps 2 and 3 many times.
Calculate the probability of a shared birthday as the proportion birthday rosters marked as 1 (in step 3).

Now that we have a clear idea of how we might accomplish this task in the physical world, even though it would be tremendously time consuming, we can mimic the process in code.

The code that we write may mimic the process used in the phyiscal world. Here we write the function create_roster which matches the process described above. The input n to the create_roster command is the size of the roster.

create_roster <- function(n){
    sample(1:365, n, replace = TRUE)
}

In the example above, the size of the class is 35, so we can create a roster of birthdays of length 35 with the following command.

create_roster(35)

 [1] 342 254 146 228 166 138 203 199 205 150 259  63  71 233  14 202 185  88  22
[20]  40  36  39 237 315 135 303 290 345 276 333  32  35 272 239  77

The output is a list of birthdays. Rather than birthdays in a month day format, like “April 3”, birthdays are reported as day of the year.

Birthday	Number
Jan 1	1
Jan 2	2
\(\vdots\)	\(\vdots\)
Dec 30	364
Dec 31	365

The check_roster will take the list of birthdays and determine if any are duplicated. There are many ways to accomplish this task with code, this is just one way. It used the duplicated command which takes a vector and returns a vector of the same size. The elements of the output vector are TRUE or FALSE, indicating if the value in the corresponding input vector has already appeared in the vector.

The any command returns a TRUE if any elements of the input vector are TRUE.

check_roster <- function(roster){
  roster |> duplicated() |> any()
}

Combining these two commands with the pipe operator will create a roster and immediately check if there are any duplicated birthdays.

create_roster(35) |> check_roster()

[1] FALSE

We can repeat these two steps many times with a for loop or a replicate command. The following command repeats the steps 10 times.

replicate(10, create_roster(35) |> check_roster())

 [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE

The final step is to calculate the proportion of TRUEs. There are many ways to accomplish this step. We could tabulate the number of TRUEs and FALSEs and then divide by the number of replicates. Or, because TRUEs and FALSEs can be treated like 1s and 0s, we can calculate the proportion by calculating the average. We’ll bump up the number of replicates to 10,000 in this calculation.

out <- replicate(10000, create_roster(35) |> check_roster())
mean(out)

[1] 0.8095

According to the simulation, 81.0% of replicates resulted in at least one birthday that was duplicated.

You can compare the simulation solution to the mathematical solution. In a later module, you will learn how to derive the mathematical solution; for now, we offer it free of charge. The right hand side of the expression reexpresses the quantity in a format that matches the R code that we will write to calculate the quantity.

\[ 1-\frac{\frac{365!}{(365-n)!}}{365^n} = 1-\exp \left[ \log (365!) - \log (365-n)! - n\log(365) \right] \]

Here we create a function bp which will return the birthday problem probability for any class of size n.

bp <- function(n) 1 - exp( lfactorial(365) - lfactorial(365-n) - n*log(365) )
bp(35)

[1] 0.8143832

Note how close our estimate is compared to the mathematical solution.

mean(out) - bp(35)

[1] -0.004883239

This is an important quanity called the simulation error, which will be discussed in greater detail in a later module.