**Monte Carlo Simulation** is repeated execution of a procedure — usually by computer — in order to understand the characteristics of the procedure.

**Monte Carlo Simulation** is repeated execution of a procedure — usually by computer — in order to understand the characteristics of the procedure.

What is the probability that at least one birthday is shared in a class of 30 individuals?

What is the probability that at least one birthday is shared in a class of 30 individuals?

Simulation generates approximate answers because we make simplifications!

For example, we generally do not account for power outages and superbowl wins.

Hopefully the error due to simplification is small.

generate_class <- function(class_size){ birthdays <- sample(1:366, class_size, replace = TRUE, prob = c(rep(1,365), .25)) data.frame(student_id = 1:class_size, birthday = birthdays) } check_birthday <- function(class){ class_summary <- class %>% summarize(n = n(), n_bday = n_distinct(birthday)) %>% mutate(shared = n > n_bday) class_summary[["shared"]] } replicates <- replicate(10000, 30 %>% generate_class %>% check_birthday) mean(replicates)

generate_class <- function(class_size){ birthdays <- sample(1:366, class_size, replace = TRUE, prob = c(rep(1,365), .25)) data.frame(student_id = 1:class_size, birthday = birthdays) } check_birthday <- function(class){ class_summary <- class %>% summarize(n = n(), n_bday = n_distinct(birthday)) %>% mutate(shared = n > n_bday) class_summary[["shared"]] } replicates <- replicate(10000, 30 %>% generate_class %>% check_birthday) mean(replicates)

Code not important on the first day.

What is the distribution of the largest **run** in a sequence of 100 flips of a fair coin?

What is the distribution of the largest **run** in a sequence of 100 flips of a fair coin?

Simulation generates approximate answers even when there are no simplifications!

The number of replicates is finite.

Key properties of a procedure are often called **operating characteristics**. Generally, one wants to know the **distribution** of an operating characteristic over repeated executions of the study.

Key properties of a procedure are often called **operating characteristics**. Generally, one wants to know the **distribution** of an operating characteristic over repeated executions of the study.

There are lots of everyday operating characteristics.

Stop thinking like a data scientist and think about items you use each day.

What are operating characteristics related to your car, cell phone, air conditioning?

Operating characteristics are the currency by which we evaluate and compare data science procedures.

A data scientist claims to have developed a tool to identify college freshman that are highly likely to join the armed forces. What operating characteristics would you like to know about the tool?

A data scientist develops an algorithm for estimating the probability that a credit card transaction is fraudulent or not. What operating characteristics are important?

Operating characteristics are premised on the classic “long-run” interpretation of probabilistic events. As such, they can be simulated by simply repeating the planned procedure and observing how often some event happens.

A data scientist claims to have developed a tool to identify college freshman that are highly likely to join the armed forces. What operating characteristics would you like to know about the tool?

A data scientist develops an algorithm for estimating the probability that a credit card transaction is fraudulent or not. What operating characteristics are important?

Operating characteristics are premised on the classic “long-run” interpretation of probabilistic events. As such, they can be simulated by simply repeating the planned procedure and observing how often some event happens.

DOES IT WORK?

So you want to perform a simulation study? Start with the tgs axioms of computing:

**The act of turning on the computer does not magically endow you with understanding of your task.** If you do not know how you will perform an analysis or simulation before you turn on your computer, you will not know how to do it afterwards either.

**Use modular/functional programming.** Functional programming means that you identify and write short, single purpose functions for each distinct task in your program. (Examples below.) This will allow you to develop your code in a systematic way, and it will provide a natural method for debugging your code. You will simply need to verify that the different sub-functions are working as expected.

In a setting where one is trying to understand a **random process**, identify the **input parameters** and the **desired characteristic**.

In a typical data analysis setting, there are population parameters that one hopes to estimate by collecting and analyzing data. The population parameters are unknown, and the accuracy of the conclusions is unknown.

In a simulation setting, the researcher sets the population parameters then generates data using the parameters. After completing the analysis, the researcher can then evaluate the accuracy of the conclusions. By repeating this process several times, the researcher can estimate the operating characteristics for **the specific set of population parameters used to simulate the data**.

Simulations are essentialy computational proofs on a case-by-case basis (e.g., this result happens for this parameter set, that result happens for that parameter set). It is up to you make the conneciton between different settings, notice patterns, and make a case for general patterns of behavior under certain circumstances. Simuations lack the natural generalizibiltiy of a mathematical proof, so the results don’t necessary hold for all sets of parameter values.

The framework described above suggests how one may write modular code to perform the simulation. One can write a function to perform each of the primary tasks. For example: