What is probability?
What is probability?
What do we mean by probability in the following examples?
What do we mean by probability in the following examples?
What do we mean by probability in the following examples?
Key elements:
generate_class <- function(class_size){ ??? ??? } check_birthday <- function(class){ ??? ??? } replicates <- replicate(???, ???) mean(replicates)
The long-run proportion is only one of the features of the repeatable process.
The long-run proportion is only one of the features of the repeatable process.
Key features of a procedure are often called operating characteristics.
What is the distribution of the largest run in a sequence of 100 flips of a fair coin?
Key properties of a procedure are often called operating characteristics. Generally, one wants to know the distribution of an operating characteristic over repeated executions of the study.
Operating characteristics are the currency by which we evaluate and compare data science procedures.
A data scientist claims to have developed a tool to identify college freshman that are highly likely to join the armed forces. What operating characteristics would you like to know about the tool?
A data scientist develops an algorithm for estimating the probability that a credit card transaction is fraudulent or not. What operating characteristics are important?
Operating characteristics are premised on the classic “long-run” interpretation of probabilistic events. As such, they can be simulated by simply repeating the planned procedure and observing how often some event happens.
In this class, we are going to use both tools.
So you want to perform a simulation study? Start with the tgs axioms of computing:
The act of turning on the computer does not magically endow you with understanding of your task. If you do not know how you will perform an analysis or simulation before you turn on your computer, you will not know how to do it afterwards either.
Use modular/functional programming. Functional programming means that you identify and write short, single purpose functions for each distinct task in your program. (Examples below.) This will allow you to develop your code in a systematic way, and it will provide a natural method for debugging your code. You will simply need to verify that the different sub-functions are working as expected.
In a setting where one is trying to understand a random process, identify the input parameters and the desired characteristic.
In a typical data analysis setting, there are population parameters that one hopes to estimate by collecting and analyzing data. The population parameters are unknown, and the accuracy of the conclusions is unknown.
In a simulation setting, the researcher sets the population parameters then generates data using the parameters. After completing the analysis, the researcher can then evaluate the accuracy of the conclusions. By repeating this process several times, the researcher can estimate the operating characteristics for the specific set of population parameters used to simulate the data.
How might we change this statement so that it is applicable to prediction?
In a simulation setting, the researcher sets the population parameters then generates data using the parameters. After completing the analysis, the researcher can then evaluate the accuracy of the conclusions. By repeating this process several times, the researcher can estimate the operating characteristics for the specific set of population parameters used to simulate the data.
How might we change this statement so that it is applicable to prediction?
In a simulation setting, the researcher sets the population parameters then generates data using the parameters. After completing the analysis, the researcher can then evaluate the accuracy of the
conclusionspredictions. By repeating this process several times, the researcher can estimate the operating characteristics (usually call measures of model performance) for the specific set of population parameters used to simulate the data.
Simulations are essentially computational proofs on a case-by-case basis (e.g., this result happens for this parameter set, that result happens for that parameter set). It is up to you make the conneciton between different settings, notice patterns, and make a case for general patterns of behavior under certain circumstances. Simulations lack the natural generalizability of a mathematical proof, so the results don’t necessary hold for all sets of parameter values.
The framework described above suggests how one may write modular code to perform the simulation. One can write a function to perform each of the primary tasks. For example:
How might we use this as a template for the birthday question?
How might we use this as a template for the birthday question?