DS 2026 Midterm 1 Prep Questions

NOTE: The .rmd version of the file is available here: (link)

Instructions

Please prepare reponses/solutions for the following questions. On the day of the exam, you will be given a new set of questions. You may use the solutions you’ve prepared for this exam during the exam.

During the exam, you will also be permitted to access the internet for publicly available content. You will not be allowed to communicate with anyone via the internet or any other means during the exam. This includes, but is not limited to:

No messaging, emailing, or using social media to contact others.
No posting questions or seeking answers on forums, chat rooms, or any collaborative platforms.
No sharing or discussing exam content with peers through any online or electronic medium.

You may NOT discuss any aspect of the exam or prep questions with anyone other than the instructor or TA. You may NOT share code or documents.

Submission instructions

You will submit the midterm to Gradescope as a PDF reproducible report.
When you submit, you will indicate the location of each question in the document.
As this is a practice exam, you do not need to submit your responses.

Questions

Exam questions are organized into sections cooresponding to the learning outcomes of the course.

Section 1. Tools of the data scientists

Learning objective: Use the tools of data scientists, including

R

IDE (Rstudio or VScode)

Git

Github

Markdown

Rmarkdown

Rivanna

Learning objective: Implement best practices, including

Reproducible reports

Working directory

Relative file paths

1.1 What is reporducible report?

1.2 What is the working directory?

1.3 What is the difference between Markdown and Rmarkdown?

1.4 What is the difference between Git and Github?

1.5 What is the difference between R and Rstudio/VScode?

1.6 The following is a schematic of a project folder, with subfolders and files.

project
│
├───code
│       script.rmd
│
├───data
│       survey-responses.csv
│
└───docs

Supposing the code subfolder is the designated working directory, write the command to be included in the script.rmd file which will read the survey-responses.csv data, avoiding absolute file paths?

1.7 What does it mean to render a markdown file?

1.8 The following error was generated in R. How might the error be resolved?

Error in loadNamespace(x) : there is no package called ‘data.table’

1.9 What is a package library?

1.10 What would happen if the command install.packages("data.table") was included in an .R or .Rmd script? Is there a better practice?

1.11 The following error was generated in R. How might the error be resolved?

Error in rcs() : could not find function "rcs"

1.12 List the possible output formats from an Rmarkdown file.

Section 2. Probability

Learning objective: Compare and contrast different definitions of probability, illustrating differences with simple examples

Learning objective: Express the rules of probability verbally, mathematically, and computationally, * AND, OR, complement, total probability * simulation error (relative and absolute)

2.1 What is probability? Please illustrate with examples of different types of probability.

2.2 Suppose the following table represents operating system and computer type preferences of UVA students.

Product	MacOS	Windows
Laptop			.8
Desktop		.15
	.6

Please complete the table by calculating the missing cell and marginal probabilities.

2.3 What is the probability that a randomly selected UVA student has a preference for a Mac OS laptop?

2.4 Among laptop users, what is the probability of randomly selecting a student with a preference Windows?

2.5 If you did not know the operating system preference, what would be the probability that a randomly selected student prefers a desktop?

2.6 Among MacOS users, what is the probability of randomly selecting a student with a preference for a desktop?

2.7 One of the big ideas of this course is that probability is a framework for coherently updating beliefs based on new information and data. Please explain how the solution in 2.6 represents an update of beliefs from the solution in 2.5.

Section 3. Simulation

Learning objective: Use probability models to build simulations of complex real world processes to answer research questions

3.1 What is simulation error?

3.2 In class and homework, you solved the birthday problem. The script we developed in class to solve the problem is below. Please change the code to answer a slightly different version of the birthday problem:

In a class of size 37, what is the probability that at least 3 individuals will share a birthday?

generate_class <- function(class_size){
  sample(1:365, class_size, replace=TRUE)
}

check_birthday <- function(class){
  class |> 
    duplicated() |> 
    any()
}

set.seed(230583)
R <- 10000
replicates <- replicate(R, generate_class(55) |> check_birthday())
mean(replicates)

## [1] 0.9875

3.3 The code below comes from HW 4 which simulated the original birthday problem. Please explain the version 2 code from HW 4. Why does it give a solution to the original birthday problem?

# Version 2 from HW 4
first_duplicate <- function(){
    sample(1:365, 366, replace=TRUE) |>
    duplicated() |>
    which() |>
    min()
}

R <- 10000
fd1 <- replicate(R, first_duplicate())
plot(ecdf(fd1), main = "Probability of shared birthday", xlab = "Class size")

3.4 Generate a figure with class size on the x-axis, and the probability of a triply-shared birthday on the y-axis.