Introduction

This Lab Exercise will focus on modeling & simulation using probability as a key feature of the models. Monte Carlo Simulation is classic example of this kind of modeling. There will be a variety of problems to choose from and for each, a small program must be written in either R or Python to simulate and analyze the system described by the problem.

Submission

The program and analysis are to be written in either an R-Markdown or Jupyter notebook, with both the source file (.Rmd/.ipynb) and a resulting output (html/pdf/doc) submitted to the drop-box by the submission deadline.

Due Date

This Lab Exercise is due on Sunday, April 29th at 11:59PM.

Grading

This Lab Exercise is worth 20 points, or about 10% of the total grade for this class for graduate students and about 12.5% for undergraduates.

There will be a number of problems of differing levels of difficulty. Level 1 are the easiest and are worth 6 points each. Level 2 problems require more effort and are worth 12 points each. Level 3 are the most challenging and are worth 18 points. You should choose the problems you want to work on for a total of 18 points. The remaining two points evaluate the overall lab submission (e.g. readability, explaining your challenges and learning, etc.).

Problem Selection

You may choose to do 3 Level 1 problems:

Problems Score
Level 1 #1 6
Level 1 #2 6
Level 1 #3 6
Overall 2
Total Score 20

Or you might choose to do 1 Level 2 and 1 Level 1:

Problems Score
Level 2 #1 12
Level 1 #2 6
Overall 2
Total Score 20

Or you may choose to do 1 Level 3:

Problems Score
Level 3 #2 18
Overall 2
Total 20

Also, you can opt to do an additional problem to help in case you don’t achieve full points for the chosen problems and want to bolster your chances to earn full points.
For example, if you chose to the Level 3 #1 problem, but it was too difficult to complete it, instead of entirely scrapping the effort, you might instead submit and document what you were able to accomplish, then do one of the Level 1 problems to make up the missing points.

You can do as many additional problems you choose, but the total overall points for the lab will remain at 20 points.


Level 1 Problems

There are 5 Level 1 problems. If you do only Level 1 problems, you must choose to complete 3 of these for full points.

Level 1 #1: Random Numbers and Probability Distributions

This problem is to help develop an understanding of randomness, sampling, and probability. Choose 3 of the probability distributions we’ve discussed in class. For each of them:

  • draw at least 2 sets of random samples from the distribution
    • try varying N (maybe one with a small N and one with a large N)
    • try varying the parameters (e.g. mean, standard deviation, rate, min, max, mode, shape, etc.)
    • compare the parameters of the sampled data with the source distribution

Then, for one of the sets of samples created above:
* plot a histogram * superimpose a line plot a line plot of the same distribution fitted to that sampled data (e.g. take the mean and sd of the sampled data, and draw a normal curve based on those parameters) * show a line of what the ideal distribution should look like (e.g. use the mean and sd used to generate the data) * try varying the number of bins in the histogram * use colors or some other method to distinguish what’s on the plot * optional: can you get 2 plots side-by-side?

Then write a short paragraph about how changing things like N, the parameters, and bin-size altered the results, and if anything surprised you.

Here is a raw code example (in R) that is a good start for the normal distribution, but would not get full points (it is missing some requirements, and doesn’t “look nice”):

# draw some random numbers from the normal distribution
N <- 1000
x <- rnorm(N)

#plot a density histogram
hist(x, freq = FALSE)                       # freq=FALSE to get a density plot (not frequency)

# get the x & y values of a normal curve fitted to these random numbers
x_fit <- seq(min(x), max(x), length=N * 10)   # at 10 * number of points for smooth curve
y_fit <- dnorm(x_fit, mean=mean(x), sd=sd(x)) # density values from normal distibution for x_fit
lines(x_fit, y_fit)

One possibly equivalent set of code in Python for the above R code is:

import numpy as np
import scipy.stats as stats
import pylab as pl

N = 1000
x = np.random.normal(size=N)

x_fit = np.arange(min(x), max(x), 1/(N))
y_fit = stats.norm.pdf(x_fit, np.mean(x), np.std(x))  

pl.plot(x_fit,y_fit)
pl.hist(x,normed=True)     
pl.show()                   

For the overall problem, write a paragraph or more about what you found challenging and/or what you learned while working on this problem.

Level 1 #2: Approximating the Binomial Distribution

The binomial distribution (https://en.wikipedia.org/wiki/Binomial_distribution) can answer the question, “In 10 flips of a fair coin, what’s the probability of getting 3 heads”.

Setup:

R and Python (with the SciPy library) already has a function to draw from the binomial distribution (dbinom and scipy.stats.binom.pmf respectively), and here we can see a table that shows the probabilities of getting so many heads in 10 flips of a fair coin.

Total_Flips <- 10
Heads <- 0:Total_Flips
Probability <- dbinom(x=Heads, size=Total_Flips, prob=0.5)
print(data.frame(Heads, Probability), row.names = FALSE)
##  Heads  Probability
##      0 0.0009765625
##      1 0.0097656250
##      2 0.0439453125
##      3 0.1171875000
##      4 0.2050781250
##      5 0.2460937500
##      6 0.2050781250
##      7 0.1171875000
##      8 0.0439453125
##      9 0.0097656250
##     10 0.0009765625

And analogous Python code:

import scipy as sp
import pandas as pd

Total_Flips = 10
Heads = sp.linspace(0, Total_Flips, Total_Flips+1)
Probability = sp.stats.binom.pmf(Heads, Total_Flips, 0.5)

df = pd.DataFrame(data=Probability, columns=["Probability"])
df.index.names = ["Heads"]
print(df)

Task:

Pretending R or Python doesn’t know the binomial distribution, write an monte-carlo simulation to estimate the values shown in the table above.

Now, re-run your simulation to show what happens if the coin is unfair and has a 65% chance showing heads. Your result should be something like:

Total_Flips <- 10
Heads <- 0:Total_Flips
Probability <- dbinom(x=Heads, size=Total_Flips, prob=0.65)
print(data.frame(Heads, Probability))
##    Heads  Probability
## 1      0 2.758547e-05
## 2      1 5.123017e-04
## 3      2 4.281378e-03
## 4      3 2.120302e-02
## 5      4 6.890980e-02
## 6      5 1.535704e-01
## 7      6 2.376685e-01
## 8      7 2.522196e-01
## 9      8 1.756530e-01
## 10     9 7.249169e-02
## 11    10 1.346274e-02

For the overall problem, write a paragraph or more about what you found challenging and/or what you learned while working on this problem.

Level 1 #3: Approximating the Geometric Distribution

The geometric distribution (https://en.wikipedia.org/wiki/Geometric_distribution) can answer the question, “If I flip a fair coin until I get a Head, what’s the probability of getting 3 Tails.”

Setup:

R can easily do this with the dgeom function:

Num_Tails_until_Head <- 0:10
Probabilty <- dgeom(Num_Tails_until_Head, 0.5)
print(data.frame(Num_Tails_until_Head, Probabilty), row.names=FALSE)
##  Num_Tails_until_Head   Probabilty
##                     0 0.5000000000
##                     1 0.2500000000
##                     2 0.1250000000
##                     3 0.0625000000
##                     4 0.0312500000
##                     5 0.0156250000
##                     6 0.0078125000
##                     7 0.0039062500
##                     8 0.0019531250
##                     9 0.0009765625
##                    10 0.0004882812

Task:

Pretending R or Python doesn’t know the geometric distribution (but can flip a coin), write an monte-carlo simulation to estimate the values shown in the table above.

And you might find this bit of Python code helpful. It will flip a fair coin until a head is flipped and report the number of tails.

import numpy as np

def coin_flip ():
    """ does a 50/50 coin flip, 0 = Heads, 1 = Tails"""
    return(np.random.randint(low=0,high=2))

tail_count = 0

flip = coin_flip()

while flip == 0:
    print(flip)        
    tail_count += 1 
    flip = coin_flip()

print('Number of tails until a head was rolled: ', tail_count)

If the code sample above represents a single “run” of a simulation, how many runs do you have to simulate in order to get an answer that is “close” to the values generated by the provided function? Did this take a long time?

For the overall problem, write a paragraph or more about what you found challenging and/or what you learned while working on this problem.

Level 1 #4: Estimating Pi via Monte Carlo Methods

Use a monte-carlo simluation to estimate the value of Pi.

How many runs are necessary to get a “good” result?

Does it matter if you use “< 1” or “<= 1”? Try modifying your program to verify your answer.

Include the following plots:

  • a plot of the points used in estimating Pi, and superimpose a line-plot of the unit-circle boundary
  • a plot of the estimated value vs. the number of points (you should expect it to vary a lot with when the number is low, but for it to “settle down” as the number of points grows)

For the overall problem, write a paragraph or more about what you found challenging and/or what you learned while working on this problem.

Level 1 #5: Estimate the Area Under a Curve

Choose a polynomial or trigonometric function (maybe one you can integrate “by-hand” and evaluate), use monte-carlo simulation to estimate the area under its curve between two points of your choosing. For example, you might choose \(f(x) = x^2\) between 1 and 2, so that your simulation will estimate \(\int_{1}^{2} x^2 dx\).

Include the following plots:

  • a plot of the points used in estimating your integration, and superimpose a line-plot of the function
  • a plot of the estimated value vs. the number of points (you should expect it to vary a lot with when the number is low, but for it to “settle down” as the number of points grows)

For the overall problem, write a paragraph or more about what you found challenging and/or what you learned while working on this problem.


Level 2 Problems

There are 5 Level 2 problems. If you do only one Level 2 problems, you must also do at least one Level 1 problem for full points. You can also choose to do two Level 2 problems, with the full 9 points spread across the two problems (with the best effort representing a majority of the score).

Level 2 #1: Birthday Problem

For some number of people, what’s the probability that at least two of them have the same birthday? This is popularly known as the birthday problem: https://en.wikipedia.org/wiki/Birthday_problem

For this problem, you can make some simplifying assumptions, such as the number of days in the year being fixed at 365 and that the probability of being born on any given day is uniform. With these assumptions, of course, the problem can be solved relatively easily using probabilty theory. But for this exercise, be sure to use a monte carlo simulation method! (though it’s fine to validate your work by comparing to a probability-theory based solution).

Write a monte-carlo simulation that given some number of people, will estimate the probability that at least 2 of those people will have the same birthday.

Use this simulation to build a table of values for the number of people from 2 to around 30 people, and build a graph of this, similar to the one shown in the Wikipedia article.

A non-Monte-Carlo method

Here’s an example of getting the probabilty values that does NOT use Monte-Carlo simulation. The value for each n is calculated as: \[p_n = \left(\frac{364}{365}\right)^{\frac{n(n-1)}{2}} \] This R Code calculates the values using this formula from 2 to 10 in two different ways. The for-loop uses rbind to append the results of each calculation to a dataframe. The sapply method builds two sets of columns (N and P) and uses cbind to bind those two columns into a dataframe.

# Simple Exponentation Method (e.g. based on probability theory)

df <- data.frame(N = integer(), P = double())   # create empty dataframe with named columns

for (n in 2:10) {
  p <- 1 - (364 / 365) ^ (n * (n - 1) / 2)
  df <- rbind(df, data.frame(N = n, P = p))  # accumulate the results as we go along
}
print(df, row.names = FALSE)
  N           P
  2 0.002739726
  3 0.008196680
  4 0.016326175
  5 0.027061942
  6 0.040317031
  7 0.055984983
  8 0.073941255
  9 0.094044865
 10 0.116140237
# I'm packing a lot into this, using sapply (instead of a for-loop)
N <- 2:10
df <- data.frame(cbind(N, P=sapply(2:10, function(n) round(100 * (1 - (364 / 365) ^ (n * (n - 1) / 2)), 2))))
print(df, row.names = FALSE)
  N     P
  2  0.27
  3  0.82
  4  1.63
  5  2.71
  6  4.03
  7  5.60
  8  7.39
  9  9.40
 10 11.61

Questions:

  1. How many times iterations should your simulation do so that you feel confident in your answer?
  2. How many people need to be in the room before you have more than a 50% chance of two people having the same birthday?
  3. How many people do you need to be almost certain to have at least 2 people with the same birthday?
  4. Did you discover anything else that was interesting?

For the overall problem, write a paragraph or more about what you found challenging and/or what you learned while working on this problem.

Level 2 #2: The Monty Hall Problem

I first learned about the Monty Hall problem while reading Mlodinow’s book, “The Drunkard’s Walk” (I highly recommend this book to help build your intuitions about probability). He mentioned how the problem from the game had been mentioned in a famous newspaper column and that many people, even people well-schooled in math and statistics, get the problem wrong. http://marilynvossavant.com/game-show-problem/

Setup

The setup of the game is this: a contestant in a game show picks from 3 doors. One of them has a very desirable prize and the other 2 have worthless prizes. At this point, the contestant has a 1/3 chance of winning the desirable prize. After the contestant picks one, Monty then opens one of the doors that does NOT have the desirable prize - and offers the contestant the opportunity to switch their chosen door.

The question: should the contestant switch doors?

The KEY to the answer is that Monty is using is knowledge of what’s behind the doors, which affects the outcomes. The other key is that the probabilities of winning are only meaningful when looking at many iterations of the game. The probabilities say to switch doors, but that doesn’t mean it will be the right move in every game. That’s why I’ve made this one of the lab exercise problems (it’s relatively easy to see through simulation)!

Task:

Build a simulation that can simulate a round of the game. Then run a bunch of simulation runs where sometimes the contestant switches and sometimes they don’t - or run a bunch of simulations and measure if they should have switched or not. What do you find? Should a contestant switch doors?

If the results you get don’t feel intuitive, don’t feel bad; even Paul Erdos (a brilliant and famous mathematician) struggled with it until he saw it play out in simulation: https://www.wired.com/2014/11/monty-hall-erdos-limited-minds/

For the overall problem, write a paragraph or more about what you found challenging and/or what you learned while working on this problem.

Level 2 #3: German Tank Problem

World War II analysts developed a method to estimate the number of tanks Germany was producing based on the serial numbers of captured tanks. Write a monte-carlo simulation to explore how good that solution was by varying the number of tanks produced as well as the presumably randomly found (sampled) serial numbers.

Setup:

Here’s an example from Wikipedia: > Suppose \(k = 4\) tanks with serial numbers 19, 40, 42 and 60 are captured. The maximal observed serial number, \(m = 60\). The unknown total number of tanks is called \(N\).

According to Wikipedia, a frequentist approach says the number of tanks can be estimated with:

\[N \approx m + \frac{m}{k} - 1 = 74 \]

and a Bayesian approach giving:

\[ \Pr(N=n) = \begin{cases} 0 &\text{if } n < m \\ \frac {k - 1}{k}\frac{\binom{m - 1}{k - 1}}{\binom n k} &\text{if } n \ge m, \end{cases} \]

Which can be used to estimate \(N\):

\[N \approx \mu \pm \sigma = 88.5 \pm 50.22\]

where

\(\mu = (m - 1)\frac{k - 1}{k - 2}\)

and \(\sigma = \sqrt{\frac{(k-1)(m-1)(m-k+1)}{(k-3)(k-2)^2}}\)

Task:

For this task, build a simulation system that will generate a sequence of serial numbers of German tanks (it’s probably easiest to just make them 1, 2, …, N). Then sample (without replacement) some small number of captured tanks. Then try out the frequentist and Bayesian estimates to see how close the estimates are to the actual number of tanks. Repeat your simulation many times, maybe with different populations of tanks and different sets of captured tanks. What conclusions do you draw about the task of estimating the number of tanks?

For the overall problem, write a paragraph or more about what you found challenging and/or what you learned while working on this problem.

Level 2 #4: Getting Your Feet Wet With Discrete Systems Simulation (The Pool Problem)

This problem is a light look into working on a Discrete Systems Simulation (DSS) problem. It involves peole arriving at a pool according to a certain probability distribution, and the duration of their stay also set by a probability distribution. You’ll then analyze things like the average number of visitors in a day, and the maximum occupancy of the pool in a day, etc.

Setup:

The swimming pool is open from 9 AM to 5PM. The time from one person’s arrival to the next can be described by the exponential distribution with a mean interarrival time of 1 every 10 minutes. Once a person arrives at the pool, the duration of their stay can also be described by the exponential distribution with a mean of 30 minutes.

Task:

Build a simulation for a day at the pool. Running it just once, what is the maximum number of people in the pool at any time in the day? Draw a plot of the number of people in the pool throughout the day.

Now run the simulation for a large number of days. Looking at the maximum number of people in the pool, what is its min, mean, and max over all days?

Note 1: one approach would be to set up a data-frame where each row represents a person/party’s visit to the pool, and store in it (by columns) when they arrive, how long they stay, and when they leave. Then you could query the dataframe after each simulation run to get the desired statistics.

Note 2: it might be worthwhile to look at the Level 2 SSSQ problem for inspiration.

For the overall problem, write a paragraph or more about what you found challenging and/or what you learned while working on this problem.

Level 2 #5: Single-Server Single-Queue (SSSQ) Problem

This is also light look at a Discrete Systems Simulation kind of problem (unlike the Pool Problem above, this problem features a queue).

Imagine you go to Salt & Straw and there is only one clerk, and there is also one single line for the customers to queue up in. This is a classic Single-Server, Single-Queue situation.

For this problem, we’ll be using a simple example from M. H. MacDougall’s book Simulating Computer Systems, Techniques and Tools (1987). For example, below is his figure showing a SSSQ.

Setup:

The following images is made of extracts from the text that explain the problem.

MacDougall, Explaining the SSSQ (page 13-20)

MacDougall, Explaining the SSSQ (page 13-20)

See [MacDougall, Explaining the SSSQ (page 13-20)] for a detailed explanation (the figure may float to somewhere within this problem description)

Task:

Implement an SSSQ as described by MacDougall. To help in this task, the following image shows source code in the computer language, C, which you can use as a baseline for developing your R or Python code.

MacDougall, C Code for the System Model (page 20)

MacDougall, C Code for the System Model (page 20)

Be sure to:

  • use more descriptive variable names to make your code easier to read and understand
  • implement a graphic similar to MacDougall’s for showing the size of the queue over time (something similar to the below is sufficient)
  • try experimenting with another set of values for Ta and Ts and show how that changes the system behaivor

Note: be careful when translating the C expentl (generates a random variate from the exponential distribution) to R or Python. Try sampling from your chosen language’s version of the exponential sampler with the given parameters to make sure you get results that make sense. The rate may be in something like \(minutes\) or \(1/minutes\).

Plot generated by this simulation in R (by me):


Note: this problem would have a Level 3 difficulty, but since working source code in a language similar to R and Python has been provided (and really, C doesn’t look that different in structure!), it’s a Level 2.

For the overall problem, write a paragraph or more about what you found challenging and/or what you learned while working on this problem.

Level 2 #6: Estimate Poisson Distribution using Exponential Distribution

The Poisson Distribution can answer the question, “In a period of time and an average rate of independent arrivals (lambda), how many arrivals will there be.” As an example, if a hair salon has an average of 2.5 customers arrive every hour, how many will arrive in the next hour?"

The exponential distribution is closely related to the Poisson distribution, since given the same rate (lambda), it can answer the question, “when will the next event occur?” For the hair salon, that equates to, “if the average number of arrivals is 2.5 per hour, when will the next customer walk through the door?”

The following R code uses the dpois function to get the probability of k events in a given time period where the rate is 2.5 per time period:

rate <- 2.5   # lambda

x=0:8

px = dpois(x, lambda = rate)

plot(x, px, type="h", xlab="Number of events k", ylab="Probability of k events", ylim=c(0,0.5), pty="s", main="Poisson distribution \n Probability of events for lambda = 2.5")

Pretend that R or Python doesn’t have a function to give you poisson probabilities (like the Level 1 problems), use Monte Carlo simulation to build the chart above (as well as show the data-table). For each run of the simulation, you’ll probably want to do something like repeatedly sampling from the exponential distribution to get arrival times, adding the results until the sum exceeds the time period in question.

Develop tables for different values of Lambda and compare them to the actual function that’s available from R or Python.

How many runs/iterations do you need to get numbers that are close (for your definition of close) to the actual values?

For the overall problem, write a paragraph or more about what you found challenging and/or what you learned while working on this problem.

Level 2 #7: Exploring the Central Limit Theorem

One of the most important theorems in statistics is the Central Limit Theorem. It generally says that regardless of the population’s true distribution, if you make repeated independent samplings from that population, and then take the means of those samplings, as the number of samplings goes up, the distribution of those means will tend towards a normal distribution.

Or to quote Wikipedia (https://en.wikipedia.org/wiki/Central_limit_theorem):

In probability theory, the central limit theorem (CLT) establishes that, in most situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a “bell curve”) even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

Build a simulation to test the ideas behind the CLT. The process might look like:

  1. pick any distribution that’s not already normal (e.g. exponential, binomial, something bimodal, etc)
  2. create a large population (10,000s? more?) - store in an array or dataframe
  3. take a small (uniform) random sample from that population and compute and store the mean
  4. repeat 3 a “bunch” of times.
  5. analyze your collection of means from those samples

Does it matter how large the individual samples are? Does it matter how many sample means you collect? If you have the statistical knowledge, do you have a way to show how close to “normal” your collection of sample means appear to be?

It might be worth checking out this page: http://www.ltcconline.net/greenl/java/Statistics/clt/cltsimulation.html

For the overall problem, write a paragraph or more about what you found challenging and/or what you learned while working on this problem.

Level 2 #8: “Chutes and Ladders”" Simulation

Chutes and Ladders is a children’s game popularized by Milton Bradley (https://en.wikipedia.org/wiki/Snakes_and_Ladders). It’s a great candidate for simulation because while the game has randomness (it’s driven by the roll of a dice or a spinner with 1 to 6), there aren’t actually any decisions to be made - it’s purely a game of chance.

The goal of the game is to reach the 100th cell (exactly) by rolling the dice to determine how far you advance. If you land at the base of a ladder, you proceed to the cell at the top of that ladder. If you land at the top of a slide (chute), you proceed back down to the base of the slide. You only win by landing exactly on cell 100.

Build a simulation for Chutes and Ladders, playing many, many games, maybe with different numbers of players (typically 2 to 4) and collect some information about what happens. What is the longest game? Shortest game? What’s the fastest route through the board (fewest turns)?

Wikipedia says this:

Any version of Snakes and Ladders can be represented exactly as an absorbing Markov chain, since from any square the odds of moving to any other square are fixed and independent of any previous game history.[4] The Milton Bradley version of Chutes and Ladders has 100 squares, with 19 chutes and ladders. A player will need an average of 39.6 spins to move from the starting point, which is off the board, to square 100. A two-player game is expected to end in 47.76 moves with a 50.9% chance of winning for the first player.

Do you get similar results?

This R Code provides data for the locations of the chutes and ladders:

ladders <- cbind(c(1, 4, 9, 21, 28, 36, 51, 71, 80), 
                c(38, 14, 31, 42, 84, 44, 67, 91, 100))

chutes  <- cbind(c(98, 95, 93, 87, 62, 64, 56, 49, 48, 16),
                 c(78, 75, 73, 24, 19, 60, 53, 11, 26, 6))

# this shows the ladders, with the first going from cell 1 to cell 38
print(ladders)
##       [,1] [,2]
##  [1,]    1   38
##  [2,]    4   14
##  [3,]    9   31
##  [4,]   21   42
##  [5,]   28   84
##  [6,]   36   44
##  [7,]   51   67
##  [8,]   71   91
##  [9,]   80  100

Equivalent code in Python:

import pandas as pd


ladders = pd.DataFrame({'From':[1, 4, 9, 21, 28, 36, 51, 71, 80], 
                       'To':[38, 14, 31, 42, 84, 44, 67, 91, 100]})

chutes = pd.DataFrame({'From':[98, 95, 93, 87, 62, 64, 56, 49, 48, 16],
                        'To':[78, 75, 73, 24, 19, 60, 53, 11, 26, 6]})

# this shows the ladders with first ladder going from 1 to 38
print(ladders)
   
   From   To
0     1   38
1     4   14
2     9   31
3    21   42
4    28   84
5    36   44
6    51   67
7    71   91
8    80  100

For the overall problem, write a paragraph or more about what you found challenging and/or what you learned while working on this problem.

Level 2 #9: Insurance Loss Estimation

An insurance company has 1000 drivers. They have determined that each driver has an independent risk of getting into a crash of 0.01 (per year). They have also determined that the average cost of each crash is 50,000, with a standard deviation of 10,000. Using Monte-Carlo simulation, estimate the costs the insurance company can expect in a single year. What is the worst case? Best case? The average value? What was the range of crashes you saw in your simulations?

How does the estimate change if the average cost of each crash is described by an exponential distribution with mean = $50,000 rather than a normal distribution?

It might be tempting to simply say: \(cost = 0.01 * 1000 * 50000 = 500000\). Do your simulation results match this? Do the simulated results provide you anything more/different? How many simulation runs were needed to get stable results?

For the overall problem, write a paragraph or more about what you found challenging and/or what you learned while working on this problem.


Level 3 Problems

There are 4 Level 3 problems. If you choose to do one of these, this is the only problem you have to solve for this exercise. However if you cannot complete the problem, don’t throw away the effort. Document what you were able to accomplish and try to explain what the challenge was. Then choose one of the easier problems from either Level 1 or Level 2 and finish your lab exercies with that to help ensure full points. Contact me if you have any questions about whether you’ve completed enough of one of these challenging problems or not.

Level 3 #1: A More Challenging Pool Problem

Do everything in the Level 2 Pool problem, and then build on that work to address the following challenges:

Challenge 1: change the duration of a person’s stay at the pool from the exponential distribution to a normal distribution with a mean of 30 minutes and standard deviation of 10 minutes. Consider how to handle negative results and explain your decision.

Challenge 2: each “arrival” can be between 1-4 people, p(1,2,3,4) = (.4, .3, .2, .1). How does that change your results?

Challenge 3: from your large number of simulations, take 80% of the maximum occupancy. Now when a person/party arrives, if the current occupancy is above that number, they have a 50/50 chance of turning away and not going into the pool. How does that change your results?

For the overall problem, write a paragraph or more about what you found challenging and/or what you learned while working on this problem.

Level 3 #2: Lemonade Stand Supply Chain Problem

Looking back, my first exposure to simulation was a game on my Commodore 64 called “Lemonade Stand” (https://en.wikipedia.org/wiki/Lemonade_Stand). In the game you start with some “starter money” and buy supplies to start running a lemonade stand in your neighborhood during the summer break from school. Each week you would look at the weather forecast and decide (given the funds avilable) how much raw materials to purchase for the week. The goal was to end the summer with no remaining supplies and as many sales as possible.

In this problem, you’ll be considering a simplified version of the game where you only have to stock one type of material (cans of lemonade). There also won’t be a weather forecast - the customer demand will be drawn from the same distribution each week.

Given the starter-code provided here, use monte-carlo simulation to generate many runs of simulated demand for lemonade to evaluate your material ordering strategy in your neighborhood Lemonade Stand.

Once you have that working, see if you can modify the function that handles ordering to create a different ordering policy. Re-run your simulation to see if you your ordering policy does a better job than the “dumb” one I provided.

For the overall problem, write a paragraph or more about what you found challenging and/or what you learned while working on this problem.

Level 3 #3: Dragon Dietary

This problem is an example of using random-search (or monte-carlo search) to try and solve an optimization problem. It was inspired by a question submitted to Reddit.

Setup:

You are a “dragon biologist” and in your research you are trying to determine what the dragons in your area are eating and in what proportion. You know they like to eat peasants, trolls, and unicorns. Each of these food sources has a distinct composition of 3 fatty-acids, and by looking at the fatty-acid composition of your dragon tissue samples, you can determine what ratio of those food sources the dragon has been eating.

You know that the food sources have the following fatty-acid composition for fatty acids, FA1, FA2, and FA3:

fa_data <- data.frame(FoodSource=c("Unicorn", "Peasant", "Troll"), 
                      FA1=c(0.1, 0.45, .2), 
                      FA2=c(0.65, 0.25, 0.45), 
                      FA3=c(0.25, 0.3, 0.3))

pander(fa_data)
FoodSource FA1 FA2 FA3
Unicorn 0.1 0.65 0.25
Peasant 0.45 0.25 0.3
Troll 0.2 0.45 0.3

And you have captured a dragon and its tissue sample shows the following composition of FA1, FA2, and FA3:

dragon_sample <- data.frame(FA1=0.25, FA2=0.50, FA3=0.25)

pander(dragon_sample)
FA1 FA2 FA3
0.25 0.5 0.25

Task:

Given the fatty-acid composition of each of the food sources and a captured dragon, develop a monte-carlo method that will generate potential ratios of peasants:trolls:unicorns in a diet and evaluate how well they matches the dragon’s sample. It should keep trying different ratios until it can no longer improve its result. How many runs does this take?

For the overall problem, write a paragraph or more about what you found challenging and/or what you learned while working on this problem.

Level 3 #4: Rummikub “First Set” Problem

Consider the game Rummikub (https://en.wikipedia.org/wiki/Rummikub). This problem was inspired by a recent game of Rummikub where I had to draw more than 11 tiles before I was able to play a set.

A really bad Rummikub hand!

A really bad Rummikub hand!

Setup:

A player initially draws 14 tiles from a pool of 104 tiles (2 x 13 x 4-colors + 2 wild-cards).

Before they can make any other plays (e.g. adding one of their tiles to an existing sequence), a player must “play a set” that totals at least 30 points using tiles only from their hand. For example, one might play group of 3 10-tiles (different colors), or a run of blue 5-6-7 tiles along with a group of 4 3-tiles (different colors).

This R code might be a good way to represent the pool of tiles and a player’s initial hand.

# borrowing the last buildDeck from: 
# http://stackoverflow.com/questions/24685024
#       /creating-a-deck-of-cards-in-r-without-using-while-and-double-for-loop
buildTilePool <- function(){
  #simplified pool, lacking wildcards
  tile_colors <- c("R","B","Y","G") # Y instead of O for Orange, G(reen) instead of B(lue) and B(lack)
  tile_numbers <-c(1:13)
  Pool <- paste(rep(tile_colors, each=13),tile_numbers, sep="-")
  Pool <-rep(Pool,2)  #double the size of the pool
  return(Pool)
}

Tile_Pool <- buildTilePool()

#note, this builds a hand, but does not remove the tiles from the pool (maybe use a set operation?)
Player_Hand <- sample(size=14, x=Tile_Pool, replace=FALSE)

print(sort(Player_Hand))
##  [1] "B-6"  "B-7"  "G-11" "G-12" "G-13" "G-3"  "G-9"  "R-12" "R-6"  "R-9" 
## [11] "Y-10" "Y-2"  "Y-6"  "Y-9"

But it might be better to create a data-frame with columns for value/number, color, player-held (easier to sort, filter, search?).

Task:

Write a simulation that can have the player keep drawing until they have a set that equals at least 30 so they can play.

Use this simulation to answer these questions: * how often (or what’s the probability of) being able to play on the inital draw of 14 tiles? * how many times must a player draw, on average, before they are able to play a set? * what’s the highest number of times a player can draw before they finally get a playable set? * does it matter if you factor in the draws of the other players, and if so, does it matter how many players there are?

Huge Caveat!

This is in Level 3 not because the simulation itself is hard, but because the really hard part taking a hand of tiles and figuring out if there are some number of groups of 3 or tiles, and/or runs of 3 or more that total up to 30 or more. Remember that in any single evaluation, if you use a tile in one group or run, it can’t be used in another (e.g. sampling without replacement from the hand).

I’m not a computer scientist, but my inexperienced guess is that problem itself is similar to the Knapsack Problem, which makes it impossible to solve in every case. But maybe with a mere 14 or more of 104 tiles, it’s not too impossible!

For the overall problem, write a paragraph or more about what you found challenging and/or what you learned while working on this problem.

Level 3 #5: Stock-Picker Performance and the “Hot Hand Fallacy”

The stock-market is difficult to predict, but some people seem to do really well. Is it worth paying money to have a people with a good track record?

Task:

  1. Build a simulation of a stock’s price, maybe using a random-walk, possibly showing how the stock ends each year (did it go up or down?).
  2. Build a stock-picker who merely uses a coin flip to predict if the stock will go up or down.
  3. Build a population of stock-pickers (hundreds or thousands)
  4. Simulate your population of stock-pickers over a long period of time (maybe 20 years?).

Analyze these results look at things like how many picked correctly at least half the time. Did any of them pick correctly for the whole 20 years? Knowing the method of ths stock-picker, with this amazing track record, would you pay them to give you advice on stocks?

For the overall problem, write a paragraph or more about what you found challenging and/or what you learned while working on this problem.