You are on page 1of 8

ISYE6501 Office Hours

Week 9, Thursday
Agenda
• Homework 9 Solution Review
• Homework 10 Preview
• Up Next
HW9 Solution Review – Design of Experiments
Q12.2 – to determine the value of 10 different yes/no features to the
market value of a house (large yard, solar roof, etc.), a real estate agent
plans to survey 50 potential buyers, showing a fictitious house with
different combinations of features. To reduce the survey size, the
agent wants to show just 16 fictitious houses. Use R’s FrF2 function (in
the FrF2 package) to find a fractional factorial design for this
experiment: what set of features should each of the 16 fictitious
houses have?
• Many different possible solutions due to randomness from FrF2, but
as a rule, solutions should have 16 reasonable combinations of the 10
features
HW9 Solution Review – Probabilistic Distributions
• Binomial – in a clinical trial, out of n total patients, the number of patients
successfully cured by the drug/treatment might follow the binomial distribution
• Geometric – consultant flies from New York to Atlanta every Monday morning,
arriving at 9am for a 10:30am weekly meeting. The number of weeks the
consultant will be on time before the first time a flight is delayed long enough to
miss the meeting might follow the geometric distribution
• Poisson – the expected number of babies that will be born in the US tomorrow
might follow the Poisson distribution
• Exponential – the time between births of babies in the US tomorrow might follow
the Exponential distribution
• Weibull – in training for track events, the length of time an athlete runs before
having to stop might follow the Weibull distribution (expect that k<1 for weaker
athletes and increases for stronger athletes)
HW9 Solution Review – Simulation
Q13.2 – simulate a simplified airport security system at a busy airport. How many
ID/boarding-pass checkers and baggage scanners are needed to keep the average wait time
below 15 minutes?
• Passengers arrive according to a Poisson distribution with = 5 per minute (i.e., mean interarrival rate =
0.2 minutes) to the ID/boarding-pass check queue, where there are several servers who each have
exponential service time with mean rate = 0.75 minutes (use if using SimPy)
• After that, the passengers are assigned to the shortest of the several personal-check queues, where
they go through the personal scanner (time is uniformly distributed between 0.5 minutes and 1 minute)

• Optimal solutions
• Arena: 4 ID checkers, 4 baggage scanners
• Average wait: 3.76 minutes
• Average total time: 5.25 minutes
• SimPy: 37 ID checkers, 37 baggage scanners
• Average wait: 8 minutes
• Note: adding / subtracting ID checkers / scanners doesn’t monotonically decrease/increase
wait and service times!
HW10 Preview – Missing Data / Imputation
• Q14.1 -  the breast cancer data set (breast-cancer-wisconsin.data.txt) from
has missing values.
1. Use mean/mode imputation to impute values for the missing data
2. Use regression to impute values for the missing data
3. Use regression with perturbation to impute values for the missing data
4. (Optional) Compare the results and quality of the classification models with
1. Data sets from questions 1, 2, and 3
2. Data that remains after data points with missing values are removed
3. Data set when a binary variable is introduced to indicate missing values

• Description at
http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Origi
nal%29
• Techniques used in this homework aren’t only applicable to
“missing” data; can also be used for bad data or outlier data
HW10 Preview – Missing Data / Imputation
• Q14.1 Tips
• First question – should you impute? Proper EDA should give you an idea
• Mean or mode imputation is okay for part 1 (recall, R has no built-in function
for mode, so you’ll have to write your own. Perhaps you’ve seen it in a
previous homework?)
• Regressing for the unknown value should be trivial, but perhaps it’s not
entirely straightforward. What values should you use as your predictors?
• When performing regression + perturbation, use rnorm() to add a random
normal distribution to your regression predictions. What would appropriate
values for mean and standard deviation be?
• Regression and regression + perturbation won’t have mathematically bound
upper and lower limits… is this an issue?
Up Next…
• HW9 peer reviews due Sunday/Monday, HW10 due
Wednesday/Thursday

You might also like