You are on page 1of 10

Problem 1

Part A:​ We throw two dice. Each one is a normal die with six equally-weighted sides.

What is the probability that the sum of the two numbers is less than 7?

1 2 3 4 5 6 The probability is 15/36 = ​5/12 or 0.42=42%

1 2 3 4 5 6 7

2 3 4 5 6 7 8

3 4 5 6 7 8 9

4 5 6 7 8 9 10

5 6 7 8 9 10 11 The highlighted numbers indicate outcomes whose sums are less


than 7 out of the total number of 36 total outcomes.

6 7 8 9 10 11 12 Therefore 15/36 total outcomes will result in a sum that is less


than 7.

Part B:​ Suppose that the overall probability of living to age 70 is 0.62 and that the

overall probability of living to age 80 is 0.23. If a person reaches their 70th birthday,

what is the probability that they will live to age 80?

P(living to 70) = 0.62 P(living to 80) = 0.23

P(80|70) = P(80n70)/P(70) < definition of conditional probability

P(70) = 0.62

P(80n70) = P(80) If an individual lives to age 80, they have already lived to age 70. The

probability of living to 80 is a subset of the probability of living to 70

and therefore P(80n70) can be reduced to P(80).

The probability of living to age 80 if a person reacher their 70th birthday is equal to

P(80|70) = ​0.23/0.62​ = ​0.371 ​or​ 37%

Part C:​ At the Iron Bank, 62% of customers have checking accounts, 24% have savings

accounts, and 17% have both checking and savings accounts. Of the Iron Bank customers

who hold checking accounts, what percentage also have a savings account?
P(C) = 0.62 P(S) = 0.24 P(C,S) = 0.17

P(S|C) = P(C,S)/P(C)

^ This is ​by definition​ of conditional

probability.

= 0.17/0.62

Of those who hold checking accounts, ​0.274 (0.17/0.62)​ or ​27.4%​ also have a savings account.

We can see through the diagram the percentages each account holds as well as the customers

they share and those who do not have either account which is displayed outside of the circles.

Therefore P(C,S) represents the overlap the accounts hold which is divided by the total

percentage of checking holders to find those who have savings given they have a checking.

Part D:​ With the same numbers from Part C, what is the probability that an Iron Bank

customer has neither a checking account nor a savings account?

Having neither a savings account nor a checking account has


a probability of ​0.31​ or ​31%​. I found this by observing those
outside of the diagram found by subtracting the sum of the
probabilities within the circles (checking, savings, both)
from 1, which is 100% or the total every possibility adds up
to .

Part E:​ Again with the same numbers from Part C, what proportion of Iron Bank

customers have a savings account but not a checking account?

The probability that a customer has a savings account and


not a checking account is ​0.07​ or ​7%.​ I found this by
observing the portion of savings in my diagram that does not
intersect with checking therefore resulting in the percentage
of customers that have a savings account but not a checking
account.
Problem 2

Return to the data on playlists from a music streaming service contained in plays_top50.csv.

Using the walkthrough on plays_top50.R as a starting point, please answer the following

questions.

Part A:​ Using the R functions xtabs and prop.table, make a 2x2 table of conditional

probabilities, where the bottom right entry in your table shows the P(plays Bob Dylan |

plays the Beatles). These variables are called bob.dylan and the.beatles in this data set.

​ the.beatles The probability of playing Bob Dylan given playing The


Beatles or P( plays Bob Dylan | plays the Beatles) =​ 0.194
bob.dylan 0 1
I achieved this table by summoning the raw number of
0 0.958 0.806 listeners, converting those numbers into probabilities,
conditioning on The Beatles, and rounding to 3 decimals in
1 0.042 0.194 R Studio.

Part B:​ Are the events “plays Coldplay” and “plays Radiohead” independent? Why or

why not? Provide numerical evidence to support your answer.

In order to be independent the following has to be true:

P(Coldplay) * P(Radiohead) = P(Coldplay, Radiohead)

​ radiohead P(Coldplay, Radiohead) = 0.055 according to the


table. This does not equal P(Coldplay)=
coldplay 0 1 Sum 0.159*P(Radiohead)=0.181 which comes out to
0.029
0 0.716 0.126 0.842
This means the events “plays Coldplay” and
1 0.104 0.055 0.159 “plays Radiohead are not independent as
P(Coldplay) * P(Radiohead) =/= P(Coldplay,
Sum 0.820 0.181 1.001 Radiohead).
Problem 3

Visitors to your website are asked to answer a single “yes or no” survey question before

they get access to the content on the page. Among all of the visitors to this page, there are

two categories: Random Clicker (RC) and Truthful Clicker (TC). There are two possible

answers to the survey: yes and no. Random clickers would click either one with equal

probability. You are also given the information that the overall fraction of random

clickers is 0.3.

After a trial period, you get the following survey results: 65% said Yes and 35% said No.

What is your estimate for P(yes | TC), the probability that someone answers “yes” to the

survey question, given that they were a truthful clicker?

I first started by filling out the

information given onto the tree.

Random clickers have an equal

probability for either choice therefore

it is 50% each because there are only

two choices. While we don't know the

click probability for truthful clickers, we can still multiply down using ​n​ as our variable

representing click rates for TC and solve using algebra as shown below:

0.15+0.7-0.7n=0.35, n=0.72 ← using the probability of “no” in the tree and the 35% test result

0.15+0.7n=0.65, n=0.72 ← using probability of “yes”, both work to find n

By plugging in 0.72 for n now, we see that the

probability of someone answering yes given they

are a TC is approximately ​0.50​ or ​50%​ therefore

P(yes|TC)​≈0.5
Problem 4

Suppose that you work in the analytics office for one of the state’s largest health-insurance
companies. Your first assignment is to study the cost-effectiveness of instituting a universal
test for a disease called SOS. Your firm is thinking about making the test free and universal for
all 10 million of its clients in Texas. The tests themselves are a significant expense. Yet if
caught early, before the onset of its worst symptoms, SOS can be treated much more
cost-effectively. This could potentially save your firm a large amount of money down the road.
You are charged with understanding the properties of the test.
We know that SOS afflicts roughly 1 Texan out of every 1000, and let’s further assume that
your 10 million clients are a representative sample of all Texans (so you can expect this base
rate for Texas to hold for your clients as well). No medical test is perfect, but this one is
reasonably accurate: it gives a positive result for 95% of people who have SOS, and a negative
result for 99% of people who do not have SOS.
In light of these numbers, what is the posterior probability that a patient has SOS, given that
they test positive for the disease?

P(V|+)=P(V,+)/P(+) ← using the

conditional probability formula to

solve for having the virus given a

positive test

This tree is filled in with all of the given information in the problem and I multiplied down each

branch to determine each outcome numerically.

Using the formula for conditional probability, we can plug in the values from our tree to get our

answer as shown below:

P(V|+)=P(V,+)/P(+)

= 0.00095/0.00095+0.00999 =0.086

Therefor the posterior probability that a patient has SOS given that they test positive for the

disease is ​0.086
Problem 5

Download the s550.csv data set from the course Canvas site. This file contains a sample
of 1501 Mercedes S-Class S550 cars offered for sale in the United States via cars.com.
There are several variables in the data set, but for this problem the three relevant
variables are mileage, price, and year.
Using the ggplot function within the ggplot2 R library, make a scatterplot of price (y)
versus mileage (x), faceted by year. (That is, there should be a separate panel for each
year.) Format your figure so that it has 2 rows of 4 panels each, so that it looks wider than
it is tall.

The scatter plot above shows price (y) versus mileage (x) faceted by year using the function

ggplot, geom_point, and facet_wrap in RStudio.


Problem 6

Bike-sharing systems are a new generation of traditional bike rentals where the whole
process from rental to return is automatic. There are thousands of municipal bike-sharing
systems around the world (e.g. Citi bikes in NYC or “Boris bikes” in London), and they
have attracted a great deal of interest because of their important role in traffic,
environmental, and health issues—especially in the wake of the COVID-19 pandemic,
when ridership levels on public-transit systems have plummeted.

These bike-sharing systems also generate a tremendous amount of data, with time of
travel, departure, and arrival position recorded for every trip. This feature turns the bike
sharing system into a virtual sensor network that can be used for sensing mobility
patterns across a city.

Bike-sharing rental demand is highly correlated to environmental and seasonal variables


like weather conditions, day of week, time of year, hour of the day, and so on. In this
problem, you’ll look at some of these demanding driving factors using the bikeshare.csv
data from the course Canvas page. This data set contains a two-year historical log (2011
and 2012) from the Capital Bikeshare system in Washington D.C. The raw data is
publicly available at http://capitalbikeshare.com/system-data. These data have been
aggregated on an hourly and daily basis and then merged with weather and seasonal data.

The variables in this data set are as follows:


• instant: unique record identifier for each row
• dteday: date
• season: season (1:spring, 2:summer, 3:fall, 4:winter)
• yr: year (0: 2011, 1:2012)
• mnth: month (1 to 12)
• hr: hour (0 to 23)
• holiday: whether the day is holiday or not
• weekday: day of the week (1 = Sunday)
• workingday: if day is neither weekend nor holiday is 1, otherwise is 0.
• weathersit: a weather situation code with the following values
– 1: Clear, Few clouds, Partly cloudy, Partly cloudy
– 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
– 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain +
Scattered clouds
– 4: Heavy Rain + Ice Pellets + Thunderstorm + Mist, Snow + Fog
• temp: Normalized temperature in Celsius. The actual values are divided by 41 (max)
• total: count of total bike rentals that hour, including both casual and registered users
Your task in this problem is to prepare three figures.

Plot A​: a line graph showing average bike rentals (total) by hour of the day (hr).

The graph above shows the average bike rentals according to the hour of the day. The x
axis represents the hour of the day while the y axis represents the average bike rentals
corresponding to those hours.

The peaks are expected as the hours 8am and 5pm are times before/after school or a typical
work day in which people would be out riding bikes while at hours such as 4am and
midnight you would expect valleys because people are typically asleep or at home.

Take-home Lesson: From the plot, we can observe that riders are typically more active
during the day and are especially active during regular commuting times such as 8am and
5pm. However, we should expect activity to be low during the especially early/late hours of
the day when individuals are typically at home or sleeping.
Plot B:​ a faceted line graph showing average bike rentals by hour of the day, faceted
according to whether it is a working day (workingday).

The graph above shows the average bike rentals according to the hour of the day
depending on if the day is a working day or not. The x axis represents the hour of the day
while the y axis represents the average bike rentals corresponding to those hours. Panel 0
represents the non-working days and Panel 1 represents the working days.

Typically, the hours of 8am and 5pm are most popular for commuting or being out on
working days as they correspond to the typical hours before and after school or work. On
weekends, it is expected that mid-afternoon would be the most popular time to be out
which is demonstrated in Panel 0. The hours between 12pm and 4pm are when most bike
rentals occur and we can assume it is because people do not work/go to school on those
days and have that free time in the afternoon.

Take-home Lesson: From the plot, we can observe that riders are typically active in the
early morning and evenings on working days and are especially active during most of the
afternoon on non-working days.
Plot C:​ a faceted bar plot showing average ridership during the 8 AM hour by weather
situation code (weathersit), faceted according to whether it is a working day or not.

The graph above shows the average bike rentals (y axis) during the hour of 8am. This is
according to the weather situation displayed on the x axis (numbers correspond to different
weather conditions). Furthermore, the data is separated by whether it is a working day in
which Panel 0 represents the non-working days and Panel 1 represents the working days.

At 8am on both working days and non-working days, there is a greater average of bike
rentals when the weather is more clear and ‘ideal’ for being outdoors. Similarly, there are
fewer rentals when the weather is snowy or stormy. Although there are fewer rides in the
morning on non-working days because individuals may not have to be up as early to go to
work, the trends according to the weather are similar across the board and are predictable.

Take-home Lesson: From the plot, we can observe that riders will typically be more active
in the mornings if the weather is more clear or ‘ideal’ (not storming or snowy) and more
severe weather will result in fewer rentals. Additionally, we can expect fewer rentals during
mornings on non-working days as not as many individuals may be up and out commuting
to school or work.

You might also like