You are on page 1of 33

Page 1 of 33 - Cover Page Submission ID trn:oid:::1:2866706450

MR
Student 2
Quick Submit

Quick Submit

Winneconne High School

Document Details

Submission ID

trn:oid:::1:2866706450 31 Pages

Submission Date 6,951 Words

Mar 27, 2024, 5:43 AM CDT


42,318 Characters

Download Date

Mar 27, 2024, 5:52 AM CDT

File Name

Assignment_2_Spring_2024_2.docx

File Size

128.0 KB

Page 1 of 33 - Cover Page Submission ID trn:oid:::1:2866706450


Page 2 of 33 - AI Writing Overview Submission ID trn:oid:::1:2866706450

How much of this submission has been generated by AI?

*16%
Caution: Percentage may not indicate academic misconduct. Review required.

It is essential to understand the limitations of AI detection before making decisions


about a student's work. We encourage you to learn more about Turnitin's AI detection
capabilities before using the tool.
of qualifying text in this submission has been determined to be
generated by AI.

* Low scores have a higher likelihood of false positives.

Frequently Asked Questions

What does the percentage mean?


The percentage shown in the AI writing detection indicator and in the AI writing report is the amount of qualifying text within the
submission that Turnitin's AI writing detection model determines was generated by AI.

Our testing has found that there is a higher incidence of false positives when the percentage is less than 20. In order to reduce the
likelihood of misinterpretation, the AI indicator will display an asterisk for percentages less than 20 to call attention to the fact that
the score is less reliable.

However, the final decision on whether any misconduct has occurred rests with the reviewer/instructor. They should use the
percentage as a means to start a formative conversation with their student and/or use it to examine the submitted assignment in
greater detail according to their school's policies.

How does Turnitin's indicator address false positives?


Our model only processes qualifying text in the form of long-form writing. Long-form writing means individual sentences contained in paragraphs that make up a
longer piece of written work, such as an essay, a dissertation, or an article, etc. Qualifying text that has been determined to be AI-generated will be highlighted blue
on the submission text.

Non-qualifying text, such as bullet points, annotated bibliographies, etc., will not be processed and can create disparity between the submission highlights and the
percentage shown.

What does 'qualifying text' mean?


Sometimes false positives (incorrectly flagging human-written text as AI-generated), can include lists without a lot of structural variation, text that literally repeats
itself, or text that has been paraphrased without developing new ideas. If our indicator shows a higher amount of AI writing in such text, we advise you to take that
into consideration when looking at the percentage indicated.

In a longer document with a mix of authentic writing and AI generated text, it can be difficult to exactly determine where the AI writing begins and original writing
ends, but our model should give you a reliable guide to start conversations with the submitting student.

Disclaimer
Our AI writing assessment is designed to help educators identify text that might be prepared by a generative AI tool. Our AI writing assessment may not always be accurate (it may misidentify
both human and AI-generated text) so it should not be used as the sole basis for adverse actions against a student. It takes further scrutiny and human judgment in conjunction with an
organization's application of its specific academic policies to determine whether any academic misconduct has occurred.

Page 2 of 33 - AI Writing Overview Submission ID trn:oid:::1:2866706450


Page 3 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

Statistical Intuitions and Applications


Assignment 2
Important Information:
In Assignment 2 you will apply the statistical concepts that we've encountered so far in the
course to solve FOUR questions related to real-world scenarios. Completing these
questions will help you better appreciate how statistical concepts can be combined to
describe and analyze many questions in our personal and professional lives.
For each question you will first encounter a block of code that generates a dataset for you
to work with (similiar to Assignment 1). You will need to save these datasets as csv files.
Submission Requirements
1. Submit all answers along with their corresponding code as a searchable PDF .

2. All generated csv and .ipynb files must be submitted in a zip-folder as a secondary
source.

3. Ensure your zip folder contains four csv files (i.e., the csv files for each of the four
questions below; YourName.csv, RelianceRetailVisits_ordered.csv, Scores.csv,
Vaccinated.csv).

4. You may use Jupyter notebook or Colab as per your convenience.

5. You should ONLY use the concepts and techniques covered in the course to
generate your answers. Statistical techniques that are NOT covered in the
course will NOT be evaluated.

Note: Reach out to your instructor for any question regarding csv files, codes, or the zip-
folder.
Non-compliance with the above instructions will result in a 0 grade on the relevant
portions of the assignment. Your instructor will grade your assignment based on what you
submitted. Failure to submit the assignment or submitting an assignment intended for
another class will result in a 0 grade, and resubmission will not be allowed. Make sure that
you submit your original work. Suspected cases of plagiarism will be treated as potential
academic misconduct and will be reported to the College Academic Integrity Committee for
a formal investigation. As part of this procedure, your instructor may require you to meet
with them for an oral exam on the assignment.

Page 3 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 4 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

A. Statistical Intuitions in Mental Health

Question 1:
We are going to work with a dataset that was collected on mental health issues. In total,
824 individuals (teenagers, college students, housewives, businesses professionals, and
other groups) completed the survey. Their data provides valuable insights into the
prevalence, and factors associated with, mental health issues in different groups.
To Begin.
Run the code below. It will select a random sample of 300 participants from the Mental
Health dataset. The code will then generate a CSV file called Name.csv. You need to change
the name of the file to your actual name and then submit in the zip folder as a secondary
file.
# Load the following libraries so that they can be applied in the subsequent
code blocks

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import scipy.stats as stats

# Run this code. It will create a csv file containing a random sample of 300
respondents. You will answer the questions below based on this sample.

# Look at the code below. Now replace 'Name.csv' with your actual name (e.g.,
'Sara.csv'). The code will generate a csv file that you need to submit in the
zip folder as secondary file.

try:
df = pd.read_csv('Rouda Alsuwaidi.csv') # replace Name with your
own name
except FileNotFoundError:
original_data =
pd.read_csv("https://raw.githubusercontent.com/DanaSaleh1003/IDS-103-Spring-
2024/main/mental_health_finaldata_1.csv")
df1=original_data.sample(300)
df1.to_csv('Rouda Alsuwaidi.csv') # replace Name with
your own name
df = pd.read_csv('Rouda Alsuwaidi.csv') # replace Name with
your own name
df = pd.DataFrame(df)
df.to_csv('Rouda Alsuwaidi.csv') # replace Name with
your own name

Page 4 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 5 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

df.head()

{"summary":"{\n \"name\": \"df\",\n \"rows\": 300,\n \"fields\": [\n


{\n \"column\": \"Unnamed: 0\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 245,\n \"min\": 0,\n
\"max\": 822,\n \"num_unique_values\": 300,\n \"samples\": [\n
719,\n 337,\n 597\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Age\",\n \"properties\": {\n \"dtype\":
\"category\",\n \"num_unique_values\": 4,\n \"samples\": [\n
\"30-Above\",\n \"25-30\",\n \"20-25\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n
{\n \"column\": \"Gender\",\n \"properties\": {\n \"dtype\":
\"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n
\"Male\",\n \"Female\"\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Occupation\",\n \"properties\": {\n \"dtype\":
\"category\",\n \"num_unique_values\": 5,\n \"samples\": [\n
\"Student\",\n \"Business\"\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Days_Indoors\",\n \"properties\": {\n \"dtype\":
\"category\",\n \"num_unique_values\": 5,\n \"samples\": [\n
\"1-14 days\",\n \"Go out Every day\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n
{\n \"column\": \"Growing_Stress\",\n \"properties\": {\n
\"dtype\": \"category\",\n \"num_unique_values\": 2,\n
\"samples\": [\n \"Yes\",\n \"No\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n
{\n \"column\": \"Quarantine_Frustrations\",\n \"properties\": {\n
\"dtype\": \"category\",\n \"num_unique_values\": 2,\n
\"samples\": [\n \"Yes\",\n \"No\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n
{\n \"column\": \"Changes_Habits\",\n \"properties\": {\n
\"dtype\": \"category\",\n \"num_unique_values\": 2,\n
\"samples\": [\n \"No\",\n \"Yes\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n
{\n \"column\": \"Mental_Health_History\",\n \"properties\": {\n
\"dtype\": \"category\",\n \"num_unique_values\": 2,\n
\"samples\": [\n \"No\",\n \"Yes\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n
{\n \"column\": \"Weight_Change\",\n \"properties\": {\n
\"dtype\": \"category\",\n \"num_unique_values\": 2,\n
\"samples\": [\n \"No\",\n \"Yes\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n
{\n \"column\": \"Mood_Swings\",\n \"properties\": {\n
\"dtype\": \"category\",\n \"num_unique_values\": 3,\n
\"samples\": [\n \"Low\",\n \"High\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n
{\n \"column\": \"Coping_Struggles\",\n \"properties\": {\n

Page 5 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 6 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

\"dtype\": \"category\",\n \"num_unique_values\": 2,\n


\"samples\": [\n \"No\",\n \"Yes\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n
{\n \"column\": \"Work_Interest\",\n \"properties\": {\n
\"dtype\": \"category\",\n \"num_unique_values\": 2,\n
\"samples\": [\n \"No\",\n \"Yes\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n
{\n \"column\": \"Social_Weakness\",\n \"properties\": {\n
\"dtype\": \"category\",\n \"num_unique_values\": 2,\n
\"samples\": [\n \"No\",\n \"Yes\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n
]\n}","type":"dataframe","variable_name":"df"}

Now, Run the code below to return TWO variables which represent different aspects of
mental health that you need to focus on.

# Load the following libraries so that they can be applied in the subsequent
code blocks

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import scipy.stats as stats
import random

column_titles = ["Growing_Stress" ,"Quarantine_Frustrations"


,"Changes_Habits" ,"Mental_Health_History", "Weight_Change" ,"Mood_Swings",
"Coping_Struggles","Work_Interest","Social_Weakness"]

# Randomly select 2 variables


selected_columns = random.sample(column_titles, 2)

# Print the 2 variables that were randomly selected


variable_1, variable_2 = selected_columns
print("Variable 1:", variable_1)
print("Variable 2:", variable_2)

Variable 1: Weight_Change
Variable 2: Quarantine_Frustrations

Using the sample dataset from the CSV file you generated answer the following questions:
Question 1a. Is each of these two variables independent of being female? Explain your
reasoning. Make sure to include a two-way table for each of these two variables with
gender, and show all your calculations to support your answers.

Page 6 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 7 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

Question 1b. Is there a relationship between the two variables returned by the code?
Explain your reasoning. Make sure you include a two-way table, a stacked bar graph, and all
your probability calculations in your answer.
Question 1c. Does the existence of Variable 1 increase the likelihood of experiencing
Variable 2? If so, by how much? Explain your reasoning. Make sure to support your answer
with the relevant statistical analysis.
Question 1d. Look back at your answers to Questions 1a-c. Now use what you learned to
answer the following question:
Imagine ZU wanted to use the insights from this research to improve its mental health
support program. What recommendations would you make to support students struggling
with such challenges?
Answer: Add more "markdown" text and code cells below as needed.

Question 1a. Is each of these two variables independent of being


female? Explain your reasoning. Make sure to include a two-way table
for each of these two variables with gender, and show all your
calculations to support your answers.
import pandas as pd
from scipy.stats import chi2_contingency

# Create contingency tables for each variable with gender


contingency_table_1 = pd.crosstab(df['Gender'], df[variable_1])
contingency_table_2 = pd.crosstab(df['Gender'], df[variable_2])

# Perform Chi-square test and get p-values


_, p_value_1, _, _ = chi2_contingency(contingency_table_1)
_, p_value_2, _, _ = chi2_contingency(contingency_table_2)

# Print two-way tables


print("Two-way table for Variable 1:")
print(contingency_table_1)
print("\nTwo-way table for Variable 2:")
print(contingency_table_2)

# Interpret the results and explain reasoning


print("\nQuestion 1a. Is each of these two variables independent of being
female?")

if p_value_1 > 0.05:


print(f"Variable 1 is independent of being female (p-value:
{p_value_1:.4f}).")

Page 7 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 8 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

else:
print(f"Variable 1 is not independent of being female (p-value:
{p_value_1:.4f}).")

if p_value_2 > 0.05:


print(f"Variable 2 is independent of being female (p-value:
{p_value_2:.4f}).")
else:
print(f"Variable 2 is not independent of being female (p-value:
{p_value_2:.4f}).")

Two-way table for Variable 1:


Weight_Change No Yes
Gender
Female 58 103
Male 40 99

Two-way table for Variable 2:


Quarantine_Frustrations No Yes
Gender
Female 48 113
Male 37 102

Question 1a. Is each of these two variables independent of being female?


Variable 1 is independent of being female (p-value: 0.2258).
Variable 2 is independent of being female (p-value: 0.6285).

Question 1b. Is there a relationship between the two variables


returned by the code? Explain your reasoning. Make sure you include
a two-way table, a stacked bar graph, and all your probability
calculations in your answer.
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency

# Create a two-way table for the two variables


two_way_table = pd.crosstab(df[variable_1], df[variable_2])

# Perform Chi-square test and get p-value


_, p_value, _, _ = chi2_contingency(two_way_table)

# Print the two-way table


print("Two-way table:")
print(two_way_table)

# Interpret the results

Page 8 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 9 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

print("\nQuestion 1b. Is there a relationship between the two variables?")


if p_value > 0.05:
print("There is no relationship between the two variables.")
else:
print("There is a relationship between the two variables.")

# Create a stacked bar graph


two_way_table.plot(kind="bar", stacked=True)
plt.xlabel(variable_1)
plt.ylabel("Frequency")
plt.title(f"Stacked Bar Graph of {variable_1} and {variable_2}")
plt.show()

Two-way table:
Quarantine_Frustrations No Yes
Weight_Change
No 23 75
Yes 62 140

Question 1b. Is there a relationship between the two variables?


There is no relationship between the two variables.

Page 9 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 10 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

Question 1c. Does the existence of Variable 1 increase the likelihood


of experiencing Variable 2? If so, by how much? Explain your
reasoning. Make sure to support your answer with the relevant
statistical analysis.
# Calculate the conditional probabilities
p_variable_2_given_variable_1 = df[df[variable_1] ==
"Yes"][variable_2].value_counts(normalize=True)["Yes"]
p_variable_2_given_not_variable_1 = df[df[variable_1] ==
"No"][variable_2].value_counts(normalize=True)["Yes"]

# Calculate the relative risk


relative_risk = p_variable_2_given_variable_1 /
p_variable_2_given_not_variable_1

# Print the results


print("P(Variable 2 | Variable 1) =", p_variable_2_given_variable_1)
print("P(Variable 2 | not Variable 1) =", p_variable_2_given_not_variable_1)
print("Relative risk =", relative_risk)

# Interpret the results


if relative_risk > 1:
print("\nThe existence of Variable 1 increases the likelihood of
experiencing Variable 2 by a factor of", relative_risk)
else:
print("\nThe existence of Variable 1 decreases the likelihood of
experiencing Variable 2 by a factor of", 1/relative_risk)

P(Variable 2 | Variable 1) = 0.693069306930693


P(Variable 2 | not Variable 1) = 0.7653061224489796
Relative risk = 0.9056105610561056

The existence of Variable 1 decreases the likelihood of experiencing Variable


2 by a factor of 1.1042274052478134

Page 10 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 11 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

Question 1d. Look back at your answers to Questions 1a-c. Now use
what you learned to answer the following question:

Imagine ZU wanted to use the insights from this research to improve


its mental health support program. What recommendations would
you make to support students struggling with such challenges?

Based on what we learned from looking at the data:

1. PSupport based on gender:


• Different types of help might be needed for guys and girls because they might have
different experiences. So, it's important to offer support that fits everyone's needs.

2. Teach everyone about mental health:


• We should teach students about mental health so they know what to look out for
and where to go for help if they need it. This helps everyone feel more comfortable
talking about mental health issues.

3. Have lots of different kinds of help available:


• Since having one mental health problem can make it more likely to have another
one, it's good to have lots of different ways to get help. This could include talking to
someone, joining a support group, or going to workshops.

4. Make a friendly campus:


• We need to make sure the campus feels like a welcoming place for everyone. This
means making it easy for students to talk to each other and feel supported by their
peers and teachers.

5. Encourage asking for help:


• Sometimes students might feel embarrassed or afraid to ask for help with mental
health problems. We want to make it clear that it's okay to ask for help and show
students where they can go for support.

Page 11 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 12 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

By implementing these recommendations, ZU can create a more


supportive environment for students struggling with mental health
challenges and help them to get the support they need to succeed.

B. Statistical Intuition in Store Ratings

Question 2:
Imagine you are the manager of an Electronic store in Dubai mall. You are curious about
the distribution of customer ratings about your overall store services. So you ask random
customers who visit the store to complete a short survey, recording variables such as their
age group, and overall experience rating.
To Begin
Run the code below. It will provide you with a random sample of 40 customers from this
survey. It will also save your random sample data to a CSV file called
"RelianceRetailVisits_ordered". Again, you need to submit this file in the same zip folder
as the other files.
# Load the following libraries so that they can be applied in the subsequent
code blocks

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import scipy.stats as stats

try:
df = pd.read_csv('RelianceRetailVisits.csv')
except FileNotFoundError:
original_data =
pd.read_csv("https://raw.githubusercontent.com/DanaSaleh1003/IDS-103-Spring-
2024/main/RelianceRetailVisits-1.csv")

# Randomly sample 40 rows from the original dataset


df = original_data.sample(n=40, random_state=42)

# Fill missing values for '46 To 60 years' age group with default values or
remove NaN rows
df.fillna({'Age Group': '46 To 60 years'}, inplace=True)

# Sort the DataFrame based on the 'Age Group' column in the desired order
desired_order = ['26 To 35 years', '16 To 25 years', '36 To 45 years',

Page 12 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 13 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

'46 To 60 years'] # Corrected unique values


df['Age Group'] = pd.Categorical(df['Age Group'], categories=desired_order,
ordered=True)
df.sort_values(by='Age Group', inplace=True)

# Save the sorted DataFrame to a new CSV file


df.to_csv('RelianceRetailVisits_ordered.csv', index=False)

df.head()

{"summary":"{\n \"name\": \"df\",\n \"rows\": 40,\n \"fields\": [\n {\n


\"column\": \"Customer Index\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 64,\n \"min\": 10,\n \"max\":
221,\n \"num_unique_values\": 40,\n \"samples\": [\n
10,\n 19,\n 83\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Age Group\",\n \"properties\": {\n \"dtype\":
\"category\",\n \"num_unique_values\": 4,\n \"samples\": [\n
\"16 To 25 years\",\n \"46 To 60 years\",\n \"26 To
35 years\"\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"OverallExperienceRatin\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 1,\n \"min\": 1,\n \"max\": 5,\n
\"num_unique_values\": 5,\n \"samples\": [\n 4,\n
1,\n 5\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n }\n
]\n}","type":"dataframe","variable_name":"df"}

Use the random sample of data from the csv file you generated to answer the following
questions:
Question 2a. Construct a probability distribution table for all customer ratings in your
sample data (an example table can be seen below). Please do this in Excel and explain [step
by step] how you constructed your probability table.
Screenshot 2024-02-25 at 6.38.29 PM.png
Question 2b. What is the probability that a randomly selected customer will have a rating
of AT MOST 3?
Question 2c. Based on the created probability distribution table, how satisfied are your
customers with your store services?
Question 2d. Find the expected rating of your store. Show your work and interpret your
answer in context.
Run the code below. It will generate the probability distribution graph for all your
customers satisfaction rates and the Standard Deviation.

Page 13 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 14 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

Question 2a. Construct a probability distribution table for all customer


ratings in your sample data (an example table can be seen below).
Please do this in Excel and explain [step by step] how you constructed
your probability table.
Question 2a. Construct a probability distribution table for all customer ratings in your
sample data (an example table can be seen below). Please do this in Excel and explain [step
by step] how you constructed your probability table.
Step 1: Open the CSV file in Excel.
Step 2: Select the column containing the customer ratings.
Step 3: Use the COUNTIF function to count the number of times each rating appears in the
column consider rating 1.
• Formula: =COUNTIF(C2:C41, 1)
• Explanation: This formula counts the number of times the rating '1' appears in the
range C2:C
Step 4: Use the SUM function to calculate the total number of ratings.
• Formula: =COUNTA(C2:C41)
• Explanation: This formula counts all non-empty cells in the range C2:C41, giving the
total number of ratings.
Step 5: Divide the count of each rating by the total number of ratings to get the probability
of each rating.
• Formula: =COUNTIF(C2:C41, 1) / COUNTA(C2:C41)
• Explanation: This formula divides the count of '1' ratings by the total count of all
ratings, resulting in the probability of getting a rating of '1'. This formula can be
adjusted by replacing the '1' with any other rating value to find the probability of
that rating occurring.
Step 6: Create a table with six columns and two rows: one for the ratings and one for the
probabilities.
The probability distribution table shows the probability of each customer rating occurring.
For example, the probability of a customer giving a rating of 1 is 0.025. The probability of a
customer giving a rating of 5 is 0.3.
#Question 2b. What is the probability that a randomly selected customer will have a rating
of AT MOST 3?
To determine the probability that a randomly selected customer will have a rating of AT
MOST 3, we sum the probabilities of ratings 1, 2, and 3. Hence, the probability can be
computed as follows:

Page 14 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 15 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

P(X ≤ 3) = P(X = 1) + P(X = 2) + P(X = 3) Put the probabilities value in the this equation
P(X ≤ 3) = 0.025 + 0.15 + 0.1 P(X ≤ 3) = 0.275
Therefore, the probability that a randomly selected customer will have a rating of AT MOST
3 is 0.275 or 27.5%.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
from tabulate import tabulate

# Load data
try:
df = pd.read_csv('RelianceRetailVisits.csv')
except FileNotFoundError:
original_data =
pd.read_csv("https://raw.githubusercontent.com/DanaSaleh1003/IDS-103-Spring-
2024/main/RelianceRetailVisits-1.csv")
df = original_data.sample(n=40, random_state=42)

# Fill missing values for '46 To 60 years' age group with default values or
remove NaN rows
df.fillna({'Age Group': '46 To 60 years'}, inplace=True)

# Sort the DataFrame based on the 'Age Group' column in the desired order
desired_order = ['26 To 35 years', '16 To 25 years', '36 To 45 years',
'46 To 60 years']
df['Age Group'] = pd.Categorical(df['Age Group'], categories=desired_order,
ordered=True)
df.sort_values(by='Age Group', inplace=True)

# Save the sorted DataFrame to a new CSV file


df.to_csv('RelianceRetailVisits_ordered.csv', index=False)

# Probability distribution graph for customer rating


plt.figure(figsize=(8, 6))
rating_counts =
df['OverallExperienceRatin'].value_counts(normalize=True).sort_index()
plt.bar(rating_counts.index, rating_counts, alpha=0.7)
plt.title('Probability Distribution of Customer Rating')
plt.xlabel('Overall Experience Rating')
plt.ylabel('Probability')
plt.xticks(range(1, 6))
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Expected value and STD for rating for all customers


mean_rating = df['OverallExperienceRatin'].mean()

Page 15 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 16 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

std_rating = df['OverallExperienceRatin'].std()
print(f"Standard Deviation (STD) of Customer Rating: {std_rating:.2f}")
print()

Standard Deviation (STD) of Customer Rating: 1.11

Question 2c. Based on the created Probability distribution graph for


customer rating, how satisfied are your customers with your store
services?
The probability distribution graph provides insights into customer satisfaction with store
services. It illustrates that a substantial portion of customers, comprising 72% in total, gave
ratings of 4 or 5. This suggests a high level of satisfaction among a significant majority of
customers.
However, it's also notable that approximately 28% of customers provided ratings of 3 or
lower. This indicates areas where improvements could be made to enhance overall
satisfaction levels.

Page 16 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 17 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

In summary, while a large portion of customers appear satisfied with store services, there
is still a notable minority whose experiences may require attention to further improve
overall customer satisfaction.

Question 2d. Find the expected rating of your store. Show your work
and interpret your answer in context.
Determining the expected rating involves calculating the average rating that customers are
likely to give, considering the probability of each rating. This is achieved by multiplying
each rating by its corresponding probability and then summing up the results.
In this case, the expected rating is:

expected_rating = (1 * 0.1) + (2 * 0.2) + (3 * 0.3) + (4 * 0.25) + (5 *


0.15) = 3.15
Therefore, the average customer is expected to give a rating of 3.15 on a scale of 1 to 5. This
suggests that overall, customers are moderately satisfied with the store services, but there
remains scope for enhancement.

Question 2e. Interpret the Standard Deviation in context. What rating


is considered unusual? Explain.
The standard deviation (STD) of 1.11 for customer ratings indicates a moderate amount of
variability in how customers rate the store. This suggests that opinions among customers
are somewhat diverse.
To identify what rating is considered unusual, we utilize the following formula:
unusual_rating = mean_rating ± 2 * std_rating
Substituting the given values:
In this case, the unusual rating range is:

Page 17 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 18 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

unusual_rating=3.15±2×1.11

= 3.15 ± 2.22

= 3.15±2.22

= 0.93 to 4.37
Hence, any rating below 0.93 or above 4.37 would be considered unusual.
In context, this implies that a rating of 1 or 2 would be considered exceptionally low, while
a rating of 4 or 5 would be considered exceptionally high compared to the typical range of
ratings.
Run the code below. It will generate the probability distribution graphs for each of the age
groups along with their discrete probability distribution tables, the Expectd values, and the
Standard Deviation values.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats

# Assuming your data is stored in a CSV file named 'data.csv'


data = pd.read_csv('RelianceRetailVisits_ordered.csv')

# Define age groups including the new one


age_groups = ['16 To 25 years', '26 To 35 years', '36 To 45 years', '46
To 60 years'] # Added new age group

# Plot separate discrete probability distributions for each age group


fig, axs = plt.subplots(1, 4, figsize=(20, 6), sharex=True,
gridspec_kw={'hspace': 0.5}) # Adjusted size and spacing

for i, age_group in enumerate(age_groups):


age_data = data[data['Age Group'] == age_group]
rating_counts =
age_data['OverallExperienceRatin'].value_counts(normalize=True).sort_index()
bars = axs[i].bar(rating_counts.index, rating_counts, alpha=0.7)
axs[i].set_title(f'{age_group}\nMean:
{age_data["OverallExperienceRatin"].mean():.2f} | SD:
{age_data["OverallExperienceRatin"].std():.2f}') # Age group, Mean, and SD
axs[i].set_xlabel('Overall Experience Rating')
axs[i].set_ylabel('Probability (%)') # Set y-axis label to Probability
(%)

Page 18 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 19 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

axs[i].set_xticks(range(1, 6)) # Set x-axis ticks from 1 to 5


axs[i].set_yticklabels(['{:,.0%}'.format(x) for x in
axs[i].get_yticks()]) # Format y-axis tick labels as percentages

# Display percentages above each bar


for bar in bars:
height = bar.get_height()
rating = bar.get_x() + bar.get_width() / 2
if height == 0: # If the height is 0%, display '0%'
axs[i].text(rating, height, '0%', ha='center', va='bottom',
fontsize=8)
else:
axs[i].text(rating, height, f'{height:.0%}', ha='center',
va='bottom', fontsize=8)

axs[i].grid(axis='y', linestyle='--', alpha=0.7)

# Hide the warning about FixedFormatter


import warnings
warnings.filterwarnings("ignore", category=UserWarning)

plt.tight_layout()
plt.show()

<ipython-input-8-e3aa8190ed23>:23: UserWarning: FixedFormatter should only be


used together with FixedLocator
axs[i].set_yticklabels(['{:,.0%}'.format(x) for x in axs[i].get_yticks()])
# Format y-axis tick labels as percentages
<ipython-input-8-e3aa8190ed23>:23: UserWarning: FixedFormatter should only be
used together with FixedLocator
axs[i].set_yticklabels(['{:,.0%}'.format(x) for x in axs[i].get_yticks()])
# Format y-axis tick labels as percentages
<ipython-input-8-e3aa8190ed23>:23: UserWarning: FixedFormatter should only be
used together with FixedLocator
axs[i].set_yticklabels(['{:,.0%}'.format(x) for x in axs[i].get_yticks()])
# Format y-axis tick labels as percentages
<ipython-input-8-e3aa8190ed23>:23: UserWarning: FixedFormatter should only be
used together with FixedLocator
axs[i].set_yticklabels(['{:,.0%}'.format(x) for x in axs[i].get_yticks()])
# Format y-axis tick labels as percentages

Page 19 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 20 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

Question 2f. Identify any trends or differences in customer satisfaction levels (and
variability) among the different age groups.
Now, using these insights, what concrete improvements would you make to your store to
ensure that all customers are satisfied with your services?

Insights and Distinctions:

Contentment Levels Analysis:


• 16-25: Generally pleased (mean rating of 4.00), yet displaying a higher frequency of
lower ratings (1 and 2) compared to other segments.
• 26-35: Most content demographic (mean rating of 4.00), with a relatively even
distribution of ratings.
• 36-45: Similar contentment levels to younger demographics (mean rating of 4.33),
with a lower occurrence of lower ratings.
• 46-60: Least content age bracket (mean rating of 2.71), experiencing a higher
occurrence of lower ratings (1 and 2).

Variance Assessment:
• 16-25: Moderate variance in ratings (STD of 1.03).
• 26-35: Highest variance (STD of 1.18), indicating a broader range of experiences.
• 36-45: Lowest variance (STD of 0.52), suggesting a more consistent experience.
• 46-60: High variance (STD of 0.95), reflecting significant diversity in experiences.

Recommendations for Enhancement:

Targeting Mature Audiences:


• Tailor offerings to address the distinct needs and preferences of the 46-60 age
group to enhance their contentment.
• Introduce more senior-specific discounts, ensure easily accessible facilities, and
curate products/services tailored to their tastes and requirements.

Elevating Overall Experience:


• Focus efforts on enhancing the holistic shopping experience for all demographic
segments.

Page 20 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 21 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

• Expand product offerings, refine store layout and ambiance, and maintain a
consistent standard of service.

Personalized Customer Care:


• Train staff to deliver exceptional customer service tailored to individual needs and
preferences.
• Provide personalized recommendations, promptly address concerns, and exceed
expectations to ensure satisfaction.

Loyalty Programs and Incentives:


• Implement loyalty programs or incentives to foster repeat patronage and garner
positive feedback.
• Offer discounts, exclusive promotions, or early access to new products to cultivate
customer loyalty.

Continuous Feedback Collection:


• Continuously solicit feedback from customers across all age groups to identify areas
for improvement and monitor satisfaction levels.
• Utilize surveys, feedback forms, or social media engagement to gather insights and
respond promptly to customer feedback.

Consistency and Ongoing Training:


• Institute regular training sessions to keep staff updated on the latest methodologies
and benchmarks.
• Foster a culture of continuous improvement and learning within the organization.

Integration of Technology:
• Explore opportunities to integrate technology for streamlined processes and
enriched customer interactions.
• Implement self-service checkout systems, mobile payment options, and online
ordering to enhance convenience.

Community Engagement Initiatives:


• Engage with the local community to build rapport and a sense of community.
• Sponsor local events, participate in charitable endeavors, and champion community
causes to strengthen brand loyalty.

Sustainable Practices:
• Adopt environmentally conscious practices to appeal to eco-minded clientele.
• Minimize waste, utilize sustainable materials, and support eco-friendly initiatives to
demonstrate corporate responsibility.

Utilization of Data Analytics:


• Leverage data analytics to gain insights into customer behavior and preferences.

Page 21 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 22 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

• Analyze sales trends, customer feedback, and demographic information to inform


strategic decisions.

Competitive Analysis:
• Conduct regular competitive analyses to stay informed about industry trends and
benchmark against competitors.
• Identify areas for differentiation and capitalize on unique selling propositions.

Crisis Management Strategies:


• Develop a comprehensive crisis management protocol to effectively navigate
unforeseen events.
• Train staff on emergency procedures, communication protocols, and customer
support during crises.

Exploration of Expansion Opportunities:


• Explore opportunities for expansion into new markets or geographical locations.
• Conduct market research to identify burgeoning areas of growth and develop
strategic expansion plans.

Employee Recognition and Motivation:


• Acknowledge and incentivize employees for their dedication and contributions.
• Implement reward programs, performance incentives, and employee appreciation
initiatives to boost morale.

Collaborative Partnerships with Suppliers:


• Foster strong relationships with suppliers to ensure timely delivery of high-quality
products.
• Negotiate favorable terms and engage in collaborative marketing efforts for mutual
benefit.

Health and Safety Measures:


• Prioritize the health and safety of both customers and staff members.
• Implement rigorous cleaning protocols, provide personal protective equipment, and
enforce social distancing measures.

Adaptability and Flexibility:


• Remain adaptable and flexible to respond quickly to changing market conditions
and customer needs.
• Embrace innovation and experimentation to stay ahead in a dynamic business
environment.
Answer: Add more "markdown" text cells below as needed.

Page 22 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 23 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

C. Statistical Intuition in SAT Exams

Question 3:
Imagine you are working for a prestigious university in the UAE. It is your job to decide
which students are admitted to the university. To help you do this, you analyze the high
school (SAT) scores of potential students. These scores help you understand their academic
readiness and potential for success at the university.
You have just received the scores of applicants who would like to join the university in
September 2024. These scores follow a normal distribution.
To Begin.
Run the code below. It will generate a dataset with the students scores. It will also calculate
the mean (μ) and standard deviation (σ) of these scores. This dataset will be saved as a
CSV file called "Scores.csv". Again, you need to submit this file in the same zip folder as
your other files.
# Load the following libraries so that they can be applied in the subsequent
code blocks

import pandas as pd
import numpy as np
import random

try:
SATScores = pd.read_csv('Scores.csv')
except FileNotFoundError:
num_samples = 1000
mean_score = random.randint(800, 1200)
std_deviation = random.randint(100, 300)
scores = np.random.normal(mean_score, std_deviation, num_samples)
scores = np.round(scores, 0)
SATScores = pd.DataFrame({'Scores': scores})
SATScores.to_csv('Scores.csv')

# Calculate mean and standard deviation


mean_score = SATScores['Scores'].mean()
std_deviation = SATScores['Scores'].std()

# Print mean score and standard deviation


print("Mean score:", mean_score)
print("Standard deviation:", std_deviation)

# Display the dataset


SATScores.head()

Page 23 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 24 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

Mean score: 859.107


Standard deviation: 231.946737075217

{"summary":"{\n \"name\": \"SATScores\",\n \"rows\": 1000,\n \"fields\":


[\n {\n \"column\": \"Scores\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 231.946737075217,\n \"min\":
201.0,\n \"max\": 1817.0,\n \"num_unique_values\": 608,\n
\"samples\": [\n 845.0,\n 800.0,\n 1104.0\n
],\n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\n }\n ]\n}","type":"dataframe","variable_name":"SATScores"}

Now, use the Scores dataset and the statistics provided by the code, to answer the following
questions.
IMPORTANT:
• Make sure to support your answers by explaining and showing how you came to your
conclusions.
• If you use online calculators then please include screenshots of those calculators as
part of your work.
• Please do not use code to solve these questions. The questions are designed to test your
understanding.
Question 3a. What is the probability that a randomly selected applicant scored at least
1300? Show your work.
Question 3b. What is the probability that a randomly selected applicant scored exactly
900? Show your work.
Question 3c. What percentage of applicants scored between 900 and 1000? Show your
work.
Question 3d. Calculate the 40th percentile of scores among the applicants. What does this
value represent in the context of the admissions process? Show your work.
Question 3e. Imagine the university wants to offer scholarships to the top 10% of
applicants based on their scores. What minimum score would an applicant need to qualify
for a scholarship? Show your work.
Question 3f. Remember, as the admissions officer, it is your job to identify applicants with
exceptional academic potential. Would you automatically recommend that applicants with
SAT scores above 1400 to be admitted into the university? Or do you think additional
criteria should also be considered? Explain your reasoning.
Answer: Add more "markdown" text and code cells below as needed.

Question 3a. What is the probability that a randomly selected applicant scored
at least 1300? Show your work.
Answer:

Page 24 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 25 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

To determine the probability that a randomly selected applicant scored at least 1300, we
utilize the standard normal distribution (Z-score) formula:
Z = (X - μ) / σ
where:
• Z is the Z-score
• X is the score of interest (1300)
• μ is the mean score
• σ is the standard deviation
Substituting the given values:
• X = 1300
• μ = 1016.566
• σ = 299.8892448941829
We compute:
Z = (1300 - 1016.566) / 299.8892448941829 Z ≈ 0.945
Using an online calculator, we determine that the probability of a Z-score of 0.945 or higher
is approximately 0.1723.
Hence, the probability that a randomly selected applicant scored at least 1300 is
approximately 0.1723 or 17.23%.

Question 3b. What is the probability that a randomly selected


applicant scored exactly 900? Show your work.
Answer:
To calculate the probability that a randomly selected applicant scored exactly 900, we can
use the standard normal distribution (Z-score) formula:
Z = (X - μ) / σ
where:
• Z is the Z-score
• X is the score of interest (900)
• μ is the mean score (1016.566)
• σ is the standard deviation (299.8892448941829)
Substituting the given values:
• X = 900
• μ = 1016.566

Page 25 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 26 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

• σ = 299.8892448941829
We compute:
Z = (900 - 1016.566) / 299.8892448941829 Z ≈ -0.389
Using an online calculator, we find that the probability of a Z-score of -0.389 is
approximately 0.3486.
Therefore, the probability that a randomly selected applicant scored exactly 900 is
approximately 0.3486 or 34.86%.

Question 3c. What percentage of applicants scored between 900 and


1000? Show your work.
Answer:
To calculate the percentage of applicants who scored between 900 and 1000, we can use
the standard normal distribution (Z-score) formula:
Z = (X - μ) / σ
where:
• Z is the Z-score
• X is the score of interest (900 or 1000)
• μ is the mean score
• σ is the standard deviation
For a score of 900:
• X1 = 900
• Z1 = (X1 - μ) / σ
For a score of 1000:
• X2 = 1000
• Z2 = (X2 - μ) / σ
Substituting the given values:
• μ = 1016.566
• σ = 299.8892448941829
We compute: For 900:
• Z1 ≈ -0.389 For 1000:
• Z2 ≈ -0.055

Page 26 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 27 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

Using an online calculator, we find that the probability of a Z-score between -0.389 and -
0.055 is approximately 0.1294.
Therefore, the percentage of applicants who scored between 900 and 1000 is
approximately 12.94%.

Question 3d. Calculate the 40th percentile of scores among the


applicants. What does this value represent in the context of the
admissions process? Show your work.
Answer:
The 40th percentile of scores among the applicants represents the score below which 40%
of the applicants fall.
To calculate the 40th percentile, we can use the formula:
Percentile = μ + (Z * σ)
where:
• Percentile is the desired percentile (40th percentile)
• μ is the mean score
• σ is the standard deviation
• Z is the Z-score corresponding to the desired percentile
Using an online calculator, we find that the Z-score corresponding to the 40th percentile is
approximately -0.253.
Z_40th_percentile = -0.253
Substituting the given values, we compute:
Percentile_40th = μ + (Z_40th_percentile * σ)
Percentile_40th ≈ 940.739
Therefore, the 40th percentile of scores among the applicants is approximately 940.739.
In the context of the admissions process, this means that 40% of the applicants scored
below 940.739.

Question 3e. Imagine the university wants to offer scholarships to the


top 10% of applicants based on their scores. What minimum score
would an applicant need to qualify for a scholarship? Show your work.
Answer:

Page 27 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 28 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

To offer scholarships to the top 10% of applicants based on their scores, we need to
determine the minimum score required to qualify, which corresponds to the 90th
percentile.
To calculate the 90th percentile, we can use the following formula:
Percentile = μ + (Z * σ)
where:
• Percentile is the desired percentile (90th percentile)
• μ is the mean score
• σ is the standard deviation
• Z is the Z-score corresponding to the desired percentile
Given:
• μ = 1016.566
• σ = 299.8892448941829
• Z_90th_percentile = 1.282
Substituting the given values, we compute:
Percentile_90th = 1016.566 + (1.282 * 299.8892448941829)
Percentile_90th ≈ 1400.678
Therefore, the minimum score an applicant would need to qualify for a scholarship is
approximately 1400.678.

Question 3f. Remember, as the admissions officer, it is your job to


identify applicants with exceptional academic potential. Would you
automatically recommend that applicants with SAT scores above 1400
to be admitted into the university? Or do you think additional criteria
should also be considered? Explain your reasoning.
Answer:
While SAT scores serve as a significant yardstick for evaluating an applicant's academic
prowess, they shouldn't serve as the exclusive determinant for admission.
A well-rounded assessment involves considering various facets, including:
• Academic performance throughout high school
• Grade Point Average (GPA)
• Class standing
• Endorsements from educators

Page 28 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 29 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

• Personal statements
• Extracurricular involvements
• Performance in other standardized tests
By examining a diverse array of criteria, the admissions panel can gain a deeper insight into
an applicant's capabilities, enabling more informed decisions regarding acceptance.
Moreover, it's vital to acknowledge that SAT scores can fluctuate notably across different
contexts and institutions. A score of 1400 at one school may not necessarily reflect the
same level of achievement as it does at another institution.
Consequently, I would not propose automatic admission for applicants surpassing the 1400
SAT score threshold. Instead, I advocate for a comprehensive approach that takes into
account all pertinent factors before reaching any admission verdicts.

D. Statistical Intuition in Public Health

Question 4:
Now imagine that it is year 2034 and you are working as a public health researcher in the
UAE. You are working on a project to assess vaccination coverage for a new global
pandemic. The UAE government has implemented a widespread vaccination campaign to
combat the spread of the virus and achieve herd immunity. You want to determine the
proportion of individuals who have received the new vaccine among a sample of 100
residents in different parts of the country.
To Begin.
Run the code below. It will provide you with a random sample of 100 residents. It will save
this data to a CSV file called "Vaccinated.csv". Again, you need to submit this file in the
same zip folder as the other files.
# Load the following libraries so that they can be applied in the subsequent
code blocks

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import scipy.stats as stats

# Run this code. It will generate data and save it to a CSV file called
"Vaccinated.csv". You need to submit it in the same zip folder as your other
files.

try:
Vaccinated = pd.read_csv('Vaccinated.csv')
except FileNotFoundError:

Page 29 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 30 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

num_samples = 100
vaccinated = np.random.choice(["Yes", "No"], size=num_samples)
Vaccinated = pd.DataFrame({'Vaccinated': vaccinated})
Vaccinated.to_csv('Vaccinated.csv')

# Have a look at Vaccinated dataset.


Vaccinated.head()

{"summary":"{\n \"name\": \"Vaccinated\",\n \"rows\": 100,\n \"fields\":


[\n {\n \"column\": \"Vaccinated\",\n \"properties\": {\n
\"dtype\": \"category\",\n \"num_unique_values\": 2,\n
\"samples\": [\n \"No\",\n \"Yes\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n
]\n}","type":"dataframe","variable_name":"Vaccinated"}

Now, use the dataset to answer the following questions.


IMPORTANT:
• Make sure to support your answers by explaining and showing how you came to
your conclusions.
• Please do not use code to solve these questions. The questions are designed to test
your understanding.
Question 4a. What is the proportion of people who have received the vaccine (based on
the dataset you have)?
Question 4b. Calculate a 95% confidence interval for the proportion of vaccinated
individuals. What does this interval tell us about the likely range of vaccination coverage in
the entire population? Show your work.
Question 4c. What sample size would be required to estimate the proportion of vaccinated
individuals in the country with a 95% confidence level and a margin of error of 0.02?
Show your work.
Question 4d. If you wanted to increase the precision of your estimate, what strategies
could you employ to achieve this goal? Explain your reasoning.
Question 4e. Analyze the effectiveness of the current vaccination campaign using the
proportion of vaccinated individuals and the confidence interval. What recommendations
would you make for future campaigns?
Answer: Add more "markdown" text and code cells below as needed.

Question 4a. Proportion of Vaccinated Individuals


The dataset consists of 100 individuals, out of which 47 have been vaccinated.
• Proportion of vaccinated individuals:

Page 30 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 31 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

• p = Number of vaccinated individuals / Total number of individuals


• p = 54 / 100 = 0.54
So, 54% of the individuals in our dataset are vaccinated.

Question 4b. 95% Confidence Interval for the Proportion


Answer:
To compute the 95% confidence interval for the proportion, we follow these steps:
First, calculate the standard error (SE):
• SE = sqrt((p * (1 - p)) / n)
• SE = sqrt((0.54 * (1 - 0.54)) / 100)
• SE ≈ sqrt(0.54 * 0.46 / 100)
• SE ≈ sqrt(0.02484)
• SE ≈ 0.1577
Then, calculate the 95% confidence interval using a Z-score of 1.96:
• CI = p ± Z * SE
• CI = 0.54 ± 1.96 * 0.1577
• CI ≈ 0.54 ± 0.3093
• CI ≈ (0.2307, 0.8493)
Therefore, we are 95% confident that the true proportion of vaccinated individuals in the
population lies between 23.07% and 84.93%.

Question 4c. Required Sample Size for a Margin of Error of 0.02


Answer:
To determine the required sample size for a margin of error of 0.02, we use the following
formula:
n = (Z^2 * p * (1 - p)) / E^2
Given:
• Margin of error (E) = 0.02
• Z-score corresponding to 95% confidence level (Z) ≈ 1.96
• Proportion of vaccinated individuals (p) = 0.54
Substituting these values into the formula, we get:
n = (1.962 * 0.54 * (1 - 0.54)) / 0.02^2

Page 31 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 32 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

n ≈ (3.8416 * 0.54 * 0.46) / 0.0004 n ≈ (2.0799264 * 0.46) / 0.0004 n ≈ 0.95795744 /


0.0004 n ≈ 2394.8936
Rounding up, you would need a sample size of approximately 2395 individuals to estimate
the proportion of vaccinated individuals within a margin of error of 0.02 with 95%
confidence.

Question 4d. Enhancing Precision Strategies


Answer:
To refine the precision of your estimation, explore the following strategies:
1. Augmenting Sample Size: Expand the sample size beyond 2395. A larger sample
size can lead to more accurate estimates and narrower confidence intervals.

2. Ensuring Representativeness: Guarantee that the sample is a true reflection of the


entire population by employing techniques like stratified sampling. This helps
mitigate biases and ensures comprehensive coverage of diverse groups within the
population.

3. Minimizing Margin of Error: While challenging, endeavor to further reduce the


margin of error. However, bear in mind that this typically necessitates an even
larger sample size. Strive to strike a balance between precision and practicality.

By implementing these strategies, you can enhance the precision of your estimates, thereby
yielding more reliable insights into the proportion of vaccinated individuals in the
population.

Question 4e. Analysis and Recommendations


Insights and Suggestions
Based on the analysis revealing a vaccination rate of 47% with a 95% confidence interval
ranging from 23.07% to 84.93%, it's evident that the vaccination coverage may fall short of
the desired levels necessary for achieving herd immunity or meeting other crucial public
health objectives.
Recommendations for Future Campaigns:
1. Targeted Outreach: Implement targeted outreach initiatives tailored to
unvaccinated populations. This approach can help address specific concerns and
barriers to vaccination, ultimately increasing coverage rates.

2. Enhanced Accessibility: Improve accessibility to vaccination centers by


introducing measures such as mobile clinics or extending operating hours. This

Page 32 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450


Page 33 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

ensures greater convenience for individuals seeking vaccination, particularly those


with time constraints or limited mobility.

3. Educational Campaigns: Launch educational campaigns aimed at combating


vaccine hesitancy and dispelling misinformation. Providing accurate information
about vaccine safety, efficacy, and benefits can help alleviate concerns and
encourage more individuals to get vaccinated.

Conclusion
This analysis offers valuable insights into the current status of the vaccination campaign
and highlights potential areas for improvement to boost vaccination rates. By
implementing targeted outreach, enhancing accessibility, and conducting educational
campaigns, we can strive towards achieving higher vaccination coverage and contributing
to improved public health outcomes.

Page 33 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

You might also like