Project 2 AI

Page 1 of 33 - Cover Page Submission ID trn:oid:::1:2866706450
MR
Student 2
Quick Submit
Quick Submit
Winneconne High School
Document Details
Submission ID
trn:oid:::1:2866706450 31 Pages
Submission Date 6,951 Words
Mar 27, 2024, 5:43 AM CDT

42,318 Characters
Download Date
Mar 27, 2024, 5:52 AM CDT
File Name
Assignment_2_Spring_2024_2.docx
File Size
128.0 KB
Page 1 of 33 - Cover Page Submission ID trn:oid:::1:2866706450

Page 2 of 33 - AI Writing Overview Submission ID trn:oid:::1:2866706450
How much of this submission has been generated by AI?
*16%
Caution: Percentage may not indicate academic misconduct. Review required.
It is essential to understand the limitations of AI detection before making decisions

about a student's work. We encourage you to learn more about Turnitin's AI detection
capabilities before using the tool.
of qualifying text in this submission has been determined to be
generated by AI.
* Low scores have a higher likelihood of false positives.
Frequently Asked Questions
What does the percentage mean?

The percentage shown in the AI writing detection indicator and in the AI writing report is the amount of qualifying text within the
submission that Turnitin's AI writing detection model determines was generated by AI.
Our testing has found that there is a higher incidence of false positives when the percentage is less than 20. In order to reduce the
likelihood of misinterpretation, the AI indicator will display an asterisk for percentages less than 20 to call attention to the fact that
the score is less reliable.
However, the final decision on whether any misconduct has occurred rests with the reviewer/instructor. They should use the
percentage as a means to start a formative conversation with their student and/or use it to examine the submitted assignment in
greater detail according to their school's policies.
How does Turnitin's indicator address false positives?

Our model only processes qualifying text in the form of long-form writing. Long-form writing means individual sentences contained in paragraphs that make up a
longer piece of written work, such as an essay, a dissertation, or an article, etc. Qualifying text that has been determined to be AI-generated will be highlighted blue
on the submission text.
Non-qualifying text, such as bullet points, annotated bibliographies, etc., will not be processed and can create disparity between the submission highlights and the
percentage shown.
What does 'qualifying text' mean?

Sometimes false positives (incorrectly flagging human-written text as AI-generated), can include lists without a lot of structural variation, text that literally repeats
itself, or text that has been paraphrased without developing new ideas. If our indicator shows a higher amount of AI writing in such text, we advise you to take that
into consideration when looking at the percentage indicated.
In a longer document with a mix of authentic writing and AI generated text, it can be difficult to exactly determine where the AI writing begins and original writing
ends, but our model should give you a reliable guide to start conversations with the submitting student.
Disclaimer
Our AI writing assessment is designed to help educators identify text that might be prepared by a generative AI tool. Our AI writing assessment may not always be accurate (it may misidentify
both human and AI-generated text) so it should not be used as the sole basis for adverse actions against a student. It takes further scrutiny and human judgment in conjunction with an
organization's application of its specific academic policies to determine whether any academic misconduct has occurred.
Page 2 of 33 - AI Writing Overview Submission ID trn:oid:::1:2866706450

Page 3 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450
Statistical Intuitions and Applications

Assignment 2
Important Information:
In Assignment 2 you will apply the statistical concepts that we've encountered so far in the
course to solve FOUR questions related to real-world scenarios. Completing these
questions will help you better appreciate how statistical concepts can be combined to
describe and analyze many questions in our personal and professional lives.
For each question you will first encounter a block of code that generates a dataset for you
to work with (similiar to Assignment 1). You will need to save these datasets as csv files.
Submission Requirements
1. Submit all answers along with their corresponding code as a searchable PDF .
2. All generated csv and .ipynb files must be submitted in a zip-folder as a secondary
source.
3. Ensure your zip folder contains four csv files (i.e., the csv files for each of the four
questions below; YourName.csv, RelianceRetailVisits_ordered.csv, Scores.csv,
Vaccinated.csv).
4. You may use Jupyter notebook or Colab as per your convenience.
5. You should ONLY use the concepts and techniques covered in the course to
generate your answers. Statistical techniques that are NOT covered in the
course will NOT be evaluated.
Note: Reach out to your instructor for any question regarding csv files, codes, or the zip-
folder.
Non-compliance with the above instructions will result in a 0 grade on the relevant
portions of the assignment. Your instructor will grade your assignment based on what you
submitted. Failure to submit the assignment or submitting an assignment intended for
another class will result in a 0 grade, and resubmission will not be allowed. Make sure that
you submit your original work. Suspected cases of plagiarism will be treated as potential
academic misconduct and will be reported to the College Academic Integrity Committee for
a formal investigation. As part of this procedure, your instructor may require you to meet
with them for an oral exam on the assignment.

A. Statistical Intuitions in Mental Health
Question 1:
We are going to work with a dataset that was collected on mental health issues. In total,
824 individuals (teenagers, college students, housewives, businesses professionals, and
other groups) completed the survey. Their data provides valuable insights into the
prevalence, and factors associated with, mental health issues in different groups.
To Begin.
Run the code below. It will select a random sample of 300 participants from the Mental
Health dataset. The code will then generate a CSV file called Name.csv. You need to change
the name of the file to your actual name and then submit in the zip folder as a secondary
file.
# Load the following libraries so that they can be applied in the subsequent
code blocks
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import scipy.stats as stats
# Run this code. It will create a csv file containing a random sample of 300
respondents. You will answer the questions below based on this sample.
# Look at the code below. Now replace 'Name.csv' with your actual name (e.g.,
'Sara.csv'). The code will generate a csv file that you need to submit in the
zip folder as secondary file.
try:
df = pd.read_csv('Rouda Alsuwaidi.csv') # replace Name with your
own name
except FileNotFoundError:
original_data =
pd.read_csv("https://raw.githubusercontent.com/DanaSaleh1003/IDS-103-Spring-
2024/main/mental_health_finaldata_1.csv")
df1=original_data.sample(300)
df1.to_csv('Rouda Alsuwaidi.csv') # replace Name with
your own name
df = pd.read_csv('Rouda Alsuwaidi.csv') # replace Name with
your own name
df = pd.DataFrame(df)
df.to_csv('Rouda Alsuwaidi.csv') # replace Name with
your own name

df.head()
{"summary":"{\n \"name\": \"df\",\n \"rows\": 300,\n \"fields\": [\n

{\n \"column\": \"Unnamed: 0\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 245,\n \"min\": 0,\n
\"max\": 822,\n \"num_unique_values\": 300,\n \"samples\": [\n
719,\n 337,\n 597\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Age\",\n \"properties\": {\n \"dtype\":
\"category\",\n \"num_unique_values\": 4,\n \"samples\": [\n
\"30-Above\",\n \"25-30\",\n \"20-25\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n
{\n \"column\": \"Gender\",\n \"properties\": {\n \"dtype\":
\"Male\",\n \"Female\"\n ],\n \"semantic_type\":
\"column\": \"Occupation\",\n \"properties\": {\n \"dtype\":
\"Student\",\n \"Business\"\n ],\n \"semantic_type\":
\"column\": \"Days_Indoors\",\n \"properties\": {\n \"dtype\":
\"1-14 days\",\n \"Go out Every day\"\n ],\n
{\n \"column\": \"Growing_Stress\",\n \"properties\": {\n
\"dtype\": \"category\",\n \"num_unique_values\": 2,\n
\"samples\": [\n \"Yes\",\n \"No\"\n ],\n
{\n \"column\": \"Quarantine_Frustrations\",\n \"properties\": {\n
\"samples\": [\n \"Yes\",\n \"No\"\n ],\n
{\n \"column\": \"Changes_Habits\",\n \"properties\": {\n
\"samples\": [\n \"No\",\n \"Yes\"\n ],\n
{\n \"column\": \"Mental_Health_History\",\n \"properties\": {\n
{\n \"column\": \"Weight_Change\",\n \"properties\": {\n
{\n \"column\": \"Mood_Swings\",\n \"properties\": {\n
\"samples\": [\n \"Low\",\n \"High\"\n ],\n
{\n \"column\": \"Coping_Struggles\",\n \"properties\": {\n


{\n \"column\": \"Work_Interest\",\n \"properties\": {\n
{\n \"column\": \"Social_Weakness\",\n \"properties\": {\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n
]\n}","type":"dataframe","variable_name":"df"}
Now, Run the code below to return TWO variables which represent different aspects of
mental health that you need to focus on.
code blocks
import numpy as np
import pandas as pd
import random
import random
column_titles = ["Growing_Stress" ,"Quarantine_Frustrations"

,"Changes_Habits" ,"Mental_Health_History", "Weight_Change" ,"Mood_Swings",
"Coping_Struggles","Work_Interest","Social_Weakness"]
# Randomly select 2 variables

selected_columns = random.sample(column_titles, 2)
# Print the 2 variables that were randomly selected

variable_1, variable_2 = selected_columns
print("Variable 1:", variable_1)
print("Variable 2:", variable_2)
Variable 1: Weight_Change
Variable 2: Quarantine_Frustrations
Using the sample dataset from the CSV file you generated answer the following questions:
Question 1a. Is each of these two variables independent of being female? Explain your
reasoning. Make sure to include a two-way table for each of these two variables with
gender, and show all your calculations to support your answers.

Question 1b. Is there a relationship between the two variables returned by the code?
Explain your reasoning. Make sure you include a two-way table, a stacked bar graph, and all
your probability calculations in your answer.
Question 1c. Does the existence of Variable 1 increase the likelihood of experiencing
Variable 2? If so, by how much? Explain your reasoning. Make sure to support your answer
with the relevant statistical analysis.
Question 1d. Look back at your answers to Questions 1a-c. Now use what you learned to
answer the following question:
Imagine ZU wanted to use the insights from this research to improve its mental health
support program. What recommendations would you make to support students struggling
with such challenges?
Answer: Add more "markdown" text and code cells below as needed.
Question 1a. Is each of these two variables independent of being

female? Explain your reasoning. Make sure to include a two-way table
for each of these two variables with gender, and show all your
calculations to support your answers.
import pandas as pd
from scipy.stats import chi2_contingency
# Create contingency tables for each variable with gender

contingency_table_1 = pd.crosstab(df['Gender'], df[variable_1])
contingency_table_2 = pd.crosstab(df['Gender'], df[variable_2])
# Perform Chi-square test and get p-values

_, p_value_1, _, _ = chi2_contingency(contingency_table_1)
_, p_value_2, _, _ = chi2_contingency(contingency_table_2)
# Print two-way tables

print("Two-way table for Variable 1:")
print(contingency_table_1)
print("\nTwo-way table for Variable 2:")
print(contingency_table_2)
# Interpret the results and explain reasoning

print("\nQuestion 1a. Is each of these two variables independent of being
female?")
if p_value_1 > 0.05:

print(f"Variable 1 is independent of being female (p-value:
{p_value_1:.4f}).")

else:
print(f"Variable 1 is not independent of being female (p-value:
{p_value_1:.4f}).")
if p_value_2 > 0.05:

print(f"Variable 2 is independent of being female (p-value:
{p_value_2:.4f}).")
else:
print(f"Variable 2 is not independent of being female (p-value:
{p_value_2:.4f}).")
Two-way table for Variable 1:

Weight_Change No Yes
Gender
Female 58 103
Male 40 99
Two-way table for Variable 2:

Quarantine_Frustrations No Yes
Gender
Female 48 113
Male 37 102
Question 1a. Is each of these two variables independent of being female?

Variable 1 is independent of being female (p-value: 0.2258).
Variable 2 is independent of being female (p-value: 0.6285).
Question 1b. Is there a relationship between the two variables

returned by the code? Explain your reasoning. Make sure you include
a two-way table, a stacked bar graph, and all your probability
calculations in your answer.
import pandas as pd
from scipy.stats import chi2_contingency
# Create a two-way table for the two variables

two_way_table = pd.crosstab(df[variable_1], df[variable_2])
# Perform Chi-square test and get p-value

_, p_value, _, _ = chi2_contingency(two_way_table)
# Print the two-way table

print("Two-way table:")
print(two_way_table)
# Interpret the results

print("\nQuestion 1b. Is there a relationship between the two variables?")

if p_value > 0.05:
print("There is no relationship between the two variables.")
else:
print("There is a relationship between the two variables.")
# Create a stacked bar graph

two_way_table.plot(kind="bar", stacked=True)
plt.xlabel(variable_1)
plt.ylabel("Frequency")
plt.title(f"Stacked Bar Graph of {variable_1} and {variable_2}")
plt.show()
Two-way table:
Quarantine_Frustrations No Yes
Weight_Change
No 23 75
Yes 62 140
Question 1b. Is there a relationship between the two variables?

There is no relationship between the two variables.

Question 1c. Does the existence of Variable 1 increase the likelihood

of experiencing Variable 2? If so, by how much? Explain your
reasoning. Make sure to support your answer with the relevant
statistical analysis.
# Calculate the conditional probabilities
p_variable_2_given_variable_1 = df[df[variable_1] ==
"Yes"][variable_2].value_counts(normalize=True)["Yes"]
p_variable_2_given_not_variable_1 = df[df[variable_1] ==
"No"][variable_2].value_counts(normalize=True)["Yes"]
# Calculate the relative risk

relative_risk = p_variable_2_given_variable_1 /
p_variable_2_given_not_variable_1
# Print the results

print("P(Variable 2 | Variable 1) =", p_variable_2_given_variable_1)
print("P(Variable 2 | not Variable 1) =", p_variable_2_given_not_variable_1)
print("Relative risk =", relative_risk)
# Interpret the results

if relative_risk > 1:
print("\nThe existence of Variable 1 increases the likelihood of
experiencing Variable 2 by a factor of", relative_risk)
else:
print("\nThe existence of Variable 1 decreases the likelihood of
experiencing Variable 2 by a factor of", 1/relative_risk)
P(Variable 2 | Variable 1) = 0.693069306930693

P(Variable 2 | not Variable 1) = 0.7653061224489796
Relative risk = 0.9056105610561056
The existence of Variable 1 decreases the likelihood of experiencing Variable

2 by a factor of 1.1042274052478134

Question 1d. Look back at your answers to Questions 1a-c. Now use
what you learned to answer the following question:
Imagine ZU wanted to use the insights from this research to improve

its mental health support program. What recommendations would
you make to support students struggling with such challenges?
Based on what we learned from looking at the data:
1. PSupport based on gender:

• Different types of help might be needed for guys and girls because they might have
different experiences. So, it's important to offer support that fits everyone's needs.
2. Teach everyone about mental health:

• We should teach students about mental health so they know what to look out for
and where to go for help if they need it. This helps everyone feel more comfortable
talking about mental health issues.
3. Have lots of different kinds of help available:

• Since having one mental health problem can make it more likely to have another
one, it's good to have lots of different ways to get help. This could include talking to
someone, joining a support group, or going to workshops.
4. Make a friendly campus:

• We need to make sure the campus feels like a welcoming place for everyone. This
means making it easy for students to talk to each other and feel supported by their
peers and teachers.
5. Encourage asking for help:

• Sometimes students might feel embarrassed or afraid to ask for help with mental
health problems. We want to make it clear that it's okay to ask for help and show
students where they can go for support.

By implementing these recommendations, ZU can create a more

supportive environment for students struggling with mental health
challenges and help them to get the support they need to succeed.
B. Statistical Intuition in Store Ratings
Question 2:
Imagine you are the manager of an Electronic store in Dubai mall. You are curious about
the distribution of customer ratings about your overall store services. So you ask random
customers who visit the store to complete a short survey, recording variables such as their
age group, and overall experience rating.
To Begin
Run the code below. It will provide you with a random sample of 40 customers from this
survey. It will also save your random sample data to a CSV file called
"RelianceRetailVisits_ordered". Again, you need to submit this file in the same zip folder
as the other files.
code blocks
import numpy as np
import pandas as pd
import random
try:
df = pd.read_csv('RelianceRetailVisits.csv')
original_data =
2024/main/RelianceRetailVisits-1.csv")
# Randomly sample 40 rows from the original dataset

df = original_data.sample(n=40, random_state=42)
# Fill missing values for '46 To 60 years' age group with default values or
remove NaN rows
df.fillna({'Age Group': '46 To 60 years'}, inplace=True)
# Sort the DataFrame based on the 'Age Group' column in the desired order
desired_order = ['26 To 35 years', '16 To 25 years', '36 To 45 years',

'46 To 60 years'] # Corrected unique values

df['Age Group'] = pd.Categorical(df['Age Group'], categories=desired_order,
ordered=True)
df.sort_values(by='Age Group', inplace=True)
# Save the sorted DataFrame to a new CSV file

df.to_csv('RelianceRetailVisits_ordered.csv', index=False)
df.head()
{"summary":"{\n \"name\": \"df\",\n \"rows\": 40,\n \"fields\": [\n {\n

\"column\": \"Customer Index\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 64,\n \"min\": 10,\n \"max\":
221,\n \"num_unique_values\": 40,\n \"samples\": [\n
10,\n 19,\n 83\n ],\n \"semantic_type\":
\"column\": \"Age Group\",\n \"properties\": {\n \"dtype\":
\"16 To 25 years\",\n \"46 To 60 years\",\n \"26 To
35 years\"\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"OverallExperienceRatin\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 1,\n \"min\": 1,\n \"max\": 5,\n
\"num_unique_values\": 5,\n \"samples\": [\n 4,\n
1,\n 5\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n }\n
]\n}","type":"dataframe","variable_name":"df"}
Use the random sample of data from the csv file you generated to answer the following
questions:
Question 2a. Construct a probability distribution table for all customer ratings in your
sample data (an example table can be seen below). Please do this in Excel and explain [step
by step] how you constructed your probability table.
Screenshot 2024-02-25 at 6.38.29 PM.png
Question 2b. What is the probability that a randomly selected customer will have a rating
of AT MOST 3?
Question 2c. Based on the created probability distribution table, how satisfied are your
customers with your store services?
Question 2d. Find the expected rating of your store. Show your work and interpret your
answer in context.
Run the code below. It will generate the probability distribution graph for all your
customers satisfaction rates and the Standard Deviation.

Question 2a. Construct a probability distribution table for all customer

ratings in your sample data (an example table can be seen below).
Please do this in Excel and explain [step by step] how you constructed
your probability table.
Question 2a. Construct a probability distribution table for all customer ratings in your
sample data (an example table can be seen below). Please do this in Excel and explain [step
by step] how you constructed your probability table.
Step 1: Open the CSV file in Excel.
Step 2: Select the column containing the customer ratings.
Step 3: Use the COUNTIF function to count the number of times each rating appears in the
column consider rating 1.
• Formula: =COUNTIF(C2:C41, 1)
• Explanation: This formula counts the number of times the rating '1' appears in the
range C2:C
Step 4: Use the SUM function to calculate the total number of ratings.
• Formula: =COUNTA(C2:C41)
• Explanation: This formula counts all non-empty cells in the range C2:C41, giving the
total number of ratings.
Step 5: Divide the count of each rating by the total number of ratings to get the probability
of each rating.
• Formula: =COUNTIF(C2:C41, 1) / COUNTA(C2:C41)
• Explanation: This formula divides the count of '1' ratings by the total count of all
ratings, resulting in the probability of getting a rating of '1'. This formula can be
adjusted by replacing the '1' with any other rating value to find the probability of
that rating occurring.
Step 6: Create a table with six columns and two rows: one for the ratings and one for the
probabilities.
The probability distribution table shows the probability of each customer rating occurring.
For example, the probability of a customer giving a rating of 1 is 0.025. The probability of a
customer giving a rating of 5 is 0.3.
#Question 2b. What is the probability that a randomly selected customer will have a rating
of AT MOST 3?
To determine the probability that a randomly selected customer will have a rating of AT
MOST 3, we sum the probabilities of ratings 1, 2, and 3. Hence, the probability can be
computed as follows:

P(X ≤ 3) = P(X = 1) + P(X = 2) + P(X = 3) Put the probabilities value in the this equation
P(X ≤ 3) = 0.025 + 0.15 + 0.1 P(X ≤ 3) = 0.275
Therefore, the probability that a randomly selected customer will have a rating of AT MOST
3 is 0.275 or 27.5%.
import numpy as np
import pandas as pd
from tabulate import tabulate
# Load data
try:
df = pd.read_csv('RelianceRetailVisits.csv')
original_data =
2024/main/RelianceRetailVisits-1.csv")
df = original_data.sample(n=40, random_state=42)
# Fill missing values for '46 To 60 years' age group with default values or
remove NaN rows
df.fillna({'Age Group': '46 To 60 years'}, inplace=True)
# Sort the DataFrame based on the 'Age Group' column in the desired order
desired_order = ['26 To 35 years', '16 To 25 years', '36 To 45 years',
'46 To 60 years']
df['Age Group'] = pd.Categorical(df['Age Group'], categories=desired_order,
ordered=True)
df.sort_values(by='Age Group', inplace=True)
# Save the sorted DataFrame to a new CSV file

df.to_csv('RelianceRetailVisits_ordered.csv', index=False)
# Probability distribution graph for customer rating

plt.figure(figsize=(8, 6))
rating_counts =
df['OverallExperienceRatin'].value_counts(normalize=True).sort_index()
plt.bar(rating_counts.index, rating_counts, alpha=0.7)
plt.title('Probability Distribution of Customer Rating')
plt.xlabel('Overall Experience Rating')
plt.ylabel('Probability')
plt.xticks(range(1, 6))
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
# Expected value and STD for rating for all customers

mean_rating = df['OverallExperienceRatin'].mean()

std_rating = df['OverallExperienceRatin'].std()
print(f"Standard Deviation (STD) of Customer Rating: {std_rating:.2f}")
print()
Standard Deviation (STD) of Customer Rating: 1.11
Question 2c. Based on the created Probability distribution graph for

customer rating, how satisfied are your customers with your store
services?
The probability distribution graph provides insights into customer satisfaction with store
services. It illustrates that a substantial portion of customers, comprising 72% in total, gave
ratings of 4 or 5. This suggests a high level of satisfaction among a significant majority of
customers.
However, it's also notable that approximately 28% of customers provided ratings of 3 or
lower. This indicates areas where improvements could be made to enhance overall
satisfaction levels.

In summary, while a large portion of customers appear satisfied with store services, there
is still a notable minority whose experiences may require attention to further improve
overall customer satisfaction.
Question 2d. Find the expected rating of your store. Show your work
and interpret your answer in context.
Determining the expected rating involves calculating the average rating that customers are
likely to give, considering the probability of each rating. This is achieved by multiplying
each rating by its corresponding probability and then summing up the results.
In this case, the expected rating is:
expected_rating = (1 * 0.1) + (2 * 0.2) + (3 * 0.3) + (4 * 0.25) + (5 *

0.15) = 3.15
Therefore, the average customer is expected to give a rating of 3.15 on a scale of 1 to 5. This
suggests that overall, customers are moderately satisfied with the store services, but there
remains scope for enhancement.
Question 2e. Interpret the Standard Deviation in context. What rating

is considered unusual? Explain.
The standard deviation (STD) of 1.11 for customer ratings indicates a moderate amount of
variability in how customers rate the store. This suggests that opinions among customers
are somewhat diverse.
To identify what rating is considered unusual, we utilize the following formula:
unusual_rating = mean_rating ± 2 * std_rating
Substituting the given values:
In this case, the unusual rating range is:

unusual_rating=3.15±2×1.11
= 3.15 ± 2.22
= 3.15±2.22
= 0.93 to 4.37
Hence, any rating below 0.93 or above 4.37 would be considered unusual.
In context, this implies that a rating of 1 or 2 would be considered exceptionally low, while
a rating of 4 or 5 would be considered exceptionally high compared to the typical range of
ratings.
Run the code below. It will generate the probability distribution graphs for each of the age
groups along with their discrete probability distribution tables, the Expectd values, and the
Standard Deviation values.
import numpy as np
import pandas as pd
# Assuming your data is stored in a CSV file named 'data.csv'

data = pd.read_csv('RelianceRetailVisits_ordered.csv')
# Define age groups including the new one

age_groups = ['16 To 25 years', '26 To 35 years', '36 To 45 years', '46
To 60 years'] # Added new age group
# Plot separate discrete probability distributions for each age group

fig, axs = plt.subplots(1, 4, figsize=(20, 6), sharex=True,
gridspec_kw={'hspace': 0.5}) # Adjusted size and spacing
for i, age_group in enumerate(age_groups):

age_data = data[data['Age Group'] == age_group]
rating_counts =
age_data['OverallExperienceRatin'].value_counts(normalize=True).sort_index()
bars = axs[i].bar(rating_counts.index, rating_counts, alpha=0.7)
axs[i].set_title(f'{age_group}\nMean:
{age_data["OverallExperienceRatin"].mean():.2f} | SD:
{age_data["OverallExperienceRatin"].std():.2f}') # Age group, Mean, and SD
axs[i].set_xlabel('Overall Experience Rating')
axs[i].set_ylabel('Probability (%)') # Set y-axis label to Probability
(%)

axs[i].set_xticks(range(1, 6)) # Set x-axis ticks from 1 to 5

axs[i].set_yticklabels(['{:,.0%}'.format(x) for x in
axs[i].get_yticks()]) # Format y-axis tick labels as percentages
# Display percentages above each bar

for bar in bars:
height = bar.get_height()
rating = bar.get_x() + bar.get_width() / 2
if height == 0: # If the height is 0%, display '0%'
axs[i].text(rating, height, '0%', ha='center', va='bottom',
fontsize=8)
else:
axs[i].text(rating, height, f'{height:.0%}', ha='center',
va='bottom', fontsize=8)
axs[i].grid(axis='y', linestyle='--', alpha=0.7)
# Hide the warning about FixedFormatter

import warnings
warnings.filterwarnings("ignore", category=UserWarning)
plt.tight_layout()
plt.show()
<ipython-input-8-e3aa8190ed23>:23: UserWarning: FixedFormatter should only be

used together with FixedLocator
axs[i].set_yticklabels(['{:,.0%}'.format(x) for x in axs[i].get_yticks()])
# Format y-axis tick labels as percentages

Question 2f. Identify any trends or differences in customer satisfaction levels (and
variability) among the different age groups.
Now, using these insights, what concrete improvements would you make to your store to
ensure that all customers are satisfied with your services?
Insights and Distinctions:
Contentment Levels Analysis:

• 16-25: Generally pleased (mean rating of 4.00), yet displaying a higher frequency of
lower ratings (1 and 2) compared to other segments.
• 26-35: Most content demographic (mean rating of 4.00), with a relatively even
distribution of ratings.
• 36-45: Similar contentment levels to younger demographics (mean rating of 4.33),
with a lower occurrence of lower ratings.
• 46-60: Least content age bracket (mean rating of 2.71), experiencing a higher
occurrence of lower ratings (1 and 2).
Variance Assessment:
• 16-25: Moderate variance in ratings (STD of 1.03).
• 26-35: Highest variance (STD of 1.18), indicating a broader range of experiences.
• 36-45: Lowest variance (STD of 0.52), suggesting a more consistent experience.
• 46-60: High variance (STD of 0.95), reflecting significant diversity in experiences.
Recommendations for Enhancement:
Targeting Mature Audiences:

• Tailor offerings to address the distinct needs and preferences of the 46-60 age
group to enhance their contentment.
• Introduce more senior-specific discounts, ensure easily accessible facilities, and
curate products/services tailored to their tastes and requirements.
Elevating Overall Experience:

• Focus efforts on enhancing the holistic shopping experience for all demographic
segments.

• Expand product offerings, refine store layout and ambiance, and maintain a
consistent standard of service.
Personalized Customer Care:

• Train staff to deliver exceptional customer service tailored to individual needs and
preferences.
• Provide personalized recommendations, promptly address concerns, and exceed
expectations to ensure satisfaction.
Loyalty Programs and Incentives:

• Implement loyalty programs or incentives to foster repeat patronage and garner
positive feedback.
• Offer discounts, exclusive promotions, or early access to new products to cultivate
customer loyalty.
Continuous Feedback Collection:

• Continuously solicit feedback from customers across all age groups to identify areas
for improvement and monitor satisfaction levels.
• Utilize surveys, feedback forms, or social media engagement to gather insights and
respond promptly to customer feedback.
Consistency and Ongoing Training:

• Institute regular training sessions to keep staff updated on the latest methodologies
and benchmarks.
• Foster a culture of continuous improvement and learning within the organization.
Integration of Technology:
• Explore opportunities to integrate technology for streamlined processes and
enriched customer interactions.
• Implement self-service checkout systems, mobile payment options, and online
ordering to enhance convenience.
Community Engagement Initiatives:

• Engage with the local community to build rapport and a sense of community.
• Sponsor local events, participate in charitable endeavors, and champion community
causes to strengthen brand loyalty.
Sustainable Practices:
• Adopt environmentally conscious practices to appeal to eco-minded clientele.
• Minimize waste, utilize sustainable materials, and support eco-friendly initiatives to
demonstrate corporate responsibility.
Utilization of Data Analytics:

• Leverage data analytics to gain insights into customer behavior and preferences.

• Analyze sales trends, customer feedback, and demographic information to inform

strategic decisions.
Competitive Analysis:
• Conduct regular competitive analyses to stay informed about industry trends and
benchmark against competitors.
• Identify areas for differentiation and capitalize on unique selling propositions.
Crisis Management Strategies:

• Develop a comprehensive crisis management protocol to effectively navigate
unforeseen events.
• Train staff on emergency procedures, communication protocols, and customer
support during crises.
Exploration of Expansion Opportunities:

• Explore opportunities for expansion into new markets or geographical locations.
• Conduct market research to identify burgeoning areas of growth and develop
strategic expansion plans.
Employee Recognition and Motivation:

• Acknowledge and incentivize employees for their dedication and contributions.
• Implement reward programs, performance incentives, and employee appreciation
initiatives to boost morale.
Collaborative Partnerships with Suppliers:

• Foster strong relationships with suppliers to ensure timely delivery of high-quality
products.
• Negotiate favorable terms and engage in collaborative marketing efforts for mutual
benefit.
Health and Safety Measures:

• Prioritize the health and safety of both customers and staff members.
• Implement rigorous cleaning protocols, provide personal protective equipment, and
enforce social distancing measures.
Adaptability and Flexibility:

• Remain adaptable and flexible to respond quickly to changing market conditions
and customer needs.
• Embrace innovation and experimentation to stay ahead in a dynamic business
environment.
Answer: Add more "markdown" text cells below as needed.

C. Statistical Intuition in SAT Exams
Question 3:
Imagine you are working for a prestigious university in the UAE. It is your job to decide
which students are admitted to the university. To help you do this, you analyze the high
school (SAT) scores of potential students. These scores help you understand their academic
readiness and potential for success at the university.
You have just received the scores of applicants who would like to join the university in
September 2024. These scores follow a normal distribution.
To Begin.
Run the code below. It will generate a dataset with the students scores. It will also calculate
the mean (μ) and standard deviation (σ) of these scores. This dataset will be saved as a
CSV file called "Scores.csv". Again, you need to submit this file in the same zip folder as
your other files.
code blocks
import pandas as pd
import numpy as np
import random
try:
SATScores = pd.read_csv('Scores.csv')
num_samples = 1000
mean_score = random.randint(800, 1200)
std_deviation = random.randint(100, 300)
scores = np.random.normal(mean_score, std_deviation, num_samples)
scores = np.round(scores, 0)
SATScores = pd.DataFrame({'Scores': scores})
SATScores.to_csv('Scores.csv')
# Calculate mean and standard deviation

mean_score = SATScores['Scores'].mean()
std_deviation = SATScores['Scores'].std()
# Print mean score and standard deviation

print("Mean score:", mean_score)
print("Standard deviation:", std_deviation)
# Display the dataset

SATScores.head()

Mean score: 859.107

Standard deviation: 231.946737075217
{"summary":"{\n \"name\": \"SATScores\",\n \"rows\": 1000,\n \"fields\":

[\n {\n \"column\": \"Scores\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 231.946737075217,\n \"min\":
201.0,\n \"max\": 1817.0,\n \"num_unique_values\": 608,\n
\"samples\": [\n 845.0,\n 800.0,\n 1104.0\n
],\n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\n }\n ]\n}","type":"dataframe","variable_name":"SATScores"}
Now, use the Scores dataset and the statistics provided by the code, to answer the following
questions.
IMPORTANT:
• Make sure to support your answers by explaining and showing how you came to your
conclusions.
• If you use online calculators then please include screenshots of those calculators as
part of your work.
• Please do not use code to solve these questions. The questions are designed to test your
understanding.
Question 3a. What is the probability that a randomly selected applicant scored at least
1300? Show your work.
Question 3b. What is the probability that a randomly selected applicant scored exactly
Question 3c. What percentage of applicants scored between 900 and 1000? Show your
work.
Question 3d. Calculate the 40th percentile of scores among the applicants. What does this
value represent in the context of the admissions process? Show your work.
Question 3e. Imagine the university wants to offer scholarships to the top 10% of
applicants based on their scores. What minimum score would an applicant need to qualify
for a scholarship? Show your work.
Question 3f. Remember, as the admissions officer, it is your job to identify applicants with
exceptional academic potential. Would you automatically recommend that applicants with
SAT scores above 1400 to be admitted into the university? Or do you think additional
criteria should also be considered? Explain your reasoning.
Question 3a. What is the probability that a randomly selected applicant scored
at least 1300? Show your work.
Answer:

To determine the probability that a randomly selected applicant scored at least 1300, we
utilize the standard normal distribution (Z-score) formula:
Z = (X - μ) / σ
where:
• Z is the Z-score
• X is the score of interest (1300)
• μ is the mean score
• σ is the standard deviation
• X = 1300
• μ = 1016.566
• σ = 299.8892448941829
We compute:
Z = (1300 - 1016.566) / 299.8892448941829 Z ≈ 0.945
Using an online calculator, we determine that the probability of a Z-score of 0.945 or higher
is approximately 0.1723.
Hence, the probability that a randomly selected applicant scored at least 1300 is
approximately 0.1723 or 17.23%.
Question 3b. What is the probability that a randomly selected

applicant scored exactly 900? Show your work.
Answer:
To calculate the probability that a randomly selected applicant scored exactly 900, we can
use the standard normal distribution (Z-score) formula:
Z = (X - μ) / σ
where:
• X is the score of interest (900)
• μ is the mean score (1016.566)
• σ is the standard deviation (299.8892448941829)
• X = 900
• μ = 1016.566

• σ = 299.8892448941829
We compute:
Z = (900 - 1016.566) / 299.8892448941829 Z ≈ -0.389
Using an online calculator, we find that the probability of a Z-score of -0.389 is
approximately 0.3486.
Therefore, the probability that a randomly selected applicant scored exactly 900 is
approximately 0.3486 or 34.86%.
Question 3c. What percentage of applicants scored between 900 and

Answer:
To calculate the percentage of applicants who scored between 900 and 1000, we can use
the standard normal distribution (Z-score) formula:
Z = (X - μ) / σ
where:
• X is the score of interest (900 or 1000)
For a score of 900:
• X1 = 900
• Z1 = (X1 - μ) / σ
For a score of 1000:
• X2 = 1000
• Z2 = (X2 - μ) / σ
• μ = 1016.566
• σ = 299.8892448941829
We compute: For 900:
• Z1 ≈ -0.389 For 1000:
• Z2 ≈ -0.055

Using an online calculator, we find that the probability of a Z-score between -0.389 and -
0.055 is approximately 0.1294.
Therefore, the percentage of applicants who scored between 900 and 1000 is
approximately 12.94%.
Question 3d. Calculate the 40th percentile of scores among the

applicants. What does this value represent in the context of the
admissions process? Show your work.
Answer:
The 40th percentile of scores among the applicants represents the score below which 40%
of the applicants fall.
To calculate the 40th percentile, we can use the formula:
Percentile = μ + (Z * σ)
where:
• Percentile is the desired percentile (40th percentile)
• Z is the Z-score corresponding to the desired percentile
Using an online calculator, we find that the Z-score corresponding to the 40th percentile is
approximately -0.253.
Z_40th_percentile = -0.253
Substituting the given values, we compute:
Percentile_40th = μ + (Z_40th_percentile * σ)
Percentile_40th ≈ 940.739
Therefore, the 40th percentile of scores among the applicants is approximately 940.739.
In the context of the admissions process, this means that 40% of the applicants scored
below 940.739.
Question 3e. Imagine the university wants to offer scholarships to the

top 10% of applicants based on their scores. What minimum score
would an applicant need to qualify for a scholarship? Show your work.
Answer:

To offer scholarships to the top 10% of applicants based on their scores, we need to
determine the minimum score required to qualify, which corresponds to the 90th
percentile.
To calculate the 90th percentile, we can use the following formula:
Percentile = μ + (Z * σ)
where:
• Percentile is the desired percentile (90th percentile)
• Z is the Z-score corresponding to the desired percentile
Given:
• μ = 1016.566
• σ = 299.8892448941829
• Z_90th_percentile = 1.282
Substituting the given values, we compute:
Percentile_90th = 1016.566 + (1.282 * 299.8892448941829)
Percentile_90th ≈ 1400.678
Therefore, the minimum score an applicant would need to qualify for a scholarship is
approximately 1400.678.
Question 3f. Remember, as the admissions officer, it is your job to

identify applicants with exceptional academic potential. Would you
automatically recommend that applicants with SAT scores above 1400
to be admitted into the university? Or do you think additional criteria
should also be considered? Explain your reasoning.
Answer:
While SAT scores serve as a significant yardstick for evaluating an applicant's academic
prowess, they shouldn't serve as the exclusive determinant for admission.
A well-rounded assessment involves considering various facets, including:
• Academic performance throughout high school
• Grade Point Average (GPA)
• Class standing
• Endorsements from educators

• Personal statements
• Extracurricular involvements
• Performance in other standardized tests
By examining a diverse array of criteria, the admissions panel can gain a deeper insight into
an applicant's capabilities, enabling more informed decisions regarding acceptance.
Moreover, it's vital to acknowledge that SAT scores can fluctuate notably across different
contexts and institutions. A score of 1400 at one school may not necessarily reflect the
same level of achievement as it does at another institution.
Consequently, I would not propose automatic admission for applicants surpassing the 1400
SAT score threshold. Instead, I advocate for a comprehensive approach that takes into
account all pertinent factors before reaching any admission verdicts.
D. Statistical Intuition in Public Health
Question 4:
Now imagine that it is year 2034 and you are working as a public health researcher in the
UAE. You are working on a project to assess vaccination coverage for a new global
pandemic. The UAE government has implemented a widespread vaccination campaign to
combat the spread of the virus and achieve herd immunity. You want to determine the
proportion of individuals who have received the new vaccine among a sample of 100
residents in different parts of the country.
To Begin.
Run the code below. It will provide you with a random sample of 100 residents. It will save
this data to a CSV file called "Vaccinated.csv". Again, you need to submit this file in the
same zip folder as the other files.
code blocks
import numpy as np
import pandas as pd
import random
# Run this code. It will generate data and save it to a CSV file called
"Vaccinated.csv". You need to submit it in the same zip folder as your other
files.
try:
Vaccinated = pd.read_csv('Vaccinated.csv')

num_samples = 100
vaccinated = np.random.choice(["Yes", "No"], size=num_samples)
Vaccinated = pd.DataFrame({'Vaccinated': vaccinated})
Vaccinated.to_csv('Vaccinated.csv')
# Have a look at Vaccinated dataset.

Vaccinated.head()
{"summary":"{\n \"name\": \"Vaccinated\",\n \"rows\": 100,\n \"fields\":

[\n {\n \"column\": \"Vaccinated\",\n \"properties\": {\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n
]\n}","type":"dataframe","variable_name":"Vaccinated"}
Now, use the dataset to answer the following questions.

IMPORTANT:
• Make sure to support your answers by explaining and showing how you came to
your conclusions.
• Please do not use code to solve these questions. The questions are designed to test
your understanding.
Question 4a. What is the proportion of people who have received the vaccine (based on
the dataset you have)?
Question 4b. Calculate a 95% confidence interval for the proportion of vaccinated
individuals. What does this interval tell us about the likely range of vaccination coverage in
the entire population? Show your work.
Question 4c. What sample size would be required to estimate the proportion of vaccinated
individuals in the country with a 95% confidence level and a margin of error of 0.02?
Show your work.
Question 4d. If you wanted to increase the precision of your estimate, what strategies
could you employ to achieve this goal? Explain your reasoning.
Question 4e. Analyze the effectiveness of the current vaccination campaign using the
proportion of vaccinated individuals and the confidence interval. What recommendations
would you make for future campaigns?
Question 4a. Proportion of Vaccinated Individuals

The dataset consists of 100 individuals, out of which 47 have been vaccinated.
• Proportion of vaccinated individuals:

• p = Number of vaccinated individuals / Total number of individuals

• p = 54 / 100 = 0.54
So, 54% of the individuals in our dataset are vaccinated.
Question 4b. 95% Confidence Interval for the Proportion

Answer:
To compute the 95% confidence interval for the proportion, we follow these steps:
First, calculate the standard error (SE):
• SE = sqrt((p * (1 - p)) / n)
• SE = sqrt((0.54 * (1 - 0.54)) / 100)
• SE ≈ sqrt(0.54 * 0.46 / 100)
• SE ≈ sqrt(0.02484)
• SE ≈ 0.1577
Then, calculate the 95% confidence interval using a Z-score of 1.96:
• CI = p ± Z * SE
• CI = 0.54 ± 1.96 * 0.1577
• CI ≈ 0.54 ± 0.3093
• CI ≈ (0.2307, 0.8493)
Therefore, we are 95% confident that the true proportion of vaccinated individuals in the
population lies between 23.07% and 84.93%.
Question 4c. Required Sample Size for a Margin of Error of 0.02

Answer:
To determine the required sample size for a margin of error of 0.02, we use the following
formula:
n = (Z^2 * p * (1 - p)) / E^2
Given:
• Margin of error (E) = 0.02
• Z-score corresponding to 95% confidence level (Z) ≈ 1.96
• Proportion of vaccinated individuals (p) = 0.54
Substituting these values into the formula, we get:
n = (1.962 * 0.54 * (1 - 0.54)) / 0.02^2

n ≈ (3.8416 * 0.54 * 0.46) / 0.0004 n ≈ (2.0799264 * 0.46) / 0.0004 n ≈ 0.95795744 /

0.0004 n ≈ 2394.8936
Rounding up, you would need a sample size of approximately 2395 individuals to estimate
the proportion of vaccinated individuals within a margin of error of 0.02 with 95%
confidence.
Question 4d. Enhancing Precision Strategies

Answer:
To refine the precision of your estimation, explore the following strategies:
1. Augmenting Sample Size: Expand the sample size beyond 2395. A larger sample
size can lead to more accurate estimates and narrower confidence intervals.
2. Ensuring Representativeness: Guarantee that the sample is a true reflection of the

entire population by employing techniques like stratified sampling. This helps
mitigate biases and ensures comprehensive coverage of diverse groups within the
population.
3. Minimizing Margin of Error: While challenging, endeavor to further reduce the

margin of error. However, bear in mind that this typically necessitates an even
larger sample size. Strive to strike a balance between precision and practicality.
By implementing these strategies, you can enhance the precision of your estimates, thereby
yielding more reliable insights into the proportion of vaccinated individuals in the
population.
Question 4e. Analysis and Recommendations

Insights and Suggestions
Based on the analysis revealing a vaccination rate of 47% with a 95% confidence interval
ranging from 23.07% to 84.93%, it's evident that the vaccination coverage may fall short of
the desired levels necessary for achieving herd immunity or meeting other crucial public
health objectives.
Recommendations for Future Campaigns:
1. Targeted Outreach: Implement targeted outreach initiatives tailored to
unvaccinated populations. This approach can help address specific concerns and
barriers to vaccination, ultimately increasing coverage rates.
2. Enhanced Accessibility: Improve accessibility to vaccination centers by

introducing measures such as mobile clinics or extending operating hours. This

ensures greater convenience for individuals seeking vaccination, particularly those

with time constraints or limited mobility.
3. Educational Campaigns: Launch educational campaigns aimed at combating

vaccine hesitancy and dispelling misinformation. Providing accurate information
about vaccine safety, efficacy, and benefits can help alleviate concerns and
encourage more individuals to get vaccinated.
Conclusion
This analysis offers valuable insights into the current status of the vaccination campaign
and highlights potential areas for improvement to boost vaccination rates. By
implementing targeted outreach, enhancing accessibility, and conducting educational
campaigns, we can strive towards achieving higher vaccination coverage and contributing
to improved public health outcomes.

Project 2 AI

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project 2 AI

Uploaded by

Copyright:

Available Formats

Page 1 of 33 - Cover Page Submission ID trn:oid:::1:2866706450

Winneconne High School

Submission Date 6,951 Words

Mar 27, 2024, 5:43 AM CDT

Mar 27, 2024, 5:52 AM CDT

Page 1 of 33 - Cover Page Submission ID trn:oid:::1:2866706450

How much of this submission has been generated by AI?

It is essential to understand the limitations of AI detection before making decisions

* Low scores have a higher likelihood of false positives.

Frequently Asked Questions

What does the percentage mean?

How does Turnitin's indicator address false positives?

What does 'qualifying text' mean?

Page 2 of 33 - AI Writing Overview Submission ID trn:oid:::1:2866706450

Statistical Intuitions and Applications

4. You may use Jupyter notebook or Colab as per your convenience.

Page 3 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

A. Statistical Intuitions in Mental Health

Page 4 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

{"summary":"{\n \"name\": \"df\",\n \"rows\": 300,\n \"fields\": [\n

Page 5 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

\"dtype\": \"category\",\n \"num_unique_values\": 2,\n

column_titles = ["Growing_Stress" ,"Quarantine_Frustrations"

# Randomly select 2 variables

# Print the 2 variables that were randomly selected

Page 6 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

Question 1a. Is each of these two variables independent of being

# Create contingency tables for each variable with gender

# Perform Chi-square test and get p-values

# Print two-way tables

# Interpret the results and explain reasoning

if p_value_1 > 0.05:

Page 7 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

if p_value_2 > 0.05:

Two-way table for Variable 1:

Two-way table for Variable 2:

Question 1a. Is each of these two variables independent of being female?

Question 1b. Is there a relationship between the two variables

# Create a two-way table for the two variables

# Perform Chi-square test and get p-value

# Print the two-way table

# Interpret the results

Page 8 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

print("\nQuestion 1b. Is there a relationship between the two variables?")

# Create a stacked bar graph

Question 1b. Is there a relationship between the two variables?

Page 9 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

Question 1c. Does the existence of Variable 1 increase the likelihood

# Calculate the relative risk

# Print the results

# Interpret the results

P(Variable 2 | Variable 1) = 0.693069306930693

The existence of Variable 1 decreases the likelihood of experiencing Variable

Page 10 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

Imagine ZU wanted to use the insights from this research to improve

Based on what we learned from looking at the data:

1. PSupport based on gender:

2. Teach everyone about mental health:

3. Have lots of different kinds of help available:

4. Make a friendly campus:

5. Encourage asking for help:

Page 11 of 33 - AI Writing Submission Submission ID trn:oid:::1:2866706450

By implementing these recommendations, ZU can create a more

B. Statistical Intuition in Store Ratings