Professional Documents
Culture Documents
MR
Student 2
Quick Submit
Quick Submit
Document Details
Submission ID
trn:oid:::1:2866706450 31 Pages
Download Date
File Name
Assignment_2_Spring_2024_2.docx
File Size
128.0 KB
*16%
Caution: Percentage may not indicate academic misconduct. Review required.
Our testing has found that there is a higher incidence of false positives when the percentage is less than 20. In order to reduce the
likelihood of misinterpretation, the AI indicator will display an asterisk for percentages less than 20 to call attention to the fact that
the score is less reliable.
However, the final decision on whether any misconduct has occurred rests with the reviewer/instructor. They should use the
percentage as a means to start a formative conversation with their student and/or use it to examine the submitted assignment in
greater detail according to their school's policies.
Non-qualifying text, such as bullet points, annotated bibliographies, etc., will not be processed and can create disparity between the submission highlights and the
percentage shown.
In a longer document with a mix of authentic writing and AI generated text, it can be difficult to exactly determine where the AI writing begins and original writing
ends, but our model should give you a reliable guide to start conversations with the submitting student.
Disclaimer
Our AI writing assessment is designed to help educators identify text that might be prepared by a generative AI tool. Our AI writing assessment may not always be accurate (it may misidentify
both human and AI-generated text) so it should not be used as the sole basis for adverse actions against a student. It takes further scrutiny and human judgment in conjunction with an
organization's application of its specific academic policies to determine whether any academic misconduct has occurred.
2. All generated csv and .ipynb files must be submitted in a zip-folder as a secondary
source.
3. Ensure your zip folder contains four csv files (i.e., the csv files for each of the four
questions below; YourName.csv, RelianceRetailVisits_ordered.csv, Scores.csv,
Vaccinated.csv).
5. You should ONLY use the concepts and techniques covered in the course to
generate your answers. Statistical techniques that are NOT covered in the
course will NOT be evaluated.
Note: Reach out to your instructor for any question regarding csv files, codes, or the zip-
folder.
Non-compliance with the above instructions will result in a 0 grade on the relevant
portions of the assignment. Your instructor will grade your assignment based on what you
submitted. Failure to submit the assignment or submitting an assignment intended for
another class will result in a 0 grade, and resubmission will not be allowed. Make sure that
you submit your original work. Suspected cases of plagiarism will be treated as potential
academic misconduct and will be reported to the College Academic Integrity Committee for
a formal investigation. As part of this procedure, your instructor may require you to meet
with them for an oral exam on the assignment.
Question 1:
We are going to work with a dataset that was collected on mental health issues. In total,
824 individuals (teenagers, college students, housewives, businesses professionals, and
other groups) completed the survey. Their data provides valuable insights into the
prevalence, and factors associated with, mental health issues in different groups.
To Begin.
Run the code below. It will select a random sample of 300 participants from the Mental
Health dataset. The code will then generate a CSV file called Name.csv. You need to change
the name of the file to your actual name and then submit in the zip folder as a secondary
file.
# Load the following libraries so that they can be applied in the subsequent
code blocks
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import scipy.stats as stats
# Run this code. It will create a csv file containing a random sample of 300
respondents. You will answer the questions below based on this sample.
# Look at the code below. Now replace 'Name.csv' with your actual name (e.g.,
'Sara.csv'). The code will generate a csv file that you need to submit in the
zip folder as secondary file.
try:
df = pd.read_csv('Rouda Alsuwaidi.csv') # replace Name with your
own name
except FileNotFoundError:
original_data =
pd.read_csv("https://raw.githubusercontent.com/DanaSaleh1003/IDS-103-Spring-
2024/main/mental_health_finaldata_1.csv")
df1=original_data.sample(300)
df1.to_csv('Rouda Alsuwaidi.csv') # replace Name with
your own name
df = pd.read_csv('Rouda Alsuwaidi.csv') # replace Name with
your own name
df = pd.DataFrame(df)
df.to_csv('Rouda Alsuwaidi.csv') # replace Name with
your own name
df.head()
Now, Run the code below to return TWO variables which represent different aspects of
mental health that you need to focus on.
# Load the following libraries so that they can be applied in the subsequent
code blocks
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import scipy.stats as stats
import random
Variable 1: Weight_Change
Variable 2: Quarantine_Frustrations
Using the sample dataset from the CSV file you generated answer the following questions:
Question 1a. Is each of these two variables independent of being female? Explain your
reasoning. Make sure to include a two-way table for each of these two variables with
gender, and show all your calculations to support your answers.
Question 1b. Is there a relationship between the two variables returned by the code?
Explain your reasoning. Make sure you include a two-way table, a stacked bar graph, and all
your probability calculations in your answer.
Question 1c. Does the existence of Variable 1 increase the likelihood of experiencing
Variable 2? If so, by how much? Explain your reasoning. Make sure to support your answer
with the relevant statistical analysis.
Question 1d. Look back at your answers to Questions 1a-c. Now use what you learned to
answer the following question:
Imagine ZU wanted to use the insights from this research to improve its mental health
support program. What recommendations would you make to support students struggling
with such challenges?
Answer: Add more "markdown" text and code cells below as needed.
else:
print(f"Variable 1 is not independent of being female (p-value:
{p_value_1:.4f}).")
Two-way table:
Quarantine_Frustrations No Yes
Weight_Change
No 23 75
Yes 62 140
Question 1d. Look back at your answers to Questions 1a-c. Now use
what you learned to answer the following question:
Question 2:
Imagine you are the manager of an Electronic store in Dubai mall. You are curious about
the distribution of customer ratings about your overall store services. So you ask random
customers who visit the store to complete a short survey, recording variables such as their
age group, and overall experience rating.
To Begin
Run the code below. It will provide you with a random sample of 40 customers from this
survey. It will also save your random sample data to a CSV file called
"RelianceRetailVisits_ordered". Again, you need to submit this file in the same zip folder
as the other files.
# Load the following libraries so that they can be applied in the subsequent
code blocks
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import scipy.stats as stats
try:
df = pd.read_csv('RelianceRetailVisits.csv')
except FileNotFoundError:
original_data =
pd.read_csv("https://raw.githubusercontent.com/DanaSaleh1003/IDS-103-Spring-
2024/main/RelianceRetailVisits-1.csv")
# Fill missing values for '46 To 60 years' age group with default values or
remove NaN rows
df.fillna({'Age Group': '46 To 60 years'}, inplace=True)
# Sort the DataFrame based on the 'Age Group' column in the desired order
desired_order = ['26 To 35 years', '16 To 25 years', '36 To 45 years',
df.head()
Use the random sample of data from the csv file you generated to answer the following
questions:
Question 2a. Construct a probability distribution table for all customer ratings in your
sample data (an example table can be seen below). Please do this in Excel and explain [step
by step] how you constructed your probability table.
Screenshot 2024-02-25 at 6.38.29 PM.png
Question 2b. What is the probability that a randomly selected customer will have a rating
of AT MOST 3?
Question 2c. Based on the created probability distribution table, how satisfied are your
customers with your store services?
Question 2d. Find the expected rating of your store. Show your work and interpret your
answer in context.
Run the code below. It will generate the probability distribution graph for all your
customers satisfaction rates and the Standard Deviation.
P(X ≤ 3) = P(X = 1) + P(X = 2) + P(X = 3) Put the probabilities value in the this equation
P(X ≤ 3) = 0.025 + 0.15 + 0.1 P(X ≤ 3) = 0.275
Therefore, the probability that a randomly selected customer will have a rating of AT MOST
3 is 0.275 or 27.5%.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
from tabulate import tabulate
# Load data
try:
df = pd.read_csv('RelianceRetailVisits.csv')
except FileNotFoundError:
original_data =
pd.read_csv("https://raw.githubusercontent.com/DanaSaleh1003/IDS-103-Spring-
2024/main/RelianceRetailVisits-1.csv")
df = original_data.sample(n=40, random_state=42)
# Fill missing values for '46 To 60 years' age group with default values or
remove NaN rows
df.fillna({'Age Group': '46 To 60 years'}, inplace=True)
# Sort the DataFrame based on the 'Age Group' column in the desired order
desired_order = ['26 To 35 years', '16 To 25 years', '36 To 45 years',
'46 To 60 years']
df['Age Group'] = pd.Categorical(df['Age Group'], categories=desired_order,
ordered=True)
df.sort_values(by='Age Group', inplace=True)
std_rating = df['OverallExperienceRatin'].std()
print(f"Standard Deviation (STD) of Customer Rating: {std_rating:.2f}")
print()
In summary, while a large portion of customers appear satisfied with store services, there
is still a notable minority whose experiences may require attention to further improve
overall customer satisfaction.
Question 2d. Find the expected rating of your store. Show your work
and interpret your answer in context.
Determining the expected rating involves calculating the average rating that customers are
likely to give, considering the probability of each rating. This is achieved by multiplying
each rating by its corresponding probability and then summing up the results.
In this case, the expected rating is:
unusual_rating=3.15±2×1.11
= 3.15 ± 2.22
= 3.15±2.22
= 0.93 to 4.37
Hence, any rating below 0.93 or above 4.37 would be considered unusual.
In context, this implies that a rating of 1 or 2 would be considered exceptionally low, while
a rating of 4 or 5 would be considered exceptionally high compared to the typical range of
ratings.
Run the code below. It will generate the probability distribution graphs for each of the age
groups along with their discrete probability distribution tables, the Expectd values, and the
Standard Deviation values.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
plt.tight_layout()
plt.show()
Question 2f. Identify any trends or differences in customer satisfaction levels (and
variability) among the different age groups.
Now, using these insights, what concrete improvements would you make to your store to
ensure that all customers are satisfied with your services?
Variance Assessment:
• 16-25: Moderate variance in ratings (STD of 1.03).
• 26-35: Highest variance (STD of 1.18), indicating a broader range of experiences.
• 36-45: Lowest variance (STD of 0.52), suggesting a more consistent experience.
• 46-60: High variance (STD of 0.95), reflecting significant diversity in experiences.
• Expand product offerings, refine store layout and ambiance, and maintain a
consistent standard of service.
Integration of Technology:
• Explore opportunities to integrate technology for streamlined processes and
enriched customer interactions.
• Implement self-service checkout systems, mobile payment options, and online
ordering to enhance convenience.
Sustainable Practices:
• Adopt environmentally conscious practices to appeal to eco-minded clientele.
• Minimize waste, utilize sustainable materials, and support eco-friendly initiatives to
demonstrate corporate responsibility.
Competitive Analysis:
• Conduct regular competitive analyses to stay informed about industry trends and
benchmark against competitors.
• Identify areas for differentiation and capitalize on unique selling propositions.
Question 3:
Imagine you are working for a prestigious university in the UAE. It is your job to decide
which students are admitted to the university. To help you do this, you analyze the high
school (SAT) scores of potential students. These scores help you understand their academic
readiness and potential for success at the university.
You have just received the scores of applicants who would like to join the university in
September 2024. These scores follow a normal distribution.
To Begin.
Run the code below. It will generate a dataset with the students scores. It will also calculate
the mean (μ) and standard deviation (σ) of these scores. This dataset will be saved as a
CSV file called "Scores.csv". Again, you need to submit this file in the same zip folder as
your other files.
# Load the following libraries so that they can be applied in the subsequent
code blocks
import pandas as pd
import numpy as np
import random
try:
SATScores = pd.read_csv('Scores.csv')
except FileNotFoundError:
num_samples = 1000
mean_score = random.randint(800, 1200)
std_deviation = random.randint(100, 300)
scores = np.random.normal(mean_score, std_deviation, num_samples)
scores = np.round(scores, 0)
SATScores = pd.DataFrame({'Scores': scores})
SATScores.to_csv('Scores.csv')
Now, use the Scores dataset and the statistics provided by the code, to answer the following
questions.
IMPORTANT:
• Make sure to support your answers by explaining and showing how you came to your
conclusions.
• If you use online calculators then please include screenshots of those calculators as
part of your work.
• Please do not use code to solve these questions. The questions are designed to test your
understanding.
Question 3a. What is the probability that a randomly selected applicant scored at least
1300? Show your work.
Question 3b. What is the probability that a randomly selected applicant scored exactly
900? Show your work.
Question 3c. What percentage of applicants scored between 900 and 1000? Show your
work.
Question 3d. Calculate the 40th percentile of scores among the applicants. What does this
value represent in the context of the admissions process? Show your work.
Question 3e. Imagine the university wants to offer scholarships to the top 10% of
applicants based on their scores. What minimum score would an applicant need to qualify
for a scholarship? Show your work.
Question 3f. Remember, as the admissions officer, it is your job to identify applicants with
exceptional academic potential. Would you automatically recommend that applicants with
SAT scores above 1400 to be admitted into the university? Or do you think additional
criteria should also be considered? Explain your reasoning.
Answer: Add more "markdown" text and code cells below as needed.
Question 3a. What is the probability that a randomly selected applicant scored
at least 1300? Show your work.
Answer:
To determine the probability that a randomly selected applicant scored at least 1300, we
utilize the standard normal distribution (Z-score) formula:
Z = (X - μ) / σ
where:
• Z is the Z-score
• X is the score of interest (1300)
• μ is the mean score
• σ is the standard deviation
Substituting the given values:
• X = 1300
• μ = 1016.566
• σ = 299.8892448941829
We compute:
Z = (1300 - 1016.566) / 299.8892448941829 Z ≈ 0.945
Using an online calculator, we determine that the probability of a Z-score of 0.945 or higher
is approximately 0.1723.
Hence, the probability that a randomly selected applicant scored at least 1300 is
approximately 0.1723 or 17.23%.
• σ = 299.8892448941829
We compute:
Z = (900 - 1016.566) / 299.8892448941829 Z ≈ -0.389
Using an online calculator, we find that the probability of a Z-score of -0.389 is
approximately 0.3486.
Therefore, the probability that a randomly selected applicant scored exactly 900 is
approximately 0.3486 or 34.86%.
Using an online calculator, we find that the probability of a Z-score between -0.389 and -
0.055 is approximately 0.1294.
Therefore, the percentage of applicants who scored between 900 and 1000 is
approximately 12.94%.
To offer scholarships to the top 10% of applicants based on their scores, we need to
determine the minimum score required to qualify, which corresponds to the 90th
percentile.
To calculate the 90th percentile, we can use the following formula:
Percentile = μ + (Z * σ)
where:
• Percentile is the desired percentile (90th percentile)
• μ is the mean score
• σ is the standard deviation
• Z is the Z-score corresponding to the desired percentile
Given:
• μ = 1016.566
• σ = 299.8892448941829
• Z_90th_percentile = 1.282
Substituting the given values, we compute:
Percentile_90th = 1016.566 + (1.282 * 299.8892448941829)
Percentile_90th ≈ 1400.678
Therefore, the minimum score an applicant would need to qualify for a scholarship is
approximately 1400.678.
• Personal statements
• Extracurricular involvements
• Performance in other standardized tests
By examining a diverse array of criteria, the admissions panel can gain a deeper insight into
an applicant's capabilities, enabling more informed decisions regarding acceptance.
Moreover, it's vital to acknowledge that SAT scores can fluctuate notably across different
contexts and institutions. A score of 1400 at one school may not necessarily reflect the
same level of achievement as it does at another institution.
Consequently, I would not propose automatic admission for applicants surpassing the 1400
SAT score threshold. Instead, I advocate for a comprehensive approach that takes into
account all pertinent factors before reaching any admission verdicts.
Question 4:
Now imagine that it is year 2034 and you are working as a public health researcher in the
UAE. You are working on a project to assess vaccination coverage for a new global
pandemic. The UAE government has implemented a widespread vaccination campaign to
combat the spread of the virus and achieve herd immunity. You want to determine the
proportion of individuals who have received the new vaccine among a sample of 100
residents in different parts of the country.
To Begin.
Run the code below. It will provide you with a random sample of 100 residents. It will save
this data to a CSV file called "Vaccinated.csv". Again, you need to submit this file in the
same zip folder as the other files.
# Load the following libraries so that they can be applied in the subsequent
code blocks
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import scipy.stats as stats
# Run this code. It will generate data and save it to a CSV file called
"Vaccinated.csv". You need to submit it in the same zip folder as your other
files.
try:
Vaccinated = pd.read_csv('Vaccinated.csv')
except FileNotFoundError:
num_samples = 100
vaccinated = np.random.choice(["Yes", "No"], size=num_samples)
Vaccinated = pd.DataFrame({'Vaccinated': vaccinated})
Vaccinated.to_csv('Vaccinated.csv')
By implementing these strategies, you can enhance the precision of your estimates, thereby
yielding more reliable insights into the proportion of vaccinated individuals in the
population.
Conclusion
This analysis offers valuable insights into the current status of the vaccination campaign
and highlights potential areas for improvement to boost vaccination rates. By
implementing targeted outreach, enhancing accessibility, and conducting educational
campaigns, we can strive towards achieving higher vaccination coverage and contributing
to improved public health outcomes.