You are on page 1of 6

Hypothesis Testing: Final Project Report

Authors: Sourodip Ghosh, Aniruddha Kulkarni, Bhumi Kakade

Abstract
This report presents a comprehensive analysis of COVID-19 patient data, focusing on the
hypothesis testing regarding the number of deceased males above 50 compared to the total
number of deceased females. Through data preparation, visualization, hypothesis testing, and
power analysis, we aim to provide a thorough understanding of the relationships within the
dataset.

1. Introduction

1.1 Background
The COVID-19 pandemic has led to a surge in the importance of understanding the
demographics and characteristics of affected individuals. This project delves into the
gender-specific mortality rates, particularly focusing on individuals above 50 years of age.

1.2 Research Question


The primary question driving this analysis is whether there is a significant difference in the
number of deceased males above 50 years old compared to the total number of deceased females.

2. Data Preparation

2.1 Dataset Overview


The initial dataset includes a multitude of features related to COVID-19 patients. To focus on the
pertinent information, we selected specific columns, such as 'SEX,' 'PATIENT_TYPE,'
'DATE_DIED,' and 'AGE.'

```python
# Code for data preparation
# ...
```

2.2 Cleaning and Transformation


To ensure the accuracy of our analysis, we performed cleaning tasks, handling missing values
and transforming data types. The 'DATE_DIED' column was converted to a binary variable,
indicating whether the patient died or not.
```python
# Code for data cleaning and transformation
# ...
```

3. Exploratory Data Analysis

3.1 Descriptive Statistics


Before delving into hypothesis testing, we conducted exploratory data analysis to understand the
distribution of age and gender in the dataset.

```python
# Code for descriptive statistics and visualizations
# ...
```

3.2 Gender Distribution


Visualizing the gender distribution revealed an almost equal representation of males and females
in the dataset.

```python
# Code and visualization for gender distribution
# ...
```

4. Comparison of Males Above 50 and Total Females

4.1 Overview
To address the primary research question, we compared the count of deceased males above 50
years old with the total count of deceased females.

4.2 Full Dataset Comparison


In the entire dataset, there were 21984 deceased males above 50 and 49540 deceased females.

```python
# Code for the comparison and visualization
# ...
```
4.3 Visual Representation
Visualizing the comparison further illustrates the relationship between deceased males above 50
and total deceased females.

```python
# Code and visualization for the comparison
# ...
```

*Insert visualizations here (Bar chart comparing males above 50 and total females).*

4.4 10% Sample Comparison


To ensure the robustness of our findings, we performed a comparison using a 10% sample of the
dataset.

```python
# Code for the 10% sample comparison and visualization
# ...
```

4.5 Interpretation
The comparison results indicate that, even in the 10% sample, the number of deceased males
above 50 remains notably lower than the total count of deceased females. This observation is
consistent with the overall dataset, suggesting the need for a statistical hypothesis test to draw
more conclusive results.

5. Hypothesis Testing

5.1 Formulation of Hypotheses


The null hypothesis (H0) posits that deceased males above 50 are more numerous than deceased
females, while the alternate hypothesis (H1) asserts the opposite.

```python
# Code for formulating hypotheses
# ...
```

5.2 Z-Test for Proportion Difference


We conducted a Z-test to compare the proportions of deceased males above 50 and total deceased
females.
```python
# Code for Z-test and visualization of rejection regions
# ...
```

*Insert visualizations here (Graph illustrating acceptance and rejection regions).*

5.3 Significance Levels


Significance levels were set at 80%, 90%, 95%, and 99%. The Z-statistic was calculated and
compared with critical values.

```python
# Code for significance levels and critical values
# ...
```

*Insert visualizations here (Graphs showing rejection regions and critical values).*

5.4 Conclusion of Hypothesis Testing


The Z-statistic obtained was remarkably high (66.3959), leading to the rejection of the null
hypothesis at all significance levels. The p-value was close to zero, providing strong evidence
against the null hypothesis.

6. Power Analysis

6.1 Understanding Power Analysis


Power analysis is a critical aspect of hypothesis testing, providing insights into a test's sensitivity
to detect a true effect. We conducted power analysis at various significance levels to further
evaluate the robustness of our findings.

6.2 Effect Size Calculation


The effect size was calculated based on the Z-statistic and the standard error of the pooled
sample proportion.

```python
# Code for effect size calculation
# ...
```
6.3 Power Analysis Results
Power analysis was performed for significance levels of 80%, 90%, 95%, and 99%.

```python
# Code for power analysis and calculations
# ...
```

6.4 Interpretation of Power Analysis


The results indicate varying levels of power for different significance levels, shedding light on
the test's ability to detect a true effect.

*Insert a table or visualizations here (Table or graphs illustrating power at different significance
levels).*

6.5 Implications of Power Analysis


- At an 80% significance level, the power of the test is 20.43%. This implies a relatively low
chance of detecting a true effect if it exists.
- As the significance level increases, the power of the test decreases, emphasizing the need for
cautious interpretation.

7. Conclusion

7.1 Summary of Findings


In summary, our analysis reveals a significant difference between the number of deceased males
above 50 and total deceased females. The Z-test provided compelling evidence against the null
hypothesis at various significance levels.

7.2 Practical Significance


While the statistical significance is evident, it's crucial to consider the practical significance of
the findings. The effect size and power analysis suggest limitations in the test's sensitivity.

7.3 Recommendations
- Future studies may benefit from a larger sample size to enhance the power of the test.
- Additional factors influencing mortality rates, such as comorbidities and healthcare access,
should be considered for a more comprehensive analysis.
8. Future Work

8.1 Further Exploration


This analysis provides a foundational understanding, but further exploration could involve a
more granular examination of demographic and health-related variables.

8.2 Longitudinal Analysis


A longitudinal analysis could reveal trends over time, considering the evolving nature of the
COVID-19 pandemic.

9. References

You might also like