You are on page 1of 55

PROJECT STA 108

NUMBER OF REPORTED CASES AND TOTAL DEATH


CAUSED BY CHOLERA IN MALAYSIA FROM YEAR 1971 TO
2000

NAME : 1) NUR DANIA BINTI AMAN SHAH (2018881976)


2) JASMIN SYAFIKAH BINTI JAMAL ASRI (2018406686)
3) NOORFATIHAH BINTI HANIPIAH (2018202014)

GROUP : AS1204_M

DISTRIBUTED TO : SIR ZULKIFLI BIN MOHD GHAZALI

1
Table of Contents

CHAPTER 1: INTRODUCTION ............................................................................................................... 4


1.1 Background of Study ....................................................................................................................... 4
1.2 Objectives of Study ..................................................................................................................... 5
1.3 Significance of Study................................................................................................................... 5
CHAPTER 2: METHODOLOGY .............................................................................................................. 6
2.1 Data Description .............................................................................................................................. 6
2.1.1 Population .................................................................................................................................. 6
2.2 Graphical Technique ........................................................................................................................... 7
2.3 Numerical Technique ........................................................................................................................ 10
2.3.1 Measure Of Central Tendency ................................................................................................. 10
2.3.2 Mean......................................................................................................................................... 10
2.3.3 Median ..................................................................................................................................... 10
2.3.4 Mode ................................................................................................................................................ 12
2.4 Measure Of Location ......................................................................................................................... 13
2.4.1 The first and third quartiles ....................................................................................................... 14
2.5 MEASURES OF DISPERSION ....................................................................................................... 15
2.5.6 SAMPLE STANDARD DEVIATION ..................................................................................... 19
2.6 MEASURE OF SKEWNESS ........................................................................................................ 19
2.7 BOX-and-WHISKER PLOT .......................................................................................................... 20
2.8 PEARSON COEFFICIENT OF SKEWNESS ................................................................................ 20
2.9 CORRELATION ................................................................................................................................. 21
2.9.1 Characteristics of the correlation coefficient .......................................................................... 21
Strength of the Correlation Coefficient .......................................................................................... 22
2.9.2 Regression ...................................................................................................................................... 22
2.9.3 Fitting a Straight Line ................................................................................................................. 23
2.9.4 Coefficient of Determination ................................................................................................. 23
CHAPTER 3: RESULTS AND INTERPRETATION ............................................................................ 25
3.1 Data Representation ..................................................................................................................... 25
3.2.1 Histogram ............................................................................................................................... 27
3.3.1 Scatter Plot ............................................................................................................................ 32
CHAPTER 4: CONCLUSION ................................................................................................................. 37

2
4.1 Report Summary............................................................................................................................ 37
REFERENCES ......................................................................................................................................... 38
APPENDIX ................................................................................................................................................ 39

3
CHAPTER 1: INTRODUCTION

1.1 Background of Study

Cholera is an illness caused by infection of the intestine with the toxigenic bacterium Vibrio
cholerae. A bacterium called Vibrio cholerea causes cholera infection. The deadly effects of
the disease are the result of toxin that the bacteria produce in the small intestine. So, the toxin
causes the body to secrete enormous amount of water, leading to diarrhea and a rapid loss
of fluids and salts. In Malaysia, there were 21535 cases that have been reported but the total
of death caused by Cholera were only 388 cases from year 1971 until 2000. This study was
taken to analyse the relationship between the number of reported cases and total death
caused by Cholera in Malaysia.

Based on this study, the number of reported cases is a manipulated variable while total death
caused by Cholera in Malaysia is a responded variable. It is because, total death caused by
Cholera in Malaysia depends on the number of reported cases. The data shows a positive
correlation which the value is 0. 7432.The value of correlation suggests a moderate correlation
relationship between the number of reported case and total death caused by Cholera in
Malaysia from 1971 until 2000. The higher number of reported cases, the higher total death
caused by Cholera in Malaysia.

4
1.2 Objectives of Study

The objectives of this study are:

1) To determine the relationship between the number of reported cases and total death
caused by Cholera in Malaysia.
2) To obtain the types of graph that suitable for the data.
3) To find the values of mean, standard deviation and interquartile range.
4) To determine the correlation and regression of data.

1.3 Significance of Study

The data for this study is easy to access since it is already available at World Health
Organisation (WHO) website. Next, it helps to save more time and money as well since we
do not need to analyse, interpret the result and collect the data on our own. This kind of data
is way more cheaper compared to primary data. Hence, the secondary data is more accurate
than the primary data. It is because the values may be obtained rapidly. The stability of the
data also high since it is done by the expert researcher from the other country.

1.4 Limitation of Study

The limitation of this study is that no session for asking question can be made to prove more
about the accuracy of data since this data is already available in World Health Organisation
(WHO) website. Next, the data may slightly different in term of purpose of study to match with our
objective. It is because the data was already found from other researcher.

5
CHAPTER 2: METHODOLOGY

2.1 Data Description

2.1.1 Population
The population that were used in this study is the number of reported cases and total death
caused by cholera from year 1971 to 2000 in all country of the world.

2.1.2 Samples
Sample that were used in this study is number of reported cases and total death caused by
cholera in Malaysia from year 1971 to 2000.

2.1.3 Data collection method


There is no data collection method that were used in this study as the data is a secondary data
where it is a ready data.

2.1.3 Sampling Technique


There is no sampling technique that were used in this study as the data is in secondary data which
it already a ready data.

2.1.4 Variables
The variables that were used in this study is the number of reported cases and total death caused
by cholera from year 1971 to 2000 in Malaysia where there are 30 of observation were taken for
both variables. In statistic, there are two variables which are discrete and continuous variable.
The continuous variable is refer to a variable which is a response are taken on values to measure
the variable. This variable is not chosen because the data is a secondary data. In this study, the
type of variable that are used is discrete variable. This is because the data that were obtained in
this study is a quantitative data which is a numerical data where it is suitable for the discrete
variable that is a countable variable.

2.1.5 Measurement scale


There are many types of measurement scale that have in statistic which include nominal, ordinal,
interval and ratio. The measurement scale that were used in this study is ratio. This measurement
scale was chosen is because ratio is a measurement which is stated that it is an ordered scale
that gives meaning to the difference between the measurement and involve true zero point. This
explain in our study that the number of reported cases caused by cholera that have a zero case
in year 1974,1994,1996 and 1999 shows that there are no reported cases causes by cholera in
6
those year. The interval is same like the ratio which is the different is it does not involve true zero
point. Nominal were not chosen in this study as our data is a quantitative data, where it is not
matched with nominal which it used a qualitative data. Also, ordinal was not chosen is because
the data that were used in this study is a secondary data, due to this there is no survey that were
done, so there is no data that can be ranked which needed in the ordinal.

2.2 Graphical Technique

Due to the data that were obtained in this study is a grouped frequency distribution the histogram
graph was chosen. As shown in the figure 1 the vertical of the bar is to represent the frequency
of the class. The histogram graph used the frequency of the class as y-axis, and the class
boundary as the x-axis.

Figure 1

7
Figure 2

The figure 3 below shows the scatter diagram. The scatter diagram is known as nature of the
relationship between two continuous variable which are the dependent variable and the
independent variable. From the scatter diagram the characteristic of different possible correlation
can also be describe to identified how closed the relationship between the two variables. Type of
the characteristic is positive correlation, negative correlation, no correlation, curvilinear correlation
and perfect positive correlation. For the positive correlation it can be identify when the two variable
which is the dependent, y-axis and the independent variable, x-axis shows a positive variable.
The change of the direction on the x-axis will shows an increasing and also for the y-axis.
Secondly, for the negative correlation it will shows a negative relationship between the two
variables. The change of direction for both independent and dependent variable for negative
correlation have different direction which is when the independent variable, x-axis increases the
dependent variable, y-axis would be decrease.

8
Figure 3

Based on the figure 3 above the scatter diagram shows a positive skewness which mean in this
it have a positive relationship between the 2 variable where when the independent variable, x-
axis (number of reported case) is increase the dependent variable, y-axis (total death) also
increase.

9
2.3 Numerical Technique

2.3.1 Measure Of Central Tendency


The measure of average which the most called in statistic to give its meaning to the measure of
central tendency. The central tendency here is the single value that is placed at the centre of a
data and it can be taken as a summary value for that data set. There are Three types of averages
that often used as measures of central tendency which is the mean, median and mode where the
group of data can be either grouped or ungrouped data. An ungrouped data is a group that is not
given in the form of frequency table or frequency distribution while a grouped data is a group of
data that is tabulated in a frequency table or frequency distribution.

2.3.2 Mean
Mean is known as the average of the data. It is the total of all the data observation divides by the
number of the data observation. It can be calculated on both grouped and ungrouped data.

Ungrouped data:

∑𝑥
𝑥̅ =
𝑛

Grouped data:

∑ 𝑓𝑥
𝑥̅ = [ ]
𝑛

2.3.3 Median
Median is the value that were arrange in an ascending order to determine its middle value. The
interpretation of median is 50% of the total number of observations having a value less than a
median value while another 50% of the total number of observations having a value more than a
median value.

10
Ungrouped Data

Step to calculated it:

i. Arrange the data in ascending order

ii. Find the position of median

iii. Find the value of median.

For special case:


1. Do a proper table with include cumulative frequency
𝑛+1
2. Find the position of median.
2

3. Refer the position value in cumulative frequency


4. The value of median is in column x.

Grouped Data

Steps to calculated:

i. Do a proper table with include cumulative frequency, class boundaries and


position.
𝑛+1
ii. Find the position of median.
2

iii. Refer the position value in cumulative to find the class median
iv. Use the formula:

∑𝑓
− ∑ 𝑓𝑚−1
𝑥̃ = 𝐿𝑚 + [ 2 𝑓𝑚
].c

Where,

n=sample size

𝐿𝑚 = lower boundary of the median class

∑ 𝑓𝑚−1 = cumulative frequency before the median class

11
𝑓𝑚 = frequency of the median class

C = median class size

2.3.4 Mode
Mode is the value that is more frequent that occur on the data. Where it have the formula for the
ungrouped and grouped data. For ungrouped data:

i. The data is first arranged in ascending order

ii. Find the mode (most frequently in a set of data) Then the mode is determined by analyzing the
most frequent value occur in those set of data.

iii. the highest frequency should be determined for a categorical data.

iv. While for a quantitative data can be determined on the histogram, also the mode and the class
interval with the highest frequency can be determined.

There is also a special case for the mode which is the method is:

i. Find the highest frequency

ii. Find the mode in column x.

For grouped data:

Steps to calculated it:

i. Do a proper table with include cumulative frequency and class boundaries.


ii. Find the highest frequency to know the class mode.
iii. Use the formula:

∆1
𝑥̂ = 𝐿𝑚0 + [ ].c
∆ 1 + ∆2

where,

𝐿𝑚0 =lower boundary of the modal class

12
∆1 =(modal class frequency – frequency for the class before the modal class)

∆2 = modal class frequency – frequency for the class after the modal class)

C = mod class size

2.3.5 Relationship between mean, median mode

The data distribution is skewed to the left or left skewness distribution. If the mode > median >
mean (or simply mean < median or mean < mode).

the data distribution is skewed to the right or right skewness distribution If the mode < median <
mean (or simply mean > median or mean > mode).

The data distribution is symmetrical or normal If mode = median = mean.

2.4 Measure Of Location

Measure location is which it included the quartile where it separate into ungrouped and grouped
data. In the ungrouped data it is used to represent the position of the value with a large sets of
data of numerical data. Basically, ungrouped data quartile it is the extension of the median. It is
also the most used to non-central places. It actually divides the region under the frequency curve
into four equal areas. As for the:

Ungrouped Data
There have 3 position in the quartile:

First Quartiles / Lower Quartiles ( 𝑄1 ) - 25%of the total data is less than first quartile value
and 75% of the total data is more than first quartile value.

𝑛+1
𝑄1 = 𝑡ℎ
4

13
Second Quartiles / Median ( 𝑄2 ) - 50%of the total data is less than second quartile value and
50% of the total data is more than second quartile value.

2(𝑛 + 1)
𝑄2 = 𝑡ℎ
4

Third Quartiles/ Upper Quartiles (𝑄3 ) - 75%of the total data is less than third quartile value and
25% of the total data is more than third quartile value.

3(𝑛 + 1)
𝑄3 = 𝑡ℎ
4

Grouped Data

The quartile in grouped data their position can be measured by the first and the third quartile as
𝑄1 and 𝑄3 . The first and third quartiles can be calculated based on the distribution of a table and
also using the ogive.

2.4.1 The first and third quartiles

Method 1: Using Formula

Step 1 : the cumulative frequencies is obtained and also the position of the data.
Step 2 After identified the first and third quartile classes. Obtain the first location of the first and
the third quartile by using the formula and . then refer to the cumulative frequency column to
determine the locations and classes it place and lie. Within these classes, the value s of and can
be determine.
Step 3 : Find the first and third quartile as follows

14
𝑛
− 𝑓𝑄1 −1
𝑄1 = 𝐿𝑄1 + [4 ] × 𝐶𝑄1
𝑓𝑄1

where
n= number of observations.
𝐿1 = lower boundary of the first quartile class
𝑓𝑚−1= cumulative frequency before the first quartile class
𝑓1= frequency of the first quartile class
𝐶1 = first quartile class size

3𝑛
− 𝑓𝑄3
𝑄3 = 𝐿𝑄3 + [ 4 ] × 𝐶𝑄3
𝑓𝑄3

where
n = number of observations.
𝐿3 = lower boundary of the first quartile class
𝑓𝑚−1= cumulative frequency before the first quartile class
𝑓3= frequency of the first quartile class
𝐶3 = 𝑡ℎ𝑖𝑟𝑑 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑐𝑙𝑎𝑠𝑠 𝑠𝑖𝑧𝑒

2.5 Measure Of Dispersion

understand the spread or variability of a set of data about the mean. It gives additional information
to judge the reliability of the measure of central tendency and helps in comparing dispersion that
is present in various samples. Some of the measure of dispersion that is discussed on this topic
is range, variance and standard deviation.

15
2.5.1 Range
In statistic the simplest measure of dispersion is the range which the difference between the
largest and the smallest value of data. So, with this two value of the data the range of the data
distribution can be obtained

For ungrouped data;


Range= Largest data value – Smallest data value.

For grouped data;


Range=Upper class boundary of the last class – Lower class boundary of the first class

2.5.2 Variance And Standard Deviation


The variance is the sum of squares of differences between each value of the data and the mean
divides by the sample size minus one. Standard deviation is the square root of the variance.
Where the standard deviation is a set of values of the amount of variation or dispersion that we
want to measured. Both variance and standard deviation is categorized by grouped and
ungrouped data where both is specific for the population and sample. For the:

2.5.3 POPULATION VARIANCE


Ungrouped data

1
𝜎 2 = [ ] ∑(𝑋 − 𝜇)2
𝑁

Where,

𝜎 2 = population variance

X = observation
N= total number of observation in the population

16
∑ = sum of all values
𝜇 = population mean

Grouped data

1 ∑ 𝑓𝑥
𝜎2 = ∑ 𝑓𝑥 2 –( )
𝑁 𝑁

𝜎 2 = population variance

X = observation
N= total number of observation in the population
∑ = sum of all values

𝜇 = population mean

2.5.4 POPULATION STANDARD DEVIATION

Ungrouped data

1
𝜎 2 = √ ∑(𝑋 − 𝜇)2
𝑁

𝜎 2 = population STANDARD DEVIATION

𝑁= total number of observation in the population

𝑋= observation

𝜇= population mean

17
Grouped data

1 ∑ 𝑓𝑥
𝜎 2 = √ ∑ 𝑓𝑥 2 – ( )
𝑁 𝑁

𝜎 2 = population STANDARD DEVIATION

𝑁= total number of observation in the population

𝑋= observation

𝜇= population mean

2.5.5 SAMPLE VARIANCE

Ungrouped data

1 (∑ 𝑥)2
𝑆2 = (∑ 𝑥 2 – )
𝑛−1 𝑛

X = observation or value

n = number of observation in the sample

∑ 𝑥 2 = sum of all the squares of observations

18
Grouped data

1 (∑ 𝑓𝑥)2
𝑆2 = (∑ 𝑓𝑥 2 – )
𝑛−1 𝑛

2.5.6 SAMPLE STANDARD DEVIATION


Ungrouped data

1 (∑ 𝑥)2
𝑆2 = √ (∑ 𝑥 2 – )
𝑛−1 𝑛

Grouped data

1 (∑ 𝑓𝑥)2
𝑆2 = √ (∑ 𝑓𝑥 2 – )
𝑛−1 𝑛

2.6 MEASURE OF SKEWNESS


For a measure of skewness that have a distribution that is not symmetrical it can be either positive
or negative which it called as skewed distribution. Due to this the mean, median and mode will
have different of values and one tail will become longer than the other one.

Negatively skewed distribution:

If the frequency curve has longer tail to left the distribution is known as negatively

skewed distribution and Mean < Median < Mode.

19
Positively skewed distribution:

If the frequency curve has longer tail to right the distribution is known as positively

skewed distribution and Mean > Median > Mode.

2.7 BOX-and-WHISKER PLOT

To represent a graphical data the box-and-whisker plot is one that useful method by using
minimum, maximum, first quartile, third quartile, and the median. The shape of data distribution
of the box-plot can be obtained and also it can determine if there are any outliers in the data.
Figure below is to show the Box-and-whisker plots for various types of distribution.

Figure 4

Based on the figure above, the first picture shows a normal distribution where the right and left
whisker are the same length. The second picture shows, the distribution is a positive skewed or
skewed to the right where the right whisker is longer than the left whisker. Lastly, the last picture
shows a negative skewed or skewed to the left distribution where the left whisker is longer than
the right whisker.

2.8 PEARSON COEFFICIENT OF SKEWNESS


There are 3 ways for the statistic to measure the skewness which is:

i. If skewness = 0 (symmetrical)

20
ii. If skewness > 0 (skewed to the right)

iii. If skewness < 0 (skewed to the left)

2.9 CORRELATION

Correlation analysis is use to analyzes the relationship between the 2 variable. Where it is to
measure how closed the two data series that are related. In particular, the correlation coefficient
is to measures the direction and the extent of linear association between two variables. There are
several types of correlation coefficients which include the Pearson product moment correlation
coefficient which is normally known by r. This Pearson’s correlation coefficient tells us two types
of relationship between the two variables. While the sign ( - or + ) is to identify what kind of
relationship of the r between the two quantitative variables, and the strength of the relationship
between the two variables describe the magnitude of the r. Which is the magnitude of the
correlation are lies between the value -1.0 and 1.0.

The mathematical formula for Pearson’s correlation coefficient r is

∑ 𝑥𝑖 𝑦𝑖
∑ 𝑥𝑖 𝑦𝑖 −
𝑛
r=
(∑ 𝑥𝑖 )2 ∑𝑦 2
√[∑ 𝑥𝑖 2 − ][∑ 𝑦𝑖 2 − 𝑛𝑖 ]
𝑛

r = Correlation coefficient

n = number of observation

x = independent variable

y = dependent variable

2.9.1 Characteristics of the correlation coefficient

21
The value of r is always -1 ≤ r ≤ 1. A value of r greater than 0 indicates a positive linear association
between the two variables.

A value of r less than 0 indicates a negative linear association between the two
variables.
A value of r equal to 0 indicates no linear relation between the two variables.

Strength of the Correlation Coefficient


|𝑟| = Perfect Correlation

|𝑟|≥ 0.8 = Strong Correlation

0.5 < |𝑟|< 0.8 = ModerateCorrelation

|𝑟|≤ 0.5 = Weak Correlation

|𝑟| =0 = No Correlation

2.9.2 Regression

Basic regression model where it consist of only one for independent variable and one for
dependent variable. To study the relationship between this two variable is:

1.Collect the data and then construct a scatter plot. The purpose of the scatter plot, as indicated
previously, is to determine the nature of the relationship where the possibilities include a positive
linear relationship, a negative linear relationship, a curvilinear relationship, or no discernible
relationship.

2. Compute the value of the correlation coefficient and then the value is test to identify its
significance of the relationship. If the value of the correlation coefficient is significant,

3. The equation of the regression line can be determined, in this state which we will find the data’s
best fit line. (Note: Determining the regression line when r is not significant and then
making predictions using the regression line are meaningless.). The purpose of the
regression line is to enable the researcher to see the trend and to make predictions on

22
the basis of the data. The simple linear model can be stated as follows;

𝑦𝑖 = 𝛽0 + 𝛽1 𝑋1 + 𝜀𝑖

Where,

𝑦𝑖 = i s the value of the response variable in the 1th trial

𝛽0 and 𝛽1 are regression coefficients or parameters

𝑋1 = is a known constant the value of the independent variables in the ith trial

𝜀𝑖 = is a random error with mean E (𝜀𝑖 ) = 0 𝑎𝑛𝑑 𝑉 (𝜀𝑖 ) = 𝜎 2

2.9.3 Fitting a Straight Line


The fitting a straight line done to shows that several lines can be drawn on the graph near the
points. The line of best fit must be drawn. Which is the Best fit means that the sum of the squares
of the vertical distances from each point to the line is at a minimum). While this is because the
best fit line is needed for the values of y, dependent variables that will be predicted from the
values of x; independent variable. Hence, the closer the points are to the line, the better the fit
and the prediction will be.

The prediction regression line is expressed as 𝑦𝑖 = 𝑏0 + 𝑏1 𝑋1 + 𝜀𝑖 where 𝑏0 and 𝑏1 are estimates


of 𝛽0 and 𝛽1 respectively. 𝛽1 is the slope of regression line and it indicates that the change in the
mean of Y as per unit increase in X. The parameter of is the Y intercept of the regression line
when X is equal to zero. The method of ordinary least squares is used To find a “good” estimators
of the regression parameters and , the mathematical formula for Ordinary Least Square Method
is:

∑ 𝑥𝑦
∑ 𝑥𝑦−
𝑛
𝑏1 = 2 (∑ 𝑥)2
𝑏0 = 𝑦̅ − 𝑏1 𝑥̅
[∑ 𝑥 − 𝑛 ]

2.9.4 Coefficient of Determination


The coefficient of determination is the ratio of the explained variation to the total variation. Which
It is normally known as R2 . In the other words, the value of the R2 tells that how much of the

23
variability in Y can be explained by the fact that they are related to X. For simple linear regression
line of y on x, the coefficient of determination is the square root of the correlation
coefficient, r. Because of this, we can state that:

Explained Variation
Coefficient of Determination, 𝑅 2 = TotalVariation

2.9.5 Regression equation line


∑ 𝑥𝑦
∑ 𝑥𝑦 −
𝑏1 = 𝑛
2 (∑ 𝑥)2
[∑ 𝑥 − 𝑛 ]

𝑏0 = 𝑦̅ − 𝑏1 𝑥̅

24
CHAPTER 3: RESULTS AND INTERPRETATION

3.1 Data Representation

Table 1 : Number of Reported Cases and Total Death Caused by Cholera in Malaysia From Year
1971 To 2000

Year Number of Reported Cases Total Death


1971 53 1
1972 864 11
1973 369 17
1974 349 0
1975 110 8
1976 246 4
1977 444 12
1978 1635 64
1979 502 10
1980 97 7
1981 469 14
1982 516 17
1983 2195 38
1984 67 1
1985 52 2
1986 55 2
1987 1168 18
1988 1324 32
1989 393 14
1990 2071 38
1991 506 6
1992 474 8
1993 995 13
1994 534 0
1995 2209 27

25
1996 1486 0
1997 389 4
1998 1304 19
1999 535 0
2000 124 1

26
3.2 DESCRIPTIVE STATISTICS ANALYSIS

3.2.1 Histogram

Figure 5

The above graph on the fiqure 5 shows positive data set, which it represent the number of
reported cases caused by cholera for a range of 30 years observation from years 1971 to 2000
in Malaysia. Based on the histogram above, the higher cases that is reported is about 2000 and
above and the lower cases that is reporter is about 50 and above. The distribution of the histogram
above is skewed to the right. While the value for the mean and standard deviation is 717.83 and
656.816

27
3.2.2 Histogram

Figure 6

The above graph on the figure 6 shows positive data set, which it represents the total death
caused by cholera in Malaysia from year 1971 to 2000 in Malaysia. Based on the histogram
above, the higher death that is reported is about 6 and above and the lowest death that is reported
is 0. The distribution of the histogram above is skewed to the right. While the value for the mean
and standard deviation is 12.93 and 14.579

28
3.2.3 Box Plot

Figure 7

Based on the figure 7 of the boxplot above the median for the number of reported cases caused
by cholera from year 1971 to 2000 is 488.00. While the interquartile range is about 987 number
of reported cases which mean in this about 50% at Malaysia have between 215.50 and 1202.0
number of reported cases.

29
3.2.4 Box Plot

Figure 8

Based on the figure 8 of t the boxplot above the value of the median for total death caused by
cholera in year 1971 to 2000 is 9.00. While the inter quartile range is about 16 total death which
mean in this about 50% at Malaysia have between 1.75 and 17.25 total death.

30
3.2.5 Descriptive

Figure 9

As from the table above, the minimum and maximum value for number of reported cases are 52
and 2209 respectively. Then, the mean and standard deviation calculated for the number of
reported cases are 717.83 and 659.816. Hence, minimum value of total death caused by cholera
in Malaysia is 0 while the maximum value is 64. Lastly, the mean and standard deviation for total
death are 12.93 and 14.579.

31
3.3 CORRELATION AND REGRESSION

3.3.1 Scatter Plot

Figure 10

This scatter plot suggests a positive correlation relationship between number of reported cases
and total death caused by the disease of cholera in Malaysia from the year 1971 to 2000.

32
3.3.2 Correlation

Figure 11

The value of r = 0.743 suggests a moderate correlation relationship between number of reported
cases and total death caused by cholera in Malaysia from the year 1971 to 2000. That is the
higher the number of reported cases, the higher the total death due to this disease.

33
3.3.3 Regression

Figure 12

Figure 13

Coefficient of determination, R2 = 0.552 means that 55.2 % of the variability of total death can be
explained by the number of reported cases. The remaining 44.8 % is unexplained variability of
total death.

34
Figure 14

Figure 15

The value of r = 0.743 suggests a moderate correlation relationship between number of reported
cases and total death caused by cholera in Malaysia from the year 1971 to 2000. That is the
higher the number of reported cases, the higher the total death due to this disease. The regression
equation is ŷ = 1.146 + 0.016 x. The value of β1 = 0.016 means that for every increase in number
of reported cases, the total death will increase by 0.016.

35
3.3.4 Fitting A Straight Line

Figure 16

Interpret the slope:

If the number of reported cases increase by 1 rate, the total death predicted will increase by 0.016.

Interpret the intercept:

If the number of reported cases is 0, the total death predicted is 1.146.

36
CHAPTER 4: CONCLUSION

4.1 Report Summary

From this study, it can be conclude that the relationship between number of reported cases and
total death caused by Cholera in Malaysia shows a positive correlation. Next, the graph that
suitable for this data is histogram. Besides, the value of mean for this data is 12.93, standard
deviation is 14.579 and for interquartile range is 16. For this data, the value of correlation is 0.743
which suggest a moderate correlation relationship between number of reported cases and total
death caused by Cholera in Malaysia from year 1971 to 2000.The regression equation is ŷ = 1.146
+ 0.016 x. The value of β1 = 0.016 means that for every increase in number of reported cases,
the total death will increase by 0.016.

37
REFERENCES

1. Standard deviation. (2020, June 7). Retrieved from


https://en.m.wikipedia.org/wiki/Standard_deviation

2. Number of reported cases of cholera. (n.d.). Retrieved June 11, 2020, from
https://www.who.int/data/gho/data/indicators/indicator-details/GHO/number-of-reported-
cases-of-cholera

38
APPENDIX

FREQUENCIES VARIABLES=Number_Of_Reported_Cases Total_Death


/ORDER=ANALYSIS.

Frequencies

Statistics
Number_Of_Re
ported_Cases Total_Death
N Valid 30 30
Missing 0 0

Frequency Table

Number_Of_Reported_Cases
Cumulative
Frequency Percent Valid Percent Percent
Valid 52 1 3.3 3.3 3.3
53 1 3.3 3.3 6.7
55 1 3.3 3.3 10.0
67 1 3.3 3.3 13.3
97 1 3.3 3.3 16.7
110 1 3.3 3.3 20.0
124 1 3.3 3.3 23.3
246 1 3.3 3.3 26.7
349 1 3.3 3.3 30.0
369 1 3.3 3.3 33.3
389 1 3.3 3.3 36.7
393 1 3.3 3.3 40.0
444 1 3.3 3.3 43.3
469 1 3.3 3.3 46.7
474 1 3.3 3.3 50.0
502 1 3.3 3.3 53.3
506 1 3.3 3.3 56.7
516 1 3.3 3.3 60.0

39
534 1 3.3 3.3 63.3
535 1 3.3 3.3 66.7
864 1 3.3 3.3 70.0
995 1 3.3 3.3 73.3
1168 1 3.3 3.3 76.7
1304 1 3.3 3.3 80.0
1324 1 3.3 3.3 83.3
1486 1 3.3 3.3 86.7
1635 1 3.3 3.3 90.0
2071 1 3.3 3.3 93.3
2195 1 3.3 3.3 96.7
2209 1 3.3 3.3 100.0
Total 30 100.0 100.0

Total_Death
Cumulative
Frequency Percent Valid Percent Percent
Valid 0 4 13.3 13.3 13.3
1 3 10.0 10.0 23.3
2 2 6.7 6.7 30.0
4 2 6.7 6.7 36.7
6 1 3.3 3.3 40.0
7 1 3.3 3.3 43.3
8 2 6.7 6.7 50.0
10 1 3.3 3.3 53.3
11 1 3.3 3.3 56.7
12 1 3.3 3.3 60.0
13 1 3.3 3.3 63.3
14 2 6.7 6.7 70.0
17 2 6.7 6.7 76.7
18 1 3.3 3.3 80.0
19 1 3.3 3.3 83.3
27 1 3.3 3.3 86.7
32 1 3.3 3.3 90.0
38 2 6.7 6.7 96.7
64 1 3.3 3.3 100.0
Total 30 100.0 100.0

40
FREQUENCIES VARIABLES=Number_Of_Reported_Cases Total_Death
/FORMAT=NOTABLE
/NTILES=4
/STATISTICS=STDDEV VARIANCE RANGE MINIMUM MAXIMUM MEAN MEDIAN MODE SKEWNESS
SESKEW
/HISTOGRAM NORMAL
/ORDER=ANALYSIS.

Frequencies

Statistics
Number_Of_Re
ported_Cases Total_Death
N Valid 30 30
Missing 0 0
Mean 717.83 12.93
Median 488.00 9.00
Mode 52a 0
Std. Deviation 659.816 14.579
Variance 435357.730 212.547
Skewness 1.105 1.863
Std. Error of Skewness .427 .427
Range 2157 64
Minimum 52 0
Maximum 2209 64
Percentiles 25 215.50 1.75
50 488.00 9.00
75 1202.00 17.25
a. Multiple modes exist. The smallest value is shown

41
Histogram

42
* Chart Builder.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Number_Of_Reported_Cases
MISSING=LISTWISE
REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Number_Of_Reported_Cases=col(source(s),
name("Number_Of_Reported_Cases"))
DATA: id=col(source(s), name("$CASENUM"), unit.category())
COORD: rect(dim(1), transpose())
GUIDE: axis(dim(1), label("Number_Of_Reported_Cases"))
GUIDE: text.title(label("1-D Boxplot of Number_Of_Reported_Cases"))
ELEMENT: schema(position(bin.quantile.letter(Number_Of_Reported_Cases)),
label(id))
END GPL.

GGraph

* Chart Builder.
GGRAPH

43
/GRAPHDATASET NAME="graphdataset" VARIABLES=Total_Death MISSING=LISTWISE
REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Total_Death=col(source(s), name("Total_Death"))
DATA: id=col(source(s), name("$CASENUM"), unit.category())
COORD: rect(dim(1), transpose())
GUIDE: axis(dim(1), label("Total_Death"))
GUIDE: text.title(label("1-D Boxplot of Total_Death"))
ELEMENT: schema(position(bin.quantile.letter(Total_Death)), label(id))
END GPL.

GGraph

DESCRIPTIVES VARIABLES=Number_Of_Reported_Cases Total_Death


/STATISTICS=MEAN STDDEV MIN MAX.

Descriptives

Descriptive Statistics

N Minimum Maximum Mean Std. Deviation

44
Number_Of_Reported_Case 30 52 2209 717.83 659.816
s
Total_Death 30 0 64 12.93 14.579
Valid N (listwise) 30

* Chart Builder.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Number_Of_Reported_Cases
Total_Death MISSING=LISTWISE
REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE
/FITLINE TOTAL=NO.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Number_Of_Reported_Cases=col(source(s),
name("Number_Of_Reported_Cases"))
DATA: Total_Death=col(source(s), name("Total_Death"))
GUIDE: axis(dim(1), label("Number_Of_Reported_Cases"))
GUIDE: axis(dim(2), label("Total_Death"))
GUIDE: text.title(label("Simple Scatter of Total_Death by
Number_Of_Reported_Cases"))
ELEMENT: point(position(Number_Of_Reported_Cases*Total_Death))
END GPL.

GGraph

CORRELATIONS

45
/VARIABLES=Number_Of_Reported_Cases Total_Death
/PRINT=TWOTAIL NOSIG
/MISSING=PAIRWISE.

Correlations

Correlations
Number_Of_Re
ported_Cases Total_Death
Number_Of_Reported_Case Pearson Correlation 1 .743**
s Sig. (2-tailed) .000
N 30 30
Total_Death Pearson Correlation .743** 1
Sig. (2-tailed) .000
N 30 30
**. Correlation is significant at the 0.01 level (2-tailed).

REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT Total_Death
/METHOD=ENTER Number_Of_Reported_Cases.

Regression

Variables Entered/Removeda
Variables Variables
Model Entered Removed Method
1 Number_Of_Re . Enter
ported_Casesb
a. Dependent Variable: Total_Death
b. All requested variables entered.

Model Summary

46
Adjusted R Std. Error of the
Model R R Square Square Estimate
1 .743a .552 .536 9.927
a. Predictors: (Constant), Number_Of_Reported_Cases

ANOVAa
Model Sum of Squares df Mean Square F Sig.
1 Regression 3404.534 1 3404.534 34.547 .000b
Residual 2759.333 28 98.548
Total 6163.867 29
a. Dependent Variable: Total_Death
b. Predictors: (Constant), Number_Of_Reported_Cases

Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 1.146 2.703 .424 .675
Number_Of_Reported_Case .016 .003 .743 5.878 .000
s
a. Dependent Variable: Total_Death

* Chart Builder.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Number_Of_Reported_Cases
Total_Death MISSING=LISTWISE
REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE
/FITLINE TOTAL=YES.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Number_Of_Reported_Cases=col(source(s),
name("Number_Of_Reported_Cases"))
DATA: Total_Death=col(source(s), name("Total_Death"))
GUIDE: axis(dim(1), label("Number_Of_Reported_Cases"))
GUIDE: axis(dim(2), label("Total_Death"))
GUIDE: text.title(label("Simple Scatter with Fit Line of Total_Death by ",
"Number_Of_Reported_Cases"))
ELEMENT: point(position(Number_Of_Reported_Cases*Total_Death))
END GPL.

47
GGraph

48
EXAMINE VARIABLES=Number_Of_Reported_Cases Total_Death

/PLOT BOXPLOT STEMLEAF

/COMPARE GROUPS

/STATISTICS DESCRIPTIVES

/CINTERVAL 95

/MISSING LISTWISE

/NOTOTAL.

Explore
Notes

Output Created 13-JUN-2020 16:21:30

Comments

Input Data C:\Users\User\Documents\sta 108


dania.spv.sav

Active Dataset DataSet1

Filter <none>

Weight <none>

Split File <none>

N of Rows in Working Data


30
File

Missing Value Handling Definition of Missing User-defined missing values for


dependent variables are treated as
missing.

Cases Used Statistics are based on cases with no


missing values for any dependent
variable or factor used.

49
Syntax EXAMINE
VARIABLES=Number_Of_Reported_C
ases Total_Death

/PLOT BOXPLOT STEMLEAF

/COMPARE GROUPS

/STATISTICS DESCRIPTIVES

/CINTERVAL 95

/MISSING LISTWISE

/NOTOTAL.

Resources Processor Time 00:00:01.36

Elapsed Time 00:00:01.50

[DataSet1] C:\Users\User\Documents\sta 108 dania.spv.sav

Case Processing Summary

Cases

Valid Missing Total

N Percent N Percent N Percent

Number_Of_Reported_Case
30 100.0% 0 0.0% 30 100.0%
s

Total_Death 30 100.0% 0 0.0% 30 100.0%

50
Descriptives

Statistic Std. Error

Number_Of_Reported_Case Mean 717.83 120.465


s
95% Confidence Interval for Lower Bound 471.45
Mean
Upper Bound 964.21

5% Trimmed Mean 672.22

Median 488.00

Variance 435357.730

Std. Deviation 659.816

Minimum 52

Maximum 2209

Range 2157

Interquartile Range 987

Skewness 1.105 .427

Kurtosis .180 .833

Total_Death Mean 12.93 2.662

95% Confidence Interval for Lower Bound 7.49


Mean
Upper Bound 18.38

5% Trimmed Mean 11.30

Median 9.00

Variance 212.547

Std. Deviation 14.579

Minimum 0

51
Maximum 64

Range 64

Interquartile Range 16

Skewness 1.863 .427

Kurtosis 4.159 .833

Number_Of_Reported_Cases

Number_Of_Reported_Cases Stem-and-Leaf Plot

Frequency Stem & Leaf

15.00 0 . 000001123333444

7.00 0 . 5555589

4.00 1 . 1334

1.00 1. 6

3.00 2 . 012

Stem width: 1000

Each leaf: 1 case(s)

52
53
Total_Death

Total_Death Stem-and-Leaf Plot

Frequency Stem & Leaf

11.00 0 . 00001112244

4.00 0 . 6788

6.00 1 . 012344

4.00 1 . 7789

.00 2.

1.00 2. 7

1.00 3. 2

2.00 3 . 88

1.00 Extremes (>=64)

Stem width: 10

Each leaf: 1 case(s)

54
55

You might also like