Project Sta108 (Finalized) (Lasttt) PDF

PROJECT STA 108
NUMBER OF REPORTED CASES AND TOTAL DEATH

CAUSED BY CHOLERA IN MALAYSIA FROM YEAR 1971 TO
2000
NAME : 1) NUR DANIA BINTI AMAN SHAH (2018881976)

2) JASMIN SYAFIKAH BINTI JAMAL ASRI (2018406686)
3) NOORFATIHAH BINTI HANIPIAH (2018202014)
GROUP : AS1204_M
DISTRIBUTED TO : SIR ZULKIFLI BIN MOHD GHAZALI
1
Table of Contents
CHAPTER 1: INTRODUCTION ............................................................................................................... 4

1.1 Background of Study ....................................................................................................................... 4
1.2 Objectives of Study ..................................................................................................................... 5
1.3 Significance of Study................................................................................................................... 5
CHAPTER 2: METHODOLOGY .............................................................................................................. 6
2.1 Data Description .............................................................................................................................. 6
2.1.1 Population .................................................................................................................................. 6
2.2 Graphical Technique ........................................................................................................................... 7
2.3 Numerical Technique ........................................................................................................................ 10
2.3.1 Measure Of Central Tendency ................................................................................................. 10
2.3.2 Mean......................................................................................................................................... 10
2.3.3 Median ..................................................................................................................................... 10
2.3.4 Mode ................................................................................................................................................ 12
2.4 Measure Of Location ......................................................................................................................... 13
2.4.1 The first and third quartiles ....................................................................................................... 14
2.5 MEASURES OF DISPERSION ....................................................................................................... 15
2.5.6 SAMPLE STANDARD DEVIATION ..................................................................................... 19
2.6 MEASURE OF SKEWNESS ........................................................................................................ 19
2.7 BOX-and-WHISKER PLOT .......................................................................................................... 20
2.8 PEARSON COEFFICIENT OF SKEWNESS ................................................................................ 20
2.9 CORRELATION ................................................................................................................................. 21
2.9.1 Characteristics of the correlation coefficient .......................................................................... 21
Strength of the Correlation Coefficient .......................................................................................... 22
2.9.2 Regression ...................................................................................................................................... 22
2.9.3 Fitting a Straight Line ................................................................................................................. 23
2.9.4 Coefficient of Determination ................................................................................................. 23
CHAPTER 3: RESULTS AND INTERPRETATION ............................................................................ 25
3.1 Data Representation ..................................................................................................................... 25
3.2.1 Histogram ............................................................................................................................... 27
3.3.1 Scatter Plot ............................................................................................................................ 32
CHAPTER 4: CONCLUSION ................................................................................................................. 37
2
4.1 Report Summary............................................................................................................................ 37
REFERENCES ......................................................................................................................................... 38
APPENDIX ................................................................................................................................................ 39
3
CHAPTER 1: INTRODUCTION
1.1 Background of Study
Cholera is an illness caused by infection of the intestine with the toxigenic bacterium Vibrio
cholerae. A bacterium called Vibrio cholerea causes cholera infection. The deadly effects of
the disease are the result of toxin that the bacteria produce in the small intestine. So, the toxin
causes the body to secrete enormous amount of water, leading to diarrhea and a rapid loss
of fluids and salts. In Malaysia, there were 21535 cases that have been reported but the total
of death caused by Cholera were only 388 cases from year 1971 until 2000. This study was
taken to analyse the relationship between the number of reported cases and total death
caused by Cholera in Malaysia.
Based on this study, the number of reported cases is a manipulated variable while total death
caused by Cholera in Malaysia is a responded variable. It is because, total death caused by
Cholera in Malaysia depends on the number of reported cases. The data shows a positive
correlation which the value is 0. 7432.The value of correlation suggests a moderate correlation
relationship between the number of reported case and total death caused by Cholera in
Malaysia from 1971 until 2000. The higher number of reported cases, the higher total death
4
1.2 Objectives of Study
The objectives of this study are:
1) To determine the relationship between the number of reported cases and total death
2) To obtain the types of graph that suitable for the data.
3) To find the values of mean, standard deviation and interquartile range.
4) To determine the correlation and regression of data.
1.3 Significance of Study
The data for this study is easy to access since it is already available at World Health
Organisation (WHO) website. Next, it helps to save more time and money as well since we
do not need to analyse, interpret the result and collect the data on our own. This kind of data
is way more cheaper compared to primary data. Hence, the secondary data is more accurate
than the primary data. It is because the values may be obtained rapidly. The stability of the
data also high since it is done by the expert researcher from the other country.
1.4 Limitation of Study
The limitation of this study is that no session for asking question can be made to prove more
about the accuracy of data since this data is already available in World Health Organisation
(WHO) website. Next, the data may slightly different in term of purpose of study to match with our
objective. It is because the data was already found from other researcher.
5
CHAPTER 2: METHODOLOGY
2.1 Data Description
2.1.1 Population
The population that were used in this study is the number of reported cases and total death
caused by cholera from year 1971 to 2000 in all country of the world.
2.1.2 Samples
Sample that were used in this study is number of reported cases and total death caused by
cholera in Malaysia from year 1971 to 2000.
2.1.3 Data collection method

There is no data collection method that were used in this study as the data is a secondary data
where it is a ready data.
2.1.3 Sampling Technique

There is no sampling technique that were used in this study as the data is in secondary data which
it already a ready data.
2.1.4 Variables
The variables that were used in this study is the number of reported cases and total death caused
by cholera from year 1971 to 2000 in Malaysia where there are 30 of observation were taken for
both variables. In statistic, there are two variables which are discrete and continuous variable.
The continuous variable is refer to a variable which is a response are taken on values to measure
the variable. This variable is not chosen because the data is a secondary data. In this study, the
type of variable that are used is discrete variable. This is because the data that were obtained in
this study is a quantitative data which is a numerical data where it is suitable for the discrete
variable that is a countable variable.
2.1.5 Measurement scale

There are many types of measurement scale that have in statistic which include nominal, ordinal,
interval and ratio. The measurement scale that were used in this study is ratio. This measurement
scale was chosen is because ratio is a measurement which is stated that it is an ordered scale
that gives meaning to the difference between the measurement and involve true zero point. This
explain in our study that the number of reported cases caused by cholera that have a zero case
in year 1974,1994,1996 and 1999 shows that there are no reported cases causes by cholera in
6
those year. The interval is same like the ratio which is the different is it does not involve true zero
point. Nominal were not chosen in this study as our data is a quantitative data, where it is not
matched with nominal which it used a qualitative data. Also, ordinal was not chosen is because
the data that were used in this study is a secondary data, due to this there is no survey that were
done, so there is no data that can be ranked which needed in the ordinal.
2.2 Graphical Technique
Due to the data that were obtained in this study is a grouped frequency distribution the histogram
graph was chosen. As shown in the figure 1 the vertical of the bar is to represent the frequency
of the class. The histogram graph used the frequency of the class as y-axis, and the class
boundary as the x-axis.
Figure 1
7
Figure 2
The figure 3 below shows the scatter diagram. The scatter diagram is known as nature of the
relationship between two continuous variable which are the dependent variable and the
independent variable. From the scatter diagram the characteristic of different possible correlation
can also be describe to identified how closed the relationship between the two variables. Type of
the characteristic is positive correlation, negative correlation, no correlation, curvilinear correlation
and perfect positive correlation. For the positive correlation it can be identify when the two variable
which is the dependent, y-axis and the independent variable, x-axis shows a positive variable.
The change of the direction on the x-axis will shows an increasing and also for the y-axis.
Secondly, for the negative correlation it will shows a negative relationship between the two
variables. The change of direction for both independent and dependent variable for negative
correlation have different direction which is when the independent variable, x-axis increases the
dependent variable, y-axis would be decrease.
8
Figure 3
Based on the figure 3 above the scatter diagram shows a positive skewness which mean in this
it have a positive relationship between the 2 variable where when the independent variable, x-
axis (number of reported case) is increase the dependent variable, y-axis (total death) also
increase.
9
2.3 Numerical Technique
2.3.1 Measure Of Central Tendency

The measure of average which the most called in statistic to give its meaning to the measure of
central tendency. The central tendency here is the single value that is placed at the centre of a
data and it can be taken as a summary value for that data set. There are Three types of averages
that often used as measures of central tendency which is the mean, median and mode where the
group of data can be either grouped or ungrouped data. An ungrouped data is a group that is not
given in the form of frequency table or frequency distribution while a grouped data is a group of
data that is tabulated in a frequency table or frequency distribution.
2.3.2 Mean
Mean is known as the average of the data. It is the total of all the data observation divides by the
number of the data observation. It can be calculated on both grouped and ungrouped data.
Ungrouped data:
∑𝑥
𝑥̅ =
𝑛
Grouped data:
∑ 𝑓𝑥
𝑥̅ = [ ]
𝑛
2.3.3 Median
Median is the value that were arrange in an ascending order to determine its middle value. The
interpretation of median is 50% of the total number of observations having a value less than a
median value while another 50% of the total number of observations having a value more than a
median value.
10
Ungrouped Data
Step to calculated it:
i. Arrange the data in ascending order
ii. Find the position of median
iii. Find the value of median.
For special case:

1. Do a proper table with include cumulative frequency
𝑛+1
2. Find the position of median.
2
3. Refer the position value in cumulative frequency

4. The value of median is in column x.
Grouped Data
Steps to calculated:
i. Do a proper table with include cumulative frequency, class boundaries and

position.
𝑛+1
ii. Find the position of median.
2
iii. Refer the position value in cumulative to find the class median
iv. Use the formula:
∑𝑓
− ∑ 𝑓𝑚−1
𝑥̃ = 𝐿𝑚 + [ 2 𝑓𝑚
].c
Where,
n=sample size
𝐿𝑚 = lower boundary of the median class
∑ 𝑓𝑚−1 = cumulative frequency before the median class
11
𝑓𝑚 = frequency of the median class
C = median class size
2.3.4 Mode
Mode is the value that is more frequent that occur on the data. Where it have the formula for the
ungrouped and grouped data. For ungrouped data:
i. The data is first arranged in ascending order
ii. Find the mode (most frequently in a set of data) Then the mode is determined by analyzing the
most frequent value occur in those set of data.
iii. the highest frequency should be determined for a categorical data.
iv. While for a quantitative data can be determined on the histogram, also the mode and the class
interval with the highest frequency can be determined.
There is also a special case for the mode which is the method is:
i. Find the highest frequency
ii. Find the mode in column x.
For grouped data:
Steps to calculated it:
i. Do a proper table with include cumulative frequency and class boundaries.

ii. Find the highest frequency to know the class mode.
iii. Use the formula:
∆1
𝑥̂ = 𝐿𝑚0 + [ ].c
∆ 1 + ∆2
where,
𝐿𝑚0 =lower boundary of the modal class
12
∆1 =(modal class frequency – frequency for the class before the modal class)
∆2 = modal class frequency – frequency for the class after the modal class)
C = mod class size
2.3.5 Relationship between mean, median mode
The data distribution is skewed to the left or left skewness distribution. If the mode > median >
mean (or simply mean < median or mean < mode).
the data distribution is skewed to the right or right skewness distribution If the mode < median <
mean (or simply mean > median or mean > mode).
The data distribution is symmetrical or normal If mode = median = mean.
2.4 Measure Of Location
Measure location is which it included the quartile where it separate into ungrouped and grouped
data. In the ungrouped data it is used to represent the position of the value with a large sets of
data of numerical data. Basically, ungrouped data quartile it is the extension of the median. It is
also the most used to non-central places. It actually divides the region under the frequency curve
into four equal areas. As for the:
Ungrouped Data
There have 3 position in the quartile:
First Quartiles / Lower Quartiles ( 𝑄1 ) - 25%of the total data is less than first quartile value
and 75% of the total data is more than first quartile value.
𝑛+1
𝑄1 = 𝑡ℎ
4
13
Second Quartiles / Median ( 𝑄2 ) - 50%of the total data is less than second quartile value and
50% of the total data is more than second quartile value.
2(𝑛 + 1)
𝑄2 = 𝑡ℎ
4
Third Quartiles/ Upper Quartiles (𝑄3 ) - 75%of the total data is less than third quartile value and
25% of the total data is more than third quartile value.
3(𝑛 + 1)
𝑄3 = 𝑡ℎ
4
Grouped Data
The quartile in grouped data their position can be measured by the first and the third quartile as
𝑄1 and 𝑄3 . The first and third quartiles can be calculated based on the distribution of a table and
also using the ogive.
2.4.1 The first and third quartiles
Method 1: Using Formula
Step 1 : the cumulative frequencies is obtained and also the position of the data.
Step 2 After identified the first and third quartile classes. Obtain the first location of the first and
the third quartile by using the formula and . then refer to the cumulative frequency column to
determine the locations and classes it place and lie. Within these classes, the value s of and can
be determine.
Step 3 : Find the first and third quartile as follows
14
𝑛
− 𝑓𝑄1 −1
𝑄1 = 𝐿𝑄1 + [4 ] × 𝐶𝑄1
𝑓𝑄1
where
n= number of observations.
𝐿1 = lower boundary of the first quartile class
𝑓𝑚−1= cumulative frequency before the first quartile class
𝑓1= frequency of the first quartile class
𝐶1 = first quartile class size
3𝑛
− 𝑓𝑄3
𝑄3 = 𝐿𝑄3 + [ 4 ] × 𝐶𝑄3
𝑓𝑄3
where
n = number of observations.
𝐿3 = lower boundary of the first quartile class
𝑓𝑚−1= cumulative frequency before the first quartile class
𝑓3= frequency of the first quartile class
𝐶3 = 𝑡ℎ𝑖𝑟𝑑 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑐𝑙𝑎𝑠𝑠 𝑠𝑖𝑧𝑒
2.5 Measure Of Dispersion
understand the spread or variability of a set of data about the mean. It gives additional information
to judge the reliability of the measure of central tendency and helps in comparing dispersion that
is present in various samples. Some of the measure of dispersion that is discussed on this topic
is range, variance and standard deviation.
15
2.5.1 Range
In statistic the simplest measure of dispersion is the range which the difference between the
largest and the smallest value of data. So, with this two value of the data the range of the data
distribution can be obtained
For ungrouped data;

Range= Largest data value – Smallest data value.
For grouped data;

Range=Upper class boundary of the last class – Lower class boundary of the first class
2.5.2 Variance And Standard Deviation

The variance is the sum of squares of differences between each value of the data and the mean
divides by the sample size minus one. Standard deviation is the square root of the variance.
Where the standard deviation is a set of values of the amount of variation or dispersion that we
want to measured. Both variance and standard deviation is categorized by grouped and
ungrouped data where both is specific for the population and sample. For the:
2.5.3 POPULATION VARIANCE

Ungrouped data
1
𝜎 2 = [ ] ∑(𝑋 − 𝜇)2
𝑁
Where,
𝜎 2 = population variance
X = observation
N= total number of observation in the population
16
∑ = sum of all values
𝜇 = population mean
Grouped data
1 ∑ 𝑓𝑥
𝜎2 = ∑ 𝑓𝑥 2 –( )
𝑁 𝑁
𝜎 2 = population variance
X = observation
N= total number of observation in the population
∑ = sum of all values
𝜇 = population mean
2.5.4 POPULATION STANDARD DEVIATION
Ungrouped data
1
𝜎 2 = √ ∑(𝑋 − 𝜇)2
𝑁
𝜎 2 = population STANDARD DEVIATION
𝑁= total number of observation in the population
𝑋= observation
𝜇= population mean
17
Grouped data
1 ∑ 𝑓𝑥
𝜎 2 = √ ∑ 𝑓𝑥 2 – ( )
𝑁 𝑁
𝜎 2 = population STANDARD DEVIATION
𝑁= total number of observation in the population
𝑋= observation
𝜇= population mean
2.5.5 SAMPLE VARIANCE
Ungrouped data
1 (∑ 𝑥)2
𝑆2 = (∑ 𝑥 2 – )
𝑛−1 𝑛
X = observation or value
n = number of observation in the sample
∑ 𝑥 2 = sum of all the squares of observations
18
Grouped data
1 (∑ 𝑓𝑥)2
𝑆2 = (∑ 𝑓𝑥 2 – )
𝑛−1 𝑛
2.5.6 SAMPLE STANDARD DEVIATION

Ungrouped data
1 (∑ 𝑥)2
𝑆2 = √ (∑ 𝑥 2 – )
𝑛−1 𝑛
Grouped data
1 (∑ 𝑓𝑥)2
𝑆2 = √ (∑ 𝑓𝑥 2 – )
𝑛−1 𝑛
2.6 MEASURE OF SKEWNESS

For a measure of skewness that have a distribution that is not symmetrical it can be either positive
or negative which it called as skewed distribution. Due to this the mean, median and mode will
have different of values and one tail will become longer than the other one.
Negatively skewed distribution:
If the frequency curve has longer tail to left the distribution is known as negatively
skewed distribution and Mean < Median < Mode.
19
Positively skewed distribution:
If the frequency curve has longer tail to right the distribution is known as positively
skewed distribution and Mean > Median > Mode.
2.7 BOX-and-WHISKER PLOT
To represent a graphical data the box-and-whisker plot is one that useful method by using
minimum, maximum, first quartile, third quartile, and the median. The shape of data distribution
of the box-plot can be obtained and also it can determine if there are any outliers in the data.
Figure below is to show the Box-and-whisker plots for various types of distribution.
Figure 4
Based on the figure above, the first picture shows a normal distribution where the right and left
whisker are the same length. The second picture shows, the distribution is a positive skewed or
skewed to the right where the right whisker is longer than the left whisker. Lastly, the last picture
shows a negative skewed or skewed to the left distribution where the left whisker is longer than
the right whisker.
2.8 PEARSON COEFFICIENT OF SKEWNESS

There are 3 ways for the statistic to measure the skewness which is:
i. If skewness = 0 (symmetrical)
20
ii. If skewness > 0 (skewed to the right)
iii. If skewness < 0 (skewed to the left)
2.9 CORRELATION
Correlation analysis is use to analyzes the relationship between the 2 variable. Where it is to
measure how closed the two data series that are related. In particular, the correlation coefficient
is to measures the direction and the extent of linear association between two variables. There are
several types of correlation coefficients which include the Pearson product moment correlation
coefficient which is normally known by r. This Pearson’s correlation coefficient tells us two types
of relationship between the two variables. While the sign ( - or + ) is to identify what kind of
relationship of the r between the two quantitative variables, and the strength of the relationship
between the two variables describe the magnitude of the r. Which is the magnitude of the
correlation are lies between the value -1.0 and 1.0.
The mathematical formula for Pearson’s correlation coefficient r is
∑ 𝑥𝑖 𝑦𝑖
∑ 𝑥𝑖 𝑦𝑖 −
𝑛
r=
(∑ 𝑥𝑖 )2 ∑𝑦 2
√[∑ 𝑥𝑖 2 − ][∑ 𝑦𝑖 2 − 𝑛𝑖 ]
𝑛
r = Correlation coefficient
n = number of observation
x = independent variable
y = dependent variable
2.9.1 Characteristics of the correlation coefficient
21
The value of r is always -1 ≤ r ≤ 1. A value of r greater than 0 indicates a positive linear association
between the two variables.
A value of r less than 0 indicates a negative linear association between the two
variables.
A value of r equal to 0 indicates no linear relation between the two variables.
Strength of the Correlation Coefficient

|𝑟| = Perfect Correlation
|𝑟|≥ 0.8 = Strong Correlation
0.5 < |𝑟|< 0.8 = ModerateCorrelation
|𝑟|≤ 0.5 = Weak Correlation
|𝑟| =0 = No Correlation
2.9.2 Regression
Basic regression model where it consist of only one for independent variable and one for
dependent variable. To study the relationship between this two variable is:
1.Collect the data and then construct a scatter plot. The purpose of the scatter plot, as indicated
previously, is to determine the nature of the relationship where the possibilities include a positive
linear relationship, a negative linear relationship, a curvilinear relationship, or no discernible
relationship.
2. Compute the value of the correlation coefficient and then the value is test to identify its
significance of the relationship. If the value of the correlation coefficient is significant,
3. The equation of the regression line can be determined, in this state which we will find the data’s
best fit line. (Note: Determining the regression line when r is not significant and then
making predictions using the regression line are meaningless.). The purpose of the
regression line is to enable the researcher to see the trend and to make predictions on
22
the basis of the data. The simple linear model can be stated as follows;
𝑦𝑖 = 𝛽0 + 𝛽1 𝑋1 + 𝜀𝑖
Where,
𝑦𝑖 = i s the value of the response variable in the 1th trial
𝛽0 and 𝛽1 are regression coefficients or parameters
𝑋1 = is a known constant the value of the independent variables in the ith trial
𝜀𝑖 = is a random error with mean E (𝜀𝑖 ) = 0 𝑎𝑛𝑑 𝑉 (𝜀𝑖 ) = 𝜎 2
2.9.3 Fitting a Straight Line

The fitting a straight line done to shows that several lines can be drawn on the graph near the
points. The line of best fit must be drawn. Which is the Best fit means that the sum of the squares
of the vertical distances from each point to the line is at a minimum). While this is because the
best fit line is needed for the values of y, dependent variables that will be predicted from the
values of x; independent variable. Hence, the closer the points are to the line, the better the fit
and the prediction will be.
The prediction regression line is expressed as 𝑦𝑖 = 𝑏0 + 𝑏1 𝑋1 + 𝜀𝑖 where 𝑏0 and 𝑏1 are estimates

of 𝛽0 and 𝛽1 respectively. 𝛽1 is the slope of regression line and it indicates that the change in the
mean of Y as per unit increase in X. The parameter of is the Y intercept of the regression line
when X is equal to zero. The method of ordinary least squares is used To find a “good” estimators
of the regression parameters and , the mathematical formula for Ordinary Least Square Method
is:
∑ 𝑥𝑦
∑ 𝑥𝑦−
𝑛
𝑏1 = 2 (∑ 𝑥)2
𝑏0 = 𝑦̅ − 𝑏1 𝑥̅
[∑ 𝑥 − 𝑛 ]
2.9.4 Coefficient of Determination

The coefficient of determination is the ratio of the explained variation to the total variation. Which
It is normally known as R2 . In the other words, the value of the R2 tells that how much of the
23
variability in Y can be explained by the fact that they are related to X. For simple linear regression
line of y on x, the coefficient of determination is the square root of the correlation
coefficient, r. Because of this, we can state that:
Explained Variation
Coefficient of Determination, 𝑅 2 = TotalVariation
2.9.5 Regression equation line

∑ 𝑥𝑦
∑ 𝑥𝑦 −
𝑏1 = 𝑛
2 (∑ 𝑥)2
[∑ 𝑥 − 𝑛 ]
𝑏0 = 𝑦̅ − 𝑏1 𝑥̅
24
CHAPTER 3: RESULTS AND INTERPRETATION
3.1 Data Representation
Table 1 : Number of Reported Cases and Total Death Caused by Cholera in Malaysia From Year
1971 To 2000
Year Number of Reported Cases Total Death

1971 53 1
1972 864 11
1973 369 17
1974 349 0
1975 110 8
1976 246 4
1977 444 12
1978 1635 64
1979 502 10
1980 97 7
1981 469 14
1982 516 17
1983 2195 38
1984 67 1
1985 52 2
1986 55 2
1987 1168 18
1988 1324 32
1989 393 14
1990 2071 38
1991 506 6
1992 474 8
1993 995 13
1994 534 0
1995 2209 27
25
1996 1486 0
1997 389 4
1998 1304 19
1999 535 0
2000 124 1
26
3.2 DESCRIPTIVE STATISTICS ANALYSIS
3.2.1 Histogram
Figure 5
The above graph on the fiqure 5 shows positive data set, which it represent the number of
reported cases caused by cholera for a range of 30 years observation from years 1971 to 2000
in Malaysia. Based on the histogram above, the higher cases that is reported is about 2000 and
above and the lower cases that is reporter is about 50 and above. The distribution of the histogram
above is skewed to the right. While the value for the mean and standard deviation is 717.83 and
656.816
27
3.2.2 Histogram
Figure 6
The above graph on the figure 6 shows positive data set, which it represents the total death
caused by cholera in Malaysia from year 1971 to 2000 in Malaysia. Based on the histogram
above, the higher death that is reported is about 6 and above and the lowest death that is reported
is 0. The distribution of the histogram above is skewed to the right. While the value for the mean
and standard deviation is 12.93 and 14.579
28
3.2.3 Box Plot
Figure 7
Based on the figure 7 of the boxplot above the median for the number of reported cases caused
by cholera from year 1971 to 2000 is 488.00. While the interquartile range is about 987 number
of reported cases which mean in this about 50% at Malaysia have between 215.50 and 1202.0
number of reported cases.
29
3.2.4 Box Plot
Figure 8
Based on the figure 8 of t the boxplot above the value of the median for total death caused by
cholera in year 1971 to 2000 is 9.00. While the inter quartile range is about 16 total death which
mean in this about 50% at Malaysia have between 1.75 and 17.25 total death.
30
3.2.5 Descriptive
Figure 9
As from the table above, the minimum and maximum value for number of reported cases are 52
and 2209 respectively. Then, the mean and standard deviation calculated for the number of
reported cases are 717.83 and 659.816. Hence, minimum value of total death caused by cholera
in Malaysia is 0 while the maximum value is 64. Lastly, the mean and standard deviation for total
death are 12.93 and 14.579.
31
3.3 CORRELATION AND REGRESSION
3.3.1 Scatter Plot
Figure 10
This scatter plot suggests a positive correlation relationship between number of reported cases
and total death caused by the disease of cholera in Malaysia from the year 1971 to 2000.
32
3.3.2 Correlation
Figure 11
The value of r = 0.743 suggests a moderate correlation relationship between number of reported
cases and total death caused by cholera in Malaysia from the year 1971 to 2000. That is the
higher the number of reported cases, the higher the total death due to this disease.
33
3.3.3 Regression
Figure 12
Figure 13
Coefficient of determination, R2 = 0.552 means that 55.2 % of the variability of total death can be
explained by the number of reported cases. The remaining 44.8 % is unexplained variability of
total death.
34
Figure 14
Figure 15
The value of r = 0.743 suggests a moderate correlation relationship between number of reported
cases and total death caused by cholera in Malaysia from the year 1971 to 2000. That is the
higher the number of reported cases, the higher the total death due to this disease. The regression
equation is ŷ = 1.146 + 0.016 x. The value of β1 = 0.016 means that for every increase in number
of reported cases, the total death will increase by 0.016.
35
3.3.4 Fitting A Straight Line
Figure 16
Interpret the slope:
If the number of reported cases increase by 1 rate, the total death predicted will increase by 0.016.
Interpret the intercept:
If the number of reported cases is 0, the total death predicted is 1.146.
36
CHAPTER 4: CONCLUSION
4.1 Report Summary
From this study, it can be conclude that the relationship between number of reported cases and
total death caused by Cholera in Malaysia shows a positive correlation. Next, the graph that
suitable for this data is histogram. Besides, the value of mean for this data is 12.93, standard
deviation is 14.579 and for interquartile range is 16. For this data, the value of correlation is 0.743
which suggest a moderate correlation relationship between number of reported cases and total
death caused by Cholera in Malaysia from year 1971 to 2000.The regression equation is ŷ = 1.146
+ 0.016 x. The value of β1 = 0.016 means that for every increase in number of reported cases,
the total death will increase by 0.016.
37
REFERENCES
1. Standard deviation. (2020, June 7). Retrieved from

https://en.m.wikipedia.org/wiki/Standard_deviation
2. Number of reported cases of cholera. (n.d.). Retrieved June 11, 2020, from
https://www.who.int/data/gho/data/indicators/indicator-details/GHO/number-of-reported-
cases-of-cholera
38
APPENDIX
FREQUENCIES VARIABLES=Number_Of_Reported_Cases Total_Death

/ORDER=ANALYSIS.
Frequencies
Statistics
Number_Of_Re
ported_Cases Total_Death
N Valid 30 30
Missing 0 0
Frequency Table
Number_Of_Reported_Cases
Cumulative
Frequency Percent Valid Percent Percent
Valid 52 1 3.3 3.3 3.3
53 1 3.3 3.3 6.7
55 1 3.3 3.3 10.0
67 1 3.3 3.3 13.3
97 1 3.3 3.3 16.7
110 1 3.3 3.3 20.0
124 1 3.3 3.3 23.3
246 1 3.3 3.3 26.7
349 1 3.3 3.3 30.0
369 1 3.3 3.3 33.3
389 1 3.3 3.3 36.7
393 1 3.3 3.3 40.0
444 1 3.3 3.3 43.3
469 1 3.3 3.3 46.7
474 1 3.3 3.3 50.0
502 1 3.3 3.3 53.3
506 1 3.3 3.3 56.7
516 1 3.3 3.3 60.0
39
534 1 3.3 3.3 63.3
535 1 3.3 3.3 66.7
864 1 3.3 3.3 70.0
995 1 3.3 3.3 73.3
1168 1 3.3 3.3 76.7
1304 1 3.3 3.3 80.0
1324 1 3.3 3.3 83.3
1486 1 3.3 3.3 86.7
1635 1 3.3 3.3 90.0
2071 1 3.3 3.3 93.3
2195 1 3.3 3.3 96.7
2209 1 3.3 3.3 100.0
Total 30 100.0 100.0
Total_Death
Cumulative
Frequency Percent Valid Percent Percent
Valid 0 4 13.3 13.3 13.3
1 3 10.0 10.0 23.3
2 2 6.7 6.7 30.0
4 2 6.7 6.7 36.7
6 1 3.3 3.3 40.0
7 1 3.3 3.3 43.3
8 2 6.7 6.7 50.0
10 1 3.3 3.3 53.3
11 1 3.3 3.3 56.7
12 1 3.3 3.3 60.0
13 1 3.3 3.3 63.3
14 2 6.7 6.7 70.0
17 2 6.7 6.7 76.7
18 1 3.3 3.3 80.0
19 1 3.3 3.3 83.3
27 1 3.3 3.3 86.7
32 1 3.3 3.3 90.0
38 2 6.7 6.7 96.7
64 1 3.3 3.3 100.0
Total 30 100.0 100.0
40
FREQUENCIES VARIABLES=Number_Of_Reported_Cases Total_Death
/FORMAT=NOTABLE
/NTILES=4
/STATISTICS=STDDEV VARIANCE RANGE MINIMUM MAXIMUM MEAN MEDIAN MODE SKEWNESS
SESKEW
/HISTOGRAM NORMAL
/ORDER=ANALYSIS.
Frequencies
Statistics
Number_Of_Re
N Valid 30 30
Missing 0 0
Mean 717.83 12.93
Median 488.00 9.00
Mode 52a 0
Std. Deviation 659.816 14.579
Variance 435357.730 212.547
Skewness 1.105 1.863
Std. Error of Skewness .427 .427
Range 2157 64
Minimum 52 0
Maximum 2209 64
Percentiles 25 215.50 1.75
50 488.00 9.00
75 1202.00 17.25
a. Multiple modes exist. The smallest value is shown
41
Histogram
42
* Chart Builder.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Number_Of_Reported_Cases
MISSING=LISTWISE
REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Number_Of_Reported_Cases=col(source(s),
name("Number_Of_Reported_Cases"))
DATA: id=col(source(s), name("$CASENUM"), unit.category())
COORD: rect(dim(1), transpose())
GUIDE: axis(dim(1), label("Number_Of_Reported_Cases"))
GUIDE: text.title(label("1-D Boxplot of Number_Of_Reported_Cases"))
ELEMENT: schema(position(bin.quantile.letter(Number_Of_Reported_Cases)),
label(id))
END GPL.
GGraph
* Chart Builder.
GGRAPH
43
/GRAPHDATASET NAME="graphdataset" VARIABLES=Total_Death MISSING=LISTWISE
REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
DATA: Total_Death=col(source(s), name("Total_Death"))
DATA: id=col(source(s), name("$CASENUM"), unit.category())
COORD: rect(dim(1), transpose())
GUIDE: axis(dim(1), label("Total_Death"))
GUIDE: text.title(label("1-D Boxplot of Total_Death"))
ELEMENT: schema(position(bin.quantile.letter(Total_Death)), label(id))
END GPL.
GGraph
DESCRIPTIVES VARIABLES=Number_Of_Reported_Cases Total_Death

/STATISTICS=MEAN STDDEV MIN MAX.
Descriptives
Descriptive Statistics
N Minimum Maximum Mean Std. Deviation
44
Number_Of_Reported_Case 30 52 2209 717.83 659.816
s
Total_Death 30 0 64 12.93 14.579
Valid N (listwise) 30
* Chart Builder.
GGRAPH
Total_Death MISSING=LISTWISE
REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE
/FITLINE TOTAL=NO.
BEGIN GPL
GUIDE: text.title(label("Simple Scatter of Total_Death by
Number_Of_Reported_Cases"))
ELEMENT: point(position(Number_Of_Reported_Cases*Total_Death))
END GPL.
GGraph
CORRELATIONS
45
/VARIABLES=Number_Of_Reported_Cases Total_Death
/PRINT=TWOTAIL NOSIG
/MISSING=PAIRWISE.
Correlations
Correlations
Number_Of_Re
Number_Of_Reported_Case Pearson Correlation 1 .743**
s Sig. (2-tailed) .000
N 30 30
Total_Death Pearson Correlation .743** 1
Sig. (2-tailed) .000
N 30 30
**. Correlation is significant at the 0.01 level (2-tailed).
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT Total_Death
/METHOD=ENTER Number_Of_Reported_Cases.
Regression
Variables Entered/Removeda
Variables Variables
Model Entered Removed Method
1 Number_Of_Re . Enter
ported_Casesb
a. Dependent Variable: Total_Death
b. All requested variables entered.
Model Summary
46
Adjusted R Std. Error of the
Model R R Square Square Estimate
1 .743a .552 .536 9.927
a. Predictors: (Constant), Number_Of_Reported_Cases
ANOVAa
Model Sum of Squares df Mean Square F Sig.
1 Regression 3404.534 1 3404.534 34.547 .000b
Residual 2759.333 28 98.548
Total 6163.867 29
b. Predictors: (Constant), Number_Of_Reported_Cases
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 1.146 2.703 .424 .675
Number_Of_Reported_Case .016 .003 .743 5.878 .000
s
* Chart Builder.
GGRAPH
Total_Death MISSING=LISTWISE
REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE
/FITLINE TOTAL=YES.
BEGIN GPL
GUIDE: text.title(label("Simple Scatter with Fit Line of Total_Death by ",
"Number_Of_Reported_Cases"))
ELEMENT: point(position(Number_Of_Reported_Cases*Total_Death))
END GPL.
47
GGraph
48
EXAMINE VARIABLES=Number_Of_Reported_Cases Total_Death
/PLOT BOXPLOT STEMLEAF
/COMPARE GROUPS
/STATISTICS DESCRIPTIVES
/CINTERVAL 95
/MISSING LISTWISE
/NOTOTAL.
Explore
Notes
Output Created 13-JUN-2020 16:21:30
Comments
Input Data C:\Users\User\Documents\sta 108

dania.spv.sav
Active Dataset DataSet1
Filter <none>
Weight <none>
Split File <none>
N of Rows in Working Data

30
File
Missing Value Handling Definition of Missing User-defined missing values for

dependent variables are treated as
missing.
Cases Used Statistics are based on cases with no

missing values for any dependent
variable or factor used.
49
Syntax EXAMINE
VARIABLES=Number_Of_Reported_C
ases Total_Death
/PLOT BOXPLOT STEMLEAF
/COMPARE GROUPS
/STATISTICS DESCRIPTIVES
/CINTERVAL 95
/MISSING LISTWISE
/NOTOTAL.
Resources Processor Time 00:00:01.36
Elapsed Time 00:00:01.50
[DataSet1] C:\Users\User\Documents\sta 108 dania.spv.sav
Case Processing Summary
Cases
Valid Missing Total
N Percent N Percent N Percent
Number_Of_Reported_Case
30 100.0% 0 0.0% 30 100.0%
s
Total_Death 30 100.0% 0 0.0% 30 100.0%
50
Descriptives
Statistic Std. Error
Number_Of_Reported_Case Mean 717.83 120.465

s
95% Confidence Interval for Lower Bound 471.45
Mean
Upper Bound 964.21
5% Trimmed Mean 672.22
Median 488.00
Variance 435357.730
Std. Deviation 659.816
Minimum 52
Maximum 2209
Range 2157
Interquartile Range 987
Skewness 1.105 .427
Kurtosis .180 .833
Total_Death Mean 12.93 2.662
95% Confidence Interval for Lower Bound 7.49

Mean
Upper Bound 18.38
5% Trimmed Mean 11.30
Median 9.00
Variance 212.547
Std. Deviation 14.579
Minimum 0
51
Maximum 64
Range 64
Interquartile Range 16
Skewness 1.863 .427
Kurtosis 4.159 .833
Number_Of_Reported_Cases
Number_Of_Reported_Cases Stem-and-Leaf Plot
Frequency Stem & Leaf
15.00 0 . 000001123333444
7.00 0 . 5555589
4.00 1 . 1334
1.00 1. 6
3.00 2 . 012
Stem width: 1000
Each leaf: 1 case(s)
52
53
Total_Death
Total_Death Stem-and-Leaf Plot
Frequency Stem & Leaf
11.00 0 . 00001112244
4.00 0 . 6788
6.00 1 . 012344
4.00 1 . 7789
.00 2.
1.00 2. 7
1.00 3. 2
2.00 3 . 88
1.00 Extremes (>=64)
Stem width: 10
Each leaf: 1 case(s)
54
55

Project Sta108 (Finalized) (Lasttt) PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project Sta108 (Finalized) (Lasttt) PDF

Uploaded by

Copyright:

Available Formats

PROJECT STA 108

NUMBER OF REPORTED CASES AND TOTAL DEATH

NAME : 1) NUR DANIA BINTI AMAN SHAH (2018881976)

DISTRIBUTED TO : SIR ZULKIFLI BIN MOHD GHAZALI

CHAPTER 1: INTRODUCTION ............................................................................................................... 4

1.1 Background of Study

The objectives of this study are:

1.3 Significance of Study

1.4 Limitation of Study

2.1 Data Description

2.1.3 Data collection method

2.1.3 Sampling Technique

2.1.5 Measurement scale

2.2 Graphical Technique

2.3.1 Measure Of Central Tendency

Step to calculated it:

i. Arrange the data in ascending order

ii. Find the position of median

iii. Find the value of median.

For special case:

3. Refer the position value in cumulative frequency

i. Do a proper table with include cumulative frequency, class boundaries and

𝐿𝑚 = lower boundary of the median class

∑ 𝑓𝑚−1 = cumulative frequency before the median class

C = median class size

i. The data is first arranged in ascending order

iii. the highest frequency should be determined for a categorical data.

i. Find the highest frequency

ii. Find the mode in column x.

For grouped data:

Steps to calculated it:

i. Do a proper table with include cumulative frequency and class boundaries.

𝐿𝑚0 =lower boundary of the modal class

C = mod class size

2.3.5 Relationship between mean, median mode

The data distribution is symmetrical or normal If mode = median = mean.

2.4 Measure Of Location

2.4.1 The first and third quartiles

Method 1: Using Formula

2.5 Measure Of Dispersion

For ungrouped data;

For grouped data;

2.5.2 Variance And Standard Deviation

2.5.3 POPULATION VARIANCE

2.5.4 POPULATION STANDARD DEVIATION

𝜎 2 = population STANDARD DEVIATION

𝑁= total number of observation in the population

𝜎 2 = population STANDARD DEVIATION

𝑁= total number of observation in the population

2.5.5 SAMPLE VARIANCE

n = number of observation in the sample

∑ 𝑥 2 = sum of all the squares of observations

2.5.6 SAMPLE STANDARD DEVIATION

2.6 MEASURE OF SKEWNESS

Negatively skewed distribution:

skewed distribution and Mean < Median < Mode.

skewed distribution and Mean > Median > Mode.

2.7 BOX-and-WHISKER PLOT

2.8 PEARSON COEFFICIENT OF SKEWNESS

iii. If skewness < 0 (skewed to the left)

The mathematical formula for Pearson’s correlation coefficient r is

2.9.1 Characteristics of the correlation coefficient

Strength of the Correlation Coefficient