You are on page 1of 62

Introduction to Statistics

(Stat 2161)

Chapter 1

Introduction

1
Chapter Goals
After completing this chapter, you will be
able to:
• Explain the reasons for studying statistics

• Explain the difference between Descriptive and


Inferential statistics

• Describe application, uses and limitations of


statistics

• Identify types of variables and scales of


measurement 2
1.1: Definition and Classification of Statistics

• Statistics is defined in two senses: plural and singular .


•Statistics in the Plural Sense: statistics means a collection of
numerical facts.
Example
• Data on human population of a region.
• Infants’ birth weight at a public hospital in three
consecutive months.
• Number of man-hours lost in industry in specific years.
• Statistics in the Singular Sense: the science of data that
concerned with collecting, organizing, presenting, analyzing
and interpreting numerical data to make decision on the
bases of such analysis.
3
Definition and . . .

•A population is the complete set of things (usually


people, objects, transactions, or events) that have a
specified property in common.
• A sample is part of a population.
• Parameters are descriptive measures computed
from direct measurements on all population
elements.
•A statistic rather than the field of statistics also
refers to a numerical quantity computed from
sample data, e.g., the sample mean, median,
maximum, etc.
4
Why Statistics?

To effectively manage data and undertake research.


 To facilitate communication.
To make valid decisions based on part of subjects
taken from a large population.

To monitor and evaluate activities performed at


different institutions.

5
Statistical
Data Information
tools

6
Classification of Statistics
o Descriptive statistics Mainly concerned with the
methods and techniques used in collection, organization,
presentation, and analysis of a set of data without
making any conclusions or inferences
oInferential statistics Deals with the method of
inferring or drawing conclusion about the characteristics
of the population based upon the results of a sample.

7
Descriptive Statistics

• Collect data
– e.g., Survey

• Present data
– e.g., Tables and graphs

• Summarize data
– e.g., Sample mean =
X i

8
Inferential Statistics
• Estimation
– e.g., Estimate the population
mean weight using the sample
mean weight
• Hypothesis testing
– e.g., Test the claim that the
population mean weight is 65 kg

Inference is the process of drawing conclusions or making


decisions about a population based on sample results

9
1.2 Stages in Statistical Investigation
➢There are five basic stages for any statistical investigation.
1.Collection of Data refers to the process of collecting
observations (measurements, survey responses, etc.).
2.Organization of Data: The arrangement of data in a suitable
form. It constitutes editing, classifying and tabulation.
3.Presentation of Data is the process of displaying data in a
precise manner using tables, graphs & diagrams.
4. Analysis of Data is the process of systematically applying
statistical and/or logical techniques to describe, illustrate, and
evaluate data.
5. Interpretation of Data it is related with generalization of some
characteristics from sample to population.

10
1.3: Application, Uses and Limitations of Statistics

•Applications
•Statistics is applied in almost all areas of research such as in
• Industry – control charts and inspection plans.
• Commerce – demand and supply.
• Agriculture – mean comparison (ANOVA).
• Economics – index number, time series and estimation.
• Education – formulation of policies to start new course.
• Planning – data related to production and consumption.
• Medicine – testing efficacy of a new drug.
• Modern Applications, for example, software engineering.

11
1.3: Application, Uses …

•Uses of Statistics

• Condenses and summarizes complex data

• Facilitates comparison of data

• Helps to measure variability in data

• Used to create relationship between variables

• Helps in predicting future trends

• Helpful in formulating and testing hypothesis and to


develop new theories

12
Limitations

1. Statistics is not suitable to directly study qualitative


phenomenon.
2. Statistics does not study individual cases.
3. Statistical laws are not exact – Only true on the
AVERAGE.
4. Statistics may be easily misused.
5. Statistics is only, one of the methods of studying a
problem.

13
1.4 Types of Variables and Measurement Scales
• Variable is a characteristic which takes on
different values.
• Value: A specific amount possible for a variable to
be.
Types of Variables
o Qualitative Variables:
oAttributes, categories
o Examples: male/female, registered to vote/not,
ethnicity, eye color, etc.
o Quantitative Variables
Discrete variable can assume only a countable number of
values.
Continuous variable can take on any value along an
interval – measurements, how much
14
Scales of Measurement
Differences between
measurements, true Ratio Scale
zero exists
Quantitative Variable
Differences between
measurements but Interval Scale
no true zero
Ordered Categories
(rankings, order, or Ordinal Scale
scaling but no exact
difference) Qualitative Variable
Categories (no
ordering or direction) Nominal Scale
15
Example
• Marital status
• Eye color
• Nominal: • Gender
• Race

• Stage of disease
• Ordinal: • Severity of pain
• Level of satisfaction

• Interval • Temperature

• Ratio: • Distance
• Length
• Time until death
• Weight 16
Chapter Two

Methods of Data Collection and

Presentation

17
Chapter Goals
After completing this chapter, you are expected to:
• Explain why we collect data
• Identify sources of data
• Describe the various methods of data collection
• Create and interpret diagrams to describe categorical
variables:
– frequency distribution, bar chart, pie chart
• Create and interpret graphs to describe numerical
variables:
– frequency distribution, histogram, ogive, stem-and-leaf
plot
18
2.1 Methods of Data Collection
• Why we collect data?
– To answer questions,
– To make decisions, and
– To gain a deeper understanding of some
phenomena.
• Example
– Does lowering speed limit reduce the number of
fatal traffic accidents?
– What fractions of students in a college belong to
blood group O?
• Data: A plural noun (the singular form is datum) means a
set of known or given facts.
• Data can be collected using survey or experiment.
19
2.1.1 Sources of Data
• Primary
– Data generated by the immediate user(s) of the data.
– Survey, experimental and observational research are
most popular.
– Tend to require more time and expense than secondary
data.
• Secondary
– Data gathered from another source for a similar or
different purpose.
• Internal sources within the researcher’s organization
• External sources, including governmental, trade,
commercial and internet sources.
20
Sources of Data . . .
• Example: If it is required to know the average
CGPA of students at a university, then data can be
accessed from the registrar office of that particular
university.
• Uses of Secondary Data
• Secondary data save time and cost as compared to primary
data.
• They are less subject to intentional bias.
• Secondary data are the only option for inaccessible
information.
• Drawback of Secondary Data
• They may not fit all the requirements that we need.
21
2.1.2 Types of Data

Data

Categorical Numerical
(Qualitative) (Quantitative)

Examples: Data on
◼ Marital Status
◼ Cause of death Discrete Continuous
◼ Eye Color
(Defined categories or
groups) Examples: Data on Examples: Data on

◼ Number of patients ◼ Weight


◼ Frequency of cough at ◼ Blood sugar level
night ◼ Survival time
◼ Number of missing teeth (Measured characteristics) 22
Methods of Data Collection
 Various methods based on the nature of the
investigation and limitations in the availability of
resources.
1. Direct Observation: The investigator observes the
behavior of subjects/individuals in the set of
observations.
Though costly, it is arguably a good method, as it
reduces the chance of incorrectness.
2. Enumeration: selected group of respondents will
be asked a set of questions available in the
schedule by well-trained enumerators.
Could be time consuming if the coverage area is wide
23
Data Collection …

3. Direct Personal Interview: This is perhaps best


suited when the problems are not completely
understood .
 It is also recommended in situations when the
information collected is of confidential nature.
4. Telephone Interview: questions are prepared and then
forwarded to the respondents via telephone calls.
• This is recommended if a respondent cannot be easily
accessible apart from by means of a telephone.

24
Data Collection …
5. Indirect Oral Interview: The researcher contacts third
parties called witnesses capable of supplying the
necessary information.
– Recommended if the information is of complex
nature or the informants are not inclined to respond.
6. Mailed Enquiry Method: Letters with a set of
questions are sent to the respondents and responses are
collected afterwards.
 Recommended if the survey covers large area and the
respondents are scattered around.
7. Old Records: A researcher uses data collected by
others & stored in some forms such as in books,
newspapers, almanacs(handbook) or even
unpublished sources.
25
2.2 Methods of Data Presentation
• Data in raw form are usually not easy to use for
decision making.

• Data can be summarized using


• Table
• Diagram
• Graph
• Statistical quantities such as mean, standard
deviation, etc.
• The type of diagram/ graph to use depends on the
variable being summarized
26
Data Presentation . . .
Data Displays

Categorical data Quantitative


•Frequency tables of •Frequency tables
counts or percentages
•Histograms
•Bar or column charts
• Frequency Polygon
•Pie chart
•Ogive
•Stem and leaf plot
27
2.2.1 Frequency Distributions
Key Terms
• Class - categories or ranges within which the data
fall(drop).
• Frequency – Number of observations in each class
• Class relative frequency - the class frequency divided by
the total number of observations in the data set.
• Class limits - the lowest and highest values for each class.
• Class mark - Midpoint of each class.
• Class boundaries: values which fall midway between the
UCL of one class and the LCL of the next large one.
Let d = LCL of 2nd class – UCL of 1st class.
Then LCB =LCL – ½ x d and UCB = UCL + ½ x d
• Class width - the difference between the lower & upper
28
class boundaries of the same class.
Frequency Distributions . . .
• Example

Class Class Class Freque Relative


Limits Boundaries Mark ncy Frequency
1 – 10 0.5 -10.5 5.5 12 12/ 100
11- 20 10.5 – 20.5 15. 5 10 10/ 100
… …. ….. … …
81 – 90 80.5 – 90.5 85.5 6 6/ 100
Total 100 1.00

Class Width = 10

29
Frequency Distribution …
Frequency Distribution

Qualitative Quantitative

Ungrouped Grouped
• Frequency Distribution: A table useful to present data in
classes and shows the number of observations in each class.
• Qualitative FD: a frequency distribution where the data to be
presented are only nominal or ordinal.
• Ungrouped FD: a frequency distribution where each number
in a dataset represents a single class.
• Grouped FD: several values are grouped into one class.
30
Frequency Distributions . . .
• Categorical Frequency Distribution

– The categorical frequency distribution is used for data which can be placed
in specific categories such as nominal or ordinal level data
– The major components of categorical frequency distribution are class, tally and
frequency (or proportion).
• Percentages are also usable

– Forms of a categorical distribution

Class Tally Frequency Percent

31
Frequency Distributions . . .
• Example: Data on smoking status by gender of a sample of 20 health workers
in Jimma Hospital 1986 E.C was given. Construct categorical frequency
distribution.

Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Gender M F M M F F F M M M F F F F M F M F M M
Smoking Y N N Y N N Y N N N N N N Y Y Y N N Y Y
status
Characteristics Tally Frequency
Gender
Male //// //// 10
Female //// //// 10
Smoking status
No //// //// // 12
Yes //// /// 8

32
Frequency Distributions . . .
• Ungrouped Frequency Distribution
– It is the distribution that use individual data values along with their
frequencies.
– often constructed for small set of data on discrete variable (when data are
numerical), and when the range of the data is small.
– sometimes it is complicated to use ungrouped frequency distribution for
large mass of data, as result we use grouped frequency distribution.
– The major components of this type of frequency distributions are class, tally,
frequency, and cumulative frequency (less than/more than).
– Cumulative frequency is used to determine the number of observations that lie
above (or below) a particular value in a data set.

33
Frequency Distributions . . .
Example: Age in year of 20 women who attended health education at Jimma
Health center in 1986 are given as follows. Construct ungrouped frequency
distribution
30 25 23 41 39 27 41 24 32 29 29 35 31 36 33 36 42
35 37 41

Age(xj) 23 24 25 27 29 30 31 32 33 35 36 37 39 41 42

Tally / / / / // / / / / // // / / /// /

Frequency(f) 1 1 1 1 2 1 1 1 1 2 2 1 1 3 1

34
Steps in the Construction of Grouped FD
1. Find the difference between the smallest and largest
values in the raw data and denote as R.
2. Set the number of classes (K); usually in between 5 &
20 or use Struges’ rule K=1+3.322(log10 n)
3. Estimate the class width W= R/K; round the estimate to
a convenient value.
4. Determine the LCL for the first class by selecting a
convenient number that is <= the lowest data value.
Then add to it the class width to get the lower class
limit of the second class. Keep adding until the
desired number of classes is reached.
5.1. If the observations are whole numbers (e.g., 12, 23, 78,
etc.), subtract ONE from the lower class limit of the second
class to get the upper class limit of the first class. 35
Steps in the Construction of Grouped FD
5.2. If the observations are fractions (e.g., 1.2,
2.3, 7.8, etc.), subtract 0.1 from the lower class
limit of the second class to get the upper class
limit of the first class.
5.3. If the observations are fractions (e.g., 1.32,
2.35, 7.84, etc.), subtract 0.01 from the lower
class limit of the second class to get the upper
class limit of the first class.
6. Count number of frequencies in each class and put
them with the corresponding classes.

36
Relative and Cumulative FD
• Relative frequency table: a table showing relative
frequencies in each class.
– Relative frequency can be expressed in terms of a a
percentage.
• Cumulative frequency (cf): the sum of the frequencies
succeeding or preceding a class k including the frequency
of the class k.
– The cumulative relative frequency expresses the same
information as a percent by multiplying by 100%/n.
• Less than cf counts the number of observations less than
or equal to upper class boundary of a class.
• More than cf is obtained by adding frequencies of
observations greater than lower class boundary of a class.
37
Example
• Consider the following data

30 40 41 33 70 51 37 10 31 21 60 44 63 72 23 37 65 14
25 28 64 39 17 74 53 34 51 27 43 45 33 16 23 68 47 32
36 19 48 49 67 60 45 54 44 30 15 38 22 46 61 25 29 55
48 49 35 13 37 36
• Prepare i) absolute frequency distribution;
ii) relative frequency distribution;
iii) less than and more than cumulative
frequency distributions.

38
Example …
R= 74 – 10 = 64 , n = 60
Using Sturges’ Rule:
K=1+3.322(log10 60) = K=1+3.322( 1.778151 ) = 6.9070  7
W = 64/ 7 = 9.14 10

1st Class: 10 – 19 f(1): 7

2nd Class: 20 – 29 f(2): 9

3rd Class: 30 – 39 f(2): 15


. Detail

39
Example …

Class Frequency RF LCF MCF


10-19 7 0.116 7 60
20-29 9 0.15 16 53
30-39 15 0.25 31 44
40-49 13 0.216 44 29
50-59 5 0.083 49 16
60-69 8 0.133 57 11
70 - 79 3 0.05 60 3
Total 60 1.00
40
Example: Age in year of 20 women who attended health education at Jimma
Health center in 1986 are given as follows. Construct grouped frequency
distribution

30 25 23 41 39 27 41 24 32 29 29 35 31 36 33 36 42
35 37 41
n=20
solution :

41
Exercise
1. Given below are raw data on ages of 40 employees of an
organization. Construct a frequency distribution including the
class boundaries, class marks the relative frequencies, the less
than and more than cumulative frequencies.
62 58 53 27 30 31 26 34 49 47 48 41 50 61 40 47 41 43 50 45
43 32 37 31 35 38 29 65 58 43 44 41 37 27 62 65 36 42 63 50
Solution

42
2.2.2 Diagrammatic Presentation of Data
• It includes bar chart, pie diagram and steam and leaf
plot.
• Bar charts are the simplest and most widely used
diagrams for data presentation.
• Bar charts display absolute or relative frequency
distributions for categorical variables.

Bar Chart

Simple Multiple Subdivided 2 Way Broken

43
Simple Bar Chart

• Simple Bar Chart contains a number of rectangles


arranged either horizontally or vertically.
• Horizontal bar chart: the X-axis represents the
frequencies while the Y-axis represents the categories.
• Vertical bar chart: the Y-axis represents the frequencies
while the X-axis represents the categories.
• A simple bar chart is useful for 1-dimensional comparison
only.
• Example
• Represent the data given in the following table using a
vertical and horizontal bar charts.

44
Simple Bar Chart . . .
Year No. of students
2000 3005
2001 3567
2002 3800
2003 4300
2004 3650
2005 5000

45
Two Way Bar Chart
• To represent data having both negative and positive
values.
• Example
Year 1990 1991 1992 1993
Net Migration 50,000 -5,000 20,000 40,000

46
Multiple Bar Chart
• To make comparison between two or more variables.
• Example: A number of accounting firms were audited, and
classified according to size status (I [large], II [medium] and
III [small]) and the degree to which income-changing
accounting practices were used in preparing clients' tax
returns.
Degree of Change
Size No changes Some changes Total
Large 23 36 59
Medium 52 61 113
Small 22 21 43
Total 97 118 215

47
Multiple Bar Chart

48
Subdivided Bar Chart
• To show and compare the breakup of one variable into
several components.
Year 2000 2001 2002 2003 2004
No. of females 800 824 856 768 900
No. of males 1389 2450 1245 1655 1445
Total 2189 3274 2101 2423 2345

49
Broken Chart
• To represent data having broad variations in value.
• One observation may be extremely larger as compared to the
others.
• If we use a scale proportional to the value (frequency), then it
will be almost impossible to see the bars of small values.
• Example
• Represent the data given below using a suitable chart.
Year 1990 1991 1992 1993 1994 1995
Value 899 543 787 35323 121 234

50
Broken Bars . . .
• Simple bar: • Broken bar:

51
Pie Diagram
• Pie Chart

– Pie chart is a circular diagram and the area of the sector of a


circle is used in pie chart.

– To construct a pie chart (sector diagram), draw a circle


(measures 3600)
Component part
Angle of sec tor =  3600
Total
– The angles of each component are calculated by the formula

– These angles are made in the circle by mean of a protractor to


show different components.

– The arrangement of the sectors is usually anti-clock wise. 52


Pie Diagram
• Pie diagrams are useful for displaying the relative frequency
distribution of a categorical variable.

University Addis Ababa Gondar Jimma Total


No. of students 8000 6000 6000 20000 Addis Ababa =
[8000/20000]* 360
= 144
L Gondar =
e [6000/20000]* 360
g = 108
e Jimma =
n [6000/20000]* 360
= 108
d 53
Steam and Leaf Plot
• A stem and leaf plot is a special table where each
data value is split into a stem (the first digit or
digits) and a leaf (usually the last digit).
Data: 21, 24, 24, 26, 27, 27, 30, 32, 38, 41
Stem Leaf
◼ 21 is shown as 2 1
◼ 38 is shown as 3 8

• Completed stem-and-leaf plot:

54
Steam and . . .
• Give a stem-and-leaf plot for the following data.
• 3.584, 3.615, 3.586, 3.712, 3.823, 3.616, 3.580, 3.888,
3.617, 3.584, 3.882, 3.912, 3.91, 3.712, 3.580, 3.917
• Stem Leaf
• 3.58 0 0 4 4 6
• 3.61 5 6 7
• 3.71 2 2
• 3.82 3
• 3.88 2 8
• 3.91 0 2 7
• 3.58|4 represents 3.584

55
2.2.4 Graphical Presentation of Data

• Graphs include histogram, frequency polygon and ogive.

• Histogram is a set of rectangles whose areas are in


proportion to class frequencies.

• Histogram depicts the frequency distribution of a


quantitative variable.

• x-axis represents class width and the y-axis indicates


frequency.

56
Histogram Example

Daily High
Temperature Frequency
Histogram : Daily High Tem perature
10 but less than 20 3
20 but less than 30 6 7 6
30 but less than 40 5
40 but less than 50 4
6 5
50 but less than 60 2 5 4
Frequency

4 3
3 2
2
1 0 0
(No gaps 0
between 0 0 10 10 2020 30 30 40 40 50 50 60 60 70
bars) Temperature in Degrees 57
Frequency Polygon
• This is a line graph of class frequencies plotted against
class marks.
• End points must be joined to the x-axis (y = 0) at mid
points of empty classes: one before the first class and the
other after the last class.
• They serve the same purpose as histograms, but are
especially helpful for comparing sets of data.
• Example
• 1. Represent the following data using a frequency polygon.
Class 14.5-24.5 24.5-34.5 34.5-44.5 44.5-54.5 54.5-64.5
Frequency 3 4 8 6 7

58
Frequency Polygon . . .

59
Frequency Polygon . . .
• 2. The following frequency distribution refer to test scores
for 28 students in an examination. Plot frequency polygons
for the two datasets.
Score 0-5 5-10 10- 15 15-20 20-25
Test1 3 4 8 6 7
Test2 1 2 5 12 8

60
Ogive
o The ogive is a frequency polygon (line plot) of
cumulative frequency or the relative cumulative frequency.

oThe X-axis is the class boundaries and the vertical axis is


either the less than or more than cumulative frequency.

oExample
Price in Birr Frequency Less than More than
Frequency Frequency
10-20 2 2 26
20-30 3 5 24
30-40 6 11 21
40-50 8 19 15
50-60 5 24 7
60-70 2 26 2

61
Ogive …

62

You might also like