WHAT IS STATISTICS
Statistics is defined as the science of
collecting, organizing, presenting,
analyzing, and interpreting data to
assist in making more effective
decisions.
OR
Collection of numerical information is
called statistics.
Dr. Iftikhar Hussain Adil
WHAT IS STATISTICS
Broadly defined, it is the science,
technology and art of extracting
information from observational data,
with an emphasis on solving real world
problems.
It is a logic and methodology for the
measurement of uncertainty and for
examination of the consequences of that
uncertainty in the planning and
interpretation of experimentation and
observation.
TYPES OF STATISTICS
Dr. Iftikhar Hussain Adil
Statistical
Methods
Descriptive
Statistics
Inferential
Statistics
TYPES OF STATISTICS
DESCRIPTIVE STATISTICS
Methods of organizing, summarizing,
and presenting data in an informative
way.
INFERENTIAL STATISTICS
The methods used to determine
something about a population on the
basis of a sample.
DESCRIPTIVE STATISTICS
Dr. Iftikhar Hussain Adil
Inferential Statistics
Aim to draw conclusions about an
additional population outside of your
datasets/sample is known to be
inferential statistics.
Population versus Sample
A population is the complete set of all
items that interests an investigator.
Population size N, can be very large or
even infinite.
e.g. All the registered voters of Pakistan
All the students at NUST
Sample is an observed subset of the
population values with sample size
given by n
Sampling Techniques
Simple Random Sampling
Systematic Sampling
Stratified Sampling
Possible strata: (Male and female strata, Resident
and nonresident strata, White, Black, Hispanic, and Asian
strata, Protestant, Catholic, Jewish, Muslim, etc., strata)
Clustered Sampling
Sample of Convenience
Parameter and Statistic
A parameter is a specific characteristic of a
population. A statistic is a specific
characteristic of a sample.
e.g. NBS surveyed its students to determine the
average daily expense. From a sample of 80
students the average expense was computed
Rs.133.
What is population?
What is sample?
What is parameter?
What is statistic?
Is Rs.133 a parameter or statistic?
Types of Variables
Variable: A characteristic of an item or
individual that will be analyzed by
using statistics.
e.g. Gender, Party affiliation of registered
voters, HH income of citizens who live in
specific geographic area, Publishing
category (hard cover, trade paper book,
mass marked paper book, text book) of
a book. No of televisions in a household
etc.
Example (Types of variables)
Reg # Gender Age FA/FSC or
equivalent
Family
Members
1 M 18.2 67 4
2 F 19 70 3
3 M 20 80 5
4 F 19.4 85 6
5 F 20.6 73 3
6 M 21 76 4
7 F 20.3 67 5
8 F 19.8 89 4
Types of Variables
Categorical Variables
A categorical variable is a variable that can take
on one of a limited, and usually fixed, number
of possible values. Categorical variables are
often used to represent categorical data.
The values of these variables are selected from an
established list of categories.
Male/ Female, Pass/ Fail, SA,A,D,SD
Numerical variables
The values of these variables involve a counted or
measured valued
Types of Variables
Discrete Variables: The vales of these
variables counts.
e.g. Number of people living in a HH
Continuous Variables: These variables
have continuous values and any value
can theoretically occur limited only by
the precision of the measuring
process. E.g time to complete a work,
air pressure in tyre.
Levels of Measurement
Levels of measurement often dictate
the calculations that can be done to
summarize and present the data. It
also determines the statistical test
that should be performed.
e.g. Balls in a bag are of different colors
like brown, yellow, blue, green,
orange or red etc.
Types of Levels of Measurement
Ratio Level Data: When a scale
consist of not only of equidistant
points but also has a meaningful zero
point, then we refer it as ratio scale.
Ratio scales are more sophisticated of
scales since it incorporates all the
characteristics of nominal, ordinal and
interval scales. E.g. income data
Properties of Ratio Level
Equal differences in the characteristic are
represented by equal differences in the
numbers assigned to the classifications.
Can be added or subtracted i.e.
X
1
+X
2
or X
1
X
2
is possible
Can be multiplied or divided
X
1*
X
2
or
X
1
/X
2
is possible
Can be ordered
X
1
<X
2
or X
1
>X
2
Meaningful zero point
Types of Levels of Measurement
Interval Scale: An interval scale satisfies x
2

x
1
or x
2
x
1
or x
1
x
2
but not the ratio.
e.g. 100
O
is not twice as warm as 50o
(no zero point, no ratio but x
2
x
1
or x
1
x
2
)
Ordinal Scale: When item are classified
according to more or less characteristics, the
scale used is referred as ordinal scale. This
scale is common in marketing, satisfaction and
attitudinal research. E.g. Excellent, v good,
good, fair, poor ( No zero point, no equal gap,
no ratio but just comparison)
Types of Levels of Measurement
Nominal Scale: a discrete classification
of data, in which data are neither
measured nor ordered but subjects
are merely allocated to distinct
categories: for example Male female,
married unmarried widowed or
separated (No ratio, No zero point,
No equal gap and no comparison)
Example
A sample of customers in a specialty ice
cream store was asked a series of
questions.
What is your favorite flavor of ice cream.
How many times do you eat ice cream
Do you have children under the age of ten
living in your home
Have you tried our latest ice cream
flavor?
Self Review 11
Chicagobased Market Facts asked a sample of
1,960 consumers to try a newly developed
chicken dinner by Boston Market. Of the 1,960
sampled, 1,176 said they would purchase the
dinner if it is marketed.
(a) What could Market Facts report to Boston
Market regarding acceptance of the chicken
dinner in the population?
(b) Is this an example of descriptive statistics
or inferential statistics? Explain.
DESCRIPTIVE STATISTICS
FREQUENCY DISTRIBUTION
A grouping of data into mutually
exclusive classes showing the number
of observations in each. The raw data
are more easily interpreted if
organized into a frequency distribution.
How to find maximum of data
How to find minimum of data
Where is the cluster of data
What is the typical price of vehicle
Dr. Iftikhar Hussain Adil
DESCRIPTIVE STATISTICS
Step 1: Decide on the number of
classes.
Step 2: Determine the class interval
'or width.
Step 3: Set the individual class limits
Step 4: Tally the vehicle selling prices
into the classes.
Dr. Iftikhar Hussain Adil
DESCRIPTIVE STATISTICS
Step 5: Count the number of items in each
class.
class frequency The number of
observations in each class.
class midpoint
class interval
Relative frequency
Dr. Iftikhar Hussain Adil
Self Review 2.2
Barry Bonds of the San Francisco Giants
established a new single season home run
record by hitting 73 home runs during the
2001 Major League Baseball season. The
longest of these home runs traveled 488 feet
and the shortest 320 feet. You need to
construct a frequency distribution of these
home run lengths.
(a) How many classes would you use?
(b) What class interval would you suggest?
(c) What actual classes would you suggest?
Exercise Page 31
1. A set of data consists of 38 observations. How many
classes would you recommend for the frequency
distribution?
2. A set of data consists of 45 observations between $0
and $29. What size would you recommend for the class
interval?
3. A set of data consists of 230 observations between
$235 and $567. What class interval would you
recommend?
4. A set of data contains 53 observations. The lowest
value is 42 and the largest is 129. The data are to be
organized into a frequency distribution.
a. How many classes would you suggest?
b. What would you suggest as the lower limit of the first
class?
5. Wachesaw Manufacturing, Inc. produced the following
number of units the last 16 days. 27, 27, 27, 28, 27,
25, 25, 28, 26, 28, 26, 28, 31, 30, 26,26
The information is to be organized into a frequency
distribution.
a. How many classes would you recommend?
b. What class interval would you suggest?
c. What lower limit would you recommend for the first
class?
d. Organize the information into a frequency distribution
and determine the relative frequency distribution.
e. Comment on the shape of the distribution.
HISTOGRAM
A graph in which the classes are
marked on the horizontal axis and the
class frequencies on the vertical axis.
The class frequencies are represented
by the heights of the bars, and the
bars are drawn adjacent to each
other.
HISTOGRAM
Frequency Polygon
It consists of line segments
connecting the points formed by the
intersections of the class midpoints
and the class frequencies.
cumulative frequency distribution
cumulative frequency polygon
Frequency Polygon
Frequency Polygon
Cumulative Frequency Polygon
Pareto Diagram
A pareto diagram is a bar chart that
displays the frequency of defect
causes
Line Graphs
Bar Charts
A bar chart can be used to depict any of
the levels of measurementnominal,
ordinal, interval, or ratio.
The level of education is an ordinal
scale variable and is reported on the
horizontal axis
Difference b/w Histogram and
Bar Chart
In a histogram, the horizontal axis refers
to the ratio scale variablevehicle selling
price. This is a continuous variable; hence
there is no space between the bars.
Another difference between a bar chart
and a histogram is the vertical scale. In a
histogram the vertical axis is the
frequency or number of observations. In a
bar chart the vertical scale refers to an
amount.
DESCRIPTIVE STATISTICS
Measures of Location
Measures of Variability
Measure of Relative Position
Measure of Shape
Dr. Iftikhar Hussain Adil
Measures of Location
POPULATION MEAN:
For raw data, that is, data that has not
been grouped in a frequency
distribution, the population mean is
the sum of all the values in the
population divided by the number of
values in the population.
Or
Dr. Iftikhar Hussain Adil
Measures of Location
The Sample Mean:
For raw data, that is, ungrouped data,
the mean is the sum of all the
sampled values divided by the total
number of sampled values
or
Measures of Location
Examples: To obtain grade A, Ben must
achieve an average of at least 80 percent in
five tests. If his average marks for the first
four tests is 78, what is the lowest marks he
can get in his fifth test and still obtain grade A?
The speeds to the nearest mile per hr, of 120
vehicles passing a check point were recorded
and grouped into the table below. Estimate the
mean of this distribution.
Speed
mph
2125 2630 3135 3645 4660
No of
vehicles
22 48 25 16 9
Measures of Location
Properties of Mean
1. Every set of interval or ratiolevel
data has a mean.
2. All the values are included in
computing the mean.
3. The mean is unique.
4. The sum of the deviations of each
value from the mean will always be
zero.
The Weighted Mean
The weighted mean is a special case
of the arithmetic mean. It occurs
when there are several observations
of the same value.
Example: A candidate obtained the
following results at NBS
Quizzes Mid Assignments Final
92% 95% 90% 65%
The regulations states that quizzes
having weight of 15%, assignments
10%, mid 25% and final 50%.What is
the candidates final percentage?
The Median:
The midpoint of the values after they
have been ordered from the smallest
to the largest, or the largest to the
smallest.
Properties of Median
The median is unique.
It is not affected by extremely large
or small values.
It can be computed for ratiolevel,
intervallevel, and ordinallevel data.
MODE: The value of the observation
that appears most frequently.
Properties of Mode
It is Robust measure.
In several data sets there is no mode
or more than one mode
Geometric Mean
The geometric mean is useful in
finding the average of percentages,
ratios, indexes, or growth rates.
Measures of Variability
Why Study Dispersion
1. The average is not representative because of
the large spread.
2. A second reason for studying the dispersion in
a set of data is to compare the spread in two
or more distributions.
A small value for a measure of dispersion
indicates that the data are clustered closely,
say, around the arithmetic mean. The mean is
therefore considered representative of the
data. Conversely, a large measure of
dispersion indicates that the mean is not
reliable.
Measures of Variability
Range
The range is based on the largest and
the smallest values in the data set. It
is the difference of largest and
smallest value.
Range = Largest value  Smallest value
MEAN DEVIATION
The arithmetic mean of the absolute
values of the deviations from the
arithmetic mean.
Advantages and Drawback
of Mean Deviation
it uses all the values in the
computation.
It is easy to understand.
It uses absolute values and it is
difficult to work with absolute values
so this measure is not frequently
used.
VARIANCE: The arithmetic mean of the
squared deviations from the mean.
STANDARD DEVIATION: The square
root of the variance.
Population Variance:
Sample Variance:
CHEBYSHEV'S THEOREM
For any set of observations (sample
or population), the proportion of the
values that lie within k standard
deviations of the mean is at least
(1 1/k
2
)
where k is any constant greater than
1.
EMPIRICAL RULE
For a symmetrical, bellshaped
frequency distribution, approximately 68
percent of the observations will lie
within plus and minus one standard
deviation of the mean; about 95 percent
of the observations will lie within plus
and minus two standard deviations of
the mean; and practically all (99.7
percent) will lie within plus and minus
three standard deviations of the mean.
Quartiles, Deciles, and
Percentiles
a percentile (or centile) is the value
of a variable below which a certain
percent of observations fall
L
p
=(n+1)*P/100
91, 75, 61, 101,43,104
Box Plots
A box plot is a graphical display,
based on quartiles, that helps us
picture a set of data.
To construct a box plot, we need
only five statistics: the minimum
value, Q
1
(the first quartile), the
median, Q
3
(the third quartile), and
the maximum value.
Outlier: An outlier is a value that is
inconsistent with the rest of the data.
Inter Quartile Range:
The inter quartile range is the
distance between the first and then
third quartile.
Skewness
Symmetric: In a symmetric set of
observations the mean and median are equal
and the data values are evenly spread around
these values. The data values below the mean
and median are a mirror image of those above.
Positively Skewed: A set of values is
skewed to the right or positively skewed if
there is a single peak and the values extend
much further to the right of the peak than to
the left of the peak. In this case the mean is
larger than the median.
Skewness
Negatively Skewed: In a negatively
skewed distribution there is a single
peak but the observations extend
further to the left, in the negative
direction, than to the right. In negatively
skewed distribution the mean is smaller
than the median.
Bimodal: A bimodal distribution will have
two or more peaks. This is often the
case when the values are from two
populations.
How to Access Skewness with
the help of Boxplot
Symmetric
The distance from Min to Q
2
= Q
2
to
Max
The distance from Min to Q
1
= Q
3
to
Max
The distance from Q
1
to Q
2
= Q
2
to Q
3
How to Access Skewness with
the help of Boxplot
Right Skewed
The distance from Q
2
to Max > Min to
Q
2
The distance from Q
3
to Max > Min to
Q
1
The distance Q
2
to Q
3
> Q
1
to Q
2
How to Access Skewness with
the help of Boxplot
Left Skewed
The distance from Min to Q
2
> Q
2
to
Max
The distance from Min to Q
1
> Q
3
to
Max
The distance Q
1
to Q
2
> Q
2
to Q
3
Skewness
Measures of Skewness
Univariate Vs Bivariate
Scatter Diagram
we use to show the relationship
between variables is called a scatter
diagram.
CONTINGENCY TABLE
A table used to classify observations
according to two identifiable
characteristics.
Stem and Leaf Plot
Stem and leaf