0% found this document useful (0 votes)
47 views78 pages

3.MU-Introduction To Statistics and Data Analysis-Talk - (04.07.2025)

The document provides an overview of statistics and data analysis, covering key concepts such as statistical terminologies, problem-solving methodologies, and descriptive versus inferential statistics. It outlines the objectives of understanding statistics, the importance of data collection methods, and the role of technology in data analysis. Additionally, it emphasizes the need for statistical knowledge in decision-making and presents various sampling methods for data collection.

Uploaded by

M.TAYYAB KHAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views78 pages

3.MU-Introduction To Statistics and Data Analysis-Talk - (04.07.2025)

The document provides an overview of statistics and data analysis, covering key concepts such as statistical terminologies, problem-solving methodologies, and descriptive versus inferential statistics. It outlines the objectives of understanding statistics, the importance of data collection methods, and the role of technology in data analysis. Additionally, it emphasizes the need for statistical knowledge in decision-making and presents various sampling methods for data collection.

Uploaded by

M.TAYYAB KHAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Statistics and Data Analysis

CONTENT

1.1 Statistical Terminologies


1.2 Statistical Problem-Solving Methodology
1.3 Review of Descriptive Statistics
1.3.1 Measures of Central Tendency
1.3.2 Measures of Variation
[Link] Game of Darts
1.4 Exploratory Data Analysis
1.4.1 Stem and Leaf Plot
1.4.2 Boxplots
OBJECTIVE
• By the end of this presentation, you should be able to:
• Define the meaning of statistics, population, sample, parameter, statistic,
descriptive statistics and inferential statistics.
• Understand and explain why a knowledge of statistics is needed
• Outline the 6 basic steps in the statistical problem solving methodology.
• Identifies various method to obtain samples.
• Discuss the role of computers and data analysis software in statistical work.
• Summarize data using measures of central tendency, such as the mean, median,
mode, and midrange.
• Describe data using measures of variation, such as the range, variance, and
standard deviation.
• Identify the position of a data value in a data set, using various measures of
position, such as percentiles, deciles, and quartiles.
• Analyze accuracy and precision of data using game of darts.
• Draw and interpret a stem and leaf plot.
• Draw and interpret a boxplot.
1.1 OVERVIEW
• Define the meaning of statistics, population,
sample, parameter, statistic, descriptive statistics
and inferential statistics.

• Understand and explain why a knowledge of


statistics is needed
What is Statistics?
Most people become familiar with probability and statistics through radio,
television, newspapers, and magazines. For example, the following statements
were found in newspapers:
• Ten of thousands teacher in Pakistan have chosen State Life Policy as their trusted bank.

death rate from lung cancer was 10 times for smokers


• The
compared to nonsmokers.
• The average cost of a wedding is nearly Rs. 500,000.
• In USA, the median salary for men with a bachelor’s degree is $49,982, while the median
salary for women with a bachelor’s degree is $35,408.
• Globally, an estimated 500,000 children under the age of 15 live with Type 1 diabetes.
• Women who eat fish once a week are 29% less likely to develop heart disease.
Statistics:

 is the sciences of conducting


studies to collect, organize,
summarize, analyze, present,
interpret and draw conclusions
from data.

Any values (observations or measurements) that


have been collected
The basic idea behind all statistical methods of data analysis is to
make inferences about a population by studying small sample
chosen from it

Population Parameter
The complete collection of A number that describes a
measurements outcomes, object or population characteristics
individual under study

Tangible Conceptual
Always finite & after a population is Population that consists of all
sampled, the population size the value that might possibly
decrease by 1 have been observed & has an
The total number of members is unlimited number of members
fixedSample
& could be listed
Statistic
A subset of a population, containing
A number that describes a sample
the objects or outcomes that are
characteristics
actually observed
EXERCISE 1.1

1. The freshman class has 317 students and an IQ pre-test is given to all of
them in their first week. The dean of admission collected data on 27 of
them and found their mean score on the IQ pre-test was 51. The mean
for the entire freshman class was therefore estimated to
approximately 51 on this test. A subsequent computer analysis of all
freshmen showed the true mean to be 52.
Based on the above problem,
a) What is the population?
b) Is the population tangible or conceptual?
c) What is the sample?
d) What number is a parameter?
e) What number is a statistic?
Statistics
• Statistics is define as the study of collecting information,
presenting, analyzing and drawing conclusion with the
aid of mathematics and computers. The most important
activity is perhaps the development of methods by which
inference can be made from data. This area include both
estimation of population parameters and Test of
hypothesis based on probability theory. In many
instances, a sample is obtained from some collection of
objects or items and on the basis of this particular set of
data, we wish to make sensible comments about the
nature of the larger collection called population. It is an
essential tool for business, planning, market research,
forecasting and finding pattern inherent in data.
Statistics
• With the increase use of technology, data handling skills
have become highly sought after, and a knowledge of the
subject is useful in understanding the statistical
information we are presenting with every day. Today a
new discipline is emerge out which is term as "Data
Science". Data science is a "concept to unifying statistics,
data analysis, informatics, and their related methods" in
order to "understand and analyze actual phenomena"
with data". It uses techniques and theories drawn from
many fields within the context of mathematics, statistics,
computer science, information science, and domain
knowledge. A data scientist is the result of a teamwork
who creates programming code, and combines it with
statistical knowledge to create insights from data.
Descriptive & Inferential Statistics
• Descriptive statistics  Inferential statistics
• consists of the collection,
 consists of generalizing from
samples to populations,
organization, classification,
summarization, and performing estimations
presentation of data obtain hypothesis testing, determining
from the sample. relationships among variables,
• Used to describe the and making predictions.
characteristics of the  Used to describe, infer, estimate,
sample approximate the characteristics
• Used to determine of the target population
whether the sample  Used when we want to draw a
represent the target conclusion for the data obtain
population by comparing from the sample
sample statistic and
population parameter
EXERCISE 1.1
2. In each of these statements, tell whether descriptive or inferential
statistics have been used.
a) Ten of thousands parents in Malaysia have chosen StemLife as their
trusted stem cell bank. (Descriptive)
b) The death rate from lung cancer was 10 times for smokers compared
to nonsmokers. (Inferential)
c) The average cost of a wedding is nearly RM10,000.
d) In USA, the median salary for men with a bachelor’s degree is $49,982,
while the median salary for women with a bachelor’s degree is
$35,408.
e) Globally, an estimated 500,000 children under the age of 15 live with
Type 1 diabetes.
f) A researcher claim that a new drug will reduce the number of heart
attacks in men over 70 years of age.
An overview of descriptive statistics and
statistical inference
START

Gathering of
Data

Classification,
Summarization, and
Processing of data

Presentation and
Communication of
Summarized information

Yes
Use sample information
Is Information from a
to make inferences about
sample?
the population Statistical
Inference
No
Descripti
ve
Draw conclusions about
Statistics Use cencus data to
the population
analyze the population
characteristic (parameter)
characteristic under study
under study

STOP
Need for Statistics

• It is a fact that, you need a knowledge of statistics


to help you:

1. Describe and understand numerical


relationship between variables
• There are a lot of data in this world so we
need to identify the right variables.

2. Make better decision


• Statistical methods allow people to make
better decisions in the face of uncertainty.
Describing relationship between variables

1. A management consultant wants to compare a client’s


investment return for this year with related figures from last
year. He summarizes masses of revenue and cost data from
both periods and based on his findings, presents his
recommendations to his client.

2. A college admission director needs to find an effective way of


selecting student applicants. He design a statistical study to
see if there’s a significance relationship between SPM result
and the gpa achieved by freshmen at his school. If there is a
strong relationship, high SPM result will become an important
criteria for acceptance.
Aiding in Decision Making

1. Suppose that the manager of “Big-Wig Executive Hair Stylist”,


Alvin Tang, has advertised that 90% of the firm’s customers are
satisfied with the company’s services. If Pamela, a consumer
activist, feels that this is an exaggerated statement that might
require legal action, she can use statistical inference techniques to
decide whether or not to sue Alvin.

2. Students and professional people can also use the knowledge


gained from studying statistics to become better consumers and
citizens. For example, they can make intelligent decisions about
what products to purchase based on consumer studies about
government spending based on utilization studies, and so on.
1.2: STATISTICAL METHODOLOGY
IN SOLVING PROBLEMS
• Outline the 6 basic steps in the statistical problem
solving methodology.

• Identifies various method to obtain samples.

• Discuss the role of computers and data analysis


software in statistical work.
STATISTICAL PROBLEM
SOLVING METHODOLOGY
6 Basic Steps

1. Identifying the problem or opportunity


2. Deciding on the method of data collection
3. Collecting the data
4. Classifying and summarizing the data
5. Presenting and analyzing the data
6. Making the decision
STEP 1
Identifying the problem or opportunity

• Must clearly understand & correctly define the objective/goal of the


study
• If not, time & effort are waste
• Is the goal to study some population?

• Is it to impose some treatment on the group & then test the response?

• Can the study goal be achieved through simple counts or


measurements of the group?

• Must an experiment be performed on the group?

• If sample are needed, how large?, how should they be taken? – the
larger the better (more than 30)
Characteristics of Sample Size
• The larger the sample, the smaller the magnitude of sampling
errors.

• Survey studies needed large sample because the returns of the


survey is voluntary based.

• Easy to divide into subgroups.

• In mail response the percentage of response may be as low as


20%-30%, thus the bigger number of samples is required.

• Subject availability and cost factors are legitimate


considerations in determining appropriate sample size.
STEP 2
Deciding on the Method of Data Collection

• Data must be gathered that are accurate, as complete as


possible & relevant to the problem

• Data can be obtained in 3 ways

1. Data that are made available by others (internal, external,


primary or secondary data)

2. Data resulting from an experiment (experimental study)

3. Data collected in an observational study (observation,


survey, questionnaire, interview)
STEP 3
Collecting the data
A. Nonprobability data
• Is one in which the judgment of the experimenter, the
method in which the data are collected or other factors could
affect the results of the sample

• 3 basic methods: Judgment samples, Voluntary samples and


Convenience samples

B. Probability data
• Is one in which the chance of selection of each item in the
population is known before the sample is picked

• 4 basic methods : random, systematic, stratified, and cluster.


A) Nonprobability Data Samples

1. Judgment samples

• Base on opinion of one or more expert person


• Ex: A political campaign manager intuitively picks certain voting districts
as reliable places to measure the public opinion of his candidate

2. Voluntary samples

• Question are posed to the public by publishing them over radio or tv


(phone or sms)

3. Convenience samples

• Take an ‘easy sample’ (most conveniently available)


• Ex: A surveyor will stand in one location & ask passerby their questions
B) Probability Data Samples

1. Random samples
• Selected using chance method or random methods

• Example:
• A lecturer wants to study the physical fitness levels of
students at her university. There are 5,000 students
enrolled at the university, and she wants to draw a
sample of size 100 to take a physical fitness test. She
obtains a list of all 5,000 students, numbered it from 1 to
5,000 and then randomly invites 100 students
corresponding to those numbers to participate in the
study.
B) Probability Data Samples

2. Systematic samples
• Numbering each subject of the populations and data is
selected every kth number.

• Example:
• A lecturer wants to study the physical fitness levels of students
at her university. There are 5,000 students enrolled at the
university, and she wants to draw a sample of size 100 to take a
physical fitness test. She obtains a list of all 5,000 students,
numbered it from 1 to 5,000 and randomly picks one of the first
50 voters (5000/100 = 50) on the list. If the pick number is 30,
then the 30th student in the list should be invited first. Then she
should invite the selected every 50th name on the list after this
first random starts (the 80th student, the 130th student, etc) to
produce 100 samples of students to participate in the study.
B) Probability Data Samples

3. Stratified samples

• Dividing the population into groups according to some


characteristics that is important to the study, then sampling
from each group
• Example:
• A lecturer wants to study the physical fitness levels of students at her
university. There are 5,000 students enrolled at the university, and she
wants to draw a sample of size 100 to take a physical fitness test.
Assume that, because of different lifestyles, the level of physical
fitness is different between male and female students. To account for
this variation in lifestyle, the population of student can easily be
stratified into male and female students. Then she can either use
random method or systematic methods to select the participants. As
example she can use random sample to chose 50 male students and
use systematic method to chose another 50 female students or
otherwise.
B) Probability Data Samples

4. Cluster samples
• Dividing the population into sections/clusters, then randomly
select some of those cluster and then choose all members
from those selected cluster
• Using a cluster sampling can reduce cost and time.

• Example:
• A lecturer wants to study the physical fitness levels of students at her
university. There are 5,000 students enrolled at the university, and she
wants to draw a sample to take a physical fitness test. Assume that,
because of different lifestyles, the level of physical fitness is different
between freshmen, sophomores, juniors and seniors students. To
account for this variation in lifestyle, the population of student can
easily be clustered into freshmen, sophomores, juniors and seniors
students. Then she can choose any one cluster such as freshmen and
take all the freshmen students as the participant.
EXERCISE 1.2
1. In each of these statements, identify the type of sample obtain.

a) A quality engineer wants to inspect rolls of wallpaper in order to obtain


information on the rate at which flows in the printing are occurring. She
decides to draw a sample of 50 rolls of wallpaper from a day’s production.
Each hour for 5 hours, she takes the 10 most recently produced rolls and
counts the number of flaws on each.

b) Suppose a researcher have a list of 1000 registered voters in a community


and he want to pick a probability sample of 50. He use a random number
table to pick one of the first 20 voters (1000/50 = 20) on our list. The table
gave him the number of 16, so he select the 16th voter on the list as the first
selected number. Then he pick every 20th name after this random start (the
36th voter, the 56th voter, etc) to produce a sample.

c) A researcher wanted to survey students in 100 homerooms in secondary


school in a large school district. They randomly select 10 schools from all the
secondary schools in the district. Then from a list of homerooms in the 10
schools they randomly select 100 homerooms.
EXERCISE 1.2
1. In each of these statements, identify the type of sample obtain.

d) In a consumer surveys of large cities the procedure is to divide a map of


the city into small blocks. Each blocks containing a cluster are surveyed. A
number of clusters are selected for the sample, and all the households in a
cluster are surveyed.
• Using a cluster sampling can reduce cost and time. Less energy and
money are expended if an interviewer stays within a specific area rather
than traveling across stretches of the cities.

e) Suppose our population is a university student body. We want to estimate


the average annual expenditures of a college student for non school items.
Assume we know that, because of different lifestyles, juniors and seniors
spend more than freshmen and sophomores, but there are fewer students
in the upper classes than in the lower classes because of some dropout
factor. To account for this variation in lifestyle and group size, the
population of student can easily be stratified into freshmen, sophomores,
junior and seniors. A sample can be stratum and each result weighted to
provide an overall estimate of average non school expenditures.
STEP 4
Classifying and Summarizing the Data

• Organize or group the facts/sample raw data for study and


investigation

• Classifying- identifying items with like characteristics &


arranging them into groups or classes.
• Ex: Production data (product make, location, production process,…)
• Data can be classified as Qualitative (categorical/Attributes) data and
Quantitative (Numerical) data.

• Summarization
• Graphical & Descriptive statistics ( tables, charts, measure of central
tendency, measure of variation, measure of position)
Data Classification
DATA TYPES

• Qualitative Quantitative

Nominal
Continuous

Ordinal
Discrete
Binary
Data Classification
• Data are the values that variables can assume
• Variables is a characteristic or attribute that can assume different values.
• Variables whose values are determined by chance are called random
variables

Variables can be
classified

By how they are categorized,


As Quantitative and
counted or measured
Qualitative
- Level of measurements of data
Qualitative Nominal Data (can’t be rank)
Gender, race, citizenship. etc
(categorical/Attributes)
Use code
1* Data that refers only to numbers
(1, 2,…)
name classification (done
using numbers) Ordinal Data (can be rank)
Feeling (dislike – like),
2* Can be placed into distinct color (dark – bright) , etc
categories according to some Likert scale
Types characteristic or attribute.
of
Data Discrete Variables
Quantitative (Numerical) Assume values that can be
counted and finite
1* Data that represent counts Ex : no of “something”
or measurements (can be count
or measure)
Continuous variables
2* Are numerical in nature 1. Can assume all values between any two
and can be ordered or ranked. specific values & it obtained by measuring
2. Have boundaries and must be rounded
because of the limits of measuring device
Ex: weight, age, salary, height, temperature, etc
EXERCISE 1.2
2. The Lemon Marketing Corporation has asked you for information about the car you
drive. For each question, identify each of the types of data requested as either
attribute data or numeric data. When numeric data is requested, identify the
variable as discrete or continuous.

a) What is the weight of your car?


b) In what city was your car made?
c) How many people can be seated in your car?
d) What’s the distance traveled from your home to your school?
e) What’s the color of your car?
f) How many cars are in your household?
g) What’s the length of your car?
h) What’s the normal operating temperature (in degree Fahrenheit) of your car’s
engine?
i) What gas mileage (miles per gallon) do you get in city driving?
j) Who made your car?
k) How many cylinders are there in your car’s engine?
l) How many miles have you put on your car’s current set of tyres?
Level of Measurements of Data
Nominal-level Ordinal-level Interval-level Ratio-level
data data data data
classifies data classifies data ranks data, and Possesses all the
into mutually into categories precise characteristics
exclusive (non that can be differences of interval
overlapping), ranked; however, between units of measurement,
exhausting precise measure do and there exists
categories in differences exist; however, a true zero.
which no order between the there is no
or ranking can ranks do not meaningful zero
be imposed on exist
the data

Examples
EXERCISE 1.2
3. The chart shows the number of job-related injuries for each of the
transportation industries for 1998.
I
Industry Number of injuries
Railroad 4520
Intercity bus 5100
Subway 6850
Trucking 7144
Airline 9950

a) What are the variables under study?


b) Categorize each variable as qualitative or quantitative.
c) Categories each quantitative variables as discrete or continuous.
d) Identify the level of measurement for each variable.
STEP 5
Presenting and Analyzing the data

• Summarized & analyzed information given by the


• graphical statistics (graph and chart)
• descriptive statistics (refer topic 1.3)
• Identify the relationship of the information

• Making any relevant statistical inferences


• hypothesis testing
• confidence interval
• ANOVA
• control charts, …
Types of Graph & Chart

 The purpose of graphs in statistics is to convey the data to the


viewer in pictorial form.
 Graphs are useful in getting the audience’s attention in a
publication or a presentation.
Types of Graph & Chart
Distribution Shapes for Histogram
• Uniform
• Basically
• Bell Shaped flat/rectangular

• Has a single peak


& tapers off at
either end
• Approximately
symmetry
• It is roughly the
same on the both
sides of a line
running through
the center
• Reverse J-Shaped
• J-Shaped • Opposite J-
Shaped
• Has a few data
• Has a few data
values on the left values on the
side & increase as right side &
one move to the increase as one
right move to the left
Distribution Shapes for Histogram

 Right Skewed  Left Skewed


 The peak is to the  The peak is to the
left right
 The data value  The data value
taper off to the taper off to the
right left

 Bimodal  U-Shaped
 Have 2 peak at  The shape is U
the same height
STEP 6
Making the decision

• The researchers can make a list of all the options


and decisions which can achieve the objective and
goal of the research, weighs the options and
choose the best options which represents the ‘best’
solution to the problem.
• The correctness of this choice depends on the
analytical skill and the quality of the information.
START

Identify the problem or


opportunity

Gather available internal and


external facts relevant to the
problem

Are available facts


No Gather new data from populations and
samples using instruments, interviews,
sufficient?

Statistical
questionnaire, etc

Yes

Problem Classify, summarize, and


process data using tables,
charts, and numerical
descriptive measure

Solving
Present and communicate

Methodology summarized information in


form of tables, charts and
descriptive measure

Is information from
Yes Use sample information to
1. Estimate value of parameter
a sample? 2. Test assumptions about
parameter

No
Use cencus information to
Interpret the results, draw
evaluate alternative courses of
conclusions, and make decisions
action and make decisions

STOP
Role of the Computer in Statistics

Two software tools commonly used for data


Analysis:

1. Spreadsheets
• Microsoft Excel & Lotus 1-2-3

2. Statistical Packages
• MINITAB, SAS, STATA, SPSS, R and SPlus
Data Analysis Aplication in EXCEL
• Graph and chart
• Formulas
• Add in – Analisis Tool Park – Data Analysis
1.3: REVIEWOF DESCRIPTIVE STATISTICS
• Summarize data using measures of central tendency,
such as
the mean, median, mode, and midrange.

• Describe data using measures of variation, such as the


range, variance, and standard deviation.

• Identify the position of a data value in a data set,


using various measures of position, such as percentiles,
deciles, and quartiles.
Summary Statistics (Data Description)
• Statistical methods can be used to summarize data.

• Measures of average are also called measures of central tendency and include the
mean, median, mode, and midrange.

• Measures that determine the spread of data values are called measures of variation
or measures of dispersion and include the range, variance, and standard deviation.

• Measures of position tell where a specific data value falls within the data set or its
relative position in comparison with other data values. The most common measures
of position are percentiles, deciles, and quartiles.

• The measures of central tendency, variation, and position are part of what is called
traditional statistics. This type of data is typically used to confirm conjectures about
the data.
TIPS: INSERT & CLEAR DATA by using Scientific Calculator

• Casio fx-570W
• Casio fx-570MS • Insert data
• MODE SD data M+
• Insert data • Shift 1
• MODE SD data M+
• Shift 2
• Shift 1
• Shift 3
• Shift 2 • Shift 4
• Clear data • Clear data
• Shift CLR 1 • Shift AC/ON =

ROUNDING RULE: The value of statistic/parameter should be rounded


to one more decimal place than occurs in the raw data
1.3.1 Measures of Central Tendency

Mean

the sum of the values divided by the total number of values.

Population Mean Sample Mean

x
N

x i i

 i 1
, N population size x i 1
, n sample size
N n

Example: 9 2 1 4 3 3 7 5 8 6 ,   x  4.8
Properties of Mean

• The mean is compute by using all the values of the data.

• The mean varies less than the median or mode when samples are taken from
the same population and all three measures are computed for these samples.

• The mean is used in computing other statistics, such as variance.

• The mean for the data set is unique, and not necessarily one of the data
values.

• The mean cannot be computed for an open-ended frequency distribution.

• The mean is affected by extremely high or low values and may not be the
appropriate average to use in these situations
1.3.1 Measures of Central Tendency

Median
the middle number of n ordered data (smallest to largest)

If n is odd If n is even

Median(MD)  xn1 xn  xn
1
2 Median(MD)  2 2
2
Example: 9 2 1 3 3 7 5 8 6 Example: 9 2 1 4 3 3 7 5 8 6
MD = 5 MD = 4.5
Properties of Median

• The median is used when one must find the center or middle value of a
data set.

• The median is used when one must determine whether the data values
fall into the upper half or lower half of the distribution.

• The median is used to find the average of an open-ended distribution.

• The median is affected less than the mean by extremely high or


extremely low values.
• Try this data: 19 2 1 4 3 3 7 5 8 6
1.3.1 Measures of Central Tendency

Mode
the most commonly occurring value in a data series

• The mode is used when the most typical case is desired.

• The mode is the easiest average to compute.

• The mode can be used when the data are nominal, such as religious
preference, gender, or political affiliation.

• The mode is not always unique. A data set can have more than one mode,
or the mode may not exist for a data set.

Example: 9 2 1 4 3 3 7 5 8 6 Mode = 3
1.3.1 Measures of Central Tendency

Midrange
is a rough estimate of the middle & also a very rough
estimate of the average and can be affected by one
extremely high or low value.

lowest value  highest value


MR 
2

Example: 9 2 1 4 3 3 7 5 8 6
MR = 5
Types of Distribution

Symmetric

Positively skewed or right-skewed Negatively skewed or left-skewed


EXERCISE 1.3.1
1. Determine the type of distribution of the following data
I
• 921 4337586

• 123 195 138 115 179 119 148 147 180 146 179 189 175 108 193
114 179 147 108 128 164 174 128 159 193 204 125 133 115 168
123 183 116 182 174 102 123 99 161 162 155 202 110 132

• Mean = Mode = Median = 11

• Mean = 25, Mode = 13, Median = 17

• Mean = 5, Mode = 73, Median = 17


1.3.2 Measures of Variation / Dispersion

• Used when the central of tendency doesn't mean anything or not


needed (ex: mean are same for two types of data)
• If the mean of x and y are same and dispersion x < dispersion y, then
population x is better than population y

• One that measure the variability that exists in a data set

• To form a judgment about how well the average value illustrate/


depict the data

• To learn the extent of the scatter so that steps may be taken to


control the existing variation
1.3.2 Measures of Variation / Dispersion

Range

is the different between the highest


value and the lowest value in a data set.
The symbol R is used for the range.

R = highest value - lowest value

Example: 9 2 1 4 3 3 7 5 8 6
R=8
1.3.2 Measures of Variation / Dispersion

Variance
is the average of the squares of the distance each value is from the mean.

Population Variance Sample Variance


N n

 x    x  x 
2 2
i i
 
2 i 1
, N population size s 
2 i 1
, n sample size
N n 1

Example: 9 2 1 4 3 3 7 5 8 6
 2  6.4
s 2  7.1
1.3.2 Measures of Variation / Dispersion

Standard Deviation
is the square root of the variance

Population standard deviation ,  Sample standard deviation, s


N n
x    x  x 
2 2
i i
 i 1
, N population size s i 1
, n sample size
N n 1

Example: 9 2 1 4 3 3 7 5 8 6
  2.5
s  2.7
Properties of Variance &
Standard Deviation
• Variances and standard deviations can be used to determine the spread of
the data. If the variance or standard deviation is large, the data are more
dispersed. The information is useful in comparing two or more data sets to
determine which is more variable.

• The measures of variance and standard deviation are used to determine


the consistency of a variable.

• The variance and standard deviation are used to determine the number of
data values that fall within a specified interval in a distribution.

• The variance and standard deviation are used quite often in inferential
statistics.

• The standard deviation is used to estimate amount of spread in the


population from which the sample was drawn.
EXERCISE 1.3.2
1. Which of the following set of data is more disperse
I
• A: 9 2 1 4 3 3 7 5 8 6
B: 3 4 5 4 3 3 8 6 7 5

• Group 1:
123 195 138 115 179 119 148 147 180 146 179 189 175 108
193 114 179 147 108 128 164 174

Group 2
128 159 193 204 125 133 115 168 123 183 116 182 174 102
123 99 161 162 155 202 110 132
[Link] Accuracy and Precision

• Accuracy is how close to "true" measurements are.


Measuring devices or techniques can easily be inaccurate and lead to
false measurements, and no matter how accurate a device my be, there is
still a tolerance for error. No measurement is perfect. Accuracy must be
accounted for in your results.

• Precision is how consistent your results are for the same phenomena over
several measurement, or how repeatable a device's (like a spring's)
performance can be made.
Precision as a measure of variation, must be accounted for in your
calculations and results
Game of Darts

• Picture A, very accurate (close to


the mark), but not very precise,
since the darts are spread out
everywhere.
• Picture B shows an example of
precision without accuracy (very
consistent, but not near the
mark).
• Picture C shows both inaccuracy
and imprecision.
• Picture D shows both accuracy
Picture Picture
and precision.
C D
1.3.3 Measures of Position
Describing the position of the data value (increasing order)

Percentiles Deciles Quartiles


Split data into Split data into Split data into
100 equal parts 10 equal parts 4 equal parts

Pi  x in  xc Di  x in  xc Qi  xin  xc
100 10 4

TIPS: If c is not a whole number, round it up to the next


whole number  xc  xc1  2

Example:
If c is a9 whole number, then use
2 1 4 3 3 7 5 8 6  P15  2, D3  3, Q2  4.5
EXERCISE 1.3.3
1. Given 9 2 1 4 3 7 5 4 6 .

a) What percentile is the value of 8?


b) Find the value correspond to 4th deciles.
c) Find the value correspond to 3rd quartiles.

2. Given 9 22 11 14 13 3 7 15 18 16

a) Find the value correspond to 20th percentiles.


b) What percentile is the value of 20?
c) Find the value correspond to 7th deciles.

TIPS: The percentile correspond to a given value of x is


 number
Percentile 
computed below x   0.5
by:
of values
100%
total number of values
Outliers
• An outlier is an extremely high or an extremely low data value when compared
with the rest of the data values.

• Outliers can be the result of measurement or observational error.

• When a distribution is normal or bell-shaped, data values that are beyond three
standard deviations of the mean can be considered suspected outliers.

Example: 9 22 11 14 13 3 7 15 18 16
EXERCISE 1.3.3

3. Given 19 2 1 4 3 7 5 4 6 . Find outliers.

4. Given 19 6 2 11 4 3 7 7 5 8 6 21 12
Find outliers
1.4: EXPLORATORY DATA ANALYSIS

• Draw and interpret a stem and leaf plot.

• Draw and interpret a boxplot.


Exploratory Data Analysis
• The purpose of exploratory data analysis is to examine data in order to
find out what information can be discovered. For example:
• Are there any gaps in the data?
• Can any patterns be discerned?
Stem-and-Leaf Plots
• A stem-and-leaf plot is a data plot that
Stem leaf
uses part of a data value as the stem
(the leading digit) and part of the data
value as the leaf (the trailing digit) to
form groups or classes.

• It has the advantage over grouped


frequency distribution of retaining the
actual data while showing them in
graphic form.
Leaf Stem Leaf
• Sometime we can construct a mixture
model.

• Arranging the data in order is not


essential, but the leaves must be
arrange in order.
EXERCISE 1.4

1. An insurance company researcher conducted a survey on


the number of car thefts in a large city for a period of 30
days last summer. The raw data are shown below.
Construct a stem and leaf plot.

52 62 51 50 69
58 77 66 53 57
75 56 55 67 73
79 59 68 65 72
57 51 63 69 75
65 53 78 66 55
EXERCISE 1.4
2. The data shown represents the percentage of unemployed males and
females in 1995 for a sample of countries of the world. Using the whole
numbers as stems and the decimals as leaves, construct a back-to-
back (mixture) stem and leaf plot and compare the distribution of the
two groups.

Females Males

8.0 3.7 8.6 5.0 8.8 1.9 5.6 4.6


7.0 3.3 8.6 3.2 1.5 6.6 5.6 0.3
8.8 6.8 9.2 5.9 2.2 5.6 3.1 5.9
7.2 4.6 5.6 5.3 9.8 8.7 6.0 5.2
7.7 8.0 8.7 0.5 4.4 9.6 6.6 6.0
6.5 3.4 3.0 9.4 4.6 3.1 4.1 7.7
Boxplots
• Boxplots are graphical representations of a five-number summary of a data set
and outliers. The five-number summary are:
• The lowest value of data set (minimum)
• Q1 (or 25th percentile)
• The median (or 50th percentile)
• Q3 (or 75th percentile)
• The highest value of data set (maximum)
STEP to Construct a Boxplot
• STEP1 : Arrange the data
• STEP2 : Find the Median
• STEP3 : Find Q1 and Q3
• STEP4 : Find Outliers
• Points that lying more than 1.5 times the interquartile range above
Q3 or below Q1
x  Q1 1.5 Q3  Q1  and x  Q3  1.5 Q3  Q1 

• STEP5 : Draw a scale for the data on the x axis.


• STEP6 : Locate the lowest value, Q1, the median, Q3, the highest value
and outliers on the scale.
• STEP7 : Draw a box around Q1 and Q3, draw a vertical line through the
median, and connect the upper and lower values

Example: Refer EXERCISE 1.4 (3)


Summarize the Boxplots

EXTRA INFO:
1. If the boxplots for two or more data sets are graphed on the
same axis, the distributions can be compared.
2. To compare the averages, use the location of the medians.
EXERCISE 1.4
3. Plot a boxplot for the following data. Then describe the data.

a) 9 22 11 14 13 3 7 15 18 16

b) 19 2 1 7 5 8 6

4. A dietician is interested in comparing the sodium content of real cheese


with the sodium content of a cheese substitute. Te data for two random
samples are shown. Compare the distributions using boxplots

Real Chese 310, 420, 45, 40, 220, 240, 180, 90

Cheese Subtitute 270, 180, 250, 290, 130, 260, 340, 310
Conclusion

• The applications of statistics are


many and varied. People
encounter them in everyday life,
such as in reading newspapers or
magazines, listening to the radio,
or watching television.
• By combining all of the
descriptive statistics techniques
discussed in this presentation
together, the student is now able
to collect, organize, summarize
and present data.

You might also like