3.MU-Introduction To Statistics and Data Analysis-Talk - (04.07.2025)
3.MU-Introduction To Statistics and Data Analysis-Talk - (04.07.2025)
CONTENT
Population Parameter
The complete collection of A number that describes a
measurements outcomes, object or population characteristics
individual under study
Tangible Conceptual
Always finite & after a population is Population that consists of all
sampled, the population size the value that might possibly
decrease by 1 have been observed & has an
The total number of members is unlimited number of members
fixedSample
& could be listed
Statistic
A subset of a population, containing
A number that describes a sample
the objects or outcomes that are
characteristics
actually observed
EXERCISE 1.1
1. The freshman class has 317 students and an IQ pre-test is given to all of
them in their first week. The dean of admission collected data on 27 of
them and found their mean score on the IQ pre-test was 51. The mean
for the entire freshman class was therefore estimated to
approximately 51 on this test. A subsequent computer analysis of all
freshmen showed the true mean to be 52.
Based on the above problem,
a) What is the population?
b) Is the population tangible or conceptual?
c) What is the sample?
d) What number is a parameter?
e) What number is a statistic?
Statistics
• Statistics is define as the study of collecting information,
presenting, analyzing and drawing conclusion with the
aid of mathematics and computers. The most important
activity is perhaps the development of methods by which
inference can be made from data. This area include both
estimation of population parameters and Test of
hypothesis based on probability theory. In many
instances, a sample is obtained from some collection of
objects or items and on the basis of this particular set of
data, we wish to make sensible comments about the
nature of the larger collection called population. It is an
essential tool for business, planning, market research,
forecasting and finding pattern inherent in data.
Statistics
• With the increase use of technology, data handling skills
have become highly sought after, and a knowledge of the
subject is useful in understanding the statistical
information we are presenting with every day. Today a
new discipline is emerge out which is term as "Data
Science". Data science is a "concept to unifying statistics,
data analysis, informatics, and their related methods" in
order to "understand and analyze actual phenomena"
with data". It uses techniques and theories drawn from
many fields within the context of mathematics, statistics,
computer science, information science, and domain
knowledge. A data scientist is the result of a teamwork
who creates programming code, and combines it with
statistical knowledge to create insights from data.
Descriptive & Inferential Statistics
• Descriptive statistics Inferential statistics
• consists of the collection,
consists of generalizing from
samples to populations,
organization, classification,
summarization, and performing estimations
presentation of data obtain hypothesis testing, determining
from the sample. relationships among variables,
• Used to describe the and making predictions.
characteristics of the Used to describe, infer, estimate,
sample approximate the characteristics
• Used to determine of the target population
whether the sample Used when we want to draw a
represent the target conclusion for the data obtain
population by comparing from the sample
sample statistic and
population parameter
EXERCISE 1.1
2. In each of these statements, tell whether descriptive or inferential
statistics have been used.
a) Ten of thousands parents in Malaysia have chosen StemLife as their
trusted stem cell bank. (Descriptive)
b) The death rate from lung cancer was 10 times for smokers compared
to nonsmokers. (Inferential)
c) The average cost of a wedding is nearly RM10,000.
d) In USA, the median salary for men with a bachelor’s degree is $49,982,
while the median salary for women with a bachelor’s degree is
$35,408.
e) Globally, an estimated 500,000 children under the age of 15 live with
Type 1 diabetes.
f) A researcher claim that a new drug will reduce the number of heart
attacks in men over 70 years of age.
An overview of descriptive statistics and
statistical inference
START
Gathering of
Data
Classification,
Summarization, and
Processing of data
Presentation and
Communication of
Summarized information
Yes
Use sample information
Is Information from a
to make inferences about
sample?
the population Statistical
Inference
No
Descripti
ve
Draw conclusions about
Statistics Use cencus data to
the population
analyze the population
characteristic (parameter)
characteristic under study
under study
STOP
Need for Statistics
• Is it to impose some treatment on the group & then test the response?
• If sample are needed, how large?, how should they be taken? – the
larger the better (more than 30)
Characteristics of Sample Size
• The larger the sample, the smaller the magnitude of sampling
errors.
B. Probability data
• Is one in which the chance of selection of each item in the
population is known before the sample is picked
1. Judgment samples
2. Voluntary samples
3. Convenience samples
1. Random samples
• Selected using chance method or random methods
• Example:
• A lecturer wants to study the physical fitness levels of
students at her university. There are 5,000 students
enrolled at the university, and she wants to draw a
sample of size 100 to take a physical fitness test. She
obtains a list of all 5,000 students, numbered it from 1 to
5,000 and then randomly invites 100 students
corresponding to those numbers to participate in the
study.
B) Probability Data Samples
2. Systematic samples
• Numbering each subject of the populations and data is
selected every kth number.
• Example:
• A lecturer wants to study the physical fitness levels of students
at her university. There are 5,000 students enrolled at the
university, and she wants to draw a sample of size 100 to take a
physical fitness test. She obtains a list of all 5,000 students,
numbered it from 1 to 5,000 and randomly picks one of the first
50 voters (5000/100 = 50) on the list. If the pick number is 30,
then the 30th student in the list should be invited first. Then she
should invite the selected every 50th name on the list after this
first random starts (the 80th student, the 130th student, etc) to
produce 100 samples of students to participate in the study.
B) Probability Data Samples
3. Stratified samples
4. Cluster samples
• Dividing the population into sections/clusters, then randomly
select some of those cluster and then choose all members
from those selected cluster
• Using a cluster sampling can reduce cost and time.
• Example:
• A lecturer wants to study the physical fitness levels of students at her
university. There are 5,000 students enrolled at the university, and she
wants to draw a sample to take a physical fitness test. Assume that,
because of different lifestyles, the level of physical fitness is different
between freshmen, sophomores, juniors and seniors students. To
account for this variation in lifestyle, the population of student can
easily be clustered into freshmen, sophomores, juniors and seniors
students. Then she can choose any one cluster such as freshmen and
take all the freshmen students as the participant.
EXERCISE 1.2
1. In each of these statements, identify the type of sample obtain.
• Summarization
• Graphical & Descriptive statistics ( tables, charts, measure of central
tendency, measure of variation, measure of position)
Data Classification
DATA TYPES
• Qualitative Quantitative
Nominal
Continuous
Ordinal
Discrete
Binary
Data Classification
• Data are the values that variables can assume
• Variables is a characteristic or attribute that can assume different values.
• Variables whose values are determined by chance are called random
variables
Variables can be
classified
Examples
EXERCISE 1.2
3. The chart shows the number of job-related injuries for each of the
transportation industries for 1998.
I
Industry Number of injuries
Railroad 4520
Intercity bus 5100
Subway 6850
Trucking 7144
Airline 9950
Bimodal U-Shaped
Have 2 peak at The shape is U
the same height
STEP 6
Making the decision
Statistical
questionnaire, etc
Yes
Solving
Present and communicate
Is information from
Yes Use sample information to
1. Estimate value of parameter
a sample? 2. Test assumptions about
parameter
No
Use cencus information to
Interpret the results, draw
evaluate alternative courses of
conclusions, and make decisions
action and make decisions
STOP
Role of the Computer in Statistics
1. Spreadsheets
• Microsoft Excel & Lotus 1-2-3
2. Statistical Packages
• MINITAB, SAS, STATA, SPSS, R and SPlus
Data Analysis Aplication in EXCEL
• Graph and chart
• Formulas
• Add in – Analisis Tool Park – Data Analysis
1.3: REVIEWOF DESCRIPTIVE STATISTICS
• Summarize data using measures of central tendency,
such as
the mean, median, mode, and midrange.
• Measures of average are also called measures of central tendency and include the
mean, median, mode, and midrange.
• Measures that determine the spread of data values are called measures of variation
or measures of dispersion and include the range, variance, and standard deviation.
• Measures of position tell where a specific data value falls within the data set or its
relative position in comparison with other data values. The most common measures
of position are percentiles, deciles, and quartiles.
• The measures of central tendency, variation, and position are part of what is called
traditional statistics. This type of data is typically used to confirm conjectures about
the data.
TIPS: INSERT & CLEAR DATA by using Scientific Calculator
• Casio fx-570W
• Casio fx-570MS • Insert data
• MODE SD data M+
• Insert data • Shift 1
• MODE SD data M+
• Shift 2
• Shift 1
• Shift 3
• Shift 2 • Shift 4
• Clear data • Clear data
• Shift CLR 1 • Shift AC/ON =
Mean
x
N
x i i
i 1
, N population size x i 1
, n sample size
N n
Example: 9 2 1 4 3 3 7 5 8 6 , x 4.8
Properties of Mean
• The mean varies less than the median or mode when samples are taken from
the same population and all three measures are computed for these samples.
• The mean for the data set is unique, and not necessarily one of the data
values.
• The mean is affected by extremely high or low values and may not be the
appropriate average to use in these situations
1.3.1 Measures of Central Tendency
Median
the middle number of n ordered data (smallest to largest)
If n is odd If n is even
Median(MD) xn1 xn xn
1
2 Median(MD) 2 2
2
Example: 9 2 1 3 3 7 5 8 6 Example: 9 2 1 4 3 3 7 5 8 6
MD = 5 MD = 4.5
Properties of Median
• The median is used when one must find the center or middle value of a
data set.
• The median is used when one must determine whether the data values
fall into the upper half or lower half of the distribution.
Mode
the most commonly occurring value in a data series
• The mode can be used when the data are nominal, such as religious
preference, gender, or political affiliation.
• The mode is not always unique. A data set can have more than one mode,
or the mode may not exist for a data set.
Example: 9 2 1 4 3 3 7 5 8 6 Mode = 3
1.3.1 Measures of Central Tendency
Midrange
is a rough estimate of the middle & also a very rough
estimate of the average and can be affected by one
extremely high or low value.
Example: 9 2 1 4 3 3 7 5 8 6
MR = 5
Types of Distribution
Symmetric
• 123 195 138 115 179 119 148 147 180 146 179 189 175 108 193
114 179 147 108 128 164 174 128 159 193 204 125 133 115 168
123 183 116 182 174 102 123 99 161 162 155 202 110 132
Range
Example: 9 2 1 4 3 3 7 5 8 6
R=8
1.3.2 Measures of Variation / Dispersion
Variance
is the average of the squares of the distance each value is from the mean.
x x x
2 2
i i
2 i 1
, N population size s
2 i 1
, n sample size
N n 1
Example: 9 2 1 4 3 3 7 5 8 6
2 6.4
s 2 7.1
1.3.2 Measures of Variation / Dispersion
Standard Deviation
is the square root of the variance
Example: 9 2 1 4 3 3 7 5 8 6
2.5
s 2.7
Properties of Variance &
Standard Deviation
• Variances and standard deviations can be used to determine the spread of
the data. If the variance or standard deviation is large, the data are more
dispersed. The information is useful in comparing two or more data sets to
determine which is more variable.
• The variance and standard deviation are used to determine the number of
data values that fall within a specified interval in a distribution.
• The variance and standard deviation are used quite often in inferential
statistics.
• Group 1:
123 195 138 115 179 119 148 147 180 146 179 189 175 108
193 114 179 147 108 128 164 174
Group 2
128 159 193 204 125 133 115 168 123 183 116 182 174 102
123 99 161 162 155 202 110 132
[Link] Accuracy and Precision
• Precision is how consistent your results are for the same phenomena over
several measurement, or how repeatable a device's (like a spring's)
performance can be made.
Precision as a measure of variation, must be accounted for in your
calculations and results
Game of Darts
Pi x in xc Di x in xc Qi xin xc
100 10 4
Example:
If c is a9 whole number, then use
2 1 4 3 3 7 5 8 6 P15 2, D3 3, Q2 4.5
EXERCISE 1.3.3
1. Given 9 2 1 4 3 7 5 4 6 .
2. Given 9 22 11 14 13 3 7 15 18 16
• When a distribution is normal or bell-shaped, data values that are beyond three
standard deviations of the mean can be considered suspected outliers.
Example: 9 22 11 14 13 3 7 15 18 16
EXERCISE 1.3.3
4. Given 19 6 2 11 4 3 7 7 5 8 6 21 12
Find outliers
1.4: EXPLORATORY DATA ANALYSIS
52 62 51 50 69
58 77 66 53 57
75 56 55 67 73
79 59 68 65 72
57 51 63 69 75
65 53 78 66 55
EXERCISE 1.4
2. The data shown represents the percentage of unemployed males and
females in 1995 for a sample of countries of the world. Using the whole
numbers as stems and the decimals as leaves, construct a back-to-
back (mixture) stem and leaf plot and compare the distribution of the
two groups.
Females Males
EXTRA INFO:
1. If the boxplots for two or more data sets are graphed on the
same axis, the distributions can be compared.
2. To compare the averages, use the location of the medians.
EXERCISE 1.4
3. Plot a boxplot for the following data. Then describe the data.
a) 9 22 11 14 13 3 7 15 18 16
b) 19 2 1 7 5 8 6
Cheese Subtitute 270, 180, 250, 290, 130, 260, 340, 310
Conclusion