You are on page 1of 47

ME-5101

Engineering Analysis &


Statistics
Lect. # 3
Introduction to Statistics Terms
& Data Types
Dr. Nazeer Ahmad Anjum
Mechanical Engineering Program
Engineering University Taxila
Statistics Terms 2
 PRESENTATION OF DATA This refers to the
organization of data into tables, graphs or charts, so
that logical and statistical conclusions can be
derived from the collected measurements. Data
may be presented in (3 Methods):
 Textual
 Tabular or
 Graphical.

• Descriptive statistics
 Tabular
 Graphical
 Numerical
3/7/2020
Statistics Terms 3
 Population: is a set of all elements of
interest or A large group of data or a large
number of measurements is called
population.
 The population parameters for mean and
standard deviation are denoted by “” and
“” respectively.
 The value of population parameter is always
constant. That is, for any population data set,
there is only one value of “” and “”.

3/7/2020
Statistics Terms 4
 Sample: A sub set of data taken from some
large population or process is a sample.
 Random sample: If each item in the
population has an equal opportunity of being
selected, it is called a random sample. This
definition is applicable for both infinite and
finite population. A random sample of size n if
selected will be independently and
identically distributed.

3/7/2020
Qualitative and Quantitative 5
• Data can be further classified as being
qualitative or quantitative.
• The statistical analysis that is
appropriate depends on whether the
data for the variable are qualitative or
quantitative.
• In general, there are more alternatives
for statistical analysis when the data are
quantitative.

3/7/2020
Qualitative Data 6

• Qualitative data are labels or names


used to identify an attribute of each
element.
• Qualitative data use either the nominal
or ordinal scale of measurement.
• Qualitative data can be either numeric
or nonnumeric.
• The statistical analysis for qualitative data
are rather limited.

3/7/2020
Quantitative Data 7
• Quantitative data indicate either how
many or how much.
– Quantitative data that measure how
many are discrete.
– Quantitative data that measure how
much are continuous because there is
no separation between the possible
values for the data..
• Quantitative data are always numeric.
• Ordinary arithmetic operations are
meaningful only with quantitative data.
3/7/2020
Cross-Sectional Data 8
Also known as transverse study, prevalence study:
• is a set of observations collected at usually discrete
and equally spaced time intervals. Data collected by
observing many subjects/fields (such as individuals,
firms, countries, or regions) at the same point of time.
It provides a snapshot of a population at a certain
time, allowing conclusions about phenomena across a
wide population to be drawn.
• An example of a cross-sectional study would be a
medical study looking at the prevalence of breast
cancer in a population.
Example: data detailing the number of Electricity &
Gas Connections, issued during a financial year.
3/7/2020
Cross-Sectional Data 9
• In short data on one or more variables
collected at a single point in time. Such data
do not have a meaningful sequence.
• Examples: Sales of 30 companies
• Productivity of each sales division.
• Firms/companies production and
productivity about a particular product in a
particular session, like ACs, Fridges, etc.

3/7/2020
Time Series Data 10
• Are data that have been collected over a period of
time on one or more variables.
• Data have associated with them a particular
frequency of observation or collection of data
points.
• The frequency is simply a measure of the interval
over, or the regularity with which, the data are
collected or recorded.
• Examples: Students enrolment in higher learning
institutions over the past 20 years, a firm’s
quarterly sales over the past five years, etc.

3/7/2020
Time Series Data 11
• Time Series Data is a set of observations collected at
usually discrete and equally spaced time intervals & are
collected over several time periods.
– Example: Weather monitoring data, Energy analysis data,
Fatigue test data, etc.

3/7/2020
Data Sources 12
• Existing Sources
– Data needed for a particular application might
already exist within a firm. Detailed information
is often kept on customers, suppliers, and
employees for example.
– Substantial amounts of business and
economic data are available from
organizations that specialize in collecting and
maintaining data.
3/7/2020
Data Sources 13

• Existing Sources
– Government agencies are another
important source of data.

– Data are also available from a variety of


industry associations and special
interest organizations.

3/7/2020
Data Sources 14
• Internet
– The Internet has become an important source
of data.
– Most government agencies, like Small
Industries corporation, HEC, PIA,
Educational institutions, different research
organizations, etc. make their data available
through a web site.
– More and more companies are creating web
sites and providing public access to them.
– A number of companies now specialize in
making information available over the
Internet. 3/7/2020
Data Sources 15
• Statistical Studies
– Statistical studies can be classified as
either experimental or observational.
– In experimental studies the variables of
interest are first identified. Then one or
more factors are controlled so that data can
be obtained about how the factors influence
the variables.
– In observational (non-experimental) studies
no attempt is made to control or influence
the variables of interest; an example is a
survey.
3/7/2020
Data Acquisition & Considerations16
• Time Requirement
– Searching for information can be time consuming.
– Information might no longer be useful by the time it is
available.
• Cost of Acquisition
– Organizations often charge for information even
when it is not their primary business activity.
• Data Errors
– Using any data that happens to be available or that
were acquired with little care can lead to poor and
misleading information.

3/7/2020
Descriptive Statistics 17
• Descriptive statistics are the tabular,
graphical, and numerical methods
used to summarize data.
• Statistics. Statistics is the science of
problem-solving in the presence of
variability.

3/7/2020
Scientific, Quality & Productivity 18
The term Scientific suggests a process of objective
investigation that ensures that valid conclusions can
be drawn from an experimental study.
Scientific investigations are important not only in the
academic laboratories of research universities but
also in the engineering laboratories of industrial
manufacturers.
Quality and Productivity are characteristic goals of
industrial processes, which are expected to result in
goods and services that are highly sought by
consumers and that yield profits for the firms that
supply them.
3/7/2020
19
Role of Statistics in Experimentation
Statistics is a scientific discipline devoted to the
drawing of valid inferences from experimental or
observational data.
The study of variation, including the construction of
experimental designs and the development of
models which describe variation, characterizes
research activities in the field of statistics.

3/7/2020
20
Role of Statistics in Experimentation
Project Planning Phase
• What is to be measured?
• How large is the likely variation?
• What are the influential factors?
Experimental Design Phase
• Control known sources of variation
• Allow estimation of the size of the uncontrolled variation
• Permit an investigation of suitable models
Statistical Analysis Phase
• Make inferences on design factors
• Guide subsequent designs
• Suggest more appropriate models
3/7/2020
Statistical Terms 21
Population: A statistical population consists of all possible
items or units possessing one or more common
characteristics under specified experimental or observational
conditions.
Process: A process is a repeatable series of actions that
results in an observable characteristic or measurement.
Variable: A property or characteristic on which information is
obtained in an experiment.
Observation: The collection of information in an experiment,
or actual values obtained on variables in an experiment.
Response Variable: Any outcome or result of an experiment.
Factors: Controllable experimental variables that can
influence the observed values of response variables.
Sample. A sample is a group of observations taken from a
population or a process. 3/7/2020
Statistical Terms 22

3/7/2020
Statistical Terms 23
Simple Random Sample: In an experimental setting, a
simple random sample of size n is obtained when items are
selected from a fixed population or a process in such a
manner that every group of items of size n has an equal
chance of being selected as the sample.
Parameters and Statistics: A parameter is a numerical
characteristic of a population or a process.
A statistic is a numerical characteristic that is computed from
a sample of observations.
Distribution: A tabular, graphical, or theoretical description
of the values of a variable using some measure of how
frequently they occur in a population, a process, or a sample.
Sampling Distribution. A sampling distribution is a
theoretical model that describes the probability of obtaining
the possible values of a sample statistic.
3/7/2020
Statistical Terms 24
Mathematical Model: A model is termed mathematical if it is
derived from theoretical or mechanistic considerations that
represent exact, error-free assumed relationships among the
variables.
𝐾𝐼𝐶 = 𝛾𝑆 𝑎
where KIC is the critical stress intensity factor, S is the fracture
strength, a is the size of the flaw that caused the fracture, and
γ is a constant relating to the flaw geometry.
Statistical Model: A model is termed statistical if it is
derived from data that are subject to various types of
specification, observation, experimental, and/or measurement
errors.
𝐾𝐼𝐶 = 𝛾𝑆 𝑎 + 𝑒

3/7/2020
25
Frequency Distributions & Histograms
 Construct intervals, ordinarily equally spaced, which cover
the range of the data values.
 Count the number of observations in each of the intervals.
If desirable, form proportions or percentages of counts in
each interval.
 Clearly label all columns in tables and both axes on
histograms, including any units of measurement, and
indicate the sample or population size.
 For histograms, plot bars whose
(a) widths correspond to the measurement intervals,
(b) heights are (proportional to) the counts for each
interval (e.g., heights can be counts, proportions, or
percentages).

3/7/2020
26
Frequency Distributions & Histograms

Mean (μ = 35.4) and standard deviation (σ = 2.65)


3/7/2020
Example: Hudson Auto Repair 27
The manager of Hudson Auto would like to have a
better understanding of the cost of parts used in the
engine tune-ups performed in the shop. He examines
50 customer invoices for tune-ups. The costs of parts,
rounded to the nearest dollar, are listed below.

91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
104 74 62 68 97 105 77 65 80 109
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73

3/7/2020
Example: Hudson Auto Repair 28
• Tabular Summary (Frequencies and
Percent Frequencies)
Parts Percent
Cost ($) Frequency Frequency
50-59 2 4
60-69 13 26
70-79 16 32
80-89 7 14
90-99 7 14
100-109 5 10
Total 50 100
3/7/2020
Example: Hudson Auto Repair 29
• Graphical Summary (Histogram)
18 Is it a
16
Normal Distribution
14
Positively skewed
Frequency

12
Negatively skewed
10
8
6
4
2

50 60 70 80 90 100 110
Parts Cost ($) 3/7/2020
Example: Hudson Auto Repair 30
• Numerical Descriptive Statistics
– The most common numerical descriptive
statistic is the average (or mean).
– Hudson’s average cost of parts, based on
the 50 tune-ups studied, is $79 (found by
summing the 50 cost values and then
dividing by 50).

3/7/2020
Statistical Inference 31
Inference (a conclusion reached on the basis
of evidence and reasoning.)

Statistical inference is the process of using


data obtained from a small group of elements
(the sample) to make estimates and test
hypotheses about the characteristics of a larger
group of elements (the population).

3/7/2020
Example: Hudson Auto Repair 32
• Process of Statistical Inference
1. Population
consists of all 2. A sample of 50
tune-ups. Average engine tune-ups
cost of parts is is examined.
unknown.

4. The value of the 3. The sample data


sample average is used provide a sample
to make an estimate of average cost of
the population average. $79 per tune-up.
3/7/2020
Standard Normal Distribution 33
This distribution facilitates easy calculation of area between
any two points under the curve. Its mean is 0 and variance is
1. Suppose x is a continuous random variable that has a normal
distribution N(, 2), then the random variable, z = (X - )/,
follow standard normal distribution as shown in figure below,
denoted by z ~ N(0, 1).
The horizontal axis of this curve is represented by z.
The center point (mean) is labelled as 0.

3/7/2020
Standard Normal Distribution 34
The Department of Transportation (DOT) was interested in
evaluating the safety performance of motorcycle helmets
manufactured in the Pakistan. A total of 264 helmets were
obtained from the major manufacturers and supplied to an
independent research testing firm where impact penetration
and chin retention tests were performed on the helmets in
accordance with DOT standards.
a) What is the population of interest?
b) What is the sample?
c) Is the population finite or infinite?
d) What inferences can be made about the population based
on the tested samples?

3/7/2020
Standard Normal Distribution 35
The standard normal distribution table is given in
Appendix A.1. This table gives the areas under
the standard normal curve between z = 0 and the
values of z from 0.00 to 3.09. Since the total area
under the curve is 1.00 and the curve is
symmetric about the mean, the area on each
side of the mean is 0.5.

3/7/2020
Standard Normal Distribution 36

3/7/2020
Standard Normal Distribution 37
For computing the area under the curve from z = 0 (-) and
any point x, we compute the value of
x
Zx 

Corresponding to this z value, we obtain the area from
Table A.1.
Problem # 1
The diameter of shafts manufactured is normally
distributed with a mean of 3.0 cm and a standard
deviation of 0.009 cm. The shafts that are with 2.98 cm or
less diameter are scrapped and shafts with diameter
more than 3.02 cm are reworked. Determine the
percentage of shafts scrapped and percentage of
rework.
3/7/2020
Standard Normal Distribution 38
Problem # 1
The diameter of shafts manufactured is normally distributed
with a mean of 3.0 cm and a standard deviation of 0.009
cm. The shafts that are with 2.98 cm or less diameter are
scrapped and shafts with diameter more than 3.02 cm are
reworked. Determine the percentage of shafts scrapped and
percentage of rework.
Mean () = 3.0 cm
Standard deviation () = 0.009 cm
Let upper limit for rework (U or x1`) = 3.02 cm
Lower limit at which shafts are scrapped (L or x2) = 2.98
Now let us determine the z value corresponding to U and L

3/7/2020
Standard Normal Distribution 39
x
Zx 

3/7/2020
Standard normal distribution 40

3/7/2020
T-Distribution 41
A theoretical probability distribution that is similar
to a normal distribution.
The T distribution is used to estimate probabilities
based on incomplete data or small samples.
It differs from a normal distribution in that has an
additional parameter called Degrees of Freedom.
Degrees of freedom are the number of variables
used in the calculation of a statistic.

3/7/2020
T-Distribution 42
The T-Distribution is also known as student’s t-distribution,
is a probability distribution that is used to estimate
population parameters when the sample size is small
and/or when the population variance is unknown.
It is similar to normal distribution in some aspects. The t-
distribution is also symmetric about the mean.
It is some what flatter than the normal curve.
As the sample size increases, the t-distribution approaches
the normal distribution.
The shape of the t-distribution curve depends on the
number of degrees of freedom.
The degrees of freedom for t-distribution are the sample size
minus one.
The standard deviation of t-distribution is always greater
than one. The t-distribution has only one parameter, the
degrees of freedom. 3/7/2020
T-Distribution 43
Suppose X1, X2, ..., Xn is a random sample from N(, 2)
distribution. If 𝑋 and S2 are computed from this sample are
independent, the random variable
x has a t-distribution independent, the
t random variable
S/ n

3/7/2020
T-Distribution & Z-Distribution 44
You must use the t-distribution table when the population
standard deviation (σ) is not known and the sample size
is small (n<30).
There are more scores in the tails in a t-distribution.
There are more scores in the center in a normal
distribution.
General Correct Rule:
If σ is not known, then using t-distribution is correct.
If σ is known, then using the normal distribution is
correct.
As the degrees of freedom increase, the t distribution
approaches the Standard Normal Distribution.

3/7/2020
T-Distribution & Z-Distribution 45
Example: The CEO of light bulbs manufacturing company
claims that an average light bulb lasts 300 days. A
researcher randomly selects 15 bulbs for testing. The
sampled bulbs last an average of 290 days, with a standard
deviation of 50 days. If the CEO’s claim were true, what is
the probability that 15 randomly selected bulbs would have
an average life of no more than 290 days?
• The traditional approach requires you to compute t
x the
statistic, based on data presented in tthe  problem,
description. S/ n
• The first thing we need to do is to compute the t statistic,
based on the following equation:
• Where 𝒙 is the sample mean, μ is the population mean, s
is the standard deviation of the sample, and n is the sample
size.
3/7/2020
T-Distribution & Z-Distribution 46
x
t
S/ n

3/7/2020
T-Distribution & Z-Distribution 47
x 290  300 290  300
t , ,  0.7745966
S/ n 50 / 15 50 / 15

• Since we will work with the raw data, we select “Sample


mean” from the Random Variable dropdown box.
• The degrees of freedom = (n-1), 15 – 1 = 14.
• , Population mean = 300.
• 𝑥, Sample mean = 290.
• S, Sample Standard Deviation = 50.
• The cumulative probability: (1.00-0.774) = 0.226.
• Hence, if the true bulb life were 300 days, there is a 22.6%
chance that the average bulb life for 15 randomly selected
bulbs would be less than or equal to 290 days.
3/7/2020

You might also like