Unit - 1 PDF

You might also like

You are on page 1of 63

Applied Statistics

Course Code:- 1150MA201


Unit-1 – Role of Statistics in Engineering ,
Data Description and Representation

Contents Contents
 Engineering methods and  Collection of data
statistical Thinking
 Collecting Engineering Data  Classification and Tabulation of
data
• Basic Principles
 Stem and Leaf diagram
• Retrospective Study
 Frequency distribution and
• Observational Study Histogram
• Designed Experiments  Box Plots
• Observing Processes over  Time Sequence Plots
Time  Probability Plots
 Mechanical and Empirical Models
Course outcome

Identify the role that statistics can play in the


Engineering problems –solving process, discuss the
different methods the engineers use to collect data
and construct and interpret visual data displays
The Engineering Method and Statistical
Thinking

 An Engineer is someone who solves problems of interest


to society with the efficient application of scientific
principles by:
• Refining existing products
• Designing new products or processes
The Creative Process in Engineering Method
Statistics Supports The Creative Process

 The field of statistics deals with the collection,


presentation, analysis, and use of data to:
• Make decisions
• Solve problems
• Design products and processes
 It is the science of learning information from data.
Statistics Supports The Creative Process
– Cont.

• Statistical techniques are useful for describing and


understanding variability.
• By variability, we mean successive observations of a
system or phenomenon do not produce exactly the
same result.
• Statistics gives us a framework for describing this
variability and for learning about potential sources of
variability.
Collecting Engineering Data

Three basic methods for collecting data:


 A retrospective study using historical data
- Data collected in the past for other purposes.
 An observational study
- Data, presently collected, by a passive observer.
 A designed experiment
- Data collected in response to process input
changes.
Observing Processes over Time

 Data are collected over time.


 It is usually very helpful to plot the data versus time in a time series
plot.
 Phenomena that might affect the system or process often become
more visible in a time-oriented plot and the concept of stability can
be better judged.
Mechanical and Empirical Models

•A Mechanistic model is built from our underlying knowledge


of the basic physical mechanism that relates several
variables.
Example: Ohm’s Law
Current = voltage/resistance
I = E/R
I = E/R + 
• The form of the function is known.
Mechanical and Empirical Models

An Empirical model is built from our engineering and


scientific knowledge of the phenomenon, but is not
directly developed from our theoretical or first-principles
understanding of the underlying mechanism.
The form of the function is not known a priori.
Mechanical and Empirical Models

Example:-
• We are interested in the numeric average molecular weight (Mn) of a
polymer. Now we know that Mn is related to the viscosity of the material (V),
and it also depends on the amount of catalyst (C) and the temperature (T ) in
the polymerization reactor when the material is manufactured. The
relationship between Mn and these variables is
Mn = f(V,C,T)
where the form of the function f is unknown.
• We estimate the model from experimental data to be of the following form
where the b’s are unknown parameters.
Collection of Data

 Collection of data is the first important aspect of


statistical survey.
 Data – Information which can be expressed in numbers.
 Two sources of data – Primary & Secondary.
 Primary data – data collected by investigator himself.
 Secondary data – data collected by someone and
used by the investigator.
Difference between Primary and Secondary
Data

 Primary data is original data collected by the


investigator while secondary data is already existing
and not original.
 Primary data is always collected for a specific purpose
while secondary data has already been collected for
some other purpose.
 Primary is costlier or is more expensive whereas
secondary data is less expensive.
Methods / Sources of Collection of Primary
Data

 Direct Personal Interview


– Data is personally collected by the interviewer.
 Indirect Oral Investigation
– Data is collected from third parties who have
information about subject of enquiry.
 Information from correspondents
– Data is collected from agents appointed
in the area of investigation.
Methods / Sources of Collection of Primary
Data -Cont

 Mailed questionnaire
– Data is collected through questionnaire
[list of questions] mailed to the informant.
 Questionnaire filled by enumerators
– Data is collected by trained enumerators
who fill questionnaires.
 Telephonic interviews
– Data is collected through an interview over
the telephone with the interviewer.
Difference Between Census & Sampling
Method

 Census Method
 Every unit of population studied
 Reliable and accurate results
 Expensive method
 Suitable when population is of homogenous nature
 Sampling Method
 Few units of population are studied
 Less Reliable and accurate results
 Less expensive method
 Suitable when population is of heterogeneous nature
Advantages and Disadvantages -
Mailed Questionnaire Method:

Advantages Disadvantages

 Less expensive  Long response time


 Only method to reach  Cannot be used by illiterates.
remote areas  Doubts cannot be cleared
 Informants can be regarding questions
influenced
Advantages and Disadvantages -
Personal Interview Method

Advantages Disadvantages

 Highest response rate  Most expensive


 Allows all types of questions  Informants can be influenced
 Allows clearing doubts  Takes more time
regarding questions
Advantages and Disadvantages -
Telephonic Interview Method:

Advantages Disadvantages

 Relatively low cost  Limited use


 Relatively high response  Reactions cannot be
rate watched
 Less influence on informants  Respondents can be
influenced
Methods / Sources of Collection of Secondary
Data

Published Source
- Government publications, Semi-government
publications etc.
Unpublished Source
- Census of India [They are collected by the
organizations for their own record]
Classification of Data

 Classification is the process of arranging data into sequences


and groups according to their common characteristics or
separating them into different but related parts.
(or)
 The process of grouping large number of individual facts and
observations on the basis of similarity among the items is
called classification.
Objectives – Classification of Data

 It condenses the mass of data in an easily assimilable form.


 It eliminates unnecessary details.
 It facilitates comparison and highlights the significant aspect
of data.
 It enables one to get a mental picture of the information
and helps in drawing inferences.
 It helps in the statistical treatment of the information
collected.
Types of Classification

 Chronological classification:
- In chronological classification the collected data are
arranged according to the order of time expressed in
years, months, weeks, etc.,
Eg:- The estimates of birth rates in India during 970 – 76 are

Year 1970 1971 1972 1973 1974 1975 1976

Birth 36.7 35.9 45.8 32.6 45.6 34.8 36.7


Rate
Types of Classification –Cont.

 Geographical classification:
- The data are classified according to geographical
region or place.
Eg:- The production of paddy in different states in Iraq,
production of wheat in different countries etc.
Country America China Denmark France Iraq

Yield of 1924 893 225 439 862


Wheat (in
kg/acre)
Types of Classification –Cont.

 Qualitative classification:
- Data are classified on the basis of same attributes or
quality like sex, literacy, religion, employment etc.,
Such attributes cannot be measured along with a
scale.
Eg:- If the population to be classified in respect to one attribute,
say sex, then we can classify them into two namely that of
males and females. Similarly, they can also be classified into
‘married or ‘ single’ on the basis of another attribute ‘marital
status’.
Types of Classification –Cont.

 Quantitative classification:
- It refers to the classification of data according to
some characteristics that can be measured such as
height, weight, etc.,
Eg:-The group of a children may be classified according to weight

Weight (in kg) No of children


5 - 10 50
10 – 15 200
15-20 260
Tabulation of Data

 Tabulation is the process of summarizing classified or grouped


data in the form of a table so that it is easily understood and an
investigator is quickly able to locate the desired information.
 A table is a systematic arrangement of classified data in columns
and rows.
 A statistical table makes it possible for the investigator to present
a huge mass of data in a detailed and orderly form.
 It facilitates comparison and often reveals certain patterns in
data
Main Parts of a Table

• Title of the table – It is a brief explanation of contents of the table.


• Table Number – It is given to be used for the reference.
• Captions – A word or Phrase which explains the contents of column
of a table.
• Stubs – It explains the contents of rows of a table.
• Body of the table – Most important part of table as it contains data.
• Head Note – Head note is inserted to convey complete information of
title.
• Source Note - It refers to the source from which information has been taken.
• Foot Note – It is used for pointing exceptions to the data
Format of Table

Table Number__________
Title ___________
[Head Note]
Stub caption Total[Rows]
Sub head Sub head
Column Column Column Column
head head head head

Stub Body of the table


entries
Total
[Columns]
Source Note:
Foot Note:
Basic Principles of Tabulation

 Tables should be clear, concise & adequately titled.


 Every table should be distinctly numbered for easy reference.
 Column headings & row headings of the table should be clear &
brief.
 Units of measurement should be specified at appropriate places.
 Explanatory footnotes concerning the table should be placed at
appropriate places.
 Source of information of data should be clearly indicated.
Basic Principles of Tabulation – cont.

 The columns & rows should be clearly separated with dark lines.
 Demarcation should also be made between data of one class and
that of another.
 Comparable data should be put side by side.
 The figures in percentage should be approximated before
tabulation.
 The alignment of the figures, symbols etc. should be properly aligned
and adequately spaced to enhance the readability of the same.
 Abbreviations should be avoided.
Representation of Data

 Stem and leaf diagram


 Frequency Distribution
 Bar Diagram
 Histogram
 Box plot
 Time sequence plot
 Probability plot
Representation of Data

 Stem and leaf diagram:-


 A stem-and-leaf plot is a graphical summary used to describe a set of
observations (as symmetric, skewed, etc.).
 Each observation is displayed on the graph and should have at least
two digits.
 Split each observation (at the same point) into a stem (one or more of
the leading digit(s)) and a leaf (remaining digits).
 Select the split point so that there are 5–20 total stems. List the stems in a
column to the left, and write each leaf in the corresponding stem row.
Problem-1

1. Use the data in the table to make a stem-and-leaf plot.


Step 1: Group the data by tens digits.
Step 2: Order the data from least to greatest.
Step 3: List the tens digits of the data in order
from least to greatest. Write these in
the “stems” column.
61 64 67
Step 4: For each tens digit, record the ones digits
of each data value in order from least to 72 74 76 79

greatest. Write these in the “leaves” column 83 84 88


Problem-1 - cont.

Step 5: Title the graph and add a key.

Ans:-
Problem-2

 The following are the numbers of text messages sent last week by
the cellular phone users on one floor of a college dormitory.
Display the data in a stem-and-leaf plot. What can you conclude?

155 159 144 129 105 145 126 116 130 114 122 112 112 142 126 118 118
108 122 121 109 140 126 119 113 117 118 109 109 119 139 139 122 78
133 126 123 145 121 134 124 119 132 133 124 129 112 126 148 147
Problem-2 – Cont.

Ans:-

Interpretation :- From the display,you can conclude that more than 50% of the
cellular phone users sent between 110 and 130 text messages.
Frequency distribution

 A frequency distribution is a tabular method for summarizing continuous or


discrete numerical data or categorical data.
 Partition the measurement axis into 5–20 (usually equal) reasonable
subintervals called classes, or class intervals. Thus, each observation falls
into exactly one class.
 Record, or tally, the number of observations in each class, called the
frequency of each class.
 Compute the proportion of observations in each class, called the relative
frequency.
 Compute the proportion of observations in each class and all preceding
classes, called the cumulative relative frequency.
Problem-1

 Construct a frequency distribution for the following Ticket Data


Ticket data: Forty random speeding tickets were selected from
the court house records in Columbia County. The
speed indicated on each ticket is given in the table
below.

58 72 64 65 67 92 55 51 69 73 64 59 65 55 75 56
89 60 84 68 74 67 55 68 74 43 67 71 72 66 62 63
83 64 51 63 49 78 65 75
Problem-1-cont.
Histogram

 A histogram is a graphical representation of a frequency


distribution.
 A (relative) frequency histogram is a plot of (relative) frequency
versus class interval.
 Rectangles are constructed over each class with height
proportional (usually equal) to the class (relative) frequency.
 A frequency and relative frequency histogram have the same
shape, but different scales on the vertical axis.
Histogram-Model

Step 1: Choose an appropriate scale and


Number of Pages Read per
interval.
Student Last Weekend
Step 2: Draw a bar for the number of 4
students in each interval. The bars should
touch but not overlap.

Students
3

Step 3: Title the graph and label the axes. 2

0
1- 10 11- 20 21- 30 31- 40

Number of Pages
Problem-1

 Construct a frequency histogram for the following Ticket Data


Ticket data: Forty random speeding tickets were selected from
the court house records in Columbia County. The
speed indicated on each ticket is given in the table
below.

58 72 64 65 67 92 55 51 69 73 64 59 65 55 75 56
89 60 84 68 74 67 55 68 74 43 67 71 72 66 62 63
83 64 51 63 49 78 65 75
Problem-1- Cont.
Frequency polygons

 A frequency polygon is a line plot of points with x


coordinate being class midpoint and y coordinate
being class frequency.
 Often the graph extends to an additional empty class
on both ends.
 The relative frequency may be used in place of
frequency.
Problem-1

 Construct a frequency polygon for the following Ticket Data


Ticket data: Forty random speeding tickets were selected from
the court house records in Columbia County. The
speed indicated on each ticket is given in the table
below.

58 72 64 65 67 92 55 51 69 73 64 59 65 55 75 56
89 60 84 68 74 67 55 68 74 43 67 71 72 66 62 63
83 64 51 63 49 78 65 75
Problem-1- Cont.

Ans:-
Ogive (or) Cumulative frequency polygon

 An ogive, or cumulative frequency polygon, is a plot of


cumulative frequency versus the upper class limit
Box and Whisker Plots

 A box-and-whisker plot uses a number line to show the distribution


of a set of data.
 Box plots are useful for comparing two or more sets of data like
heights of boys and girls in a class.
 A boxplot is a graphical display of the five-number summary.
 To make a box-and-whisker plot, first divide the data into four
equal parts using quartiles. The median, or middle quartile,
divides the data into a lower half and an upper half. The median
of the lower half is the lower quartile, and the median of the
upper half is the upper quartile.
Quartiles

 Quartiles split the data into four parts. For ungrouped data, arrange
the observations in order from smallest to largest.
 The second quartile is the median: Q2 = x.
 If n is even: The first quartile, Q1, is the median of the smallest n/2
observations; and the third quartile, Q3, is the median of the largest
n/2 observations.
 If n is odd: The first quartile, Q1, is the median of the smallest (n +1)/2
observations; and the third quartile, Q3, is the median of the largest
(n+1)/2 observations.
Quartiles

 For grouped data:-


 L1 = the lower boundary of the class containing Q1.
 L3 = the lower boundary of the class containing Q3.
 f1 = the frequency of the class containing the first quartile.
 f3 = the frequency of the class containing the third quartile.
 CF1 = cumulative frequency for classes below the one containing Q1.
 CF3 = cumulative frequency for classes below the one containing Q3.
Problem-1

 Use the data to make a box-and-whisker plot.

73 67 75 81 67 75
85 69
Problem-1-Cont.

 Step 1: Order the data from least to greatest. Then find the least
and greatest values, the median, and the lower and upper
quartiles.
 Step 2: Draw a number line. Above the number line, plot points
for each value in Step 1.
 Step 3: Draw a box from the lower to the upper quartile. Inside
the box, draw a vertical line through the median. Then draw the
“whiskers” from the box to the least and greatest values.
Problem-1-Cont.
Problem-1 –Cont.
Problem-2:
Comparing Box-and-Whisker Plots
Problem-2 – Cont.
Ans:-
Time Sequence Plots

 A time series “Measures the same phenomenon at equal intervals of


time”
Components of time series

 Trend : underlying long-term movement


 Cycle : medium-term cyclical movements about the
trend
 Seasonal (S) : factors that occur one or more times per year.
Stable in size and direction from year to year.
 Irregular (I) : residual after other components have been
removed. Should exhibit no pattern.
 We combine the trend and the cycle to form trend cycle (C),
but refer to this as the “trend”.
Time Series – Trend ,Time Series –Seasonal &
Time Series – Irregular Graphs
Probability Plot

 It is a graphical technique for assessing whether or not a data


set follows a given distribution such as the normal or Weibull.
 The data are plotted against a distribution and if the data are
in such a way that the points should form approximately a
straight line.
 It is a graphical method for determining whether sample data
conform to a hypothesized distribution based on a subjective
visual examination of the data.

You might also like