Professional Documents
Culture Documents
Chapter - 1
1.1 Definition of Statistics and Classification of Statistics
A. Definition of Statistics
Statistics can be defined in two senses: plural (as Statistical Data) and singular (as Statistical
Methods).
Plural sense: Statistics are collection of facts (figures). This meaning of the word is widely used
when reference is made to facts and figures on sales, employment or unemployment, accident,
weather, death, education, etc. E.g.: Sales Statistics, Labor Statistics, Employment Statistics, etc.
In this sense the word Statistics serves simply as data. But not all numerical data are statistics.
Singular sense: Statistics is the science that deals with the methods of data collection,
organization, presentation, analysis and interpretation of data. It refers the subject area that is
concerned with extracting relevant information from available data with the aim to make sound
decisions. According to this meaning, statistics is concerned with the development and
application of methods and techniques for collecting, organizing, presenting, analyzing and
interpreting statistical data.
B. Classification of Statistics
Based on the scope of the decision, statistics can be classified into two; Descriptive and
Inferential Statistics.
Descriptive Statistics refers to the procedures used to organize and summarize masses of data.
It is concerned with describing or summarizing the most important features of the data. It deals
only the characteristics of the collected data without going beyond it. That is, this part deals with
only describing the data collected without going any further: that is without attempting to
infer(conclude) anything that goes beyond the data themselves. The methodology of descriptive
statistics includes the methods of organizing (classification, tabulation, Frequency Distributions)
and presenting (Graphical and Diagrammatic Presentation) data and calculations of certain
indicators of data like Measures of Central Tendency and Measures of Dispersion (Variation)
which summarize some important features of the data.
Inferential (Inductive) Statistics includes the methods used to find out something about a
population, based on the sample. It is concerned with drawing statistically valid conclusions
about the characteristics of the population based on information obtained from sample. In this
form of statistical analysis, descriptive statistics is linked with probability theory in order to
generalize the results of the sample to the population. Performing hypothesis testing, determining
relationships between variables and making predictions are also inferential statistics.
Introduction to probability and Statistics for Civil Engineering
1. Collection of Data: This is the first stage in any statistical investigation and involves the
process of obtaining (gathering) a set of related measurements or counts to meet
predetermined objectives. The data collected may be primary data (data collected directly by
the investigator) or it may be secondary data (data obtained from intermediate sources such
as newspaper s, journals, official records, etc).
2. Organization of Data: It is usually not possible to derive any conclusion about the main
features of the data from direct inspection of the observations. The second purpose of
statistics is describing the properties of the data in a summary form. This stage of statistical
investigation helps to have a clear understanding of the information gathered and includes
editing (correcting), classifying and tabulating the collected data in a systematic manner.
Thus the first step in the organization of data is editing. It means correcting (adjusting)
omissions, inconsistencies, irrelevant answers and wrong computations in the collected data.
The second step of the organization of data is classification that is arranging the collected
data according to some common characteristics. The last step of the organization of data is
presenting the classified data in tabular form, using rows and columns (tabulation).
3. Presenting of Data: The purpose of data presentation is to have an overview of what the data
actually looks like, and to facilitate statistical analysis. Data presentation can be done using
Graphs and Diagrams which have great memorizing effect and facilitates comparison.
4. Analysis of Data: The analysis of data is the extraction of summarized and comprehensive
numerical description in order to reach conclusions or provide answers to a problem. The
problem may require simple or sophisticated mathematical expressions.
5. Interpretation of Data: This is the last stage of statistical investigation. Interpretation
involves drawing conclusions from the data collected and analyzed in order to make decision.
Introduction to probability and Statistics for Civil Engineering
Population: A population is a totality of things, objects, peoples, etc about which information
is being
Sample: A sample is a subset or part of a population selected to draw conclusions about the
population.
Census survey: -It is the process of examining the entire population. It is the total count of the
population.
population measurement used to describe the population. Example: population mean and
population standard deviation
Statistic: - It is a measure used to describe the sample. It is a value computed from the sample.
Sampling frame:-A list of people, items or units from which the sample is taken.
Data:- Data as a collection of related facts and figures from which conclusions may be drawn.
Variable: A certain characteristic which changes from object to object and time to time.
For Decision Making: statistics helps to enhance the power of decision making in the
face of uncertainty by providing sufficient information.
Reliability Engineering : is the study of the ability of a system or component to perform
its required functions under stated conditions for a specified period of time
The application of probability theory, which includes mathematical tools for dealing with
large populations, to the field of mechanics, which is concerned with the motion of
particles or objects when subjected to a force.
The field of statistics deals with the collection, presentation, analysis, and use of data to:
Such as Make decisions, Solve problems and Design products and processes. It is the
science of learning information from data.
1. Design of Experiments (DOE) uses statistical techniques to test and construct models of
engineering components and systems.
2. Quality control and process control use statistics as a tool to manage conformance to
specifications of manufacturing processes and their products.
3. Time and methods engineering uses statistics to study repetitive operations in manufacturing
in order to set standards and find optimum (in some sense) manufacturing procedures.
4. Reliability engineering uses statistics to measures the ability of a system to perform for its
intended function (and time) and has tools for improving performance.
5. Probabilistic design uses statistics in the use of probability in product and system design.
6. Every structural design, every safety factor, every hydrological analysis, every mechanical
analysis, everything, even the materials used are based on statistics. The results gotten from
the analysis are projected to other conditions, and the probability of them to interact together
(for example, earthquake, wind and max load. Or having the highest flow and rain)
7. Condenses and summarizes masses of data and presents facts in numerical and definite form
8. Facilitates comparison: statistical devises such as averages, percentages, ratios, etc. are used
for this purpose.
9. Formulating and testing hypothesis
10. Forecasting: Statistical methods help in studying past data and predicting future trends.
Limitations of Statistics
It cannot deal with a single observation; rather it deals aggregate of facts.
Statistical methods are not applicable to qualitative character i.e. it deals with quantitative
characteristics.
Statistical results are true on average; i.e. for the majority of case. Laws of statistics are not
universally true like the laws of physics, chemistry and mathematics.
Introduction to probability and Statistics for Civil Engineering
Ex: Classify each of the following as Qualitative and Quantitative and if it is quantitative classify
as Discrete and Continuous.
Based on the number on the shirts it is not possible to judge, whether Mr. B plays better. But by
using the test score, it is possible to judge that Mr. B did better in the exam. Also it not possible
to find the average shirt numbers (or the average shirt number is nothing) because the numbers
on the shirts are simply codes but it is possible to obtain the average test score.
Nominal Scales of variables are those qualitative variables which show category of
individuals. They reflect classification in to categories (name of groups) where there is no
particular order or qualitative difference to the labels. Numbers may be assigned to the
variables simply for coding purposes. It is not possible to compare individual basing on the
numbers assigned to them. The only mathematical operation permissible on these variables is
counting.
These variables
Have mutually exclusive (non-overlapping) and exhaustive categories.
No ranking or order between (among) the values of the variable.
Example: Gender, Religion, ID No, Ethnicity, Color
Ordinal Scales of variables are also those qualitative variables whose values can be ordered
and ranked. Ranking and counting are the only mathematical operations to be done on the
values of the variables. But there is no precise difference between the values (categories) of
the variable.
Eg: Academic qualifications (B.Sc., M.Sc., Ph.D), Strength (very weak, week, strong, very
strong), Health status (very sick, sick, cured)
Interval Scales of variables are those quantitative variables when the value of the variables is
zero it does not show absence of the characteristics i.e. there is no true zero. Zero indicates
low than empty. There is a precise difference between the units of measurement (levels)
Eg: temperature, 00c does not mean there is no temperature but to say it is too cold.
Introduction to probability and Statistics for Civil Engineering
Ratio Scales of variables are those quantitative variables when the values of the variables are
zero it shows absence of the characteristics. Zero indicates absence of the characteristics.
Eg: Height, Weight, Income, Amount of yield, Expenditure, Consumption.
All mathematical operations are allowed to be operated on the values of the variables.
Based on the source, data can be classified into two: Primary Data and Secondary Data.
In primary data collection, you collect the data yourself using methods such as interviews,
observations, laboratory experiments and questionnaires. The key point here is that the data you
collect is unique to you and your research and, until you publish, no one else has access to it.
There are many methods of collecting primary data and the main methods include:
Questionnaire: It is a popular means of collecting data, but is difficult to design and often
require many rewrites before an acceptable questionnaire is produced.
Observation: It involves recording the behavioral patterns of people, objects and events in a
systematic manner.
Diaries: A diary is a way of gathering information about the way individuals spend their time
on professional activities. They are not about records of engagements or personal journals of
thought! Diaries can record either quantitative or qualitative data, and in management
research can provide information about work patterns and activities.
Laboratory experiment: Conducting laboratory experiments on fields of chemical, biological
sciences and so on.
NGOs, etc.) or for some other purpose than the one currently being considered, or often a
combination of the two.
Some of the sources of secondary data are government document, official statistics, technical
report, scholarly journals, trade journals, review articles, reference books, research institutes,
universities, hospitals, libraries, library search engines, computerized data base and world wide
web ( ).
So far you know how to collect data. So what do we do with the collected data next? Now you
have to present the data you have collected so that they can be of use. Thus the collected data
also known as raw data are always in an unorganized form and need to be organized and
presented in a meaningful and readily comprehensible form in order to facilitate further
statistical analysis. This chapter introduces tabular and graphical methods commonly used to
summarize both qualitative and quantitative data. Tabular and graphical summaries of data can
be obtained in annual reports, newspaper articles and research studies. Everyone is exposed to
these types of presentations, so it is important to understand how they are prepared and how they
will be interpreted. Modern statistical software packages provide extensive capabilities for
summarizing data and preparing graphical presentations.
Class: is a description of a group of similar numbers in a data set.
Frequency: is the number of times a variable value is repeated.
Class frequency: the number of observations belonging to a certain class.
There are three types of frequency distributions; categorical, ungrouped (discrete or frequency
array) and grouped (continuous) frequency distributions.
1.Categorical FD:-a FD in which the data is qualitative i.e. either nominal or ordinal. Each
category of the variable represents a single class and the number of times each category repeats
represents the frequency of that class (category).
Grouped (Continuous) FD: - A FD of numerical data in which several values of a variable are
grouped into one class. The number of observations belonging to the class is the frequency of the
class.
Class Limits:-The lowest and highest values that can be included in a class are called Class
Limits. The lowest values are called Lower Class Limits and the highest values are called Upper
Class Limits.
Class Boundaries:-are class limits when there is no gap between the UCL of the first class and
the LCL of the second class. The lowest values are called Lower Class Boundaries and the
highest values are called Upper Class Boundaries.
Class Width (Class Size):-the difference between UCB and LCB of a class. It is also the
difference between the lower limits of two consecutive classes or it is the difference between
upper limits of two consecutive classes.
Class Mark (Class Midpoint):-is the half way between the class limits or the class boundaries.
Relative frequency: - is the ratio of class frequency to the total frequency (total number of
observations).
Introduction to probability and Statistics for Civil Engineering
Cumulative frequency: is the sum of frequencies (total number of observations) below or above
a certain value.
Less than Cumulative Frequency: is the total number of values of a variable below a certain
UCB.
More than Cumulative Frequency: - is the total number of values of a variable above a certain
LCB.
6. Put the smallest value of the data set as the LCL of the first class. To obtain the LCL of
the second class add the class width W to the LCL of the first class. Continue adding
until you get K classes.
Let X be the smallest observation
LCL1=X
LCLi=LCLi-1+W for i=2, 3… K.
7. Obtain the UCLs of the FD by adding W-U to the corresponding LCLs.
UCLi=LCLi+ (W-U) for i=1,2…K.
8. Generate the class boundaries.
1 1
LCBi=LCLi- U and UCBi=UCLi+ U for i=1,2…K.
2 2
16 21 26 24 11 17 25 26 13 27 24 26 3 27 23 24 15 22 22 12 22 29 18 22 28 25 7
17 22 28 19 23 23 22 3 19 13 31 23 28 24 9 20 33 30 23 20 8 21 24
Solution
Exercise In a survey the age of 44 women at marriage was reported as follows. Construct the
appropriate FD for this data.
24 25 27 26 22 23 24 25 24 23 26 28 24 25 23 24 25 25 25 22 27 28
27 24 25 24 25 28 26 25 24 28 24 25 25 24 25 24 26 27 27 25 28 26
1. Histogram: A graph in which the classes are marked on the X axis (horizontal axis) and
the frequencies are marked along the Y axis (vertical axis).
The height of each bar represents the class frequencies and the width
of the bar represents the class width.
The bars are drawn adjacent to each other.
1. Frequency Polygon: A graph that consists of line segments connecting the
intersection of the class marks and the frequencies.
Can be constructed from Histogram by joining the mid-points of each
bar.
2. Frequency curve: is a smooth free hand curve of frequency polygon.
Diagrams
1. Bar Diagram:-It is the simplest and most commonly used diagrammatic
representation of a frequency distribution. It is appropriate to present Qualitative Data
(nominal\ordinal). It uses a serious of separated and equally spaced bars in which the
width of the bars is constant and height of bars corresponds to the frequency of the
category. The bars are separated by constant distance.
1.1 Simple Bar Diagram: is a diagram in which categories of a variable are
marked on the X axis and the frequencies of the categories are marked on the Y
axis.
It is applicable for discrete variables, that is, for data given according to some
period, places and timings. These periods and timings are represented on the
Introduction to probability and Statistics for Civil Engineering
base line (X-axis) at regular interval and the corresponding frequencies are
represented on the Y-axis.
The width of the rectangle represents nothing (it is meaningless), but it
should be equal for all rectangles.
Each rectangle is separated by an equal space.
It can also represent some magnitude (on the Y axis) over time, space,
groups, etc. (on the X axis).
Example1:
Mar Status
100
80
60
Fr eq uen cy
40
20
0
Single Married Divorced
Mar Status
Example2:
Introduction to probability and Statistics for Civil Engineering
1.2 Component Bar Diagram: is used when there is a desire to show a total or
aggregate is divided into its component parts. The bars represent total value of
a variable with each total broken into its component parts and different colors
are used for identification. In such type of diagrams, a bar is subdivided in to
parts in proportion to the size of the sub division. These subdivided rectangles
are shaded differently by lines, dots and colors so that they will be very easy to
compare the components.
Sometimes the volumes of different attributes may be greatly different. For
making meaningful comparisons, the components of the attributes are reduced
to percentages. In that case each attribute will have 100 as its maximum
volume. This sort of component bar diagram is known as percentage bar-
diagram.
Each rectangle represents total value of a variable and is broken into its
component parts.
Example
Marital Status Male Female Total
Single 90 10 100
Married 30 40 70
Divorced 1 29 30
250
200
150 Divorced
100 Married
50 Single
0
Male Female Total
1.3 Multiple Bars Diagram: used to display data on more than one variable. In
the multiple bars diagram two or more sets of inter-related data are interpreted.
Introduction to probability and Statistics for Civil Engineering
Example:
Year Coffee Butter Sugar Total
1997 120 127 75
1998 25 98 87
1999 100 120 75
2000 198 98 60
400
300 Coffee
200 Butter
100 Sugar
0 Total
time1 time2 time3 time4
Pie chart: - Pie chart is popularly used in practice to show percentage break down of data. A pie
chart is a circle representing a set of data by dividing the circle into sectors proportional to the
number of items in the categories or a pie chart is a circle representing the total, cut into slices in
proportional to the size of the parts that make up the total. It gives the proportional sizes of
different data groups as slice of a pie or a circle.
Example:
Single
Married
Divorced
Histogram
Histogram is a special type of bar graph in which the horizontal scale represents classes
of data values and the vertical scale represents frequencies. The height of the bars
correspond to the frequency values, band the drawn adjacent to each other (without gaps).
We can construct a histogram after we have first completed a frequency distribution table
for a data set. The y axis is reserved for the class boundaries.
Introduction to probability and Statistics for Civil Engineering
Consider the following set of Example 2.4: data and construct the frequency distribution.
11 ,29, 6, 33, 14, 21, 18, 17, 22, 38, 31, 22, 27, 19, 22, 23, 26, 39, 34, 27
Relative frequency histogram has the same shape and horizontal ( ) scale as a histogram, but the
vertical (y -axis) scale is marked with relative frequencies instead of actual frequencies.
Frequency Polygon
A frequency polygon uses line segment connected to points located directly above class midpoint
values. The heights of the points correspond to the class frequencies, and the line segments are
extended to the left and right so that the graph begins and ends on the horizontal axis with the
same distance that the previous and next midpoint would be located.
An Ogive (pronounced as “oh-jive”) is a line that depicts cumulative frequencies, just as the
cumulative frequency distribution lists cumulative frequencies. Note that the Ogive uses class
boundaries along the horizontal scale, and graph begins with the lower boundary of the first class
and ends with the upper boundary of the last class. Ogive is useful for determining the number of
values below some particular value. There are two type of Ogive namely less than Ogive and
more than Ogive. The difference is that less than Ogive uses less than cumulative frequency and
more than Ogive uses more than cumulative frequency on axis.
Above example Example 2.4:
Introduction to probability and Statistics for Civil Engineering
pictograph
Pictograph is a way of representing statistical data using symbolic figures to match the
frequencies of different kinds of data. Visual presentation of data using icons, pictures, symbols,
etc., in place of or in addition to common graph elements (bars, lines, points). Pictographs use
relative sizes or repetitions of the same icon, picture, or symbol to show comparison.
Also called pictogram, pictorial chart, pictorial graph, or picture graph.
A stem-and-leaf diagram, also called a stem-and-leaf plot, is a diagram that quickly summarizes
data while maintaining the individual data points. In such a diagram, the "stem" is a column of
the unique elements of data after removing the last digit. The final digits ("leaves") of each
column are then placed in a row next to the appropriate column and sorted in numerical order.