You are on page 1of 66

Pilot Survey

Once the questionnaire is ready, it is advisable to conduct a try-out with asmall group which is known
as Pilot Survey or Pre-Testing of the questionnaire. The pilot survey helps in providing a preliminary
idea about the survey.
Non-Response Errors
Non-response occurs if an interviewer is unable to contact a person listed in the sample or a person
from the sample refuses to respond. In this case, the sample observation may not be representative.
Sampling Bias Sampling bias occurs when the sampling plan is such that some members of the target
population could not possibly be included in the sample.
The Census is being regularly conducted every ten years since 1881. The first Census after
Independence was held in 1951.

DeSome of the major agencies at the national level are Census of India, National Sample Survey
Organisation (NSSO), Central Statistical Organisation (CSO), Registrar General of India (RGI),
Directorate General of Commercial Intelligence and Statistics (DGCIS), Labour Bureau etc
Data can be grouped according to time. Such a classification is known as a Chronological
Classification.The variable ‘population’ is a Time Series as it depicts a series of values for different years
In Spatial Classification the data are classified with reference to geographical locations such as
countries, states, cities, districts, etc.
America 1925
Brazil 127
China 893
Denmark 225
France 439
India 862

Sometimes you come across characteristics that cannot be expressed quantitatively. Such characteristics
are called Qualities or Attributes. For example, nationality, literacy, religion, gender, marital status, etc.
They cannot be measured.
Under exclusive method Under the method we form classes in such a way that the lower limit of a class
coincides with the upper class limit of the previous class this method of classification is most suitable in
case of data of a continuous variable.
In comparison to the exclusive method, the Inclusive Method does not exclude the upper class limit in a
class interval. It includes the upper class in a class. Thus both class limits are parts of the class interval.
In textual presentation, data are described within the text. When the quantity of data is not too large
this form
of presentation is more suitable In a bandh call given on 08 September 2005 protesting the hike in
prices of
petrol and diesel, 5 petrol pumps were found open and 17 were closed whereas 2 schools were closed
and remaining 9 schools were found open in a town of Bihar.
In a tabular presentation, data are presented in rows (read horizontally) and columns (read vertically).
Classification used in tabulation is of four kinds:
• Qualitative
• Quantitative
• Temporal and
• Spatial
When classification is done according to qualitative characteristics like socialstatus, physical status,
nationality, etc., it is called qualitative classification
In quantitative classification, the data are classified on the basis of characteristics which are quantitative
in nature. In other words these characteristics can be measured quantitatively. For example, age, height,
production, income, etc are quantitative characteristics. Classes are formed by assigning limits called
class limits for the values of the characteristic under consideration. Here classifying characteristic is age
in years and is quantifiable.
Temporal classification In this classification time becomes the classifying variable and data are
categorised according to time. Time may be in hours, days, weeks, months,
years, etc When classification is done in such a way that place becomes the classifying
variable, it is called spatial classification. The place may be a village/town, block, district, state, country,
etc Subscripted numbers like 1.2, 3.1, etc. are also in use for identifying the table according to its
location.
At the top of each column in a table a column designation is given to explain figures of the column. This
is called caption or column heading
Like a caption or column heading each row of the table has to be given a heading. The designations of
the rows are also called stubs or stub items, and the complete left column is known as stub column. A
brief description of the row headings may also be given at the left hand top in the table
Footnote is the last part of the table. Footnote explains the specific feature of the data content of the table
which is not self explanatory and has not been explained earlier.
There are various kinds of diagrams in common use. Amongst them the important ones are the
following:
(i) Geometric diagram
(ii) Frequency diagram
(iii) Arithmetic line graph
Geometric Diagram Bar diagram and pie diagram come in the category of geometric diagram for
presentation of data. The bar diagrams are of three types – simple, multiple and
Different types of data may require different modes of diagrammatical representation. Bar diagrams are
suitable both for frequency type and non-frequency type variables and attributes. Discrete variables like
family size, spots on a dice, grades in an examination, etc. and attributes such
as gender, religion, caste, country, etc. can be represented by bar diagrams. Bar diagrams are more
convenient for non-frequency data such as incomeexpenditure income expenditure profile,
export/imports
over the years, etc.
Bar diagrams can have different forms such as multiple bar diagram and component bar diagram. A
component bar diagram shows the bar and its sub-divisions into two or more components. For example,
the bar might show the total population of children in the age-group of 6–14 years.The components
show the proportion of those who are enrolled and those who are not

Multiple bar diagrams are used for comparing two or more sets of data, for example income and
expenditure or import and export for different years, marks obtained in different subjects in different
classes, etc.
Bars (also called columns) are usually used in time series data
component bar diagrams. Component bar diagrams or charts (Fig.4.3), also called sub-diagrams, are
very useful in comparing the sizes of different component parts (the elements or parts which a thing is
made up of) and also for throwing light on the relationship among these integral parts. For example,
sales proceeds from different products, expenditure pattern in a typical Indian family (components
being food, rent, medicine, education, power, etc.), budget outlay for receipts and expenditures,
components of labour force, population etc. Component bar diagrams are usually shaded or coloured
suitably.

A pie diagram is also a component diagram, but unlike a component bar diagram, a circle whose area is
proportionally divided among the components (Fig.4.4) it representsPie charts usually are not drawn
with absolute values of a category. The values of each category are first expressed as percentage of the
total value of all the categories
Frequency Diagram Data in the form of grouped frequency distributions are generally represented by
frequency diagrams like histogram,frequency polygon, frequency curve and ogive.
A histogram is a two dimensional diagram.
The frequency curve is obtained by drawing a smooth freehand curve passing through the points of the
frequency polygon as closely as possible. It may not necessarily pass through all the points of the
frequency
polygon but it passes through them as closely as possible Frequency polygon is the most common
method of presenting grouped frequency distribution.
Ogive is also called cumulative frequency curve. As there are two types of cumulative frequencies, for
example less than type and more than type,
accordingly there are two ogives for any grouped frequency distribution data. Here in place of simple
frequencies as in the case of frequency polygon, cumulative frequencies are plotted along y-axis against
class limits of the frequency distribution. For less than ogive the cumulative frequencies are plotted
against the respective upper limits of the class intervals whereas for more than ogives the cumulative
frequencies are plotted against the respective lower limits of the class interval. An interesting feature of
the two ogives together is that their
intersection point gives the medianFig. 4.8 (b) of the frequency distribution. As the shapes of the two
ogives suggest, less than ogive is never decreasing and more than ogive is never increasing.
Arithmetic Line Graph An arithmetic line graph is also called time series graph and is a method of
diagrammatic presentation of data. In it, time (hour, day/date, week, month, year, etc.) is plotted along
x-axis and the value of the variable (time series data) along y-axis. A line graph by joining these plotted
points, thus, obtained is called arithmetic line graph (time series graph). It helps in understanding the
trend, periodicity, etc. in a long term time series data.

Dscriptive statistics is the term given to the analysis of data that helps describe, show or summarize data
in a meaningful way such that, for example, patterns might emerge from the data. Descriptive statistics
do not, however, allow us to make conclusions beyond the data we have analysed or reach conclusions
regarding any hypotheses we might have made. They are simply a way to describe our data.
Descriptive statistics are very important because if we simply presented our raw data it would be hard
to visualize what the data was showing, especially if there was a lot of it. Descriptive statistics therefore
enables us to present the data in a more meaningful way, which allows simpler interpretation of the
data. For example, if we had the results of 100 pieces of students' coursework, we may be interested in
the overall performance of those students. We would also be interested in the distribution or spread of
the marks. Descriptive statistics allow us to do this. How to properly describe data through statistics and
graphs is an important topic and discussed in other Laerd Statistics guides. Typically, there are two
general types of statistic that are used to describe data:
 Measures of central tendency: these are ways of describing the central position of a frequency
distribution for a group of data. In this case, the frequency distribution is simply the distribution
and pattern of marks scored by the 100 students from the lowest to the highest. We can
describe this central position using a number of statistics, including the mode, median, and
mean. You can learn more in our guide: Measures of Central Tendency.
 Measures of spread: these are ways of summarizing a group of data by describing how spread
out the scores are. For example, the mean score of our 100 students may be 65 out of 100.
However, not all students will have scored 65 marks. Rather, their scores will be spread out.
Some will be lower and others higher. Measures of spread help us to summarize how spread out
these scores are. To describe this spread, a number of statistics are available to us, including the
range, quartiles, absolute deviation, variance and standard deviation.
When we use descriptive statistics it is useful to summarize our group of data using a combination of
tabulated description (i.e., tables), graphical description (i.e., graphs and charts) and statistical
commentary (i.e., a discussion of the results).
Inferential Statistics
We have seen that descriptive statistics provide information about our immediate group of data. For
example, we could calculate the mean and standard deviation of the exam marks for the 100 students
and this could provide valuable information about this group of 100 students. Any group of data like
this, which includes all the data you are interested in, is called a population. A population can be small
or large, as long as it includes all the data you are interested in. For example, if you were only interested
in the exam marks of 100 students, the 100 students would represent your population. Descriptive
statistics are applied to populations, and the properties of populations, like the mean or standard
deviation, are called parameters as they represent the whole population (i.e., everybody you are
interested in).
Often, however, you do not have access to the whole population you are interested in investigating, but
only a limited number of data instead. For example, you might be interested in the exam marks of all
students in the UK. It is not feasible to measure all exam marks of all students in the whole of the UK so
you have to measure a smaller sample of students (e.g., 100 students), which are used to represent the
larger population of all UK students. Properties of samples, such as the mean or standard deviation, are
not called parameters, but statistics. Inferential statistics are techniques that allow us to use these
samples to make generalizations about the populations from which the samples were drawn. It is,
therefore, important that the sample accurately represents the population. The process of achieving this
is called sampling (sampling strategies are discussed in detail in the section, Sampling Strategy, on our
sister site). Inferential statistics arise out of the fact that sampling naturally incurs sampling error and
thus a sample is not expected to perfectly represent the population. The methods of inferential statistics
are (1) the estimation of parameter(s) and (2) testing of statistical hypotheses.
Nominal
Let’s start with the easiest one to understand.  Nominal scales are used for labeling variables, without
any quantitative value.  “Nominal” scales could simply be called “labels.”  Here are some examples,
below.  Notice that all of these scales are mutually exclusive (no overlap) and none of them have any
numerical significance.  A good way to remember all of this is that “nominal” sounds a lot like “name”
and nominal scales are kind of like “names” or labels.

Examples of Nominal Scales


Note: a sub-type of nominal scale with only two categories (e.g. male/female) is called “dichotomous.”  If
you are a student, you can use that to impress your teacher.
Bonus Note #2: Other sub-types of nominal data are “nominal with order” (like “cold, warm, hot, very
hot”) and nominal without order (like “male/female”).
Ordinal
With ordinal scales, the order of the values is what’s important and significant, but the differences
between each one is not really known.  Take a look at the example below.  In each case, we know that a
#4 is better than a #3 or #2, but we don’t know–and cannot quantify–how much better it is.  For
example, is the difference between “OK” and “Unhappy” the same as the difference between “Very
Happy” and “Happy?”  We can’t say.
Ordinal scales are typically measures of non-numeric concepts like satisfaction, happiness, discomfort,
etc.
“Ordinal” is easy to remember because is sounds like “order” and that’s the key to remember with
“ordinal scales”–it is the order that matters, but that’s all you really get from these.
Advanced note: The best way to determine central tendency on a set of ordinal data is to use the mode or
median; a purist will tell you that the mean cannot be defined from an ordinal set.

Example of
Ordinal Scales

Interval
Interval scales are numeric scales in which we know both the order and the exact differences between
the values.  The classic example of an interval scale is Celsius temperature because the difference
between each value is the same.  For example, the difference between 60 and 50 degrees is a
measurable 10 degrees, as is the difference between 80 and 70 degrees.
Interval scales are nice because the realm of statistical analysis on these data sets opens up.  For
example, central tendency can be measured by mode, median, or mean; standard deviation can also be
calculated.
Here’s the problem with interval scales: they don’t have a “true zero.”  For example, there is no such
thing as “no temperature,” at least not with celsius.  In the case of interval scales, zero doesn’t mean the
absence of value, but is actually another number used on the scale, like 0 degrees celsius.  Negative
numbers also have meaning
Ratio scales are the ultimate nirvana when it comes to data measurement scales because they tell us
about the order, they tell us the exact value between units, AND they also have an absolute zero

Parametric statistics require that data be interval or ratio. If

the data are nominal or ordinal, nonparametric statistics must be used.

Raw data, or data that have not been summarized in any way, are sometimes referred to as ungrouped
data.

Data that have been organized into a frequency distribution are called grouped data

One particularly useful tool for grouping data is the frequency distribution, which is asummary of data
presented in the form of class intervals and frequencies. How

When constructing a frequency distribution, the business researcher should first determine the range of
the raw data. The range often is defined as the difference between the

largest and smallest numbers

The second step in constructing a frequency distribution is to determine how many classes it will
contain.
After selecting the number of classes, the business researcher must determine the width of the class
interval. An approximation of the class width can be calculated by divid

Ing the range by the number of classes

The midpoint of each class interval is called the class midpoint and is sometimes referred to as the class
mark. It is the value halfway across the class interval and can be calculated as the average of the two
class endpoints. For example, in the distribution of Table 2.2, the mid+point of the class interval 3–
under 5 is 4, or (3+5)/2.he class midpoint is important, because it becomes the representative value for
each class in most group statistics calculations

Relative Frequencyis the proportion of the total frequency that is in any given class interval in a
frequency distribution. Relative frequency is the individual class frequency divided by the total
frequency.

The cumulative frequency is a running total of frequencies through the classes of a frequency
distribution.

Quntitatite graphs: five types of quantitative data graphs: (1)histogram, (2) frequency polygon, (3)
ogive, (4) dot plot, and (5) stem-and-leaf plot.

Histogram: What is a Histogram?


A histogram is used to summarize discrete or continuous data. In other words, it provides a visual
interpretation of numerical data by showing the number of data points that fall within a specified range
of values (called “bins”). It is similar to a vertical bar graph. However, a histogram, unlike a vertical bar
graph, shows no gaps between the bars.
 Parts of a Histogram
1. The title: The title describes the information included in the histogram.
2. X-axis: The X-axis are intervals that show the scale of values which the measurements fall
under.
3. Y-axis: The Y-axis shows the number of times that the values occurred within the intervals set
by the X-axis.
4. The bars: The height of the bar shows the number of times that the values occurred within the
interval, while the width of the bar shows the interval that is covered. For a histogram with
equal bins, the width should be the same across all bars.
 Importance of a Histogram
Creating a histogram provides a visual representation of data distribution. Histograms can display a
large amount of data and the frequency of the data values. The median and distribution of the data can
be determined by a histogram. In addition, it can show any outliers or gaps in the data.
 Distributions of a Histogram
 A normal distribution: In a normal distribution, points on one side of the average are as likely to occur
as on the other side of the average.
A bimodal distribution: In a bimodal distribution, there are two peaks. In a bimodal distribution, the
data should be separated and analyzed as separate normal distributions. 
A right-skewed distribution: A right-skewed distribution is also called a positively skewed distribution.
In a right-skewed distribution, a large number of data values occur on the left side with a fewer
number of data values on the right side. A right-skewed distribution usually occurs when the data has a
range boundary on the left-hand side of the histogram. For example, a boundary of 0.
 A left-skewed distribution: A left-skewed distribution is also called a negatively skewed distribution. In
a left-skewed distribution, a large number of data values occur on the right side with a fewer number of
data values on the left side. A right-skewed distribution usually occurs when the data has a range
boundary on the right-hand side of the histogram. For example, a boundary such as 100.
A random distribution: A random distribution lacks an apparent pattern and has several peaks. In a
random distribution histogram, it can be the case that different data properties were combined.
Therefore, the data should be separated and analyzed separately.

Example of a Histogram
Jeff is the branch manager at a local bank. Recently, Jeff’s been receiving customer feedback saying that
the wait times for a client to be served by a customer service representative are too long. Jeff decides to
observe and write down the time spent by each customer on waiting. Here are his findings from
observing and writing down the wait times spent by 20 customers:The corresponding histogram with
5-second bins (5-second intervals) would look as follows:
 We can see that:
 There are 3 customers waiting between 1 and 35 seconds
 There are 5 customers waiting between 1 and 40 seconds
 There are 5 customers waiting between 1 and 45 seconds
 There are 5 customers waiting between 1 and 50 seconds
 There are 2 customers waiting between 1 and 55 seconds
 
Jeff can conclude that the majority of customers wait between 35.1 and 50 seconds.
Uses of Frequency Polygon
A frequency polygon is a graphical form of representation of data. It is used to depict the shape of the
data and to depict trends. It is usually drawn with the help of a histogram but can be drawn without it
as well. A histogram is a series of rectangular bars with no space between them and is used to represent
frequency distributions.

Steps to Draw a Frequency Polygon


 Mark the class intervals for each class on the horizontal axis. We will plot the frequency on the
vertical axis.
 Calculate the classmark for each class interval. The formula for class mark is:
Classmark = (Upper limit + Lower limit) / 2
 Mark all the class marks on the horizontal axis. It is also known as the mid-value of every class.
 Corresponding to each class mark, plot the frequency as given to you. The height always depicts
the frequency. Make sure that the frequency is plotted against the class mark and not the upper
or lower limit of any class.
 Join all the plotted points using a line segment. The curve obtained will be kinked.
 This resulting curve is called the frequency polygon.
Note that the above method is used to draw a frequency polygon without drawing a histogram. You can
also draw a histogram first by drawing rectangular bars against the given class intervals. After this, you
must join the midpoints of the bars to obtain the frequency polygon. Remember that the bars will have
no spaces between them in a histogram.
Solved Example for You
Question: Construct a frequency polygon using the data given below:
Test Scores Frequency

49.5-59.5 5

59.5-69.5 10

69.5-79.5 30

79.5-89.5 40

89.5-99.5 15

Answer: We first need to calculate the cumulate frequency from the frequency given.
Test Scores Frequency Cumulative Frequency

49.5-59.5 5 5
59.5-69.5 10 15

69.5-79.5 30 45

79.5-89.5 40 85

89.5-99.5 15 100

We now start by plotting the class marks such as 54.5, 64.5, 74.5 and so on till 94.5. Note that we will
also plot the previous and next class marks to start and end the polygon, i.e. we plot 44.5 and 104.5 as
well.
Then, the frequencies corresponding to the class marks are plotted against each class mark. Like you
can see below, this makes sense as the frequency for class marks 44.5 and 104.5 are zero and touching
the x-axis. These plot points are used only to give a closed shape to the polygon. The polygon looks like
this:

What is an Ogive Graph?


An ogive (oh-jive), sometimes called a cumulative
frequency polygon, is a type of frequency polygon that shows cumulative frequencies. In other words,
the cumulative percents are added on the graph from left to right.
An ogive graph plots cumulative frequency on the y-axis and class boundaries along the x-axis. It’s
very similar to a histogram, only instead of rectangles, an ogive has a single point marking where the
top right of the rectangle would be. It is usually easier to create this kind of graph from a frequency
table.
How to Draw an Ogive Graph
Example question: Draw an Ogive graph for the following set of data:
02, 07, 16, 21, 31, 03, 08, 17, 21, 55
03, 13, 18, 22, 55, 04, 14, 19, 25, 57
06, 15, 20, 29, 58.
Step 1: Make a relative frequency table from the data. The first column has the class limits, the second
column has the frequency (the count) and the third column has the relative frequency (class frequency /
total number of items):

If you aren’t sure how to create your class limits(also called bins), watch the video at the bottom of this
article or see: What is a Bin in Statistics?
Step 2: Add a fourth column and cumulate (add up) the frequencies in column 2, going down from top
to bottom. For example, the second entry is the sum of the first row and the second row in the frequency
column (5 + 5 = 10), and the third entry is the sum of the first, second, and third rows in the frequency
column (5 + 5 + 6 = 16):

Step 3: Add a fifth column and cumulate the relative frequencies from column 3. If you do this step
correctly, your values should add up to 100% (or 1 as a decimal):

Step 4: Draw an x-y graph with percent cumulative relative frequency on the y-axis (from 0 to 100%,
or as a decimal, 0 to 1). Mark the x-axis with the class boundaries.
Step 5: Plot your points. Note: Each point should be plotted on the upper limit of the class boundary. For
example, if your first class boundary is 0 to 10, the point should be plotted at 10.
Step 6: Connect the dots with straight lines. the ogive is one continuous line, made up of several smaller
lines that connect pairs of dots, moving from left to right.
The finished graph for this sample data:
Qualitative Bar Graphs: In contrast to quantitative data graphs that are plotted along a numerical scale,
qualitative graphs are plotted using non-numerical categories. In this section, we will examine three
types of qualitative data graphs: (1) pie charts, (2) bar charts, and (3) Pareto charts.
magine you survey your friends to find the kind of movie they like best:
Table: Favorite Type of Movie

Comedy Action Romance Drama SciFi

4 5 6 1 4

You can show the data by this Pie Chart:

It is a really good way to show relative sizes: it is easy to see which movie types are most liked, and
which are least liked, at a glance.
You can create graphs like that using our Data Graphs (Bar, Line and Pie) page.
Or you can make them yourself ...
How to Make Them Yourself
First, put your data into a table (like above), then add up all the values to get a total:
Table: Favorite Type of Movie

Comedy Action Romance Drama SciFi TOTAL

4 5 6 1 4 20

Next, divide each value by the total and multiply by 100 to get a percent:
Comedy Action Romance Drama SciFi TOTAL

4 5 6 1 4 20

4/20 5/20 6/20 1/20 4/20


100%
= 20% = 25% = 30% = 5% = 20%

 
Now to figure out how many degrees for each "pie slice" (correctly called a sector).
A Full Circle has 360 degrees, so we do this calculation:
Comedy Action Romance Drama SciFi TOTAL

4 5 6 1 4 20
20% 25% 30% 5% 20% 100%

4/20 × 360° 5/20 × 360° 6/20 × 360° 1/20 × 360° 4/20 × 360°
360°
= 72° = 90° = 108° = 18° = 72°

  A Bar Graph (also called Bar Chart) is a graphical display of data using bars of different heights.
Imagine you just did a survey of your friends to find which kind of movie they liked best:
Table: Favorite Type of Movie

Comedy Action Romance Drama SciFi

4 5 6 1 4

We can show that on a bar graph like this:

It is a really good way to show relative sizes: we can see which types of movie are most liked, and which
are least liked, at a glance.
We can use bar graphs to show the relative sizes of many things, such as what type of car people have,
how many customers a shop has on different days and so on.

Example: Nicest Fruit


A survey of 145 people asked them "Which is the nicest fruit?":
Fruit: Apple Orange Banana Kiwifruit Blueberry Grapes

People: 35 30 10 25 40 5

And here is the bar graph:

That group of people think Blueberries are the nicest.


Bar Graphs can also be Horizontal, like this:

Example: Student Grades


In a recent test, this many students got these grades:
Grade: A B C D

Students: 4 12 10 2
And here is the bar graph:

You can create graphs like that using our Data Graphs (Bar, Line and Pie) page.
 
Histograms vs Bar Graphs Bar Graphs are good when your data is in categories (such as "Comedy",
"Drama", etc).
But when you have continuous data (such as a person's height) then use a Histogram.
It is best to leave gaps between the bars of a Bar Graph, so it doesn't look like a Histogram.
 ine Graph: a graph that shows information that is connected in some way (such as change over time)
You are learning facts about dogs, and each day you do a short test to see how good you are. These are
the results:
Table: Facts I got Correct
Day 1 Day 2 Day 3 Day 4
3 4 12 15
And here is the same data as a Line Graph:

Example: Apples Sold


Here is a pictograph of how many apples were sold at the local shop over 4 months:

Note that each picture of an apple means 10 apples (and the half-apple picture means 5 apples).
So the pictograph is showing:
 In January 10 apples were sold
 In February 40 apples were sold
 In March 25 apples were sold
 In April 20 apples were sold
It is a fun and interesting way to show data.
But it is not very accurate: in the example above we can't show just 1 apple sold, or 2 apples sold etc.
A third type of qualitative data graph is a Pareto chart, which could be viewed as a particuLar
application of the bar graph. An important concept and movement in business is total quality
management (see Chapter 18). One of the important aspects of total quality manage
ment is the constant search for causes of problems in products and processes. A graphical technique for
displaying problem causes is Pareto analysis. Pareto analysis is a quantitative tallying of the number and
types of defects that occur with a product or service.Analysts use this tally to produce a vertical bar
chart that displays the most common types of defects, ranked in order of occurrence from left to right.
The bar chart is called a Pareto chart.Pareto charts were named after an Italian economist, Vilfredo
Pareto, who observed more than 100 years ago that most of Italy’s wealth was controlled by a few
families who
were the major drivers behind the Italian economy. Quality expert J. M. Juran applied this notion to the
quality field by observing that poor quality can often be addressed by attacking a few major causes that
result in most of the problems. A Pareto chart enables quality managemen decision makers to separate
the most important defects from trivial defects,which helps them to set priorities for needed quality
improvement work.

Measures of central tendency –ungrouped data 3.1One type of measure that is used to
describe a set of data is the measure of central tendency. Measures of central tendency yield information
about the center, or middle part, of a group of numbers. Table
Percentilesthe 87th percentile is a value such that at least 87% of the data are below the value and no
more than 13% are above the value. This section focuses on seven measures of variability for ungrouped
data: range, interquartile range, mean absolute deviation, variance, standard deviation, z scores, and
coefficient of variation.
1.range
Interquartile range is therange of values between the first and third quartile. Essentially, it is the range
of the middle50% of the data and is determined by computing the value of Q3 - Q1. The interquartile
range is especially useful in situations where data users are more interested in values toward the middle
and less interested in extremes.
Three other measures of variability are the variance, the standard deviation, and the mean absolute
deviation. They are obtained through similar processes and are, therefore, presented
together. These measures are not meaningful unless the data are at least intervallevel data.
Two ways of applying the standard deviation are the empirical rule andChebyshev’s theorem

The empirical rule is an important rule of thumb that is used to state the approximate percentage of
values that lie within a given number of standard deviations from the mean of a setof data if the data
are normally distributed. The empirical rule is used only for three numbers of standard deviations: l , 2 ,
and
3.
 If a set of data is normally distributed, or bell shaped, approximately 68% of the datavalues are within
one standard deviation of the mean, 95% are within two standard deviations,and almost 100% are
within three standard deviations
Chebyshev’s Theorem from book pg no 62
Both the sample variance and sample standard deviation use n - 1 in the denominatorinstead of none of
the properties of a good estimator is being unbiased. Whereas using n in the denominator of the sample
variance makes it a biased estimator, using n - 1 allows it to be an unbiased estimator,which is a
desirable property
in inferential statistics.
 Both variance and standard deviation are always positive.
 If all the observations in a data set are identical, then the standard deviation and variance will
be zero.
 z Scores
 A z score represents the number of standard deviations a value (x) is above or below themean of
a set of numbers when the data are normally distributed. Using z scores allows
translation of a value’s raw distance from the mean into units of standard deviations
If a z score is negative, the raw value (x) is below the mean. If the z score is positive,the raw value (x) is
above the mean.
For example, for a data set that is normally distributed with a mean of 50 and a standard deviation of
10, suppose a statistician wants to determine the z score for a value of 70.
This value (x = 70) is 20 units above the mean, so the z value is

This z score signifies that the raw score of 70 is two standard deviations above the mean.How is this z
score interpreted? The empirical rule states that 95% of all values are within
two standard deviations of the mean if the data are approximately normally distributed.Figure 3.7
shows that because the value of 70 is two standard deviations above the mean
(z = +2.00), 95% of the values are between 70 and the value (x = 30), that is two standarddeviations
below the mean, or z = (30 - 50) 10 = -2.00. Because 5% of the values are outside the range of two
standard deviations from the mean and the normal distribution issymmetrical, 21 ⁄2% ( of the 5%) are
below the value of 30. Thus 971⁄2% of the values arebelow the value of 70. Because a z score is the
number of standard deviations an individual data value is from the mean, the empirical rule can be
restated in terms of z scores.

Between z = -1.00 and z = +1.00 are approximately 68% of the values.


Between z = -2.00 and z = +2.00 are approximately 95% of the values.
Between z = -3.00 and z = +3.00 are approximately 99.7% of the values

The coefficient of variation is a statistic that is the ratio of the standard deviation to themean expressed
in percentage and is denoted CV. The coefficient of variation essentially is a relative comparison of a
standard deviationto its mean. The coefficient of variation can be useful in comparing standard
deviationsthat have been computed from data with different means.Suppose five weeks of average
prices for stock A are 57, 68, 64, 71, and 62. To computea coefficient of variation for these prices, first
determine the mean and standard deviation:
= 64.40 and = 4.84. The coefficient of variation is:7.5The standard deviation is 7.5% of the
mean.Sometimes financial investors use the coefficient of variation or the standard deviation
or both as measures of risk. Imagine a stock with a price that never changesSometimes financial
investors use the coefficient of variation or the standard deviationor both as measures ofrisk. Imagine a
stock with a price that never changes.An investor bearsno risk of losing money from the price going
down because no variability occurs in the price.

Suppose, in contrast, that the price of the stock fluctuates wildly. An investor who buys at alow price
and sells for a high price can make a nice profit. However, if the price drops belowwhat the investor
buys it for, the stock owner is subject to a potential loss. The greater the variabilityis, the more the
potential for loss. Hence, investors use measures of variability such asstandard deviation or coefficient
of variation to determine the risk of a stock.What does thecoefficient of variation tell us about the risk of
a stock that the standard deviation does not?Suppose the average prices for a second stock, B, over these
same five weeks are 12, 17,8, 15, and 13. The mean for stock B is 13.00 with a standard deviation of
3.03. The coefficient
of variation can be computed for stock B as:23.3%

The standard deviation for stock B is 23.3% of the mean.With the standard deviation as the measure of
risk, stock A is more risky over thisperiod of time because it has a larger standard deviation. However,
the average price ofstock A is almost five times as much as that of stock B. Relative to the amount
invested instock A, the standard deviation of $4.84 may not represent as much risk as the
standarddeviation of $3.03 for stock B, which has an average price of only $13.00. The coefficient
ofvariation reveals the risk of a stock in terms of the size of standard deviation relative to thesize of the
mean (in percentage). Stock B has a coefficient of variation that is nearly threetimes as much as the
coefficient of variation for stock A. Using coefficient of variation as ameasure of risk indicates that stock
B is riskier.The choice of whether to use a coefficient of variation or raw standard deviations tocompare
multiple standard deviations is a matter of preference. The coefficient of variationalso provides an
optional method of interpreting the value of a standard deviation.
Grouped data do not provide information about individual values. Hence, measures ofcentral tendency
and variability for grouped data must be computed differently from those
for ungrouped or raw data.Mean,median,mode

Two measures of variability for grouped data are presented here: the variance and the
standarddeviation.Again, the standard deviation is the square root of the variance. Both measures
have original and computational formulas.

Measures of shape are tools that can be used to describe the shape of a distribution of data. In this
section, we examine two measures of shape, skewness and kurtosis
A distribution of data in which the right half is a mirror image of the left half is said to be symmetrical.
One example of a symmetrical distribution is the normal distribution, or bell
curve,
Skewness is when a distribution is asymmetrical or lacks symmetry

Skewness and the Relationship of the Mean, Median, and Mode

The concept of skewness helps to understand the relationship of the mean, median, andmode. In a
unimodal distribution (distribution with a single peak or mode) that is skewed,
the mode is the apex (high point) of the curve and the median is the middle value. Themean tends to be
located toward the tail of the distribution, because the mean is particularly
affected by the extreme values. A bell-shaped or normal distribution with the mean,median, and mode
all at the center of the distribution has no skewnessStatistician Karl Pearson is credited with developing
at least two coefficients of skewnessthat can be used to determine the degree of skewness in a
distribution.We present one of
these coefficients here, referred to as a Pearsonian coefficient of skewness. This coefficientcompares the
mean and median in light of the magnitude of the standard deviation . Note that
if the distribution is symmetrical, the mean and median are the same value and hence the coefficient of
skewness is equal to zero. Because the value of Sk is positive, the distribution is positively skewed. If the
value ofSk is negative, the distribution is negatively skewed. The greater the magnitude of Sk, themore
skewed is the distribution.if mean is more positively skewed

Kurtosis describes the amount of peakedness of a distribution. Distributions that are highand thin are
referred to as leptokurtic distributions. Distributions that are flat and spread
out are referred to as platykurtic distributions. Between these two types are distributionsthat are more
“normal” in shape, referred to as mesokurtic distributions
Box-and-Whisker Plots
The box-and-whisker plot is determined from five specific numbers.
1. The median (Q2)
2. The lower quartile (Q1)
3. The upper quartile (Q3)
4. The smallest value in the distribution
5. The largest value in the distribution

Probability

The three general methods of assigning probabilities are (1) the classical method, (2) therelative
frequency of occurrence method, and (3) subjective probabilities.
Classical Method of Assigning Probabilities
When we assign probabilities using the classical method, the probability of an individualevent
occurring is determined as the ratio of the number of items in a population containing
the event (ne) to the total number of items in the population (N). That is, P(E) = ne N. Forexample, if a
company has 200 workers and 70 are female, the probability of randomly selecting a female from this
company is 70/200 = .35.
Suppose, in a particular plant, three machines make a given product. Machine Aalways produces 40%
of the total number of this product. Ten percent of the items produced by machine A are defective. If the
finished products are well mixed with regard towhich machine produced them and if one of these
products is randomly selected, the classicalmethod of assigning probabilities tells us that the probability
that the part was producedby machine A and is defective is .04. This probability can be determined even
beforethe part is sampled because with the classical method, the probabilities can be determineda
priori; that is, they can be determined prior to the experiment.

Probability values can be converted to percentages by multiplying by 100. Meteorologistsoften report


weather probabilities in percentage form. For example, when theyforecast a 60% chance of rain for
tomorrow, they are saying that the probability of rain tomorrow is .60.
Relative Frequency of Occurrence
The relative frequency of occurrence method of assigning probabilities is based on cumulatedhistorical
data.With this method, the probability of an event occurring is equal to thenumber of times the event
has occurred in the past divided by the total number of opportunitiesfor the event to have occurred.
The subjective method of assigning probability is based on the feelings or insights of the person
determining the probability. Subjective probability comes from the person’s intuition or
reasoning. Subjective probability also can be a potentially useful way of tapping a person’s experience,
knowledge, and insight and using them to forecast the occurrence of some event
Experiment is a process that produces outcomes
Event is an outcome of an experiment, the experiment defines the possibilities ofthe event. IfEvents are
denoted by uppercase letters; italic capital letters (e.g., A and E1, E2, . . .)
represent the general or abstract case, and roman capital letters (e.g., H and T for heads andtails) denote
specific things and people

Events that cannot be decomposed or broken down into other events are called elementaryevents.
Elementary events are denoted by lowercase letters (e.g., e1, e2, e3, . . .). Suppose theexperiment is to roll
a die. The elementary events for this experiment are to roll a 1 or rolla 2 or roll a 3, and so on. Rolling
an even number is an event, but it is not an elementary event because the even number can be broken
down further into events 2, 4, and 6.
sample space is a complete roster or listing of all elementary events for an experiment . Table 4.1is the
sample space for the roll of a pair of dice. The sample space for the roll of a singledieis {1, 2, 3, 4, 5,
6}.Sample space can aid in finding probabilities. Suppose an experiment is to roll a pair of dice.What is
the probability that the dice will sum to 7? An examination of the sample space shown in Table 4.1
reveals that there are six outcomes in which the dice sum to 7—{(1,6),(2,5), (3,4), (4,3), (5,2), (6,1)}—
in the total possible 36 elementary events in the sample space.Using this information,we can conclude
that the probability of rolling a pair of dice that sumto 7 is 6 36, or .1667.
Unions and Intersections
Set notation, the use of braces to group numbers, is used as a symbolic tool for unions and intersections
in this chapter. The union of X, Y is formed by combining elements from each of the setsand is denoted
Xu Y. An element qualifies for the union of X, Y if it is in either X or Y or in
both X and Y. The union expression X Ycan be translated to “X or Y.” For example, if An intersection is
denoted X Y. To qualify for intersection, an element must be in both
X and Y. Theintersection contains the elements common to both sets.Thus the intersection symbol, is
often read as and. The intersection of X, Y is referred to as X and Y.
Two or more events are mutually exclusive events if the occurrence of one event preclude the
occurrence of the other event(s). This characteristic means that mutually exclusive events cannot occur
simultaneously and therefore can have no intersection.

Two or more events are independent events if the occurrence or nonoccurrence of one of the events
does not affect the occurrence or nonoccurrence of the other event(s) .
(X/Y) = P (X) and P (Y /X) = P (Y) P(X Y)
denotes the probability of X occurring given that Y has occurred. If X and Y areindependent, then the
probability of X occurring given that Y has occurred is just the probability of X occurring. Knowledge
that Y has occurred does not impact the probability of X occurring because X and Y are independent.

A list of collectively exhaustive events contains all possible elementary events for an experiment .Thus,
all sample spaces are collectively exhaustive lists. The list of possible outcomes for the tossing of a pair
of dice contained in Table 4.1 is a collectively exhaustive list. The sample space for an experiment can
be described as a list of events that are mutually exclusive and collectively exhaustive. Sample space
events do not overlap or intersect, and the list is complete.

Complementory events: The complement of event A is denoted A , pronounced “not A.” All the
elementary eventsof an experiment not in A comprise its complement . For example, if in rolling one die,
event A is getting an even number, the complement of A is getting an odd number. If event A is getting a
5 on the roll of a die, the complement of A is getting a 1, 2, 3, 4, or 6.
P (A’) = 1 - P (A)
Suppose 32% of the employees of a company have a college degree. If an employee is randomly selected
from the company, the probability that the person does not have a college degree is 1 - .32 = .68.
Suppose 42% of all parts produced in a plant are molded by machine A and 31% are molded by
machine B. If a part is randomly selected, the probability that it was molded by neither machine A nor
machine B is 1 - .73 = .27. (Assume that a part is only molded on one machine.)
The mn Counting Rule
Suppose a customer decides to buy a certain brand of new car. Options for the car include two different
engines, five different paint colors, and three interior packages. If each of
these options is available with each of the others, how many different cars could the customer choose
from? To determine this number, we can use the mn counting rule.
Using the mn counting rule, we can determine that the automobile customer has(2)(5)(3) = 30 different
car combinations of engines, paint colors, and interiors available

Sampling from a Population with Replacement


The size of the population, N, is 6, the six sides of the die. We are sampling three dice rolls, n = 3. The
sample space is 216
Suppose in a lottery six numbers are drawn from the digits 0 through 9, with replacement (digits can be
reused). How many different groupings of six numbers can be drawn? N is the population of 10
numbers (0 through 9) and n is the sample size, six numbers

(N)n = (10)6 = 1,000,000

Combinations: Sampling from a Population Without Replacement


This situation does not allow sampling with replacement because three different lawyers will be selected
to go. This problem is solved by using combinations. N = 16 and n = 3, so A total of 560 combinations of
three lawyers could be chosen to represent the firm.
NCn = 16C3 =
16!
3!13!
= 560

Conditional Probability
The conditional probability of an event B is the probability that the event will occur given the
knowledge that an event A has already occurred. This probability is written P(B|A), notation for
the probability of B given A. In the case where events A and B are independent (where event A has no
effect on the probability of event B), the conditional probability of event B given event A is simply the
probability of event B, that is P(B).
If events A and B are not independent, then the probability of the intersection of A and B (the probability
that both events occur) is defined by 
P(A and B) = P(A)P(B|A).
From this definition, the conditional probability P(B|A) is easily obtained by dividing by P(A):

Note: This expression is only valid when  P(A)  is greater than 0.

Examples
In a card game, suppose a player needs to draw two cards of the same suit in order to win. Of the 52
cards, there are 13 cards in each suit. Suppose first the player draws a heart. Now the player wishes to
draw a second heart. Since one heart has already been chosen, there are now 12 hearts remaining in a
deck of 51 cards. So the conditional probability P(Draw second heart|First card a heart) = 12/51.

Suppose an individual applying to a college determines that he has an 80% chance of being accepted,
and he knows that dormitory housing will only be provided for 60% of all of the accepted students. The
chance of the student being accepted and receiving dormitory housing is defined by
P(Accepted and Dormitory Housing) = P(Dormitory Housing|Accepted)P(Accepted)  = (0.60)*(0.80) =
0.48.
To calculate the probability of the intersection of more than two events, the conditional probabilities
of all of the preceding events must be considered. In the case of three events, A, B, and C, the probability
of the intersection P(A and B and C) = P(A)P(B|A)P(C|A and B).

Consider the college applicant who has determined that he has 0.80 probability of acceptance and that
only 60% of the accepted students will receive dormitory housing. Of the accepted students who receive
dormitory housing, 80% will have at least one roommate. The probability of being
accepted and receiving dormitory housing and having no roommates is calculated by: 
P(Accepted and Dormitory Housing and No Roommates) = P(Accepted)P(Dormitory Housing|
Accepted)P(No Roomates|Dormitory Housing and Accepted)  = (0.80)*(0.60)*(0.20) = 0.096. The student
has about a 10% chance of receiving a single room at the college.

Another important method for calculating conditional probabilities is given by Bayes's formula. The
formula is based on the expression P(B) = P(B|A)P(A) + P(B|Ac)P(Ac), which simply states that the
probability of event B is the sum of the conditional probabilities of event B given that event A has or has
not occurred. For independent events A and B, this is equal to P(B)P(A) + P(B)P(Ac) = P(B)(P(A) + P(Ac)) =
P(B)(1) = P(B), since the probability of an event and its complement must always sum to 1. Bayes's
formula is defined as follows:

Example
Suppose a voter poll is taken in three states. In state A, 50% of voters support the liberal candidate, in
state B, 60% of the voters support the liberal candidate, and in state C, 35% of the voters support the
liberal candidate. Of the total population of the three states, 40% live in state A, 25% live in state B, and
35% live in state C. Given that a voter supports the liberal candidate, what is the probability that she
lives in state B?

By Bayes's formula,
P(Voter lives in state B|Voter supports liberal candidate) =
P(Voter supports liberal candidate|Voter lives in state B)P(Voter lives in state B)/
(P(Voter supports lib. cand.|Voter lives in state A)P(Voter lives in state A) +
P(Voter supports lib. cand.|Voter lives in state B)P(Voter lives in state B) +
P(Voter supports lib. cand.|Voter lives in state C)P(Voter lives in state C))
= (0.60)*(0.25)/((0.50)*(0.40) + (0.60)*(0.25) + (0.35)*(0.35))
= (0.15)/(0.20 + 0.15 + 0.1225) = 0.15/0.4725 = 0.3175.

The probability that the voter lives in state B is approximately 0.32.

Discrete Distributions
A random variable, usually written X, is a variable whose possible values are numerical outcomes of a
random phenomenon. There are two types of random variables, discrete and continuous.
A discrete random variable is one which may take on only a countable number of distinct values such as
0,1,2,3,4,........ Discrete random variables are usually (but not necessarily) counts. If a random variable
can take only a finite number of distinct values, then it must be discrete
A continuous random variable is one which takes an infinite number of possible values. Continuous
random variables are usually measurements. Examples include height, weight, the amount of sugar in
an orange, the time required to run a mile.
discrete distributions are presented:
1. binomial distribution
2. Poisson distribution
3. hypergeometric distribution
many continuous distribution
1. uniform distribution
2. normal distribution
3. exponential distribution
4. t distribution
5. chi-square distribution
6. F distribution
The histogram is probably the most common graphical way to depict a discrete distribution.
Discovery of the normal curve of errors is generally credited to mathematician and astronomer Karl
Gauss (1777–1855), Thus the normal distribution is sometimes
referred to as the Gaussian distribution or the normal curve of error
To a lesser extent, some credit has been given to Pierre-Simon de Laplace (1749–1827) for discovering
the normal distribution. However, many people now believe that Abraham
de Moivre (1667–1754), a French mathematician, first understood the normal distribution. De Moivre
determined that the binomial distribution approached the normal distributionas a limit. De Moivre
worked with remarkable accuracy. His published table values for the normal curve are only a few ten-
thousandths off the values of currently published tables.†
The normal distribution exhibits the following characteristics.
It is a continuous distribution.
■It is a symmetrical distribution about its mean.
■It is asymptotic to the horizontal axis.
■It is unimodal.
■It is a family of curves.
■Area under the curve is 1.

In theory, the normal distribution is asymptotic to the horizontal axis. That is, it does
not touch the x-axis, and it goes forever in each direction. is referred to as the bell-shaped curve. It is
unimodal in that values mound up in only one portion of the graph—the center of the curve. The
normal distribution actually is a family of curves. Every unique value of the mean and every unique
value of the standard deviation result in a different normal curve. In addition, the total area under any
normal distribution is 1. The area under the curve yields the probabilities, so the total of all probabilities
for a normal distribution is 1. Because the distribution is symmetric, the area of the distribution on each
side of the mean is 0.5

A mechanism wasdeveloped by which all normal distributions can be converted into a single
distribution: the z distribution. This process yields the standardized normal distribution (or curve).
Theconversion formula for any x value of a given normal distribution followThe z distribution is a
normaldistribution with a mean of 0 and a standard deviation of 1.
exponential distribution is continuous anddescribes a probability distribution of the times between
random occurrences. The followingare the characteristics of the exponential distribution.
■It is a continuous distribution.
■It is a family of distributions.
■It is skewed to the right.
■The x values range from zero to infinity.
■Its apex is always at x = 0.
■The curve steadily decreases as x gets larger.
The exponential probability distribution is determined by the following
SAMPLING

a sampling frame is the source material or device from which a sample is drawn. It is a list of all those within a
population who can be sampled, and may include individuals, households or institutions.

Sampling Frame vs. Sample Space


A sampling frame is a list of things that you draw a sample from. A sample space is a list of all possible outcomes for an experiment. For
example, you might have a sampling frame of names of people in a certain town for a survey you’re going to be conducting on family size.
The sample space is all possible outcomes from your survey: 1 person, 2 people, 3 people…10 or more.
 sampling frame is a list of all the items in your population. It’s a complete list of everyone or everything you want to study. The difference
between a population and a sampling frame is that the population is general and the frame is specific. For example, the population could be
“People who live in Jacksonville, Florida.” The frame would name ALL of those people, from Adrian Abba to Felicity Zappa. A couple more
examples:
Population: People in STAT101.
Sampling Frame: Adrian, Anna, Bob, Billy, Howie, Jess, Jin, Kate, Kaley, Lin, Manuel, Norah, Paul, Roger, Stu, Tim, Vanessa, Yasmin.

In random sampling every unit of the population has the same probability of being selected into the
sample. nonrandom sampling not every unit of the population has the same probability of being
selected into the sample. Members of nonrandom samples are not selected by chance.
Random Sampling Techniques
Simple random sampling
There are lot of sampling techniques which are grouped into two categories as
 Probability Sampling
 Non- Probability Sampling
The difference lies between the above two is whether the sample selection is based on randomization or
not. With randomization, every element gets equal chance to be picked up and to be part of sample for
study.
Probability Sampling
This Sampling technique uses randomization to make sure that every element of the population gets an
equal chance to be part of the selected sample. It’s alternatively known as random sampling.
Simple Random Sampling
Stratified sampling
Systematic sampling
Cluster Sampling
Multi stage Sampling
Simple Random Sampling: Every element has an equal chance of getting selected to be the part sample.
It is used when we don’t have any kind of prior information about the target population.
For example: Random selection of 20 students from class of 50 student. Each student has equal chance
of getting selected. Here probability of selection is 1/50
Single Random Sampling
Stratified Sampling
This technique divides the elements of the population into small subgroups (strata) based on the
similarity in such a way that the elements within the group are homogeneous and heterogeneous
among the other subgroups formed. And then the elements are randomly selected from each of these
strata. We need to have prior information about the population to create subgroups.

Stratified Sampling
Cluster Sampling
Our entire population is divided into clusters or sections and then the clusters are randomly selected. All
the elements of the cluster are used for sampling. Clusters are identified using details such as age, sex,
location etc.
Cluster sampling can be done in following ways:
· Single Stage Cluster Sampling
Entire cluster is selected randomly for sampling.
Single Stage Cluster Sampling
· Two Stage Cluster Sampling
Here first we randomly select clusters and then from those selected clusters we randomly select
elements for sampling

Two Stage Cluster Sampling


Systematic Clustering
Here the selection of elements is systematic and not random except the first element. Elements of a
sample are chosen at regular intervals of population. All the elements are put together in a sequence
first where each element has the equal chance of being selected.
For a sample of size n, we divide our population of size N into subgroups of k elements.
We select our first element randomly from the first subgroup of k elements.
To select other elements of sample, perform following:
We know number of elements in each group is k i.e N/n
So if our first element is n1 then
Second element is n1+k i.e n2
Third element n2+k i.e n3 and so on..
Taking an example of N=20, n=5
No of elements in each of the subgroups is N/n i.e 20/5 =4= k
Now, randomly select first element from the first subgroup.
If we select n1= 3
n2 = n1+k = 3+4 = 7
n3 = n2+k = 7+4 = 11

Systematic Clustering
Multi-Stage Sampling
It is the combination of one or more methods described above.
Population is divided into multiple clusters and then these clusters are further divided and grouped into
various sub groups (strata) based on similarity. One or more clusters can be randomly selected from
each stratum. This process continues until the cluster can’t be divided anymore. For example country
can be divided into states, cities, urban and rural and all the areas with similar characteristics can be
merged together to form a strata.
Multi-Stage Sampling
Non-Probability Sampling
It does not rely on randomization. This technique is more reliant on the researcher’s ability to select
elements for a sample. Outcome of sampling might be biased and makes difficult for all the elements of
population to be part of the sample equally. This type of sampling is also known as non-random
sampling.
Convenience Sampling
Purposive Sampling
Quota Sampling
Referral /Snowball Sampling
Convenience Sampling
Here the samples are selected based on the availability. This method is used when the availability of
sample is rare and also costly. So based on the convenience samples are selected.
For example: Researchers prefer this during the initial stages of survey research, as it’s quick and easy to
deliver results.
Purposive Sampling
This is based on the intention or the purpose of study. Only those elements will be selected from the
population which suits the best for the purpose of our study.
For Example: If we want to understand the thought process of the people who are interested in pursuing
master’s degree then the selection criteria would be “Are you interested for Masters in..?”
All the people who respond with a “No” will be excluded from our sample.
Quota Sampling
This type of sampling depends of some pre-set standard. It selects the representative sample from the
population. Proportion of characteristics/ trait in sample should be same as population. Elements are
selected until exact proportions of certain types of data is obtained or sufficient data in different
categories is collected.
For example: If our population has 45% females and 55% males then our sample should reflect the same
percentage of males and females.
Referral /Snowball Sampling
This technique is used in the situations where the population is completely unknown and rare.
Therefore we will take the help from the first element which we select for the population and ask him to
recommend other elements who will fit the description of the sample needed.
So this referral technique goes on, increasing the size of population like a snowball.
Referral /Snowball Sampling
For example: It’s used in situations of highly sensitive topics like HIV Aids where people will not openly
discuss and participate in surveys to share information about HIV Aids.
Not all the victims will respond to the questions asked so researchers can contact people they know or
volunteers to get in touch with the victims and collect information
Panel sampling
 Method of first selecting a group of participants through a random sampling method and then
asking that group for the same information again several times over a period of time.
 Therefore, each participant is given same survey or interview at two or more time points; each
period of data collection called a "wave".
 This sampling methodology often chosen for large scale or nation-wide studies in order to
gauge changes in the population with regard to any number of variables from chronic illness to
job stress to weekly food expenditures.
 Panel sampling can also be used to inform researchers about within-person health changes due
to age or help explain changes in continuous dependent variables such as spousal interaction.
 There have been several proposed methods of analyzing panel sample data, including growth
curves.
 QUOTA SAMPLINGThe population is first segmented into mutually exclusive sub-groups, just
as in stratified sampling.
 Then judgment used to select subjects or units from each segment based on a specified
proportion.
 For example, an interviewer may be told to sample 200 females and 300 males between the age
of 45 and 60.
 It is this second step which makes the technique one of non-probability sampling.
 In quota sampling the selection of the sample is non-random.
 For example interviewers might be tempted to interview those who look most helpful. The
problem is that these samples may be biased because not everyone gets a chance of selection.
This random element is its greatest weakness and quota versus probability has been a matter of
controversy for many years
 MULTI PHASE SAMPLINGPart of the information collected from whole sample & part from
subsample.
 In Tb survey MT in all cases – Phase I
 X –Ray chest in MT +ve cases – Phase II
 Sputum examination in X – Ray +ve cases - Phase III Survey by such procedure is less costly, less
laborious & more purposeful
Although strata and clusters are both non-overlapping subsets of the population, they differ in
several ways. All strata are represented in the sample; but only a subset of clusters are in the
sample.With stratified sampling, the best survey results occur when elements within strata are
internally homogeneous. However, with cluster sampling, the best results occur when elements
within clusters are internally heterogeneous

Sampling error occurs when the sample is not representative of the population. When random sampling
techniques are used to select elements for the sample, sampling error occurs by chance.Many times the
statistic computed on the sample is not an accurate estimate of the population parameter because the
sample was not representative of the population.This result is caused by sampling error.
All errors other than sampling errors are nonsampling errors. The many possible nonsampling errors
include missing data, recording errors, input processing errors, and analysis
errors. Other nonsampling errors result from the measurement instrument, such as errors of unclear
definitions, defective questionnaires, and poorly conceived concepts. Improper definition of the frame is
a nonsampling error.

. SAMPLING DISTRIBUTION OF p

A lurking variable is a variable that is not included as an explanatory or response variable in the analysis but can
affect the interpretation of relationships between variables. A lurking variable can falsely identify a strong relationship
between variables or it can hide the true relationship. For example, a research scientist studies the effect of diet and
exercise on a person's blood pressure. Lurking variables that also affect blood pressure are whether a person smokes
and stress levels.

Hence, all the other variables that could affect the dependent variable to change must be controlled.
These other variables are called extraneous or confounding variables.

Standard ErrorThe standard error (SE) of a statistic is the approximate standard deviation of a
statistical sample population. The standard error is a statistical term that measures the accuracy
with which a sample distribution represents a population by using standard deviation. In
statistics, a sample mean deviates from the actual mean of a population—this deviation is the
standard error of the mean

The term "standard error" is used to refer to the standard deviation of various sample statistics,
such as the mean or median. For example, the "standard error of the mean" refers to the standard
deviation of the distribution of sample means taken from a population. The smaller the standard
error, the more representative the sample will be of the overall population
The relationship between the standard error and the standard deviation is such that, for a given
sample size, the standard error equals the standard deviation divided by the square root of the
sample size. The standard error is also inversely proportional to the sample size; the larger the
sample size, the smaller the standard error because the statistic will approach the actual value.

Once you've calculated the mean of a sample, you should let people know how close your sample mean is
likely to be to the parametric mean. One way to do this is with the standard error of the mean. If you take
many random samples from a population, the standard error of the mean is the standard deviation of the
different sample means. About two-thirds (68.3%) of the sample means would be within one standard
error of the parametric mean, 95.4% would be within two standard errors, and almost all (99.7%) would be
within three standard errors.

What Is Variance?
Variance (σ2) in statistics is a measurement of the spread between numbers in a data set. That is, it
measures how far each number in the set is from the mean and therefore from every other number in the
set.
Notably, when calculating a sample variance to estimate a population variance, the denominator
of the variance equation becomes N - 1 so that the estimation is unbiased and does not
underestimate the population variance.
Variance measures variability from the average or mean. To investors, variability is volatility, and volatility
is a measure of risk. Therefore, the variance statistic can help determine the risk an investor assumes
when purchasing a specific security.
A large variance indicates that numbers in the set are far from the mean and from each other, while a
small variance indicates the opposite.
Variance can be negative. A variance value of zero indicates that all values within a set of numbers are
identical.
Law of Large NumbersThe law of large numbers, in probability and statistics, states that as a sample
size grows, its mean gets closer to the average of the whole population
 Skip to secondary menu
 Skip to main content
 Skip to primary sidebar
 Home
 About Me
 Contact Me
Statistics By Jim

Making statistics intuitive


 Basics
 Hypothesis Testing
 Regression
 ANOVA
 Fun
 Glossary
 Blog
 My Store
Normal Distribution in Statistics
By Jim Frost 103 Comments
The normal distribution is the most important probability distribution in statisticsbecause it fits many natural
phenomena. For example, heights, blood pressure, measurement error, and IQ scores follow the normal distribution.
It is also known as the Gaussian distribution and the bell curve.

The normal distribution is a probability function that describes how the values of a variable are distributed. It is a
symmetric distribution where most of the observations cluster around the central peak and the probabilities for values
further away from the mean taper off equally in both directions. Extreme values in both tails of the distribution are
similarly unlikely.

In this blog post, you’ll learn how to use the normal distribution, its parameters, and how to calculate Z-scores to
standardize your data and find probabilities.

Example of Normally Distributed Data: Heights


Height data are normally distributed. The distribution in this example fits real data that I collected from 14-year-old
girls during a study.

As you can see, the distribution of heights follows the typical pattern for all normal distributions. Most girls are close to
the average (1.512 meters). Small differences between an individual’s height and the mean occur more frequently
than substantial deviations from the mean. The standard deviation is 0.0741m, which indicates the typical distance
that individual girls tend to fall from mean height.

The distribution is symmetric. The number of girls shorter than average equals the number of girls taller than average.
In both tails of the distribution, extremely short girls occur as infrequently as extremely tall girls.

Parameters of the Normal Distribution


As with any probability distribution, the parameters for the normal distribution define its shape and probabilities
entirely. The normal distribution has two parameters, the mean and standard deviation. The normal distribution does
not have just one form. Instead, the shape changes based on the parameter values, as shown in the graphs below.

Mean

The mean is the central tendency of the distribution. It defines the location of the peak for normal distributions. Most
values cluster around the mean. On a graph, changing the mean shifts the entire curve left or right on the X-axis.

Standard deviation

The standard deviation is a measure of variability. It defines the width of the normal distribution. The standard
deviation determines how far away from the mean the values tend to fall. It represents the typical distance between
the observations and the average.

On a graph, changing the standard deviation either tightens or spreads out the width of the distribution along the X-
axis. Larger standard deviations produce distributions that are more spread out.
When you have narrow distributions, the probabilities are higher that values won’t fall far from the mean. As you
increase the spread of the distribution, the likelihood that observations will be further away from the mean also
increases.

Population parameters versus sample estimates

The mean and standard deviation are parameter values that apply to entire populations. For the normal
distribution, statisticians signify the parameters by using the Greek symbol μ (mu) for the population mean and σ
(sigma) for the population standard deviation.

Unfortunately, population parameters are usually unknown because it’s generally impossible to measure an entire
population. However, you can use random samples to calculate estimates of these
parameters. Statisticians represent sample estimates of these parameters using x̅ for the sample mean and s for the
sample standard deviation.

Related posts: Measures of Central Tendency and Measures of Variability

Common Properties for All Forms of the Normal


Distribution
Despite the different shapes, all forms of the normal distribution have the following characteristic properties.

o They’re all symmetric. The normal distribution cannot model skewed distributions.


o The mean, median, and mode are all equal.
o Half of the population is less than the mean and half is greater than the mean.
o The Empirical Rule allows you to determine the proportion of values that fall within certain distances from
the mean. More on this below!
While the normal distribution is essential in statistics, it is just one of many probability distributions, and it does not fit
all populations. To learn how to determine whether the normal distribution provides the best fit to your sample data,
read my posts about How to Identify the Distribution of Your Data and Assessing Normality: Histograms vs. Normal
Probability Plots.

The Empirical Rule for the Normal Distribution


When you have normally distributed data, the standard deviation becomes particularly valuable. You can use it to
determine the proportion of the values that fall within a specified number of standard deviations from the mean. For
example, in a normal distribution, 68% of the observations fall within +/- 1 standard deviation from the mean. This
property is part of the Empirical Rule, which describes the percentage of the data that fall within specific numbers of
standard deviations from the mean for bell-shaped curves.

Mean +/- standard deviations Percentage of data contained

1 68%

2 95%

3 99.7%

Let’s look at a pizza delivery example. Assume that a pizza restaurant has a mean delivery time of 30 minutes and a
standard deviation of 5 minutes. Using the Empirical Rule, we can determine that 68% of the delivery times are
between 25-35 minutes (30 +/- 5), 95% are between 20-40 minutes (30 +/- 2*5), and 99.7% are between 15-45
minutes (30 +/-3*5). The chart below illustrates this property graphically.
Standard Normal Distribution and Standard Scores
As we’ve seen above, the normal distribution has many different shapes depending on the parameter values.
However, the standard normal distribution is a special case of the normal distribution where the mean is zero and the
standard deviation is 1. This distribution is also known as the Z-distribution.

A value on the standard normal distribution is known as a standard score or a Z-score. A standard score represents
the number of standard deviations above or below the mean that a specific observation falls. For example, a standard
score of 1.5 indicates that the observation is 1.5 standard deviations above the mean. On the other hand, a negative
score represents a value below the average. The mean has a Z-score of 0.
Suppose you weigh an apple and it weighs 110 grams. There’s no way to tell from the weight alone how this apple
compares to other apples. However, as you’ll see, after you calculate its Z-score, you know where it falls relative to
other apples.

Standardization: How to Calculate Z-scores


Standard scores are a great way to understand where a specific observation falls relative to the entire distribution.
They also allow you to take observations drawn from normally distributed populations that have different means and
standard deviations and place them on a standard scale. This standard scale enables you to compare observations
that would otherwise be difficult.

This process is called standardization, and it allows you to compare observations and calculate probabilities across
different populations. In other words, it permits you to compare apples to oranges. Isn’t statistics great!

To standardize your data, you need to convert the raw measurements into Z-scores.

To calculate the standard score for an observation, take the raw measurement, subtract the mean, and divide by the
standard deviation. Mathematically, the formula for that process is the following:

X represents the raw value of the measurement of interest. Mu and sigma represent the parameters for the
population from which the observation was drawn.

After you standardize your data, you can place them within the standard normal distribution. In this manner,
standardization allows you to compare different types of observations based on where each observation falls within
its own distribution.
Example of Using Standard Scores to Make an Apples to
Oranges Comparison
Suppose we literally want to compare apples to oranges. Specifically, let’s compare their weights. Imagine that we
have an apple that weighs 110 grams and an orange that weighs 100 grams.

If we compare the raw values, it’s easy to see that the apple weighs more than the orange. However, let’s compare
their standard scores. To do this, we’ll need to know the properties of the weight distributions for apples and oranges.
Assume that the weights of apples and oranges follow a normal distribution with the following parameter values:

Apples Oranges

Mean weight grams 100 140

Standard Deviation 15 25

Now we’ll calculate the Z-scores:

o Apple = (110-100) / 15 = 0.667


o Orange = (100-140) / 25 = -1.6
The Z-score for the apple (0.667) is positive, which means that our apple weighs more than the average apple. It’s
not an extreme value by any means, but it is above average for apples. On the other hand, the orange has fairly
negative Z-score (-1.6). It’s pretty far below the mean weight for oranges. I’ve placed these Z-values in the standard
normal distribution below.
While our apple weighs more than our orange, we are comparing a somewhat heavier than average apple to a
downright puny orange! Using Z-scores, we’ve learned how each fruit fits within its own distribution and how they
compare to each other.

Finding Areas Under the Curve of a Normal Distribution


The normal distribution is a probability distribution. As with any probability distribution, the proportion of the area that
falls under the curve between two points on a probability distribution plot indicates the probability that a value will fall
within that interval. To learn more about this property, read my post about Understanding Probability Distributions.

Typically, I use statistical software to find areas under the curve. However, when you’re working with the normal
distribution and convert values to standard scores, you can calculate areas by looking up Z-scores in a Standard
Normal Distribution Table.

Because there are an infinite number of different normal distributions, publishers can’t print a table for each
distribution. However, you can transform the values from any normal distribution into Z-scores, and then use a table
of standard scores to calculate probabilities.

Using a Table of Z-scores

Let’s take the Z-score for our apple (0.667) and use it to determine its weight percentile. A percentile is the proportion
of a population that falls below a specific value. Consequently, to determine the percentile, we need to find the area
that corresponds to the range of Z-scores that are less than 0.667. In the portion of the table below, the closest Z-
score to ours is 0.65, which we’ll use.
The trick with these tables is to use the values in conjunction with the properties of the normal distribution to calculate
the probability that you need. The table value indicates that the area of the curve between -0.65 and +0.65 is 48.43%.
However, that’s not what we want to know. We want the area that is less than a Z-score of 0.65.

We know that the two halves of the normal distribution are mirror images of each other. So, if the area for the interval
from -0.65 and +0.65 is 48.43%, then the range from 0 to +0.65 must be half of that: 48.43/2 = 24.215%. Additionally,
we know that the area for all scores less than zero is half (50%) of the distribution.

Therefore, the area for all scores up to 0.65 = 50% + 24.215% = 74.215%

Our apple is at approximately the 74th percentile.

Below is a probability distribution plot produced by statistical software that shows the same percentile along with a
graphical representation of the corresponding area under the curve. The value is slightly different because we used a
Z-score of 0.65 from the table while the software uses the more precise value of 0.667.
Related post: Percentiles: Interpretations and Calculations

Treatment

In an experiment, the factor (also called an independent variable) is an explanatory variable manipulated by the
experimenter. Each factor has two or more levels, i.e., different values of the factor. Combinations of factor levels
are called treatments. The table below shows independent variables, factors, levels, and treatments for a
hypothetical experiment.

Vitamin C
0 mg 250 mg 500 mg
0 mg Treatment 1 Treatment 2 Treatment 3
Vitamin E
400 mg Treatment 4 Treatment 5 Treatment 6

In this hypothetical experiment, the researcher is studying the possible effects of Vitamin C and Vitamin E on
health. There are two factors - dosage of Vitamin C and dosage of Vitamin E. The Vitamin C factor has three levels
- 0 mg per day, 250 mg per day, and 500 mg per day. The Vitamin E factor has 2 levels - 0 mg per day and 400
mg per day. The experiment has six treatments. Treatment 1 is 0 mg of E and 0 mg of C, Treatment 2 is 0 mg of E
and 250 mg of C, and so on.

The placebo effect is the fact that some patients' health improves after takingwhat they believe is an effective drug


but which is in fact only a placebo.

The placebo effect can be understood only if we acknowledge the unity of mind and body
Blinding is a procedure in which one or more parties in a trial are kept unaware of which
treatment arms participants have been assigned to, in other words, which treatment was received. Blinding is an
important aspect of any trial done in order to avoid and prevent conscious or unconscious bias in the design and
execution of a clinical trial.
What is Research Ethics?

Image from: https://www.speedlaboratory.net/VI/login.asp

Research ethics provides guidelines for the responsible conduct of research. In addition, it educates and monitors
scientists conducting research to ensure a high ethical standard. The following is a general summary of some
ethical principles:

Honesty:

Honestly report data, results, methods and procedures, and publication status. Do not fabricate, falsify, or
misrepresent data.

Objectivity:

Strive to avoid bias in experimental design, data analysis, data interpretation, peer review, personnel decisions,
grant writing, expert testimony, and other aspects of research.

Integrity:

Keep your promises and agreements; act with sincerity; strive for consistency of thought and action.

Carefulness:

Avoid careless errors and negligence; carefully and critically examine your own work and the work of your peers.
Keep good records of research activities.

Openness:

Share data, results, ideas, tools, resources. Be open to criticism and new ideas.

Respect for Intellectual Property:

Honor patents, copyrights, and other forms of intellectual property. Do not use unpublished data, methods, or
results without permission. Give credit where credit is due. Never plagiarize.
Confidentiality:

Protect confidential communications, such as papers or grants submitted for publication, personnel records, trade
or military secrets, and patient records.

Responsible Publication:

Publish in order to advance research and scholarship, not to advance just your own career. Avoid wasteful and
duplicative publication.

Responsible Mentoring:

Help to educate, mentor, and advise students. Promote their welfare and allow them to make their own decisions.

Respect for Colleagues:

Respect your colleagues and treat them fairly.

Social Responsibility:

Strive to promote social good and prevent or mitigate social harms through research, public education, and
advocacy.

Non-Discrimination:

Avoid discrimination against colleagues or students on the basis of sex, race, ethnicity, or other factors that are not
related to their scientific competence and integrity.

Competence:

Maintain and improve your own professional competence and expertise through lifelong education and learning;
take steps to promote competence in science as a whole.

Legality:

Know and obey relevant laws and institutional and governmental policies.

Animal Care:

Show proper respect and care for animals when using them in research. Do not conduct unnecessary or poorly
designed animal experiments.

Human Subjects Protection:

When conducting research on human subjects, minimize harms and risks and maximize benefits; respect human
dignity, privacy, and autonomy.

If the numbers come from a census of the entire population and not a sample, when we calculate the average of
the squared deviations to find the variance, we divide by N, the number of items in the population. If the data are
from a sample ratherthan a population, when we calculate the average of the squared deviations, we divide by n –
1, one less than the number ofitems in the sample
The standard deviation (SD) measures the amount of variability, or dispersion, for a subject set of data from the
mean, while the standard error of the mean (SEM) measures how far the sample mean of the data is likely to be
from the true population mean

Notice that instead of dividing by n = 20, the calculation divided by n – 1 = 20 – 1 = 19 because the data is a
sample.

For the sample variance, we divide by the sample size minus one (n – 1). Why not divide by n? The answer has to
do

with the population variance. The sample variance is an estimate of the population variance. Based on the
theoretical

mathematics that lies behind these calculations, dividing by (n – 1) gives a better estimate of the population
variance.

A Random Variable is a set of possible values from a random experiment.

You would use x ¯ to estimate the population mean and s to estimate the population
standard deviation. The sample mean, x ¯ , is the point estimate for the population mean, μ. The sample
standard deviation,
s, is the point estimate for the population standard deviation, σ.

A confidence interval is another type of estimate but, instead of being just one number, it is an interval of numbers. The
interval of numbers is a range of values calculated from a given set of sample data. The confidence interval is likely to
include an unknown population parameter.
Suppose, for the iTunes example, we do not know the population mean μ, but we do know that the population standard
deviation is σ = 1 and our sample size is 100. Then, by the central limit theorem, the standard deviation for the sample
mean
is
σ
n=1
100
= 0.1 .
The empirical rule, which applies to bell-shaped distributions, says that in approximately 95% of the samples, the sample
mean, x ¯ , will be within two standard deviations of the population mean μ. For our iTunes example, two standard
deviations is (2)(0.1) = 0.2. The sample mean x ¯ is likely to be within 0.2 units of μ.

You might also like