You are on page 1of 38

ENGINEERING DATA ANALYSIS

EDA 11 – for Bachelor of Science in Civil Engineering, 2nd year level

“This course is designed for undergraduate students with emphasis on problem solving related to
societal issues that engineers are called upon to solve.”

Prepared by:

Engr. Maybelle Del Rosario – Balicante


EDA 11 - ENGINEERING DATA ANALYSIS

THE BEGINNINGS OF
INTRODUCTION STATISTICS
The History of statistics can be said to start around 1749 although, overtime, there have
been changes to the interpretation of the statistics. In early times, the meaning was
restricted to information about states. In modern terms, “statistics” means both sets of
collected information, as in national accounts and temperature records, and analytical work
which requires statistical inference.

STATISTICS – the science and the art which deals on interpreting data from facts
and information.
– a science which deals with the methods of gathering, presentation
analysis and interpretation of data.

DATA GATHERING - involves getting information through interviews,


questionnaires, experiments, testing and other methods.
DATA PRESENTATION - deals with translating of information into a numerical or
quantitative data by using tabular or graphical form.
DATA ANALYSIS - the resolution of information into a simpler element by the
application of statistical principles, the choice of which
depends upon the nature or purpose of the problem at
DATA INTERPRETATION hand.
- comes after data analysis, includes finding and relating
those to existing theories and earlier study on that area.

 TYPES OF STATISTICS
 Description - summarizes data numerically and/or graphically. Then, the
results are interpreted using mean and standard deviation.
 Inferential - generalization of findings of a sample of a population to make
conclusions about the nature of the whole population.

STATISTICAL DATA – A sequence of observation, made on a set of objects included in the


sample drawn from population is known as statistical data.
– Data can be defined as the quantitative or qualitative value of
variable (e.g. number, images, words, figures, facts or ideas).
– It is a lowest unit of information from which other measurements
analysis can be done.

2
EDA 11 - ENGINEERING DATA ANALYSIS

DATA

QUANTITATIVE QUALITATIVE

CONTINUOUS
DATA DISCRETE DATA ATTRIBUTE DATA OPEN DATA
(DISCONTINUOUS)
(VARIABLE)

NOMINAL DATA ORDINAL DATA

DATA is one of the most important and vital aspect of any research study.

 TYPES OF DATA
1. Quantitative – are measures of values or counts and expressed as numbers.
 Continuous Data (Variable) – data that can take the form of
decimals or continuous values of varying degrees of precision (e.g.
height, weight)
 Discrete Data (Discontinuous) – data whose value cannot take the
form of decimals. (e.g. Family size, enrolment size)

 Qualitative – defined as the data that approximates and characterize. Non-


numerical in nature. Collected through methods of observation, one-on-one
interviews and similar methods.
 Attribute Data – data that can be counted for recording and analysis.
Nominal Data – data defined by an operation which allows
making statements only equality or difference. (e.g. gender,
race, religion, political affiliation)
Ordinal Data – data defined affiliation operation whereby
members of particular group are ranked. (e.g. awareness, IQ)
 Open Data – data that is depending on the sample and not given a
specific value on a possible set of responses or answers.

3
EDA 11 - ENGINEERING DATA ANALYSIS

DATA GATHERING
LESSON 1.0
In Data Collection the first step in any investigation is collection of data.
The Data may be collected for the whole population or for a sample only. It is mostly
collected on sample basis.
Collection of Data is very difficult job. The investigator is the well-trained person who
collects the statistical data. The respondents are the persons from whom the information is
collected.

UN-GROUPED - data which have been arranged in a


DATA systematic order are called raw data or
DATA ungrouped data.
COLLECTION
GROUPED - data presented in the form of frequency
DATA distribution is called grouped data.

 FACTORS to be Considered Before Collection of Data:


 Object and scope of the enquiry
 Sources of information
 Quantitative expression
 Techniques of data collection
 Unit of collection

METHODS OF DATA COLLECTION - The selection of a method for collecting


information must balance several concerns including:
 Resources must be available
 Analysis and reporting
 Relevant study
 And the skill of the evaluator
 TYPES OF SOURCES OF DATA

SOURCES OF DATA

EXTERNAL INTERNAL
SOURCES SOURCES

SECONDARY
PRIMARY DATA
DATA

4
EDA 11 - ENGINEERING DATA ANALYSIS

1. External Sources – when information is collected from outside agencies is


called external sources of data. Such types of data are either primary or
secondary. This type of information can be collected by census or sampling
method by conducting survey.
 Primary Data – the primary data is the first-hand information
collected compiled and published by organization for some specific
purpose. They are most original data and have not yet have gone
under any statistical treatment. Its validity is greater than secondary
data.

Direct Personal
Investigation

Investigation
Indirect Oral
through
Investigation METHODS OF Investigation

COLLECTING
PRIMARY
DATA

Investigation
Investigation
through Local
through Mailed
Reporters'
Questionnaire
Questionnaire

 Secondary Data – the secondary data are the second-hand


information which are already collected by some organization for
some purpose and are available for the present study. The secondary
data are not pure in character and have undergone some statistical

5
EDA 11 - ENGINEERING DATA ANALYSIS

treatment at least once. Secondary data may be available in the


published or unpublished form. When it is not possible to collect the
data by primary method, the investigator goes for secondary
method. This data collected for some purpose other than the
problem at hand.

INTERNATIONAL

GOVERNMENT

PUBLISHED
SOURCES
MUNICIPAL
CORPORATION
UNPLISHED
SOURCES
INSITUTIONAL/
COMMERCIAL

Difference between PRIMARY and SECONDARY DATA


PRIMARY DATA SECONDARY DATA
- Real time data - Past data
- Sure about sources of data - Not sure about of sources of data
- Help to give results/finding - Refining the problem
- Costly and time consuming process - Cheap and no time consuming process
- Avoid biasness of response data - Can not know in data biasness or not
- More flexible - Less flexible

 Internal Sources – many institutions and departments have information


about their regular functions, for their own internal purposes. When that
information is used in any survey is called internal sources of data.

6
EDA 11 - ENGINEERING DATA ANALYSIS

PLANNING AND CONDUCTING SURVEYS


SURVEY RESEARCH – means collecting information about a group of people by asking
them questions and analyzing the results.

 USES OF SURVEY RESEARCH


1. Social Research – investigating the experiences and characteristics of
different social groups.
2. Market Research – finding out what customers think about products,
services and companies.
3. Health – collecting data from patients about symptoms and treatments.
4. Politics – measuring the public opinion about parties and policies.
5. Psychology – researching personality traits, preferences and behaviors.

 STEPS on Planning a Survey Research:


 Goal Setting – establish clear picture of the study/survey.
 Background Reading – before planning, it is wise to carry out library search
of the relevant background of the study.
 Early Planning – the success of data collection requires careful preparation.
 Create Appropriate Method of Data Collection – assess the relevant type of
obtaining data with respect to the study/survey.
 Gather Information – formal implementation of the gathering data.
 Analyze the result – evaluate and measure the collected data.
 Draw Collection – this must answer the established goal.

 STEPS on Conducting a Survey Research:


 Determine who will participate on the research.
 Decide the type of survey (mail, online, or in person).
 Design the survey questionnaire and lay-out.
 Distribute the survey.
 Analyze the responses.
 Write up the results.

7
EDA 11 - ENGINEERING DATA ANALYSIS

PLANNING AND CONDUCTING EXPERIMENTS


EXPERIMENTAL RESEARCH – commonly used in science such as sociology, psychology,
physics, chemistry, biology and medicine.

 CRITERIA OF USE
1. There is a time priority in a casual effect.
2. There is consistency in a causal relationship.
3. The magnitude of correlation is great.

 STEPS on Conducting Experimental Research:


 Recognition and Statement of the problem.
 Choice of factors, levels and ranges.
 Selection of the response variable.
 Choice of design.
 Conducting the experiment.
 Statistical Analysis.
 Drawing conclusions and making recommendations.

8
EDA 11 - ENGINEERING DATA ANALYSIS

PROBABILITY
LESSON 2.0
Probability is the branch of mathematics that studies the possible outcomes of given
events together with the outcomes’ relative likelihoods and distributions. In common
usage, the word “probability” is used to mean the chance that a particular event (or set of
events) will occur expressed on a linear scale from 0 (impossibility) to 1 (certainty), also
expressed as a percentage between 0 and 100%. The analysis of events governed by
probability is called statistics.

EXPERIMENTAL - also known as Empirical Probability, is


based on actual experiments and adequate
PROBABILITY
recordings of the happening of events.
THEORETICAL - is used to find the probability of an event.

 TWO APPROACHES TO STUDY PROBABILITY


1. Experimental Probability – to determine the occurrence of any event, a
series of actual experiments are conducted. Experiments which do not have
a fixed result are known as random experiments. The outcome of such
experiments is uncertain. Random experiments are repeated multiple times
to determine their likelihood. An experiment is repeated a fixed number of
times and each repetition is known as a trial. Mathematically, the formula for
the experimental probability is defined by:
 Probability of Event P(E) = Number of times an event occurs/Total
number of trials.
2. Theoretical Probability – theoretical probability does not require any
experiments to conduct. Instead of that, we should know about the situation
to find the probability of an event occurring. Mathematically, the theoretical
probability is described as the number of favorable outcomes divided by the
number of possible outcomes.
 Probability of Event P(E) = Number of favorable outcomes/Number
of possible outcomes.

9
EDA 11 - ENGINEERING DATA ANALYSIS

SAMPLE SPACE AND RELATIONSHIPS AMONG EVENTS - Rolling an ordinary


six-sided die is a familiar example of a random experiment, an action for which all possible
outcomes can be listed, but for which the actual outcome on any given trial of the
experiment cannot be predicted with certainty. In such a situation we wish to assign to each
outcome, such as rolling a two, a number, called the probability of the outcome, that
indicates how likely it is that the outcome will occur. Similarly, we would like to assign a
probability to any event, or collection of outcomes, such as rolling an even number, which
indicates how likely it is that the event will occur if the experiment is performed.

 EXAMPLES:
1. Construct a sample space for the experiment that consists of tossing a single
coin.
Solution: The outcomes could be labeled h for heads and t for tails. Then
the sample space is the set: S = { h,t }
2. Construct a sample space for the experiment that consists of rolling a single
die. Find the events that correspond to the phrases “an even number is
rolled” and “a number greater than two is rolled.”
Solution: The outcomes could be labeled according to the number of
dots on the top face of the die. Then the sample space is the set S = { 1, 2,
3, 4, 5, 6 }
The outcomes that are even are 2, 4, and 6, so the event that
corresponds to the phrase “an even number is rolled” is the set { 2, 4, 6 },
which it is natural to denote by the letter E. We write E = { 2, 4, 6 }.
Similarly the event that corresponds to the phrase “a number
greater than two is rolled” is the set T={ 3, 4, 5, 6 }, which we have
denoted T.

A graphical representation of a sample space and events is a Venn diagram. In


general, the sample space S is represented by a rectangle, outcomes by points within the
rectangle, and events by ovals that enclose the outcomes that compose them.

A device that can be helpful in identifying all possible outcomes of a random


experiment, particularly one that can be viewed as proceeding in stages, is what is called a
tree diagram. It is described in the following example.

10
EDA 11 - ENGINEERING DATA ANALYSIS

 EXAMPLE:
1. Construct a sample space that describes all three-child families according to
the genders of the children with respect to birth order.
Solution: Two of the outcomes are “two boys then a girl,” which we
might denote bbg, and “a girl then two boys,” which we would denote
gbb.
Clearly there are many outcomes, and when we try to list all of
them it could be difficult to be sure that we have found them all unless
we proceed systematically. The tree diagram gives a systematic
approach.

The diagram was constructed as follows. There are two possibilities for the first
child, boy or girl, so we draw two-line segments coming out of a starting point, one ending
in a b for “boy” and the other ending in a g for “girl.” For each of these two possibilities for
the first child there are two possibilities for the second child, “boy” or “girl,” so from each of
the b and g we draw two-line segments, one segment ending in a b and one in a g. For each
of the four ending points now in the diagram there are two possibilities for the third child,
so we repeat the process once more.

The line segments are called branches of the tree. The right ending point of each
branch is called a node. The nodes on the extreme right are the final nodes; to each one
there corresponds an outcome, as shown in the figure.

From the tree it is easy to read off the eight outcomes of the experiment, so the
sample space is, reading from the top to the bottom of the final nodes in the tree,

S = { bbb, bbg, bgb, bgg, gbb, gbg, ggb, ggg }

11
EDA 11 - ENGINEERING DATA ANALYSIS

COUNTING RULE USEFUL IN PROBABILITY


Multiplication Rules - If an operation can be performed in n1 ways, and if for each of these
a second operation can be performed in n2 ways, and for each of the first two a third
operation can be performed in n3 ways, and so forth, then the sequence of k operations can
be performed in n1 · n2 · · · nk ways.
 EXAMPLES:
1. The design for a website is to use one of the four colors, a font from among
three, and three different positions for an image. Calculate the number of
web designs possible.
Solution: From the multiplication rule, 4 × 3 × 3 = 36 web designs are
possible.

Permutation relates to the act of arranging all the members of a set into some sequence or
!
order. 𝑛𝑃𝑟 = ( )!

 EXAMPLES:
1. Critical Miss, PSU's Tabletop Gaming Club, has 15 members this term. How
many ways can a slate of 3 officers consisting of a president, vice-president,
and treasurer be chosen?
Solution: In this case, repeats are not allowed since we don’t want the
same member to hold more than one position. The order matters, since if
you pick person 1 for president, person 2 for vice-president, and person 3
for treasurer, you would have different members in those positions than
if you picked person 2 for president, person 1 for vice-president, and
person 3 for treasurer. This is a permutation problem with n = 15 and r =
3.

15P3=15!/(15−3)!=15!/12!=2730

There are 2,730 ways to elect these three positions.

In general, if you were selecting items that involve rank, a position title,
1st, 2nd, or 3rd place or prize, etc. then the order in which the items are
arranged is important and you would use permutation.

 Types of PERMUTATION:
1. Permutation of Distinct Elements - The number of permutations of n different
elements is n!, where n! = n(n − 1)(n − 2)· · · 3 · 2 · 1.
2. Permutation of Subsets - The number of permutations of n distinct objects taken r
!
at a time is 𝑛𝑃𝑟 = ( )!
.
3. Circular Permutation - The number of permutations of n objects arranged in a
circle is (n − 1)!.

12
EDA 11 - ENGINEERING DATA ANALYSIS

4. Permutation of Similar Objects - The number of permutations of n = n1 + n2 + · · · +


nr objects of which n1 are of one type, n2 are of a second type, . . ., and nr are of an
rth type is n! / n1!n2! · · · nr!

Partitions – The number of ways of partitioning a set of n objects into r cells with n1
elements in the first cell, n2 elements in the second, and so forth, is , ,…,
=
!
.
! !··· !

 EXAMPLES:
1. In how many ways can 7 graduate students be assigned to 1 triple and 2
double hotel rooms during a conference?
Solution: The total number of possible partitions would be

, ,
= 7! / 3!2!2! = 210

Combination is a way of selecting items from a collection, such that (unlike permutations)
!
the order of selection does not matter. 𝑛𝐶𝑟 = )!]
[ !(

 EXAMPLES:
1. Critical Miss, PSU's Tabletop Gaming Club, has 15 members this term. They
need to select 3 members to have keys to the game office. How many ways
can the 3 members be chosen?
Solution: In this case, repeats are not allowed, because we don’t want
one person to have more than one key. The order in which the keys are
handed out does not matter. This is a combination problem with n = 15
and r = 3.

15C3=15!/(3!(15−3)!)=15!/(3!⋅12!)=455 There are 455 ways to hand out the


three keys.

We can use these counting rules in finding probabilities. For instance, the
probability of winning the lottery can be found using these counting
rules.

13
EDA 11 - ENGINEERING DATA ANALYSIS

RULES OF PROBABILITY
 GENERAL PROBABILITY RULES
1. The probability of an impossible event is zero; the probability of a certain
event is one. Therefore, for any event A, the range of possible probabilities
is: 0 ≤ P(A) ≤ 1.
2. For S the sample space of all possibilities, P(S) = 1. That is the sum of all the
probabilities for all possible events is equal to one. Recall the party affiliation
above: if you have to belong to one of the three designated political parties,
then the sum of P(R), P(D) and P(I) is equal to one.
3. For any event A, P(Ac) = 1 - P(A). It follows then that P(A) = 1 - P(Ac)
4. This is the probability that either one or both events occur.
 If two events, say A and B, are mutually exclusive - that is A and B
have no outcomes in common - then P(A or B) = P(A) + P(B)
 If two events are NOT mutually exclusive, then P(A or B) = P(A) + P(B)
- P(A and B)
5. This is the probability that both events occur.
 P(A and B) = P(A) • P(B|A) or P(B)*P(A|B) Note: this straight line
symbol, |, does not mean divide! This symbols means "conditional" or
"given". For instance, P(A|B) means the probability that event A
occurs given event B has occurred.
 If A and B are independent - neither event influences or affects the
probability that the other event occurs - then P(A and B) = P(A)*P(B).
This particular rule extends to more than two independent events.
For example, P(A and B and C) = P(A)*P(B)*P(C)

6. or

14
EDA 11 - ENGINEERING DATA ANALYSIS

DISCRETE PROBABILTY
LESSON 3.0 DISTRIBUTION
Probability distribution as defined is an assignment of probabilities to the values of the
random variable. When we speak of random variable, it represents a quantitative
(numerical) variable that is measured or observed in an experiment. Therefore, you can
actually find a probability associated with that variable.
A Discrete Probability Distribution is a probability distribution that depicts the occurrence
of discrete (individually countable) outcomes, such as 1, 2, 3, yes, no, true, or false. The
binomial distribution, for example, is a discrete distribution that evaluates the probability of
a "yes" or "no" outcome occurring over a given number of trials, given the event's
probability in each trial—such as flipping a coin one hundred times and having the outcome
be "heads". It also counts occurrences that have countable or finite outcomes.

 TYPES OF DISCRETE PROBABILITY DISTRIBUTIONS


1. Binomial Probability Distribution – is one in which there is only a probability
of two outcomes. In this distribution, data are collected in one of two forms
after repetitive trials and classified into either success or failure. It generally
has a finite set of just two possible outcomes, such as zero or one. For
instance, flipping a coin gives you the list {Heads, Tails}.
2. Bernoulli Probability Distribution - are similar to binomial distributions
because there are two possible outcomes. One trial is conducted, so the
outcomes in a Bernoulli distribution are labeled as either a zero or one. A one
indicates success, and a zero means failure—one trial is called a Bernoulli
trial.
3. Multinomial Probability Distribution - occur when there is a probability of
more than two outcomes with multiple counts. For instance, say you have a
covered bowl with one green, one red, and one yellow marble. For your test,
you record the number of times you randomly choose each of the marbles
for your sample.
4. Poisson Distribution - is a discrete distribution that counts the frequency of
occurrences as integers, whose list {0, 1, 2, ...} can be infinite. For instance,
say you have a covered bowl with one red and one green marble, and your
chosen period is two minutes. Your test is to record whether you pick the
green or red marble, with the green indicating success. After each test, you
place the marble back in the bowl and record the results.

Discrete Probability Distributions are graphs of the outcomes of test results that are
finite, such as a value of 1, 2, 3, true, false, success, or failure. Investors use discrete
probability distributions to estimate the chances that a particular investing outcome is
more or less likely to happen. Armed with that information, they can choose a hedging
strategy that matches the probabilities found in their analysis.

15
EDA 11 - ENGINEERING DATA ANALYSIS

RANDOM VARIABLES AND THEIR PROBABILTY DISTRIBUTIONS


Suppose that to each point of a sample space we assign a number. We then have a function
defined on the sample space. This function is called a Random Variable (or stochastic
variable) or more precisely a random function (stochastic function). It is usually denoted by
a capital letter such as X or Y. In general, a random variable has some specified physical,
geometrical, or other significance.
A Random Variable that takes on a finite or countably infinite number of values is called a
discrete random variable while one which takes on a noncountably infinite number of values
is called a nondiscrete random variable.

 EXAMPLE:
1. Suppose that a coin is tossed twice so that the sample space is S = {HH, HT,
TH, TT}. Let X represent the number of heads that can come up. With each
sample point we can associate a number for X as shown in table below. Thus,
for example, in the case of HH (i.e., 2 heads), X = 2 while for TH (1 head), X =
1. It follows that X is a random variable.

Sample Point HH HT TH TT
X 2 1 1 0

It should be noted that many other random variables could also be


defined on this sample space, for example, the square of the number of
heads or the number of heads minus the number of tails.

PROBABILITY DISTRIBUTION FUNCTIONS


The Probability Density Function (PDF) is a function that describes the probability of a
continuous random variable taking on a certain value. It is a mathematical function that
describes the probability that a random variable will fall within a certain range of values.
The Probability Mass Function (PMF) (or frequency function) is a function that describes
the probability of a discrete random variable taking on a certain value. It is a mathematical
function that describes the probability that a random variable will take on a specific value
rather than falling within a range of values. Probability Mass Functions can be represented
numerically with a table, graphically with a histogram, or analytically with a formula.
The Cumulative Distribution Function (CDF) is a function that describes the probability
that a random variable (continuous or discrete) will take on a value less than or equal to a
certain value. It is a mathematical function that describes the probability that a random
variable will fall within a certain range of values, up to and including a specific value.

16
EDA 11 - ENGINEERING DATA ANALYSIS

 EXAMPLE for PDF:


1. If the probability density of a random variable is given: f(x) = K(1-x2), 0 < x < 1
& f(x) = 0, otherwise. Find the value of K.
Solution: ∫ f(x) dx = 1
∫ K(1 − x ) dx = 1

K x− =1
K (1 – 1/3) = 1
K = 3/2

 EXAMPLE for PMF to CDF:


1. Toss three coins. Let X represent the number of heads that can come up.
Solution: Random variable = X = Number of heads

X= 0 1 2 3
f(x) = 1/8 3/8 3/8 1/8
f(x) is the Probability Mass Function.

X = xi 0 1 2 3
f(xi) = P(x =
1/8 3/8 3/8 1/8
xi )
0 if x < 0
1/8 if 0 ≤ x < 1
F(x) = 4/8 if 1 ≤ x < 2 *where F(x) is the Cumulative
7/8 if 2 ≤ x < 3 Distribution Function
8/8 if 3 ≤ x < ∞

 EXAMPLE for CDF to PMF:


1. 0 if x < -2
F(x) = 0.2 if -2 ≤ x < 0
0.7 if 0 ≤ x < 2
1 if 2 ≤ x
Solution:
f(x1) = f(-2) = 0.2
f(x1) + f(x2) = f(-2) + f(0) = 0.7
f(0) = 0.7-0.2
f(0) = 0.5
f(x1) + f(x2) + f(x3) = f(-2) + f(0) + f(2) = 1
f(2) = 1 – 0.7
f(2) = 0.3
X= -2 0 2
f(x) = 0.2 0.5 0.3

17
EDA 11 - ENGINEERING DATA ANALYSIS

DISCRETE PROBABILITY DISTRIBUTION FUNCTIONS - A discrete random


variable assumes each of its values with a certain probability. The probability distribution of
a random variable X is a description of the probabilities associated with the possible values
of X. For a discrete random variable, the distribution is often specified by just a list of the
possible values along with the probability of each. In some cases, it is convenient to express
the probability in terms of a formula called the probability mass function.
 For a discrete random variable X with possible values x1, x2, . . . , xn, a probability
mass function is a function f(x) such that:
1. f(𝑥 ) ≥ 0
2. ∑ 𝑓(𝑥 ) = 1
3. f(𝑥 ) = P[X = 𝑥 ]

CUMULATIVE DISTRIBUTION FUNCTIONS


An alternate method for describing a random variable’s probability distribution is with
cumulative probabilities such as P[X ≤ x]. Furthermore, cumulative probabilities can be
used to find the probability mass function of a discrete random variable. It is denoted as
F(x), then the cumulative distribution function of a discrete random variable X is

F(x) = P[X ≤ x] = ∑ P[X = x ]

 The cumulative distribution function F(x) of a discrete random variable satisfies


the following properties:
1. F(x) = ∑ f(x )
2. 0 ≤ F(x) ≤ 1
3. If a ≤ b then F(a) ≤ F(b)

 EXAMPLE:
1. Determine the probability mass function of X from the cumulative
0, 𝑥 < −2
0.2, −2 ≤ 𝑥 < 0
distribution function: 𝑓(𝑥) =
0.7, 0 ≤ 𝑥 < 2
1, 2 ≤ 𝑥
Solution: The domain of the probability mass function are the included
endpoints of each interval, x = −2, 0, 2. The value of f(x) at each x is
determined by f(x ) = F(x )− F(x − 1) for i = 1, 2, 3 and f(x ) is taken
to be equal to F(x ).
f(x1) = f(−2) = F(x2) = F(0) = 0.2
f(x2) = f(0) = F(x2) − F(x1) = F(0) − F(−2) = 0.7 − 0.2 = 0.5
f(x3) = f(2) = F(x3) − F(x2) = F(2) − F(0) = 1 − 0.7 = 0.3
0.2, x = −2
Therefore, f(x) 0.5, x = 0
0.3, x = 2

18
EDA 11 - ENGINEERING DATA ANALYSIS

EXPECTED VALUES OF RANDOM VARIABLE


The mean or expected value of the discrete random variable X with probability mass
function f(x), denoted as µX of E[X] is

µX = E[X] = ∑( )x f(x)

 EXAMPLE:
1. A salesperson for a medical device company has two appointments on a
given day. At the first appointment, he believes that he has a 70% chance to
make the deal, from which he can earn $1000 commission if successful. On
the other hand, he thinks he only has a 40% chance to make the deal at the
second appointment, from which, if successful, he can make $1500. What is
his expected commission based on his own probability belief? Assume that
the appointment results are independent of each other.
Solution: Let Y denote the total commission of the salesperson in the
appointments. The table below summarizes his total commission and
the associated probabilities in parentheses.

His expected commission is µ = E[Y] = 2500(0.28) + 1000(0.42) +


1500(0.12) + 0(0.18) = 1300.

THE BINOMIAL DISTRIBUTION


Many processes can be thought of as consisting of a sequence of Bernoulli trials, such as,
for example, the repeated tossing of a coin or the repeated examination of objects to
determine whether or not they are defective. In such cases, a random variable of interest is
the number of successes obtained within a fixed number of trials n, where a success is
defined in an appropriate manner. Such a random variable is called a Binomial Random
Variable. If the binomial random variable X is the number x of trials that result in a success
in a Bernoulli process having n trials, the probability mass function of X is

f(x) = p (1 − p) , x = 0, 1, 2, . . . , n.

The binomial random variable is probably the most important of all discrete probability
distributions. Its probability distribution is called a binomial distribution. The mean µ and
variance σ of the binomial random variable X with parameters n and p, the number of
trials and the probability of a success, respectively, are

µ = E[X] = np

σ2 = E[X] = np(1 − p)

19
EDA 11 - ENGINEERING DATA ANALYSIS

 EXAMPLE:
1. Each sample of water has a 10% chance of containing a particular organic
pollutant. Assume that the samples are independent with regard to the
presence of the pollutant. a) Find the probability that in the next 18 samples,
exactly 2 contain the pollutant. b) Find the probability that 3 to 5 of the 20
samples contain the pollutant. c) Find the mean and standard deviation of
the number of pollutants in 16 samples.
Solution:
a) Let X be the number of samples that contain the pollutant in the next
18 samples analyzed. X is a binomial random variable with p = 0.1 and n
= 18. Therefore, P[X = 2] = (0.1) (0.9) = 0.2835.
b) The required probability is P[3 ≤ X ≤ 5].
P[3 ≤ X ≤ 5] = P[X = 3] + P[X = 4] + P[X = 5]
= (0.1) (0.9) + (0.1) (0.9) + (0.1) (0.9) = 0.3118
c) µ = np = 16(0.1) = 1.6
σ = √ 𝜎 = √ np(1 − p) = √ 16(0.1)(0.9) = 1.2

THE POISSON DISTRIBUTION


The number X of outcomes occurring during a Poisson experiment is called a Poisson
random variable, and its probability distribution is called the Poisson distribution.
The probability mass function of the Poisson random variable X, representing the number
of outcomes occurring in a given time interval or specified region denoted by t, is

( )
f(x) = !
, x = 0, 1, 2, . . .

where λ is the average number or outcomes per unit time, distance, area, or volume.

 EXAMPLE:
1. Ten is the average number of oil tankers arriving each day at a certain port.
The facilities at the port can handle at most 15 tankers per day. a) What is the
probability of finding 8 oil tankers on a given day? b) What is the probability
that on a given day tankers have to be turned away?
Solution:
a) We are given λ = 10 (oil tankers per day), so we take t = 1 (day).
f(x = 8) = !
= 0.1126
b) Tankers will be turned away if the number of tankers exceed the
port’s capacity of 15. Thus, the probability we seek is P[X > 15].
P[X > 15] = 1 − P[X ≤ 15]
=1-∑ !
= 1 − 0.9513 = 0.0487

20
EDA 11 - ENGINEERING DATA ANALYSIS

CONTINUOUS PROBABILTY
LESSON 4.0 DISTRIBUTION
Physical quantities such as time, length, area, temperature, pressure, load, intensity, etc.,
when they need to be described probabilistically, are modeled by continuous random
variables.

 TYPES OF CONTINUOUS PROBABILITY DISTRIBUTIONS


1. Normal Probability Distribution – sometimes called the bell curve (or De
Moivre distribution), is a distribution that occurs naturally in many situations.
The bell curve is symmetrical. Half of the data will fall to the left of the mean;
half will fall to the right.
2. Exponential Probability Distribution - (also called the negative exponential
distribution) is a probability distribution that describes time between events in a
Poisson process.

RANDOM VARIABLES AND THEIR PROBABILTY DISTRIBUTIONS


Continuous Probability Distributions - Physical quantities such as time, length, area,
temperature, pressure, load, intensity, etc., when they need to be described
probabilistically, are modeled by continuous random variables.
A continuous random variable is a function whose range is an interval of real numbers.
When a sample space has an infinite number of sample points, the associated random
variable is continuous with its values distributed over one or more intervals on the real
number line.

 The function f(x) is a probability density function of the continuous random


variable
X defined over the set of real numbers if:
1. 𝑓(𝑥) ≥ 0
2. ∫ 𝑓(𝑥)𝑑𝑥 = 1
3. 𝑃[𝑎 ≤ 𝑥 ≤ 𝑏] = ∫ 𝑓 (𝑥)𝑑𝑥

 EXAMPLE:
𝑥 , −1 < 𝑥 < 2
1. Consider the function 𝑓 (𝑥) = 𝑓 (𝑥) = . a) Show that it is a
0, 𝑒𝑙𝑠𝑒𝑤ℎ𝑒𝑟𝑒
probability density function of some continuous random variable X. b)
Determine 𝑃[0 < 𝑋 < 1].

21
EDA 11 - ENGINEERING DATA ANALYSIS

Solution:
a) We show that Properties (1) and (2) are satisfied. 1) Clearly, 𝑥 ≥ 0
for all real number x. 2) We must show that
∫ 𝑓(𝑥)𝑑𝑥 = 1, ∫ 𝑥 𝑑𝑥 = 1, ∫ 𝑥 𝑑𝑥 = 1
b) 𝑃[0 < 𝑋 ≤ 1] = ∫ 𝑥 𝑑𝑥 =

If X is a continuous random variable with probability density function f(x), the cumulative
distribution function F(x) is defined as

𝐹(𝑥) = 𝑃[𝑋 ≤ 𝑥] = 𝑓(𝑡)𝑑𝑡

 The cumulative distribution function has the following properties:


1. lim 𝐹(𝑥) = 0

2. lim 𝐹(𝑥) = 1

3. 𝑃[𝑎 < 𝑋 ≤ 𝑏] = 𝐹(𝑏) − 𝐹(𝑎)

 EXAMPLE:
1. Suppose that for some continuous random variable X, 𝐹(𝑥) = (𝑥 − 1)
for 1 ≤ 𝑥 ≤ 3. a) What is the probability that X assumes a value between 1.2
and 2.6? b) Find the density function and use it to compute 𝑃[1.2 < 𝑋 <
2.6].
Solution:
a) We apply Property 3 to compute 𝑃[1.2 < 𝑋 < 2.6].
𝑃[1.2 < 𝑋 < 2.6] = 𝐹 (2.6) − 𝐹 (1.2)
= (2.6 − 1) − (1.2 − 1) = 0.5453
b) 𝑓 (𝑥 ) = 𝐹 (𝑥 ) = (4𝑥 ) = ,
.
𝑃[1.2 < 𝑋 < 2.6] = ∫ . 𝑥 𝑑𝑥 = 0.5453

22
EDA 11 - ENGINEERING DATA ANALYSIS

EXPECTED VALUES OF CONTINUOUS RANDOM VARIABLES


Let X be a continuous random variable with density f(x). The mean or expected value of X,
denoted µX or E[X], and variance of X, denoted σ2X or V[X] are defined as
µ𝑋 = E[𝑋] = 𝑥𝑓 (𝑥 ) 𝑑𝑥

𝜎 𝑋 = V[𝑋] = (𝑥 − µ𝑋) 𝑓(𝑥) 𝑑𝑥

 EXAMPLE:
𝑥 , −1 < 𝑥 < 2
1. Consider the function 𝑓(𝑥) = , compute the mean and
0, 𝑒𝑙𝑠𝑒𝑤ℎ𝑒𝑟𝑒
variance of the random variable.
Solution:
1 1 5
𝜇= 𝑥( 𝑥 )𝑑𝑥 = 𝑥 𝑑𝑥 =
3 3 4
1 51
𝜎 = 𝑥 ( 𝑥 𝑑𝑥) − 𝜇 =
3 80

NORMAL DISTRIBUTION
Undoubtedly, the most widely used model for a continuous measurement is a normal
random variable and its distribution, normal distribution, is the most important
continuous probability distribution.

 EXAMPLE:
1. The time X until recharge for a battery in a laptop computer under common
conditions is normally distributed with µ = 260 minutes and σ = 50 minutes.
Find the probability that a fully charged laptop lasts a) anywhere from 3 to 4
hours; b) less than 270 minutes; c) longer than 300 minutes.
Solution:
( )
a) P[180 < X < 240] = ∫ 𝑒 . 𝑑𝑥 = 0.2898

( )
b) P[X < 270] = ∫ 𝑒 . 𝑑𝑥

1 ( ) 1 ( )
= 𝑒 . 𝑑𝑥 + 𝑒 . 𝑑𝑥
50√2𝜋 50√2𝜋
= 0.5793
( )
c) P[X > 300] = ∫ 𝑒 . 𝑑𝑥

1 ( ) 1 ( )
= 𝑒 . 𝑑𝑥 − 𝑒 . 𝑑𝑥
50√2𝜋 50√2𝜋
= 0.2119

23
EDA 11 - ENGINEERING DATA ANALYSIS

EXPONENTIAL DISTRIBUTION
The random variable X, the distance between successive events from a Poisson process
with mean number of events λ > 0 per unit distance is an exponential random variable
with parameter λ. The probability density function of X is
1 1
𝑓(𝑥) = 𝜆𝑒 − 𝜆𝑥, 𝜇= & 𝜎 =
𝜆 𝜆

 EXAMPLE:
1. The lifetime of a mechanical assembly in a vibration test is exponentially
distributed with a mean of 400 hours. a) What is the probability that an
assembly on test fails in less than 100 hours? b) What is the probability that
an assembly operates for more than 500 hours before failure?
Solution:
a) Let X be the time before a mechanical assembly in a vibration test
fails, measured in hours. The mean lifetime is µ = 400 hours.
P[X < 100] = 1 − F(100) = 0.2212
( )
b) 𝑃[𝑋 > 500] = 𝑒 = 0.2865

24
EDA 11 - ENGINEERING DATA ANALYSIS

JOINT PROBABILTY DISTRIBUTION


LESSON 5.0
It is often useful to have more than one random variable defined in a random experiment.
For example, the continuous random variable X can denote the length of one dimension of
an injection-molded part, and the continuous random variable Y might denote the length of
another dimension. We might be interested in probabilities that can be expressed in terms
of both X and Y.
In general, if X and Y are two random variables, the probability distribution that defines
their simultaneous behavior is called a Joint Probability Distribution. In this chapter, we
investigate some important properties of these joint distributions.

TWO RANDOM VARIABLES


For simplicity, we begin by considering random experiments in which only two random
variables are studied.

JOINT PROBABILITY FUNCTION - If X and Y are discrete random variables, the


joint probability distribution of X and Y is a description of the set of points (x, y) in the range
of (X, Y) along with the probability of each point. Also, P[X = x and Y = y] is usually written
as P[X = x, Y = y]. The joint probability distribution of two random variables is sometimes
referred to as the bivariate probability distribution or bivariate distribution of the
random variables. One way to describe the joint probability distribution of two discrete
random variables is through a joint probability mass function f(x, y).
 The joint probability mass function of the discrete random variables X and Y,
denoted f(x, y), satisfies:
1. 𝑓(𝑥, 𝑦) ≥ 0
2. ∑ ∑ 𝑓(𝑥, 𝑦) = 1
3. 𝑓 (𝑥, 𝑦) = 𝑃[𝑋 = 𝑥, 𝑌 = 𝑦]

 EXAMPLE:
1. Two ballpoint pens are selected at random from a box that contains 3 blue
pens, 2 red pens, and 3 green pens. Let X be the number of blue pens
selected and Y be the number of red pens selected. a) Find the joint
probability mass function f(x,y). b) Find the probability of selecting at least
one green pen at random.
Solution:
a) Clearly, x = 0, 1, 2 and y = 0, 1, 2 with the restriction 0 ≤ x + y ≤ 2 since
two ballpoint pens are selected. The possible pair of values (x, y) are (0,
0), (0, 1), (0, 2), (1, 0), (1, 1) and (2, 2). f(1, 0) represents the probability
that 1 blue ballpen and no red ballpen are selected. The probability mass
function is

25
EDA 11 - ENGINEERING DATA ANALYSIS

𝑓 (𝑥, 𝑦) =

To check if the function f(x,y) is correct, satisfy Property 1 by the


probability at each mass point (x,y) shown at table below.
y
Total
0 1 2
3 3 1 5
0
28 14 28 14
9 3 15
X 1 0
28 14 28
3 3
2 0 0
28 28
15 3 1
Total 𝟏
28 7 28
b) Let G1 and G2 be the events of selecting 1 and 2 green pens.
𝑃[𝐺 ∪ 𝐺 ] = 𝑃[𝐺 ] + 𝑃[𝐺 ]
= [𝑓(1,0) + 𝑓(0,1)] + 𝑓(0,0)

= + +

9 6 3 9
= + + =
28 28 28 14

 The joint probability density function of the continuous random variables X and
Y, denoted f(x, y), satisfies:
1. 𝑓(𝑥, 𝑦) ≥ 0
2. ∫ ∫ 𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦 = 1

3. 𝑃[(𝑋, 𝑌) ∈ 𝑅] = ∬ 𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦 for any region R in the xy-plane.

 EXAMPLE:
1. Let X and Y be continuous random variables with joint density function
𝑓(𝑥, 𝑦) = 𝑥 𝑦, −1 ≤ 𝑥 ≤ 2, 1 ≤ 𝑦 ≤ 5. a) Calculate
𝑃[𝑋 ≥ 0, 1 ≤ 𝑌 < 4]. b) Calculate 𝑃[𝑋 ≥ 1, 𝑋 ≤ 𝑌 < 4]. c) Calculate
−1 < 𝑋 < (𝑌 − 1),
𝑃
1≤𝑌≤5
Solution:
a) The probability we seek is 𝑃[0 ≤ 𝑋 < 2, 1 ≤ 𝑌 < 4] =
∫ ∫ 𝑥 𝑦𝑑𝑦𝑑𝑥
We first compute the inner integral, leaving the variable x ‘untouched’
since the variable of integration is y. ∫ 𝑥 𝑦𝑑𝑦 = 𝑥
We now compute the outer integral ∫ 𝑥 𝑑𝑥 =

26
EDA 11 - ENGINEERING DATA ANALYSIS

b) 𝑃[𝑋 ≥ 1, 𝑋 ≤ 𝑌 < 4] = 𝑃[1 ≤ 𝑋 < 2, 𝑋 ≤ 𝑌 < 4]


1 1
= 𝑥 𝑦𝑑𝑦𝑑𝑥 = ( 𝑥 𝑦𝑑𝑦)𝑑𝑥
36 36
1 𝑥 2 1
= 𝑥 (8 − )𝑑𝑥 = 𝑥 − 𝑥 𝑑𝑥
36 2 9 72
467
=
1080
c) Inserting the limits of integration, we realize that the double integral
cannot be evaluated.
( )
1
𝑥 𝑦𝑑𝑦𝑑𝑥
36
So we switch the limits of integration and the differentials. The order of
integration is with respect to x first, then with respect to y.
( ) ( )
1 1
𝑥 𝑦𝑑𝑥𝑑𝑦 = ( 𝑥 𝑦𝑑𝑥)𝑑𝑦
36 36
1 1
= 𝑦 (𝑦 − 1) + 1 𝑑𝑦
108 8
1 1 19
= 𝑦(𝑦 − 1) + 𝑦𝑑𝑦 =
864 108 45

MARGINAL PROBABILITY DISTRIBUTION - If more than one random variable is


defined in a random experiment, it is important to distinguish between the joint probability
distribution of X and Y and the probability distribution of each variable individually. The
individual probability distribution of a random variable is referred to as its Marginal
Probability Distribution.
 If f(x, y) is the joint probability mass function of the discrete random variables X
and
Y, the marginal probability mass function of the random variables are:
1. 𝑓(𝑥) = ∑ 𝑓(𝑥, 𝑦)
2. 𝑓 (𝑦) = ∑ 𝑓(𝑥, 𝑦)
 If f(x, y) is the joint probability density function of the continuous random
variables X and Y, the marginal probability density function of the random variables
are:
1. 𝑓(𝑥) = ∫ 𝑓(𝑥, 𝑦)𝑑𝑦
2. 𝑓 (𝑦) = ∫ 𝑓 (𝑥, 𝑦)𝑑𝑥

 EXAMPLE:
1. a) Obtain the marginal probability density function of X in example of joint
probability density function and b) verify Property number 2 of probability
density function.

27
EDA 11 - ENGINEERING DATA ANALYSIS

Solution:
1
𝑓(𝑥 ) = 𝑥 𝑦𝑑𝑦
36
a) = ∫ 𝑥 𝑦𝑑𝑦 = 𝑥
b) ∫ 𝑥 𝑑𝑥 = ∫ 𝑥 𝑑𝑥 = 1

CONDITIONAL PROBABILITY DISTRIBUTION - When two random variables are


defined in a random experiment, knowledge of one can change the probabilities that we
associate with the values of the other. Consequently, the random variables X and Y are
expected to be dependent. Knowledge of the value obtained for Y changes the probabilities
associated with the values of X.

 Let X and Y be continuous random variables. The conditional probability density


function of Y given X = x, denoted 𝑷[𝒀|𝑿 = 𝒙] is:
𝑓(𝑥, 𝑦)
𝑃[𝑌|𝑋 = 𝑥] = 𝑓 | (𝑦) = , 𝑓(𝑥) > 0
𝑓(𝑥)
 Because 𝒇𝒀|𝒙 (𝒚) is a density function, the following properties are satisfied:
1. 𝑓 | (𝑦) ≥ 0
2. ∫ 𝑓 | (𝑦)𝑑𝑦 = 1
3. 𝑃[𝑎 < 𝑌 < 𝑏|𝑋 = 𝑥] = ∫ 𝑓 | (𝑦)𝑑𝑦

 EXAMPLE:
1. Two ballpoint pens are selected at random from a box that contains 3 blue
pens, 2 red pens, and 3 green pens. Let X be the number of blue pens
selected and Y be the number of red pens selected. Compute 𝑃[𝑋 = 1|𝑌 =
0] and 𝑃[𝑋 = 1|𝑌 = 1].
Solution:
9
𝑃[𝑋 = 1, 𝑌 = 0] 𝑓(1,0) 18 3
𝑃[𝑋 = 1|𝑌 = 0] = = = =
𝑃[𝑌 = 0] 𝑓 (0) 15 5
28
3
𝑃[𝑋 = 1, 𝑌 = 1] 𝑓(1,1) 14 1
𝑃[𝑋 = 1|𝑌 = 1] = = = =
𝑃[𝑌 = 1] 𝑓 (1) 3 2
7

MORE THAN TWO RANDOM VARIABLES - More than two random variables can
be defined in a random experiment. Results for multiple random variables are
straightforward extensions of those for two random variables. A summary for the
continuous random variables is provided here. For the discrete case, simply replace the
integral with a summation.

28
EDA 11 - ENGINEERING DATA ANALYSIS

 A probability density function f(x1, x2, . . . , xn) for the continuous random
variables X1, X2, . . . , Xn has the following properties:
1. 𝑓(𝑥 , 𝑥 … , 𝑥 ) ≥ 0
2. ∫ ∫ … ∫ 𝑓(𝑥 , 𝑥 … , 𝑥 )𝑑𝑥 𝑑𝑥 … 𝑑𝑥 = 1
3. 𝐹𝑜𝑟 𝑎𝑛𝑦 𝑟𝑒𝑔𝑖𝑜𝑛 𝑅 𝑜𝑓 𝑛 − 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙 𝑠𝑝𝑎𝑐𝑒,
⬚ ⬚
𝑃[(𝑋 , 𝑋 … , 𝑋 )𝜖𝑅] = … 𝑓(𝑥 , 𝑥 … , 𝑥 ) 𝑑𝑥 𝑑𝑥 … 𝑑𝑥

 EXAMPLE:
1. Suppose that X1, X2, and X3 represent the thickness in micrometers of a
substrate, an active layer, and a coating layer of a chemical product,
respectively. Assume that the random variables are independent and
normally distributed with µ1 = 10000, µ2 = 1000, µ3 = 80, σ1 = 250, σ2 = 20, and
σ3 = 4, respectively. The specifications for the thickness of the substrate,
active layer, and coating layer are 9200 < x 1 < 10800, 950 < x2 < 1050, and 75 <
x3 < 85, respectively. a) What proportion of chemical products meets all
thickness specifications? b) Which one of the three thickness has the least
probability of meeting specifications?
Solution:
a) The requested probability is
𝑃[9200 < 𝑋 < 10800, 950 < 𝑋 < 1050, 75 < 𝑋 < 85]
Because the random variables are independent,
𝑃[9200 < 𝑋 < 10800, 950 < 𝑋 < 1050, 75 < 𝑋 < 85]
= 𝑃[9200 < 𝑋 < 10800]𝑃[950 < 𝑋 < 1050]𝑃[75 < 𝑋 < 85]
= 𝑃[−3.2 < 𝑍 < 3.2]𝑃[−2.5 < 𝑍 < 2.5]𝑃[−1.25 < 𝑍 < 1.25]
After standardizing. From Appendix A-1, the desired probability is
(0.998626)(0.987581)(0.788701) = 0.777836
b) The thickness of the coating layer has the least probability of meeting
specifications. Consequently, a priority should be to reduce variability in
this part of the process.

29
EDA 11 - ENGINEERING DATA ANALYSIS

30
EDA 11 - ENGINEERING DATA ANALYSIS

31
EDA 11 - ENGINEERING DATA ANALYSIS

LINEAR FUNCTIONS OF RANDOM VARIABLES


A random variable is sometimes defined as a function of one or more random variables. For
example, if the random variables X1 and X2 denote the length and width, respectively, of a
manufactured part, Y = 2X1 + 2X2 is a random variable that represents the perimeter of the
part.
Given random variables X1, X2, …, Xk and constants a1, a2, …, ak,
𝑌 = 𝑎 𝑋 + 𝑎 𝑋 + ⋯,𝑎 𝑋
Is a Linear Combination of 𝑋 , 𝑋 , … , 𝑋 .
Let 𝑋 , 𝑋 , … , 𝑋 be independent and identically disturbed random variables with 𝐸[𝑋 ] = 𝜇

and 𝑉[𝑋 ] = 𝜎 . Define the random variable 𝑋 by 𝑋 = .

 The expected value and variance of 𝑿 are:


1. 𝐸 𝑋 = 𝜇
2. 𝑉 𝑋 =

 EXAMPLE:
1. An automated filling machine fills soft-drink cans. The mean fill volume is
12.1 fluid ounces, and the standard deviation is 0.1 oz. Assume that the fill
volumes of the cans are independent, normal random variables. What is the
probability that the average volume of 10 cans selected from this process is
less than 12 oz?
Solution: Let 𝑋 , 𝑋 , … , 𝑋 denote the fill volumes of the 10 cans. The
average fill volume 𝑋 is a normal random variable with
.
𝐸 𝑋 = 12.1 𝑉𝑋 = = 0.001
.
Consequently, 𝑃 𝑋 < 12 = 𝑃[𝑍 = 𝑃[𝑍 < −3.16] = 0.000789
√ .

 If 𝑋 , 𝑋 , … , 𝑋 are independent normal random variables with 𝐸[𝑋 ] = 𝜇 and


𝑉[𝑋 ] = 𝜎 , then 𝑌 = 𝑎 𝑋 + 𝑎 𝑋 + ⋯ , 𝑎 𝑋 is normal random variable with
mean and variance
𝐸[𝑌] = 𝑎 𝜇 + 𝑎 𝜇 + ⋯ , 𝑎 𝜇
𝑉[𝑌] = 𝑎 𝜎 + 𝑎 𝜎 + ⋯ , 𝑎 𝜎

32
EDA 11 - ENGINEERING DATA ANALYSIS

 EXAMPLE:
1. Let the random variables X1 and X2 denote the length and width,
respectively, of a manufactured part. Assume that X1 is normally distributed
with E[X1] = 2 cm and standard deviation 0.1 cm and that X2 is normal with
E[X2] = 5 cm and standard deviation 0.2 cm. Also assume that X1 and X2 are
independent. Determine the probability that the perimeter exceeds 14.5 cm.
Solution: Let 𝑌 = 2𝑋 + 2𝑋 be the perimeter of the manufactured part.
Y is normally distributed with
𝜇 = 𝐸[𝑌] = 2(2) + 2(5) = 14
𝜎 = 𝑉[𝑌] = 4(0.1) + 4(0.2) = 0.2
.
Thus, 𝑃[𝑌 > 14.5] = 𝑃 𝑧 > = 𝑃[𝑍 > 2.5] = 0.131776
√ .

GENERAL FUNCTION OF RANDOM VARIABLES


In many situations in statistics, it is necessary to derive the probability distribution of a
function of one or more random variables.
Suppose that X is a discrete random variable with probability distribution fX(x). Let Y = h(X)
be a function of X that defines a one-to-one transformation between the values of X and Y
and that we wish to find the probability distribution of Y. By a one-to-one transformation,
we mean that each value x is related to one and only one value of y = h(x) and that each
value of y is related to one and only one value of x, say x = u(y) where u(y) is found by
solving y = h(x) for x in terms of y.
Now the random variable Y takes on the value y when X takes on the value u(y). Therefore,
the probability distribution of Y is 𝑓𝑌(𝑦) = 𝑃[𝑌 = 𝑦] = 𝑃[𝑋 = 𝑢 (𝑦)] = 𝑓𝑋(𝑢 (𝑦)).
We now consider the situation in which the random variables are continuous. Let Y = h(X)
with X continuous and the transformation one to one. The equation y = h(x) can be solved
for x in terms of y, say x = u(y). The probability distribution of Y is 𝑓𝑌(𝑦) = 𝑓𝑋(𝑢(𝑦))|𝐽|
where J=u’(y) is called the Jacobian of the transformation. The two vertical bars is the
absolute value function.

33
EDA 11 - ENGINEERING DATA ANALYSIS

SAMPLING DISTRIBTIONS AND


LESSON 6.0 POINT ESTIMATION OF
PARAMETERS

POINT ESTIMATION
In statistical inference, the term Parameter is used to denote a quantity θ (Greek theta),
say, that is a property of an unknown probability distribution. For example, it may be the
mean, variance, or a particular quantile of the probability distribution. Parameters are
unknown, and one of the goals of statistical inference is to estimate them.
Parameters can be thought of as representing a quantity of interest about a general
population. In earlier chapters, probability calculations were made based on given values of
the parameters of the probability distributions, but in practice the parameters are unknown
since the probability distribution that characterizes observations from the population is
unknown. An experimenter’s goal is to find out as much as possible about these parameters
since they provide an understanding of the underlying probability distribution that
characterizes the population.

The primary purpose in taking a random sample is to obtain information about the
unknown
population parameters. Suppose, for example, that we wish to reach a conclusion about the
proportion of people in a locality who prefer a particular brand of soft drink. Let p represent
the unknown value of this proportion. It is impractical to question every individual in the
population to determine the true value of p. To make an inference regarding the true
proportion p, a more reasonable procedure would be to select a random sample (of an
appropriate size) and use the observed proportion 𝑝̂ of people in this sample favoring the
brand of soft drink.
The sample proportion, 𝑝̂ , is computed by dividing the number of individuals in the sample
who prefer the brand of soft drink by the total sample size n. Thus, 𝑝̂ is a function of the
observed values in the random sample. Because many random samples are possible from
population, the value of 𝑝̂ will vary from sample to sample. That is, 𝑝̂ is a random variable.
Such a random variable is called a statistic.

Estimation is a procedure by which the information contained within a sample is used to


investigate properties of the population from which the sample is drawn. In particular, a
point estimate of an unknown parameter θ is a statistic θˆ that is in some sense a “best
guess” of the value of θ. Notice that a caret or “hat” placed over a parameter signifies a
statistic used as an estimate of the parameter.
For a given data set x1, x2, . . . , xn, the sample mean and sample variance take the observed
⋯ ∑ ( ̅)
values 𝑥̅ = and 𝑠 = .

34
EDA 11 - ENGINEERING DATA ANALYSIS

SAMPLING DISTRIBUTION AND THE CENTRAL LIMIT THEOREM


The distribution of a random sample is called a Sampling Distribution. Theoretically the
distribution of 𝑋 can be found. In general, we would suspect that the distribution of 𝑋
depends on the density f(x) from which the random sample was selected, and indeed it
does. Two characteristics of the distribution of 𝑋, its mean and variance, do not depend on
the density f(x) per se but depend only on two characteristics of the density f(x).

SAMPLE MEAN
Let 𝑋 , 𝑋 , … , 𝑋 be a random sample of size n from a population with density function f(x),

mean 𝜇 and variance 𝜎 . The random variable 𝑋, define by 𝑋 = .

 EXAMPLE:
1. An electronics company manufactures resistors that have a mean resistance
of 100 ohms and a standard deviation of 10 ohms. The resistance follows a
normal distribution. Find the probability that a random sample of 25
resistors will have an average resistance of fewer than 95 ohms.
Solution: According to the reproductive property, the sampling
distribution of 𝑋 is normal with 𝐸 𝑋 = 𝜇 = 100 and 𝑉 𝑋 = = =
4. Thus,
⎡ ⎤
𝑋−𝐸 𝑋 95 − 100
𝑃 𝑋 < 95 = 𝑃 ⎢ < ⎥ = 𝑃[𝑍 < −2.5] = 0.0062
⎢ √4 ⎥
𝑉𝑋
⎣ ⎦
If the distribution of resistance is normal with mean 100 ohms and
standard deviation of 10 ohms, finding a random sample of resistors with
a sample mean less than 95 ohms is a rare event. If this actually happens,
it casts doubt as to whether the true mean is really 100 ohms or if the
true standard deviation is really 10 ohms.

SAMPLE PROPORTION
Consider the random sample 𝑋 , 𝑋 , … , 𝑋 from the population X, we can write the estimate
as
𝑥
𝑝̂ = 𝑥 =
𝑛
where 𝑥 represents the number of trials resulting in a success, among n trials.
( )
The Standard Error of the estimate of the proportion p is 𝜎 = .

35
EDA 11 - ENGINEERING DATA ANALYSIS

 EXAMPLE:
1. Suppose that the probability p that a vaccine provokes a serious adverse
reaction is unknown. If the vaccine is administered to n = 500,000 head of
cattle and then x0 = 372 are observed to suffer the reaction. a) Find the point
estimate of p. b) Calculate the standard error estimate.
Solution:
a) 𝑝̂ = = 0.000744
. ( . )
b) 𝜎 = = 0.3856 × 10

SAMPLE VARIANCE
The statistic S2 is an unbiased point estimator of the population variance σ2, and the
numerical value s2 computed from the sample data is called the point estimate of σ2. The
sample variance can be thought of as an estimate of the variance σ2 of the unknown
underlying probability distribution of the observations in the data set. It provides an
indication of the variability in the sample in the same way that the variance σ2 provides an
indication of the variability of a probability distribution.

The function 𝛤(𝑣) = ∫ 𝑥 𝑒 𝑑𝑥 for 𝑣 > 0 is called the Gamma Function.

 The gamma function has the following properties:


1. Γ(1) = 1
2. Γ(𝑣) = (𝑣 − 1)Γ(𝑣 − 1)
3. Γ(𝑛) = (𝑛 − 1)! 𝑓𝑜𝑟 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑖𝑛𝑡𝑒𝑔𝑒𝑟 𝑛
4. Γ = √𝜋

CENTRAL LIMIT THEOREM


Let 𝑓(𝑥) be a probability function with mean 𝜇 and variance 𝜎 . Let 𝑋 be the mean of a
random sample of size 𝑛 from a population with distribution 𝑓(𝑥), then
𝜎
𝑋~𝑁(𝜇, )
𝑛
for sufficiently large 𝑛.

The Central Limit Theorem tells us that 𝑋 is approximately, or asymptotically, distributed


as a normal distribution with mean 𝜇 and variance .
The astonishing thing about the theorem is the fact that nothing is said about the form of
the original probability function. Whatever the distribution function, the sample mean 𝑋
will have approximately the normal distribution for large samples.
The importance of the theorem, as far as practical applications are concerned, is the fact
that the mean 𝑋 of a random sample from any distribution with variance 𝜎 and mean 𝜇 is
approximately distributed as a normal random variable with mean 𝜇 and variance .

36
EDA 11 - ENGINEERING DATA ANALYSIS

When is the sample size large enough so that the Central Limit Theorem can be assumed to
apply? The answer depends on how close the underlying distribution is to the normal. A
general rule is that the approximation is adequate as long as n ≥ 30, although the
approximation is often good for much smaller values of n, particularly if the distribution of
the random variables Xi has a probability density function with a shape reasonably similar
to the normal bell-shaped curve. In most cases encountered in practice, this guideline is
very conservative, and the Central Limit Theorem will apply for sample sizes much smaller
than 30.

GENERAL CONCEPTS OF POINT ESTIMATION


Considers two basic criteria for determining good point estimates of a particular parameter,
namely, Unbiased Estimates and Minimum Variance Estimates. These criteria help us
decide which statistics to use as point estimates. In general, when there is more than one
obvious point estimate for a parameter, these criteria can be used to compare the possible
choices of point estimate.

UNBIASED ESTIMATOR
If 𝑋 , 𝑋 , … , 𝑋 is a random sample from a population with density function f(x), the
statistics 𝜃 = ℎ(𝑋 , 𝑋 , … , 𝑋 ) is called a Point Estimator of the unknown parameter 𝜃.
After the sample 𝑥 , 𝑥 , … , 𝑥 has been selected, the point estimator 𝜃 takes on a single
numerical value 𝜃 = ℎ(𝑥 , 𝑥 , … , 𝑥 ) called the point estimate of 𝜃.
The point estimator 𝜃 is an Unbiased Estimator of the unknown parameter 𝜃 if 𝐸 𝜃 = 𝜃.
If the estimator is not unbiased, the quantity 𝑏𝑖𝑎𝑠 𝜃 = 𝐸 𝜃 − 𝜃 is called the Bias of the
estimator.

VARIANCE OF A POINT ESTIMATOR


We saw in the previous section that a parameter may have more than one estimator. The
property of unbiasedness alone cannot be relied on to select an estimator. A method to
select among unbiased estimators is needed.
If all unbiased estimators of 𝜃 are considered, the one with the smallest variance is called
the Minimum Variance Unbiased Estimator (MVUE).

 EXAMPLE:
1. Compute the variance of the two unbiased estimators of 𝜇 using a random
sample of size 𝑛.
1
𝜇̂ = (𝑋 + 𝑋 + ⋯ + 𝑋 )
𝑛
1
𝜇̂ = (2𝑋 + 𝑋 + ⋯ + 𝑋 + 2𝑋 )
𝑛+2

37
EDA 11 - ENGINEERING DATA ANALYSIS

Solution:
We apply equation
𝑉[𝑎 𝑋 + 𝑎 𝑋 + ⋯ + 𝑎 𝑋 ] = 𝑎 𝑉[𝑋 ] + 𝑎 𝑉[𝑋 ] + ⋯ + 𝑎 𝑉[𝑋 ]
1 1
𝑉[𝜇̂ ] = 𝑉 (𝑋 + 𝑋 + ⋯ + 𝑋 ) = (𝑉[𝑋 ] + 𝑉[𝑋 ] + ⋯ + 𝑉[𝑋 ])
𝑛 𝑛
1 𝜎
= (𝑛𝜎 ) =
𝑛 𝑛
1
𝑉[𝜇̂ ] = 𝑉 (2𝑋 + 𝑋 + ⋯ + 𝑋 + 2𝑋 )
𝑛+2
1
= (4𝑉[𝑋 ] + 𝑉[𝑋 ] + ⋯ + 𝑉[𝑋 ] + 4𝑉[𝑋 ])
(𝑛 + 2)
1 𝑛+6
= (4𝜎 + 𝜎 + ⋯ + 𝜎 + 4𝜎 ) = 𝜎
(𝑛 + 2) (𝑛 + 2)

STANDARD ERROR
When the numerical value or point estimate of a parameter is reported, it is usually
desirable to give some idea of the precision of estimation. The measure of precision usually
employed is the Standard Error of the estimator that has been used.
The standard error of an estimator 𝜃 is its standard deviation given by 𝑠𝑒 𝜃 = 𝜎 =

𝑉[𝜃].

MEAN SQUARED ERROR OF AN ESTIMATOR


Sometimes it is necessary to use a biased estimator. In such cases, the mean squared error
of the estimator can be important.
The Mean Squared Error of an estimator 𝜃 is the quantity 𝑀𝑆𝐸 𝜃 = 𝐸[(𝜃 − 𝜃) ].

38

You might also like