You are on page 1of 62

Statistics

• Definition
• Is the science of conduction studies to
• 1- collect
• 2-Organize
• 3-Summarize
• 4- analyze
• 5- draw conclusion
Terminology
• Variable is the characteristics or attribute that
can assume different values
• Data are the values (measurements or
observations) that the variable can assume
• Random variable are variable whose values
are determined by chance
• Data set are collection of data values
Types of statistics

• There are two branches in statistics


• Descriptive statistics consist of the collection,
organization , summarization and presentation
of data.
• Inferential statistics consist of generalization
from samples to population, performing
estimates and hypothesis test, determine
relation ship among variables and making
predictions.
Terminology
• Population include all subjects that are being
studied. (due to esxpense , time , size of
population, medical concern , etc., it is not
possible to study population
• Sample group of subject selected from
population
• Hypothesis test it is decision making process
for evaluating claims about population based
on information obtained from samples
Types of Data
Types of
Data

Quantitative Qualitative
Data Data
Quantitative Data

Quantitative data can further


be divide into discrete and
continuous types.
Levels of Measurement
The scale determines the amount of information
contained in the data. The scale indicates the data
summarization and statistical analyses that are
most appropriate.
Nominal
Lowest to
Levels of
Ordinal highest
. Measurement

Interval
Ratio
Summary -
Levels of Measurement

 Nominal - categories only


 Ordinal - categories with some order
 Interval - differences but no natural
starting point
 Ratio - differences and a natural starting
point
Scales of Measurement
Data

Qualitative Quantitative

Numerical Nonnumerical Numerical

Nominal Ordinal Nominal Ordinal Interval Ratio


Summary of Levels of Measurement

Arrange Determine if one data


Level of Put data in Subtract data
data in value is a multiple of
measurement categories values
order another

Nominal Yes No No No
Ordinal Yes Yes No No
Interval Yes Yes Yes No
Ratio Yes Yes Yes Yes
Methods of Data Collection
In an observational study, a researcher observes and
measures characteristics of interest of part of a
population.
In an experiment, a treatment is applied to part of a
population, and responses are observed.
A simulation is the use of a mathematical or physical
model to reproduce the conditions of a situation or
process.
A survey is an investigation of one or more characteristics
of a population.
A census is a measurement of an entire population.

A sampling is a measurement of part of a population.


Sampling
• Sampling is a process of selecting samples
from a group or population to become the
foundation for estimating and predicting the
outcome of the population as well as to detect
the unknown piece of information.
• Advantages of Sampling: save cost and human
resources
• . The destructive nature of certain tests
SAMPLING TECHNIQUES
• Sampling techniques often depend on
research objectives of a research work.
• Probability Sampling:
• This sampling technique includes sample
selection which is based on random
methods.
• Non- probability Sampling
• This sampling techniques is not based on
random selection. some elements of the
population have no chance of selection
Types of Sampling
Probability Sampling
 Simple Random Sampling
 Stratified Random Sampling
 Cluster Sampling
 Systematic Sampling
Non- probability Sampling
 Representative Sampling (Can
be stratified random or quota sampling)
 Convenience or Haphazard Sampling
Summary
• Sampling is the process whereby some
elements (individuals) in the population are
selected for a research study.
• 2. The population consists of all individuals
with a particular characteristic that is of
interest to the researchers. If data are
obtained from all members of the population,
then we have a census; if data are obtained
from some members of the population, then
we have a sample
© 2011 Pearson Education, Inc
• Two main techniques of sampling: probability and non-
probability.
• Probability sampling is based on random selection while
non-probability sampling is not based on random
selection.
• Probability sampling consists of : random sampling,
stratified sampling, systematic sampling and cluster
sampling.
• Non-probability sampling consists of quota sampling,
purposive sampling and convenience sampling.
• In cluster sampling, the unit of sampling does not refer to
an individual entity but a group of entities.
Measure of central tendency

• In statistics we used either population or


sample
• The measure found using the value in the
population are called parameters and statistics
are for those measured the sample.
• The mean, the median and the mode are used
for the measurement of central tendency
Properties and uses of central
tendency
• The mean : All values of the data are used 2) The man is used in
calculation of other statistics such as variance 3) The mean for the
data set is unique…It is necessary to be part of the data set 4) It
cannot be computed for data set with open ended classes 5) The
mean is affected by outliers
• The median : It is used to know the centre of the data set 2) It is
used to know the position of the data is it in upper half or the lower
half 3) It is used for open ended distribution 4) Less affected by
outliers
• The mode : It is used when we are interest on the typical cases 2) It
is the easiest average to compute 3) It is used in nominal data
• The mode is not always unique (more than mode or no mode)
• The midrange: Easy to compute and not all data set is used and
affected by outliers 2) It give the mid point
When the Mean is not used
• Although the mean is the most commonly used measure
of central tendency, there are situations where the mean
does not provide a good, representative value, and there
are situations where you cannot compute a mean at all.
• When a distribution contains a few extreme scores (or is
very skewed), the mean will be pulled toward the
extremes (displaced toward the tail). In this case, the
mean will not provide a "central" value.
• In nominal scale it is impossible to compute a mean, and
when data are measured on an ordinal scale (ranks), it is
usually inappropriate to compute a mean.

20
Shape of the Distribution
• Symmetrical (mean is about equal to median)
• Skewed
– Negatively : mean < median
– Positively (example: income)
mean > median
frequency distribution shape
• There are three most common
• 1- positively skewed, 2-symetric and negatively
skewed…
• In positively skewed or right skewed distribution ,
the majority of the data values fall to the left of
the mean and clustered at the lower end of the
distribution the tail to the right . The mean is to
right of the median and the mode is left of the
median. E.g. exam when most of the students
have poor result or the income of the population
in some countries.
• 2- Symmetric the data are evenly distributed around the
mean. Moreover, when the distribution is unimodal the
mean, the mode and the median are concentrated
around at the centre of distribution.
• 3- negatively skewed or left skewed distribution: The
majority of the data falls to the right of the mean and
clustered at the upper end of the distribution with the
tails to the left . the mean is left to the median and the
mode to the right of the median. Exam when most of
the student get high result. When the distribution is
extremely skwed the value of the mean is pulled to the
tail, and the majorty of the data will be greater or less
than mean depend on skewness. Here Median is used.
Central tendency shape
Central Tendency and Skewed
Normally distributed Distributions
Outliers

• the mean and median so different in our earlier


example about home values? It is because there is one
price that is extremely different from the rest of the
data. In statistics, we call such extreme values outliers.
• The mean is affected by the presence of an outlier;
however, the median is not.
• A statistic that is not affected by outliers is called
resistant. We say that the median is a resistant
measure of center, and the mean is not resistant. In a
sense, the median is able to resist the pull of a far away
value, but the mean is drawn to such values. It cannot
resist the influence of outlier values.
Lesson Summary

• When examining a set of data, we use descriptive statistics


to provide information about where the data is centred.
• The mode is a measure of the most frequently occurring
number in a data set and is most useful for categorical data
and data measured at the nominal level.
• The mean and median are two of the most commonly used
measures of center.
• The mean, or average, is the sum of the data points divided
by the total number of data points in the set.
• In a data set that is a sample from a population, the
sample mean is notated as x. When the entire population is
involved he population mean is μ.
• The median is the numeric middle of a data set. If
there are an odd number of numbers, this middle value
is easy to find. If there is an even number of data
values, however, the median is the mean of the middle
two values.
• The median is resistant, i.e., it is not affected by the
presence of outliers.
• An outlier is a number that has an extreme value when
compared with most of the data. The mean is not
resistant, and therefore the median tends to be a more
appropriate measure of center to use in examples that
contain outliers.
• For a nominal variable, the mode is the only measure
that can be used.
• For ordinal variables, the mode and the median may be
used. The median provides more information (taking
into account the ranking of categories.)
• For interval-ratio variables, the mode, median, and
mean may all be calculated. The mean provides the
most information about the distribution, but the
median is preferred if the distribution is skewed.
Introduction
• Dispersion (variability, scatter, or spread))
characterizes how
• stretched or squeezed of the data.
• Dispersion refers to the variation of the items
around an average.
• Dispersion is a non negative real number
• Equal zero if all the data are the same and
increases as the data become more diverse.
Measures of Variability
• Quartiles
• Range
• Interquartile range
• Variance
• Standard deviation
• Coefficient of variation
• All of these measures are appropriate for
measurement data only.
The location of quartiles

When there are n values in an ordered data set:

n+1
lower quartile = th value
4

n+1
median = th value
2

3 (n + 1)
upper quartile = th value
4

interquartile range = upper quartile – lower quartile


The Five-Number Summary

• The five-number summary is a numerical


description of a data set comprised of the
following measures (in order):
• minimum value,
• lower quartile,
• median,
• upper quartile,
• maximum value.
Box plot
• box plot summarizes
data using the median,
upper and lower
quartiles, and the
extreme (least and
greatest) values. It
allows you to see
important
characteristics of the
data at a glance
• the Box includes the
lower quartile, median,
and upper quartile.
• The Whiskers extend
from the Box to the
max and min.
The Mean and Standard Deviation as
Descriptive Statistics
• If you are given numerical values for the mean
and the standard deviation, you should be
able to construct a visual image (or a sketch)
of the distribution of scores.
• As a general rule, about 70% of the scores will
be within one standard deviation of the mean,
and about 95% of the scores will be within a
distance of two standard deviations of the
mean.

34
Choosing Appropriate
Measure of Variability
• If data are symmetric, with no serious outliers,
use range and standard deviation.
• If data are skewed, and/or have serious
outliers, use IQR.
• If comparing variation across two data sets,
use coefficient of variation.
Summary
• For parametric (normally distributed,
symmetrical) data, the mean and SD are the
appropriate measures of central tendency and
• variability of the data.
• For non-parametric data, the median is the
• appropriate central tendency measure and the
IQR is the appropriate measure of the
variability of the data.
Introduction
• Probability is the measurement of the likelihood
that an event will occur; It is an important part of
statistics and it is the basis of inferential
statistics. make decisions about a population
from sample
• Used to make decisions in the face of uncertainty
• For example:, a weather forecaster may predict
that there is an 80% chance of rain tomorrow.
Calculating probability

• Probability is a numerical measure of the likelihood that a specific


event will occur
• PROBABILITY RULES
• Rule 1. The probability P(A) of any event A satisfies 0 ≤ P(A) ≤ 1.
• Rule 2. If S is the sample space in a probability model, then P(S) =
1.
• Rule 3. Two events A and B are disjoint if they have no outcomes
in common and so can never occur together. If A and B are disjoint,
• P(A or B) = P(A) + P(B) This is the addition rule for disjoint events.
• Rule 4. The complement of any event A is the event that A does
not occur, written as Ac. The complement rule states that
• P(Ac) = 1 − P(A)
Definitions of probability
• Statistics originally meant state records (births, deaths, etc.) popular meaning.
• Definition of probability
• (Frequentist) An event’s probability: is the proportion of times that we expect the
event to occur, if the experiment were repeated a large number of times. probability
is defined as the hypothetical number towards which the relative frequency tends
when a random experiment is repeated infinitely many times.
• (Subjectivist) subjective probability is an individual’s degree of belief in the
occurrence of an event.
• (Classical) Probability is simply a fraction of the number of favourable cases to a
particular event divided by the number of all cases possible.
• The classical definition of probability : If there are m outcomes in a sample space
(universal set), and all are equally likely of being the result of an experimental
measurement, then the probability of observing an event (a subset) that contains s
outcomes is given by 𝑠/𝑚
Event: Each possible • Probability: the chance
type of occurrence that an uncertain event will
or outcome. occur (always between 0
If an event is and 1).
impossible, it has
a probability of 0.
If it’s an absolute
certainty, then the
probability is 1. A
lot of the time,
you’ll be dealing
with probabilities
somewhere in
between.
Sample space
• Probability experiment: is a chance process that leads
to well defined results called outcome
• An outcome is the result of a single trail of a probability
experiment.
• A sample space is the set of all possible outcomes of a
probability experiment
• Experiment sample space
• Toss a coin head or tail
• Roll die 1, 2, 3, 4, 5, 6
• Answer question True or false
• Toss two coins HH, TT, HT, TH
Draw the Venn and tree diagrams for
the experiment of tossing a coin twice.
Simple and Compound Events

• An event consists of one or more of the outcomes


of an experiment.
• Definition
• Event An event is a collection of one or more of the
outcomes of an experiment.
• It can either be simple (elementary event) it is
denotes by letter E1, E2, E3
• Simple Event An event that includes one and only
one of the (final) outcomes for an experiment is
called a simple event and is usually denoted by Ei.
Example
• The die toss:
• Simple events: Sample space:
1 E1
2
S ={E1, E2, E3, E4, E5, E6}
E2
S
3 E3 •E1 •E3
4 E4 •E5
5
E5 •E2 •E4 •E6
6 E6
2. Independent versus Dependent
Events
• Two events, A and B , are independent if the
• occurrence of one does not affect the
probability of the occurrence of the other. If
• events A and B are not independent , then
they are said to be dependent. Therefore,
• two events are dependent if the outcome of
the first affects the outcome of the second.
• Mutually exclusive events cannot occur simultaneously.
The complement of Event A consists of all outcomes in
which event A does not occur.
• Complementary events are always mutually exclusive ,
but mutually exclusive events are not necessarily
• complementary.
• Given an experiment involving rolling two dice, the
event of the dice dots having a sum of six and the event
of the dice dots having a sum of eight are mutually
exclusive . In that same experiment, the event of the
dice dots having an even sum is the complement of the
event of the dice dots having an odd sum.
And / Or in probability
• And means at the same time, for example if you
ask to find from deck queen and heart, it means
queen heart which is 1/52, but or means has two
meaning. If you ask queen or heart…it means you
get the 4 queens and 13 heart. So there would be
3+14-1 = 16 possibility. Other example if you ask
to find queen or king (4 and 4 ) = 8.
• In the first example queen heart is counted twice
so we subtract one. (inclusive or). In second
example both event cannot happen at the same
time exclusive or.
The word Or
• In probability and statistics, the word “or” is
usually used as an “inclusive or” rather than
an “exclusive or.” For instance, there are three
ways for “Event A or B” to occur.
– A occurs and B does not occur
– B occurs and A does not occur
– A and B both occur
Compare “A and B” to “A or B”
The compound event “A and B” means that A and B
both occur in the same trial. Use the multiplication rule
to find P(A and B).

The compound event “A or B” means either A can


occur without B, B can occur without A or both A and B
can occur. Use the addition rule to find P(A or B).

A B A B

A and B A or B
The Addition Rule
The probability that one or the other of two events will
occur is: P(A) + P(B) – P(A and B)

A card is drawn from a deck. Find the probability


it is a king or it is red.
A = the card is a king B = the card is red.

P(A) = 4/52 and P(B) = 26/52


but P(A and B) = 2/52
P(A or B) = 4/52 + 26/52 – 2/52
= 28/52 = 0.538
Probability rule

• Rule 1
• The probability of any event (E) is a number (either fraction or
decimal) between and including 0 and 1( 0< P( E) < 1). From Rule 1
probability cannot be negative or greater than one.
• Rule 2
• If event cannot occurred it is probability is zero. When a single die is
rolled find the probability of getting 9 = 0
• Rule 3
• If probability of event is certain P (E ) = 1. When a single die is rolled
find the probability of getting number less than 7 =
• Thus when the probability of an event is near to zero is less likely to
occur and when it greater than 0.5 it is more likely.
• Rule 4
• The sum of probability of all outcomes in the sample space is 1
Correlation
A correlation is a relationship between two variables. The data can be represented by
the ordered pairs (x, y) where x is the independent (or explanatory) variable, and y is
the dependent (or response) variable. without being able to infer causal
relationships.

A scatter plot can be used to determine whether a y


linear (straight line) correlation exists between two
variables. 2

x
Example: 2 4 6

x 1 2 3 4 5 –2

y –4 –2 –1 0 2
–4
• What to study
• 1-Are two or more variables related
• 2-If Yes, what is the strength of the
relationship
• To answer question 1 and 2 we used
correlation coefficient
Linear Correlation
y y
As x increases, y As x increases, y
tends to tends to increase.
decrease.

x x
Negative Linear Correlation Positive Linear Correlation
y y

x x
No Correlation Nonlinear Correlation
Linear Correlation
y
y

r = 0.91 r = 0.88

x
x
Strong negative correlation
Strong positive correlation
y
y

r = 0.42
r = 0.07

x
x
Weak positive correlation
Nonlinear Correlation
correlation coefficient (r)
r = greater than +.50) = a
When the correlation
strong positive relationship
coefficient approaches r = -
or high degree of
1.00 (or less than r = -.50), it
relationship between the
means that there is a strong
two variables. R = +1 perfect
negative relationship
correlation

Think about a correlation until it is greater than r = 0.30 or less than r


= -0.30.
Correlation and Causation
A strong correlation between two variables may provide clues about possible cause-
effect relationships. However, some statisticians claim a strong correlation never
implies a cause-effect relationship.

If there is a significant correlation between two variables, you should consider the
following possibilities.

1. Is there a direct cause-and-effect relationship between the variables?


Does x cause y?
2. Is there a reverse cause-and-effect relationship between the variables?
Does y cause x?
3. Is it possible that the relationship between the variables can be caused by a
third variable or by a combination of several other variables?
4. Is it possible that the relationship between two variables may be a
coincidence?
• Correlational may allow a researcher to
develop ideas about potential cause-effect
relationships between variables. For
verification the researcher may conduct a
controlled experiment and determine whether
their cause-effect hunch between two
variables has some support. Indeed, after a
controlled experiment, a researcher may claim
a cause-effect relationship between two
variables.
Revision correlation

• In studying relationship between two variable


• 1- collect the data for the two variables under
study
• 2- draw the scatter plot and determine the nature
of relationship.
• 3- Calculate correlation coefficient
• 4- Test the significant of the relationship. (if
significant go to regression otherwise stop)
• 5- Regression equation
Regression
• In the previous lecture we study correlation
between two variable, then we determine the
correlation coefficient. The correlation
coefficient is the use to the significance of the
correlation. If the correlation is significant the
next step is to determine the equation of the
regression line. If the data of correlation is not
significant determination of the regression line
is meaningless.
• Regression line is the data ‘s best fit.
Variation About a Regression Line
The total variation about a regression line is the sum of the squares of
the differences between the y-value of each ordered pair and the mean
of y.
Total variation   y i  y 
2

The explained variation is the sum of the squares of the differences


between each predicted y-value and the mean of y.
Explained variation   yˆ i  y 
2

The unexplained variation is the sum of the squares of the differences


between the y-value of each ordered pair and each corresponding
predicted y-value.
Unexplained variation   y i  yˆ i 
2

Total variation  Explained variation  Unexplained variation


Variation About a Regression Line
To find the total variation, you must first calculate the
total deviation, the explained deviation, and the
unexplained deviation.
Total deviation  y i  y
Explained deviation  yˆ i  y
Unexplained deviation  y i  yˆ i
y (xi, yi)
Unexplained
Total deviation
y i  yˆ i
deviation
yi  y
(xi, ŷi) Explained
y deviation
(xi, yi)
yˆ i  y
x
x

You might also like