You are on page 1of 24

STATISTICS

SUBJECT – MANAGEMENT
SUBJECT CODE – 17 UNIT
- VII

990000000

[2]
CONTENTS

Chapters Titles
1 Fundamentals of Probability

2 Probability Distribution

3 Sampling

4 Hypothesis Test

5 Correlation & Regression

6 T Test, Z Test, F Test, & Chi Square

7 Data Analysis & Management Information System

[3]
CHAPTER 1
FUNDAMENTALS OF PROBABILITY
Probability:
The term probability refers to the chance of happening or not happening of an event.
In any statement when we use the word chance it means that there is an element of
uncertainty in that statement. A numerical measure of uncertainty is provided by the theory
of probability.
In the words of Morris Hamburg, “probability measures provide the decision maker
in the business and in government with the means for quantifying the
the uncertainties which
affect his choice of appropriate action.”

Origin:
The theory of probability has its origin in the games of chance related gambling like
drawing cards from a pack or throwing a dice, etc. Jerome Cardan (1501 – 1576) an Italian
mathematician
atician was the first man to write a book on the subject entitled, “Book on Games of
Chance”, which was published after his death in 1663. The foundation of the theory of
probability was laid by French mathematicians Blasé Pascal (1623 – 1662) and Pierre de
Fermat.

Terminology:
In order to understand the meaning and concept of probability, we must know
various terms in this context.

Random Experiment:
An experiment is called a random experiment if when conducted repeatedly under
essentially homogenous conditions, the result is not unique, i.e. it does not give the same
result. The result may be anyone of the various possible outcomes.

Sample Space:
The set of all possible outcomes of an experiment is called the sample space of that
experiment and is usually denoted by S. Every outcome (element) of the sample space is
called sample point.
Some random experiments of sample space:
(i) If an unbiased coin is
is tossed randomly, then there are two possible outcomes for
this experiment, viz., head (H) or tail (H) up. Then the sample space is
S= H,T
(ii) When two coins are thrown simultaneously, the sample space is

S= (H, H) (H, T) (T, H) (T, T)


(iii) When three coins are thrown simultaneously, then the sample space consists of 2
x 2 x 2, i.e., 8 sample points as shown below:
S= (H,H,H), (H,H,T), (H,T,H), (T,H,H), (H,T,T), (T,H,T), (T,T,H, (T,T,T)
(iv) When a die is thrown randomly, then the sample space
is S = 1, 2, 3, 4, 5, 6

(v) When two dice are thrown simultaneously and the sum of the points is noted,
then the sample space is:
S = 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12

[4]
Trial and Event.:
Performing of a random experiment is called a ‘trial’ and outcome or outcomes are
termed as ‘events’. For instance, tossing of a coin would be called a trial and the result
(falling head or tail upward) an event.

Types of Events:
Exhaustive Cases:
The total number possible outcomes
outcomes of a random experiment Is called the exhaustive
cases for the experiment. For instance, in toss of a single coin, we can get head or tail. Hence
exhaustive number of cases is 2 because they between themselves exhaust all possible
outcome of the random experiment.

Favorable cases or Events:


The numbers of outcomes which result in the happening of a desired event are called
favorable cases. For instance, in the drawing a card from a pack of cards, the cases favorable
to getting a club 13 and getting an ace is 4.

Mutually Exclusive Events:


Two or more events are said to be mutually exclusive if the happening of any one of them
excludes the happening of all others in the same experiment. Thus mutually exclusive events
are those events, the occurrence of which prevents the possibility of the other to occur.

Symbolically, a set of events E1, E2, …..En is mutually exclusive if Ei Ej = (i j). This means
the intersection of two events is a null set ( ).

For example, let a dice be thrown once, the event E1 of getting an even number

is E1 = 2, 4, 6
The event E2 is getting an odd number is
E2 = 1, 3, 5

Since E1 E2 = , the two events are mutually exclusive.

Equally likely events:


The outcomes are said to be equally likely if none of them is expected to occur in
preference to other. For instance, head and tail are equally likely events in tossing an
unbiased coin.
When an unbiased dice is thrown once, we may get 1 or 2 or 3 or 4 or 5 or 6. The six
events are equally likely.

Independent Events.:
Events are said to be independent if occurrence of one does not affect the outcome
of any of the others. For instance, the result of the first toss of a coin does not affect the
result of successive tosses at all.
al

Dependent Events:
If the occurrence of the one event affects the happening of the other event, then they
are said to be dependent events. For instance the probability of drawing a king from a pack
of 52 cards is . If this card is not replaced before the second draw, the probability of
getting a king again in 3/51 as there are now only 51 cards left and they contain only 3
kings.

[5]
Compound Events:
Two events are said to be compound when their occurrences are related to each other. For
example, a dice is thrown once. The sample space S is
S = 1, 2, 3, 4, 5, 6

Let one event be E1 that is, of getting an even digit upper most,
i.e., E1 = 2, 4, 6

Let the other event be E2, that is, that of getting a number greater than 4,
i.e., E2 = 5, 6

The event of getting an even number and a number greater than 4 is


E=6
Clearly a compound event is a intersection of two or more events. In the present case
E = E1 E2

Complementary Events:
If E is anyy subset of the sample space, then its complement denoted by E (read as E E-bar)
contains all the elements of the sample space that are not part of E. If S denoted the sample
space, then
E= S – E
= All sample elements not in E

Expressions of Probability:
Probability will always be a number between 0 and 1. If an event is certain to happen
its probability would be 1 and if it is certain that the event would not take place, then the
probability of its happening is zero.
The general rule of the happening of an event is that if an event can happen in m
ways and fail to happen ion n ways, then the probability (P) of the happening of the event is
given by
P =
or
P =

Probabilities can be expressed either as ratios, fraction or in percentages. For


instance the probability of getting a head in toss of a coin can be expressed as ½ or 0.5 or
50%.
Odds in Favour and Odds Against An Event:
If as a result of an experiment, a of the outcomes are favourable to an event E and b of the
outcomes are against it,t, then the odds in favour of E is a : b, and the odd against E is b : a.

Example:
The odds against an event are 2 : 5. Find the probability of its happening.

Solution:
Odds against the event E are b : a i.e., 2 : 5

P (E) = = =
P (E) + P(E) = 1
P(E) = 1 – P(E)

[6]
=1- =
Hence, the probability of happening of the event is
Approaches of Probability:
1. Classical Approach:
 This approach of defining probability is based on the assumption that all possible
outcomes (finite in number) of an experiment are mutually exclusive and equally likely.
likely.

 If a random experiment is repeated finite number of times, out of which outcomes ‘a’ are
in favour of event A, outcomes ‘b’ are not in favour of event A and a all these possible
outcomes are mutually exclusive, collectively exhaustive and equally likely, then
probability of occurrence of event A is defined as:
as:
P(A) = = =

 Since the probability of occurrence of an event is based on prior knowledge of the


process
ess involved, therefore this approach is often called a priori (original) approach or
classical approach. This approach implies that there is no need to perform random
experiments to find the probability of occurrence of an event. Also, no experimental dat
data
are required for computation of probability.
probability.
2. Relative Frequency Approach:
 If outcomes or events of a random experiment are not equally likely or not known
whether they are equally likely, then classical approach is not desirable to determine
probability
ty of a random event. For example, in cases like (i) whether a number greater
than 3 will appear when die is rolled or (ii) whether a lot of 100 items will contain 10
defective items, etc., it is not possible to predict occurrence of an outcome in
 advance without repetitive trials of the experiment.
experiment.
 This approach of computing probability states that when a random experiment is
repeated a large number of times under identical conditions where trails are
independent to each other, the desired even may occur some proportion (relative
frequency) of time. Thus, probability f an event can be approximated by recording the
relative frequency with which such an event has occurred over a finite number of
 repetitions of the experiment under identical conditions.
conditions.
 Since e the probability of an event is determined through repetitive expirical
observations of experimental outcomes, it is also known as empirical probability
probability.
Few situations to which this approach can be applied are follows:
follows:
1) Observing how often you win lottery when buying regularly.
2) Observing whether or not a certain traffic signal is red when you cross it.
3) Observing births and noting how often the baby is a female.

3. Subjective Approach:
The Subjective Approach of calculating probability is always based on the degree of
beliefs, convictions and experience concerning the likelihood of occurrence of a
random event. It is a way to quantify an individual’s beliefs, assessment and judgement
about a random phenomenon.
Probability assigned for the occurrence of an event
event may be based on just guess or on
having some idea about the relative frequency of past occurrences of the event. This
approach must be used when either sufficient data are not available or sources of
information giving different results are not known.
………………..

[7]
CHAPTER 2
PROBABILITY DISTRIBUTION
Probability Theory is the branch of mathematics concerned with probability, the
analysis of random phenomena.
Probability is a way of assigning every "event" a value between zero a and one, with the
requirement that the event made up of all possible results (in our example, the event
{1,2,3,4,5,6}) be assigned a value of one.
The central objects of probability theory are random variables, stochastic processes, and
events.
If an individual coin toss or the roll of dice is considered to be a random event, then if
repeated many times the sequence of random events will exhibit certain patterns, which
can be studied and predicted. Two representative mathematical results describing such
patterns are :
1. Law of large numbers
2. The central limit theorem
The mathematical theory of probability has its roots in attempts to analyze games of
chance by Gerolamo Cardano in the sixteenth century.
Probability theory mainly considered discrete events, and its methods were mainly
combinatorial.

1. Probability Theory deals with events that occur in countable sample spaces.
Examples: Throwing dice, experiments with decks of cards, random walk, and tossing coins.

2. Classical Definition: Initially the probability


probability of an event to occur was defined as number
of cases favorable for the event, over the number of total outcomes possible in an equi
equi-
probable sample space: see Classical definition of probability.
For example, if the event is "occurrence of an even number when a die is rolled", the
probability is given by , since 3 faces out of the 6 have even numbers and each
face has the same probability of appearing.

3. Modern Definition: The modern definition starts with a finite or countable set called the
sample space, which relates to the set of all possible outcomes in classical sense.

4. Probability Distribution assigns a probability to each measurable subset of the


possible outcomes
tcomes of a random experiment, survey, or procedure of statistical inference.
As probability theory is used in quite diverse applications, terminology is not uniform
and sometimes confusing. The following terms are used for non-cumulative
non cumulative probability
distribution functions:
Probability Mass, Probability mass function, p.m.f.: for discrete random variables.
Categorical Distribution: for discrete random variables
es with a finite set of values.
Probability Density, Probability density function, p.d.f: most often reserved for
continuous random variables.
The following terms are somewhat ambiguous as they can refer to non-cumulative
non or
cumulative distributions, depending on authors' preferences:
 Probability Distribution Function
Function: continuous or discrete, non-cumulative
cumulative or
 cumulative.
 Probability Function:: even more ambiguous, can mean any of the above or other
 things.
Finally,


[8]
 Probability Distribution: sometimes the same as probability distribution function, but
usually refers to the more complete assignment of probabilities to all measurable subsets
of outcomes, not just to specific outcomes or ranges of outcomes.

5. Properties of Probability Distribution:


i. The Probability Distribution of the sum of two independent random variables is
the convolution of each of their distributions.
ii. Probability Distributions are not a vector space—they are not closed under linear
combinations, as these do not preserve non-negativity or total integral 1—but they
are closed under convex combination, thus forming a convex subset of the space of
functions (or measures).

6. Types of Probability Distribution:


Probability Distribution can be divided in different forms:
I) Related to Real-Valued Quantities that Grow Linearly (E.G. errors, offsets)
Normal Distribution (Gaussian distribution), for a single such quantity; the most
common continuous distribution
II) Related to positive real-valued quantities that grow exponentially (e.g. prices,
incomes, populations)
Log-Normal Distribution, for a single such quantity whose log is
normally distributed
Pareto Distribution, for a single such quantity whose log is
exponentially distributed; the prototypical power law distribution
III) Related to Real-Valued Quantities That Are Assumed To Be Uniformly
Distributed Over A (Possibly Unknown) Region.
Discrete Uniform Distribution, for a finite set of values (e.g. the outcome of a
fair die)
 Continuous Uniform Distribution, for continuously distributed values
IV) Related to Bernoulli trials (yes/no events, with a given probability).
A. Basic Distributions:
Bernoulli Distribution, for the outcome of a single Bernoulli trial (e.g.
success/failure, yes/no)
Binomial Distribution, for the number of "positive occurrences" (e.g. successes,
yes votes, etc.) given a fixed total number of independent occurrences
Negative Binomial Distribution, for binomial-type observations but where the
quantity of interest is the number of failures before a given number of successes
occurs
Geometric Distribution, for binomial-type observations but where the quantity
of interest is the number of failures before the first success; a special case of the
negative binomial distribution
V) Related to Sampling Schemes over a Finite Population:
Hypergeometric Distribution, for the number of "positive occurrences" (e.g.
successes, yes votes, etc.) given a fixed number of total occurrences, using
sampling without replacement
Beta-Binomial Distribution, for the number of "positive occurrences" (e.g.
successes, yes votes, etc.) given a fixed number of total occurrences, sampling
using a Polya urn scheme (in some sense, the "opposite" of sampling without
replacement)
VI) Related to categorical outcomes (events with K possible outcomes, with a given
probability for each outcome).

[9]
Categorical Distribution, for a single categorical outcome (e.g. yes/no/maybe in
a survey); a generalization of the Bernoulli distribution
Multinomial Distribution, for the number of each type of categorical outcome,
given a fixed number of total outcomes; a generalization of the binomial
distribution
Multivariate Hypergeometric Distribution, similar to the multinomial
distribution, but using sampling without replacement; a generalization of the
hyper geometric distribution

VII) Related to events in a Poisson process (events that occur independently with a
given rate)
I. Poisson Distribution, for the number of occurrences of a Poisson-type event in a
given period of time
II. Exponential Distribution, for the time before the next Poisson-type event
occurs
III. Gamma Distribution, for the time before the next k Poisson-type events occur

VIII) Related to the absolute values of vectors with normally distributed


components
Rayleigh Distribution, for the distribution of vector magnitudes with Gaussian
distributed orthogonal components. Rayleigh distributions are found in RF signals
with Gaussian real and imaginary components.
Rice Distribution, a generalization of the Rayleigh distributions for where there is a
stationary background signal component. Found in Rician fading of radio signals due
to multipath propagation and in MR images with noise corruption on non-zero NMR
signals.

IX) Related to normally distributed quantities operated with sum of squares (for
 hypothesis testing)
Chi-Squared Distribution, the distribution of a sum of squared standard normal
variables; useful e.g. for inference regarding the sample variance of normally
distributed samples (see chi-squared test)
Student's ‘T’ Distribution, the distribution of the ratio of a standard normal variable
and the square root of a scaled chi squared variable; useful for inference regarding
the mean of normally distributed samples with unknown variance (see Student's t-
test)
F-Distribution, the distribution of the ratio of two scaled chi squared variables;
useful e.g. for inferences that involve comparing variances or involving R-squared
(the squared correlation coefficient)

…………………..

[10]
CHAPTER 3

SAMPLING

Sampling
It is concerned with the selection of a subset of individuals from within the population
to estimate characteristics of the whole population. Each observation measures one or more
properties (such as weight, location, color) of observable bodies distinguished as
independent objects or individuals.

The sampling process comprises several stages:


i. Defining the population of concern
ii. Specifying a sampling frame, a set of items or events possible to me
measure
iii. Specifying a sampling method for selecting items or events from the frame
iv. Determining the sample size
v. Implementing the sampling plan
vi. Sampling and data collecting
collectin
vii. Data which can be selected

I. Population Definition:
 A population can be defined as including all people or items with the
 characteristic one wishes to understand.
understand.
 There is very rarely enough time or money to gather information from everyone
or everything in a population, the goal becomes finding a representative sample
 (or subset) of that population.
population.
 Population often consists of physical objects.
objects.
 Sampling theory treat the observed population as a sample from a larger 'super
population'

II. Sampling Frame:
 it is possible to identify and measure every single item in the population and to
 include any one of them in our sample.
sample.
 A sampling frame which has the property that we can identify every single
element and include any in our sample
sample
III. Sampling Methods:
Within any of the types of frame identified above, a variety of sampling methods can be
employed, individually or in combination. Factors commonly influencing the choice
between these designs include:
 Nature and quality of the frame
frame
 Availability of auxiliary information about units on the frame
frame
 Accuracy requirements, and the need to measure accuracy
accuracy
 Whether detailed analysis of the sample is expected
expected
 Cost/operational concerns
concerns

Methods of Sampling:
There are two methods of Sampling
 Probability Sampling
Sampling
 Non Probability Sampling
Sampling

[11]
Probability Sampling :
 Where very unit in the population has a chance (greater than zero) of being selected
 in the sample.
 This probability can be accurately determined.
 The combination of these traits makes it possible to produce unbiased estimates of
population totals, by weighting sampled units according to their probability of
 selection.
 Every element in the population does have the same probability of selection, this is
 known as an 'equal probability of selection' (EPS) design.
Example: we want to estimate the total income of adults living in a given street. We
visit each household in that street, identify all adults living there, and randomly select
one adult from each household. (For example, we can allocate each person a random
number, generated from a uniform distribution between 0 and 1, and select the
person with the highest number in each household). We then interview the selected
person and find their income.
People living on their own are certain to be selected, so we simply add their income
to our estimate of the total. But a person living in a household of two adults has only a
one-in-two chance of selection. To reflect this, when we come to such a household, we
would count the selected person's income twice towards the total. (The person who is
selected from that household can be loosely viewed as also representing the person who
isn't selected.)

Types of Probability Sampling:


1) Simple Random Sampling,
2) Systematic Sampling,
3) Stratified Sampling,
4) Cluster or Multistage Sampling.
a. Simple Random Sampling:
 All the elements of the frame are given an equal probability.
 Any given pair of elements has the same chance of selection as any other such pair
 (and similarly for triples, and so on).
  SRS can be vulnerable to sampling error.
 SRS may also be cumbersome and tedious when sampling from an unusually large
target population.

b. Systematic Sampling:
Systematic Sampling relies on arranging the study population according to
some ordering scheme and then selecting elements at regular intervals through that
ordered list.
 Systematic Sampling involves a random start and then proceeds with the
th
 selection of every k element from then onwards.
 In this case, k = (population size/sample size).
 The starting point is not automatically the first in the list, but is instead randomly
th
chosen from within the first to the k element in the list.
 As long as the starting point is randomized, systematic sampling is a type of
probability sampling.
For example, suppose we wish to sample people from a long street that starts in a poor
area (house No. 1) and ends in an expensive district (house No. 1000). A simple random
selection of addresses from this street could easily end up with too many

[12]
from the high end and too few from the low end (or vice versa), leading to an
unrepresentative sample. Selecting (e.g.) every 10th street number along the street
ensures that the sample is spread evenly along the length of the street, representing
all of these districts. (Note that if we always start at house #1 and end at #991, the
sample is slightly biased towards the low end; by randomly selecting the start
between #1 and #10, this bias is eliminated.

 Systematic Sampling is especially vulnerable to periodicities in the list. If periodicity


is present and the period is a multiple or factor of the interval used, the sample is
especially likely to be unrepresentative of the overall population, making the scheme less
accurate than simple random sampling.

 Another Drawback of Systematic Sampling is that even in scenarios where it
is more accurate than SRS, its theoretical properties make it difficult to quantify that
accuracy.

Stratified Sampling:
i. The population frame can be organized by these categories into separate "strata."
ii. Each stratum is then sampled as an independent sub-population, out of which
individual elements can be randomly selected.
iii. Independent strata can enable researchers to draw inferences about specific
subgroups that may be lost in a more generalized random sample.
iv. Stratified sampling method can lead to more efficient statistical estimates.
v. Each stratum is treated as an independent population.
vi. There are, however, some potential drawbacks to using stratified sampling.
vii. It is costly and complex sample selection.
viii. It complicates the design, and potentially reducing the utility of the strata.
ix. Stratified sampling a larger sample.

A stratified sampling approach is most effective when three conditions are met:
i. Variability within strata are minimized
ii. Variability between strata are maximized
iii. The variables upon which the population is stratified are strongly correlated with
the desired dependent variable.
Cluster Sampling:
 Sampling is often clustered by geography, or by time periods.
 Clusters can be chosen from a cluster-level frame, with an element-level frame
 created only for the selected clusters.
Cluster sampling is commonly implemented as multistage sampling.
 It is a complex form of cluster sampling in which two or more levels of units are
 embedded one in the other.
 The 1st stage consists of constructing the clusters that will be used to sample from.
 The 2nd stage, a sample of primary units is randomly selected from each cluster. In
each of those selected clusters, additional samples of units are selected, and so on.

……………….






[13]
CHAPTER 4

HYPOTHESIS TEST
Hypothesis Test
It is a method of statistical inference using data from a scientific study.
A result is called statistically significant if it has been predicted as unlikely to have
occurred by chance alone, according to a pre-determined
pre determined threshold probability, the
significance level.
The phrase
rase "test of significance" was coined by statistician Ronald Fisher.
These tests are used in determining what outcomes of a study would lead to a
rejection of the null hypothesis for a pre-specified
specified level of significance.
The critical region of a hypothesis test is the set of all outcomes which cause the null
hypothesis to be rejected in favour of the alternative hypothesis.
In the Neyman-Pearson
Pearson framework (see below), the process of distinguishing
between the null & alternative hypothesis is aided by identifying two conceptual
 types of errors (type 1 & type 2)
H0 is true Truly not guilty H1 is true Truly guilty
Accept Null Hypothesis Acquittal Right decision Wrong decision Type II Error
Reject Null Hypothesis Conviction Wrong decision Type I Error Right decision

 Definition of important terms used:
 Statistical Hypothesis: A statement about the parameters describing a
 population (not a sample).
sample).
 Statistic: A value calculated from a sample, often to summarize the sample for
 comparison purposes.
 Simple Hypothesis:
Hypothesis Any hypothesis which specifies the population distribution
 completely.
 Composite Hypothesis: Any hypothesis which does not specify the population
distribution completely.
completely.
Null Hypothesis (H0): A simple hypothesis associated with a contradiction to a
theory one would like to prove.
alternative Hypothesis (H1): A hypothesis (often composite) associated with a
theory one would like to prove.
 Statistical Test: A procedure whose inputs are samples and whose result is a
 hypothesis.
 Region of Acceptance: The set of values of the test statistic for which we fail to
 reject the null hypothesis.
hypothesis.
 Region of Rejection / Critical Region:
Region The set of values of the test statistic
 for which the null hypothesis is rejected.
rejected.
Critical Value: The threshold value delimiting the regions of acceptance and
rejection for the test statistic.



[14]
Power of a test (1 − β)
The test's probability of correctly rejecting the null hypothesis. The complement of the
false negative rate, β. Power is termed sensitivity in biostatistics.
 Size / Significance Level of a Test (α): For simple hypothesis, this is the test's

probability of incorrectly rejecting the null hypothesis. The false positive rate. For
composite hypothesis this is the upper bound of the probability of rejecting the null
hypothesis over all cases covered by the null hypothesis. The complement of the false
positive rate, (1 − α), is termed specificity in biostatistics.
p-value: The probability, assuming the null hypothesis is true, of observing a result at
least as extreme as the test statistic.
 Statistical Significance Test: An experimental result was said to be statistically
significant if a sample was sufficiently inconsistent with the (null) hypothesis. The
statistical hypothesis test added mathematical rigor and philosophical consistency to the
concept by making the alternative hypothesis explicit. The term is loosely used to
 describe the modern version which is now part of statistical hypothesis testing.
 Conservative Test: A test is conservative if, when constructed for a given
nominal significance level, the true probability of incorrectly rejecting the null
 hypothesis is never greater than the nominal level.
Exact Test: A test in which the significance level or critical value can be computed
exactly, i.e., without any approximation. A statistical hypothesis test compares a test
statistic. The test statistic (the formula found in the table below) is based on
optimality. For a fixed level of Type I error rate, use of these statistics minimizes Type
II error rates (equivalent to maximizing power).
Uniformly Most Powerful Test (UMP): A test with the greatest power for all
values of the parameter(s) being tested, contained in the alternative hypothesis.




…………………..






















[15]
CHAPTER 5

CORRELATION & REGRESSION

Correlation:
 It is a statistical technique that can show whether and how strongly pairs of variables
are related.

 It is a statistical measure that indicates the extent to which two or more variables
fluctuate together.

 Correlation can be Positive or Negative.
Negative.

 A positive correlation indicates the extent to which those variables increase or
decrease in parallel.

 A negative correlation indicates the extent to which one variable increases as the
other decreases.

 The main result of a correlation is called the correlation
correlation coefficient (or "r").
"r").

 It ranges from -1.0
1.0 to +1.0.
+1.0.

 The closer r is to +1 or -1,
1, the more closely the two variables are related.
related.

 If r is close to 0, it means there is no relationship between the variables.
variables.

 If r is positive, it means that as one variable gets larger the other gets larger. If r is
negative it means that as one gets larger, the other gets smaller.
smaller.

While correlation coefficients are normally reported as r = (a value between -1 and +1),
squaring them makes then easier to understand. The square of the coefficient (or r square) is
equal to the percent of the variation in one variable that is related to the variation in the
other. After squaring r, ignore the decimal point. An r of .5 means 25% of the variation is
related (.5 squared =.25). An r value of .7 means 49% of the variance is related (.7 squared =
.49).

Types of Correlation:
1. Positive correlation occurs when an increase in one variable increases the value
in another.

[16]
2. Negative correlation occurs when an increase in one variable decreases the value
of another.

3. No correlation occurs when there is no linear dependency between the variables .

4. Perfect Correlation occurs when there is a functional dependency between the


variables .

[17]
5. Strong Correlation: A correlation is stronger the closer the points are located to one
another on the line.

6. Weak Correlation: A correlation is weaker the farther apart the points are located
to one another on the line.

Calculating the Correlation:


The formula for the correlation is:

…………………

[18]
CHAPTER 6

‘t’ Test, Z--Test, F-Test


Test & Chi Square Test

‘t’ test is also known as students test or students ‘t’ test.


test
test t = , where is the sample mean of the data, is the sample size,
and is the population standard deviation of the data.
The t-statistic was introduced in 1908 by William Sealy Gosset, a chemist working for
the Guinness brewery in Dublin, Ireland ("Student" was his pen name)
name).
Gosset had been hired due to Claude Guinness's policy of recruiting the best
graduates from Oxford and Cambridge to apply biochemistry and statistics to
Guinness's industrial processes.
test work was submitted to and accepted in the journal Biometrika, the
The Student t-test
journal that Karl Pearson had co-founded
founded and of which he was the Editor-in-Chief;
Editor
the article was published in 1908.
 Company policy at Guinness forbade its chemists from publishing their findings, so
Gosset published his mathematical work under the pseudonym "Student"
"Student"

Uses:
Among
 the most frequently used t-tests are:
A one-sample location test of whether the mean of a population
ulation has a value specified
in a null hypothesis.
A two-sample
sample location test of the null hypothesis that the means of two populati
populations
are equal. All such tests are usually called Student's t-tests.
The name student’s t-tests should only be used if the variances of the two
populations are also assumed to be equal.
If the variances of the two populations are also not assumed to be equal the test

used is sometimes called Welch's t-test, "unpaired" or "independent nt samples" t-tests.

Assumptions:

The assumptions underlying a t-test are that
  Z follows a standard normal distribution under the null hypothesis
2 2
s follows a χ distribution with p degrees of freedom under the null hypothesis,
where p is a positive constant
Z and s are independent.
independent

Types of ‘t’ test:


 One sample ‘t’ test
 Two sample ‘t’ test

One Sample ‘t’ test:


When only one sample is collected from the population.
In testing the null hypothesis that the population mean is equal to a specified value μ0, one
uses the statistic

[19]
where :
= the sample mean,
s = the sample standard deviation of the sample
and n = the sample size.

Two sample ‘t’ test: Two samples can be Independent or Dependent samples.

Independent (unpaired) Samples: The independent samples t-test is used when two
separate sets of independent and identically distributed samples are obtained, one from
each of the two populations being compared.
The randomization is not essential here.
This test is only used when both:
 the two sample sizes (that is, the number, n, of participants of each group) are equal;
 it can be assumed that the two distributions have the same variance.

Independent sample t test is calculated:

where

here,
is the grand standard deviation (or pooled standard deviation),
1 = group one, 2 = group two.

and are the unbiased estimators of the variances of the two samples.

Paired Samples:
Paired samples t-tests typically consist of a sample of matched pairs of similar units, or one
group of units that has been tested twice (a "repeated measures" t-test).

Average (XD) and standard deviation (sD), constant μ0 is non-zero if you want to test
whether the average of the difference is significantly different from μ0. The degree of
freedom used is n − 1.

 Each subject having to be tested twice. Because half of the sample now depends on
the other half, the paired version of Student's t-test has only 'n/2 − 1' degrees of
 freedom (with 'n' being the total number of observations).
 A paired samples t-test based on a "matched-pairs sample" results from an unpaired
sample that is subsequently used to form a paired sample, by using additional
variables that were measured along with the variable of interest

…………………..

[20]
CHAPTER 7
DATA ANALYSIS AND MANAGEMENT INFORMATION SYSTEM

Data Analysis is the process of systematically applying statistical and/or logical


techniques to describe and illustrate, condense and recap, and evaluate data.
 It convert data into information and knowledge, and
 It explore the relationship between variables.

Understanding of the data analysis procedures will help you to


 appreciate the meaning of the scientific method, hypotheses testing and statistical
significance in relation to research questions

 realise the importance of good research design when
when investigating research
questions

 have knowledge of a range of inferential statistics and their applicability and
limitations in the context of your research

 be able to devise, implement and report accurately a small quantitative research
project

 be capable of identifying the data analysis procedures relevant to your research
project

 show an understanding of the strengths and limitations of the selected quantitative
and/or qualitative research project

 demonstrate the ability to use word processing, project planning and statistical
computer packages in the context of a quantitative research project and report

 be adept of working effectively alone or with others to solve a research question/
problem quantitatively.

Considerations / Issues in Data Analysis:


Ana
There are a number of issues that researchers should be cognizant of with respect to data
analysis. These include:
 Having the necessary skills to analyze
 Concurrently selecting data collection methods and appropriate analysis
 Drawing unbiased inference
infe 
 Inappropriate subgroup analysis
 Following acceptable norms for disciplines
 Determining statistical significance
 Lack of clearly defined and objective outcome measurements
 Providing honest and accurate analysis
 Manner of presenting data
 Environmental/contextual issues
 Data recording method
 Partitioning ‘text’ when analyzing qualitative data
 Training of staff conducting analyses
 Reliability and Validity
 Extent of analysis

[21]
There are numerous ways under which data analysis procedures are broadly defined. The
following diagram makes it evident.

Quantitative Data Analysis:


a. Statistical Methods:
Multiple Regression - This statistical procedure is used to estimate the equation with
the best fit for explaining how the value of a dependent variable changes as the values of
a number of independent variables shift. A simple market research example is the
estimation of the best fit for advertising by looking at how sales revenue (the dependent
variable) changes in relation to expenditures on advertising, placement of ads, and timing
of ads.
Discriminant Analysis - This statistical technique is used to for classification of people,
products, or other tangibles into two or more categories. Market research can make use
of discriminant analyses in a number of ways. One simple example is to distinguish what
advertising channels are most effective for different types of products.
Factor Analysis - This statistical method is used to determine which are the strongest
underlying dimensions of a larger set of variables that are inter-correlated. Where many
variables are correlated, factor analysis identifies which relations are strongest. A Using
factor analysis, a market researcher who wants to know what combination of variables or
factors are most appealing to a particular type of consumer can use factor analysis to
reduce the data down to a few variables are most appealing to consumers.
Cluster Analysis - This statistical procedure is used to separate objects into a specific
number of groups that are mutually exclusive but that are also relatively homogeneous in
constitution. This process is similar to what occurs in market segmentation where the
market researcher is interested in the similarities that facilitate grouping consumers into
segments and is also interested in the attributes that make the market segments distinct.
Conjoint Analysis - This statistical method is used to unpack the preferences of
consumers with regard to different marketing offers. Two dimensions are of interest to
the market researcher in conjoint analysis: (1) The inferred utility functions of each
attribute, and (2) the relative importance of the preferred attributes to the consumers.
 Multidimensional Scaling - This category represents a constellation of techniques used
to produce perceptual maps of competing brands or products. For instance, in

[22]
multidimensional scaling, brands are shown in a space of attributes in which distance
between the brands represents dissimilarity. An example of multidimensional scaling in
market research would show the manufacturers of single serving coffee in the form of "K-
cups." The different K-cup brands would be arrayed in the multidimensional space by
attributes such as strength of roast, number of flavored and specialty versions,
distribution channels, and packaging options.

Primary or Secondary Sources:


1. What is a Primary Source?
 A document or record containing first-hand information or original data on a topic
 A work created at the time of an event or by a person who directly experienced an
event
 Data type is afresh.
Some examples include: interviews, diaries, letters, journals, original hand-written
manuscripts, newspaper and magazine clippings, government documents, etc.

2. What is a Secondary Source?


 Any published or unpublished work that is one step removed from the original
source, usually describing, summarizing, analyzing, evaluating, derived from, or
 based on primary source materials
 A source that is one step removed from the original event or experience
 A source that provides criticism or interpretation of a primary source
Some examples include: textbooks, review articles, biographies, historical films, music
and art, articles about people and events from the past
Methods of Primary Data Collection:
1. Observation Method
2. Interview Method
3. Questionnaire Methods
4. Schedule Methods

Other Methods:
a. Mechanical Devices
b. Projective Techniques
c. Depth Interviews

Observation Method:
Information is sought by way of investigation own direct observation without asking
from the respondents.
 It is most commonly used method.
 Generally used in studies relating to behavioural sciences.
 Systematically planned and recorded.
 Its reliability depends on checks and control.
 Subjective biasness is eliminated.
  Information obtained under this method related to what is currently happening.
 Independent of respondents willingness to response.
 It is an expensive method.
 The information provided by this method is very limited.

………………………………

[23]
SAMPLE QUESTIONS
1. The level of significance is the probability of committing the :
(a) Type I error (b) Type II error (c) standard error (d) probable error
Ans. A
(Dec. 2008, Paper-II)

2. A researcher wants to test the significance of the differences of the average performance of
more than two sample groups drawn from a normally distributed population, which one of
the following hypothesis testing in appropriate.
(1) Chi- square test (2) F – test (3) z- test (4) t- test
Ans. 2
(July 2016, Paper-II)

3. When a researcher wants to test whether two samples can be regarded as drawn from the
normal populations having same variance by using variance ratio, which one of the following
tests of hypothesis is appropriate.
(1) Z- test (2) t- test (3) Chi- square test (4) F- test
Ans. 4
(July 2016, Paper-III)

4. The collective set of tools and techniques used to develop a quality assurance system when
business processes show variations, is known as
(A) Quality Assurance Process
(B) Quality Management System
(C) Statistical Quality Control
(D) Statistical Process Control

Ans. D
(Dec. 2014, Paper-III)

[24]

You might also like