You are on page 1of 44

GEO 314 Lecture notes

LECTURE 1
STATISTICS
Statistics is the science of data. It is the science of organizing, describing and analyzing quantitative data. It also
refers to the indices which are derived from data through statistical procedures e.g. mean, standard deviation,
correlation coefficient etc.

Importance of Statistics in Geography


1. For interpretation of findings and making of predictions e.g. population growth, forest degradation, climate
change or economic development.
2. Geography deals with inference (deductions) which implies use of a set of methods to enable conclusions to be
made about a population from a subset of its data or sample.
3. We use statistics to derive mathematically, the nature of relationships between variables. E.g. the use of statistics
to analyze the relationship between rainfall and crop yield.
4. Enables geographers to conduct investigation of changes of phenomena in time e.g. soil erosion changes with
time.
5. To summarize volumes of data and make valid generalizations, conclusions, and inferences

The role of Statistical Methods in Geographical Studies


1. Techniques enable geographers to handle large quantities of numerical figures by summarizing them in a form
that can be easily described and explained.
2. Statistical methods make it possible for geographers to establish precisely the nature, form and degree of the
relationship between spatially covariant phenomena such as the increase of cost of travel with increase in
distance.
3. The techniques minimize subjective judgment and increase objectivity and precision in explaining the spatial
patterns of geographical distributions and relationships.
4. They help geographers to relate geographical studies to other scientific subjects using statistical methods.
5. Statistical methods hold a position in connecting geographical models, theories and concepts with the physical
and human environment in which one lives.

Basic Concepts of Research Involving Statistical Analyses


Population/universe: is the entire set of objects and events or groups of people which is the subject of the research
and about which the research aims to determine some characteristics. It is the aggregate of all that conforms to a
given specification. Example; All standard eight pupils in a country, all indigenous trees in a National Park, All
diabetes patients in a province etc.
Population can be finite (such as number of students in a class) or infinite (e.g. number of stars in the sky).
It is often impractical to generalize results to the absolute population/universe. It is also often impractical to select a
representative sample from the target population because:
i. It may be difficult to identify individual members.
ii. The numbers of a given population may be large or scattered over a large geographical area.
iii. It might be costly in terms of money and resources for locating a representative sample.
Due to these reasons, researchers therefore draw samples from an accessible population which is normally defined
and manageable. This however creates a high likelihood of losing the generalizability of the results and therefore
researchers must try to demonstrate that the accessible population is in itself comparable to the target population in
characteristics that appear most relevant to the study. Examples of accessible population are: All standard eight
pupils in a district/ division, all diabetic patients in Nairobi Hospital.
Variable: is the characteristic that is being measured. It assumes different values among the subjects. Examples;
height, weight, number of houses, time, etc. there is no limit to the number of variables one may have in any
statistical analysis. Variables can be discrete or continuous.
Discrete variables: are variables that have only a countable number of distinct possible values. That
is, they can assume only a finite numbers of values or as many values as there are integers and have no
intermediary values.
Some variables, such as the numbers of children in family, the numbers of car accident on a certain
road on different days, or the numbers of students taking basics of statistics course are the results of

By Opole Ombogo Page 1


GEO 314 Lecture notes
counting and thus these are discrete variables. Typically, a discrete variable is a variable whose
possible values are some or all of the ordinary counting numbers like 0, 1, 2, 3, . . . .
Continuous variables:
Quantities such as length, weight, or temperature can in principle be measured arbitrarily accurately.
There is no indivible unit. Weight may be measured to the nearest gram, but it could be measured
more accurately, say to the tenth of a gram. What we mean is that for any two measurements, no-
matter how close together they are, a third measurement can always be found, which lies between the
first two if a more precise measurement is taken.
Dependent and Independent Variables:
Independent variables are those that influence the situation while dependent are the
influenced/fluctuating variable in response to fluctuations of the independent. Fluctuations of
independent are always exogenous to the experiment.
Parameters: are specific values or qualities that relate to the population. Like variables, a parameter is a
characteristic that is measurable and assume different values in the population. The difference is that a parameter to
a population characteristic while a variable relate to a characteristic of a sample drawn from the population.
Sampling: is the process of selecting a number of individuals for a study in such a way that the individuals selected
represent the large group from which they were selected.
Theory: is a set of concepts or constructs and the interrelations that are assumed to exist among those concepts. It
provides the basis for establishing the hypothesis to be tested in the study. A sampling theory is the study of the
relationship between a population and the sample drawn from it.
A sample: is a small part chosen from a large population and is thought to be a representative of the large
population
-It is a subset of the whole population which is actually investigated and whose characteristic would be generalized
to the entire population.
A sample could constitute every 5th car passing through a point, every 10th house along a street, or 100 industries
selected from a list of industries in a town. In this sample; a car, a house, or an industry constitute an element or
unit of analysis.
Element is the individual entity/smallest sampling unit about which measures are to be obtained.
Measurement is the conversion of characteristics of a phenomenon into symbols (numbers) ‘quantifiable
properties’. The measurement process is directed at an individual entity (i.e. the smallest sampling unit) about
which measures is to be obtained.
Sample Statistics/Statistics are the corresponding values or quantities that relate to the sample. Statistics are thus
estimates of population parameters.
Data are all the information a researcher gathers for his/her study. It can be: Primary data, Secondary data,
quantitative, or qualitative.
Data come from making observations either on a single variable or simultaneously on two or more variables. Data
can also be categorized as:
i. Univariate data: observations on a single variable
ii. Bivariate data: observations on two variables e.g. (x; y) =(height, weight) of a student
iii. Multivariate data: observations on more than two variables e.g. (x; y; z) = (height, weight, gender) of a
student
A sampling Frame
It is a list of entire population from which the items can be selected to form a sample, i.e. from which a sample is
drawn. Public records such as telephone directories, list of tax payers, list of farmers, students, churches, game
reserves, etc form different types of sampling frames. It is required that all the elements of the population have an
equal chance of being included in the sample. When the chance of each element of a population to be included in a
sample is known, then it is called probability or random sampling.
Hypothesis: - A hypothesis is a researcher’s prediction regarding an outcome of a study. It states possible
differences, relationships, or causes between two or more variables or concepts. They are derived from/based on
existing theories, previous researches, personal observation or experiences.
The whole research study revolves around hypothesis and therefore we will dedicate lessons 6, 7 and 8 discussing
hypotheses, including the various techniques involved in hypothesis testing.

By Opole Ombogo Page 2


GEO 314 Lecture notes
Objectives: - is any kind of desired end or condition. In research, it refers to specific aspects of the phenomenon
under study that the researcher desires to bring out at the end of the research study.

Steps in Statistical Approach to a Problem


1. Data collection
2. Data organization
3. Data analyses
4. Data interpretation
5. Prediction; if required

LECTURE 2
SCALES/LEVELS OF MEASUREMENT
There are FOUR levels of measurements used in any form of geographic research. These are:
1. Nominal Scale
Also referred as the nominal classification and involves categorization without numerical evolution. This scale is
used only when we classify observations into mutually exclusive categories where we avoid overlap. Members of
the population are classified according to their similarities or when they are actual. It is the weakest level of
measurement which classifies objects by simply naming different categories; for example sex, which is classified
into male and female; region, tribe, climate, race, cultural groups, towns and farms etc.

2. Ordinal Scale
This is a scale of measurement that gives order of the magnitude of the data although other absolute values are not
known. We assign ranks order to a set of qualitative expressions such as good, average, poor, heavier than, longer
than etc. we also use this scale for broader concepts e.g. stability, development, power, innovation, etc. it doesn’t
give the size of the intervals between the scores or about the ratio by which one unit is higher than the other. The
interval between successive points on the ordinal scale should be distinguished on the basis of some criterion such
as size and distance and then assigned ranks. For example, towns in Kenya can be assigned rank orders according to
their population sizes as follows:
Town Rank
Nairobi 1
Mombasa 2
Kisumu 3
Nakuru 4
Characteristics of Ordinal Scale
1) It is asymmetrical: For example, if X > Y > Z, then X > Z; meaning they are not equal.
Transitive: Applies for successive members in a sequence; if A is larger than B and B is Larger than C, then A
is larger than C
2) It presumes that the variable measured has an underlying continuous distribution even though that distribution
is not actually measured. (eg soil cartegories)
3) It treats clusters of units as if there are ties. A study which classifies regions as low rainfall and high rainfall,
arid and humid are examples. It thus assumes that all parts within the region receive equal amount of rainfall
and that the variation exist between the regions; the more the ties produced by this measurement, the less
sensitive the measurement.
4) All the FOUR basic arithmetic operations (+, -, ×, and ÷) can be performed on the ranked data.

3. Interval Scales
In Interval Scale, not only are the observations ordered but the magnitude of the difference separating any two
observations along the measurement scales are known. It measures the range from one value to the next. For
example:
Town Rank Population
Nairobi 1 4m

By Opole Ombogo Page 3


GEO 314 Lecture notes
Mombasa 2 2m
Kisumu 3 1m
Nakuru 4 0.5m

4. Ratio Scale
Ratio scale is the highest level of measurement with all properties belonging to ordinal, interval and nominal scale
belonging to it. It represents a further refinement in quantitative measurement as it has a defined or non-arbitrary
(true) zero point, a characteristic that permits certain comparisons between scale values that are not possible with
the interval scale. It is an interval scale with the additional property that its zero position indicates the absence of
the quantity being measured. In this, a variable is expressed as a proportion of the total population. Most ordinary
counting belongs to this category.
Under this scale, addition, subtraction, multiplication, division, ratio, square root operations etc. can be done at this
scale.
For example, if a hill is 1000 metres high, it can be said that the hill is twice as high as one that is 500 metres high;
but we cannot say that a place with 200C twice as hot as a place with 100C. Physical sciences are better equipped
when it comes to ratio scale. Another example of a ratio scale is the amount of money you have in your pocket
right now (25 cents, 55 cents, etc.). Money is measured on a ratio scale because, in addition to having the properties
of an interval scale, it has a true zero point: if you have zero money, this implies the absence of money. Since
money has a true zero point, it makes sense to say that someone with 50 cents has twice as much money as
someone with 25 cents (or that Bill Gates has a million times more money than you do).
Application: on weight, mass, velocity etc for example if vehicle A moves at 90km/hr and B at 30km/hr then A is
3times faster than B and so the ratio is 3:1.

Summary properties of the four scales of measurement


Property Nominal Ordinal Interval Ratio
Categorization/Classification √ √ √ √
Ordering × √ √ √
Distance × × √ √
Non-arbitrary Zero Point × × × √
Overlap × × × ×

Statistics applicable to different measurement scale


Scale Characteristics Appropriate Statistical Tests
Nominal Subjects classified under a common characteristics Measures of central tendency:
mode
Ordinal Subjects classified in order, numerals reflect increasing Measures of central tendency:
amounts of the attributes but not at intervals. mode, mean, median.
Interval -Numerals reflect increasing amounts of the attributes -Measures of central tendency:
with equal intervals mode, mean, median.
-A zero point does not exist; it is only arbitrary -Measures of dispersion: range,
STDEV, variance.
Ratio -Numerals reflect increasing amounts of the attribute -Measures of central tendency:
with equal amounts of the attribute with equal intervals. mode, mean, median.
-A true zero point exists. -Measures of dispersion: range,
STDEV, variance.

********************************************
SAMPLING
Sampling is the use of a subset of the population to represent the whole (total population). Total population is the
total collection of units, elements or individuals that one is interested in analyzing
Sample:

By Opole Ombogo Page 4


GEO 314 Lecture notes
The groups of units selected from a large group (population) that are studied with the hope to draw valid conclusion
about the larger group. An ideal sample then is that with characteristics corresponds to or reflects those of the
original population. To ensure representativeness, the sample may be either random or stratified depending upon
the conceptualized population and the sampling objective. If the sample is representative of the population,
important conclusions about the population can often be inferred from the analysis of the sample.
Randomness
Randomness is an important concept that is used to justify much of what we say in research. A random selection
gives each individual, or item a calculable chance of being selected or included in the final sample and should not
be confused with a haphazard selection. Randomness is closely connected with the concept of chance and
probability and implies a lack of predictability.
Reason for Sampling:
1. Samples can be studied more quickly than a population, because the sample is small and the population is
large if studied in entirety
2. A study of a sample is less expensive than studying an entire population since only a small portion of the
population is studied (especially if the study has a lengthy follow up time).
3. A study of the entire population (census) is impossible in most situations. Sometimes the process of
studying destroys or depletes the item under study.
4. Sample results are often more accurate than results based on a population when more expensive procedures
are adopted to improve on accuracy.
5. If the samples are properly selected, probability methods can be used to estimate the errors in the resulting
statistics.

Sample size
An appropriate sample size is required for validity. If the sample size is too small, it will not yield valid results. An
appropriate sample size can produce accuracy of results. Moreover, the results from the small sample size will be
questionable. A sample size that is too large will result in wasting money and time

Sampling can either be done probabilistically or none probabilistically:

a) Probability sampling:
The best way to ensure that a sample will lead to reliable and valid inference is to use probability sample.
Probability sampling method is any method of sampling the uses some form of random selection. In order to have a
random method you must set up some process or procedure that ensures that the different units in your population
have calculable probability or chance of being selected.

Most common type of probability sampling:


1. Simple random sampling
2. Stratified random sampling
3. Systematic random sampling
4. Cluster sampling
5. Multistage sampling

1. Simple random sampling


In simple random sampling we select a sample such that each individual is chosen randomly in a manner that each
member of the population has an equal chance of being included in the sample.
Every possible sample of a given size has the same chance of selection that is each member of the population is
equally likely to be chosen at any stage in the sampling process
The recommended way to select a simple random sample is to use a table of random numbers.
Steps;
1) Compile a sample frame
2) Decide on the sample size
3) Randomly point at any number from the table of random numbers (Appendix) and include the element
assigned that number in the sample.

By Opole Ombogo Page 5


GEO 314 Lecture notes
4) Progress in the table in any way; downwards or upwards. For example, if our sample frame had 12000
elements and you first point at 0555, then you decide to proceed down the column, the next two numbers
are 84638 and 7527. Then these numbers should be ignored because they lie outside the range 0-12000
5) Once done with one column, move on top of the next column on the right. All the cases within the required
range are selected until the required sample size is achieved.

2. Stratified random sampling


Usually the total population is heterogeneous and so there often exist factors which divide up the population into
homogeneous subpopulations (groups/strata). This has to be accounted for when we select a sample from the
population to ensure the selected sample represents the population without omission of any minority. This is
achieved through a stratified sampling method. A stratified sample is obtained by taking samples from each stratum
or sub population.
For example, suppose a farmer wish to work out the average milk yield of each cow type in his herd, which consist
of Ayrshire, Friesian, Galloway and Jersey cows. It is possible to sub divide the cows into four sub groups
according to cow types the select a sample with proportion of the sample drawn from the four groups.
Steps:
1) Identify the population and compile the sample frame
2) Det the sample size
3) Define the criterion for stratification
4) List the population according to the defined strata or subgroups.
5) Determine the required sample size and the appropriate representation in each stratum.
6) Select, using random numbers, an appropriate number of subjects for each stratum.
Advantage: it ensures inclusion in the sample of subgroups which otherwise would be omitted entirely by other
sampling methods because of their small numbers in the population.

3. Interval/Systematic random sampling


Systematic sampling which is some time called interval sampling means that there is a gap, or interval, between
each of the elements. This technique requires that the first item is selected at random as a starting point and there
after selection is done at a given predefined intervals. If the researcher wants to select a fixed size sample then it is
necessary to know the whole population size from which the sample is being selected and then calculate the
appropriate sampling interval as follows:
Sampling interval is the distance between the cases that are selected for the sample.

Sampling interval = (Total population (N))/ (Sample size (n))


For example if a systematic sample of 500 children is to be selected from a population of 10,000 children. Then the
sampling interval will be:
K=(N)/(n)=10000/500=20
Thus a systematic random sample is in which the kth item is selected.
Systematic sampling should not be used when a cyclic repetition is inherent in the sampling frame. For example it
is not appropriate for selecting months in a year in studying the frequency of some climatic conditions because
some conditions are more often at certain times of the year.
Steps:
1. Get the sampling frame/list the total population
2. Determine the sample size
3. Determine the sampling interval by dividing the total population by the sample size
4. Blindly select from the table o random numbers the starting point,
5. Repeat the procedure by picking subsequent numbers after every K th (which is the sampling interval) until
the desired sample size is obtained
4. Cluster sampling
Cluster sampling is used when natural grouping is evident, where the entire population is divided into sub groups or
clusters. Element of a cluster are often heterogeneous but between clusters the clusters are homogenous hence
preferably each cluster should be a small representation of the total population. Cluster sampling can be conducted
in the form of area sampling or with respect to geographical demarcation.

By Opole Ombogo Page 6


GEO 314 Lecture notes
The main difference between cluster sampling and stratified sampling is that in cluster sampling the cluster is
treated as the sampling unit so analysis is done on population of the cluster where as in stratified sampling the
analysis is done on elements within the cluster. Example; microclimate of a region, soils, water quality.

Advantage
1. Cluster sampling is cheap,
2. Easy and is particularly useful when it is difficult or costly to develop a sampling frame or when the
population elements are widely dispersed geographically.
Disadvantage
Cluster sampling increases sampling error since elements in a cluster are likely to be similar in many aspects.

5. Multistage sampling
This is a complex form of cluster sampling where after clustering instead of using all the elements contained in the
selected cluster we randomly select elements from each cluster. Constructing the cluster is the first stage and
selecting what elements within the cluster to use is the second stage. Example; sampling 1 province out of 8, then
sampling a county, then a sub-county

b) Non probability sampling:


Non-probability sampling is any sampling method where some element of the population have no chance of
selection (they are sometimes referred to as out of coverage or under covered), or where the probability of selection
cannot be accurately determined. It involves the selection of elements based on assumptions regarding the
population of interest, which forms the criteria of selection.

The difference between nonprobability and probability sampling is that nonprobability sampling does not involve
random selection and probability sampling does.
Main types of non-probability sampling are:
1. Convenient/Opportunity/Accidental sampling
2. Purposive/Judgmental sampling
3. Quota Sampling
4. Snowball Sampling

1. Convenient/Opportunity/Accidental sampling
Convenient sampling is also called volunteer samples or grab. It is a sampling method that involves the sample
being drawn from that part of the population which is close to hand i.e. sample unit is selected because it is readily
available, easy to reach and convenient. From this sampling technique one cannot make a generalization about the
entire population. This sampling technique is of great use during pilot testing.
Sometimes the sample is accessed through contacts or gatekeepers

2. Purposive/Judgmental sampling
Judgmental/purposive sampling starts with a purpose in mind and the sample is thus selected to include people of
interest and exclude those who do not suit the purpose. It involves selecting a group of people because they have
particular traits that the researchers want to study. For example, judgmental sampling is employed while studying
the behavior of consumers of a particular product or service, in a market research.

3. Quota Sampling
In quota sampling we first segment the population into mutually exclusive segments just as when doing stratified
sampling, but in this case judgment is used to select the subjects or units from each segment based on some
specified proportion. This sampling technique may be biased since not everyone get a chance of selection. This lack
of randomness is the greatest weakness of this sampling method. Widely used in opinion polls and market research,
in which case the interviewers involved are given a quota of the subjects of a specified characteristics who they
attempt to reach.
The researcher decides how many of each category are selected.
For example,

By Opole Ombogo Page 7


GEO 314 Lecture notes
1. An interviewer might be told to go out and select 20 male smokers and 20 female smokers so that they
could be interviewed about their health and smoking behavior.
2. An interviewer may be told to sample 200 females and 300 males between the age of 45 and 60. This
means that individuals can put a demand on who they want to sample (targetting)
 The first step in non-probability quota sampling is to divide the population into exclusive subgroups.
 Then, the researcher must identify the proportions of these subgroups in the population; this same
proportion will be applied in the sampling process.
 Finally, the researcher selects subjects from the various subgroups while taking into consideration the
proportions noted in the previous step.
 The final step ensures that the sample is representative of the entire population. It also allows the researcher
to study traits and characteristics that are noted for each subgroup.

4. Snowball Sampling
This is a sampling design where existing study subject recruits future subjects from among their acquaintances.
May be extremely biased since sample member are not selected from a sampling frame and it is respondent driven.
But can allow the researcher to make estimates about the social network connecting the hidden population.
This sampling method involves two main steps
a) Identify a few key individuals
b) Ask these individual to volunteer to distribute the questionnaires to people they know who fit the criteria of
the desired sample. Or
c) Suggest an acquaintance that meets the selection criteria then the researcher administers an interview

SAMPLING AND NON-SAMPLING ERRORS


Inaccuracies in a sample are of two types: systematic bias and sampling error.
Systematic Bias
Systematic bias results from errors in the sampling procedures such as:
1) Inappropriate sampling frame
2) Natural bias in the reporting data
3) Non respondents
4) Bias in the instruments of collection
Systematic bias cannot be reduced or eliminated by increasing the sample size.
Sampling error
These are random errors that result from a range of non-determinable factors such as
1) Heterogeneity of the population
2) The small sample size
Sampling error can be reduced of eliminated by:
1) Increasing the sample size
2) Lowering the confidence level

LECTURE 3 - 5
DESCRIPTIVE AND INFERENTIAL STATISTICS
Most geographical studies are quantitative in nature and can be divided into the following main functions
i. Descriptive statistics: - Are indices that describe a given sample. It helps in expressing a set of data
composed of individuals of varied forms from one another to a greater or lesser extent in a set of summary
data.
Descriptive statistics give information that describes the data in some manner, i.e. consist mainly of
methods for organizing and summarizing information.
For example, suppose a pet shop sells cats, dogs, birds and fish. If 100 pets are sold and 40 out of the 100
were dogs, then one description of the data on the pets sold would be that 40% were dogs.

By Opole Ombogo Page 8


GEO 314 Lecture notes
This same pet shop may conduct a study on the number of fish sold each day for one month and determine
that an average of 10 fish were sold each day. The average is an example of descriptive statistics.
Some other measurements in descriptive statistics answer questions such as:
- How widely dispersed is this data?
- Are there a lot of different values? or
- Are many of the values the same?
- What value is in the middle of this data?
- Where does a particular data value stand with respect with the other values in the data set?
Examples of descriptive techniques are: measures of central tendency (mean, mode, median), measures of
dispersion (range, standard deviation, and variance), distributions (e.g. percentages, frequencies) and
relationships (e.g. correlations).

A graphical representation of data is another method of descriptive statistics. Examples of this visual
representation are histograms, bar graphs and pie charts, etc. Using these methods, the data is described by
compiling it into a graph, table or other visual representation.
This provides a quick method to make comparisons between different data sets and to spot the smallest and
largest values and trends or changes over a period of time. If the pet shop owner wanted to know what type
of pet was purchased most in the summer, a graph might be a good medium to compare the number of each
type of pet sold and the months of the year.

ii. Inferential Statistics: - most geographers deal with data obtained from samples and it is assumed that the
sample is a representative of a total population. Inferential statistics therefore enable the geographer within
a certain defined limits to make statements about the characteristics of a population based on the data
collected from a sample.
Inferential statistics consist of methods for drawing and measuring the reliability of conclusions about
population based on information obtained from a sample of the population.
We can therefore say, Descriptive statistics uses the data to provide descriptions of the population, either through
numerical calculations or graphs or tables while inferential statistics makes inferences and predictions about a
population based on a sample of data taken from the population in question. One of the most commonly used
inferential technique is hypothesis testing. We will cover various techniques of hypotheses testing in detail later in
this course.

DESCRIPTIVE STATISTICS
There are two methods of describing data; graphical and numerical. Some of the basic graphical methods are
histograms, frequency polygons, ogives while Numerical methods include frequency distributions, measures of
central tendencies, and measures of variability and dispersion. Basically all data arrays and tabulations fall under
numerical methods.

a. FREQUENCY DISTRIBUTIONS
Statistical techniques can be used to process a mass of figures relating to a single variable, e.g. kilometers, so that
some significant meaning can be extracted from them. To do this, the first step is to build up a distribution of
frequencies of the occurrence of the figures; that means, for example, combining the distances into relatively
smaller number of categories/classes and examine the number of kilometers falling into each class.
Reasons for constructing Frequency Distributions include:
1. To facilitate the analysis of data.
2. To estimate frequencies of the unknown population distribution from the distribution of sample data and
3. To facilitate the computation of various statistical measures

Discrete and Continuous Frequency Distributions


a) Discrete (or) Ungrouped frequency distribution:
In this form of distribution, the frequency refers to discrete value. Here:
 The data are presented in a way that exact measurements of units are clearly indicated.
 There are definite differences between the variables of different groups of items.

By Opole Ombogo Page 9


GEO 314 Lecture notes
 Each class is distinct and separate from the other class.
 Non-continuity from one class to another class exists.
Examples of Discrete Data are such facts like the number of rooms in a house, the number of companies registered
in a country, the number of children in a family, etc.

b) Continuous frequency distribution:


This form of distribution refers to groups of values. This becomes necessary in the case of some variables which
can take any fractional value and in which case an exact measurement is not possible. Hence a discrete variable can
be presented in the form of a continuous frequency distribution.

We can also say that Discrete class is data that increases in jumps or whole numbers e.g. number of children in a
family can be; 0,1,2,4, and cannot be 2.5, 4.5, 0.2. Continuous data on the other hand is data that increases
continuously such as kilometers traveled; can be 60.8, 900.6, etc.

Key terms:
A frequency is the number of times a given datum occurs in a data set
Raw Data: - This is the data collected but has not been re-organized or re-arranged. The first step in making raw
data more meaningful is to re-enlist the figures in order of size so that they run from the lowest to the highest i.e.
into an array. An array can be simplified further by listing repeated figures only once and the number of times it
occurs written alongside it. This is called ungrouped frequency distribution. The sum of the frequencies (f) must
equal the total number of items making up the raw data.
Class Limit: - Are the extreme boundaries of a class such as 401 – 450. Care must be taken when defining class
limits to avoid overlapping of classes or wide gaps between classes.
Class Interval: - Is the width of the class i.e. the difference between class limits.
Percentages: Is a proportion of a subgroup to the total group or sample. It ranges from 0% to 100%. Percentages
are important especially if there is a need to compare groups that differ in size.
Lower class limit of a class is the smallest value within the class.
Upper class limit of a class is the largest value within the class.
Class midpoint is found by adding a class’s lower class limit and upper class limit and dividing the result by 2.
Class boundaries are the numbers which separate classes. They are equally spaced halfway between neighboring
class limits.
Class width is the difference between two class boundaries (or corresponding class limits).
Grouping the data involves grouping of the arrays into classes after choosing an appropriate class limit. A group
of such classes together with their frequencies is called a grouped frequency distribution. Grouping helps us to
combine scores into smaller categories. It can be necessitated by:
 When scores are distributed in such a way that certain scores not obtained by any subject
 When the samples are very large
 When information sought is sensitive; e.g. annual income queries in a questionnaire.

Absolute and relative distributions


A relative frequency is the fraction of times an answer occurs.
Number of successful trials
Relative Frequency = ……………………………………………
Total number of trials

Example 1 Example 2: Travel Survey


Your team has won 9 games from a total of 12  92 people were asked how they got to work:
games played:  35 used a car
- the Frequency of winning is 9  42 took public transport
- the Relative Frequency of winning is 9/12 = 75%  8 rode a bicycle
 7 walked
Example 3: class exercise The Relative Frequencies (to 2 decimal places) are:
What kind of music do you like?

By Opole Ombogo Page 10


GEO 314 Lecture notes
 10 people like pop music  Car: 35/92 = 0.38
 27 people like dance  Public Transport: 42/92 = 0.46
 17 people like hip-hop / rap  Bicycle: 8/92 = 0.09
 20 people like rhumba  Walking: 7/92 = 0.08
Calculate the relative frequencies 0.38+0.46+0.09+0.08 = 1.01

All the Relative Frequencies add up to 1 (except for any rounding error).

Cumulative Frequency Distribution


In combination with the number of cases in each class limit, an indication of the number of classes that are less than
or more than a given value is necessary. The data in a cumulative frequency distribution can be converted into
percentages in another column.
Cumulative frequency table: displays the total number of observations less than or equal to the category. A
cumulative relative frequency table displays the aggregate proportion (or percent) of observations less than or
equal to the category.
Construct a Cumulative Frequency Table and Cumulative Relative Frequency Table for the following data:
Twenty students were asked how many hours they worked per day. Their responses, in hours, are listed below:
5; 6; 3; 3; 2; 4; 7; 5; 2; 3; 5; 6; 5; 4; 4; 3; 5; 2; 5; 3

Graphical Representation of Frequency Tables


Choice of a Graph
i. Should depend on the data to be represented, e.g. for discrete variables, bar charts are better
ii. Frequency polygons are advantageous because several polygons can be superimposed with less crossing of
lines. This makes it easy to make comparisons. Frequency polygons also give a much better conception of the
trend of the distribution than histograms.
iii. If sample sizes are different, it is important to change frequencies to relative frequencies, that is, in terms of
proportions or percentages where frequencies are expressed as proportions of 1 or 100 respectively.
iv. Histograms are easier to understand especially when only one distribution is being presented.

a. Histograms
These are graphs of a frequency distribution constructed on the basis of a horizontal axis with a continuous scale
running from one extreme end of the distribution to the other. For each class in the distribution, a vertical rectangle
is drawn with its base on the horizontal axis extending from one class limit to the other limit and its area or height is
proportionate to the frequencies in the class. The vertical axis is labeled ‘frequency’. The difference between a
histogram and a bar chart is that in bar charts, spaces are left between the bars to signify a lack of continuity or
flow between the categories.

A relative frequency histogram compares each class interval to the Relative frequency (decimal or %). A relative
frequency histogram has the same shape and the same horizontal scale as the corresponding frequency histogram.
The difference is that the vertical scale measures the relative frequencies (measured as a percentage), not
frequencies.

b. Frequency Polygon
Procedure for construction:
1. Construct a histogram or if not required, construct a table of grouped frequency distribution.
2. Mark the mid-point of the top of each rectangle in the histogram or the mid-point of the class limits.
3. Join the mid-points with a straight or smooth line curves.
If the mid-points of the bars/rectangles are corrected with a smooth curve then it forms a frequency curve.
The curve should begin and end at the baseline/x-axis.
c. Ogives
An ogive, sometimes called a cumulative line graph, is a line that connects points that are the cumulative
percentage of observations below the upper limit of each class in a cumulative frequency distribution. It is a curve
obtained when the cumulative frequencies of a distribution is graphed. Procedure:

By Opole Ombogo Page 11


GEO 314 Lecture notes
1. Compute the cumulative frequencies of the distribution.
2. Prepare a graph with the cumulative frequencies on the vertical axis and the class limits or the mid points on the
horizontal.
3. Plot starting point at zero on the vertical scale and at the lower class limit of the 1 st class
4. Plot cumulative frequencies on the graph at the upper class limit of the classes to which they refer.
5. Join all the points.

Example:
The following rainfall data was collected at a meteorological station over a period of time
23 22 23 24 21 20 19 28
22 19 22 22 22 26 15 19
24 21 20 22 26 25 10 22
22 17 25 23 19 21 22 21
22 23 22 24 20 26 23 23

Construct an absolute, relative and cumulative frequency distribution table for above data set
Draw an absolute frequency distribution histogram and polygon for above data set.

STATISTICAL DESCRIPTIONS
MEASURES OF CENTRAL TENDENCY
Uses:
a. To provide a summary and a consistent description of sets of data
b. Means are importantly used in comparisons e.g. school scores in national examinations
c. Quickly condense a large amount of data

ARITHMETIC MEAN (X)


This is the sum of the data set values divided by the number of observed cases (n).
Formula:
X = X/n
For a grouped data,
a. Using the mid-point.
Formula:

Where: x is the mean, n is the number of observations, f is the frequency, x is the mid-point of the
class limits and  is the summation notation.
Example:
Steps:
1) Obtain the mid-point of each class
2) Multiply each mid-point by its respective frequency to obtain fx
3) Sum f and fx to obtain n and fx respectively.
Class Frequency (f) Mid-point (x) fx
401 - 420 12 410.5 4926
421 - 440 27 430.5 11623.5
441 - 460 34 450.5 15317
461 - 480 24 470.5 11292
481 - 500 15 490.5 7357.5
501 - 520 8 510.5 4084
 120 54600

By Opole Ombogo Page 12


GEO 314 Lecture notes

Therefore x = fx/f = 54600/120 = 455

b. Using the Assumed Mean

Where: Ax is assumed mean and i is the class interval.


Steps:
1) Choose one of the mid-points to represent an arbitrary or assumed mean. The best is the mid-point that lies in
the middle of the range and opposite to a highest frequency
2) Assign the value 0 in the column d to the row with the highest frequency value.
3) Calculate d by subtracting the assumed mean from x (midpoints) for each class
4) Multiply f by d to obtain column fd
5) Sum column fd to obtain fd

Example:
Class Frequency (f) Mid-point (x) fx .d = x - A .fd
401 - 420 12 410.5 4926 -40 -480
421 - 440 27 430.5 11623.5 -20 -540
441 - 460 34 450.5 15317 0 0
461 - 480 24 470.5 11292 20 480
481 - 500 15 490.5 7357.5 40 600
501 - 520 8 510.5 4084 60 480
 120 54600 540

Now x = Ax + (fd)/n = 450.5 + (540)/120


.x = 450.5 + 4.5 = 50
Exercise
Class Frequency (f)
41 - 50 7
51 - 60 22
61 - 70 29
71 - 80 19
81 - 90 10
91 - 100 3

Advantages
1. Fast and easy to calculate
2. Important for use in further analysis
3. Takes care of all the observations
Disadvantages
1. Sensitive to extreme values
Arithmetic average is extremely sensitive to extreme values. Imagine a data set of 4, 5, 6, 7, and 8,578. The
sum of the five numbers is 8,600 and the mean is 1,720 – which doesn’t tell us anything useful about the level of
the individual numbers. Therefore, arithmetic average is not the best measure to use with data sets containing a few
extreme values or with more dispersed (volatile) data sets in general. Median can be a better alternative in such
cases.
2. Not suitable for time series type of data
Arithmetic mean is perfect for measuring central tendency when you’re working with data sets of independent
values taken at one point of time.

By Opole Ombogo Page 13


GEO 314 Lecture notes
3. Works only when all values are equally important
Arithmetic means treats all the individual observations equally. However, For example, you have a
portfolio of stocks and it is highly unlikely that all stocks will have the same weight and therefore the same impact
on the total performance of the portfolio. Calculating the average performance of the total portfolio or a basket of
stocks is a typical case when arithmetic average is not suitable and it is therefore better to use weighted average
instead.

WEIGHTED MEAN
The weighted arithmetic mean is similar to an ordinary arithmetic mean (the most common type of average),
except that instead of each of the data points contributing equally to the final average, some data points contribute
more than others.
To find the weighted mean:
- Multiply the numbers in your data set by the weights.
- Add the results up.

Sample problem1
For that set of number above with equal weights (1/5 for each number), the math to find the weighted mean would
be:
1(*1/5) + 3(*1/5) + 5(*1/5) + 7(*1/5) + 10(*1/5) = 5.2.

Sample problem 2
You take three 100-point exams in your statistics class and score 80, 80 and 95. The last exam is much easier than
the first two, so your professor has given it less weight. The weights for the three exams are:
- Exam 1: 40 % of your grade. (Note: 40% as a decimal is .4.)
- Exam 2: 40 % of your grade.
- Exam 3: 20 % of your grade.
What is your final weighted average for the class?
Multiply the numbers in your data set by the weights (The percent weight given to each exam is called a weighting
factor).
.4(80) = 32 or 80*40%
.4(80) = 32
.2(95) = 19
Add the numbers up. 32 + 32 + 19 = 83.

Sample problem 3:
Given two school classes, one with 20 students, and one with 30 students, the grades in each class on a test were:
Morning class = 62, 67, 71, 74, 76, 77, 78, 79, 79, 80, 80, 81, 81, 82, 83, 84, 86, 89, 93, 98
Afternoon class = 81, 82, 83, 84, 85, 86, 87, 87, 88, 88, 89, 89, 89, 90, 90, 90, 90, 91, 91, 91, 92, 92, 93, 93, 94, 95,
96, 97, 98, 99
The straight average for the morning class is 80 and the straight average of the afternoon class is 90. The straight
average of 80 and 90 is 85, which is the mean of the two class means. However, this does not account for the
difference in number of students in each class (20 versus 30); hence the value of 85 does not reflect the average
student grade (independent of class). The average student grade can be obtained by averaging all the grades,
without regard to classes (add all the grades up and divide by the total number of students):

Weighted Mean Formula


In some cases the weights might not add up to 1. In those cases, you’ll need to use the weighted mean formula. The
only difference between the formula and the steps above is that you divide by the sum of all the weights.

In simple terms, the formula can be written as:

By Opole Ombogo Page 14


GEO 314 Lecture notes

Σ = the sum of (in other words…add them up!).


w = the weights.
x = the value.
In other words: multiply each weight w by its matching value x, sum that all up, and divide by the sum of weights.
To use the formula:
- Multiply the numbers in your data set by the weights.
- Add the numbers in Step 1 up. Set this number aside for a moment.
- Add up all of the weights.
- Divide the numbers you found in Step 2 by the number you found in Step 3.
In the sample grades problem above, all of the weights add up to 1 (.4 + .4 + .2) so you would divide your answer
(83) by 1:
83 / 1 = 83.

However, let’s say your weighted means added up to 1.2 instead of 1. You’d divide 83 by 1.2 to get:
83 / 1.2 = 69.17.

Sample problem 3: Alex usually works 7 days a week, but sometimes just 1, 2, or 5 days.
Alex worked:
 on 2 weeks: 1 day each week
 on 14 weeks: 2 days each week
 on 8 weeks: 5 days each week
 on 32 weeks: 7 days each week
What is the mean number of days Alex works per week?
If we use "Weeks" as the weighting:
Weeks × Days = 2 × 1 + 14 × 2 + 8 × 5 + 32 × 7
= 2 + 28 + 40 + 224 = 294
Also add up the weeks:
Weeks = 2 + 14 + 8 + 32 = 56
Divide: Mean = 294/56 = 5.25

Its however easier if we first presented our data in a table as below:


weight Days wx
2 1 2
14 2 28
8 5 40
32 7 224
Σw = 56 Σwx = 294
Divide Σwx by Σx:
Mean = 294/56 = 5.25

Exercise
1. The numbers 1, 2, 3 and 4 have weights 0.1, 0.2, 0.3 and 0.4 respectively. What is the weighted mean?
2. The numbers 1, 2, 3, 4, 5 and 6 have weights 0.5, 0.1, 0.1, 0.1, 0.1 and 0.1 respectively.
What is the weighted mean?
3. In Bobby's school, math grades for the year are calculated from assignments, tests and a final exam.
Assignments count 30%, tests 20%, and the final exam 50%.
If Bobby has an assignment grade of 85, a test grade of 72, and an exam of 61, what is Bobby's overall
grade?
4. Cat wants to buy a new car, and decides on the following rating system:
- Appearance 10%

By Opole Ombogo Page 15


GEO 314 Lecture notes
- Reliability 40%
- Miles per gallon 20%
- Comfort 30%
The Coyota car gets 5 (out of 10) for Appearance, 9 for Reliability, 7 for Miles per gallon and 6 for Comfort.
The Fonda car gets 6 (out of 10) for Appearance, 10 for Reliability, 8 for Miles per gallon and 4 for Comfort.
The Hadillac car gets 4 (out of 10) for Appearance, 7 for Reliability, 3 for Miles per gallon and 9 for Comfort.
The Tord car gets 7 (out of 10) for Appearance, 6 for Reliability, 9 for Miles per gallon and 3 for Comfort.

Which car is best?


5. In a gymnastics competition the overall score is calculated from points for execution and difficulty,
calculated with weights of 80% and 20% respectively.
Grace scored 9.8 for execution and had an overall score of 9.5. What was her score for difficulty?

Weakness
The weighted mean can be easily influenced by outliers in your data. If you have very high or very low values in
your data set, the weighted mean may not be a good statistic to rely on.

MEDIAN
Median is the middle value of the scores, i.e., the mid-point that separates the upper 50% of he values from the
lower 50%. To obtain the median, arrange the scores of values in ascending order; me middle value is the value
above or bellow which an equal number of observations occur. If the total numbers of occurrences are odd, then the
median will be one of the observed values. If the numbers of occurrences are even, then the median will be mid-
way between two of the values.
Formula: Mdn = n + 1
2
Example: 3,3,1,1,4,4,5,2,2,2
Arranged: 1,1,2,2,2,3,3,4,4,5
Median = 2+3/2 = 2.5
Strengths: - it is not affected by the values of extreme items in the distribution

Median for Grouped Data

Where: -fb is the sums of all the frequencies (cumulative frequencies) below the median class
-fc the frequency of the class containing the median
-Le is the lower limit class in which the median occurs
-.i is the interval or group width
Example 1:
Class Frequency (f)
401 - 420 12
421 - 440 27
441 - 460 34
461 - 480 24
481 - 500 15
501 - 520 8
 120
Mdn = 440.5 + 20 (120/2 – 47)34
= 440.5 + 20(0.38)
= 448.15

By Opole Ombogo Page 16


GEO 314 Lecture notes
Example 2:
Class Frequency (f)
926 – 975 1
876 – 925 1
826 – 875 3
776 – 775 3
676 – 725 6
625 – 675 10
576 – 625 1
526 – 575 1

= 626 + 26
= 652
MODE
Mode is the single value that occurs most frequently in the distribution or the midpoint of the class with the highest
frequency. The peak or point of the greatest concentration in the distribution can statistically be calculated by:
Mo = 3mdn – 2x;
That is, mode can be easily computed once the mean and the median of the data set are known.
Mode is best applicable where a distribution is much skewed. For example; consider the following monthly income
for five individuals; 800, 900, 850, 750, 5000.
The mean would be = 800+900+850+750+5000/5,
= 1660
This figure (1660) although an accurate statement of the mean, is not typical of the group as a whole because it’s
affected by the income of 5000. The median value is 850 and this is typically more representative of the group than
the mean of 1660, therefore if a distribution is very much skewed i.e. it contains more occurrences at one extreme
than the other, the median or mode is more likely to be a representative of these scores than the mean.

Modal Value for Grouped Data

Where: l = Lower limit of the modal class


Fm = Frequency of the modal class
Fa = Frequency of the class above the modal class
Fb = Frequency of the class bellow the modal class
.c = Class interval
Example:
The data bellow relates to rainfall amounts for station k for the period 1971 to 2000.
Class Frequency (f)
926 – 975 1
876 – 925 1
826 – 875 3
776 – 775 3
676 – 725 6
625 – 675 10
576 – 625 1
526 – 575 1

By Opole Ombogo Page 17


GEO 314 Lecture notes

Solution:
Modal class = 625 – 675

= 625 + 30
= 655
Advantages
1) Very quick and easy to determine
2) Is an actual value of the data
3) Not affected by extreme scores

Disadvantages
1) Sometimes not very informative (e.g. cigarettes smoked in a day)
2) Can change dramatically from sample to sample
3) Might be more than one (which is more representative?)

MEASURES OF VARIABILITY
Measures of variability show how the scores differ amongst themselves in magnitude. It is a standard way of
describing variability or dispersion of scores.
Variability is the distribution of scores around a particular central score or value, i.e. the mean (in most statistics). It
is therefore the dispersion of scores around the mean of a distribution.

Purpose of measures of variability


1) Helps the researcher to see how spread out the scores or measures for each variable are.
2) Provides us with the indices which we can use to further describe a distribution or scores.

Types of Measures of Variability.


1. The Range
The range is the difference between the highest score and the lowest score in a distribution. It is determined by
subtracting the lowest score from the highest score.
Range = Highest Value – Lowest Value
Disadvantages of Range as a measure of variability
1. It is influenced by the extreme values
2. It is generally not a satisfactory measure of variability or derivation because it utilizes only a fraction of the
available information concerning variations in a given data.
3. It reveals nothing about the dispersion of the observation

2. The Quartile Deviation (QD)


This is calculated by dividing a list of scores arranged in ascending or descending order into two halves, the result
of which is four equal parts, each part containing 25% of all the figures in the data set. The two new lines are called
the quartiles because they divide the list into four parts. It is the semi variation between the upper quartiles (Q3) and
lower quartiles (Q1) in a distribution, and Q3 – Q1 is referred to as the inter-quartile range
QD = (Q3 – Q1)/2
Quartile Deviation (QD) is measured by Quartile Deviation Coefficient (QDC)
(Q3 – Q1)/2 Q3 – Q1
QDC = =
(Q3 + Q1)/2 Q3 + Q1

By Opole Ombogo Page 18


GEO 314 Lecture notes

You may also describe Quartiles as:


Quartiles divide the lower and upper sub-samples into two parts:
– 1st Quartile: Q1 = median of the lower sub-sample, also called the lower fourth
– 2nd Quartile: Q2 = median of the entire sample
– 3rd Quartile: Q3 = median of the upper sub-sample, also called the upper fourth
– Inter Quartile Range: IQR = Q3 - Q1 (also called fourth spread)

Example 1: 241, 521, 421, 250, 300, 365, 840, 958, 241
Example 2: 5, 8, 13, 74, 85, 88, 90, 91, 92, 92, 93, 94, 94, 94, 95, 95, 95, 96, 96, 98, 99, 101, 103, 106,
113.

For Continuous/Grouped Series


We apply the same formula as in median for grouped data only that we now work with quartiles (4)

Where:
LQ1 = lower limit of the first quartile class
f1 = frequency of the first quartile class
c = class interval
m1 = c.f. preceding the first quartile class
LQ3 = 1ower limit of the 3rd quartile class
f3 = frequency of the 3rd quartile class
m3 = c.f. preceding the 3rd quartile class

Example 1
3. The Mean Deviation
Is the average amount by which individual sores deviate from the mean. It is calculated by first finding out how
much each value differs from the mean value, summing the squares of the difference then dividing by the number
of observations.

Example:
x (x –x) (x –x)2
5 -5
7 -3
8 -2
12 2
18 8
 = 50  = 20  = 106

Thus, Mean Deviation = (x –µ)/n = 20/5 = 4

Standard Deviation

By Opole Ombogo Page 19


GEO 314 Lecture notes
STDEV is the extent to which scores in a distribution deviates from the mean. It involves subtracting the mean
from each score to obtain the deviation.
The formula for the variance in a population is:

where μ is the mean; and N is the number of scores.

When the variance is computed in a sample statistics then the formula is;

The most common formula for computing variance in a sample is:

(where M is the mean of the sample) can be used. S² is a biased estimate of σ². which gives an unbiased estimate of
σ². Since samples are usually used to estimate parameters, s² is the most commonly used measure of variance.
Calculating the variance is an important part of many statistical applications and analyses. It is the first step in
calculating the standard deviation.
Example:
x (x –x) (x –x)2
5 -5 25
7 -3 9
8 -2 4
12 2 4
18 8 64
 = 50  = 20  = 106
σ² = 106/5 = 21.2

Variance is the sum of squared deviations divided by the degrees of freedom.

For grouped data, we calculate variance by the formula

If the variance is small then the scores are close together while a large variance implies the scores are more spread
out.
STDEV is obtained by taking the square root of the variance. A large STDEV implies a large deviation from the
mean, i.e. a greater variability, while a small STDEV denotes less variability of scores in the distribution.

For grouped data, the formula would be:

By Opole Ombogo Page 20


GEO 314 Lecture notes

Properties of STDEV
1) Takes into account all scores and responds to the exact position of every score relative to the mean of the
distribution, i.e., if the score is shifted further from the mean, the STDEV increases.
2) It sensitive to extreme scores.

Exercise
1. Students at Kibabii University sat for an examination at the end of the semester. Their results were grouped
as shown:
Class Frequency
50-54 2
55-59 3
60-64 6
65-69 9
70-74 12
75-79 15
80-84 10
85-89 8
90-94 6
95-99 4
i. Calculate the mean, mode and the median.
ii. Calculate the quartile deviation of the data
iii. Calculate the standard Deviation and Variance

Coefficient of Variation
C.V. is the standard deviation calculated as a percentage of the mean. That is; it is the ratio between the standard
deviation of a sample and it’s mean. C.V. is used when there’s need to compare the variability of two or more sets
of a distribution.

The coefficient of variation is usually expressed in percentages:

It allows us to compare the dispersions of two different distributions if their means are positive.
The coefficient of variation for a distribution can be calculated to compare the values obtained with another
distribution. The greater dispersion corresponds to the value of the coefficient of greater variation.
Example 1
A distribution is x = 140 and σ = 28.28 and the other is x = 150 and σ = 24. Which of the two has a greater
dispersion?

By Opole Ombogo Page 21


GEO 314 Lecture notes

The first distribution has a higher dispersion.

Example 2
Bellow summary statistics relate to rainfall performance across Kajiado County (1961 - 2011), calculate Coefficient
of Variation and deduce on your answer.

Inter-Annual and spatial rainfall variability levels between the three stations
N.D.O. Met. Station Mashuru Met. Station M.S.W. Met. Station
Mean(mm) 830.3 671.1 449.3
STDEV(mm) 202.3 174.9 94.0
C.V 0.24 0.26 0.21

Seasonal Variability levels between the three stations


N.D.O. Met. Station Mashuru Met. Station M.S.W. Met. Station
Long Short Long Short Long Short
Season Rains Rains Rains Rains Rains Rains
Mean(mm) 412.53 208.18 322.51 203.31 73.25 35.63
STDEV(mm) 139.26 102.54 128.03 101.65 28.23 15.82
C.V.

LECTURE 6
PROBABILITY DISTRIBUTIONS
A distribution is used by statisticians as a standard reference or a model by which to compare all other
distributions. In this course, we will discuss four types of probability distributions, namely:
Binomial Distribution, Poisson distribution, Exponential Distribution and Normal Distribution

NORMAL DISTRIBUTION
This is the most important and frequently used continuous probability distribution because it well fits in
many types of problems. It is essential in inferential statistics because it describes probabilistically the
link between a statistics and a parameter (i.e. between a sample and the population from which it is
drawn). A normal distribution curve takes a bell-shaped form with the variables clustered around a central
value, i.e. the mean and tailoring off symmetrically on each side.

By Opole Ombogo Page 22


GEO 314 Lecture notes

Features of a normal distribution curve include:


1. Mean, median and mode takes the same value i.e. mean=median=mode.
2. The curve is asynoptic to the x-axis
3. It is unimodal, that is; it has a single mode
Skewness
Skewness is the extent to which a frequency curve is asymmetrical. A distribution is said to be
symmetrical if the peak is at the centre of the histogram and the slopes on each side are virtually equal to
each other. If the peak lies to the left or right of the centre, then it is skewed. The further the peak lies
from the centre of the histogram, the more the distribution is said to be skewed.
Skewness is measured in terms of the direction and the degree of that skew. The direction of the skew
depends upon the relationship of the peak to the centre of the histogram. A skew will be positive when the
peak lies to the left and tails off to the right and negative if the peak lies to the right of the centre and tails
off to the left.

Nature of Skewness
Skewness can be positive or negative or zero.
1. When the values of mean, median and mode are equal, there is no skewness.
2. When mean > median > mode, skewness will be positive.
3. When mean < median < mode, skewness will be negative.

By Opole Ombogo Page 23


GEO 314 Lecture notes

Why do we care? One application is testing for normality: many statistics inferences require that a distribution be
normal or nearly normal. A normal distribution has skewness and excess kurtosis of 0, so if your distribution is
close to those values then it is probably close to normal.

The Degree of Skewness


The degree of Skewness can be mathematically determined by calculating the Pearson’s Coefficient of
Skewness (PCS).

3(Mean - Median)
PCS =
STDEV

A positive value shows a positive skew while a negative value shows a negative skew and the higher the
coefficient, the greater the skew.

Skewness Index
Skewness index calculates whether a distribution is skewed or not.it is obtained by the formula:

If skewness is positive, the data are positively skewed or skewed right, meaning that the right tail of the
distribution is longer than the left. If skewness is negative, the data are negatively skewed or skewed left,
meaning that the left tail is longer.

If skewness = 0, the data are perfectly symmetrical. But a skewness of exactly zero is quite unlikely for
real-world data, so how can you interpret the skewness number? Bulmer (1979)— a classic —
suggests this rule of thumb:

 If skewness is less than −1 or greater than +1, the distribution is highly skewed.
 If skewness is between −1 and −½ or between +½ and +1, the distribution is moderately skewed.
 If skewness is between −½ and +½, the distribution is approximately symmetric.

By Opole Ombogo Page 24


GEO 314 Lecture notes

Kurtosis
Kurtosis is the peakedness of a distribution. It is a measure of the degree to which the frequency
distribution is concentrated around the frequency peak.
As skewness involves the third moment of the distribution, kurtosis involves the fourth moment.
≈ means approximately equal to

Interpreting Kurtosis

The reference standard is a normal distribution, which has a kurtosis of 3. In token of this, often the
excess kurtosis is presented: excess kurtosis is simply kurtosis−3. For example, the “kurtosis” reported
by Excel is actually the excess kurtosis.

 A normal distribution has kurtosis exactly 3 (excess kurtosis exactly 0). Any distribution with
kurtosis ≈3 (excess ≈0) is called mesokurtic.
 A distribution with kurtosis <3 (excess kurtosis <0) is called platykurtic. Compared to a normal
distribution, its tails are shorter and thinner, and often its central peak is lower and broader.
 A distribution with kurtosis >3 (excess kurtosis >0) is called leptokurtic. Compared to a normal
distribution, its tails are longer and fatter, and often its central peak is higher and sharper.

We measure kurtosis by the formula:

There are three types of peaks in a frequency distribution:

By Opole Ombogo Page 25


GEO 314 Lecture notes
If negative kurtosis value is obtained, then a greater proportion of the data are concentrated near the mean
than would be the case of a normal curve. A positive kurtosis indicates that a greater proportion of the
data are concentrated away from the mean – platy kurtic kurtosis, while a perfect normal distribution
would yield a zero.
Example
The data below relates to a natural science experiment: 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 6, 6
i. Calculate the Pearson’s Coefficient of Skewness
ii. Calculate the Kurtosis
The data: 1,2,1,4,4,4,4,4,1,10,11. Calculate

By Opole Ombogo Page 26


GEO 314 Lecture notes

LECTURE 7
INFERENTIAL STATISTICS
Inferential statistics differ from descriptive statistics in that they are explicitly designed to test hypotheses.
Sir Ronald A. Fisher, one of the most prominent statisticians in history, established the basic guidelines for
significance testing. He said that a statistical result may be considered significant if it can be shown that the
probability of it being rejected due to chance is 5% or less.

We must also understand three related statistical concepts: sampling distribution, standard error, and
confidence interval. A sampling distribution is the theoretical distribution of an infinite number of samples
from the population of interest in your study. However, because a sample is never identical to the
population, every sample always has some inherent level of error, called the standard error. If this
standard error is small, then statistical estimates derived from the sample (such as sample mean) are
reasonably good estimates of the population. The precision of our sample estimates is defined in terms of a
confidence interval (CI). A 95% CI is defined as a range of plus or minus two standard deviations of the
mean estimate, as derived from different samples in a sampling distribution. Hence, when we say that our
observed sample estimate has a CI of 95%, what we mean is that we are confident that 95% of the time, the
population parameter is within two standard deviations of our observed sample estimate. Jointly, the p-
value and the CI give us a good idea of the probability of our result and how close it is from the
corresponding population parameter.

STANDARD ERROR AND CONFIDENCE INTERVAL


Standard error is the approximate standard deviation of a statistical sample population. Standard error is a statistical
term that measures the accuracy with which a sample represents a population. In statistics, a sample means deviates
from the actual mean of a population; this deviation is the standard error.
The standard error is considered part of descriptive statistics. It represents the standard deviation of the mean within
a dataset. This serves as a measure of variation for random variables, providing a measurement for the spread. The
smaller the spread, the more accurate the dataset.

Standard Error and Population Sampling


When a population is sampled, the mean is generally calculated. The standard error can include the variation
between the calculated mean of the population and one which is considered known, or accepted as accurate. This
helps compensate for any incidental inaccuracies related to the gathering of the sample.

The standard Error


The standard error (S.E) is the standard deviation of the sampling distribution of a statistics. It gives an estimate of
the degree to which the sample mean varies from the population mean. Its uses include:
i) S.E helps in testing weather the difference between observed and expected frequencies could arise due to
chance.
ii) It gives an idea about the reliability and precision of a sample. We say; the smaller the SE, the greater the
uniformity of the sampling distribution and thus the greater the reliability of the sample. Conversely, the greater
the SE, the greater the difference between the observed and expected frequencies thus revealing a greater
unreliability of the sample.

By Opole Ombogo Page 27


GEO 314 Lecture notes
iii) SE enables us to specify the limits (maximum and minimum) within which the parameters of the population are
expected to lie within a specific degree of confidence; i.e. Confidence interval

We assess data in-terms of standard error of the Mean and standard error of the difference in mean

The standard error of the mean = STDEV/√sample size = STDEV/√n

Example1: the weights of a random sample of 11 three-year old children were taken in a village. The sample mean
was 16kg and the standard deviation of the sample was 2kg. What was the SE?
SE = 2/√11 = 0.6kg
At 95% confidence interval the weights will be 16 ± (2 × 0.6) = 14.8 to 17.2kg.
This means that we are approximately 95% certain that the mean weight of all three year old children in your
population lies between 14.8 and 17.2kgs.
Increasing the sample size in above example will increase the reliability of the calculation. E.g.; with a sample size
of 20 children, instead of 11, the SE would have been:
SE = 2/√20 = 0.45kg, therefore, at 95% confidence interval for the mean weight would have been 15.1 to 16.9kg

Example 2
A 2 3 3 4 4 4 4 5 5 6
B 1 1 2 4 12

Example2:
The following weights were measured from 3 year old children. Calculate the standard error and the range to which
the weights fall at 95% confidence interval.
13, 14, 14, 15, 16, 16, 16, 17, 17, 18, 20.

Example 3: Testing Hypothesis


A machine fills packets with spice which are supposed to have a mean weight of 40g. a random sample of 36
packets is taken and the mean weight is found to be 42.4g with a standard deviation of 6g. it is required to conduct a
significance test at the 5% level.
Solution
Ho, mean = 40g
H1, mean ≠ 40g
For a 95% confidence limit/5% level of significance, the sample mean must be within ±1.96, that is; µ-1.96 to
µ+1.96.
The Standard Error = SE = 6/√36 = 1

Example

Example 2

By Opole Ombogo Page 28


GEO 314 Lecture notes

HYPOTHESIS
A hypothesis is a researcher’s prediction regarding an outcome of a study. It states possible differences,
relationships, or causes between two or more variables or concepts. They are derived from/based on existing
theories, previous researches, personal observation or experiences.
The whole study revolves around hypothesis.

Geography is a science that deals with a concept of space and spatial relationships which along with time and
composition of matter comprise three major parameters of concern of all sciences. These are: 1.Space, 2.time, and
3.Composition of matter. Geography seeks to explain how the subsystems of the physical environment are
organized on the earth’s surface and how humans distribute themselves over the earth and then space relationship to
physical features and other human beings. There are normally two groups of factors in geographical comparisons.
1. Those which operate consistently and from which predictions can be made. E.g. the driver of land use in
North Eastern and Central Provinces is rainfall which can be predicted.
2. Those which are irregular (random):- these are factors which are purely dependent on chance: for example
when a sample is taken from a population, the sample statistics are not necessarily the same as population
parameters and as such, the answers on population characteristics from the sample are tentative. In regard
to this, the degree of confidence is introduced.

By Opole Ombogo Page 29


GEO 314 Lecture notes
Purpose of Hypothesis
The purpose and importance of the null hypothesis and alternative hypothesis are that:
1. They provide an approximate description of the phenomena.
2. Provide the researcher or an investigator with a relational statement that is directly tested in a research study.
3. To behave as a working instrument of the theory.
4. Prove whether or not the test is supported, which is separated from the investigator’s own values and decisions.
5. They provide direction – they bridge the gap between the problem and the evidence needed for its solution.
6. It ensures collection of the evidence necessary to answer the questions posed in the statement of the problem.
7. It enables the investigator to asses the information he or she has collected from the standpoint of both relevance
and organization.
8. Sensitizes the investigator to certain aspects of the situations that are relevant regarding the problem at hand.
9. They permit the researcher to understand the problem with greater clarity and use the data collected to find
solutions to the problem.
10. They guide collection of data and provide the structure for their meaningful interpretation in relation to the
problem under investigation.
11. They form the framework for the ultimate conclusion as solution, i.e. they provide the framework for reporting
the inferences of the study. Researchers usually base their conclusions on results of hypothesis tested.

Qualities/Characteristics of a Good Hypothesis


A strong review of literature or existing theories leads to a good hypotheses. In formulating a good hypothesis, a
researcher should generate as many ideas as possible, and examine each statement critically before stating
hypothesis. Formulation of hypothesis is done after review of the literature but before data collection.
Good hypothesis should have the following characteristics:
1. Must state clearly and briefly the expected relationship between the variables under study.
2. Must be based on sound rationale derived from theory, previous researches, or professional experience.
3. Must be consistent with common sense or generally accepted truth.
4. Must be testable i.e. data can be collected to support or not support the hypothesis
5. Must be related to empirical (realistic, practical, experimential) phenomena: words like: ‘ought’ ‘bad’ ‘should’,
reflect more of a judgment and therefore should be avoided.
6. They should be testable within a reasonable time e.g. Deforestation leads to global warming /climate change
will take a very long time.
7. The variables stated in the hypothesis must be consistent with the purpose statement and the objectives.
8. Must be simple and concise.
9. Must be stated in such away that its implications can be deduced to validate or refute the situation.

Examples of a good Hypothesis


1. High mathematics anxiety influences student’s performance in a statistical test at University X.
2. There is a positive relationship between level of education and income among civil servants in Kenya.
3. Amount of seasonal rainfall and fertilizer type used influence the yield of maize in Kitale
NB. The use of value-laden, biased, or subjective hypothesis should be avoided. For example: The study will show
that students from urban schools perform better in national examinations compared to students from rural schools.
(Perform better is value-laden)

Types of Hypothesis
(1) Null Hypotheses and (2). Alternative Hypotheses (Alternative-non-directional Hypotheses and Alternative
directional Hypotheses).significance of directional hypothesis is tested using a one-tailed t-test while that of
non-directional is done using two-tailed test.
(3) Directional and (4) Non-directional hypotheses
Null Hypotheses - Is sometimes referred to as statistical hypotheses.
It is a negative proposition: It always states that no real relationship or difference exists and that any relationship
between two variables or difference between two groups is merely due to chance or error.

It is formulated to be rejected, if so, then alternative hypothesis is adopted.

By Opole Ombogo Page 30


GEO 314 Lecture notes
In statistical testing, the alternative hypothesis cannot be tested directly. Rather, it is tested indirectly by rejecting
the null hypotheses with a certain level of probability. Statistical testing is always probabilistic, because we are
never sure if our inferences, based on sample data, apply to the population, since our sample never equals the
population. The probability that a statistical inference is caused pure chance is called the p-value. The p-value is
compared with the significance level (a), which represents the maximum level of risk that we are willing to take
that our inference is incorrect. For most statistical analysis, a is set to 0.05. A p-value less than a=0.05 indicates
that we have enough statistical evidence to reject the null hypothesis, and thereby, indirectly accept the alternative
hypothesis. If p>0.05, then we do not have adequate statistical evidence to reject the null hypothesis or accept the
alternative hypothesis.

Hypothesis Testing
Hypothesis testing involves collection of any data/information that may support or fail to support the stated
hypothesis. The purpose of hypothesis testing is to make judgment between sample statistics and hypothesized
population parameter. The idea is to obtain a statistical inference about population parameter from sample statistics.

Errors in Hypothesis Testing


Type1 Error (Errors of omission) – are made when a hypothesis is rejected when it should have been adopted.
Type11 Error (Errors of commission) - Are made if a hypothesis is adopted when it should have been rejected.
In both types 1and 11errors, an error in decision or an error in judgement has occurred.
These errors should be avoided and to do this; Increase sample size: this may not be possible and therefore to avoid
this, introduce a significance level or rejection level. Note that the higher the significance level, the higher the
possibility of rejecting a null hypothesis when it is true.

Steps to Hypothesis Testing


Hypothesis testing is used to establish whether the differences exhibited by random samples can be inferred to the
populations from which the samples originated.

General Assumptions
1) Population is normally distributed
2) The sampling was randomly conducted
3) Mutually exclusive comparison samples
4) Data characteristics match statistical technique
For interval / ratio data use: T-tests, Pearson correlation, ANOVA, OLS regression
For nominal / ordinal data use: Difference of proportions, chi square and related measures of association, logistic
regression
Steps:
1. Formulate the Hypothesis (Null and Alternative)
Null Hypothesis (Ho): There is no difference between the variables under study.
Alternative Hypothesis (H1): There is a difference between the variables.
Note: The alternative hypothesis will indicate whether a 1-tailed or a 2-tailed test is utilized to reject the null
hypothesis.
2. Decide on the rejection level or significance level.
-This determines how different the parameters and/or statistics must be before the null hypothesis can be
rejected. This "region of rejection" is based on alpha ( ) -- the error associated with the confidence level.
The point of rejection is known as the critical value.
-The significance level is usually associated with the normal distribution. It is normally set at 5% (0.05). The
rejection level is a measure of how strong the evidence must be before H O is rejected.
3. Select and carry out an appropriate statistical test to determine the probability (p) that the problem data could
have occurred by chance under HO.
4. Decide results of the Null Hypothesis. If the calculate p is less than p tabulated at a given alpha (); otherwise
known as critical value, then the H O is rejected at that  level of significance. However, if the reverse is true,
then the HO cannot be rejected. This does not mean that HO is correct; it means that the evidence is not strong
enough to reject it.

By Opole Ombogo Page 31


GEO 314 Lecture notes

Selection of a statistical test


The choice of statistical test depends on three factors.
1. Characteristics of the data in terms of measurements
2. The power of the test
3. Assumptions that can be made about the population from which the data is derived.

LECTURE 8 & 9

PARAMETRIC TESTS
Under parametric testing techniques, we’ll discuss one test; that is; the student t-test.
Student t-test
Student t-test is a parametric test of the difference between two samples. It makes use of t. The t-distribution is
based on small sample such that when the sample size increases, the t distribution tends towards normal
distribution. It is useful in determining the significance of the difference between two groups which are measured at
an interval scale. It is used for independent as well as paired or marched samples. It is also used to determine the
significance of correlation coefficients in regression analysis. It determines the means of two variables.
Before a t-test is applied, two assumptions must be made.
 Background population of the samples are normally distributed
 The standard deviation of the population are equal
The t involves lengthy calculations and as such, small samples are taken.

The t-test is called a parametric test because your data must come from populations that are normally distributed
and use interval measurement. The t-test is used to answer to this question: Is there any difference between the
means of the two populations of which our data is a random sample? The t-test is also called a test of inference
because we are trying to discover if populations are different by studying samples from the populations, i.e., what
we find to be true about our samples we will assume to be true about the population.

T-Test Assumptions:
1. The first assumption is concerned with the scale of measurement. Here assumption for a t-test is that the
scale of measurement applied to the data collected follows a continuous or ordinal scale.
2. The second assumption is regarding simple random sample. The Assumption is that the data is collected
from a representative, randomly selected portion of the total population.
3. The third assumption is the data, when plotted, results in a normal distribution, bell-shaped distribution
curve.
4. The fourth assumption is a that reasonably large sample size is used for the test. Larger sample size means
the distribution of results should approach a normal bell-shaped curve.
5. The final assumption is the homogeneity of variance. Homogeneous, or equal, variance exists when the
standard deviations of samples are approximately equal.
For a ONE TAILED test;
x-µ
.t =
Sx

The specimens of copper wires drawn from a large lot have the following breaking strength (in kg

By Opole Ombogo Page 32


GEO 314 Lecture notes
weight):
57.8, 57.2, 57.0, 56.8, 57.2, 57.8, 57.0, 57.2, 59.6, 54.4
Test (using Student’s t-statistic) whether the mean breaking strength of the lot may be taken to be
57.8 kg weight (Test at 5 per cent level of significance). Verify the inference so drawn by using
Sandler’s A-statistic as well.
Solution: Taking the null hypothesis that the population mean is equal to hypothesized mean of
57.8 kg., we can write:

As the sample size is small (since n = 10) and the population standard deviation is not known, we
shall use t-test assuming normal population and shall work out the test statistic t as under:

S/no. xi (xi - ) (xi - )2


1 57.8 0.6 0.36
2 57.2 0 0
3 57 -0.2 0.04
4 56.8 -0.4 0.16
5 57.2 0 0
6 57.8 0.6 0.36
7 57 -0.2 0.04
8 57.2 0 0
9 59.6 2.4 5.76
10 54.4 -2.8 7.84
∑ 572 14.56
x¯ 10/572 57.2

On the other hand, if we are comparing between two sample means then we pool the standard deviations and
standard errors to yield the formula below:

We can further simplify this formula and re-write it as below:

Where;
x1¯ = Mean of first set of values
x2¯ = Mean of second set of values
S1 = Standard deviation of first set of values
S2 = Standard deviation of second set of values

By Opole Ombogo Page 33


GEO 314 Lecture notes
n1 = Total number of values in first set
n2 = Total number of values in second set

From above formula, we can further pool the two standard deviations (in the denominator)

Example2
A study on the distribution of hardwood in two districts over a period of 18 years yielded the following summary
statistics.
Kitui Mwingi
Mean 325 421
SD 20 11
Formulate a suitable Ho and test it at 0.05 significance level.

Example3
The following average slope measurements in degrees were made on different rock types in the same area.
Limestone Grit stone
(Degrees) (Degrees)
32.1 17.8
29.4 15.8
33 12.5
27.3 15.5
19 15.1
14.4 12.2
21.1 13.1
25.5 10.6
9.1 9.3
10.5 5.5
10.5
11
14.2
Question: Use t-test for the difference between means to asses the validity of the statement that the slopes on the
two rocks differ.

Example 4
The data below show the results of quality control exercise obtained during a nutritional survey.
Weight (kg)
Child No. Observer A Observer B
1 18.6 17.7
2 17.1 14.5
3 14.3 12.4
4 23.2 20.7
5 18.4 16.8
6 14.9 14.4

By Opole Ombogo Page 34


GEO 314 Lecture notes
7 16.6 14.1
8 14.8 17.1
9 21.5 21.2
10 24.6 21.9
11 17.4 16.6
12 15.7 13.6
13 16.1 14.5
14 12.9 11.2
15 12.3 16
16 19.4 20.4
17 19.3 17.5
18 24.8 22.2
19 14.3 15.1
20 13.4 10.9
Example 5
A firm ordered sacks of chemicals with a nominal weight of 50kg. A random sample of 8 sacks was taken and it
was found that the sample mean was 49.2kg with a STDEV of 1.6kg. The firm wishes to test whether the mean
weight of the sample of sacks is significantly less than the nominal weight at 5% level of significance.

Chi-Square Test of Independence


A Chi-Square test is a non-parametric technique and therefore no assumptions about the data or the parameters in
the population are made. In this technique, one needs a very large sample to approximate the population estimate of
the proportion in each mutually exclusive category. That is; it is used for comparing data in which individual
observations are assigned to categories and the number in each category is counted (frequency). It is designed to
evaluate whether the difference between observed and expected frequencies (expected frequencies in most statistics
means the average/mean) under a set of theoretical assumptions is statistically significant. It works by testing a
distribution actually observed from the field against some other distribution determined by the hypothesis.

Applications of X2
1. Testing the significance of association between two attributes
Chi-Square (2) is a statistical technique which attempts to establish relationship between two variables both of
which are categorical in nature. For example, we may want to test the hypothesis that there is a relationship
between gender and road accidents caused by drivers. The variable ‘gender’ is categorized as male and female
while the variable “number of accidents” is categorized as ‘none’ ‘few’, and ‘many’. The chi-square technique is
therefore a form of count occurring in two or more mutually exclusive categories. It compares the proportion
observed in each category with what would be expected under the assumption of independence between the two
variables. If the observed frequency greatly departs from what is expected, then we reject the null hypothesis that
the two variables are independent of each other. We would then conclude that one variable is related to the other.
The technique yields one value which should be equal or greater than zero.
2. As a test of independence
It is used to determine whether there is a significant association between the two variables.
For example, in an election survey, voters might be classified by gender (male or female) and voting preference
(Democrat, Republican, or Independent). We could use a chi-square test for independence to determine whether
gender is related to voting preference.

3. As a test of goodness of fit


This is test to show how well distribution of observed data fits the assumed theoretical distribution such as
normal distribution. It is used to determine whether sample data are consistent with a hypothesized distribution.

By Opole Ombogo Page 35


GEO 314 Lecture notes
For example, suppose a company printed baseball cards. It claimed that 30% of its cards were rookies; 60%,
veterans; and 10%, All-Stars. We could gather a random sample of baseball cards and use a chi-square goodness of
fit test to see whether our sample distribution differed significantly from the distribution claimed by the company.
The sample problem at the end of the lesson considers this example.

4. Testing population variance


This is the testing of population variance through confidence intervals
5. Testing homogeneity
This is a test of whether different samples come from the same universe.

****
To determine the significance of our test, we compare the obtained Chi-square value with a critical or table value. If
the obtained value is greater than the critical value, we reject the null hypothesis. If one is using a computer
program, the computer will give the 2 value and also the actual probability of the computed 2 value. In this case,
one does not need the table to determine if a chi-square value is significant. If the probability of the computed chi-
square value is less than the level of significance set, then the null hypothesis should be rejected and conclude that
the two variables are not independent on each other and vice versa.

The data has to meet bellow requirements:


1. The data must be in form of frequencies, counted in each of a number of categories.
2. The total number of observed frequencies must be greater than 20.
3. The expected frequency under H O in any one fraction must not normally be less than 5.
4. The observations must be independent.

Formula:

Where:
.O – Observed frequency
.E – Expected frequency
.n – Number of factions
.degree of Freedom = n-1

NB:
-2 is a measure of the aggregate difference between observed and an expected frequency under Ho such that the
greater its value, the less likely is that Ho is correct. So if calculated value is greater than the critical value at a
given significance level, then Ho is rejected.
-The Degree of freedom (DF) is the maximum number of expected frequency. The value of 2 depends on the
degree of freedom.

Example 1:
In an experiment of rolling a single dice, the results of performing such an experiment 6 times are represented in the
table below and each test is expected to appear 10 times on average.
Test Observed (O) Expected (E)
1 15 10
2 7 10
3 4 10
4 11 10

By Opole Ombogo Page 36


GEO 314 Lecture notes
5 6 10
6 17 10

Df = n-1 = 6-1 = 5
(O - E)
 =∑
2

Test Observed (O) Expected (E) O-E (0-E)2 (0-E)2/E


1 15 10 5 25 2.5
2 7 10 -3 9 0.9
3 4 10 -6 36 3.6
4 11 10 1 1 0.1
5 6 10 -4 16 1.6
6 17 10 7 49 4.9
 13.6
2 = 13.6
Testing at the probability level of 0.05 gives a critical value of 11.1. Since 13.6 is greater than 11.1, we reject the
null hypothesis.

Example 2:
To determine whether in an area, the choice of a site for construction of residential houses depends on altitude. The
following observations were made.
Altitude(m) Observation(O)
100-150 13
151-200 21
201-250 20
251-300 30

Solution
(1) Formulate Ho and H 1
Ho – The choice of construction site for residential houses is independent of altitude
H1 - There is a significant relationship between choice of construction site for residential houses and altitude.
(2) Calculate the degree of freedom (DF) and decide on the rejection level
. DF = n-1 = 4-1 = 3
Rejection level in most cases is set at 0.05 (95% confidence limit).
(3) Calculate Chi (2)
Formula
(O - E)
 =∑
2

Altitude(m) Observation(O) Expected (E) O-E (O - E)2 (O - E)2/E


100-150 13 21 -8 64 3.05
151-200 21 21 0 0 0
201-250 20 21 -1 1 0.05

By Opole Ombogo Page 37


GEO 314 Lecture notes
251-300 30 21 9 81 3.86
 84 6.96
*Expected frequency in most statistics means the average/mean - that is; in this case if the houses were evenly
distributed, then each cell will have 21 houses.

Since 6.96 is not greater than 7.82, there is no adequate evidence to reject the null hypothesis. The choice of
construction sites for residential houses in the area is independent of altitude.

The 2 test is also used for more than one variable.


Example 3
To determine whether in an area, the choice of a site for construction of residential houses depends on altitude and
direction. The following observations were made.
Altitude(m) Observed(West) Observed (East)
100-150 13 6
151-200 21 12
201-250 20 9
251-300 30 13

Ho – The choice of construction site for residential houses is independent of altitude and direction
H1 - There is a significant relationship between choice of construction site for residential houses, altitude and
direction.
In this case, the expected frequencies are calculated by multiplying the sum of rows (row) and the sum of
columns (column) then dividing by the grand total (N).
Row × Column
Expected Frequency
N
The degree of freedom, DF = (r-1) (c-1)
.df = (2-1) (4-1) = 3

Altitude Observed Observed Expected Expected West East


(m) (West) (East) row (W) (E) (O -E)2/E (O - E)2/E
100-150 13 6 19 12.9 6.1 0.00 0.003
151-200 21 12 33 22.4 10.6 0.08 0.170
201-250 20 9 29 19.6 9.4 0.01 0.013
251-300 30 13 43 29.1 13.9 0.03 0.060
 84 40 124 0.12 0.246 0.37

To obtain the calculated value, add all the (O - E) 2/E thus 2 =0.37
Critical value at 0.05 level is 7.82 which is greater than 0.37. We have no adequate evidence to reject the null
hypothesis. The choice of construction site for residential houses is independent of altitude and direction.

Example 4: You wish to evaluate the association between a person's sex and their attitudes toward school spending
on athletic programs. A random sample of adults in your school district produced the following table (counts).
Female Male Row Total
Spend more money 15 25 40
Spend the same 5 15 20
Spend less money 35 10 45
Column Total 55 50 105

By Opole Ombogo Page 38


GEO 314 Lecture notes
Assumptions
Independent random sampling
Nominal/Ordinal level data
No more than 20% of the cells have an expected frequency less than 5
No empty cells

State the Hypothesis


Ho: There is no association between a person's sex and their attitudes toward spending on athletic
programs.
Ha: There is an association between a person's sex and their attitudes toward spending on athletic
programs.

Set the Rejection Criteria


Determine degrees of freedom DF= (# of rows - 1) (# of columns - 1)
DF= (3 - 1) (2 - 1) or DF=2
Establish the confidence level (.05, .01, etc.); Alpha = .05
Based on the chi-square distribution table, the critical value = 5.991

Compute the Test Statistic

Where
FO= observed frequency
Fe= expected frequency for each cell
Fe= (frequency for the column) (frequency for the row)/N

Frequency Observed (Data that is actually collected from the field)


Female Male  Row
Spend more 15 25 40
Spend same 5 15 20
Spend less 35 10 45
 Column 55 50 105
Row × Column
Expected Frequency
N
Female Male  Row
Spend more 55×40/105 = 20.952 50×40/105 = 19.048 40
Spend same 55×20/105 = 10.476 50×20/105 = 9.524 20
Spend less 55×45/105 = 23.571 50×45/105 = 21.429 45
 Column 55 50 105
Chi-square Calculations
Female (O - E)2/E Male (O - E)2/E
Spend more (15-20.952)2/20.952 (25-19.048)2/19.048
Spend same (5-10.476)2/10.476 (15-9.524)2/9.524
Spend less (35-23.571)2/23.571 (10-21.429)2/21.429

Chi-square
Female Male
Spend more 1.691 1.860

By Opole Ombogo Page 39


GEO 314 Lecture notes
Spend same 2.862 3.149
Spend less 5.542 6.096
21.200

Decide Results of Null Hypothesis


Since the chi-square test statistic 21.2 exceeds the critical value of 5.991, you may conclude there is a
statistically significant association between a person's sex and their attitudes toward spending on athletic
programs. As is apparent in the contingency table, males are more likely to support spending than females.

Exercise 1
The following results show students’ performance relative to height.
Height/Performance High Medium Middle Low Low
Above average 14 11 10 5
Average 10 16 16 14
Below Average 3 14 7 10
QN: Formulate a suitable hypothesis and test it using an appropriate technique at 0.05 significance level

Exercise 2
The table below shows the frequencies of a random sample obtained on attitudes of residents surrounding Mau
Forest Reserve regarding conservation and resettlement of forest inhabitants.
Attitude Male Female
Strongly Support 12 14
Support 16 15
Do not Support 14 16
Strongly reject 8 11

QN: Formulate a suitable Ho and choose an appropriate test of statistics to test it at 0.05 significance level.

Exercise 3
The data bellows are frequencies of annually treated cases of 3 water-borne diseases in Mavoko Municipality for a
period of 4 years.
Year Bilharzias Diarrhea Typhoid
1999 442 454 67
2000 327 509 142
2001 706 375 139
2002 375 236 45

QN: Formulate an appropriate Ho and validate it at 0.05 significance level


Fill the tables bellow:
1. State the Hypothesis; Null and alternative
2. Set the Rejection Criteria i.e. Calculate the degrees of freedom
Degrees of Freedom = (number of rows – 1) (number of columns -1)

3. Compute chi-square (X2) Using bellow formula

Fo is just observed (O) and Fe Expected (E) in.


First, in the table bellow, add to obtain the sum for rows, sum for columns and the Grand Total (N) by filling the
blank/bold cells. This will help you in calculating Expected Frequency (Fe) or (E)

By Opole Ombogo Page 40


GEO 314 Lecture notes
Year Bilharzias Diarrhea Typhoid  Row
1999 442 454 67 b
2000 327 509 142
2001 706 375 139
2002 375 236 45
 Column a N

Second: use the formula row ×column


N
To calculated Expected frequency in the table bellow. Example given by bilharzias expected for 1999.

Year Bilharzias (E) Diarrhea (E) Typhoid (E)


1999 .b*a/N = y
2000
2001
2002

4. Replace calculated (Expected values above) and observed values in the formula. Perform this for every
disease case, every year and fill in the table bellow

Year Bilharzias (O - E)2/E Diarrhea (O - E)2/E Typhoid (O - E)2/E


1999 (442 – y)2/y
2000
2001
2002
Total

Add the calculated totals for the three disease cases to obtain the sum which is the chi-square.
5. Use the Chi-Square tables at 95% to against the degrees of freedom calculated in procedure 2.
6. Compare calculated chi and one obtained form table reading and decide whether to reject the null
hypothesis as appropriate.

LECTURE 10 & 11
REGRESSION AND CORRELATION ANALYSIS
These are techniques of studying how the variations in one series are related to series in another series. Regression
and correlation exist in two levels of complexities:
 Linear Regression and Correlation and
 Multi-Linear Regression and Correlation.
NB: We should however remember that there are otherwise non-linear correlations too.
The two techniques are usually considered together; they appear to be one and the same thing: that is, we are sure
that if two variables are strongly related (existence of correlation), then it is easy to imagine that when one changes
then the other one changes a consequence. We thus apply two types of coefficients (Regression Coefficients and
Correlation Coefficients) which are a result of using statistics.

By Opole Ombogo Page 41


GEO 314 Lecture notes

Regression addresses itself to the rates of change of variable(s) in relation to change in in another one. Correlation
on the other hand addresses itself to the relationship between variables and strengths of these relationships, i.e.
independent variable altering dependent variable.
Correlation therefore measures the degree of relationship between the variables while regression analysis shows
how the variables are related.
Regression and correlation analysis thus determines the nature and the strength of relationship between two
variables.

REGRESSION ANALYSIS
Regression is the process of predicting one variable from the other variable. Multiple regressions describe the
process by which several variables are used to predict another.
We can therefore use regression analysis to estimate/forecast/predict values of a variable given another well
correlated variable. In other words, the technique of regression analysis is used to:
1. Determine the statistical relationship between two or more variables and
2. To make prediction of one variable on the basis of the other.

Assumptions in Regression Analysis


i. That there is actual relationship between dependent and independent variables
ii. That the values of the dependent variable are random but the values of independent variable are fixed
quantities without error and are chosen by the experimenter.
iii. That there is clear indication of direction of the relationship.
iv. That the conditions are the same when the regression model is being used especially when using to make
prediction/forecast/extrapolation
v. That the analysis be used to predict values within the range for which it is valid.

Simple Linear Regression Model


In this, a single variable is used to predict another variable on assumption that a linear relationship exists between
the two variables; in which the variable to be predicted is called the dependent variable while the value upon which
prediction is based is called independent. We define this relationship by the expression: Y = a + bx.
The simple linear regression model for Linear Regression Line is stated as: y = a + bxi + ei
Where: yi is the dependent variable
.xi is independent variable
.ei is the unpredictable random element
.a represents the y intercept
.b is the constant indicating the slope of regression line

REGRESSION ANALYSIS
Simple Linear Regression Model
In this analysis, a single variable is used to predict another variable on assumption that there exists a relationship
defined by y = a + bx.
Where:
.y - Independent variable
.x – Dependent variable
.a – y intercept
.b – A constant indicating the slope of regression line (mount of change/gradient)

By Opole Ombogo Page 42


GEO 314 Lecture notes

SCATTER DIAGRAMS METHOD


Scatter diagram are used generally to represent graphically and then compare the two sets of data given. The
variable that is independent is usually plotted on the X axis and the dependent variable is then plotted on the Y axis.
Uses
1. By seeing a scatter diagram, we will get to know if there is any correlation between the given two sets of data.
2. Also a scatter plot is a useful summary for a set of a bivariate data, i.e., data in two variables, and is usually
drawn before one work out fitting a regression line or a linear correlation coefficient.
3. It also gives a good visual picture of the type of relationship between the two variables and
4. Also aids the interpreting of the regression model or the correlation coefficient.

A line is then drawn through the scatter plots to determine the intercept and the slope of the line y = a + bx. The
short coming of this method is that if different people will draw a line through the plots then the line will deviate in
trend.

CORRELATION ANALYSIS
Coefficient of Correlation (r)
Coefficient of correlation is a technique used in explaining how well one variable is described by another. It ranges
between +1 and -1. For example, a correlation of r = 0.9 suggests a strong, positive association between two
variables, whereas a correlation of r = -0.2 suggest a weak, negative association. A correlation close to zero
suggests no linear association between two continuous variables
Some of the techniques for obtaining the R-value include: Coefficient of Correlation by Least Square Method,
Using Simple Regression Coefficient and use of Karl Pearson’s Method.
In our case, we shall look at the Carl Pearson’s method/Pearson’s Product Moment Correlation which is
calculated as below:

Coefficient of Determination (r2)


This is the measure of the degree of linear association or correlation between two variables, dependent and
independent.

By Opole Ombogo Page 43


GEO 314 Lecture notes

Where:
.r2 – coefficient of determination
.a – y intercept
.b – slope of the best fitting estimation line
.x – value of the independent variable
.y – value of the dependent variable
. –the mean of the observed values of y

Interpreting r2
 The coefficient of determination can have values from 0-1.
 A value of 1 implies all data points in the scatter diagram fall exactly on the regression line, perfect correlation.
 Value 0 only occurs when X tells us nothing about Y, that is, there is no relationship between X and Y variables.
 Between 1 and 0 shows goodness of fit.
 .r2 also indicates the amount of the variations in the dependent variable (y) that are explained by independent
variable (x) e.g. if r2 = 0.967, then variations in independent variables explains 96.7% of variations in dependent
variable.; i.e. explains most variations because it is closer to unity.

Correlation and Regression Analysis Sample exercises


Example1
x 2 5 4 6 3
y 6 10 7 9 8

Example 2
S/N 1 2 3 4 5 6 7 8 9 10
Height (x) 52 62 58 48 55 60 56 53 50 62
Weight (y) 40 55 53 40 44 50 51 45 44 55

Example 3
A local shop of milk shakes keeps a track of the amount of milk shakes they sell relative to the temperature on that
day. Below are the figures of their sale and temperature for the last 12 days. Comment on the relationship.

Milk Shakes Sales Vs Temperature


Temperature (oC) 14.2 16.4 11.9 15.2 18.5 22.1 19.4 25.1 23.4 18.1 22.6 17.2
Milk Shake Sales $215 $325 $185 $332 $406 $522 $412 $614 $544 $421 $445 $408

Example 4
The following summary relates to a medical experiment attempting to establish the relation between age and blood sugar.
Subject 1 2 3 4 5 6
Age 43 21 25 42 57 59
Glucose level 99 65 79 75 87 81
Determine the coefficient of correlation (r) and coefficient of determination (r 2).

By Opole Ombogo Page 44

You might also like