Professional Documents
Culture Documents
LECTURE 1
STATISTICS
Statistics is the science of data. It is the science of organizing, describing and analyzing quantitative data. It also
refers to the indices which are derived from data through statistical procedures e.g. mean, standard deviation,
correlation coefficient etc.
LECTURE 2
SCALES/LEVELS OF MEASUREMENT
There are FOUR levels of measurements used in any form of geographic research. These are:
1. Nominal Scale
Also referred as the nominal classification and involves categorization without numerical evolution. This scale is
used only when we classify observations into mutually exclusive categories where we avoid overlap. Members of
the population are classified according to their similarities or when they are actual. It is the weakest level of
measurement which classifies objects by simply naming different categories; for example sex, which is classified
into male and female; region, tribe, climate, race, cultural groups, towns and farms etc.
2. Ordinal Scale
This is a scale of measurement that gives order of the magnitude of the data although other absolute values are not
known. We assign ranks order to a set of qualitative expressions such as good, average, poor, heavier than, longer
than etc. we also use this scale for broader concepts e.g. stability, development, power, innovation, etc. it doesn’t
give the size of the intervals between the scores or about the ratio by which one unit is higher than the other. The
interval between successive points on the ordinal scale should be distinguished on the basis of some criterion such
as size and distance and then assigned ranks. For example, towns in Kenya can be assigned rank orders according to
their population sizes as follows:
Town Rank
Nairobi 1
Mombasa 2
Kisumu 3
Nakuru 4
Characteristics of Ordinal Scale
1) It is asymmetrical: For example, if X > Y > Z, then X > Z; meaning they are not equal.
Transitive: Applies for successive members in a sequence; if A is larger than B and B is Larger than C, then A
is larger than C
2) It presumes that the variable measured has an underlying continuous distribution even though that distribution
is not actually measured. (eg soil cartegories)
3) It treats clusters of units as if there are ties. A study which classifies regions as low rainfall and high rainfall,
arid and humid are examples. It thus assumes that all parts within the region receive equal amount of rainfall
and that the variation exist between the regions; the more the ties produced by this measurement, the less
sensitive the measurement.
4) All the FOUR basic arithmetic operations (+, -, ×, and ÷) can be performed on the ranked data.
3. Interval Scales
In Interval Scale, not only are the observations ordered but the magnitude of the difference separating any two
observations along the measurement scales are known. It measures the range from one value to the next. For
example:
Town Rank Population
Nairobi 1 4m
4. Ratio Scale
Ratio scale is the highest level of measurement with all properties belonging to ordinal, interval and nominal scale
belonging to it. It represents a further refinement in quantitative measurement as it has a defined or non-arbitrary
(true) zero point, a characteristic that permits certain comparisons between scale values that are not possible with
the interval scale. It is an interval scale with the additional property that its zero position indicates the absence of
the quantity being measured. In this, a variable is expressed as a proportion of the total population. Most ordinary
counting belongs to this category.
Under this scale, addition, subtraction, multiplication, division, ratio, square root operations etc. can be done at this
scale.
For example, if a hill is 1000 metres high, it can be said that the hill is twice as high as one that is 500 metres high;
but we cannot say that a place with 200C twice as hot as a place with 100C. Physical sciences are better equipped
when it comes to ratio scale. Another example of a ratio scale is the amount of money you have in your pocket
right now (25 cents, 55 cents, etc.). Money is measured on a ratio scale because, in addition to having the properties
of an interval scale, it has a true zero point: if you have zero money, this implies the absence of money. Since
money has a true zero point, it makes sense to say that someone with 50 cents has twice as much money as
someone with 25 cents (or that Bill Gates has a million times more money than you do).
Application: on weight, mass, velocity etc for example if vehicle A moves at 90km/hr and B at 30km/hr then A is
3times faster than B and so the ratio is 3:1.
********************************************
SAMPLING
Sampling is the use of a subset of the population to represent the whole (total population). Total population is the
total collection of units, elements or individuals that one is interested in analyzing
Sample:
Sample size
An appropriate sample size is required for validity. If the sample size is too small, it will not yield valid results. An
appropriate sample size can produce accuracy of results. Moreover, the results from the small sample size will be
questionable. A sample size that is too large will result in wasting money and time
a) Probability sampling:
The best way to ensure that a sample will lead to reliable and valid inference is to use probability sample.
Probability sampling method is any method of sampling the uses some form of random selection. In order to have a
random method you must set up some process or procedure that ensures that the different units in your population
have calculable probability or chance of being selected.
Advantage
1. Cluster sampling is cheap,
2. Easy and is particularly useful when it is difficult or costly to develop a sampling frame or when the
population elements are widely dispersed geographically.
Disadvantage
Cluster sampling increases sampling error since elements in a cluster are likely to be similar in many aspects.
5. Multistage sampling
This is a complex form of cluster sampling where after clustering instead of using all the elements contained in the
selected cluster we randomly select elements from each cluster. Constructing the cluster is the first stage and
selecting what elements within the cluster to use is the second stage. Example; sampling 1 province out of 8, then
sampling a county, then a sub-county
The difference between nonprobability and probability sampling is that nonprobability sampling does not involve
random selection and probability sampling does.
Main types of non-probability sampling are:
1. Convenient/Opportunity/Accidental sampling
2. Purposive/Judgmental sampling
3. Quota Sampling
4. Snowball Sampling
1. Convenient/Opportunity/Accidental sampling
Convenient sampling is also called volunteer samples or grab. It is a sampling method that involves the sample
being drawn from that part of the population which is close to hand i.e. sample unit is selected because it is readily
available, easy to reach and convenient. From this sampling technique one cannot make a generalization about the
entire population. This sampling technique is of great use during pilot testing.
Sometimes the sample is accessed through contacts or gatekeepers
2. Purposive/Judgmental sampling
Judgmental/purposive sampling starts with a purpose in mind and the sample is thus selected to include people of
interest and exclude those who do not suit the purpose. It involves selecting a group of people because they have
particular traits that the researchers want to study. For example, judgmental sampling is employed while studying
the behavior of consumers of a particular product or service, in a market research.
3. Quota Sampling
In quota sampling we first segment the population into mutually exclusive segments just as when doing stratified
sampling, but in this case judgment is used to select the subjects or units from each segment based on some
specified proportion. This sampling technique may be biased since not everyone get a chance of selection. This lack
of randomness is the greatest weakness of this sampling method. Widely used in opinion polls and market research,
in which case the interviewers involved are given a quota of the subjects of a specified characteristics who they
attempt to reach.
The researcher decides how many of each category are selected.
For example,
4. Snowball Sampling
This is a sampling design where existing study subject recruits future subjects from among their acquaintances.
May be extremely biased since sample member are not selected from a sampling frame and it is respondent driven.
But can allow the researcher to make estimates about the social network connecting the hidden population.
This sampling method involves two main steps
a) Identify a few key individuals
b) Ask these individual to volunteer to distribute the questionnaires to people they know who fit the criteria of
the desired sample. Or
c) Suggest an acquaintance that meets the selection criteria then the researcher administers an interview
LECTURE 3 - 5
DESCRIPTIVE AND INFERENTIAL STATISTICS
Most geographical studies are quantitative in nature and can be divided into the following main functions
i. Descriptive statistics: - Are indices that describe a given sample. It helps in expressing a set of data
composed of individuals of varied forms from one another to a greater or lesser extent in a set of summary
data.
Descriptive statistics give information that describes the data in some manner, i.e. consist mainly of
methods for organizing and summarizing information.
For example, suppose a pet shop sells cats, dogs, birds and fish. If 100 pets are sold and 40 out of the 100
were dogs, then one description of the data on the pets sold would be that 40% were dogs.
A graphical representation of data is another method of descriptive statistics. Examples of this visual
representation are histograms, bar graphs and pie charts, etc. Using these methods, the data is described by
compiling it into a graph, table or other visual representation.
This provides a quick method to make comparisons between different data sets and to spot the smallest and
largest values and trends or changes over a period of time. If the pet shop owner wanted to know what type
of pet was purchased most in the summer, a graph might be a good medium to compare the number of each
type of pet sold and the months of the year.
ii. Inferential Statistics: - most geographers deal with data obtained from samples and it is assumed that the
sample is a representative of a total population. Inferential statistics therefore enable the geographer within
a certain defined limits to make statements about the characteristics of a population based on the data
collected from a sample.
Inferential statistics consist of methods for drawing and measuring the reliability of conclusions about
population based on information obtained from a sample of the population.
We can therefore say, Descriptive statistics uses the data to provide descriptions of the population, either through
numerical calculations or graphs or tables while inferential statistics makes inferences and predictions about a
population based on a sample of data taken from the population in question. One of the most commonly used
inferential technique is hypothesis testing. We will cover various techniques of hypotheses testing in detail later in
this course.
DESCRIPTIVE STATISTICS
There are two methods of describing data; graphical and numerical. Some of the basic graphical methods are
histograms, frequency polygons, ogives while Numerical methods include frequency distributions, measures of
central tendencies, and measures of variability and dispersion. Basically all data arrays and tabulations fall under
numerical methods.
a. FREQUENCY DISTRIBUTIONS
Statistical techniques can be used to process a mass of figures relating to a single variable, e.g. kilometers, so that
some significant meaning can be extracted from them. To do this, the first step is to build up a distribution of
frequencies of the occurrence of the figures; that means, for example, combining the distances into relatively
smaller number of categories/classes and examine the number of kilometers falling into each class.
Reasons for constructing Frequency Distributions include:
1. To facilitate the analysis of data.
2. To estimate frequencies of the unknown population distribution from the distribution of sample data and
3. To facilitate the computation of various statistical measures
We can also say that Discrete class is data that increases in jumps or whole numbers e.g. number of children in a
family can be; 0,1,2,4, and cannot be 2.5, 4.5, 0.2. Continuous data on the other hand is data that increases
continuously such as kilometers traveled; can be 60.8, 900.6, etc.
Key terms:
A frequency is the number of times a given datum occurs in a data set
Raw Data: - This is the data collected but has not been re-organized or re-arranged. The first step in making raw
data more meaningful is to re-enlist the figures in order of size so that they run from the lowest to the highest i.e.
into an array. An array can be simplified further by listing repeated figures only once and the number of times it
occurs written alongside it. This is called ungrouped frequency distribution. The sum of the frequencies (f) must
equal the total number of items making up the raw data.
Class Limit: - Are the extreme boundaries of a class such as 401 – 450. Care must be taken when defining class
limits to avoid overlapping of classes or wide gaps between classes.
Class Interval: - Is the width of the class i.e. the difference between class limits.
Percentages: Is a proportion of a subgroup to the total group or sample. It ranges from 0% to 100%. Percentages
are important especially if there is a need to compare groups that differ in size.
Lower class limit of a class is the smallest value within the class.
Upper class limit of a class is the largest value within the class.
Class midpoint is found by adding a class’s lower class limit and upper class limit and dividing the result by 2.
Class boundaries are the numbers which separate classes. They are equally spaced halfway between neighboring
class limits.
Class width is the difference between two class boundaries (or corresponding class limits).
Grouping the data involves grouping of the arrays into classes after choosing an appropriate class limit. A group
of such classes together with their frequencies is called a grouped frequency distribution. Grouping helps us to
combine scores into smaller categories. It can be necessitated by:
When scores are distributed in such a way that certain scores not obtained by any subject
When the samples are very large
When information sought is sensitive; e.g. annual income queries in a questionnaire.
All the Relative Frequencies add up to 1 (except for any rounding error).
a. Histograms
These are graphs of a frequency distribution constructed on the basis of a horizontal axis with a continuous scale
running from one extreme end of the distribution to the other. For each class in the distribution, a vertical rectangle
is drawn with its base on the horizontal axis extending from one class limit to the other limit and its area or height is
proportionate to the frequencies in the class. The vertical axis is labeled ‘frequency’. The difference between a
histogram and a bar chart is that in bar charts, spaces are left between the bars to signify a lack of continuity or
flow between the categories.
A relative frequency histogram compares each class interval to the Relative frequency (decimal or %). A relative
frequency histogram has the same shape and the same horizontal scale as the corresponding frequency histogram.
The difference is that the vertical scale measures the relative frequencies (measured as a percentage), not
frequencies.
b. Frequency Polygon
Procedure for construction:
1. Construct a histogram or if not required, construct a table of grouped frequency distribution.
2. Mark the mid-point of the top of each rectangle in the histogram or the mid-point of the class limits.
3. Join the mid-points with a straight or smooth line curves.
If the mid-points of the bars/rectangles are corrected with a smooth curve then it forms a frequency curve.
The curve should begin and end at the baseline/x-axis.
c. Ogives
An ogive, sometimes called a cumulative line graph, is a line that connects points that are the cumulative
percentage of observations below the upper limit of each class in a cumulative frequency distribution. It is a curve
obtained when the cumulative frequencies of a distribution is graphed. Procedure:
Example:
The following rainfall data was collected at a meteorological station over a period of time
23 22 23 24 21 20 19 28
22 19 22 22 22 26 15 19
24 21 20 22 26 25 10 22
22 17 25 23 19 21 22 21
22 23 22 24 20 26 23 23
Construct an absolute, relative and cumulative frequency distribution table for above data set
Draw an absolute frequency distribution histogram and polygon for above data set.
STATISTICAL DESCRIPTIONS
MEASURES OF CENTRAL TENDENCY
Uses:
a. To provide a summary and a consistent description of sets of data
b. Means are importantly used in comparisons e.g. school scores in national examinations
c. Quickly condense a large amount of data
Where: x is the mean, n is the number of observations, f is the frequency, x is the mid-point of the
class limits and is the summation notation.
Example:
Steps:
1) Obtain the mid-point of each class
2) Multiply each mid-point by its respective frequency to obtain fx
3) Sum f and fx to obtain n and fx respectively.
Class Frequency (f) Mid-point (x) fx
401 - 420 12 410.5 4926
421 - 440 27 430.5 11623.5
441 - 460 34 450.5 15317
461 - 480 24 470.5 11292
481 - 500 15 490.5 7357.5
501 - 520 8 510.5 4084
120 54600
Example:
Class Frequency (f) Mid-point (x) fx .d = x - A .fd
401 - 420 12 410.5 4926 -40 -480
421 - 440 27 430.5 11623.5 -20 -540
441 - 460 34 450.5 15317 0 0
461 - 480 24 470.5 11292 20 480
481 - 500 15 490.5 7357.5 40 600
501 - 520 8 510.5 4084 60 480
120 54600 540
Advantages
1. Fast and easy to calculate
2. Important for use in further analysis
3. Takes care of all the observations
Disadvantages
1. Sensitive to extreme values
Arithmetic average is extremely sensitive to extreme values. Imagine a data set of 4, 5, 6, 7, and 8,578. The
sum of the five numbers is 8,600 and the mean is 1,720 – which doesn’t tell us anything useful about the level of
the individual numbers. Therefore, arithmetic average is not the best measure to use with data sets containing a few
extreme values or with more dispersed (volatile) data sets in general. Median can be a better alternative in such
cases.
2. Not suitable for time series type of data
Arithmetic mean is perfect for measuring central tendency when you’re working with data sets of independent
values taken at one point of time.
WEIGHTED MEAN
The weighted arithmetic mean is similar to an ordinary arithmetic mean (the most common type of average),
except that instead of each of the data points contributing equally to the final average, some data points contribute
more than others.
To find the weighted mean:
- Multiply the numbers in your data set by the weights.
- Add the results up.
Sample problem1
For that set of number above with equal weights (1/5 for each number), the math to find the weighted mean would
be:
1(*1/5) + 3(*1/5) + 5(*1/5) + 7(*1/5) + 10(*1/5) = 5.2.
Sample problem 2
You take three 100-point exams in your statistics class and score 80, 80 and 95. The last exam is much easier than
the first two, so your professor has given it less weight. The weights for the three exams are:
- Exam 1: 40 % of your grade. (Note: 40% as a decimal is .4.)
- Exam 2: 40 % of your grade.
- Exam 3: 20 % of your grade.
What is your final weighted average for the class?
Multiply the numbers in your data set by the weights (The percent weight given to each exam is called a weighting
factor).
.4(80) = 32 or 80*40%
.4(80) = 32
.2(95) = 19
Add the numbers up. 32 + 32 + 19 = 83.
Sample problem 3:
Given two school classes, one with 20 students, and one with 30 students, the grades in each class on a test were:
Morning class = 62, 67, 71, 74, 76, 77, 78, 79, 79, 80, 80, 81, 81, 82, 83, 84, 86, 89, 93, 98
Afternoon class = 81, 82, 83, 84, 85, 86, 87, 87, 88, 88, 89, 89, 89, 90, 90, 90, 90, 91, 91, 91, 92, 92, 93, 93, 94, 95,
96, 97, 98, 99
The straight average for the morning class is 80 and the straight average of the afternoon class is 90. The straight
average of 80 and 90 is 85, which is the mean of the two class means. However, this does not account for the
difference in number of students in each class (20 versus 30); hence the value of 85 does not reflect the average
student grade (independent of class). The average student grade can be obtained by averaging all the grades,
without regard to classes (add all the grades up and divide by the total number of students):
However, let’s say your weighted means added up to 1.2 instead of 1. You’d divide 83 by 1.2 to get:
83 / 1.2 = 69.17.
Sample problem 3: Alex usually works 7 days a week, but sometimes just 1, 2, or 5 days.
Alex worked:
on 2 weeks: 1 day each week
on 14 weeks: 2 days each week
on 8 weeks: 5 days each week
on 32 weeks: 7 days each week
What is the mean number of days Alex works per week?
If we use "Weeks" as the weighting:
Weeks × Days = 2 × 1 + 14 × 2 + 8 × 5 + 32 × 7
= 2 + 28 + 40 + 224 = 294
Also add up the weeks:
Weeks = 2 + 14 + 8 + 32 = 56
Divide: Mean = 294/56 = 5.25
Exercise
1. The numbers 1, 2, 3 and 4 have weights 0.1, 0.2, 0.3 and 0.4 respectively. What is the weighted mean?
2. The numbers 1, 2, 3, 4, 5 and 6 have weights 0.5, 0.1, 0.1, 0.1, 0.1 and 0.1 respectively.
What is the weighted mean?
3. In Bobby's school, math grades for the year are calculated from assignments, tests and a final exam.
Assignments count 30%, tests 20%, and the final exam 50%.
If Bobby has an assignment grade of 85, a test grade of 72, and an exam of 61, what is Bobby's overall
grade?
4. Cat wants to buy a new car, and decides on the following rating system:
- Appearance 10%
Weakness
The weighted mean can be easily influenced by outliers in your data. If you have very high or very low values in
your data set, the weighted mean may not be a good statistic to rely on.
MEDIAN
Median is the middle value of the scores, i.e., the mid-point that separates the upper 50% of he values from the
lower 50%. To obtain the median, arrange the scores of values in ascending order; me middle value is the value
above or bellow which an equal number of observations occur. If the total numbers of occurrences are odd, then the
median will be one of the observed values. If the numbers of occurrences are even, then the median will be mid-
way between two of the values.
Formula: Mdn = n + 1
2
Example: 3,3,1,1,4,4,5,2,2,2
Arranged: 1,1,2,2,2,3,3,4,4,5
Median = 2+3/2 = 2.5
Strengths: - it is not affected by the values of extreme items in the distribution
Where: -fb is the sums of all the frequencies (cumulative frequencies) below the median class
-fc the frequency of the class containing the median
-Le is the lower limit class in which the median occurs
-.i is the interval or group width
Example 1:
Class Frequency (f)
401 - 420 12
421 - 440 27
441 - 460 34
461 - 480 24
481 - 500 15
501 - 520 8
120
Mdn = 440.5 + 20 (120/2 – 47)34
= 440.5 + 20(0.38)
= 448.15
= 626 + 26
= 652
MODE
Mode is the single value that occurs most frequently in the distribution or the midpoint of the class with the highest
frequency. The peak or point of the greatest concentration in the distribution can statistically be calculated by:
Mo = 3mdn – 2x;
That is, mode can be easily computed once the mean and the median of the data set are known.
Mode is best applicable where a distribution is much skewed. For example; consider the following monthly income
for five individuals; 800, 900, 850, 750, 5000.
The mean would be = 800+900+850+750+5000/5,
= 1660
This figure (1660) although an accurate statement of the mean, is not typical of the group as a whole because it’s
affected by the income of 5000. The median value is 850 and this is typically more representative of the group than
the mean of 1660, therefore if a distribution is very much skewed i.e. it contains more occurrences at one extreme
than the other, the median or mode is more likely to be a representative of these scores than the mean.
Solution:
Modal class = 625 – 675
= 625 + 30
= 655
Advantages
1) Very quick and easy to determine
2) Is an actual value of the data
3) Not affected by extreme scores
Disadvantages
1) Sometimes not very informative (e.g. cigarettes smoked in a day)
2) Can change dramatically from sample to sample
3) Might be more than one (which is more representative?)
MEASURES OF VARIABILITY
Measures of variability show how the scores differ amongst themselves in magnitude. It is a standard way of
describing variability or dispersion of scores.
Variability is the distribution of scores around a particular central score or value, i.e. the mean (in most statistics). It
is therefore the dispersion of scores around the mean of a distribution.
Example 1: 241, 521, 421, 250, 300, 365, 840, 958, 241
Example 2: 5, 8, 13, 74, 85, 88, 90, 91, 92, 92, 93, 94, 94, 94, 95, 95, 95, 96, 96, 98, 99, 101, 103, 106,
113.
Where:
LQ1 = lower limit of the first quartile class
f1 = frequency of the first quartile class
c = class interval
m1 = c.f. preceding the first quartile class
LQ3 = 1ower limit of the 3rd quartile class
f3 = frequency of the 3rd quartile class
m3 = c.f. preceding the 3rd quartile class
Example 1
3. The Mean Deviation
Is the average amount by which individual sores deviate from the mean. It is calculated by first finding out how
much each value differs from the mean value, summing the squares of the difference then dividing by the number
of observations.
Example:
x (x –x) (x –x)2
5 -5
7 -3
8 -2
12 2
18 8
= 50 = 20 = 106
Standard Deviation
When the variance is computed in a sample statistics then the formula is;
(where M is the mean of the sample) can be used. S² is a biased estimate of σ². which gives an unbiased estimate of
σ². Since samples are usually used to estimate parameters, s² is the most commonly used measure of variance.
Calculating the variance is an important part of many statistical applications and analyses. It is the first step in
calculating the standard deviation.
Example:
x (x –x) (x –x)2
5 -5 25
7 -3 9
8 -2 4
12 2 4
18 8 64
= 50 = 20 = 106
σ² = 106/5 = 21.2
If the variance is small then the scores are close together while a large variance implies the scores are more spread
out.
STDEV is obtained by taking the square root of the variance. A large STDEV implies a large deviation from the
mean, i.e. a greater variability, while a small STDEV denotes less variability of scores in the distribution.
Properties of STDEV
1) Takes into account all scores and responds to the exact position of every score relative to the mean of the
distribution, i.e., if the score is shifted further from the mean, the STDEV increases.
2) It sensitive to extreme scores.
Exercise
1. Students at Kibabii University sat for an examination at the end of the semester. Their results were grouped
as shown:
Class Frequency
50-54 2
55-59 3
60-64 6
65-69 9
70-74 12
75-79 15
80-84 10
85-89 8
90-94 6
95-99 4
i. Calculate the mean, mode and the median.
ii. Calculate the quartile deviation of the data
iii. Calculate the standard Deviation and Variance
Coefficient of Variation
C.V. is the standard deviation calculated as a percentage of the mean. That is; it is the ratio between the standard
deviation of a sample and it’s mean. C.V. is used when there’s need to compare the variability of two or more sets
of a distribution.
It allows us to compare the dispersions of two different distributions if their means are positive.
The coefficient of variation for a distribution can be calculated to compare the values obtained with another
distribution. The greater dispersion corresponds to the value of the coefficient of greater variation.
Example 1
A distribution is x = 140 and σ = 28.28 and the other is x = 150 and σ = 24. Which of the two has a greater
dispersion?
Example 2
Bellow summary statistics relate to rainfall performance across Kajiado County (1961 - 2011), calculate Coefficient
of Variation and deduce on your answer.
Inter-Annual and spatial rainfall variability levels between the three stations
N.D.O. Met. Station Mashuru Met. Station M.S.W. Met. Station
Mean(mm) 830.3 671.1 449.3
STDEV(mm) 202.3 174.9 94.0
C.V 0.24 0.26 0.21
LECTURE 6
PROBABILITY DISTRIBUTIONS
A distribution is used by statisticians as a standard reference or a model by which to compare all other
distributions. In this course, we will discuss four types of probability distributions, namely:
Binomial Distribution, Poisson distribution, Exponential Distribution and Normal Distribution
NORMAL DISTRIBUTION
This is the most important and frequently used continuous probability distribution because it well fits in
many types of problems. It is essential in inferential statistics because it describes probabilistically the
link between a statistics and a parameter (i.e. between a sample and the population from which it is
drawn). A normal distribution curve takes a bell-shaped form with the variables clustered around a central
value, i.e. the mean and tailoring off symmetrically on each side.
Nature of Skewness
Skewness can be positive or negative or zero.
1. When the values of mean, median and mode are equal, there is no skewness.
2. When mean > median > mode, skewness will be positive.
3. When mean < median < mode, skewness will be negative.
Why do we care? One application is testing for normality: many statistics inferences require that a distribution be
normal or nearly normal. A normal distribution has skewness and excess kurtosis of 0, so if your distribution is
close to those values then it is probably close to normal.
3(Mean - Median)
PCS =
STDEV
A positive value shows a positive skew while a negative value shows a negative skew and the higher the
coefficient, the greater the skew.
Skewness Index
Skewness index calculates whether a distribution is skewed or not.it is obtained by the formula:
If skewness is positive, the data are positively skewed or skewed right, meaning that the right tail of the
distribution is longer than the left. If skewness is negative, the data are negatively skewed or skewed left,
meaning that the left tail is longer.
If skewness = 0, the data are perfectly symmetrical. But a skewness of exactly zero is quite unlikely for
real-world data, so how can you interpret the skewness number? Bulmer (1979)— a classic —
suggests this rule of thumb:
If skewness is less than −1 or greater than +1, the distribution is highly skewed.
If skewness is between −1 and −½ or between +½ and +1, the distribution is moderately skewed.
If skewness is between −½ and +½, the distribution is approximately symmetric.
Kurtosis
Kurtosis is the peakedness of a distribution. It is a measure of the degree to which the frequency
distribution is concentrated around the frequency peak.
As skewness involves the third moment of the distribution, kurtosis involves the fourth moment.
≈ means approximately equal to
Interpreting Kurtosis
The reference standard is a normal distribution, which has a kurtosis of 3. In token of this, often the
excess kurtosis is presented: excess kurtosis is simply kurtosis−3. For example, the “kurtosis” reported
by Excel is actually the excess kurtosis.
A normal distribution has kurtosis exactly 3 (excess kurtosis exactly 0). Any distribution with
kurtosis ≈3 (excess ≈0) is called mesokurtic.
A distribution with kurtosis <3 (excess kurtosis <0) is called platykurtic. Compared to a normal
distribution, its tails are shorter and thinner, and often its central peak is lower and broader.
A distribution with kurtosis >3 (excess kurtosis >0) is called leptokurtic. Compared to a normal
distribution, its tails are longer and fatter, and often its central peak is higher and sharper.
LECTURE 7
INFERENTIAL STATISTICS
Inferential statistics differ from descriptive statistics in that they are explicitly designed to test hypotheses.
Sir Ronald A. Fisher, one of the most prominent statisticians in history, established the basic guidelines for
significance testing. He said that a statistical result may be considered significant if it can be shown that the
probability of it being rejected due to chance is 5% or less.
We must also understand three related statistical concepts: sampling distribution, standard error, and
confidence interval. A sampling distribution is the theoretical distribution of an infinite number of samples
from the population of interest in your study. However, because a sample is never identical to the
population, every sample always has some inherent level of error, called the standard error. If this
standard error is small, then statistical estimates derived from the sample (such as sample mean) are
reasonably good estimates of the population. The precision of our sample estimates is defined in terms of a
confidence interval (CI). A 95% CI is defined as a range of plus or minus two standard deviations of the
mean estimate, as derived from different samples in a sampling distribution. Hence, when we say that our
observed sample estimate has a CI of 95%, what we mean is that we are confident that 95% of the time, the
population parameter is within two standard deviations of our observed sample estimate. Jointly, the p-
value and the CI give us a good idea of the probability of our result and how close it is from the
corresponding population parameter.
We assess data in-terms of standard error of the Mean and standard error of the difference in mean
Example1: the weights of a random sample of 11 three-year old children were taken in a village. The sample mean
was 16kg and the standard deviation of the sample was 2kg. What was the SE?
SE = 2/√11 = 0.6kg
At 95% confidence interval the weights will be 16 ± (2 × 0.6) = 14.8 to 17.2kg.
This means that we are approximately 95% certain that the mean weight of all three year old children in your
population lies between 14.8 and 17.2kgs.
Increasing the sample size in above example will increase the reliability of the calculation. E.g.; with a sample size
of 20 children, instead of 11, the SE would have been:
SE = 2/√20 = 0.45kg, therefore, at 95% confidence interval for the mean weight would have been 15.1 to 16.9kg
Example 2
A 2 3 3 4 4 4 4 5 5 6
B 1 1 2 4 12
Example2:
The following weights were measured from 3 year old children. Calculate the standard error and the range to which
the weights fall at 95% confidence interval.
13, 14, 14, 15, 16, 16, 16, 17, 17, 18, 20.
Example
Example 2
HYPOTHESIS
A hypothesis is a researcher’s prediction regarding an outcome of a study. It states possible differences,
relationships, or causes between two or more variables or concepts. They are derived from/based on existing
theories, previous researches, personal observation or experiences.
The whole study revolves around hypothesis.
Geography is a science that deals with a concept of space and spatial relationships which along with time and
composition of matter comprise three major parameters of concern of all sciences. These are: 1.Space, 2.time, and
3.Composition of matter. Geography seeks to explain how the subsystems of the physical environment are
organized on the earth’s surface and how humans distribute themselves over the earth and then space relationship to
physical features and other human beings. There are normally two groups of factors in geographical comparisons.
1. Those which operate consistently and from which predictions can be made. E.g. the driver of land use in
North Eastern and Central Provinces is rainfall which can be predicted.
2. Those which are irregular (random):- these are factors which are purely dependent on chance: for example
when a sample is taken from a population, the sample statistics are not necessarily the same as population
parameters and as such, the answers on population characteristics from the sample are tentative. In regard
to this, the degree of confidence is introduced.
Types of Hypothesis
(1) Null Hypotheses and (2). Alternative Hypotheses (Alternative-non-directional Hypotheses and Alternative
directional Hypotheses).significance of directional hypothesis is tested using a one-tailed t-test while that of
non-directional is done using two-tailed test.
(3) Directional and (4) Non-directional hypotheses
Null Hypotheses - Is sometimes referred to as statistical hypotheses.
It is a negative proposition: It always states that no real relationship or difference exists and that any relationship
between two variables or difference between two groups is merely due to chance or error.
Hypothesis Testing
Hypothesis testing involves collection of any data/information that may support or fail to support the stated
hypothesis. The purpose of hypothesis testing is to make judgment between sample statistics and hypothesized
population parameter. The idea is to obtain a statistical inference about population parameter from sample statistics.
General Assumptions
1) Population is normally distributed
2) The sampling was randomly conducted
3) Mutually exclusive comparison samples
4) Data characteristics match statistical technique
For interval / ratio data use: T-tests, Pearson correlation, ANOVA, OLS regression
For nominal / ordinal data use: Difference of proportions, chi square and related measures of association, logistic
regression
Steps:
1. Formulate the Hypothesis (Null and Alternative)
Null Hypothesis (Ho): There is no difference between the variables under study.
Alternative Hypothesis (H1): There is a difference between the variables.
Note: The alternative hypothesis will indicate whether a 1-tailed or a 2-tailed test is utilized to reject the null
hypothesis.
2. Decide on the rejection level or significance level.
-This determines how different the parameters and/or statistics must be before the null hypothesis can be
rejected. This "region of rejection" is based on alpha ( ) -- the error associated with the confidence level.
The point of rejection is known as the critical value.
-The significance level is usually associated with the normal distribution. It is normally set at 5% (0.05). The
rejection level is a measure of how strong the evidence must be before H O is rejected.
3. Select and carry out an appropriate statistical test to determine the probability (p) that the problem data could
have occurred by chance under HO.
4. Decide results of the Null Hypothesis. If the calculate p is less than p tabulated at a given alpha (); otherwise
known as critical value, then the H O is rejected at that level of significance. However, if the reverse is true,
then the HO cannot be rejected. This does not mean that HO is correct; it means that the evidence is not strong
enough to reject it.
LECTURE 8 & 9
PARAMETRIC TESTS
Under parametric testing techniques, we’ll discuss one test; that is; the student t-test.
Student t-test
Student t-test is a parametric test of the difference between two samples. It makes use of t. The t-distribution is
based on small sample such that when the sample size increases, the t distribution tends towards normal
distribution. It is useful in determining the significance of the difference between two groups which are measured at
an interval scale. It is used for independent as well as paired or marched samples. It is also used to determine the
significance of correlation coefficients in regression analysis. It determines the means of two variables.
Before a t-test is applied, two assumptions must be made.
Background population of the samples are normally distributed
The standard deviation of the population are equal
The t involves lengthy calculations and as such, small samples are taken.
The t-test is called a parametric test because your data must come from populations that are normally distributed
and use interval measurement. The t-test is used to answer to this question: Is there any difference between the
means of the two populations of which our data is a random sample? The t-test is also called a test of inference
because we are trying to discover if populations are different by studying samples from the populations, i.e., what
we find to be true about our samples we will assume to be true about the population.
T-Test Assumptions:
1. The first assumption is concerned with the scale of measurement. Here assumption for a t-test is that the
scale of measurement applied to the data collected follows a continuous or ordinal scale.
2. The second assumption is regarding simple random sample. The Assumption is that the data is collected
from a representative, randomly selected portion of the total population.
3. The third assumption is the data, when plotted, results in a normal distribution, bell-shaped distribution
curve.
4. The fourth assumption is a that reasonably large sample size is used for the test. Larger sample size means
the distribution of results should approach a normal bell-shaped curve.
5. The final assumption is the homogeneity of variance. Homogeneous, or equal, variance exists when the
standard deviations of samples are approximately equal.
For a ONE TAILED test;
x-µ
.t =
Sx
The specimens of copper wires drawn from a large lot have the following breaking strength (in kg
As the sample size is small (since n = 10) and the population standard deviation is not known, we
shall use t-test assuming normal population and shall work out the test statistic t as under:
On the other hand, if we are comparing between two sample means then we pool the standard deviations and
standard errors to yield the formula below:
Where;
x1¯ = Mean of first set of values
x2¯ = Mean of second set of values
S1 = Standard deviation of first set of values
S2 = Standard deviation of second set of values
From above formula, we can further pool the two standard deviations (in the denominator)
Example2
A study on the distribution of hardwood in two districts over a period of 18 years yielded the following summary
statistics.
Kitui Mwingi
Mean 325 421
SD 20 11
Formulate a suitable Ho and test it at 0.05 significance level.
Example3
The following average slope measurements in degrees were made on different rock types in the same area.
Limestone Grit stone
(Degrees) (Degrees)
32.1 17.8
29.4 15.8
33 12.5
27.3 15.5
19 15.1
14.4 12.2
21.1 13.1
25.5 10.6
9.1 9.3
10.5 5.5
10.5
11
14.2
Question: Use t-test for the difference between means to asses the validity of the statement that the slopes on the
two rocks differ.
Example 4
The data below show the results of quality control exercise obtained during a nutritional survey.
Weight (kg)
Child No. Observer A Observer B
1 18.6 17.7
2 17.1 14.5
3 14.3 12.4
4 23.2 20.7
5 18.4 16.8
6 14.9 14.4
Applications of X2
1. Testing the significance of association between two attributes
Chi-Square (2) is a statistical technique which attempts to establish relationship between two variables both of
which are categorical in nature. For example, we may want to test the hypothesis that there is a relationship
between gender and road accidents caused by drivers. The variable ‘gender’ is categorized as male and female
while the variable “number of accidents” is categorized as ‘none’ ‘few’, and ‘many’. The chi-square technique is
therefore a form of count occurring in two or more mutually exclusive categories. It compares the proportion
observed in each category with what would be expected under the assumption of independence between the two
variables. If the observed frequency greatly departs from what is expected, then we reject the null hypothesis that
the two variables are independent of each other. We would then conclude that one variable is related to the other.
The technique yields one value which should be equal or greater than zero.
2. As a test of independence
It is used to determine whether there is a significant association between the two variables.
For example, in an election survey, voters might be classified by gender (male or female) and voting preference
(Democrat, Republican, or Independent). We could use a chi-square test for independence to determine whether
gender is related to voting preference.
****
To determine the significance of our test, we compare the obtained Chi-square value with a critical or table value. If
the obtained value is greater than the critical value, we reject the null hypothesis. If one is using a computer
program, the computer will give the 2 value and also the actual probability of the computed 2 value. In this case,
one does not need the table to determine if a chi-square value is significant. If the probability of the computed chi-
square value is less than the level of significance set, then the null hypothesis should be rejected and conclude that
the two variables are not independent on each other and vice versa.
Formula:
Where:
.O – Observed frequency
.E – Expected frequency
.n – Number of factions
.degree of Freedom = n-1
NB:
-2 is a measure of the aggregate difference between observed and an expected frequency under Ho such that the
greater its value, the less likely is that Ho is correct. So if calculated value is greater than the critical value at a
given significance level, then Ho is rejected.
-The Degree of freedom (DF) is the maximum number of expected frequency. The value of 2 depends on the
degree of freedom.
Example 1:
In an experiment of rolling a single dice, the results of performing such an experiment 6 times are represented in the
table below and each test is expected to appear 10 times on average.
Test Observed (O) Expected (E)
1 15 10
2 7 10
3 4 10
4 11 10
Df = n-1 = 6-1 = 5
(O - E)
=∑
2
Example 2:
To determine whether in an area, the choice of a site for construction of residential houses depends on altitude. The
following observations were made.
Altitude(m) Observation(O)
100-150 13
151-200 21
201-250 20
251-300 30
Solution
(1) Formulate Ho and H 1
Ho – The choice of construction site for residential houses is independent of altitude
H1 - There is a significant relationship between choice of construction site for residential houses and altitude.
(2) Calculate the degree of freedom (DF) and decide on the rejection level
. DF = n-1 = 4-1 = 3
Rejection level in most cases is set at 0.05 (95% confidence limit).
(3) Calculate Chi (2)
Formula
(O - E)
=∑
2
Since 6.96 is not greater than 7.82, there is no adequate evidence to reject the null hypothesis. The choice of
construction sites for residential houses in the area is independent of altitude.
Ho – The choice of construction site for residential houses is independent of altitude and direction
H1 - There is a significant relationship between choice of construction site for residential houses, altitude and
direction.
In this case, the expected frequencies are calculated by multiplying the sum of rows (row) and the sum of
columns (column) then dividing by the grand total (N).
Row × Column
Expected Frequency
N
The degree of freedom, DF = (r-1) (c-1)
.df = (2-1) (4-1) = 3
To obtain the calculated value, add all the (O - E) 2/E thus 2 =0.37
Critical value at 0.05 level is 7.82 which is greater than 0.37. We have no adequate evidence to reject the null
hypothesis. The choice of construction site for residential houses is independent of altitude and direction.
Example 4: You wish to evaluate the association between a person's sex and their attitudes toward school spending
on athletic programs. A random sample of adults in your school district produced the following table (counts).
Female Male Row Total
Spend more money 15 25 40
Spend the same 5 15 20
Spend less money 35 10 45
Column Total 55 50 105
Where
FO= observed frequency
Fe= expected frequency for each cell
Fe= (frequency for the column) (frequency for the row)/N
Chi-square
Female Male
Spend more 1.691 1.860
Exercise 1
The following results show students’ performance relative to height.
Height/Performance High Medium Middle Low Low
Above average 14 11 10 5
Average 10 16 16 14
Below Average 3 14 7 10
QN: Formulate a suitable hypothesis and test it using an appropriate technique at 0.05 significance level
Exercise 2
The table below shows the frequencies of a random sample obtained on attitudes of residents surrounding Mau
Forest Reserve regarding conservation and resettlement of forest inhabitants.
Attitude Male Female
Strongly Support 12 14
Support 16 15
Do not Support 14 16
Strongly reject 8 11
QN: Formulate a suitable Ho and choose an appropriate test of statistics to test it at 0.05 significance level.
Exercise 3
The data bellows are frequencies of annually treated cases of 3 water-borne diseases in Mavoko Municipality for a
period of 4 years.
Year Bilharzias Diarrhea Typhoid
1999 442 454 67
2000 327 509 142
2001 706 375 139
2002 375 236 45
4. Replace calculated (Expected values above) and observed values in the formula. Perform this for every
disease case, every year and fill in the table bellow
Add the calculated totals for the three disease cases to obtain the sum which is the chi-square.
5. Use the Chi-Square tables at 95% to against the degrees of freedom calculated in procedure 2.
6. Compare calculated chi and one obtained form table reading and decide whether to reject the null
hypothesis as appropriate.
LECTURE 10 & 11
REGRESSION AND CORRELATION ANALYSIS
These are techniques of studying how the variations in one series are related to series in another series. Regression
and correlation exist in two levels of complexities:
Linear Regression and Correlation and
Multi-Linear Regression and Correlation.
NB: We should however remember that there are otherwise non-linear correlations too.
The two techniques are usually considered together; they appear to be one and the same thing: that is, we are sure
that if two variables are strongly related (existence of correlation), then it is easy to imagine that when one changes
then the other one changes a consequence. We thus apply two types of coefficients (Regression Coefficients and
Correlation Coefficients) which are a result of using statistics.
Regression addresses itself to the rates of change of variable(s) in relation to change in in another one. Correlation
on the other hand addresses itself to the relationship between variables and strengths of these relationships, i.e.
independent variable altering dependent variable.
Correlation therefore measures the degree of relationship between the variables while regression analysis shows
how the variables are related.
Regression and correlation analysis thus determines the nature and the strength of relationship between two
variables.
REGRESSION ANALYSIS
Regression is the process of predicting one variable from the other variable. Multiple regressions describe the
process by which several variables are used to predict another.
We can therefore use regression analysis to estimate/forecast/predict values of a variable given another well
correlated variable. In other words, the technique of regression analysis is used to:
1. Determine the statistical relationship between two or more variables and
2. To make prediction of one variable on the basis of the other.
REGRESSION ANALYSIS
Simple Linear Regression Model
In this analysis, a single variable is used to predict another variable on assumption that there exists a relationship
defined by y = a + bx.
Where:
.y - Independent variable
.x – Dependent variable
.a – y intercept
.b – A constant indicating the slope of regression line (mount of change/gradient)
A line is then drawn through the scatter plots to determine the intercept and the slope of the line y = a + bx. The
short coming of this method is that if different people will draw a line through the plots then the line will deviate in
trend.
CORRELATION ANALYSIS
Coefficient of Correlation (r)
Coefficient of correlation is a technique used in explaining how well one variable is described by another. It ranges
between +1 and -1. For example, a correlation of r = 0.9 suggests a strong, positive association between two
variables, whereas a correlation of r = -0.2 suggest a weak, negative association. A correlation close to zero
suggests no linear association between two continuous variables
Some of the techniques for obtaining the R-value include: Coefficient of Correlation by Least Square Method,
Using Simple Regression Coefficient and use of Karl Pearson’s Method.
In our case, we shall look at the Carl Pearson’s method/Pearson’s Product Moment Correlation which is
calculated as below:
Where:
.r2 – coefficient of determination
.a – y intercept
.b – slope of the best fitting estimation line
.x – value of the independent variable
.y – value of the dependent variable
. –the mean of the observed values of y
Interpreting r2
The coefficient of determination can have values from 0-1.
A value of 1 implies all data points in the scatter diagram fall exactly on the regression line, perfect correlation.
Value 0 only occurs when X tells us nothing about Y, that is, there is no relationship between X and Y variables.
Between 1 and 0 shows goodness of fit.
.r2 also indicates the amount of the variations in the dependent variable (y) that are explained by independent
variable (x) e.g. if r2 = 0.967, then variations in independent variables explains 96.7% of variations in dependent
variable.; i.e. explains most variations because it is closer to unity.
Example 2
S/N 1 2 3 4 5 6 7 8 9 10
Height (x) 52 62 58 48 55 60 56 53 50 62
Weight (y) 40 55 53 40 44 50 51 45 44 55
Example 3
A local shop of milk shakes keeps a track of the amount of milk shakes they sell relative to the temperature on that
day. Below are the figures of their sale and temperature for the last 12 days. Comment on the relationship.
Example 4
The following summary relates to a medical experiment attempting to establish the relation between age and blood sugar.
Subject 1 2 3 4 5 6
Age 43 21 25 42 57 59
Glucose level 99 65 79 75 87 81
Determine the coefficient of correlation (r) and coefficient of determination (r 2).