You are on page 1of 20

2021/08/25

Lecture 4

GIS220:
Descriptive statistics

Prof Gregory Breetzke


greg.breetzke@up.ac.za
Room 1-19, Geography

Lecture overview

• What are descriptive statistics?


• Types of descriptive statistics
– Univariate
– Bivariate
• Examples

1
2021/08/25

Descriptive statistics

• Provide an initial entry point

• Some research questions can satisfactory be answered


using descriptive statistics

Types of descriptive statistics

• Univariate and bivariate statistics


– U: mean, mode, range, standard deviation
– B: correlation coefficient

2
2021/08/25

Types of descriptive statistics

UNIVARIATE

• Measures of central tendency


– Mean
– Mode
– Median
• Measures of dispersion
– Range
– Interquartile range
– Variance
– Standard deviation

The mean

• The mean is a measure of central value


– What most people mean by “average”
– Sum of a set of numbers divided by the number
of numbers in the set

3
2021/08/25

The median
• Middlemost or most central item in the set of
ordered numbers; it separates the distribution
into two equal halves
• If odd, then n is the middle value of sequence
– if X = [1,2,4,6,9,10,12,14,17]
– then 9 is the median
• If even, then n, average of 2 middle values
– if X= [1,2,4,6,9,10,11,12,14,17]
– then 9.5 is the median; i.e., (9+10)/2
• Median is not affected by extreme values

The mode
• The mode is the most frequently occurring
number in a distribution
– if X = [1,2,4,7,7,7,8,10,12,14,17]
– then 7 is the mode
• Easy to see in a simple frequency distribution
• Possible to have no modes or more than one
mode
– bimodal and multimodal
• Don’t have to be exactly equal frequency
– major mode, minor mode
• Mode is not affected by extreme values

4
2021/08/25

When to use what…?


• Mean is a great measure. But, there are time when its
usage is inappropriate or impossible
– Nominal data: Mode
– The distribution is bimodal: Mode
– You have ordinal data: Median or mode
– Are a few extreme scores: Median

Dispersion
• Dispersion
– How tightly clustered or how
variable the values are in a data
set
• Example
– Data set 1: [0,25,50,75,100]
– Data set 2: [48,49,50,51,52]
– Both have a mean of 50, but data
set 1 clearly has greater variability than data set 2

5
2021/08/25

Range
• The difference between the maximum and
minimum values in a set
• Example
– Data set 1: [1,25,50,75,100]; R: 100-1 = 99
– Data set 2: [48,49,50,51,52]; R: 52-48 = 4
– The range ignores how data are distributed and
only takes the extreme scores into account

• RANGE = (Xlargest –Xsmallest)

Quartiles
• Split ordered data into four quarters

= first quartile = (25th percentile)


= second quartile = Median (50th percentile)
= third quartile = (75th percentile)

6
2021/08/25

Interquartile range (IQR)


• Difference between third and first quartiles
– Interquartile Range = Q3-Q1

• Spread in middle 50%

• Not affected by extreme values

• The IQR is used to measure how spread out the data points in a set
are from the mean of the data set

• The higher the IQR, the more spread out the data points

• The smaller the IQR, the more bunched up the data points are
around the mean

• It is best used with other measurements such as the median and


total range to build a complete picture of a data set’s tendency to
cluster around its mean.

Example

• Given the set of values: 27, 18, 19, 12, 15, 1,


2, 6, 5, 9, 7, find the…
– Mean
– Median
– Range
– Interquartile range

7
2021/08/25

Standard deviation
• Let X = [3, 4, 5 ,6, 7]
– X=5
– (X - X) = [-2, -1, 0, 1, 2]
• Subtract x from each number in X
– (X - X)2 = [4, 1, 0, 1, 4]
• Squared deviations from the mean
– – S (X - X)2 = 10
• Sum of squared deviations from the mean (SS)
– S (X - X)2 /n-1 = 10/5 = 2.5
• Average squared deviation from the mean
– S (X - X)2 /n-1 = 2.5 = 1.58
• Square root of averaged squared deviation

Standard deviation
• Most South African employers issue raises based on
percent of salary
• Why do supervisors think the most fair raise is a
percentage raise?
• Answer:
1)Because higher paid persons get the most money.
2)The easiest thing to do is raise everyone’s salary by a fixed
percent.
• If your budget went up by 5%, salaries can go up by 5%.
• The problem is that the flat percent raise gives
unequal increased rewards

8
2021/08/25

Standard deviation
• Acme Toilet Cleaning Services
• Salary Pool: R200,000

Incomes:
• President: R100K; Manager: R50K; Secretary: R40K; and
Toilet Cleaner: R10K
• Mean: R50K - These can be considered
• Range: R90K “measures of inequality”

• Variance: R1,050,000,000
• Standard Deviation: R32.4K
• Now, let’s apply a 5% raise

Standard deviation
• After a 5% raise, the pool of money increases by R10K to
R210,000

• Incomes:
– President: R105K; Manager: R52.5K; Secretary: R42K; and Toilet Cleaner:
R10.5K
– Mean: R52.5K –went up by 5%
– Range: R94.5K –went up by 5%
– Variance: R1,157,625,000
– Standard Deviation: R34K –went up by 5%

• The flat percentage raise increased


inequality. The top earner got 50% of
the new money. The bottom earner
got 5% of the new money. Measures of
inequality went up by 5%.

9
2021/08/25

Skew
• Skewness is a measure of the asymmetry of the
probability distribution
• Roughly speaking, a distribution has positive skew
(right-skewed) if the right (higher value) tail is
longer and a negative skew (left-skewed) if the left
(lower value) tail is longer (confusing the two is a
common error)

Skew

10
2021/08/25

Kurtosis

• A high kurtosis distribution has a sharper "peak"


and fatter "tails", while a low kurtosis distribution
has a more rounded peak with wider "shoulders".

11
2021/08/25

Frequency distributions
• Symmetrical distribution
– Approximately equal numbers of observations above and
below the middle
• Skewed distribution
– One side is more spread out that the other, like a tail
– Direction of the skew
• Positive or negative (right or left)
• Side with the fewer scores
• Side that looks like a tail

Symmetrical vs. skewed distributions

12
2021/08/25

Types of descriptive statistics

BIVARIATE

• Correlation
– linear pattern of relationship between one variable (x) and
another variable (y) –an association between two variables
• Relative position of one variable correlates with relative
distribution of another variable
• Warning:
– No proof of causality
– Cannot assume x causes y

Scatterplots and correlation


• A scatter plot (or scatter diagram) is used to show
the relationship between two variables
– Scatter diagram plots pairs of bivariate observations (x, y)
on the X-Y plane
– Y is called the dependent variable
– X is called an independent variable
• Correlation analysis is used to measure strength of
the association (linear relationship) between two
variables
– Only concerned with strength of the
relationship
– No causal effect is implied

13
2021/08/25

Types of correlation
• Positive correlation
– High values of X tend to be associated with high values of Y.
– As X increases, Y increases
• Negative correlation
– High values of X tend to be associated with low values of Y.
– As X increases, Y decreases
• No correlation
• No consistent tendency for values on Y to increase or
decrease as X increases

14
2021/08/25

15
2021/08/25

Applications

Individual vs Group (Neighbourhood)

16
2021/08/25

What type of relationship?


Scatterplot:Video Games and Alcohol Consumption

20
Average Number of Alcoholic Drinks

18
16
14
Per Week

12
10
8
6
4
2
0
0 5 10 15 20 25
Average Hours of Video Games Per Week

What type of relationship?


Scatterplot: Video Games and Test Score

100
90
80
70
Exam Score

60
50
40
30
20
10
0
0 5 10 15 20
Average Hours of Video Games Per Week

17
2021/08/25

Each point represents something or


some PLACE!!

18
2021/08/25

19
2021/08/25

Practical 1

Date: Thursday 26th August 1130-1430 (Posted on Thursday)

Location: Remotely or on-campus (Brown & Orange & Red IT labs)

Assistance: Thursdays 1130-1420 and Thursdays 14:00-16:00 by


appointment via Doodle

Due: Thursday 9th September at 1130 (upload on ClickUp)

Task: Sampling exercise and gaining familiarity with GeoDa and


ArcPro

Software: Excel, GeoDa and ArcPro

20

You might also like