Introduction To Statistics Lecture 7

Introduction to
statistics
Arinaitwe Irene, PhD
04/12/2024
Statistics
A body of concepts and methods, which

deal with collection, organization,
summarizing, presentation, analysis and Statistics is divided into two broad
interpretation of data for the purpose of branches, that is;
assisting in making valid conclusions and
drawing a more effective decisions.
• Descriptive [Basic] statistics .
• Inferential [Inductive] statistics
04/12/2024
Descriptive statistics
• Analysis of data that helps describe, show or summarize data in

a meaningful way.
• Do not, however, allow us to make conclusions beyond the data.
• A way to describe our data.
• Enables us to present the data in a more meaningful way, which
allows simpler interpretation of the data
04/12/2024
Example
For example, if we had the results of 100 pieces of students'

coursework, we may be interested in the overall performance of
those students. We would also be interested in the distribution or
spread of the marks.
Descriptive statistics allow us to do this.
04/12/2024
Descriptive statistics
There are two general types of statistic that are used to describe data:
• Measures of central tendency
• Measures of spread
04/12/2024
Scales
• Represents a composite measure of a variable

• Series of items arranged according to value for the purpose of
quantification
Provides a range of values that correspond to different characteristics or
amounts of a characteristic exhibited in observing a concept.
Scales come in four different levels: Nominal, Ordinal, Interval, and Ratio
04/12/2024
Nominal
• Nominal variables are variables that have two or

more categories, but which do not have an
intrinsic order.
• For example, a real estate agent could classify
their types of property into distinct categories
such as houses, condos, co-ops or bungalows.
• So "type of property" is a nominal variable with
4 categories called houses, condos, co-ops and
bungalows.
04/12/2024
Ordinal
• Ordinal variables are variables that have
two or more categories just like nominal
variables only the categories can also be
ordered or ranked.
04/12/2024
Interval
• Interval variables are variables for which their central characteristic is

that they can be measured along a range and they have a numerical
value (for example, temperature measured in degrees Celsius or
Fahrenheit)
e.g., the difference between 20°C and 30°C is the same as 30°C to
40°C.
04/12/2024
Ratio
• Class task
• Example ???
04/12/2024
Ratio
• Ratio variables are interval variables, but with the added condition
that 0 (zero) of the measurement indicates that there is none of that
variable.
• Examples of ratio variables include height, mass, and distance.
04/12/2024
Measures of Central Tendency
• A measure of central tendency is a single value that attempts to describe

a set of data by identifying the central position within that set of data
or
• These are ways of describing the central position of a frequency distribution for a group of
data.
• We can describe this central position using several statistics, including
the mode, median, and mean.
04/12/2024
Mean
• The mean (or average) is the most popular and well-known measure of
central tendency.
• It can be used with both discrete and continuous data, although its use
is most often with continuous data.
• An important property of the mean is that it includes every value in
your data set as part of the calculation.
04/12/2024
When mean is not appropriate-(example 1)
The mean has one main disadvantage: it is

particularly susceptible to the influence of
outliers. These are values that are unusual
compared to the rest of the data set by
being especially small or large in numerical
value.
1 2 3 4 5 6 7 8 9 10
The mean salary for these ten staff is staff
$30.7k. However, inspecting the raw data
suggests that this mean value might not be Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
the best way to accurately reflect the typical
salary of a worker, as most workers have
salaries in the $12k to 18k range. The mean
is being skewed by the two large salaries.
04/12/2024
• Median over the mean (or mode) is preferred data is
When mean is not skewed.
appropriate- • If we consider the normal distribution - as this is the most
frequently assessed in statistics - when the data is perfectly
(example2) normal, the mean, median and mode are identical.
• However, as the data becomes skewed the mean loses its
ability to provide the best central location for the data
because the skewed data is dragging it away from the
typical value.
• Median best retains this position and is not as strongly
influenced by the skewed values therefore more realistic
than mean.
04/12/2024
• Median is the number present in the middle when the numbers in a
The Median set of data are arranged in ascending or descending order.
• Computing the median when the number of scores in the set is odd
involves the following steps:
1. Order the scores in numerical order from lowest to highest
2. Count the number of scores
3. Select the middle score as the median
• If the number of scores in the set is even, then:

– First perform steps 1 and 2 above,
– Then find the mean of the 2 middle scores as the median
04/12/2024
•
Median Example If the set of scores is {15, 20, 35, 45, 50, 56, 67}, then the
median is: 45.
• If our set of scores is {15, 20, 21, 20, 36, 15, 25, 15} before
computing median, we need to order these values in
ascending order; getting {15,15,15,20,20,21,25,36}
• There are 8 scores and score #4 and #5 represent the

halfway point. Since both of these scores are 20, the
median is 20.
• If the two middle scores had different values, you would

have to interpolate to determine the median.
04/12/2024
• Mode is the value that occurs most frequently in a
set of data.
• In this data set{15,15,15,20,20,21,25,36} the value

15 occurs three times and is the mode.
The Mode • {14, 25, 23, 67, 25, 78, 65, 45, 25, 18, 20, 89,
25, 90}
• In the above set of scores, the mode is 25
04/12/2024
Skewed Distributions and
the Mean and Median
• We often test whether our data is

normally distributed because this is a
common assumption underlying many
statistical tests. An example of a normally
distributed set of data is presented
• When you have a normally distributed

sample you can legitimately use both
the mean or the median as your
measure of central tendency.
04/12/2024
Skewed data
• when our data is skewed, for example, as

with the right-skewed data set
• We find that the mean is being dragged in

the direct of the skew. In these situations,
the median is generally considered to be the
best representative of the central location of
the data. The more skewed the distribution,
the greater the difference between the
median and mean, and the greater emphasis
should be placed on using the median as
opposed to the mean
04/12/2024
when to use the mean, median and mode
Best measure of central

Type of Variable
tendency
Nominal Mode
Ordinal Median
Interval/Ratio (not skewed) Mean
Interval/Ratio (skewed) Median
04/12/2024
Measures of spread
• These are ways of summarizing a group of data by describing how spread out the
scores are.
For example, the mean score of our 100 students may be 65 out of 100.
However, not all students will have scored 65 marks. Rather, their scores will
be spread out. Some will be lower and others higher.
 Measures of spread statistics include the range, quartiles, absolute
deviation, variance and standard deviation.
04/12/2024
Range
• The range: The highest value minus the lowest value.
• In our example data set {15,15,15,20,20,21,25,36}, the high value is 36 and the
low is 15.
• So the range is 36 - 15 = 21
04/12/2024 BIT 1205 23

Range
• However, the range only provides information about the maximum and minimum values and does not say
anything about the values in between.
• A commonly used measure of dispersion is the standard deviation, which is simply the square root of the
variance.
• The variance, is defined as the sum of the squared distances of each term in the distribution from the mean,
divided by the number of terms in the distribution.
• Squaring the difference makes each term positive so that values above the mean do not cancel values below the
mean.
04/12/2024 BIT 1205 24

Standard Deviation
• Is a more accurate and detailed estimate of dispersion because an

outlier can greatly exaggerate the range
• In our example data set i.e. {15,15,15,20,20,21,25,36} the single

outlier value of 36 stands apart from the rest of the values).
• The Standard Deviation shows the relation that the set of scores has
to the mean of the sample.
04/12/2024 BIT 1205 25

Standard Deviation (s)
• s is the most commonly used measure of variability

• Standard deviation is the average amount that each of the
individual scores varies from the mean of a set of scores.
• Hence the larger the s, the more variability/dispersion in the
set of scores
• If all scores in a set are identical, then there is no variability and
s = 0. i n
• s= i 1
( xi  x)2
n 1
• Where xi is the individual score, x is the mean of all the scores,
and n is the number of observations.
04/12/2024 BIT 1205 26
Example
• Set of scores: 15,20,21,20,36,15,25,15
• Computation of SD: First find the distance between each value and the mean.
• From above; the mean is 20.875. So, the differences from the mean are:
15 - 20.875 = -5.875
20 - 20.875 = -0.875
21 - 20.875 = +0.125
20 - 20.875 = -0.875
36 - 20.875 = 15.125
15 - 20.875 = -5.875
25 - 20.875 = +4.125
15 - 20.875 = -5.875
• Note: The values below the mean have negative discrepancies and values
above it have positive ones.
04/12/2024 BIT 1205 27
Example cont’d
• Next, we square each discrepancy:
-5.875 * -5.875 = 34.515625
-0.875 * -0.875 = 0.765625
+0.125 * +0.125 = 0.015625
-0.875 * -0.875 = 0.765625
15.125 * 15.125 = 228.765625
-5.875 * -5.875 = 34.515625
+4.125 * +4.125 = 17.015625
-5.875 * -5.875 = 34.515625
• Take these "squares" and sum them to get the Sum of Squares (SS) value.
The sum is 350.875. We then divide this sum by the number of scores minus
1.
• Here, the result is 350.875 / 7 = 50.125. This value is known as the variance.
04/12/2024 BIT 1205 28
Example cont’d
• To get the standard deviation, we take the square root of the

variance (remember that we squared the deviations earlier).
• This would be SQRT(50.125) = 7.079901129253.
04/12/2024 BIT 1205 29

Formula for Standard Deviation
04/12/2024 BIT 1205 30

Variance & Standard deviation cont’d
• The variance and the standard deviation give us a numerical
measure of the scatter of a data set.
• These measures are useful for making comparisons between data

sets that go beyond simple visual impressions.
04/12/2024 BIT 1205 31

Data Type v. Statistics Used
Data Type Statistics Used

Nominal Frequency, percentages, modes
Ordinal Frequency, percentages, modes, median, range, percentile, ranking
Interval Frequency, percentages, modes, median, range, percentile, ranking average,

variance, SD, t-tests, ANOVAs, Pearson Rs, regression
Ratio Frequency, percentages, modes, median, range, percentile, ranking average,

variance, SD, t-tests, ratios, ANOVAs, Pearson Rs, regression
04/12/2024

Introduction To Statistics Lecture 7

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Statistics Lecture 7

Uploaded by

Copyright:

Available Formats

Introduction to

A body of concepts and methods, which

• Analysis of data that helps describe, show or summarize data in

For example, if we had the results of 100 pieces of students'

Descriptive statistics allow us to do this.

• Represents a composite measure of a variable

• Nominal variables are variables that have two or

• Interval variables are variables for which their central characteristic is

• A measure of central tendency is a single value that attempts to describe

The mean has one main disadvantage: it is

• If the number of scores in the set is even, then:

• There are 8 scores and score #4 and #5 represent the

• If the two middle scores had different values, you would

• In this data set{15,15,15,20,20,21,25,36} the value

• In the above set of scores, the mode is 25

• We often test whether our data is

• When you have a normally distributed

• when our data is skewed, for example, as

• We find that the mean is being dragged in

Best measure of central

Interval/Ratio (not skewed) Mean

Interval/Ratio (skewed) Median

• The range: The highest value minus the lowest value.

04/12/2024 BIT 1205 23

04/12/2024 BIT 1205 24

• Is a more accurate and detailed estimate of dispersion because an

• In our example data set i.e. {15,15,15,20,20,21,25,36} the single

04/12/2024 BIT 1205 25

• s is the most commonly used measure of variability

• Set of scores: 15,20,21,20,36,15,25,15

• To get the standard deviation, we take the square root of the

• This would be SQRT(50.125) = 7.079901129253.

04/12/2024 BIT 1205 29

04/12/2024 BIT 1205 30

• These measures are useful for making comparisons between data

04/12/2024 BIT 1205 31

Data Type Statistics Used

Ordinal Frequency, percentages, modes, median, range, percentile, ranking

Interval Frequency, percentages, modes, median, range, percentile, ranking average,

Ratio Frequency, percentages, modes, median, range, percentile, ranking average,

You might also like