You are on page 1of 32

Introduction to

statistics
Arinaitwe Irene, PhD

04/12/2024
Statistics

A body of concepts and methods, which


deal with collection, organization,
summarizing, presentation, analysis and Statistics is divided into two broad
interpretation of data for the purpose of branches, that is;
assisting in making valid conclusions and
drawing a more effective decisions.
• Descriptive [Basic] statistics .
• Inferential [Inductive] statistics

04/12/2024
Descriptive statistics

• Analysis of data that helps describe, show or summarize data in


a meaningful way.
• Do not, however, allow us to make conclusions beyond the data.
• A way to describe our data.
• Enables us to present the data in a more meaningful way, which
allows simpler interpretation of the data

04/12/2024
Example

For example, if we had the results of 100 pieces of students'


coursework, we may be interested in the overall performance of
those students. We would also be interested in the distribution or
spread of the marks.

Descriptive statistics allow us to do this.

04/12/2024
Descriptive statistics

There are two general types of statistic that are used to describe data:
• Measures of central tendency
• Measures of spread

04/12/2024
Scales

• Represents a composite measure of a variable


• Series of items arranged according to value for the purpose of
quantification
Provides a range of values that correspond to different characteristics or
amounts of a characteristic exhibited in observing a concept.
Scales come in four different levels: Nominal, Ordinal, Interval, and Ratio

04/12/2024
Nominal

• Nominal variables are variables that have two or


more categories, but which do not have an
intrinsic order.
• For example, a real estate agent could classify
their types of property into distinct categories
such as houses, condos, co-ops or bungalows.
• So "type of property" is a nominal variable with
4 categories called houses, condos, co-ops and
bungalows.
04/12/2024
Ordinal
• Ordinal variables are variables that have
two or more categories just like nominal
variables only the categories can also be
ordered or ranked.

04/12/2024
Interval

• Interval variables are variables for which their central characteristic is


that they can be measured along a range and they have a numerical
value (for example, temperature measured in degrees Celsius or
Fahrenheit)
e.g., the difference between 20°C and 30°C is the same as 30°C to
40°C.

04/12/2024
Ratio
• Class task

• Example ???

04/12/2024
Ratio

• Ratio variables are interval variables, but with the added condition
that 0 (zero) of the measurement indicates that there is none of that
variable.
• Examples of ratio variables include height, mass, and distance.

04/12/2024
Measures of Central Tendency

• A measure of central tendency is a single value that attempts to describe


a set of data by identifying the central position within that set of data
or
• These are ways of describing the central position of a frequency distribution for a group of
data.
• We can describe this central position using several statistics, including
the mode, median, and mean.
04/12/2024
Mean

• The mean (or average) is the most popular and well-known measure of
central tendency.
• It can be used with both discrete and continuous data, although its use
is most often with continuous data.
• An important property of the mean is that it includes every value in
your data set as part of the calculation.

04/12/2024
When mean is not appropriate-(example 1)

The mean has one main disadvantage: it is


particularly susceptible to the influence of
outliers. These are values that are unusual
compared to the rest of the data set by
being especially small or large in numerical
value.

1 2 3 4 5 6 7 8 9 10
The mean salary for these ten staff is staff
$30.7k. However, inspecting the raw data
suggests that this mean value might not be Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
the best way to accurately reflect the typical
salary of a worker, as most workers have
salaries in the $12k to 18k range. The mean
is being skewed by the two large salaries.

04/12/2024
• Median over the mean (or mode) is preferred data is
When mean is not skewed.
appropriate- • If we consider the normal distribution - as this is the most
frequently assessed in statistics - when the data is perfectly
(example2) normal, the mean, median and mode are identical.
• However, as the data becomes skewed the mean loses its
ability to provide the best central location for the data
because the skewed data is dragging it away from the
typical value.
• Median best retains this position and is not as strongly
influenced by the skewed values therefore more realistic
than mean.

04/12/2024
• Median is the number present in the middle when the numbers in a
The Median set of data are arranged in ascending or descending order.

• Computing the median when the number of scores in the set is odd
involves the following steps:
1. Order the scores in numerical order from lowest to highest
2. Count the number of scores
3. Select the middle score as the median

• If the number of scores in the set is even, then:


– First perform steps 1 and 2 above,
– Then find the mean of the 2 middle scores as the median

04/12/2024

Median Example If the set of scores is {15, 20, 35, 45, 50, 56, 67}, then the
median is: 45.

• If our set of scores is {15, 20, 21, 20, 36, 15, 25, 15} before
computing median, we need to order these values in
ascending order; getting {15,15,15,20,20,21,25,36}

• There are 8 scores and score #4 and #5 represent the


halfway point. Since both of these scores are 20, the
median is 20.

• If the two middle scores had different values, you would


have to interpolate to determine the median.

04/12/2024
• Mode is the value that occurs most frequently in a
set of data.

• In this data set{15,15,15,20,20,21,25,36} the value


15 occurs three times and is the mode.

The Mode • {14, 25, 23, 67, 25, 78, 65, 45, 25, 18, 20, 89,
25, 90}

• In the above set of scores, the mode is 25

04/12/2024
Skewed Distributions and
the Mean and Median

• We often test whether our data is


normally distributed because this is a
common assumption underlying many
statistical tests. An example of a normally
distributed set of data is presented

• When you have a normally distributed


sample you can legitimately use both
the mean or the median as your
measure of central tendency.

04/12/2024
Skewed data

• when our data is skewed, for example, as


with the right-skewed data set

• We find that the mean is being dragged in


the direct of the skew. In these situations,
the median is generally considered to be the
best representative of the central location of
the data. The more skewed the distribution,
the greater the difference between the
median and mean, and the greater emphasis
should be placed on using the median as
opposed to the mean

04/12/2024
when to use the mean, median and mode

Best measure of central


Type of Variable
tendency

Nominal Mode

Ordinal Median

Interval/Ratio (not skewed) Mean

Interval/Ratio (skewed) Median

04/12/2024
Measures of spread

• These are ways of summarizing a group of data by describing how spread out the
scores are.
For example, the mean score of our 100 students may be 65 out of 100.
However, not all students will have scored 65 marks. Rather, their scores will
be spread out. Some will be lower and others higher.
 Measures of spread statistics include the range, quartiles, absolute
deviation, variance and standard deviation.

04/12/2024
Range

• The range: The highest value minus the lowest value.

• In our example data set {15,15,15,20,20,21,25,36}, the high value is 36 and the
low is 15.

• So the range is 36 - 15 = 21

04/12/2024 BIT 1205 23


Range

• However, the range only provides information about the maximum and minimum values and does not say
anything about the values in between.

• A commonly used measure of dispersion is the standard deviation, which is simply the square root of the
variance.

• The variance, is defined as the sum of the squared distances of each term in the distribution from the mean,
divided by the number of terms in the distribution.

• Squaring the difference makes each term positive so that values above the mean do not cancel values below the
mean.

04/12/2024 BIT 1205 24


Standard Deviation

• Is a more accurate and detailed estimate of dispersion because an


outlier can greatly exaggerate the range

• In our example data set i.e. {15,15,15,20,20,21,25,36} the single


outlier value of 36 stands apart from the rest of the values).

• The Standard Deviation shows the relation that the set of scores has
to the mean of the sample.

04/12/2024 BIT 1205 25


Standard Deviation (s)

• s is the most commonly used measure of variability


• Standard deviation is the average amount that each of the
individual scores varies from the mean of a set of scores.
• Hence the larger the s, the more variability/dispersion in the
set of scores
• If all scores in a set are identical, then there is no variability and
s = 0. i n
• s= i 1
( xi  x)2

n 1
• Where xi is the individual score, x is the mean of all the scores,
and n is the number of observations.
04/12/2024 BIT 1205 26
Example

• Set of scores: 15,20,21,20,36,15,25,15

• Computation of SD: First find the distance between each value and the mean.

• From above; the mean is 20.875. So, the differences from the mean are:
15 - 20.875 = -5.875
20 - 20.875 = -0.875
21 - 20.875 = +0.125
20 - 20.875 = -0.875
36 - 20.875 = 15.125
15 - 20.875 = -5.875
25 - 20.875 = +4.125
15 - 20.875 = -5.875

• Note: The values below the mean have negative discrepancies and values
above it have positive ones.
04/12/2024 BIT 1205 27
Example cont’d
• Next, we square each discrepancy:
-5.875 * -5.875 = 34.515625
-0.875 * -0.875 = 0.765625
+0.125 * +0.125 = 0.015625
-0.875 * -0.875 = 0.765625
15.125 * 15.125 = 228.765625
-5.875 * -5.875 = 34.515625
+4.125 * +4.125 = 17.015625
-5.875 * -5.875 = 34.515625

• Take these "squares" and sum them to get the Sum of Squares (SS) value.
The sum is 350.875. We then divide this sum by the number of scores minus
1.

• Here, the result is 350.875 / 7 = 50.125. This value is known as the variance.
04/12/2024 BIT 1205 28
Example cont’d

• To get the standard deviation, we take the square root of the


variance (remember that we squared the deviations earlier).

• This would be SQRT(50.125) = 7.079901129253.

04/12/2024 BIT 1205 29


Formula for Standard Deviation

04/12/2024 BIT 1205 30


Variance & Standard deviation cont’d
• The variance and the standard deviation give us a numerical
measure of the scatter of a data set.

• These measures are useful for making comparisons between data


sets that go beyond simple visual impressions.

04/12/2024 BIT 1205 31


Data Type v. Statistics Used

Data Type Statistics Used


Nominal Frequency, percentages, modes

Ordinal Frequency, percentages, modes, median, range, percentile, ranking

Interval Frequency, percentages, modes, median, range, percentile, ranking average,


variance, SD, t-tests, ANOVAs, Pearson Rs, regression

Ratio Frequency, percentages, modes, median, range, percentile, ranking average,


variance, SD, t-tests, ratios, ANOVAs, Pearson Rs, regression

04/12/2024

You might also like