You are on page 1of 18

Data Mining

By:
Muhammad Haleem
Statistical Descriptions of Data
Measuring the Central Tendency
Motivation
Central tendency characteristics
 To better understand the data representative:
central tendency provide different ways to measure the data and get
the data representative, that could provide an idea about our data.
Mean, Median and Mode are examples of measure of central
tendency.
Data dispersion characteristics
To better understand the variations in our data.
Data dispersion characteristics provide the measures to show the
variations that how far our data items falls far away from the central
value.
 Max, min, range, variance, standard deviation and outliers etc.
Measuring the Central Tendency
Suppose that we have some attribute X , like salary, which
has been recorded for a set of objects.
Let x1, x2…… xN be the set of N observed values or
observations for X .
Here, these values may also be referred to as the data set
(for X).
Measures of central tendency give us an idea about our
data. Measures of central tendency include the
midrange, mean, median and mode.
Range and Midrange

Range and measures of central tendency


(mean, median and mode) are values that
summarize a set of data. They are useful
when analyzing data.
Range -
the difference between the greatest
and the smallest values in a data set
Daily High Temperatures (for any given date) Over the Last Decade
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002
59 50 49 13 40 46 50 53 58 47

To find the range of the daily high temperatures,


subtract the least value from the greatest value.
Range give us the highest variation limit in a
series.

59° - 13° = 46°


Midrange
The midrange can also be used to
assess the central tendency of a
numeric data set. It is the average of
the largest and smallest values in the
set.

Midrange= max+min
2
Mean or Arithmetic Mean
The mean of set X is

Sometimes, each value xi in a set may be associated


with a weight Wi .The weights reflect the significance
of values attached to it. It also called the weighted
mean. In this case, we can compute as:
Mean -
(or average) the sum of a set of data
divided by the number of data
Daily High Temperatures (for any given date) Over the Last Decade
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002
59 50 49 13 40 46 50 53 58 47

To find the mean, find the sum of the data


59+50+50+13+40+46+50+53+58+47=465
and divide it by the number of data.
465÷10=46.5
The mean for daily high temperature over the last
decade is 46.5°, or approximately 47°.
Median:
For skewed data, a better measure of the
center of data is the median, which is the
middle value in a set of ordered data values.
It is the value that separates the higher half
of a data set from the lower half.
If N is odd, then the median is the middle
value of the set. If X is even in this case, the
median is taken as the average of the two
middlemost values.
Median -
the middle value of an ordered data set
Daily High Temperatures (for any given date) Over the Last Decade
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002
59 50 49 13 40 46 50 53 58 47

To find the median, place all the data in


numerical order, then find the middle number. If
there are two middle numbers, find the mean (or
average) of the two middle numbers.
13 40 46 47 49 50 50 53 58 59
49+50=99
99÷2=49.5
Mode -
the most common value in a data
set
Daily High Temperatures (for any given date) Over the Last Decade
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002
59 50 49 13 40 46 50 53 58 47

To find the mode, find the most common value.


It helps to place data in numerical order to find
the mode.
13 40 46 47 49 50 50 53 58 59
If there is not a value which appears more often
than another, then there is no mode.
Outliers
Sometimes there are extreme values that are
separated from the rest of the data. These
extreme values are called outliers. Outliers affect
the mean.
Daily High Temperatures (for any given date) Over the Last Decade

1993 1994 1995 1996 1997 1998 1999 2000 2001 2002
59 50 49 13 40 46 50 53 58 47

The daily high temperature in 1996 is the outlier.


Mean
59+50+49+13+40+46+50+53+58+57=494
465÷10=46.5
Because outliers can affect the mean, the median
may be better measures of central tendency. You
might consider the median to best represent the
expected temperature.

Median
13 40 46 47 49 50 50 53 58 59
49+50=99
99÷2=49.5
Sometimes the mode is more helpful when
analyzing data. If you were trying to determine
what clothes to wear for a day trip, you might
base your decision on the mode temperature
because the mode temperature is the
temperature which occurred most often.
13° 40° 46° 47° 49° 50° 50° 53° 58° 59°
Dropping the outlier may help when determining
the mean.

59+50+49+13+40+46+50+53+58+47=465
465÷10=46.5°

40+46+47+49+50+50+53+58+59=452
452÷9=50.2°

When the 13° outlier is dropped, the average


daily temperature increases by more than 4° to
50.2°, which is closer to both the median of
49.5° and the mode of 50°.
You Try It!
Amna’s scores in various subject for first year
are 93, 79, 88, 77, 92, 88, 80, 34, 84, 88.
Calculate the range, mean, median, and mode.
Then make and explain a prediction for next
year’s scores.
Range: 59 Mean: 80.3
Median: 86 Mode: 88
Predictions will vary: Amna will score an
estimated average of 85 on her tests.
Determined this by removing the outlying score
of 34 and recalculated the mean.
When the mean, median and mode values are
closer to each other in a data set then the data
set is Normally distributed.

And the vise versa for the skew data set where
data values are not close to each other.

You might also like