Measuring Central Tendency in Data Mining

Measuring Central Tendency in Data Mining
Muchake Brian
Phone: 0701178573
Email: bmuchake@gmail.com, bmuchake@cis.mak.ac.ug,
Do not Keep Company With Worthless People

Psalms 26:11
Introduction to Measure of Central Tendency
 A measure of central tendency is a single value that describes the way in which a
group of data cluster around a central value. It is a way to describe the center of a
data set.
 A measure of central tendency is a summary statistic that represents the center point
or typical value of a dataset.
 A central tendency can be calculated for either a finite set of values or for a
theoretical distribution, such as the normal distribution.
 In summary, a measure of central tendency (also referred to as measures of center
or central location) is a summary measure that attempts to describe a whole set of
data with a single value that represents the middle or center of its distribution.
Measures of Central Tendency
 There are three main measures of central tendency: the mode, the median and the
mean. Each of these measures describes a different indication of the typical or
central value in the distribution.
Measures of Central Tendency-Mean [Cont’d]
 Mean is the most common and most effective numerical measure of the “center” of a
set of data. At time it is referred to as the (arithmetic) mean. (sample vs. population).
Arithmetic Mean
 Arithmetic mean refers to the average of a set of numerical values, as calculated by
adding them together and dividing by the number of terms in the set.
 The arithmetic mean is the simplest and most widely used measure of a mean, or
average. It simply involves taking the sum of a group of numbers, then dividing that
sum by the count of the numbers used in the series.
 For example, take 34, 44, 56 and 78. The sum is 212. The arithmetic mean is 212
divided by four, or 53.
 The formula of arithmetic mean is understood as:
 Thus, the mean of n observation x1, x2, . . ., xn, is given by
 Where the symbol ∑ called sigma which stands for summation. i at the bottom of sigma indicate start
number while N at the top of sigma indicates the last value of your figures. The N divisible is the
number of elements in your set.
Weighted Mean
 Weighted mean also called Weighted Average is a mean where some values contribute more than others.
 The weighted arithmetic mean is similar to an ordinary arithmetic mean (the most common type of
average), except that instead of each of the data points contributing equally to the final average, some
data points contribute more than others.
 The weighted mean is a type of mean that is calculated by multiplying the weight (or probability)
associated with a particular event or outcome with its associated quantitative outcome and then summing
all the products together.
 It is very useful when calculating a theoretically expected outcome where each outcome shows a different
probability of occurring, which is the key feature that distinguishes the weighted mean from the arithmetic
mean.
 It is important to note that in weighted mean all the probabilities or weights must be
mutually exclusive (i.e., no two events can occur at the same time) and that the total
weights and probabilities must add up to 100%.
 Sometimes, each value in a set may be associated with a weight, the weights reflect
significance, importance, or occurrence frequency attached to their respective values.
 Weighted Mean is a statistical method which calculates the average by multiplying the
weights with its respective mean and taking its sum.
 Where
• ∑ denotes the sum
• w is the weights and
• x is the value
Trimmed Mean
 A truncated mean or trimmed mean is a statistical measure of central tendency, much like the mean
and median. It involves the calculation of the mean after discarding given parts of a probability
distribution or sample at the high and low end, and typically discarding an equal amount of both.
 This number of points to be discarded is usually given as a percentage of the total number of points,
but may also be given as a fixed number of points.
 Trimmed Mean a method of averaging that removes a small percentage of the largest
and smallest values before calculating the mean.
 A trimmed mean (similar to an adjusted mean) is a method of averaging that removes
a small designated percentage of the largest and smallest values before calculating
the mean.
 After removing the specified outlier observations, the trimmed mean is found using a
standard arithmetic averaging formula. The use of a trimmed mean helps eliminate the
influence of outliers or data points on the tails that may unfairly affect the traditional
mean.
 The Trimmed Mean can be calculated using the following formula.
 Where −
 The trimmed mean is the mean obtained after cutting off values at the high and low
extremes.
 For example, we can sort the values and remove the top and bottom 2% before
computing the mean.
 We should avoid trimming too large a portion (such as 20%) at both ends as this can
result in the loss of valuable information.
Example
 Figure out the 20% trimmed mean for the number set {8, 3, 7, 1, 3, and 9}
 Give us a chance to first ascertain the estimation of Trimmed check (g), where g
alludes to number of qualities to be trimmed from the given arrangement.
 g = Floor (Trimmed Mean Percent x Sample Size) g = Floor (0.2 x 6) g = Floor (1.2)
Trimmed check (g) = 1
 Record the given arrangement of numbers {8, 3, 7, 1, 3, 9} in rising request, = 1, 3,
3,7,8,9.
 As the trimmed tally is 1, we ought to expel one number from the earliest starting point
and end. Along these lines, we uproot first number (1) and last number (9) from the
above arrangement of numbers, = 3, 3, 7, 8.Now Trimmed mean can be computed as:
• The Trimmed Mean of the given numbers is 5.25.

 Note:
 There are other types of means, that can be used in various branches of math. And these include:
Harmonic me an., Geometric mean., Arithmetic-Geometric mean., Root-Mean Square mean., and
Heronian mean.
 In your free time investigate about each of them and their application
Limitations of the mean
 The mean cannot be calculated for categorical data, as the values cannot be summed.
 As the mean includes every value in the distribution the mean is influenced by outliers
and skewed distributions.
 Even a small number of extreme values can corrupt the mean.
 Not good for smaller data sets
Measures of Central Tendency-Median
 Arithmetic Median is a positional average and refers to the middle value in a distribution.
 It divides the series into two halves by first arranging the items in ascending or descending
order of magnitude and then locating the middle value and is denoted by the symbol or M.
 The median is the middle number in a sorted, ascending or descending, list of numbers and
can be more descriptive of that data set than the average.
 The median is the value separating the higher half from the lower half of a data sample (a
population or a probability distribution).
 For a data set, it may be thought of as the "middle" value. For example, in the data set {1, 3,
3, 6, 7, 8, 9}, the median is 6, the fourth largest, and also the fourth smallest, number in the
sample.
Measures of Central Tendency-Median [Cont’d]
 For a continuous probability distribution, the median is the value such that a number is
equally likely to fall above or below it.
 Finding the median in sets of data with an odd and even number of values
 The median is less affected by outliers and skewed data than the mean, and is usually
the preferred measure of central tendency when the distribution is not symmetrical.
 Suppose that a given data set of N distinct values is sorted in numerical order. The
median is the middle value if odd number of values, or average of the middle two
values otherwise.
 For skewed (asymmetric) data, a better measure of the center of data is the median.
Limitation
 The median cannot be identified for categorical nominal data, as it cannot be logically
ordered.
Measures of Central Tendency-Mode
 Arithmetic Mode refers to the most frequently occurring value in the data set. In other words, modal value
has the highest frequency associated with it. It is denoted by the symbol Mo or Mode.
 The mode is the most commonly occurring value in a distribution.
 The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.
 The mode has an advantage over the median and the mean as it can be found for both numerical and
categorical (non-numerical) data.
Measures of Central Tendency-Mode [Cont’d]
 Mode is the most frequently occurring data element in the data set.
 Note that a data set may not necessarily have a Mode. If each data element is
occurring only once, there will be no Mode.
 On the other hand, if two or three data elements are repeating themselves, the same
number of times; there will be two or three Modes.
 Unlike the median and mean, the mode is about the frequency of occurrence. There
can be more than one mode or no mode at all; it all depends on the data set itself.
Unimodal
 In mathematics, unimodality means possessing a unique mode. More generally,
unimodality means there is only a single highest value, somehow defined, of some
mathematical object.
 A unimodal distribution is a distribution with one clear peak or most frequent value.
The values increase at first, rising to a single peak where they then decrease
Bimodal
 A data set is bimodal if it has two modes. This means that there is not a single data
value that occurs with the highest frequency. Instead, there are two data values that tie
for having the highest frequency.
 It means that, this number is most frequent number you will see in a random sample.
Now, imagine two mountains next to each other. This is a bimodal distribution. It has
two modes — two most frequently observed number in a sample. Example: height of
human beings.
Multimodal
 A multimodal distribution is a continuous probability distribution with two or more
modes.
 Modality refers to the way in which something happens or is experienced and a
research problem is characterized as multimodal when it includes multiple such
modalities. In order for Artificial Intelligence to make progress in understanding the
world around us, it needs to be able to interpret such multimodal signals together.
 For example, images are usually associated with tags and text explanations; texts
contain images to more clearly express the main idea of the article. Different
modalities are characterized by very different statistical properties.
Measures of Central Tendency-Mid Range
 In statistics, the mid-range or mid-extreme of a set of statistical data values is the arithmetic mean of
the maximum and minimum values in a data set, defined as:
 The mid-range is the midpoint of the range; as such, it is a measure of central tendency.
 The mid-range is rarely used in practical statistical analysis, as it lacks efficiency as an estimator for
most distributions of interest, because it ignores all intermediate points, and lacks robustness, as
outliers change it significantly. Indeed, it is one of the least efficient and least robust statistics.
 However, it finds some use in special cases: it is the maximally efficient estimator for the center of a
uniform distribution, trimmed mid-ranges address robustness, and as an L-estimator, it is simple to
understand and compute.

Measures of Central Tendency-Mid Range [Cont’d]
• It is also useful to know what number is mid-way between the least value and the
greatest value of the data set. This number is called the midrange. To find the
midrange, add together the least and greatest values and divide by two, or in other
words, find the mean of the least and greatest values.
Measures of Central Tendency-Problem Solved
Suppose that the values for a given set of data are grouped into intervals. The intervals
and corresponding frequencies are as follows:
Calculate the appropriate median value for the data.

Measures of Central Tendency-Problem Solved [Cont’d]
Let count the lower boundary. Li

• Consider the mid interval in the set. If its not there focus on the two mid interval add
them and divide the.
• In our example we get 900 which falls in the 20-50 interval. This means the lower
bound is 20 and the upper bound is 50.
• Next is to calculate N which is the summation of the frequency values. And so in our
example we focus on 200+450+300+1500+700+44=3194.
• Next is the summation of the frequency of the lower boundary. Whereby based on our
data set where we used an interval of 20-50 the frequencies in the lower boundary will
be 200+450+300=950
• Next is to calculate frequency of the median which in our case is 1500.
• Next is calculating the width. Here we subtract the lower boundary from upper
boundary (upper boundary-lower boundary).
• No to substitute the values in the formula
• Apply BODMAS
• No to substitute the values in the formula
• The answer is given a measure of years as derived from the interval column for age.

Measuring Central Tendency in Data Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Measuring Central Tendency in Data Mining

Uploaded by

Copyright:

Available Formats

Measuring Central Tendency in Data Mining

Do not Keep Company With Worthless People

 Thus, the mean of n observation x1, x2, . . ., xn, is given by

• The Trimmed Mean of the given numbers is 5.25.

Calculate the appropriate median value for the data.

Let count the lower boundary. Li

You might also like