10 views

Uploaded by Mandar Gadre

Where to the Data Summary Metrics like mean, median, and mode come from?

Where to the Data Summary Metrics like mean, median, and mode come from?

© All Rights Reserved

- Age and the Distance to Crime
- 5ulangkaji T4 Statistic
- Teacher Collaboration 9-26-08
- STATISTICS
- Chapter 04
- Statistics - Part A
- Statistics
- Box and Whisker Plot
- Montgomery c
- Syllabus in STAT 01
- Educ 6B
- Week 2 Monday Ch4 Measures of Center Spread KEY
- Statistics and Probability
- Pictgram Notes
- Novel Denoising
- claudia-3m corrected.pptx
- PSY 2003 Chapter 3 Notes
- 137074291.docx
- Session_01(Introduction).ppt
- 06601435.pdf

You are on page 1of 12

Summarizing the Data with a Metric

Discrepancy and Error

Estimating the Summary Metrics: Minimizing the Error

Arithmetic Mean, Median, and Mode

Geometric Mean, Harmonic Mean and Mid-Range

Breakdown Points of the Arithmetic Mean and the Median

-- Mandar Gadre

(July 2013)

A Sample Dataset

e.g. the median age of all Indian citizens.

e.g. estimated median, calculated from the ages of, say 1 million citizens.

only one value, no need to summarize!

For non-identical values of the property (almost everywhere in

the real world), we look for a Summary Statistic such as a

mean.

But how do we choose the summary statistic, S?

Xi,

S

i = 1 to N

from this Summary Statistic.

We take s as the candidate summary statistic.

You could define your own way of calculating discrepancy!

Three common ways

1. Comparison

Is the individual reading same as the candidate?

ei = 1, xi s

ei = 0, xi = s

2. Absolute difference from the candidate

ei = |xi s|

3. Square of the difference from the candidate

ei = (xi s)2

Discrepancy

individual data-points xi from the Candidate Summary Statistic s.

E for the three types of discrepancies would be

1. Comparison with the Candidate

E = i ei, where

ei = 1, xi s; ei = 0, xi = s

2. Absolute difference from the Candidate

E = i |xi s|

3. Square of the difference from the Candidate

E = i (xi s)2

Error

minimized.

We have given special names for the three types of Summary Statistics

arising from the three types of discrepancies:

S, such that the error

E = i ei, (where ei = 1, xi s; ei = 0, xi = s) is minimized.

It turns out that the value of s that occurs the most frequently will

minimize this error. There may be one or more such values.

We call this the Mode.

e.g. if we want to sell single size mens t-shirts, we can get them made

with size equal to the mode.

Calculating S

This is similar to the absolute value function and the derivative is the

signum function!

which happens at the middle reading when all the data-points are

arranged in increasing order.

We call this the Median.

If we want to summarize income-per-household in a huge country like

India (data-set with severe outliers which do not require higher

weightage) we will use the median.

If we want to summarize the height of children in a kindergarten class,

we may use the mean: the data is normally distributed and most likely

there arent any extreme outliers (though in such cases the median

and the mode are not any worse summary statistics to use).

+++

Mode is used while capturing categorical/nominal data.

Mean is used to capture the effect of extreme outliers.

Median is used for datasets with extreme outliers which need not be

given any higher weightage.

The outliers sway the arithmetic mean much more than they sway the

median, because of the square-of-the-distance.

S, such that the error E = i (xi s)2 is minimized.

Making the derivative zero gives us

i 2(xi S) = 0 or N*S = i xi

or

S = (i xi) / N

Geometric Mean:

Defined only for dataset with all positive numbers, it is the Nth

root of the product of all N data-points.

G = ( (xi) ) 1/N

It is used while summarizing/aggregating data with different

categories and scales involved. E.g. rating companies on various

metrics taken together.

Or where the data-points show compounding behavior e.g.

summarizing performance of a stock over the past N years.

Harmonic Mean:

Defined only for dataset with all non-zero numbers, it is the

reciprocal of the arithmetic mean of reciprocals of xi.

H = 1 / (1/N(i (1/xi)) )

It is used while summarizing rates. e.g. the average speed of

aircraft between numerous Mumbai-London trips; or the

average rate (in ml/min) at which a blood donor fills a bag over

multiple visits.

Mid-Range

Defined as the arithmetic mean of the maximum and minimum

data-points

This is one of the least efficient (since it ignores all the datapoints except for min and max) and the least robust (since it

only depends on the extreme data-points and will be swayed if

they are extreme outliers) statistics.

It is used in process control. e.g. where the process is tightly

controlled and the outliers are already handled/trimmed out.

which sample we choose from the population; or to the presence of

contaminated/bad/incorrect data in that sample. Breakdown Point

represents the degree of robustness of a statistic.

Breakdown Point is the largest proportion of contaminated datapoints (e.g. an arbitrarily large data-point) a statistic can handle

before yielding an absurd result (e.g. an arbitrarily large statistic).

Since the arithmetic mean depends on all the values and is swayed by

changing even one value among N, its Breakdown Point is 0.

The median is the strongest statistic, with its Breakdown Point at 50%.

(If more than 50% of the data is contaminated, a statistic cannot be defined

anyway since there is no way to distinguish between the actual underlying

distribution and the contaminated one.)

Robust Statistics

- Age and the Distance to CrimeUploaded byreynosito
- 5ulangkaji T4 StatisticUploaded bycgnash
- Teacher Collaboration 9-26-08Uploaded byShaun Luehring
- STATISTICSUploaded byAnya Alstreim
- Chapter 04Uploaded bySajid Anotherson
- Statistics - Part AUploaded byWerner
- StatisticsUploaded byAkash Chavan
- Box and Whisker PlotUploaded byEdelyn Pantua
- Montgomery cUploaded bymagister7
- Syllabus in STAT 01Uploaded byNyebeArbara
- Educ 6BUploaded byDex Licong
- Week 2 Monday Ch4 Measures of Center Spread KEYUploaded byRocket Fire
- Statistics and ProbabilityUploaded byMary Anne Ananey
- Pictgram NotesUploaded byWajahatAbbasWajeeh
- Novel DenoisingUploaded byPraveen Anand
- claudia-3m corrected.pptxUploaded byLeoncio Lumaban
- PSY 2003 Chapter 3 NotesUploaded byGabe Marquez
- 137074291.docxUploaded byhanif
- Session_01(Introduction).pptUploaded byYepu Mi Ah-Reum
- 06601435.pdfUploaded byThunuguntla Vinod Kumar
- No Care a PowerUploaded byVenkateswararao Musala
- MS 8 finalUploaded byDeepak Pandey
- Copy of Performance Evalutionno AnswersUploaded byRohan Shah
- 2010JunQMUploaded byShel Lee
- GridDataReport-exercise1750Uploaded byNanang Suwandana
- Report 1Uploaded byBremen Jair Figueroa Vargas
- Zarnadi Etal 2012Uploaded byDeyvyd Lima
- Microsoft PowerPoint - 5 Assignement and lab tutorialUploaded byAhmed Alamin
- unit2statisticalconsiderationsindesigndesignoficenginecomponents2-160507104250Uploaded bysantoshbawage
- Do We Live in a VacuumUploaded byortiz_nagare

- No 1-2006 an Engineer Taking Over the Work of AnotherUploaded byhizbi7
- Principles of TeachingUploaded byAmin D. Ace
- Print and Go Esl-eBook-3Uploaded byozkankilic
- Theodore RooseveltUploaded bylaletrada
- Electrical Conductor - Wikipedia, The Free EncyclopediaUploaded byBasil Gonsalves
- rr307Uploaded bysere55
- Test 2 Freq Analysis Topics and ExampleUploaded byBikramjit Singh Bhangoo
- Lesson Plan 5Uploaded byAna-Maria Gugu
- ANE_AD_ P Lab 3_1 Sem (130)Uploaded byBojja Venkat Sai
- QuasiHelicalUploaded byFabio Cavaliere
- Exp 6Uploaded byTrisha de Leon
- Examen quimica bachilleratoUploaded bysanvibar
- ITALCHIMICA Company Profile 2014 EnUploaded byPurusotam T
- 2016-with-cs.docxUploaded byAnonymous 5VfLKVNw
- ch13AUploaded byJosh Kemp
- JURNET KEDUA KANBANUploaded byditot
- Build OptionsUploaded byFrancisco
- Alcohol MetabolismUploaded byRichardWC
- 002fGuidanceonwasteUploaded bySajad Khan
- MN_2018-06-19Uploaded bymoorabool
- 50bdad4600-Chp 1-2 Test ReviewUploaded bykie
- The Campanile (Vol 90, Ed 10) published June 2 2008Uploaded byThe Campanile
- WAMU Subprime Underwriting GuidelinesUploaded bysamuriami
- vol50no2rad8Uploaded bySteva Milosev
- Blind TrustUploaded bygreling
- Tesmec TRS 1475Uploaded byzliang
- syllabusUploaded byAnupam Samaiyar
- The Jungle by Sinclair, Upton, 1878-1968Uploaded byGutenberg.org
- Using the PSpice Library TranslatorUploaded byGerson Monteiro
- Multi-Criteria Evaluation of the Web-based E-learning SystemUploaded byhailian1986