You are on page 1of 36

Summarizing and Visualizing a Data Set

Arun Kumar, Ravindra Gokhale, and Vinaysingh Chawan


Indian Institute of Management Indore

Term II, 2011

Rubiks Cube Data Set


Rubiks cube puzzle solving contest is held in dierent countries every year. This data set has record minimum time for thirty three countries.

Type of Data

Quantitative Data: Data for which arithmetic operation makes sense. Ex: Age, Salary, Length.

Categorical Data: Data obtained by putting individuals in dierent categories. Ex: Gender, States of a country

Visualization

Quantitative Data: Histogram, Stem-Leaf plot, Box plot

Categorical Data: Pie Chart, Bar chart

*Discuss Rubiks cube data

Interpreting a Histogram

Shape: symmetric, skewed, unimodal, bimodal Center: mean, median Spread: range, standard deviation, inter-quartile range

Measure of the central tendency of a data set

Mean: If we have a data set x1 , . . . , xn then mean of the data set is x1 ++xn . n

Notation: x

Mean: Example

Calculate the mean of 0,5,1,1,3.

Measure of the Central Tendency of a Data Set

Median: Middle number in a sorted data set. When the number of observations (sample size) is an even number then there are two middle numbers. In that case, we take average of the two middle numbers to obtain the median.

Notations: x

Median: Example 1

Calculate the median of 0,5,1,1,3.

Median: Example 2

Calculate the median of 3,2,5,6,4,4,3,5.

Measure of the Central Tendency of a Data Set

Mode: Observation in the data set with the largest frequency. Note that we can have more than one mode for a data set.

Mode: Example

Calculate the mode of 0,5,1,1,3.

Eect of an Outlier

Calculate mean, median, and mode of 0,5,1,1,3,100.

Eect of an Outlier

Unit of Mean, Median, and Mode

Mean, Median, and Mode has the same unit as the data.

Identifying Relation Between Mean and Median from Histogram

Symmetric:

Left skewed:

Identifying Relation Between Mean and Median from Histogram

Right skewed:

Measure of the Spread of a Data Set

Range: max-min

Ex: 0,5,1,1,3; nd the range? Range = 5 0 = 5.

Measure of the Spread of a Data Set

Variance:

n x 2 i=1 (xi )

Standard deviation:

n x 2 i=1 (xi )

Variance and Standard Deviation: Example


Calculate variance and standard deviation of 3,3,3,3,3.

Calculate variance and standard deviation of 1,2,3,4,5.

Unit of Variance and Standard Deviation

Standard deviation has the same unit as the data but the unit of variance is square of the unit of the data.

Standard Deviation

Standard deviation is always greater than or equal to zero.

Does Standard Deviation Gets Aected by Outliers?

Calculate standard deviation for the data 3,3,3,3,100.

Is Standard Deviation Always a Good Measure of the Spread of a Data Set?

Quartiles

First quartile: 25th percentile

Notation: Q1

Quartiles

Third quartile: 75th percentile

Notation: Q3

Exercise

Find the rst and third quartile of 8,7,1,4,6,6,4,5,7,6,3,0.

Quartiles

Median is the second quartile (Q2 ).

Measure of the Spread of a Data Set

Inter Quartile Range (IQR): Q3 Q1

*IQR is a robust measure of spread. IQR does not get aected much by skewness or outliers.

Exercise

Find IQR of 8,7,1,4,6,6,4,5,7,6,3,0.

Five Number Summary

Minimum First quartile Median Third quartile Maximum

Boxplot

*We will create a box plot for the rubiks cube data set.

Interpreting a Box Plot

Shape:

Outliers: Any observation not in the range [Q1 1.5 IQR, Q3 + 1.5 IQR] is considered an outlier (Informal Rule).

Why Do We Need Box Plot?

To compare two or more data sets. Visualization of summary statistics.

Example

*Degree of Reading Power Test Data

Categorical Data Visualization

*Bar Chart

*Pie Chart

Show billionaires data.