# “Summarizing and Visualizing a Data Set”

Arun Kumar, Ravindra Gokhale, and Vinaysingh Chawan
Indian Institute of Management Indore

Term II, 2011

Rubik’s Cube Data Set
Rubik’s cube puzzle solving contest is held in diﬀerent countries every year. This data set has record minimum time for thirty three countries.

Type of Data

Quantitative Data: Data for which arithmetic operation makes sense. Ex: Age, Salary, Length.

Categorical Data: Data obtained by putting individuals in diﬀerent categories. Ex: Gender, States of a country

Visualization Quantitative Data: Histogram. Bar chart *Discuss Rubik’s cube data . Box plot Categorical Data: Pie Chart. Stem-Leaf plot.

median Spread: range. skewed.Interpreting a Histogram Shape: symmetric. bimodal Center: mean. inter-quartile range . unimodal. standard deviation.

n Notation: x ¯ . .Measure of the central tendency of a data set Mean: If we have a data set x1 . xn then mean of the data set is x1 +···+xn . . . .

3.1.1. .5.Mean: Example Calculate the mean of 0.

When the number of observations (sample size) is an even number then there are two middle numbers. In that case.Measure of the Central Tendency of a Data Set Median: Middle number in a sorted data set. Notations: x ˜ . we take average of the two middle numbers to obtain the median.

.Median: Example 1 Calculate the median of 0.5.1.1.3.

Median: Example 2 Calculate the median of 3.4.6.2.3.5. .4.5.

Measure of the Central Tendency of a Data Set Mode: Observation in the data set with the largest frequency. Note that we can have more than one mode for a data set. .

.Mode: Example Calculate the mode of 0.3.1.1.5.

1.5.Eﬀect of an Outlier Calculate mean. median.3.100.1. and mode of 0. .

Eﬀect of an Outlier .

and Mode Mean. Median. . Median. and Mode has the same unit as the data.Unit of Mean.

Identifying Relation Between Mean and Median from Histogram Symmetric: Left skewed: .

Identifying Relation Between Mean and Median from Histogram Right skewed: .

1.5.3.Measure of the Spread of a Data Set Range: max-min Ex: 0. ﬁnd the range? Range = 5 − 0 = 5.1. .

Measure of the Spread of a Data Set Variance: n x 2 i=1 (xi −¯ ) n Standard deviation: n x 2 i=1 (xi −¯ ) n .

4. Calculate variance and standard deviation of 1. .3.5.Variance and Standard Deviation: Example Calculate variance and standard deviation of 3.2.3.3.3.3.

.Unit of Variance and Standard Deviation Standard deviation has the same unit as the data but the unit of variance is square of the unit of the data.

Standard Deviation Standard deviation is always greater than or equal to zero. .

3.100.3.Does Standard Deviation Gets Aﬀected by Outliers? Calculate standard deviation for the data 3. .3.

Is Standard Deviation Always a Good Measure of the Spread of a Data Set? .

Quartiles First quartile: 25th percentile Notation: Q1 .

Quartiles Third quartile: 75th percentile Notation: Q3 .

4.1.0.7.3.6.4.6.6.5. .Exercise Find the ﬁrst and third quartile of 8.7.

Quartiles Median is the second quartile (Q2 ). .

Measure of the Spread of a Data Set Inter Quartile Range (IQR): Q3 − Q1 *IQR is a robust measure of spread. . IQR does not get aﬀected much by skewness or outliers.

7.6.1.6. .4.5.0.6.Exercise Find IQR of 8.4.7.3.

Five Number Summary Minimum First quartile Median Third quartile Maximum .

Boxplot *We will create a box plot for the rubik’s cube data set. .

5 ∗ IQR] is considered an outlier (Informal Rule). . Q3 + 1.Interpreting a Box Plot Shape: Outliers: Any observation not in the range [Q1 − 1.5 ∗ IQR.

Visualization of summary statistics.Why Do We Need Box Plot? To compare two or more data sets. .

Example *Degree of Reading Power Test Data .

.Categorical Data Visualization *Bar Chart *Pie Chart Show billionaires data.