You are on page 1of 22

CONDENSATION OF DATA

Applied Statistics and Computing Lab Indian School of Business

Applied Statistics and Computing Lab

Learning goals
• Understanding a possible approach to data analysis • Studying three data representation techniques:
– Stem and leaf plot – Frequency table – Dot plot

Applied Statistics and Computing Lab

2

Data Analysis
• Exploratory
– Cleaning – Summarization – Exploration of salient features
• Location • Variability (spread) • Concentration
– Shape – Skewness – Tail information

• Inferential
Applied Statistics and Computing Lab
3

Dataset
• The percentage of employees involved in a certain ‘worker involvement in decision making’ program, in 30 companies: (5, 32, 53, 35, 42, 43, 52, 45, 46, 44, 37, 48, 58, 49, 57, 50, 47, 78, 34, 51, 42, 52, 47, 33, 55, 56, 49, 48, 63, 38) • Arranged in ascending order: (5, 32, 33, 34, 35, 37, 38, 42, 42, 43, 44, 45, 46, 47, 47, 48, 48, 49, 49, 50, 51, 52, 52, 53, 55, 56, 57, 58, 63, 78)
0 | 5 1 | 2 | 3 | 234578 4 | 223456778899 5 | 012235678 6 | 3 7 | 8
4 Data taken from Aczel A., Sounderpandian J. Complete business statistics

Applied Statistics and Computing Lab

Stem and leaf plot
• • • • • Most basic and an easy method of visualizing data in its original form Stem and leaf plot displays the actual values of all the data points Each value separated into a stem and a leaf, separated by ‘|’, with stem on the left side and leaf on the right side of the vertical line Which part of the number qualifies as a stem and which part a leaf, is determined on data-to-data basis For example, a data consisting of 2 digit values may consider the digits at ten’s place to be the stem and the digits at unit’s place to be the leaves, similar to our previous diagram The leaves generally consist of the last or unit digit of a number and the other digits may be considered as the stem The numbers can sometimes be rounded up to a particular number of digits and the last digit may be considered to be the leaf A common format applies to all the values of a dataset All the stems must be listed, irrespective of whether any leaf follows or not
5

• • • •

Applied Statistics and Computing Lab

Example
• GPA of 50 students in the first semester exam for their second course in Quantitative methods • The GPA range is 0-10 • The numbers have 7 values after the decimal point • Converted into ‘1 value after decimal point’ format
Applied Statistics and Computing Lab
6

Stem and leaf plot (contd.)
The decimal point is at the |
0 | 3446 1 | 1145677 2 | 22224599 3 | 1344488 4 | 3556 5 | 23578899 6|4 7 | 14789 8 | 13 9| 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 6 Represents 4 values: 0.3,0.4,0.4,0.6 Represents 8 values: 2.2,2.2,2.2,2.2,2.4,2.5,2.9,2.9

Represents the only value with 6 at its ten’s place: 6.4

• • • • • •

For negative values, a –ve sign is put in front of the stem Stem and leaf plot is a powerful tool to study a data Gives an idea about the distribution of values; their spread and density Useful in detecting unusual values and the value occurring with the highest frequency Easy to read and understand Not very informative if there are too few or too many values
7

Applied Statistics and Computing Lab

Frequency table
• A table listing the frequency counts for each value of a variable • Useful tool to give a basic idea about the data in a quick glance • Very easy to construct and is mostly self-explanatory • Can accommodate many types of data, whether categorical or numerical. Both types of numerical data; discrete and continuous, can be represented in a frequency table
Applied Statistics and Computing Lab
8

Cars dataset
• Consists of data on 804 used cars in the USA • Data is collected on 12 features, such as the price, make and model of the car, the number of cylinders, number of doors etc. • Collected from the Kelly Blue Book

Applied Statistics and Computing Lab

9

Frequency table (contd.)
• For Cars data, let us take a look at various frequency tables:
Car make Buick Cadillac Chevrolet Pontiac SAAB Saturn No. of cylinders 4 6 8 Applied Statistics and Computing Lab Frequency of each make 80 80 320 150 114 60 Frequency of cars with corresponding no. of cylinders 394 310 100
Price 8638.93 8769 8870.95 9041.91 9220.83 9482.22 9506.05 9563.79 9654.06 9665.85 9720.98 … … … Frequency 1 1 1 1 1 1 1 1 1 1 1 … … …

There are 798 unique prices!
10

Frequency table (contd.)
• Is there a better way of tabulating the prices? • What if we split into bands of prices and calculate the frequencies? • Would such a table be useful? • The prices of cars range from $8639 to $70760

Applied Statistics and Computing Lab

11

Frequency table for class intervals
Price range [$8000, $13000) [$13000, $18000) [$18000, $23000) [$23000, $28000) [$28000, $33000) [$33000, $38000) [$38000, $43000) [$43000, $48000) [$48000, $53000) [$53000, $58000) [$58000, $63000) [$63000, $68000) [$68000, $73000) Number of cars 135 265 150 75 76 45 33 11 5 2 1 3 3
12

Applied Statistics and Computing Lab

Determining class intervals
• Each band of prices or a group of values of a variable, is referred to as a ‘class’ or a ‘class interval’ • The number of class intervals and size of each interval can be best determined by the researcher or analyst, who has prior knowledge of the behaviour of the variable • Classes must be determined keeping the range of values in mind • Very few, yet wide class intervals, may not be very informative as most of the information may get hidden into the large intervals • Too many small intervals may be able to capture a detailed picture but such a table will be sparse and the sheer length of it may take away the usefulness of the table • As far as possible, having class intervals of equal width makes the table easier to understand
Applied Statistics and Computing Lab
13

Determining class intervals (contd.)
• • • • • •
Interval [1,3] [1,3) (1,3]

The class limits i.e. the highest and lowest values of a class interval must be chosen carefully Must ensure that classes are determined such that any one value of the dataset can not possibly belong to more than one class intervals Using two types of brackets; closed [] or open () A class interval can have one open and one closed bracket Closed bracket => include the number on that side of the interval Open bracket => all numbers up to or starting from, but excluding the number on that side of the interval
Meaning Includes every number from 1 to 3, including the limits e.g. 1, 1.3, 1.8, 2.24, 2.6, 2.98, 2.999999, 3 Includes every number starting from 1 and reaching up to but not including 3 e.g. 1, 1.01, 1.3, 1.78,2.4, 2.9, 2.99, 2.999, 2.9999, 2.99999 (There can be as many 9s after the decimal) Includes every number starting after 1 (but not 1) and reaching up to and including 3 e.g. 1.000000000001, 1.0000001, 1.1, 1.24, 1.7, 2.3, 2.69, 2.99, 3 (There can be as many zeroes after the decimal point but the last digit must be a 1) Includes every number in between 1 and 3, excluding 1 and 3 e.g. 1.0000000000000000001, 1.15, 1.6, 1.92, 2.3, 2.89, 2.99999999999999999999999

(1,3)

• •

For a discrete data, limits of class intervals can be easily determined in a non-overlapping manner For continuous data, values at the limits can repeat across classes
14

Applied Statistics and Computing Lab

Dot plot
• A simple tool to depict the frequencies of values in a dataset • X-axis denotes the value and the corresponding frequency is denoted on the Y-axis • Gives an idea about the distribution of values • Indicates the intervals within which the variable may not take any values • The value with highest frequency is easily determined • To create a dot plot in R, the variable has to be numeric • In case of a categorical variable or a variable with class intervals, an equivalent variable assigning a numeric value to each category or class must be created
Applied Statistics and Computing Lab
15

Applied Statistics and Computing Lab

16

Comparison
Stem and leaf plot Discrete data Continuous data √ √ Frequency table √ Constructing class intervals can be useful √ Dot plot √ Need to create class intervals √ Best depiction if there are many values but only a few of them have a high frequency

Categorical data Advantages • •

×

Depicts actual values Most informative with Can detect unusual large data observations

Disadvantages

Not very informative for a large dataset

Gives less information than a stem and leaf plot

Applied Statistics and Computing Lab

17

Height is in cms. Height (in cms.) 147.2 149.5 149.9 151.1 … … 198.1 1 Height (in cms.) [146, 152) [152, 158) [158, 164) [164, 170) [170, 176) [176, 182) [182, 188) [188, 194) [194, 200) Applied Statistics and Computing Lab Frequency 5 31 92 101 118 92 40 26 2
18

Frequency 1 1 1 1

Out of the 507 total data points, 147 have unique height values Clearly, for this continuous data we need to make class intervals!

In this case stem plot is not at all a good idea. Most importantly, for this variable, we do not need to know the exact values. Knowing the range within which they lie might be sufficient

Height is in cms.

Height (in cms.) Frequency [146, 152) [152, 158) [158, 164) [164, 170) [170, 176) [176, 182) [182, 188) [188, 194) [194, 200) 5 31 92 101 118 92 40 26 2

Applied Statistics and Computing Lab

19

Conclusion
• Easy to construct • Tools important to get a feel of the data! • Must use the appropriate representation based on the characteristics of the data • Helpful in determining the further course of data analysis

Applied Statistics and Computing Lab

20

R-codes
Functions Stem and leaf plot R-code stem(‘variable name’) Note: ‘scale’ is an important parameter to explore in R’s stem function table(‘variable name’) Install.packages(“TeachingDemos”) library(TeachingDemos) dots(‘variable name’)

Frequency table Dot plot

Applied Statistics and Computing Lab

21

Thank you

Applied Statistics and Computing Lab