687 views

Uploaded by ASClabISB

This tutorial on Data condensation is prepared by the Applied Statistics and Computing lab at the Indian School of Business, Hyderabad. It is a part of the module on Descriptive Statistics, prepared by us.

save

- (3) Methods of Data Collection
- (8) Measures of Dispersion
- (2) Types of Data
- (5) Bayes' Rule
- (9) Basic Box-Plot
- (9) Geometric and Negative Binomial Distribution
- (5) Graphical Presentation 1
- (6) Graphical Presentation 2
- (12)Continuous Distributions
- (11) Notched and Variable Width Box-Plots
- (10) Box-Plot With Fences
- (10) Hypergeometric Distribution
- (12) Bivariate Data
- (7) Measures of Central Tendency
- (8b) Grouped Data_central Tendency and Dispersion
- (7) Discrete Uniform Distribution
- (6) Random Variables and PMF
- (1) Set Theory
- (11) Poisson Distribution
- (8) Binomial Distribution
- (4) Conditional Probability
- (13) Normal Distribution
- (2) Permutations and Combinations
- (1) Introduction
- Teradata Case
- (14) Joint Distribution
- Gentle Lentil Case.pdf
- (3) Probability
- R Tutorial
- (15) Chi-square, Student’s t and Snedecor’s F distributions
- Italica Tc 200
- Viaje de fe a costa rica y panamá
- Educación popular y disputa hegemónica. Ouviña.pdf
- instructivo_ambiente.pdf
- Modelo Tridimensional Del Adn Graf
- cocina_saludable.pdf
- ILIADA
- seguridad inf principios.pdf
- AnalisisComparativoRefElect2007-2008.pdf
- f[1]
- ABC Dario
- Formato Referencia Personal
- ApuntesCurvas.pdf
- 21.11.2018
- QUIZ in TRIGO July 3, 2018.odt
- TUGAS METODODLOGI PENELITIAN 3 JUDUL.docx
- Programacion_Fanuc
- Bom Estarmos Aqui Louvando a Deus - Corinhos
- Directorio de Los Docentes Inclusivos (1)
- Manual Creacion Lsmw.pdf
- eulaKOR.txt
- GUIA Metodos de Explotacion Superficial
- Guía de Actividades y Rúbrica de Evaluación - Fase 2 - Un encuentro creativo con mi cuerpo voz
- g.txt
- fghj
- Disc Capability Inst
- Pengumuman Persyaratan Cpns Th. 2018
- E Control&Indication Web
- Initial Texas Probate Filing
- Analisis Informasi Keuangan Berdasarkan USALI.docx
- (7) Discrete Uniform Distribution
- (6) Random Variables and PMF
- (11) Poisson Distribution
- (8) Binomial Distribution
- (4) Conditional Probability
- (13) Normal Distribution
- (9) Geometric and Negative Binomial Distribution
- (14) Joint Distribution
- (12)Continuous Distributions
- (3) Probability
- R Tutorial
- (15) Chi-square, Student’s t and Snedecor’s F distributions
- (10) Hypergeometric Distribution
- (1) Set Theory
- (2) Permutations and Combinations
- (5) Graphical Presentation 1
- (1) Introduction
- (6) Graphical Presentation 2
- (11) Notched and Variable Width Box-Plots
- (10) Box-Plot With Fences
- (12) Bivariate Data
- (7) Measures of Central Tendency
- (8b) Grouped Data_central Tendency and Dispersion

You are on page 1of 22

Applied Statistics and Computing Lab Indian School of Business

Applied Statistics and Computing Lab

Learning goals

• Understanding a possible approach to data analysis • Studying three data representation techniques:

– Stem and leaf plot – Frequency table – Dot plot

Applied Statistics and Computing Lab

2

Data Analysis

• Exploratory

– Cleaning – Summarization – Exploration of salient features

• Location • Variability (spread) • Concentration

– Shape – Skewness – Tail information

• Inferential

Applied Statistics and Computing Lab

3

Dataset

• The percentage of employees involved in a certain ‘worker involvement in decision making’ program, in 30 companies: (5, 32, 53, 35, 42, 43, 52, 45, 46, 44, 37, 48, 58, 49, 57, 50, 47, 78, 34, 51, 42, 52, 47, 33, 55, 56, 49, 48, 63, 38) • Arranged in ascending order: (5, 32, 33, 34, 35, 37, 38, 42, 42, 43, 44, 45, 46, 47, 47, 48, 48, 49, 49, 50, 51, 52, 52, 53, 55, 56, 57, 58, 63, 78)

0 | 5 1 | 2 | 3 | 234578 4 | 223456778899 5 | 012235678 6 | 3 7 | 8

4 Data taken from Aczel A., Sounderpandian J. Complete business statistics

Applied Statistics and Computing Lab

**Stem and leaf plot
**

• • • • • Most basic and an easy method of visualizing data in its original form Stem and leaf plot displays the actual values of all the data points Each value separated into a stem and a leaf, separated by ‘|’, with stem on the left side and leaf on the right side of the vertical line Which part of the number qualifies as a stem and which part a leaf, is determined on data-to-data basis For example, a data consisting of 2 digit values may consider the digits at ten’s place to be the stem and the digits at unit’s place to be the leaves, similar to our previous diagram The leaves generally consist of the last or unit digit of a number and the other digits may be considered as the stem The numbers can sometimes be rounded up to a particular number of digits and the last digit may be considered to be the leaf A common format applies to all the values of a dataset All the stems must be listed, irrespective of whether any leaf follows or not

5

• • • •

Applied Statistics and Computing Lab

Example

• GPA of 50 students in the first semester exam for their second course in Quantitative methods • The GPA range is 0-10 • The numbers have 7 values after the decimal point • Converted into ‘1 value after decimal point’ format

Applied Statistics and Computing Lab

6

**Stem and leaf plot (contd.)
**

The decimal point is at the |

0 | 3446 1 | 1145677 2 | 22224599 3 | 1344488 4 | 3556 5 | 23578899 6|4 7 | 14789 8 | 13 9| 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 6 Represents 4 values: 0.3,0.4,0.4,0.6 Represents 8 values: 2.2,2.2,2.2,2.2,2.4,2.5,2.9,2.9

Represents the only value with 6 at its ten’s place: 6.4

• • • • • •

For negative values, a –ve sign is put in front of the stem Stem and leaf plot is a powerful tool to study a data Gives an idea about the distribution of values; their spread and density Useful in detecting unusual values and the value occurring with the highest frequency Easy to read and understand Not very informative if there are too few or too many values

7

Applied Statistics and Computing Lab

Frequency table

• A table listing the frequency counts for each value of a variable • Useful tool to give a basic idea about the data in a quick glance • Very easy to construct and is mostly self-explanatory • Can accommodate many types of data, whether categorical or numerical. Both types of numerical data; discrete and continuous, can be represented in a frequency table

Applied Statistics and Computing Lab

8

Cars dataset

• Consists of data on 804 used cars in the USA • Data is collected on 12 features, such as the price, make and model of the car, the number of cylinders, number of doors etc. • Collected from the Kelly Blue Book

Applied Statistics and Computing Lab

9

**Frequency table (contd.)
**

• For Cars data, let us take a look at various frequency tables:

Car make Buick Cadillac Chevrolet Pontiac SAAB Saturn No. of cylinders 4 6 8 Applied Statistics and Computing Lab Frequency of each make 80 80 320 150 114 60 Frequency of cars with corresponding no. of cylinders 394 310 100

Price 8638.93 8769 8870.95 9041.91 9220.83 9482.22 9506.05 9563.79 9654.06 9665.85 9720.98 … … … Frequency 1 1 1 1 1 1 1 1 1 1 1 … … …

**There are 798 unique prices!
**

10

**Frequency table (contd.)
**

• Is there a better way of tabulating the prices? • What if we split into bands of prices and calculate the frequencies? • Would such a table be useful? • The prices of cars range from $8639 to $70760

Applied Statistics and Computing Lab

11

**Frequency table for class intervals
**

Price range [$8000, $13000) [$13000, $18000) [$18000, $23000) [$23000, $28000) [$28000, $33000) [$33000, $38000) [$38000, $43000) [$43000, $48000) [$48000, $53000) [$53000, $58000) [$58000, $63000) [$63000, $68000) [$68000, $73000) Number of cars 135 265 150 75 76 45 33 11 5 2 1 3 3

12

Applied Statistics and Computing Lab

**Determining class intervals
**

• Each band of prices or a group of values of a variable, is referred to as a ‘class’ or a ‘class interval’ • The number of class intervals and size of each interval can be best determined by the researcher or analyst, who has prior knowledge of the behaviour of the variable • Classes must be determined keeping the range of values in mind • Very few, yet wide class intervals, may not be very informative as most of the information may get hidden into the large intervals • Too many small intervals may be able to capture a detailed picture but such a table will be sparse and the sheer length of it may take away the usefulness of the table • As far as possible, having class intervals of equal width makes the table easier to understand

Applied Statistics and Computing Lab

13

**Determining class intervals (contd.)
**

• • • • • •

Interval [1,3] [1,3) (1,3]

The class limits i.e. the highest and lowest values of a class interval must be chosen carefully Must ensure that classes are determined such that any one value of the dataset can not possibly belong to more than one class intervals Using two types of brackets; closed [] or open () A class interval can have one open and one closed bracket Closed bracket => include the number on that side of the interval Open bracket => all numbers up to or starting from, but excluding the number on that side of the interval

Meaning Includes every number from 1 to 3, including the limits e.g. 1, 1.3, 1.8, 2.24, 2.6, 2.98, 2.999999, 3 Includes every number starting from 1 and reaching up to but not including 3 e.g. 1, 1.01, 1.3, 1.78,2.4, 2.9, 2.99, 2.999, 2.9999, 2.99999 (There can be as many 9s after the decimal) Includes every number starting after 1 (but not 1) and reaching up to and including 3 e.g. 1.000000000001, 1.0000001, 1.1, 1.24, 1.7, 2.3, 2.69, 2.99, 3 (There can be as many zeroes after the decimal point but the last digit must be a 1) Includes every number in between 1 and 3, excluding 1 and 3 e.g. 1.0000000000000000001, 1.15, 1.6, 1.92, 2.3, 2.89, 2.99999999999999999999999

(1,3)

• •

For a discrete data, limits of class intervals can be easily determined in a non-overlapping manner For continuous data, values at the limits can repeat across classes

14

Applied Statistics and Computing Lab

Dot plot

• A simple tool to depict the frequencies of values in a dataset • X-axis denotes the value and the corresponding frequency is denoted on the Y-axis • Gives an idea about the distribution of values • Indicates the intervals within which the variable may not take any values • The value with highest frequency is easily determined • To create a dot plot in R, the variable has to be numeric • In case of a categorical variable or a variable with class intervals, an equivalent variable assigning a numeric value to each category or class must be created

Applied Statistics and Computing Lab

15

Applied Statistics and Computing Lab

16

Comparison

Stem and leaf plot Discrete data Continuous data √ √ Frequency table √ Constructing class intervals can be useful √ Dot plot √ Need to create class intervals √ Best depiction if there are many values but only a few of them have a high frequency

Categorical data Advantages • •

×

Depicts actual values Most informative with Can detect unusual large data observations

Disadvantages

Not very informative for a large dataset

Gives less information than a stem and leaf plot

Applied Statistics and Computing Lab

17

Height is in cms. Height (in cms.) 147.2 149.5 149.9 151.1 … … 198.1 1 Height (in cms.) [146, 152) [152, 158) [158, 164) [164, 170) [170, 176) [176, 182) [182, 188) [188, 194) [194, 200) Applied Statistics and Computing Lab Frequency 5 31 92 101 118 92 40 26 2

18

Frequency 1 1 1 1

Out of the 507 total data points, 147 have unique height values Clearly, for this continuous data we need to make class intervals!

In this case stem plot is not at all a good idea. Most importantly, for this variable, we do not need to know the exact values. Knowing the range within which they lie might be sufficient

Height is in cms.

Height (in cms.) Frequency [146, 152) [152, 158) [158, 164) [164, 170) [170, 176) [176, 182) [182, 188) [188, 194) [194, 200) 5 31 92 101 118 92 40 26 2

Applied Statistics and Computing Lab

19

Conclusion

• Easy to construct • Tools important to get a feel of the data! • Must use the appropriate representation based on the characteristics of the data • Helpful in determining the further course of data analysis

Applied Statistics and Computing Lab

20

R-codes

Functions Stem and leaf plot R-code stem(‘variable name’) Note: ‘scale’ is an important parameter to explore in R’s stem function table(‘variable name’) Install.packages(“TeachingDemos”) library(TeachingDemos) dots(‘variable name’)

Frequency table Dot plot

Applied Statistics and Computing Lab

21

Thank you

Applied Statistics and Computing Lab

- (3) Methods of Data CollectionUploaded byASClabISB
- (8) Measures of DispersionUploaded byASClabISB
- (2) Types of DataUploaded byASClabISB
- (5) Bayes' RuleUploaded byASClabISB
- (9) Basic Box-PlotUploaded byASClabISB
- (9) Geometric and Negative Binomial DistributionUploaded byASClabISB
- (5) Graphical Presentation 1Uploaded byASClabISB
- (6) Graphical Presentation 2Uploaded byASClabISB
- (12)Continuous DistributionsUploaded byASClabISB
- (11) Notched and Variable Width Box-PlotsUploaded byASClabISB
- (10) Box-Plot With FencesUploaded byASClabISB
- (10) Hypergeometric DistributionUploaded byASClabISB
- (12) Bivariate DataUploaded byASClabISB
- (7) Measures of Central TendencyUploaded byASClabISB
- (8b) Grouped Data_central Tendency and DispersionUploaded byASClabISB
- (7) Discrete Uniform DistributionUploaded byASClabISB
- (6) Random Variables and PMFUploaded byASClabISB
- (1) Set TheoryUploaded byASClabISB
- (11) Poisson DistributionUploaded byASClabISB
- (8) Binomial DistributionUploaded byASClabISB
- (4) Conditional ProbabilityUploaded byASClabISB
- (13) Normal DistributionUploaded byASClabISB
- (2) Permutations and CombinationsUploaded byASClabISB
- (1) IntroductionUploaded byASClabISB
- Teradata CaseUploaded byMila Gorodetsky
- (14) Joint DistributionUploaded byASClabISB
- Gentle Lentil Case.pdfUploaded byRahul Sukhija
- (3) ProbabilityUploaded byASClabISB
- R TutorialUploaded byASClabISB
- (15) Chi-square, Student’s t and Snedecor’s F distributionsUploaded byASClabISB

- (7) Discrete Uniform DistributionUploaded byASClabISB
- (6) Random Variables and PMFUploaded byASClabISB
- (11) Poisson DistributionUploaded byASClabISB
- (8) Binomial DistributionUploaded byASClabISB
- (4) Conditional ProbabilityUploaded byASClabISB
- (13) Normal DistributionUploaded byASClabISB
- (9) Geometric and Negative Binomial DistributionUploaded byASClabISB
- (14) Joint DistributionUploaded byASClabISB
- (12)Continuous DistributionsUploaded byASClabISB
- (3) ProbabilityUploaded byASClabISB
- R TutorialUploaded byASClabISB
- (15) Chi-square, Student’s t and Snedecor’s F distributionsUploaded byASClabISB
- (10) Hypergeometric DistributionUploaded byASClabISB
- (1) Set TheoryUploaded byASClabISB
- (2) Permutations and CombinationsUploaded byASClabISB
- (5) Graphical Presentation 1Uploaded byASClabISB
- (1) IntroductionUploaded byASClabISB
- (6) Graphical Presentation 2Uploaded byASClabISB
- (11) Notched and Variable Width Box-PlotsUploaded byASClabISB
- (10) Box-Plot With FencesUploaded byASClabISB
- (12) Bivariate DataUploaded byASClabISB
- (7) Measures of Central TendencyUploaded byASClabISB
- (8b) Grouped Data_central Tendency and DispersionUploaded byASClabISB