You are on page 1of 30


Applied Statistics and Computing Lab Indian School of Business

Applied Statistics and Computing Lab


Learning Goals
• Why we use graphs • What are the various types of graphs for presenting numerical data • Which graph to use in which scenario • Graphical Distortion

Applied Statistics and Computing Lab


Why use Graphs: An example
• A private insurance firm interested in marketing it’s insurance products in region A. To target precisely, needs to know age distribution. QuestionsIn which age group does the highest number of people lie? Needs to divide population into 4 different age groups, to sell 4 different products It has the following dataApplied Statistics and Computing Lab

Data on ages of 507 people
23,21,23,26,22,27,29,37,55,53,21,19,20,18,32,20,28,19,23,33,40,28,24,36,23,29,3 4,31,34,42,45,23,46,26,30,25,20,37,24,36,28,29,23,23,25,24,37,42,30,28,29,39,26, 20,21,20,19,20,40,25,45,28,21,22,19,24,24,20,29,27,27,40,43,22,22,21,24,23,23,4 5,20,25,25,33,21,23,20,34,20,41,25,32,24,65,28,25,38,23,22,20,35,34,67,38,33,26, 25,52,21,32,43,24,28,62,45,40,21,23,30,20,28,41,32,26,37,38,27,23,50,25,23,43,3 3,22,26,37,32,23,37,23,27,23,27,24,21,25,23,23,46,34,25,29,45,44,35,55,25,31,19, 45,34,19,20,29,33,37,21,23,51,31,27,27,37,25,37,33,25,29,25,20,25,28,24,31,25,2 7,23,20,28,40,21,62,44,49,34,25,29,19,20,20,26,19,36,34,24,27,23,20,28,40,21,62, 44,49,34,25,29,19,20,20,26,19,36,34,24,27,22,22,48,21,27,33,34,54,25,35,22,21,4 1,23,19,29,27,36,21,20,20,24,35,33,25,45,55,49,30,28,25,23,26,21,26,32,32,32,35, 19,26,22,23,25,38,30,43,60,32,26,23,24,21,28,25,20,64,39,27,32,23,24,23,29,44,2 0,24,42,27,43,37,20,47,45,20,28,21,37,27,26,22,21,62,27,27,22,22,52,42,30,19,19, 19,24,21,36,32,52,26,56,30,23,21,44,37,51,38,23,44,26,23,20,44,25,18,22,35,24,2 5,23,22,24,26,26,28,34,24,33,46,51,25,19,35,19,19,20,41,33,44,19,29,35,33,22,33, 44,29,46,19,30,26,20,32,20,27,22,40,42,29,31,22,29,36,37,25,46,25,43,43,24,24,1 9,46,29,26,32,29,34,26,34,22,25,41,38,21,34,37,56,28,35,29,22,22,24,36,40,40,37, 23,34,20,23,40,20,30,32,30,21,39,37,22,39,49,24,20,40,24,39,32,24,22,20,27,21,2 6,28,26,18,30,22,30,18,52,25,28,42,23,41,32,22,24,25,27,24,27,31,35,21,36,20,23, 19,25,31,32,40,41,36,43,34,26,29,23,45,33,29,29,45,48,19,38,26,48,22,32,44,44,1 9,32,30,
Applied Statistics and Computing Lab

What can you infer from the data? Practically nothing! How long before you come up with answers? Probably the first thing you do, is count the observations for each age. Note down the observations along with the corresponding age That makes a frequency table for you! Frequency table- Just like in categorical data, a frequency table for discrete numerical data lists each possible value (either individually or grouped into intervals), the associated frequency and sometimes the corresponding relative frequency. • Note: Age is, in theory, a continuous variable as it can assume any value. But here the variable is, age, in whole years, which is discrete. • But 44 distinct values in your data! • Hence frequency table with 44 rows and one frequency column • • • • • •

Applied Statistics and Computing Lab


So, list individually or group?
• List Individually or group into intervals?In the ages data, there are 44 distinct values. If we list individually, we have data with 44 rows! Cumbersome to interpret Insurance company interested in selling 4 different products catering to the needs of 4 different age groups. Interested in 4 age categories

• In general, depending on need and size of data, decide whether to group or not ( For discrete data). • For continuous data it is necessary to group. • How to make groups?- Find max and min. Choose suitable class width= (max-min)/(desired no of classes), round off to the next integer, if decimal. If not, then the next integer
Applied Statistics and Computing Lab

Construction of Frequency table
Class Interval 17-29 30-42 43-55 56-68 Frequency 298 142 56 10

Do we have the answers in a minute from this table?
• The age group 17-29 has the maximum number of people • We also have the exact number of people in each age group •This same data can be represented pictorially in a number of ways!

Applied Statistics and Computing Lab

Types of Graph
• Graphs for presenting Numerical data:
Bar chart (for discrete variable) Histogram Frequency Polygon Ogive Line Diagram

Applied Statistics and Computing Lab


Bar Chart (Numerical Data)
• Graph of the frequency distribution • Similar to bar chart for categorical data • Each frequency or relative frequency is represented by a rectangle centered over the corresponding value (or range of values for grouped data) • Area of the rectangle is proportional to the corresponding frequency or relative frequency • We could name the groups group 1, group 2, group 3 and group 4 and plot the corresponding frequencies, exactly like in case of categorical data (Exercise) • Conceptually hence there is no difference between the two
Applied Statistics and Computing Lab

Histogram( for continuous numerical data)-

• Graph of the frequency distribution of continuous data • Suppose given the ages of 507 people in continuous form- (Now age not reported in whole years, can take any value on real line) • We draw histogram instead of bar chart • Similar to bar chart for numerical data except that there are no gaps between the bars • Length of each rectangle represents frequency of each equal classinterval , so that area represented by histogram= total frequency • If class-intervals are not equal, then length represents relative frequency,(= class frequency/class interval) then total area enclosed by histogram=1

Applied Statistics and Computing Lab

• • Maximum concentration is in the age group 20-25 Gives an idea about shape of the distributionfor eg, we can say that the distribution of ages is not symmetric, it is highly right skewed (See module on Skewness and kurtosis) Extent of spread or variation (see module on dispersion) What is bin width?- Bin width refers to the length of each class interval. How to choose bin width?- Well, R chose a bin width for you! The default bin width in R is given by Sturges Rule Some other Thumb Rules- Doanes Formula, Rice Rule, Scott Rule, Freedman Diaconis Rule, All you need to do is specify the option in breaks=“” in R (see histogram in R-code slide) For more details on these rules-

• • • • • • •

Applied Statistics and Computing Lab

How to Choose optimal bin width?
• • • • Lots of research going on regarding optimal bandwidth For all practical purposes, you can just rely on the default bandwidth option in R! You can also specify your own bandwidth option in R- suppose you want bin width of 5 Just make sure, that the no of class intervals are not too small or not too large (generally between 5 to 20). We show how to specify various bin widths in R and choose the best that suits your purposeHistogram with too small binwidth( Problem: Shows too much individual data and does not allow the underlying pattern, ie, frequency distribution of the data to be easily seen) Histogram with too large binwidth(Problem: Bins are too large and does not convey the properties of the distribution)

• •

Applied Statistics and Computing Lab


Frequency Polygon (for representing continuous data)

Applied Statistics and Computing Lab

• A frequency polygon is formed by plotting the frequencies of each class against their midpoints and joining the points by straight lines • To get a closed polygon, we take two additional classes, one at each end, that have zero frequencies. ( The midpoints corresponding to these classes thus have zero frequencies) • Basically, if superimposed on a histogram, it joins the midpoints of each rectangular bar by straight line segments • We draw the frequency polygon for the ages data over the 13 histogram itself

• But is there any additional information you can derive from a frequency polygon, over and above which the histogram gives? • Not really! In fact histogram gives more information since while it lists the entire class intervals, a frequency polygon only shows the midpoint. To appreciate fully, look at a frequency polygon without the corresponding histogram-

Applied Statistics and Computing Lab

• In the construction we have made a simplification by drawing the class frequency corresponding to the mid point of the class interval thereby losing more information14

Why use Frequency Polygon?
• For comparing between two sets of data the corresponding frequency polygons can be drawn on the same graph • Drawing two histograms on the same diagram for comparison purposes is confusing • The insurance company is looking at the profitability of investing in two regions- region A and region B. • Region with a higher proportion of 50 plus population demands more insurance. • The ages.both.regions.csv data gives the ages of a random sample of 507 people in both region A and in region B • Draw two histograms on the same diagram and try to compareApplied Statistics and Computing Lab

Why use Frequency polygon
Q. What can you infer? Practically nothing, right!

Applied Statistics and Computing Lab


Why Frequency Polygon (Contd)
Draw two frequency polygons on the same diagram and compare.

• What can you infer? Can you infer better? • Which region should have the higher insurance demand?
Applied Statistics and Computing Lab

Ogives- Cumulative Frequency Curves
• Now suppose the insurance company wants answers to more particular questionsIn region A, how many are 50 years or more? In region A, how many people are 20 years or less? In region A, how many people are 60 years or more?

(Similar questions for region B) • It wants to design separate products for the age groups 20 or less, 20-50, 50 and above and a few additional schemes for 60 plus people • Clearly, it needs to know the cumulative frequency for each age group!
Applied Statistics and Computing Lab

Ogives for Region A

• A cumulative frequency curve or ogive is obtained by plotting cumulative, rather than individual class frequencies. There maybe two types of ogivesA curve showing the number of observations equal to or greater than the lower class limit of each corresponding class- referred to as “more than type” ogive A curve showing the number of observations equal to or less than the upper class limit of each corresponding classreferred to as “less than type” ogive

• Each successive point is joined by line segments to give the ogive
Applied Statistics and Computing Lab

Ogives- For Region A

• The black plot gives the less than type Ogive • The purple plot gives the more than type Ogive • From diagram insurance company readily has the answers- From the less than type Ogive, observe that there are 114 people aged less than 22, around 100 people aged less than 20 • 491 people aged less than 52, roughly 500 people less than 50 • From the more than type ogive, we infer there are very few people above 60, something around 5 • Q. Draw the ogives for Region B and try to answer the above questions. Compare with the age distribution for region A
Applied Statistics and Computing Lab

Line Diagram
•The ages data is an example of cross section data. •Use any of the above diagrams depending on nature of cross section data. •But what if given a time series data- a series of observations given corresponding to each time point? •For eg, consider the following data Year
1985 1990 1994 1995 1998 1999 2000

Households with computer
8.2 15 22.8 24.1 36.6 42.1 51

•How to represent this graphically? •Need to represent each value corresponding to each given year

Source: Falling Through The Net: Toward Digital Inclusion ( U.S Department of Commerce,October 2000)

Applied Statistics and Computing Lab


Line Chart

Applied Statistics and Computing Lab

• Plot years on the horizontal axis and mark the values corresponding to each year on the vertical axis • Join the points by line segments. We have our line graph ready! • Think: Can we construct a histogram , ogive or bar chart with this data? Why or why not? • Line diagram is meant for representing chronological data. It exhibits the relationship of the variable 22 with time.

Line Chart: Inferences

• Shows an increasing trend over the years- that is, from 1985 to 2000, the percentage of households with computers consistently rising • From under 10% in 1985 it has crossed to over 50% in 2000, signifying an over 400% increase from 1985 to 2000 • Useful for analysing time trend- that is, the long-term movement of time series data
Applied Statistics and Computing Lab

Graphical Distortion of Data
• • As much as graphs can be used to summarize and represent various aspects of data succinctly it can also be used to distort data First might be inadequate representation of data. Consider the following line graph showing the population above poverty line of a hypothetical country APeople above poverty line
1200 1000 800 600 400 200 0 1990 1995 2000 2005 2010 People above poverty line

Seeing this graph, we conclude that poverty has been falling in this country as the number of people above poverty line is rising.

Applied Statistics and Computing Lab

Graphical Distortion: Continued
• But now, this graph used inadequate information- this is the table from which the graph has been produced Relative share of people above poverty line
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1990 1995 2000 2005 2010 Relative share of people above poverty line

• • •

Draw a line chart showing the relative share of people above poverty line We see that the relative share of people above poverty line is actually declining and thus the relative share of people below poverty line is actually rising Our earlier conclusion, based on representation of inadequate data, led to a fallacious conclusion

Applied Statistics and Computing Lab

Graphical Distortion of data Contd..
• The above is just an example. There might be numerous ways in which data can be misrepresented • For eg, one common misuse might be distortion with scale • With the explosion of data visualization techniques and sophisticated displays like 3-D charts data distortion can be easier to achieve • For more information read chart.htm
Applied Statistics and Computing Lab

R Codes
Histogram data=read.csv('ages.continuous.csv',header=TRUE,sep=',') View(data) age=data$age max(age) colors=c("red", "bisque", "darkslategray", "violet", "orange", "blue", "pink", "cyan","brown","cornsilk") # hist for histogram,right=TRUE means right-closed, left-open intervals hist(age,right=TRUE,col=colors) # To specify bin widths on your own bins=seq(17,67,by=5) hist(age,right=TRUE,breaks=bins,col=colors) #Example of Histogram with too small binwidth bins=seq(17,67,by=2) # Example of Histogram with too large binwidth bins=seq(17,67,by=25) hist(age,right=TRUE,breaks=bins,col=colors) # Drawing a frequency polygon over a histogram bins=seq(17,67,by=10) hist(age,right=TRUE,breaks=bins,col=colors,xlim=c(10,75)) # draw the histogram lines(c(12,seq(22,62,by=10),72),c(0,as.vector(table(cut(age,seq(17,67,by=10)))),0),lwd=2) #draw the frequency polygon

Applied Statistics and Computing Lab


R Codes
Frequency Polygon RegionA.age=data$RegionA.age RegionB.age=data$RegionB.age max(RegionA.age) min(RegionA.age) max(RegionB.age) min(RegionB.age) bins.A=seq(17,67,by=10) bins.B=seq(15,75,by=10) #To draw two frequency polygons on the same graph plot(c(12,seq(22,62,by=10),72),c(0,as.vector(table(cut(RegionA.age,seq(17,67,by=10)))),0),type="b",main="Frequency distribution of age",xlab="age ",ylab="frequency", xlim=c(10,80),ylim=c(0,270)) lines(c(10,seq(20,70,by=10),80),c(0,as.vector(table(cut(RegionB.age,seq(15,75,by=10)))),0),lwd=2,col="violet") Line Chart data=read.csv('Households with computer.csv',header=TRUE,sep=',') household.comp=data$ Year=data$Year x=c(0,0,0,0,0) y=c(0,0,0,0,0) plot(x,y,xlab="Year",ylab="Percentage of Households with Computer",type="b",xlim=c(1985,2000),ylim=c(5,65)) lines(Year,household.comp,type="b",col="blue") title("Line Chart")

Applied Statistics and Computing Lab


R Codes
Ogives min(data) max(data) NumberOfClasses = 10 ClassInterval = (67 - 17)/10 ClassInterval ClassEnds = seq(17,67,5) classes=cut(data[,1], breaks=ClassEnds) FrequencyDistribution = table(classes) CumulativeFrequencies = c(cumsum(FrequencyDistribution)) cbind(CumulativeFrequencies) #Less than type Ogive plot(ClassEnds,c(0,as.vector(CumulativeFrequencies)),type="b",xlim=c(10,70),ylim=c(0,700),main="Ogives",xlab="ClassIntervals",y lab="Cumulative Frequency of Age") #More than type Ogive cbind(FrequencyDistribution) Frequency=as.vector(FrequencyDistribution) cbind(as.vector(FrequencyDistribution)) More.than.cum.freq=cumsum(rev(Frequency)) Upper.limit=rev(ClassEnds) lines(Upper.limit,c(0,More.than.cum.freq),type="b",col="violet")

Applied Statistics and Computing Lab


Thank you

Applied Statistics and Computing Lab