You are on page 1of 23

Visualization-Basic Graphs 2-

Histograms & Box Plots


Mohamed A. Nour

Data
data(cars)

R Library Used
Basic R, ggplot2

Histograms and Box Plots


In this section, you will learn common techniques that data scientists and data analysts
use to generate Histograms and Box Plots for basic visualtization and data exploration

Histograms
Histograms display the distribution of a continuous variable by dividing the range of
scores into a specified number of bins on the x-axis and displaying the frequency of
scores in each bin on the y-axis.
Code - Histograms
par(mfrow=c(1,2)) # To display 2 graphs in one row
hist(mtcars$mpg)

hist(mtcars$mpg,
breaks=12,
col="red",
xlab="Miles Per Gallon",
main="Colored histogram with 12 bins")
hist(mtcars$mpg,
freq=FALSE,
breaks=12,
col="red",
xlab="Miles Per Gallon",
main="Histogram, rug plot, density curve")
rug(jitter(mtcars$mpg))
lines(density(mtcars$mpg), col="blue", lwd=2)

x <- mtcars$mpg
h<-hist(x,
breaks=12,
col="red",
xlab="Miles Per Gallon",
main="Histogram with normal curve and box")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)
box()

Box plots
A box-and-whiskers plot describes the distribution of a continuous variable by plotting its
five-number summary.
Using parallel box plots to compare groups
Box plots can be created for individual variables or for variables by group.
Code - Box plots
boxplot(mpg~cyl,data=mtcars,
main="Car Milage Data",
xlab="Number of Cylinders",
ylab="Miles Per Gallon")

Notched box plots


Box plots are very versatile. By adding notch=TRUE, you get notched box plots. If two
boxes’ notches don’t overlap, there’s strong evidence that their medians differ.
Code - Notched box plots
boxplot(mpg~cyl,data=mtcars,
notch=TRUE,
varwidth=TRUE,
col="red",
main="Car Mileage Data",
xlab="Number of Cylinders",
ylab="Miles Per Gallon")

notes
Bar plots needn’t be based on counts or frequencies. You can create bar plots that
represent means, medians, standard deviations, and so forth by using the aggregate
function and passing the results to the barplot() function.
Box plots for two crossed factors
Finally, you can produce box plots for more than one grouping factor.
Code - Box plots for two crossed factors
mtcars$cyl.f <- factor(mtcars$cyl,
levels=c(4,6,8),
labels=c("4","6","8"))
mtcars$am.f <- factor(mtcars$am,
levels=c(0,1),
labels=c("auto","standard"))

boxplot(mpg ~ am.f *cyl.f,


data=mtcars,
varwidth=TRUE,
col=c("gold", "darkgreen"),
main="MPG Distribution by Auto Type",
xlab="Auto Type")
Understanding the Data Using
Histogram and Boxplot With
Example
Learn how to extract the maximum information from a Histogram and
Boxplot

Rashida Nasrin Sucky

Follow

Jun 10 · 5 min read

Understanding the data does not mean getting the mean, median,
standard deviation only. Lots of time it is important to learn the
variability or spread or distribution of the data. Both histogram
and boxplot are good for providing a lot of extra information about
a dataset that helps with the understanding of the data.

Histogram

A histogram takes only one variable from the dataset and shows
the frequency of each occurrence. I will use a simple dataset to
learn how histogram helps to understand a dataset. I will use
python to make the plots. Let’s import the dataset:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("Cartwheeldata.csv")
df.head()

This dataset shows cartwheel data. Assume, people in an office


decided to go on a Cartwheel distance competition in a picnic. And
the dataset above shows the results. Let’s understand the data.

1. Make a histogram of ‘Age’.


sns.distplot(df['Age'], kde =False).set_title("Histogram of age")
From the picture above, it is clear that most people are below 30.
Only one person is 39and one is 54. The distribution is right-
skewed.

2. Make a distribution of ‘CWDistance’.


sns.distplot(df["CWDistance"], kde=False).set_title("Histogram of
CWDistance")
Such a nice stair! It is hard to say which range has the most
frequency.
3. Sometimes plotting two distribution together gives a good
understanding. Plot ‘Height’ and ‘CWDistance’ in the same figure.
sns.distplot(df["Height"], kde=False)
sns.distplot(df["CWDistance"], kde=False).set_title("Histogram of
height and score")
We cannot say that there is a relationship between Height and
CWDistance from this picture.

Now see, what kind of information we can extract from boxplots.

Boxplot

A boxplot shows the distribution of the data with more detailed


information. It shows the outliers more clearly, maximum,
minimum, quartile(Q1), third quartile(Q3), interquartile
range(IQR), and median. You can calculate the middle 50% from
the IQR. Here is the picture:
It also gives you the information about the skewness of the data,
how tightly closed the data is and the spread of the data.

Let’s see some examples using our Cartwheel data.

1. Make a boxplot of the ‘Score’.


sns.boxplot(df["Score"])
It is important to pay attention to the minimum as Q1–
1.5* IQR and maximum as Q3 + 1.5*IQR. They are not
literally the minimum and maximum of the data. This
indicates a reasonable range.

If your data has values less than Q1–1.5* IQR


and more than Q3 + 1.5*IQR, those data can be
called outliers or probably you should check the
data collection method one more time.

In this particular picture, the reasonable minimum value should


be, 4–1.5*4 which means -2. Our minimum value should not be
less than -2. At the same time, the estimated maximum should be
8 + 1.5*4 or 14. So, the maximum value of the data should be more
than 14. Luckily the minimum and maximum score as per the
dataset is 2 and 10 respectively. That means, there are no outliers
and the data collection process was also good. From the picture
above, the distribution also looks normal.

From this plot, we can say,

a. The distribution is normal.

b. Median is 6

c. The minimum score is 2


d. The maximum score is 10

e. The first quartile (first 25%) is at 4

f. The third quartile (75%) is at 8

g. Middle 50% of the data ranges from 4 to 8.

h. The interquartile range is 4.

2. It can be helpful to plot two variables in the same boxplot to


understand how one affects the other. Plot CWDistance and
‘Glasses’ in the same plot to see if glasses have any effect on
CWDistance.
sns.boxplot(x = df["CWDistance"], y = df["Glasses"])
People with no glasses have a higher median than the people with
glasses. The overall range for the people with no glasses is lower
but the IQR has higher values. From the picture above, IQR is
ranging from 72 to 94. But for the people with glasses overall
range of CWDistance is higher but IQR ranges from 66 to 90
which is less than the people with no glasses.

3. The histograms of CWDistance for the people with glasses and


no glasses separately may give more understanding.
g = sns.FacetGrid(df, row = "Glasses")
g = g.map(plt.hist, "CWDistance")
If you see, from this picture, the maximum frequency for the
people with glasses is at the beginning of the CWDistance. More
study is required to make an inference about the effect of glasses
on CWDistance. Constructing a confidence interval may help.
Bar Chart

Figure 4: Example of a bar chart.


Smith, V. (1996). Male and Female Occupations From the 1881 Census of England and
Wales.
CC BY-SA 4.0

Best used for:

 Reading numbers accurately from the chart


 Comparing individual bars to each other
 Exploring overall trends across categories
 Note, see histograms for continuous data.;

Data

 One (or more) text (categorical) variable


 One numerical variable

Strength;

 Familiar to most audiences


 High perceptual accuracy because they use alignment and length
Best practices

 Use horizontal bars to enhance label readability


 Consider sorting bars by length when alphabetical order is not important
 Axis must start at zero

Pie Chart

Figure 6: Example of a pie chart.


Sylvanmoon. (2016). Pie-Chart.
CC0

Best used for

 When separate categories add up to a meaningful whole


Data

 One text (categorical) variable


 One numerical variable

Strength

 Familiar to most audiences

Best practices
Pie charts are very difficult for humans to interpret accurately due to angle and area.  To
improve perception:

 Categories MUST add up to meaningful whole


 Only use when significant differences exist in category size
 Sort by size
 Ideally, limit your categories to two: the data point you wish to highlight and everything
else.
 Avoid 3D versions as they cause distortion.

Line Chart

Figure 7: Example of a line chart


Kazov. (2015). Compound Interest (English).
CC BY-SA 4.0
Best used for

 Visualizing continuous and temporal data

Data

 One date (categorical) variable on the x-axis (horizontal)


 One or more numerical variable on the y-axis (vertical)

Strength

 Familiar to most audiences


 Clearer choice over grouped and stacked bar charts when multiple categories are
needed

Best practices

 Use colour to differentiate categories


 Too many lines creates difficulty in readability
 Place labels to improve readability
 Add small dots to clarify measurement locations
 Axis must start at zero

You might also like