Visualization - Hist and Box

Visualization-Basic Graphs 2-
Histograms & Box Plots

Mohamed A. Nour
Data
data(cars)
R Library Used
Basic R, ggplot2
Histograms and Box Plots

In this section, you will learn common techniques that data scientists and data analysts
use to generate Histograms and Box Plots for basic visualtization and data exploration
Histograms
Histograms display the distribution of a continuous variable by dividing the range of
scores into a specified number of bins on the x-axis and displaying the frequency of
scores in each bin on the y-axis.
Code - Histograms
par(mfrow=c(1,2)) # To display 2 graphs in one row
hist(mtcars$mpg)
hist(mtcars$mpg,
breaks=12,
col="red",
xlab="Miles Per Gallon",
main="Colored histogram with 12 bins")
hist(mtcars$mpg,
freq=FALSE,
breaks=12,
col="red",
main="Histogram, rug plot, density curve")
rug(jitter(mtcars$mpg))
lines(density(mtcars$mpg), col="blue", lwd=2)
x <- mtcars$mpg
h<-hist(x,
breaks=12,
col="red",
main="Histogram with normal curve and box")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)
box()
Box plots
A box-and-whiskers plot describes the distribution of a continuous variable by plotting its
five-number summary.
Using parallel box plots to compare groups
Box plots can be created for individual variables or for variables by group.
Code - Box plots
boxplot(mpg~cyl,data=mtcars,
main="Car Milage Data",
xlab="Number of Cylinders",
ylab="Miles Per Gallon")
Notched box plots

Box plots are very versatile. By adding notch=TRUE, you get notched box plots. If two
boxes’ notches don’t overlap, there’s strong evidence that their medians differ.
Code - Notched box plots
boxplot(mpg~cyl,data=mtcars,
notch=TRUE,
varwidth=TRUE,
col="red",
main="Car Mileage Data",
xlab="Number of Cylinders",
ylab="Miles Per Gallon")
notes
Bar plots needn’t be based on counts or frequencies. You can create bar plots that
represent means, medians, standard deviations, and so forth by using the aggregate
function and passing the results to the barplot() function.
Box plots for two crossed factors
Finally, you can produce box plots for more than one grouping factor.
Code - Box plots for two crossed factors
mtcars$cyl.f <- factor(mtcars$cyl,
levels=c(4,6,8),
labels=c("4","6","8"))
mtcars$am.f <- factor(mtcars$am,
levels=c(0,1),
labels=c("auto","standard"))
boxplot(mpg ~ am.f *cyl.f,

data=mtcars,
varwidth=TRUE,
col=c("gold", "darkgreen"),
main="MPG Distribution by Auto Type",
xlab="Auto Type")
Understanding the Data Using
Histogram and Boxplot With
Example
Learn how to extract the maximum information from a Histogram and
Boxplot
Rashida Nasrin Sucky
Follow
Jun 10 · 5 min read
Understanding the data does not mean getting the mean, median,
standard deviation only. Lots of time it is important to learn the
variability or spread or distribution of the data. Both histogram
and boxplot are good for providing a lot of extra information about
a dataset that helps with the understanding of the data.
Histogram
A histogram takes only one variable from the dataset and shows
the frequency of each occurrence. I will use a simple dataset to
learn how histogram helps to understand a dataset. I will use
python to make the plots. Let’s import the dataset:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("Cartwheeldata.csv")
df.head()
This dataset shows cartwheel data. Assume, people in an office

decided to go on a Cartwheel distance competition in a picnic. And
the dataset above shows the results. Let’s understand the data.
1. Make a histogram of ‘Age’.

sns.distplot(df['Age'], kde =False).set_title("Histogram of age")
From the picture above, it is clear that most people are below 30.
Only one person is 39and one is 54. The distribution is right-
skewed.
2. Make a distribution of ‘CWDistance’.

sns.distplot(df["CWDistance"], kde=False).set_title("Histogram of
CWDistance")
Such a nice stair! It is hard to say which range has the most
frequency.
3. Sometimes plotting two distribution together gives a good
understanding. Plot ‘Height’ and ‘CWDistance’ in the same figure.
sns.distplot(df["Height"], kde=False)
sns.distplot(df["CWDistance"], kde=False).set_title("Histogram of
height and score")
We cannot say that there is a relationship between Height and
CWDistance from this picture.
Now see, what kind of information we can extract from boxplots.
Boxplot
A boxplot shows the distribution of the data with more detailed

information. It shows the outliers more clearly, maximum,
minimum, quartile(Q1), third quartile(Q3), interquartile
range(IQR), and median. You can calculate the middle 50% from
the IQR. Here is the picture:
It also gives you the information about the skewness of the data,
how tightly closed the data is and the spread of the data.
Let’s see some examples using our Cartwheel data.
1. Make a boxplot of the ‘Score’.

sns.boxplot(df["Score"])
It is important to pay attention to the minimum as Q1–
1.5* IQR and maximum as Q3 + 1.5*IQR. They are not
literally the minimum and maximum of the data. This
indicates a reasonable range.
If your data has values less than Q1–1.5* IQR

and more than Q3 + 1.5*IQR, those data can be
called outliers or probably you should check the
data collection method one more time.
In this particular picture, the reasonable minimum value should

be, 4–1.5*4 which means -2. Our minimum value should not be
less than -2. At the same time, the estimated maximum should be
8 + 1.5*4 or 14. So, the maximum value of the data should be more
than 14. Luckily the minimum and maximum score as per the
dataset is 2 and 10 respectively. That means, there are no outliers
and the data collection process was also good. From the picture
above, the distribution also looks normal.
From this plot, we can say,
a. The distribution is normal.
b. Median is 6
c. The minimum score is 2

d. The maximum score is 10
e. The first quartile (first 25%) is at 4
f. The third quartile (75%) is at 8
g. Middle 50% of the data ranges from 4 to 8.
h. The interquartile range is 4.
2. It can be helpful to plot two variables in the same boxplot to

understand how one affects the other. Plot CWDistance and
‘Glasses’ in the same plot to see if glasses have any effect on
CWDistance.
sns.boxplot(x = df["CWDistance"], y = df["Glasses"])
People with no glasses have a higher median than the people with
glasses. The overall range for the people with no glasses is lower
but the IQR has higher values. From the picture above, IQR is
ranging from 72 to 94. But for the people with glasses overall
range of CWDistance is higher but IQR ranges from 66 to 90
which is less than the people with no glasses.
3. The histograms of CWDistance for the people with glasses and

no glasses separately may give more understanding.
g = sns.FacetGrid(df, row = "Glasses")
g = g.map(plt.hist, "CWDistance")
If you see, from this picture, the maximum frequency for the
people with glasses is at the beginning of the CWDistance. More
study is required to make an inference about the effect of glasses
on CWDistance. Constructing a confidence interval may help.
Bar Chart
Figure 4: Example of a bar chart.

Smith, V. (1996). Male and Female Occupations From the 1881 Census of England and
Wales.
CC BY-SA 4.0
Best used for:
 Reading numbers accurately from the chart

 Comparing individual bars to each other
 Exploring overall trends across categories
 Note, see histograms for continuous data.;
Data
 One (or more) text (categorical) variable

 One numerical variable
Strength;
 Familiar to most audiences

 High perceptual accuracy because they use alignment and length
Best practices
 Use horizontal bars to enhance label readability

 Consider sorting bars by length when alphabetical order is not important
 Axis must start at zero
Pie Chart
Figure 6: Example of a pie chart.

Sylvanmoon. (2016). Pie-Chart.
CC0
Best used for
 When separate categories add up to a meaningful whole

Data
 One text (categorical) variable

 One numerical variable
Strength
Best practices
Pie charts are very difficult for humans to interpret accurately due to angle and area. To
improve perception:
 Categories MUST add up to meaningful whole

 Only use when significant differences exist in category size
 Sort by size
 Ideally, limit your categories to two: the data point you wish to highlight and everything
else.
 Avoid 3D versions as they cause distortion.
Line Chart
Figure 7: Example of a line chart

Kazov. (2015). Compound Interest (English).
CC BY-SA 4.0
Best used for
 Visualizing continuous and temporal data
Data
 One date (categorical) variable on the x-axis (horizontal)

 One or more numerical variable on the y-axis (vertical)
Strength

 Clearer choice over grouped and stacked bar charts when multiple categories are
needed
Best practices
 Use colour to differentiate categories

 Too many lines creates difficulty in readability
 Place labels to improve readability
 Add small dots to clarify measurement locations
 Axis must start at zero

Visualization - Hist and Box

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Visualization - Hist and Box

Uploaded by

Copyright:

Available Formats

Visualization-Basic Graphs 2-

Histograms & Box Plots

Histograms and Box Plots

Notched box plots

boxplot(mpg ~ am.f *cyl.f,

Rashida Nasrin Sucky

Jun 10 · 5 min read

This dataset shows cartwheel data. Assume, people in an office

1. Make a histogram of ‘Age’.

2. Make a distribution of ‘CWDistance’.

Now see, what kind of information we can extract from boxplots.

A boxplot shows the distribution of the data with more detailed

Let’s see some examples using our Cartwheel data.

1. Make a boxplot of the ‘Score’.

If your data has values less than Q1–1.5* IQR

In this particular picture, the reasonable minimum value should

From this plot, we can say,

a. The distribution is normal.

c. The minimum score is 2

e. The first quartile (first 25%) is at 4

f. The third quartile (75%) is at 8

g. Middle 50% of the data ranges from 4 to 8.

h. The interquartile range is 4.

2. It can be helpful to plot two variables in the same boxplot to

3. The histograms of CWDistance for the people with glasses and

Figure 4: Example of a bar chart.

Best used for:

 Reading numbers accurately from the chart

 One (or more) text (categorical) variable

 Familiar to most audiences

 Use horizontal bars to enhance label readability

Figure 6: Example of a pie chart.

Best used for

 When separate categories add up to a meaningful whole

 One text (categorical) variable

 Familiar to most audiences

 Categories MUST add up to meaningful whole

Figure 7: Example of a line chart

 Visualizing continuous and temporal data

 One date (categorical) variable on the x-axis (horizontal)

 Familiar to most audiences

 Use colour to differentiate categories

You might also like