You are on page 1of 4

Data Mining and Warehousing Assignment-1

By: Vanshika Garg


179302161

Introduction to Boxplots
Boxplots are a measure of how well distributed the data in a data set is. It divides the
data set into three quartiles. This graph represents the minimum, maximum, median, first
quartile and third quartile in the data set. It is also useful in comparing the distribution of
data across data sets by drawing boxplots for each of them. You need to have
information on the variability or dispersion of the data. A boxplot is a graph that gives
you a good indication of how the values in the data are spread out. Although boxplots
may seem primitive in comparison to a histogram or density plot, they have the
advantage of taking up less space, which is useful when comparing distributions
between many groups or datasets.
A boxplot is a standardized way of displaying the dataset based on a five-number
summary: the minimum, the maximum, the sample median, and the first and third
quartiles.

 Lower Extreme – the smallest value in a given dataset.


 Upper Extreme – the highest value in a given dataset.
 Median value – the middle number in the set.
 Lower Quartile – below that value, the lower 25% of the data are contained.
 Upper Quartile – above that value, the upper 25% of the data are contained.
 “Whiskers” – the lines that extend from the boxes. They are used to indicate
variability out of the upper and lower quartiles.
 The "interquartile range", abbreviated "IQR", is just the width of the box in the
box-and-whisker plot. That is, IQR = Q3 – Q1. The IQR can be used as a
measure of how spread-out the values are.
 Statistics assumes that your values are clustered around some central value.
The IQR tells how spread out the "middle" values are; it can also be used to
tell when some of the other values are "too far" from the central value. These
"too far away" points are called "outliers", because they "lie outside" the range
in which we expect them.
 The IQR is the length of the box in your box-and-whisker plot. An outlier is any
value that lies more than one and a half times the length of the box from either
end of the box.
OPERATIONS ON BOX PLOT:

 CREATING BOX PLOT


Four arrays are passed to an array named box_plot_data and it is plotted with
boxplot function
import matplotlib.pyplot as plt 
value1 = [82,76,24,40,67,62,75,78,71,32,98,89,78,67,72,82,87,66,56,52]
value2=[62,5,91,25,36,32,96,95,3,90,95,32,27,55,100,15,71,11,37,21]
value3=[23,89,12,78,72,89,25,69,68,86,19,49,15,16,16,75,65,31,25,52]
value4=[59,73,70,16,81,61,88,98,10,87,29,72,16,23,72,88,78,99,75,30]
box_plot_data=[value1,value2,value3,value4]
plt.boxplot(box_plot_data)
plt.show()
 CREATING BOX PLOT WITH FILLS AND LABELS:

boxplot() function takes the data array to be plotted as input in first argument,
second argument patch_artist=True , fills the boxplot and third argument
takes the label to be plotted

import matplotlib.pyplot as plt


value1 = [82,76,24,40,67,62,75,78,71,32,98,89,78,67,72,82,87,66,56,52]
value2=[62,5,91,25,36,32,96,95,3,90,95,32,27,55,100,15,71,11,37,21]
value3=[23,89,12,78,72,89,25,69,68,86,19,49,15,16,16,75,65,31,25,52]
value4=[59,73,70,16,81,61,88,98,10,87,29,72,16,23,72,88,78,99,75,30] 
box_plot_data=[value1,value2,value3,value4]
plt.boxplot(box_plot_data,patch_artist=True,labels=['course1','course2','cours
e3','course4'])
plt.show()

 Horizontal box plot in python with different colors:

boxplot() function takes argument vert =0 which plots the horizontal


box plot.

Colors array takes up four different colors and passed to four different
boxes of the boxplot with patch.set_facecolor() function.

import matplotlib.pyplot as plt


value1 = [82,76,24,40,67,62,75,78,71,32,98,89,78,67,72,82,87,66,56,52]
value2=[62,5,91,25,36,32,96,95,3,90,95,32,27,55,100,15,71,11,37,21]
value3=[23,89,12,78,72,89,25,69,68,86,19,49,15,16,16,75,65,31,25,52]
value4=[59,73,70,16,81,61,88,98,10,87,29,72,16,23,72,88,78,99,75,30]
box_plot_data=[value1,value2,value3,value4]
box=plt.boxplot(box_plot_data,vert=0,patch_artist=True,labels=['course1','course2','c
ourse3','course4'])
colors = ['cyan', 'lightblue', 'lightgreen', 'tan']
for patch, color in zip(box['boxes'], colors):
    patch.set_facecolor(color)
plt.show()

You might also like