You are on page 1of 19

1

Lahore Garrison University


MSDS-111 Statistical and Mathematical
Methods
for Data Science
Week-2 Semester-
Spring-2024
Prepared by:

Dr.Quratulain Rana

Assistant Professor
2
Learning Objectives

After studying this lecture, the student will be able to:


 Plot a histogram for a dataset.
 Tell whether the histogram is skewed or not, and in which direction.
 Construct a stem & leaf plot for one or several datasets.
 Plot a box plot for one or several datasets.
 Interpret a box plot.
 Use histograms, summaries and box plots to investigate datasets.

Dr. Quratulain Rana


3
Data Visualization

Following are the number of simple and advanced ways to visualize data:
 histograms,
 stem-and-leaf plots,
 boxplots,
 time plots, and
 scatter plots.

Dr. Quratulain Rana


4
Histogram

A histogram shows the shape of data, checks for homogeneity, and suggests possible outliers. To construct a
histogram, we split the range of data into equal intervals, “bins,” and count how many observations fall into
each bin.
• A frequency histogram consists of columns, one for each bin, whose height is determined by the number of
observations in the bin.
• A relative frequency histogram has the same shape but a different vertical scale. Its column heights
represent the proportion of all data that appeared in each bin.
Example 6:The sample of CPU times(data set Example:1) stretches from 9 to 139 seconds. Choosing
intervals [0,14], [14,28], [28,42], . . . as bins, Draw a
i. Frequency Histogram
ii. Relative Frequency Histogram

Dr. Quratulain Rana


5
Histogram

Dr. Quratulain Rana


6
Supporting meterial

Dr. Quratulain Rana


7
Supporting meterial

Dr. Quratulain Rana


8
Supporting meterial

Dr. Quratulain Rana


9
Stem and leaf plot

Stem-and-leaf plots are similar to histograms although they carry more information. Namely, they
also show how the data are distributed within columns.

Dr. Quratulain Rana


10
Stem and leaf plot

Example: 7 Using data given in example:1, Make a stem and leaf plot

Solution

Dr. Quratulain Rana


11
Stem and leaf plot

Example: 8
Sometimes stem-and-leaf plots are used to compare two samples. For this purpose, one can put two
leaves on the same stem. Consider, for example, samples of round-trip transit times (known as pings)
received from two locations (data set Pings).

Dr. Quratulain Rana Contd:


12
Stem and leaf plot

Dr. Quratulain Rana


13
Box plot

The main descriptive statistics of a sample can be represented graphically by a boxplot. To construct a
boxplot, we draw a box between the first and the third quartiles, a line inside a box for a median, and
extend whiskers to the smallest and the largest observations, thus representing a so-called five-point
summary:

Example:8 The mean and five-point summary of CPU times were found in previous Examples as:

Make a stem and leaf plot


Dr. Quratulain Rana
14
Box plot

A boxplot is drawn. The mean is depicted with a “+,” and the right whisker extends till the second largest observation
X = 89 because X = 139 is a suspected outlier (depicted with a little circle).
From this boxplot, one can conclude:
•– The distribution of CPU times is right skewed because (1) the mean exceeds the median, and (2) the right half of
the box is larger than the left half.
•– Each half of a box and each whisker represents approximately 25% of the population. For example, we expect
about 25% of all CPU times to fall between 42.5 and 59 seconds.
Dr. Quratulain Rana
15
Practice Question:1

The numbers of blocked intrusion attempts on each day during the first two weeks of the month were:
56, 47, 49, 37, 38, 60, 50, 43, 43, 59, 50, 56, 54, 58
After the change of firewall settings, the numbers of blocked intrusions during the next 20
days were:
53, 21, 32, 49, 45, 38, 44, 33, 32, 43, 53, 46, 36, 48, 39, 35, 37, 36, 39, 45.
Comparing the number of blocked intrusions before and after the change,
(a) construct side-by-side stem-and-leaf plots;
(b) compute the five-point summaries and construct box plots;
(c) comment on your findings.

Dr. Quratulain Rana


16
Practice Question:2

A network provider investigates the load of its network. The number of concurrent users is recorded at
fifty locations (thousands of people),

(a) Compute the sample mean, variance, and standard deviation of the number of concurrent users.
(b) Compute the five-point summary and construct a boxplot.
(c) Compute the interquartile range. Are there any outliers?
(d) It is reported that the number of concurrent users follows approximately Normal distribution. Does the
histogram support this claim?

Dr. Quratulain Rana


17
Practice Question:3

Consider three data sets:

a) For each data set, draw a histogram and determine whether the distribution is right- skewed, left-
skewed, or symmetric.
b) Compute sample means and sample medians. Do they support your findings about skewness and
symmetry? How?
Dr. Quratulain Rana
18
Practice Question:4

The following data set represents the number of new computer accounts registered during ten consecutive
days.
43, 37, 50, 51, 58, 105, 52, 45, 45, 10.
a) Compute the mean, median, quartiles, and standard deviation.
b) Check for outliers using the 1.5(IQR) rule.
c) Delete the detected outliers and compute the mean, median, quartiles, and standard deviation again.
d) Make a conclusion about the effect of outliers on basic descriptive statistics.

Dr. Quratulain Rana


19
References

 Probability and Statistics for Computer Scientists, 2nd Edition, Michael Baron
 Linear Algebra and Its Applications, 5th Edition, David C. Lay and Steven R. Lay
 Introduction to Linear Algebra, 5th Edition, Gilbert Strang
 Probability for Computer Scientists, online Edition, David Forsyth.

You might also like