Professional Documents
Culture Documents
Ms Data Science S, 24 (WEEK# 2)
Ms Data Science S, 24 (WEEK# 2)
Dr.Quratulain Rana
Assistant Professor
2
Learning Objectives
Following are the number of simple and advanced ways to visualize data:
histograms,
stem-and-leaf plots,
boxplots,
time plots, and
scatter plots.
A histogram shows the shape of data, checks for homogeneity, and suggests possible outliers. To construct a
histogram, we split the range of data into equal intervals, “bins,” and count how many observations fall into
each bin.
• A frequency histogram consists of columns, one for each bin, whose height is determined by the number of
observations in the bin.
• A relative frequency histogram has the same shape but a different vertical scale. Its column heights
represent the proportion of all data that appeared in each bin.
Example 6:The sample of CPU times(data set Example:1) stretches from 9 to 139 seconds. Choosing
intervals [0,14], [14,28], [28,42], . . . as bins, Draw a
i. Frequency Histogram
ii. Relative Frequency Histogram
Stem-and-leaf plots are similar to histograms although they carry more information. Namely, they
also show how the data are distributed within columns.
Example: 7 Using data given in example:1, Make a stem and leaf plot
Solution
Example: 8
Sometimes stem-and-leaf plots are used to compare two samples. For this purpose, one can put two
leaves on the same stem. Consider, for example, samples of round-trip transit times (known as pings)
received from two locations (data set Pings).
The main descriptive statistics of a sample can be represented graphically by a boxplot. To construct a
boxplot, we draw a box between the first and the third quartiles, a line inside a box for a median, and
extend whiskers to the smallest and the largest observations, thus representing a so-called five-point
summary:
Example:8 The mean and five-point summary of CPU times were found in previous Examples as:
A boxplot is drawn. The mean is depicted with a “+,” and the right whisker extends till the second largest observation
X = 89 because X = 139 is a suspected outlier (depicted with a little circle).
From this boxplot, one can conclude:
•– The distribution of CPU times is right skewed because (1) the mean exceeds the median, and (2) the right half of
the box is larger than the left half.
•– Each half of a box and each whisker represents approximately 25% of the population. For example, we expect
about 25% of all CPU times to fall between 42.5 and 59 seconds.
Dr. Quratulain Rana
15
Practice Question:1
The numbers of blocked intrusion attempts on each day during the first two weeks of the month were:
56, 47, 49, 37, 38, 60, 50, 43, 43, 59, 50, 56, 54, 58
After the change of firewall settings, the numbers of blocked intrusions during the next 20
days were:
53, 21, 32, 49, 45, 38, 44, 33, 32, 43, 53, 46, 36, 48, 39, 35, 37, 36, 39, 45.
Comparing the number of blocked intrusions before and after the change,
(a) construct side-by-side stem-and-leaf plots;
(b) compute the five-point summaries and construct box plots;
(c) comment on your findings.
A network provider investigates the load of its network. The number of concurrent users is recorded at
fifty locations (thousands of people),
(a) Compute the sample mean, variance, and standard deviation of the number of concurrent users.
(b) Compute the five-point summary and construct a boxplot.
(c) Compute the interquartile range. Are there any outliers?
(d) It is reported that the number of concurrent users follows approximately Normal distribution. Does the
histogram support this claim?
a) For each data set, draw a histogram and determine whether the distribution is right- skewed, left-
skewed, or symmetric.
b) Compute sample means and sample medians. Do they support your findings about skewness and
symmetry? How?
Dr. Quratulain Rana
18
Practice Question:4
The following data set represents the number of new computer accounts registered during ten consecutive
days.
43, 37, 50, 51, 58, 105, 52, 45, 45, 10.
a) Compute the mean, median, quartiles, and standard deviation.
b) Check for outliers using the 1.5(IQR) rule.
c) Delete the detected outliers and compute the mean, median, quartiles, and standard deviation again.
d) Make a conclusion about the effect of outliers on basic descriptive statistics.
Probability and Statistics for Computer Scientists, 2nd Edition, Michael Baron
Linear Algebra and Its Applications, 5th Edition, David C. Lay and Steven R. Lay
Introduction to Linear Algebra, 5th Edition, Gilbert Strang
Probability for Computer Scientists, online Edition, David Forsyth.