You are on page 1of 160

Introduction to Data

Science

Introduction
Preet Kanwal
Assistant Professor
Department of CSE
PESU Bangalore
Talks about Data Science

Data Science is an emerging area of work.


Universities are oering masters courses in
data science.


Without Data Science the business will be
treated as dead man walking without a heart,
a soul and a mind.

Prof. Preet Kanwal


Talks about Data Science
Harvard business review dubbed it as The Sexiest
Job of 21st Century.
Talks about Data Science

Data Scientists regarded As rare as Unicorns .

The demand for Data Scientists has grown by 350% in the past
five years, and is predicted to continue to rise sharply.

Prof. Preet Kanwal


Applications Data Science
1. Internet Search

Prof. Preet Kanwal


Applications Data Science
1. Entertainment

Prof. Preet Kanwal


Applications Data Science
2. Recommender Systems

Prof. Preet Kanwal


Applications Data Science
3. Image Recognition

Prof. Preet Kanwal


Applications Data Science
4. Price Comparison Websites

Prof. Preet Kanwal


Applications Data Science
5. Airline Route Planning

Prof. Preet Kanwal


Applications Data Science
*****Almost everywhere*****

Government: What proportion of the population of India is BPL


(Below Poverty Line)?

Marketing: which products are best for Up selling.

Health Care: 1958, Smoking and death rates: It was found that
the death rates were higher among regular cigarette smokers
than aming the men who never smoked.

Human Resources :which employees are most likely to leave,


employees performance, decide employees bonus.

Prof. Preet Kanwal


The next generation of scientic discovery
and technological innovation will be data-
driven. We live in an exponential world.

Prof. Preet Kanwal


Is it Same as Big Data??

DATA SCIENCE BIG DATA

The key word in "Data Science" is not Data, it is


Science.

Prof. Preet Kanwal


Data Science

"Data Science" is an interdisciplinary


;eld that deals with

What data represents

How to extract knowledge or insights


from data in various forms

How it can be turned into a valuable


resource

Data Science builds on concepts in:


Mathematics
Computer Science
Statistics

Prof. Preet Kanwal


Two ways to approach
Data Science
1) Top Down Approach
2) Bottom- Up approach

Prof. Preet Kanwal


Textbooks
For Concepts
Statistics for Engineers and Scientists
by William Navidi
McGraw Hill Education, 3rd Edition.

For Applications
Data Science From Scratch
By Joel Grus
ORielly, 1st Edition.
Statistics?
Technique of:
Collecting,
Analyzing,
Drawing conclusion from data.

Everything that deals with data belongs to the domain of


statistics. Example:
1) Calculating average length of downtimes of a computer.
2) Data on no. Of persons attending a seminar.
3) Evaluating eectiveness of products.
4) Predicting reliability of a rocket.
5) Studying vibrations of airplane wings.

Prof. Preet Kanwal


Origin of Statistics
Can be traced to two areas:
1) Game of Chance ( Probability)
2) Political Science

Prof. Preet Kanwal


Types of Statistics

Prof. Preet Kanwal


Data?

Data is a collection of facts, such as


numbers, words, measurements,
observations or even just descriptions
of things.

Prof. Preet Kanwal


Data Classication

Prof. Preet Kanwal


Quantitative vs Qualitative
DATA
Quantitative or Numerical data, is numerical information (numbers).
"Quantitative is about Quantity"
Quantitative data can also be Discrete or Continuous.

Qualitative or Categorical data is descriptive information


(it describes something).

Prof. Preet Kanwal


Discrete Data - counted
Discrete Data can only take certain values.

Discrete Data isolated points on the number line


Prof. Preet Kanwal
Continuous Data - Measured
Continuous Data can take any value (within a range).

Prof. Preet Kanwal


Census or Sample
A Census is when we collect data for every member of the group
(the whole "population").
A Sample is when we collect data just for selected members of the
group.

Census Sample
120 people in your
local football club

Ask everyone (all Choose the people


120) what their that are there this
age is afternoon

Accurate. Not as accurate,


but may be good
enough.

Hard to do. A lot easier


Prof. Preet Kanwal
Population is the entire collection of objects.

Sample is a subset of Population, containing


objects that are actually observed.

Prof. Preet Kanwal


Simple Random Sampling
SRS is not guaranteed to reflect the population
perfectly.

SRSs always differ in some ways from each other;


occasionally a sample is substantially different from
the population.

Two different samples from the same population will


vary from each other as well.

This phenomenon is known as sampling variation.

Prof. Preet Kanwal


Question

Is a Simple random sample


guaranteed to re0ect exactly the
population from which it was
drawn??

Prof. Preet Kanwal


Question
A coin is tossed twice and comes up heads both times.
Someone says, There's something wrong with this coin.
A coin is supposed to come up heads only half of the
time not every time.

a) Is it reasonable to conclude that something is wrong with


the coin?

b) If the coin comes up heads 100 times in a row, would it be


reasonable to conclude that something is wrong with the
coin?

Prof. Preet Kanwal


A coin is tossed twice and comes up heads both times.
Someone says, There's something wrong with this
coin. A coin is supposed to come up heads only half of
the time not every time.

a)Is it reasonable to conclude that something is wrong with


the coin?
No. This could well be sampling variation.

b) If the coin comes up heads 100 times in a row, would it be


reasonable to conclude that something is wrong with the
coin?
Yes. It is virtually impossible for sampling variation to
be this large.

Prof. Preet Kanwal


Summary statistics
Helps make important features of a sample stand out.

Measures of Central Value Measures of Spread

Sample Mean (or the Average) Sample Variance


(The average of the squared differences
from the Mean.)

Sample Median (middle number) Sample Standard Deviation


Divides the sample in half (using the Standard Deviation we have a
"standard" way of knowing what is
normal, and what is extra large or extra
small.)

Sample Mode (Most frequently


Range
occurring value)
Quartiles
IQR InterQuartile Range
Percentiles
Prof. Preet Kanwal
Median

The median is another measure of centre, like the mean.

Order the n data points from smallest to largest. Then,

If n is odd, the sample median is the number in position (n + 1) / 2.

If n is even, the sample median is the average of the numbers in


positions (n/2 , n/2 + 1).

Prof. Preet Kanwal


Median Application
Advantage of the median is that it is not influenced as
much by an outlier.

Hence, Median is the preferred measure of average when


the data contains outliers.

Example:

When incomes are reported, a typical approach is to


report the median income.

This is done because the mean income is skewed by a


small number of people with very high incomes (think
about the salaries of Bill Gates and Oprah).
Prof. Preet Kanwal
Variance and Standard
deviation:
for each sample_item in sample:
diff = (sample_item sample_mean).
sq_diff = square(diff) // as diff could be negative.
Total = total + sq_diff.

Variance = total / n 1.

Unit of Variance = squared units.

Standard deviation = sq_root(Variance).

Prof. Preet Kanwal


Sample Variance - Whysquarethe
di5erences?

1. Add up the differences from the mean : the negatives cancel the positives:
4 + 4 4 4/4 = 0

2. Absolute values
|4| + |4| + |4| + |4| / 4 = 4 + 4 + 4 + 4 /4 = 4

looks good, but what about this case:

|7| + |1| + |6| + |2| / 4 = 7 + 1 + 6 + 2/ 4 = 4

Oh No! It also gives a value of 4, Even though the differences are more spread out.

3. So let us try squaring each difference (and taking the square root at the end):

(72 + 12 + 62 + 22) / 4 = 4.74...That is nice!


The Standard Deviation is bigger when the differences are more spread out ... just
what we want.
Prof. Preet Kanwal
Why divide by n 1 instead of n
(Bessels Correction)

Standard deviation is used to estimate the amount of spread in the


population from which sample was drawn.

Ideally we should compute deviations from population mean instead of


deviation from sample mean.

Since population mean is unknown, sample mean is used in its place.

Mathematical fact : deviation around sample mean is smaller than the


deviations around population mean.

Hence division by n 1 provides exactly the right correction.

Prof. Preet Kanwal


Points to Ponder:

1. If a constant is added or subtracted to/ from each sample item then:


a) Sample mean increases or decreases by the same constant.
b) Sample variance and standard deviation are unaffected.

2. If each sample item is multiplied or divided by a constant then:


a) Sample mean is multiplied or divided by the same value.
b) Standard deviation is multiplied or divided by the same value.

Prof. Preet Kanwal


Range = largest_value
smallest_value

Prof. Preet Kanwal


Quartiles

Divides the sample into quarters. A sample has three quartiles.

1.First quartile ( Q1 ) is the median of the lower half of the data: 0.25(n
+1 )

1.The median is the second quartile (Q2): 0.5 ( n +1 )

1.Third quartile (Q3) is the median of the upper half of the data: 0.75( n
+ 1)

Prof. Preet Kanwal


IQR InterQuartile Range

IQR = third_quartile first_quartile

25% data is less than first quartile.


75% data is less than third quartile.

50% of data is in between first and third quartile.

IQR is the distance needed to span middle half of


data.
Prof. Preet Kanwal
Percentile

The pth percentile of a sample, for a number p between 0 and 100,


divides the sample so that as nearly as possible,
p% of the sample values are less than the pth percentile,
and (100 p%) are greater.

The computation of the location of the pth percentile is analogous to


what we did for the quartiles.

Prof. Preet Kanwal


To Find Percentiles

Order the n sample values from smallest to largest.

Compute the quantity (p/100)(n + 1), where n is the sample size.

If this quantity is an integer, the sample value in this position is the


pth percentile.

Otherwise, average the two sample values on either side.

Prof. Preet Kanwal


Points to Ponder:

The first quartile is the 25th percentile.

The median is the 50th percentile.

The third quartile is the 75th percentile.

Prof. Preet Kanwal


Question

Suppose we have the following data:


2, 3, 5, 6, 7, 9, 9, 11, 12, 15

1.What is the mean of these data?


2.What is the median?
3.Which is the mode?
4.What is the Standard Deviation?
5.What is the first quartile?
6.What is the third quartile?
7.What is the interquartile Range?

Prof. Preet Kanwal


Pos 1 2 3 4 5 6 7 8 9 10
Val 2 3 5 6 7 9 9 11 12 15

Mean = 7.9

Median = avg of items in pos ( 10/2 , 10/2 + 1) = avg of item in pos(5, 6) =


( 7 + 9 )/2 = 16/2 = 8

Mode = 9

Std dev = 4.09

First quartile = avg of item in pos(2, 3) = (3 + 5) /2 = 8 / 2 = 4

Third quartile = avg of item in pos(8, 9) = (11 + 12) / 2 = 23 / 2 = 11.5

Prof. Preet Kanwal


Question

Suppose we have the following data:


2, 3, 5, 6, 7, 9, 9, 11, 12, 15

1.Add 1 to each item in the data and answer the following:


a) What is the median?
b) Which is the mode?
c) What is the Standard Deviation?
d) What is the first quartile?
e) What is the third quartile?
f) What is the interquartile Range?

Prof. Preet Kanwal


Question
Suppose we have the following data:
2, 3, 5, 6, 7, 9, 9, 11, 12, 15

2. Multiply each item by 2 in the data and answer the


following:
a) What is the median?
b) Which is the mode?
c) What is the Standard Deviation?
d) What is the first quartile?
e) What is the third quartile?
f) What is the interquartile Range?

Prof. Preet Kanwal


Points to Ponder:
1) If a constant is added/ subtracted from each item in the sample then :
a) Mean increases or decreases by the same constant.
b) Median increases or decreases by the same constant.
c) Standard Deviation remains the same.
d) Q1 increases or decreases by the same constant.
e) Q3 increases or decreases by the same constant.
f) IQR remains the same.
1) If a constant is multiplied/divided to/from each item in the sample then :
a) Mean multiplied or divided by the same constant.
b) Median multiplied or divided by the same constant.
c) Standard Deviation multiplied or divided by the same constant.
d) Q1 multiplied or divided by the same constant.
e) Q3 multiplied or divided by the same constant.
f) IQR multiplied or divided by the same constant.

Prof. Preet Kanwal


Example of the data set where the standard deviation is
larger than the mean:
S = { 0, 0, 1, 15, 20)

Standard deviation can never be negative.

Lowest possible value of standard deviation is 0. When all


the Items are equal.

Prof. Preet Kanwal


Question
Consider the following data set (representing scores
in an examination)
67, 44, 60, 31, 15, 81, 77, 70, 84, 95, 91

1)Find mean
2)Find median
3)Find variance
4)Find standard deviation
5)Find range
6)Find first quartile
7)Find third quartile
8)Find IQR
9) Find 60th percentile

Prof. Preet Kanwal


Pos 1 2 3 4 5 6 7 8 9 10 11

Val 15 31 44 60 67 70 77 81 84 91 95

1. Mean = 715/ 11 = 65
2.Median = item in pos( 11 + 1 ) / 2 = item in pos 6 = 70
3. Variance = 6488 / 10 = 648.8
4. Standard deviation = sq_root(Variance) = sq_root(648.8) = 25.47
5. Range = highest value lowest value = 95 15 = 80
6. First quartile = 0.25 ( 12) = item in pos 3 = 44
7. Third quartile = 0.75 ( 12) = item in pos 9 = 84
8. IQR = third quartile first quartile = 84 44 = 40
9. 60th percentile = 0.60 (12) = 7.2 = avg of item in pos (7, 8) =
(77 + 81 ) / 2 = 79

Prof. Preet Kanwal


True/False
1. For any list of numbers, half of them will be below the mean?

2. Is the sample mean always the most frequently occurring value?

3. Is the sample mean always equal to one of the values in the sample?

4. Is the sample median always equal to one of the values in the sample?

5. Is it possible for standard deviation of a list of numbers to be equal to 0?

6. Is it possible for standard deviation to be greater than mean?

Prof. Preet Kanwal


True/False
1. For any list of numbers, half of them will be below the mean? (False)
2.Is the sample mean always the most frequently occurring value? (False)
3. Is the sample mean always equal to one of the values in the sample?
(False)
4. Is the sample median always equal to one of the values in the sample?
(False)
5. Is it possible for standard deviation of a list of numbers to be equal to 0?
(True)
6. Is it possible for standard deviation to be greater than mean? (True)

Prof. Preet Kanwal


Question

A vendor converts the weights on the pacakges from


pounds to kilograms
1 kg = 2.2 lb

a) How does this affect the mean weight of the packages?

b) How does this affect the standard deviation of the weights?

Prof. Preet Kanwal


A vendor converts the weights on the pacakges from
pounds to kilograms
1 kg = 2.2 lb

a) How does this affect the mean weight of the packages?


The mean will be divided by 2.2.

b) How does this affect the standard deviation of the weights?


The standard deviation will be divided by 2.2.

Prof. Preet Kanwal


Question
The vendor begins using heavier packaging, which
increases the weight of each package by 50g.

a) How does this affect the mean weight of the packages?

b) How does this affect the standard deviation of the weights?

Prof. Preet Kanwal


The vendor begins using heavier packaging, which
increases the weight of each package by 50g.

a) How does this affect the mean weight of the packages?


The mean will increase by 50 g.

b) How does this affect the standard deviation of the weights?


The standard deviation will be unchanged.

Prof. Preet Kanwal


Question
The smallest number on a list is changed from 12.9 to 1.29.

a) Is it possible to determine by how much:


I. Mean changes ?
II.Median changes?
III.Standard deviation changes?

b) Is it possible to determine whether the median changes or


not if there are only two numbers on the list?

Prof. Preet Kanwal


The smallest number on a list is changed from 12.9 to 1.29.

a) Is it possible to determine by how much:


I. Mean changes ?
It is not possible to tell by how much the mean changes, because the
sample size is not known
II. Median changes?
If there are more than two numbers on the list, the median is
unchanged.
III. Standard deviation changes?
It is not possible to tell by how much the standard deviation changes,
both because the sample size is unknown and because the original
standard deviation is unknown.

b) Is it possible to determine whether the median changes or not if there are


only two numbers on the list?
Yes it is possible.
change_in_median = Old_median new_median =
(12.9 + x)/2 (1.29 + x)/2 = (12.9 1.29)/2 = 5.805
Prof. Preet Kanwal
Outliers

Prof. Preet Kanwal


Outliers
Outliers are points that are much larger or smaller than the rest of the
sample points.

Outliers may be data entry errors or they may be points that really are
different from the rest.

Outliers should not be deleted without considerable thought


sometimes calculations and analyses will be done with and without
outliers and then compared.
Outlier

1 2 3 4 5 200

Prof. Preet Kanwal


Prof. Preet Kanwal
Identify the Outliers if any!

Prof. Preet Kanwal


Prof. Preet Kanwal
Trimmed Mean

Prof. Preet Kanwal


Trimmed Mean or Truncated
Mean

Trimming mean is a statistical measure of central tendency


much like the mean and median.

An averaging method designed to reduce the e5ects of


statistical outliers. Helps eliminate the inGuence of data points
on the tails that may unfairly aect the traditional mean.

A trimming mean eliminates the extreme observations by


removing observations from each end of the ordered sample.

It involves the calculation of the mean after discarding given parts


of a sample at the beginning and the end of the whole data, and
typically discarding an equal amount of both.

09/08/16 Prof. Preet Kanwal Prof. Preet Kanwal


Facts
For most statistical applications, 5 to 25 percent of
the ends are discarded.
Under normality, the best possible amount of
trimming is zero.
However, in practical applications, there is no
guarantee that the observed sources are symmetric.
This method is best suited for data with large, erratic
deviations or extremely skewed distributions.
Median extreme case of trimmed mean.

09/08/16 Prof. Preet Kanwal Prof. Preet Kanwal


Example
Mean trimmed 20% has 20% of the
largest numbers removed, and 20%
of the smallest numbers removed.

09/08/16 Prof. Preet Kanwal Prof. Preet Kanwal


Problem 1: Find 10% trimmed
mean of :

2, 4, 6, 7, 11, 21, 81, 90, 105,


121

09/08/16 Prof. Preet Kanwal Prof. Preet Kanwal


Solution Problem 1
N=10
P=0.10,
k=np=1, n=10,p=0.10, k=np=1 which is
an integer so trim exactly one
observation at each end, since k=1.
Thus trim o 2 and 121. We are left with
R=n2k=102=8R=n2k=102=8
observations.

09/08/16 Prof. Preet Kanwal Prof. Preet Kanwal


Problem 2: Find 15% trimmed
mean of :

2, 4, 6, 7, 11, 21, 81, 90, 105,


121

09/08/16 Prof. Preet Kanwal Prof. Preet Kanwal


Solution Problem 2
N=10

P=0.15,

k=np=1, n=10,p=0.15, k=np=1.5 which is not an


integer so round the value to nearest whole no. K = 2

Thus trim o 2,4 and 105, 121. We are left with


R=n2k=102(2)= 6 observations. (6, 7, 11, 21, 81,
90)

09/08/16 Prof. Preet Kanwal Prof. Preet Kanwal


Usefulness

Useful estimator : less sensitive to


outliers than the mean.

Still a reasonable estimate of central


tendency or mean for many
statistical models - robust estimator.

09/08/16 Prof. Preet Kanwal Prof. Preet Kanwal


Application - Olympic
Judging
Judges rate skates on 0-to-10 scale (using any real number from 0 to
10). And note here we do not support anonymous judges (secret
ballots) we want all scores to be public.

The top K and bottom K scores for each skate are discarded (where K
is some pre-agreed constant)

The skater's score is the mean of her un-discarded judge-scores.


Highest scoring skater wins the gold.

"outlier" judges are discarded - There is no "dictator" judge. The


"trimmed mean" system, unfortunately can thereby discard some
important honest judgements. (Not all outliers represent corrupt
judges.) That is sad!!

09/08/16 Prof. Preet Kanwal Prof. Preet Kanwal


Points to Ponder:

A numerical summary of a sample is called a statistic.

A numerical summary of a population is called a


parameter.

Statistics are often used to estimate parameters.

Prof. Preet Kanwal


Data
Visualization
Preet Kanwal
Assistant Professor
Department of CSE
PESU Bangalore
Outline
1. What is Data Visualization?
2. Histogram.
3. Bar Chart.
4. Pie Chart.
5. Line Chart.

Prof. Preet Kanwal


Making predictions is not enough!

Effective data scientists know how to explain and interpret their


results, and communicate findings accurately to stakeholders to inform
business decisions.

Visualization is the field of research in computer science that studies


effective communication of quantitative results by linking perception,
cognition, and algorithms to exploit the enormous bandwidth of the
human visual cortex.

In this course you will learn to recognize, design, and use effective
visualizations.

Prof. Preet Kanwal


What is Data Visualization?
A form of visual communication
Helps people understand significance of data by placing it in a visual
context.
Patterns and trends are difficult if not impossible to detect by looking
at textual data.
Much easier to recognize with data visualization software.

Communicate information clearly using


Plots of various types
Info graphics
Maps, schematic diagrams (e.g. train routes)

Prof. Preet Kanwal


Graphical Summaries
1. Histogram
2. Bar Chart
3. Pie Chart
4. Line Chart
5. Box plot
6. Scatter plot

Prof. Preet Kanwal


Histogram
Histogram

Prof. Preet Kanwal


Histogram
Gives an idea of the shape of a sample indicating regions where sample
points are concentrated and the regions where they are sparse.
For a single variable
- Histogram (visual plot of frequency distribution) is very helpful
- Choice of number of bins important histogram can be made to look
different by altering this.

Prof. Preet Kanwal


Histogram construction
Construct frequency table:
- Divide the sample into groups class intervals or bins.
- Frequency = number of data points that fall into each of the
class intervals.
- Relative frequency of a class interval = frequency / total
number of data points
- (proportion of data points that fall into that interval)
- sum of all relative frequencies = 1
- If the class intervals are the same width, then draw a rectangle
for each class, whose height is equal to the frequencies or
relative frequencies.
- The data axis is marked here with the lower class limits. (x
axis)
- Frequency scales always start at zero ( y- axis)

Prof. Preet Kanwal


Problem 1: Construct a histogram for
the following data set:

Prof. Preet Kanwal


Problem 1: Construct a
histogram for
the following data set:

Prof. Preet Kanwal


Problem 1 Solution
Problem 1 Solution:

Prof. Preet Kanwal


What to look for in a
histogram
Is the distribution symmetric or skewed?
How many peaks does the histogram have, and where are they
located? (Is it unimodal or bimodal or multimodal?)
Are there any unusual characteristics?
What is the maximum data value as shown on the histogram?
(What is the largest value on the data axis?)
What is the minimum data value as shown on the histogram?
(What is the smallest value on the data axis?)
Does the histogram have any gaps, and if so, where are they
located? (Gaps are empty classes with bars on both sides.)
Does the histogram have any extreme values, and if so, where
are they located? (An extreme value is a bar with a large gap -
two or more classes - between it and the other bars.)

Prof. Preet Kanwal


Symmetry and Skewness

Long right-hand tail is


long left-hand tail is Symmetric if its right said to be skewed to
said to be skewed to half is a mirror image of the right, or
the left, or negatively its left half. positively skewed.
skewed.

Prof. Preet Kanwal


Modes

Prof. Preet Kanwal


Prof. Preet Kanwal
Answer the questions for
the histogram shown below.

Prof. Preet Kanwal


minimum = 0.

maximum = 500.

skewed right.

one peak at 25.

3 gaps:
a) between 100 and 150.
b) between 200 and 250.
c) between 300 and 400.

Since the last gap is twice as large, values between 400 and 500 are
extreme.

Prof. Preet Kanwal


Problem 2: Construct a
histogram for
the following data set:

Prof. Preet Kanwal


Problem 2: Construct a
histogram for
the following data set:

Prof. Preet Kanwal


Problem 2 Solution
Problem 2 Solution:

Prof. Preet Kanwal


Problem 3
Ammonium concentrations were measured at a total of 349 alluvial wells
in the state of Iowa. The mean concentration was 0.27, median was 0.10
and standard deviation was 0.40.

If a histogram of these 349 measurements were drawn, would it be:


a) skewed to the right?
b) skewed to the left?
c) Approximately symmetric?
d) Undetermined?

Prof. Preet Kanwal


Problem 3 solution
Ammonium concentrations were measured at a total of 349 alluvial wells
in the state of Iowa. The mean concentration was 0.27, median was 0.10
and standard deviation was 0.40.

If a histogram of these 349 measurements were drawn, would it be:


a) skewed to the right?
b) skewed to the left?
c) Approximately symmetric?
d) Undetermined?

As mean > median.


Also note that half the values are between 0 and 0.10, so the left-hand
tail is very short.

Prof. Preet Kanwal


Problem 4
For the given data :

Answer the following:

The histogram drawn will be:


a) skewed to the left?
b) skewed to the right?
c) approximately symmetric?

Prof. Preet Kanwal


Problem 4 solution
For the given data :

Answer the following:

The histogram drawn will be:


a) skewed to the left?
b) skewed to the right?
c) approximately symmetric?

Reason: The 85th percentile is much closer to the median


(50th percentile) than the 15th percentile is. Therefore the histogram is
likely to have a longer left-hand tail than right-hand tail.
Prof. Preet Kanwal
Problem 5 -Iris Data Set

Number of data points: 150

Attributes
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
species: {setosa, versicolour,
virginica}

Prof. Preet Kanwal


Prof. Preet Kanwal
Problem 5 -Iris Data Set
Setosa
Virginica

Versicolor
Prof. Preet Kanwal
Problem 5 -Iris Data Set
1. For the entire dataset, print out the
minimum, mean, median and maximum
values for each attribute.
2. Generate a histogram for each numeric
attribute for the dataset.(ndivs = 10).
3. What are the distinguishing characteristics
of each of the histograms generated?

Prof. Preet Kanwal


ProblemProblem
5 Solution:
5 Solution

Prof. Preet Kanwal


Homework Iris DataSet
1. Generate histograms for each combination
of species and attribute. There should be
12 histograms (3 species x 4 attributes).
2. Is it possible to use the results from the
previous task to determine the species of
a plant based on the values of the
attributes? If so, explain how.

Prof. Preet Kanwal


Homework Expected
Solution

Prof. Preet Kanwal


Bar Charts
Histogram

Prof. Preet Kanwal


Bar Chart
A bar chart is NOT a histogram

Histograms
Are used for quantitative data.
Show distributions of variables.
Display frequencies of ranges (intervals, bins).
Will appear dierent for dierent bin sizes.

Bar Charts
Are used for categorical data.
Show frequencies associated with categories.

Prof. Preet Kanwal


Problem1: Dormant data set
Has 60 values.

Eruptions are categorized as Long and Short.

Prof. Preet Kanwal


Problem2: Iris data set
Has 150 values.

Species are categorized as setosa, versicolour,


virginica.

Prof. Preet Kanwal


Problem3: Glass data set
214 samples of glass

Various glass types Coded as:

Building_windows_Goat_processed (BWF).
Building_windows_non_Goat_processed (BWNF).
Vehicle_windows_Goat_processed (VWF).
Vehicle_windows_non_Goat_processed (VWNF) (none in this
database).
Containers (Containers)
Tableware (Tableware).
Headlamps (Headlamps).

Prof. Preet Kanwal


Pie Charts
Histogram

Prof. Preet Kanwal


Prof. Preet Kanwal
Prof. Preet Kanwal
Pie Charts
Circular graph that shows relative contribution from various
categories to overall total.

Useful for data classified into nominal categories.

Show percentage or proportional data.

Generally good for 6 or less categories.

If too many categories, difficult to interpret.

Pie slices may be exploded for better effectiveness.

Criticisms - Comparison by angle less accurate than comparison by


length

Prof. Preet Kanwal


Youd think that pie charts would be so easy
to understand, but theres always someone
who doesn't get it .

Prof. Preet Kanwal


Comparison by length
easier

Prof. Preet Kanwal


Line Charts
Histogram

Prof. Preet Kanwal


Line Charts
Displays continuous data over time.
Ideal for showing trends over time.
Useful for depicting relationship between
variables
x axis generally represents independent variable
(could be time)(categorical data)
y axis used to show dependent variable(s) (Value axis)
X and y axis are distributed evenly.

Prof. Preet Kanwal


Problem1 Oil price data
set
Crude oil price data from 1/1/2014 to 9/5/2016.
Plot oil price.
Note that there as missing values.
Dealing with missing values.
Code replaces missing value by nan not a number.
matplotlib automatically ignores nan values.

What can you say about the trend from this plot?
What is the projection for this year?

Prof. Preet Kanwal


Box plot
Histogram

Prof. Preet Kanwal


Box Plots (Box and whiskers plot)

Good way to visually represent a distribution .


Good way to summarize large amounts of data.
Displays range and distribution of data along a number
line.
Consists of
Whiskers representing extent of distribution.
Outliers.

Prof. Preet Kanwal


Five things needed to construct Box Plots

1)Arrange the data from least to greatest.


2)Find the median
3)Find First quartile median of the lower half of the data
4)Find Third quartile
5)Find lower and upper extremes.
IQR = Q3 - Q1
Lower_extreme = Q1 1.5 * IQR
Upper_extreme = Q3 + 1.5 * IQR

Prof. Preet Kanwal


Drawing a Box Plot

1)Plot points for the five values above the number line.
2)Draw a vertical line from median, Q1, Q3.
3)Form a box by connecting vertical lines from Q1, Q2, Q3.
4)Draw the whiskers from the extremes to the box.

Note:
The whiskers do not extend to the minimum and maximum
of the sample, but to the smallest and largest values
inside a "reasonable" distance from the end of the box.

Prof. Preet Kanwal


Prof. Preet Kanwal
Note on whisker length
Whisker length on both the sides is the same as you are
adding/subtracting same constant (1.5 IQR) to/from Q3/Q1
respectively.

The whisker length is truncated based on the following conditions:

1) if min(data) > lower_extreme


set lower_extreme = min(data)

2) if upper_extreme > max(data)


set upper_extreme = max(data)

Prof. Preet Kanwal


Problem 1

Construct a boxandwhisker plot for the


given data.

Pos: 1 2 3 4 5 6 7 8
Val: 10.2 14.1 14.4 14.4 14.4 14.5 14.5 14.6

Pos: 9 10 11 12 13 14 15

Val: 14.7 14.7 14.7 14.9 15.1 15.9 16.4


Prof. Preet Kanwal
Solution
1) Sample Size = 15

2) Calculate the quartiles.


a) Q1 is the value of the 4th data points;
Q1 = 14.4

b) Q2 is the value of the 8th data points;


Q2 = 14.6

c) Q3 is the value of the 12th data points.


Q3 = 14.9

3) Check for outliers. Find the interquartile range, .


a) IQR = 14.9 14.4 = 0.5
b) Upper Extreme: Q3 + 1.5 * IQR = 15.65
c) Lower Extreme: Q1 1.5 * IQR = 13.65
.

Prof. Preet Kanwal


Analyzing a Box Plot
1) Provides information about the distribution of data in the quartiles.
Shorter distance means data is bunched together.
Longer distance means data is spread out.

2) We can easily identify Outliers.

1) Indicator of Centrality, Spread, Symmetry. For example,


a distribution with a positive skew would have a longer whisker in the
positive direction than in the negative direction.
A larger mean than median would also indicate a positive skew.

2) Box plots are good at portraying extreme values and are especially good at
showing differences between distributions.

Prof. Preet Kanwal


Comparitive boxplots

Means of comparing many samples at once, in a way that


would be impossible for the histogram.

Various attributes of the samples compared at a glance.

Obvious differences are immediately apparent. Data which will


not lend itself to standard analysis can be identified.

Prof. Preet Kanwal


Example:

Prof. Preet Kanwal


Observations made:

Sample A and B appear to have similar centres, which exceed


those of C and D.

Sample B appears to have larger variability than the other


three samples.

Samples A, B and C are reasonably symmetric, but sample D


is skewed to the right.

There are no obvious outliers in any of the samples.

Prof. Preet Kanwal


Problem 2

Prof. Preet Kanwal


Problem 3

Prof. Preet Kanwal


Problem 4

Prof. Preet Kanwal


Problem 5

Prof. Preet Kanwal


Problem 6

Prof. Preet Kanwal


Problem 7

Prof. Preet Kanwal


Problem 8

Prof. Preet Kanwal


Problem 9

Prof. Preet Kanwal


Problem 10

Prof. Preet Kanwal


Problem 11 Match the following

Prof. Preet Kanwal


Multivariate Data

Data consisting of
One variable only - univariate data
Two variables bivariate data
More than two variables multivariate data

How can one visualize multivariate data using two


dimensions?

Most real time data is multivariate.

Prof. Preet Kanwal


Problem 1 Construct a Scatter Plot

Prof. Preet Kanwal


Solution : Problem 1

Prof. Preet Kanwal


Analyzing a Scatter Plot
The relationship between two variables is called their Correlation.

Correlation is a statistical measure that indicates the extent to which two or


more variables fluctuate together.

The closer the data points come when plotted to making a straight line, the
higher the correlation between the two variables, or stronger the relationship.

A positive correlation indicates the extent to which those variables increase or


decrease in parallel. A perfect positive correlation is given the value of 1.

A negative correlation indicates the extent to which one variable increases as


the other decreases. A perfect negative correlation is given the value of -1.

If there is absolutely no correlation present the value given is 0.

Prof. Preet Kanwal


The closer the number is to 1 or -1, the stronger the correlation, or the stronger
the relationship between the variables.

The closer the number is to 0, the weaker the correlation.

So something that seems to kind of correlate in a positive direction might have


a value of 0.67, whereas something with an extremely weak negative correlation
might have the value -.21.

Prof. Preet Kanwal


Example : Correlation
1) Perfect positive correlation : The total amount of money spent on tickets at
the movie theatre with the number of people who go.

2) Negative correlation : The amount of time it takes to reach a destination


with the distance of a car (travelling at constant speed) from that destination.

3) Strong but not perfect positive correlation: The number of hours students
spent studying for an exam versus the grade received.

This won't be a perfect correlation because:

Two people could spend the same amount of time studying and get
different grades.

But in general the rule will hold true that as the amount of time studying
increases so does the grade received.

Prof. Preet Kanwal


Correlation is Not Causation

When the fluctuation of one variable reliably predicts a similar


fluctuation in another variable, theres often a tendency to think
that means that the change in one causes the change in the other.

However, correlation does not imply causation, which says that


a correlation does not mean that one thing causes the other.

There may be an unknown factor that influences both variables


similarly.

Prof. Preet Kanwal


Example 1: Correlation is Not Causation

A few years ago a survey of employees found a strong positive


correlation between "Studying an external course" and Sick
Days.

1) Does this mean:

2) Studying makes them sick?

3) Sick people study a lot?

4) Or did they lie about being sick to study more?

Without further research we can't be sure!!.

Prof. Preet Kanwal


Example 2: Correlation is Not Causation

A number of studies report a positive correlation between the


amount of television children watch and the likelihood that
they will become bullies.

However, the studies only report a correlation, not causation.

It is likely that some other factor such as a lack of parental


supervision may be the influential factor.

Prof. Preet Kanwal


Problem 2:

The graphs to be shown do not have perfect


correlations.
Which graph would have a correlation of:
1) 0 ?
2) 0.7?
3) -0.7?
4) 0.3?
5) -0.3? Prof. Preet Kanwal
Prof. Preet Kanwal
The first graph seems to have a
pretty strong positive correlation,
so it would have a value of about
0.7. You can see that the band of
data points that is angled upward
is relatively thin so there is not a
whole lot of variation in the results
when one variable is entered.

Prof. Preet Kanwal


The data points of the
second graph are much
more spread out, although
they definitely follow a
downward pattern.
Therefore, it would be a
good guess to say that this
is roughly a -0.3
correlation.

Prof. Preet Kanwal


The third graph also
has a negative
correlation, but the
data points are much
tighter indicating a
higher correlation.
Therefore, this would
probably have a value
of about -0.7.

Prof. Preet Kanwal


The fourth graph does
not seem to have a
correlation at all. There
is no pattern to where
the data points lie.
They do not seem to
go in any particular
direction. Therefore
this data has a
correlation value of 0.

Prof. Preet Kanwal


The last graph appears
to have a positive
correlation, although
the data points are not
very close together.
This graph would
probably have a value
of 0.3.

Prof. Preet Kanwal


Thank you !

Prof. Preet Kanwal