You are on page 1of 102

MODULE 1

How much data does google handle?

- Google now processes over 40,000


search queries every second on
average, which translate to over 3.5
billion searches per day and 1.2 trillion Statistical Analysis
searches per year worldwide.
- Google currently processes over 20 Statistics is defined as the science of collecting,
petabytes of data per day through an analyzing, presenting, and interpreting data, as
average of 100,000 MapReduce jobs well as of making decisions based on such
spread across its massive computing analysis.
clusters. Types of Statistics
- The average MapReduce job ran across
approximately 400 machines in • Descriptive statistics is consists of different
September 2007, crunching techniques for organizing, displaying, and
approximately 11,000 machine years in describing data by using labels, graph, and
a single month. summary measures.

Why do we summarize data? • Inferential statistics is consists of methods


that use sample results to help make decisions
The main objective of summarizing data is to or predictions about a population.
easily identify the characteristics of the data to
be processed. Underlying your dataset

Summarizing your data will enable you to A data set is a collection of observations on one
classify the normal values of your data and or more variables. (collection of data from
uncover what are the odd values present on excel, database, row in a data mining)
your dataset. A variable is a characteristic under study that
What is raw data? assures different values for different elements.
(age, name etc)
Raw data pertains to the collected data before
it’s processed or ranked. Observation or measurement pertains to a
value of a variable.(age 10 or 20 etc)
Qualitative raw data
Basic Terms
Suppose you are tasked to gather the ages (in
years) of 50 students in a university.

In the above table, the variables are the


customer ID, first name, last name, address, and
Categorical raw data age. The data set are all the transactions from
customer 1 to 5. Each transaction can also be
Another example is gathering the student status
called as measurement or observation.
of the same 50 students.
What are the different types of variables
available?
help of summary statistics and graphical
representations.
Quantitative Variables

It pertains to a variable that can be measured


numerically. A data collected on a quantitative Measures of Central Tendencies
variable are called quantitative data.
A basic step in exploring your data is getting a
Quantitative variables are divided into two “typical value” for each feature (variable): an
types: discrete variables and continuous estimate of where most of the data are located
variables. (i.e. their central tendency).

Discrete Variables

Discrete variable is a variable whose values are


countable. In other words, a discrete variable
can assume only certain values. (ex. Age,
number of siblings and pertains to specific
values)

Continuous Variables

These are data that can take on any interval


values. Also called as float, interval, numeric.

Qualitative Variables Arithmetic Mean


Qualitative or categorical variable pertains to The arithmetic mean is calculated by adding
the variables that values couldn’t be measured. together the values in the sample. The sum is
Ordinal Variables then divided by the number of items in the
sample (the replication).
Categorical data that has an explicit ordering.
Also called as ordered factor.

Binary Variables

A special case of categorical with just two


categories. (0/1, True or False) Also called as
dichotomous, logical, indicator.

Qualitative vs Quantitative
Mean

The median is the middle value, taken when you


arrange your numbers in order (rank).

When calculating for median, one must


remember the following:

• Arrange the data values of the given data set


in increasing order. (from smallest to largest)
Exploratory Data Analysis (EDA)
• Find the value that divides the ranked data set
It refers to the critical process of performing in two equal parts.
initial investigations on the data so as to
discover patterns, to spot anomalies, to test Mode
hypothesis and to check assumptions with the
It is the most frequent value in a sample. It is Interpreting Standard Deviation
calculated by working out how many there are
• A small standard deviation means that the
of each value in your sample. The one with the
values in a statistical data set are close to the
highest frequency is the mode. It is possible to
mean of the data set, on average
get tied frequencies, in which case you report
both values. • A large standard deviation means that the
values in the data set are farther away from the
Types of Modes
mean, on average.
1. Unimodal – the distribution is said to be
• The standard deviation can never be a
unimodal if it has only one value with highest
negative number.
number of occurences.
• The smallest possible value for the standard
2. Bimodal – the data set contains two modes.
deviation is 0
3. Multimodal – if the distribution has more
• The standard deviation is affected by outliers
than two modes.
(extremely low or extremely high numbers in
Measures of Dispersion the data set). That’s because the standard
deviation is based on the distance from the
A second dimension, variability, also referred to
mean.
as dispersion, measures whether the data
values are tightly clustered or spread out. • The standard deviation has the same units as
the original data.

Exploring the Data Distribution


Two ways to explore distribution

1. Graphically – using frequency histograms or


tally plots draws a picture of the sample shape.
2. Shape statistics – such as skewness and
kurtosis.

Range Histogram

The difference between the largest and the A plot of the frequency table with the bins on
smallest value in a data set. the x-axis and the count (or pro‐ portion) on the
y-axis.
𝑟𝑎𝑛𝑔𝑒 = 𝑙𝑎𝑟𝑔𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 − 𝑠𝑚𝑎𝑙𝑙𝑒𝑠𝑡 𝑣𝑎𝑙𝑢e

Standard Deviation

The standard deviation is used when the data


are normally distributed. You can think of it as a
sort of “average deviation” from the mean.

The general formula for calculating standard


deviation looks like the following:

Understanding Distribution

• If a histogram is symmetric, then it can be said


that the values for the mean, median, and
mode are the equal.
Shapes Statistics

1. Skewness – a measure of how central the


average is in the distribution.

2. Kurtosis – a measure of how pointy the


distribution

Skewness

The skewness of a sample is a measure of how


• If a histogram and frequency distribution is
central the average is in relation to the overall
skewed to the right it means that the value of
spread of values.
the mean is the largest among the three and
mode has the lowest value. Positively Skewed

A positive value indicates that the average is


skewed to the left, that is, there is a long “tail”
of more positive values.

• If a histogram and frequency distribution are


skewed to the left, the value of the mean is the
smallest and mode has the largest value.

Negatively Skewed

A negative value indicates that the average is


skewed to the right, that is, there is a long “tail”
of more positive values.

Boxplot

A plot introduced by Tukey as a quick way to


visualize the distribution of data.

STATISTICAL ANALYSIS

Consider this scenario: A new advertisement for


Tom and Jerry’s rash guard introduced in May
of last year resulted to a 30% increase in ice
cream sales for the following three months.
Thus, the advertisement was effective.
What is null hypothesis?

Null Hypothesis is usually denoted by H0 and


the alternative hypothesis is denoted by H1.
The null hypothesis is a statement that is
assumed to be true at the beginning of an
analysis.

Suppose you are trying to test a certain case,


initially, the person being questioned is not
guilty. This initial verdict is the null hypothesis.
The contrary of this claim is the alternative
hypothesis.

Claim: Coke claims that each bottle of their soft


drink is filled at 8 oz. of all the time.

Problem: In reality, rash guard sales typically 𝐻0: 𝑒𝑎𝑐ℎ 𝑏𝑜𝑡𝑡𝑙𝑒 𝑖𝑠 𝑓𝑖𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 8 𝑜𝑧 𝑜𝑓 𝑠𝑜𝑑𝑎
increases in summer months. Thus, there is a 𝑙𝑒𝑡 𝜇 𝑖𝑠 𝑡ℎ𝑒 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑜𝑓 𝑜𝑓 𝑠𝑜𝑑𝑎 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 𝐻0:
“history effect”. 𝜇 = 8 oz

What is statistics? 𝐻1: 𝑒𝑎𝑐ℎ 𝑏𝑜𝑡𝑡𝑙𝑒 𝑖𝑠 𝑓𝑖𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑙𝑒𝑠𝑠 𝑡ℎ𝑎𝑛 8𝑜𝑧
𝑜𝑓 𝑠𝑜𝑑𝑎 𝑙𝑒𝑡 𝜇 𝑖𝑠 𝑡ℎ𝑒 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑜𝑓 𝑜𝑓 𝑠𝑜𝑑𝑎
Statistics is the science concerned with
𝑐𝑜𝑛𝑡𝑒𝑛𝑡 𝐻0: 𝜇 < 8 oz
developing and studying methods for collecting,
analyzing, interpreting and presenting empirical Conducting
data.
Student’s T-test
• Rely upon the calculation of numbers

• Rely upon how the numbers are chosen and


how statistics are interpreted

Types of Statistics

• Descriptive Statistics - describing and


summarizing data sets using pictures and
statistical quantities.

• Inferential Statistics - analyzing data sets and


drawing conclusions from them

• Probability - the study of chance events


governed by rules (or laws) Student’s T-test

In statistics, student’s t-test is a method


developed by William Sealy Gosset used in
testing hypothesis about the mean of a small
sample drawn from a normally distributed
population when the population standard is
unknown.
What is hypothesis testing? Scenario
Hypothesis testing is a statistical procedure for You want to know if there’s a difference with
testing whether chance is a plausible the height of crops from two fields.
explanation of an experimental finding.
𝐻0:There’s no statistically significant difference
between the samples.

Identify Critical Value

If the t value < critical value = don’t reject the


null hypothesis

If the t value > critical value = reject the null


hypothesis

Significance Levels (Alpha)

A significance level, also known as alpha or α, is


an evidentiary standard that a researcher sets
before the study. It defines how strongly the
sample evidence must contradict the null
hypothesis before you can reject the null
hypothesis for the entire population.

Confidence Intervals

The table shows the equivalent confidence


levels and levels of significance.

𝐻0:There’s no statistically significant difference


between the samples.

Identify p value

If the p value > critical value = don’t reject the


null hypothesis

If the p value < critical value = reject the null


hypothesis

𝑑𝑒𝑔𝑟𝑒𝑒𝑠 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚 = 𝑛1 + 𝑛2 – 2

• 𝑑𝑓 = 16 + 16 − 2 = 30
• 𝑑𝑓 = 30 • a continuous dependent variable, or response
variable

• a discrete independent variable, also called a


predictor or explanatory variable.

One-way ANOVA

Use analysis of variance to test for differences


between population means.

Research Questions for One-way ANOVA


Since the value for T-test is 0.026
• Do accountants, on average, earn more than
• Since the t value is greater than 0.025, we teachers?
reject the hypothesis.
• Do people treated with one of two new drugs
Assumptions in T-test have higher average T-cell counts than people
in the control group?
• independent observations
• Do people spend different amounts
• normally distributed data for each group depending on which type of credit card they
• equal variances for each group have?

One-way ANOVA Assumptions in ANOVA

• Observations are independent.

• Errors are normally distributed.

• All groups have equal response variances.

Correlation

What is ANOVA?

Analysis of variance (ANOVA) is a statistical


technique used to compare the means of two or
more groups of observations or treatments. For
this type of problem, you have the following:
What is correlation? SUPPLEMENTARY MATERIAL 1

• Exploratory data analysis in many modeling Every data analysis requires data. Data can be in
projects (whether in data science or in research) different forms such as images, text, videos, and
involves examining correlation among etc, that are usually gathered from different
predictors, and between predictors and a target data sources. Organizations store data in
variable. different ways, some through data warehouses,
traditional RDBMS, or even through cloud. With
• Variables X and Y (each with measured data)
the voluminous amount of data that an
are said to be positively correlated if high values
organization processes each day, the dilemma
of X go with high values of Y, and low values of
on how to start data analysis emerges.
X go with low values of Y.
How do we start performing an analysis?
Correlation Coefficient
First and foremost, know your data.
• A metric that measures the extent to which
numeric variables are associated with one To understand your organization’s data, there
another (ranges from -1 to +1). are numerous techniques that can be used. In
module 1, the most common techniques will be
• Scatterplot is a plot in which the x-axis is the
identified.
value of one variable, and the y-axis the value of
another. What is raw data?

Assumptions in Correlation Raw data pertains to the collected data before


it’s processed or ranked. Suppose you are
• The correlation coefficient measures the
tasked to gather the ages (in years) of 50
extent to which two variables are associated
students in a university.
with one another.
The table below is an example of quantitative
• When high values of v1 go with high values of
raw data.
v2, v1 and v2 are positively associated

• When high values of v1 are associated with


low values of v2, v1 and v2 are negatively
associated

• The correlation coefficient is a standardized Another example is gathering the student status
metric so that it always ranges from -1 (perfect of the same 50 students. Now we will have an
negative correlation) to +1 (perfect positive example of categorical raw data which is
correlation) presented in the table below.

• 0 indicates no correlation, but be aware that


random arrangements of data will produce both
positive and negative values for the correlation
coefficient just by chance
The examples above can also be called
Types of Correlation ungrouped data. An ungrouped data set
contains information on each member of a
sample or population individually.

Raw data can be summarized through charts,


dashboards, tables, and numbers. The most
common way to describe raw data is through
the frequency distributions.
A frequency distribution shows how the What is inferential statistics?
frequencies are distributed over various
It is consist of methods that use sample results
categories.
to help make decisions or predictions about a
Below is a frequency table that summarized the population.
survey conducted by Gallup Poll about the
Basic terms
Worries About Not Having Enough Money to
Pay Normal Monthly Bills. A variable is a characteristic under study that
assures different values for different elements.

Observation or measurement pertains to a


value of a variable.

A data set is a collection of observations on one


or more variables.
A frequency distribution of a qualitative variable
enumerates all the categories and the number
of instances that belong to each category.

Let’s transform the following responses into a


In the above table, the variables are the
frequency table in order to interpret the data
customer ID, first name, last name, address, and
better.
age. The data set are all the transactions from
customer 1 to 5. Each transaction can also be
called as measurement or observation.

Types of Variables

1. Quantitative Variables It pertains to a


variable that can be measured numerically. A
data collected on a quantitative variable are
called quantitative data.

Examples: Income, height, gross sales, price of a


home, number of cars owned.

Is it easier to understand through frequency Quantitative variables are divided into two
table, isn’t it? In order to study data, in this types: discrete variables and continuous
module we will be using statistics. variables.

Statistics is defined as the science of collecting, • Discrete variable is a variable whose values
analyzing, presenting, and interpreting data, as are countable. In other words, a discrete
well as of making decisions based on such variable can assume only certain values.
analysis.
• Continuous variables are variables that can
Since statistics is a broad body of knowledge, it assume any numerical value or interval.
is divided into two areas: descriptive statistics
2. Qualitative or Categorical Variables
and inferential statistics.
It pertains to the variables that values couldn’t
What is descriptive statistics?
be measured.
Descriptive statistics is consists of different
techniques for organizing, displaying, and
describing data by using labels, graph, and
summary measures.
Another way to present data is through pie
chart. A pie chart is a circle divided into portions
that represent frequencies or percent ages of a
population.

Population Vs Sample

A population is consist of all elements –


individuals, items, or objects, whose
characteristics are being studied.

A sample is a portion of the population selected


for study.

Visual example of Population vs Sample


To graph grouped data, we can use of the
following methods:

A group data can be presented using


histograms. Histograms can be drawn for a
frequency distribution. A histogram is a graph
which classes are marked on horizontal axis and
the frequencies, relative frequencies, or
percentages are marked on the vertical axis.
Let’s proceed to the different ways to group and
study data.

Aside from using the frequency table, we can


also make use of the different graphs that are
commonly used to visually present data.

The first one is a bar graph. A bar graph is made


of bars whose heights represent the frequencies The above histogram shows the percentage
of respective categories. One type of bar graph distribution of annual car insurance premiums
is called pareto chart. in 50 states. The data used to make this
distribution and histogram are based on
Pareto chart is a bar graph where in the bars estimates made by insure.com.
are arranged based on their heights. It is
arranged in descending order (largest to Understanding Frequency Distribution Curve
smallest).
Knowing the meaning for each curve in
histogram would be helpful to interpret a
dataset. A histogram can be:

1. Symmetric

2. Skewed

3. Uniform rectangular

A symmetric histogram is the type of histogram


which both sides are equal.
SUPPLEMENTARY MATERIAL 2

Examples of statistical analysis application in


real life:

If a histogram doesn’t have equal sides, it is said 1. Manufacturers use statistics to weave quality
to be skewed. into beautiful fabrics, to bring lift to the airline
industry and to help guitarists make beautiful
• A skewed-to-the-right histogram has longer music.
tail on the right side.
2. Researchers keep children healthy by using
statistics to analyze data from the production of
viral vaccines, which ensures consistency and
safety.

3. Communication companies use statistics to


optimize network resources, improve service
and reduce customer churn by gaining greater
insight into subscriber requirements.
• A skewed-to-the-left histogram has longer tail
on the left. 4. Government agencies around the world rely
on statistics for a clear understanding of their
countries, their businesses and their people.

Understanding the measures of central


tendencies:

Measures of central tendencies are useful in


identifying the middle value for histograms and
frequency distribution. Methods used to
If the histogram has equal values then it would calculate for the measures central tendencies
be considered as uniform or rectangular can determine the typical values that can be
histogram. found in your data.

Measures of Tendencies

1. Mean
What is mean?
2. Median
Mean is also called as arithmetic mean which
3. Modes pertains to the sum of all the values over the
number of items added.
Measures of Dispersion

1. Range

2. Variance

3. Standard Deviation
There are instances where the data values are
even, in this case, the two middle numbers are
gathered and divided into two.

See the example below: The following data


describes the cell phone minutes used last
month by 12 randomly selected customers:
What is median?
230, 2053, 160, 397, 510, 380, 263, 3864, 184,
The median is the middle value, taken when you 201, 326, 721
arrange your numbers in order (rank). This
Here, we need to arrange the data values first.
measure of the average does not depend on the
It will give us:
shape of the data.
160 184 201 230 263 326 380 397 510 721 2053
When calculating for median, one must
3864
remember the following:
If we observe our data, we have no central
1. Arrange the data values of the given data set
value, to compute for the median, we need to
in increasing order. (from smallest to largest)
identify the values that divide the data into two
2. Find the value that divides the ranked data equal parts. In our case we have 326 and 380.
set in two equal parts.
We can only have one median per data set, so
Example: the median would be calculated using:

Find the median for 2014 compensation:

What is mode?

Mode is the value that occurs the most frequent


in the given dataset.

Let’s identify the most frequent values in the


following example:

77 82 74 81 79 84 74 78

In this data set, the value 74 appears twice.


Therefore, our mode is 74.
Step 1: Arrange the values in increasing order
16.2 16.9 19.3 19.3 19.6 21.0 22.2 22.5 28.7 When a dataset has only one value that repeat
33.7 42.1 the most then the distribution can be called as
unimodal. If the data in the distribution has two
Step 2: Identify the center of the data sets.
values that repeat the most, then it is called
16.2 16.9 19.3 19.3 19.6 21.0 22.2 22.5 28.7 bimodal. If there are more than two modes in a
33.7 42.1 dataset, it is said to be multimodal.

For this example, the middle value is 21.0, Relationships among mean, median, and mode:
therefore it is the median of the data values. • If a histogram is symmetric, then it can be
said that the values for the mean, median, and
mode are the equal.
What is standard deviation?

This method is the most commonly used


measure of dispersion. The value of standard
deviation tells how closely the values of a data
set are clustered around the mean.

The things one must remember when dealing


with standard deviation are the ff:
• If a histogram and frequency distribution is
skewed to the right it means that the value of • A lower value of standard deviation indicates
the mean is the largest among the three and that the data set are spread over a smaller
mode has the lowest value. If the mean is the range around the mean.
largest, it means that the dataset is sensitive to • A larger value of the standard deviation
outliers that occur in the right. indicates that the values of that data set are
spread over a relatively larger range around the
mean.

• The standard deviation can be obtained by


taking the positive square root of the variance.

• If a histogram and frequency distribution are


skewed to the left, the value of the mean is the
smallest and mode has the largest value. In this
scenario, the left tail contains the outliers.

Measures of Dispersion

If an analyst wants to know how disperse a


dataset is, the methods of calculating the
measures of dispersion can be used. Measures
of dispersion can help in determining how
spread the data values are.

What is range?

Range is the simplest method to compute when


measuring the dispersion of data. The range can
be obtained by subtracting the smallest value in
the dataset from the largest value.
FORMATIVES 11. If the histogram is skewed right, the mean is
greater than the median.
1. If a histogram and frequency distribution are
skewed to the left, the value of the means is the - True
largest.
12. Identify whether the statement below is an
- False example of

2. Which of the following is an example of a) positive correlation


categorical raw data?
b) negative correlation
- Collected subjects offered
c) no correlation
3. You conduct a survey where the respondents
The more one eats, the less hunger one will
could choose from good, better, best, excellent.
have.
What type of variable should contain this type
of data? - Positive correlation
- Interval 14. Boxplot is a plot in which the x-axis is the
value of one variable, and the y-axis the value of
4. Student height is a categorical variable.
another.
- True
- False
5. Quantitative variable pertains to the variables
15.0 indicates no correlation, but be aware that
that values couldn’t be measured.
random arrangement of data will produce both
- False positive and negative values for the correlation
coefficient just by chance

- True
6. The data below is bimodal.
16. The alternative hypothesis or research
Data: 77 82 74 81 79 84 74 82
hypothesis Ha represents an alternative claim
- True about the value of the parameter.

8. What is the median of the following data? - True

17. Identify whether the statement below is an


example of

a) positive correlation

b) negative correlation

c) no correlation

As one increases in age, often one’s agility


- 24
decreases.
9. People’s age is an example of continuous
- Negative correlation
variable
18. Find the median: 15, 5, 9, 18, 22, 25, 5
- True
- 12
10. It is a portion of the population selected for
study. (Use lowercase for your answer) 19. Observe the histogram below. Based on it,
how many students were greater than or equal
- Sample
to 60 inches tall?
7. Customer gender is an example of ordinal
data.

- False

8. If the distribution has more than two


modes, it is called multimodal.

- True

9. It is a measure of how central the average


is in relation to the overall spread of values
- 11
- Skewness
20. The height of three volleyball players is 10. A special case of categorical with just
210 cm, 220 cm and 191 cm. What is their two categories is called logical.
average height?
- True
- 207
11. Identify whether the statement
below is a null hypothesis or alternative
20/20 hypothesis:
1. The standard deviation can never be a Contrary to popular belief, people can
negative number see through walls.
- True - alternative hypothesis

2. It refers to the critical process of 12. The height of three volleyball players is
performing initial investigations on the data 210 cm, 220 cm and 191 cm. What is their
so as to discover patterns, to spot anomalies, average height?
to test hypothesis and to check assumptions
- 207
with the help of summary statistics and
graphical representations 13. 0 indicates no correlation, but be aware
that random arrangements of data will
- Exploratory Data Analysis
produce both positive and negative values
3. A variable is a characteristic under study for the correlation coefficient just by chance
that assures different values for different
- True
elements.
14. Inferential statistics is about describing
- True
and summarizing data sets using pictures
4. The data that is collected before being and statistical quantities
processed is called statistical data.
- False
- False
15. Identify whether the statement
5. The smallest possible value for the below is an example of
standard deviation is 1.
a) positive correlation
- False b) negative correlation
6. Observation or measurement pertains c) no correlation
to a value of a variable.  
- True The more you exercise your muscles,
the stronger they get.
- positive correlation

16. Inferential Statistics is about


analyzing data sets and drawing
conclusions from them
- True
17. A residual is the difference between
the observed value of the response and
the predicted value of the response
variable.
- True
18. The smallest possible value for the
standard deviation is 0.
- True
19. When conducting two-tailed T-test,
the data is normally distributed.
- True
20. When x and y are positively
correlated, as the value of x increases,
the value of y tends to increase as well.
- True
- Instagram’s algorithm

MODULE 2

- Weather forecast
- Singapore’s supply and demand
gap on a typical day
Data Mining
is defined as a process used to extract usable
insights from a larger set of any raw data.

It implies analysing data patterns in large


batches of data using one or more software.

Key Features of Data Mining

• Automatic pattern predictions based on trend


- Microsoft’s voice recognition and behavior analysis.

• Prediction based on likely outcomes.

• Creation of decision-oriented information.

• Focus on large data sets and databases for


analysis.

• Clustering based on finding and visually


documented groups of facts not previously
known.

- Gmail’s spam
Key Terms

• A data set may have attribute variables and


target variable(s).

• The values of the attribute variables are used


to determine the values of the target
variable(s).

• Attribute variables and target variables may


also be called as independent variables and
dependent variables, respectively, to reflect
that the values of the target variables depend
on the values of the attribute variables Data Patterns learned in data mining

• Classification and prediction patterns

• Cluster and association patterns

• Data reduction patterns

• Outliers and anomaly patterns

• Sequential and temporal patterns

Data-mining approaches can be separated into


two categories

Two major approaches in data mining

1. Supervised learning – the desired output is


Independent variables known.

2. Unsupervised learning – it is used against


data that has no historical labels.

Data Analytics Lifecycle

• Big data analysis differs from traditional data


analysis primarily due to the volume, velocity,
and variety characteristics of data being
processed.

What is big data?

Normal data comes from your traditional


relational database system
• Data Engineer – provides technical skills, assist
data management and extraction, supports
analytic sandbox

• Data Scientist – provides analytic techniques


data and modeling

Different Analytical Life Cycle

Data Life Cycle according to SAS


Big data comes from different data sources.
Data are usually in petabytes, or terabytes

Characteristics of Big Data

1 Volume - big data starts as low as 1 terabyte


and it has no upper limit

2 Velocity - big data enters an average system


at velocity ranging between 30 Kilobytes (KB)
p/sec to as much as 30 gigabytes (GB) per sec

3 Variety - big data is composed of 1st Phase: Ask


unstructured, semi structured, structured
dataset Whether you’re a data scientist developing a
new model to reduce churn, or a business
executive wanting to improve the customer
How can you ensure the success of an analytics experience, this phase defines what your
projects? business needs to know.

2nd Phase: Prepare

PEOPLE This phase is both critical to success and


frustratingly timeconsuming. You have data
First, you need the right people. sitting in database, on desktops or in Hadoop,
plus you want to capture live-streaming data.

3rd Phase: Explore


Key Roles for successful Analytics Project
In this phase, you’ll search for relationships,
• Business User – understands the domain area
trends and patterns to gain a deeper
• Project Sponsor – provides requirements understanding of your data.
(monetary, equipment, etc.)
You’ll also develop and test hypotheses through
• Project manager – ensures meeting rapid prototyping in an iterative process.
objectives.
4th Phase: Model
• Business Intelligence Analyst – provides
This is the phase where you code your own
business domain expertise based on deep
models using R or Python or an interactive
understanding of the data
predictive software like SAS.
• Database Administrator (DBA) – creates DB
environment
Organizations must be able to identify the Steps of the KDD Process
appropriate algorithm to analyze the available
data.

5th Phase: Implement

After validating the accuracy of the model


generated during the previous phase, it must be
implemented in the organization.

Models are usually deployed by automating


common manual tasks.
Understanding each step
6th Phase: Act
1. Developing an understanding of
In this phase, we enable two types of decisions:
operational that are automated, and strategic - Data Selection
decisions where individuals make a long term
impact. • the application domain

7th Phase: Evaluate • the relevant prior knowledge

Organizations must always monitor the • the goals of the end-user


performance of the model generated. 2. Creating a target data set:
When the performance starts to degrade below selecting a data set or focusing on a subset of
the acceptance level, the model can be variables, or data samples, on which discovery is
recalibrated or replaced with a new model. to be performed.
8th Phase: Ask again 3. Data Cleaning and preprocessing
The marketplace changes. Your business • Removal of noise or outliers
changes. And that’s why your analytics process
occasionally needs to change. • Collecting necessary information to model or
account for noise

• Strategies for handling missing data fields


What is KKD?
• Accounting for time sequence and known
KKD changes
The term Knowledge Discovery in Databases, or 4. Data reduction and projection.
KDD for short, refers to the broad process of
finding knowledge in data, and emphasizes the • Finding useful features to represent the data
"high-level" application of particular data depending on the goal of the task
mining methods. 5. Choosing the data mining task.
It is of interest to researchers in machine • Deciding whether the goal of the KDD process
learning, pattern recognition, databases, is classification, regression, clustering, etc.
statistics, artificial intelligence, knowledge
acquisition for expert systems, and data 6. Choosing the data mining algorithm
visualization.
• Selecting method(s) to be used for searching
for patterns in the data

• Deciding which models and parameters may


be appropriate
• Matching a particular data mining method enable you to become familiar with the data,
with the overall criteria of the KDD Process identify data quality problems, discover first
insights into the data, and/or detect interesting
7. Data mining.
subsets to form hypotheses regarding hidden
• Searching for patterns of interest in particular information.
representation form or a set of such
3rd Phase: Data Preparation
representations as classification rules or tress,
regression, clustering, and so forth. The data preparation phase covers all activities
needed to construct the final dataset (data that
8. Interpreting mined patterns.
will be fed into the modeling tool(s)) from the
9. Consolidating discovered knowledge. initial raw data.

Data preparation tasks are likely to be


performed multiple times and not in any
What is CRISP-DM? prescribed order. Tasks include table, record,
CRISP-DM and attribute selection, as well as
transformation and cleaning of data for
The CRISP-DM methodology is described in modeling tools.
terms of a hierarchical process model,
consisting of sets of tasks described at four 4th Phase: Modeling
levels of abstraction (from general to specific): In this phase, various modeling techniques are
phase, generic task, specialized task, and selected and applied, and their parameters are
process instance calibrated to optimal values.

Typically, there are several techniques for the


same data mining problem type. Some
techniques have specific requirements on the
form of data. Therefore, going back to the data
preparation phase is often necessary.

5th Phase: Evaluation

Before proceeding to final deployment of the


model, it is important to thoroughly evaluate it
and review the steps executed to create it, to
be certain the model properly achieves the
business objectives.

A key objective is to determine if there is some


1st Phase: Business Underdstanding important business issue that has not been
sufficiently considered. At the end of this phase,
This initial phase focuses on understanding the
a decision on the use of the data mining results
project objectives and requirements from a
should be reached.
business perspective, then converting this
knowledge into a data mining problem 6th Phase: Deployment
definition and a preliminary plan designed to
Depending on the requirements, the
achieve the objectives.
deployment phase can be as simple as
generating a report or as complex as
implementing a repeatable data mining process
2nd Phase: Data Underdstanding
across the enterprise.
The data understanding phase starts with initial
data collection and proceeds with activities that
In many cases, it is the customer, not the data • Priority Filtering
analyst, who carries out the deployment steps.
• Spam Filtering

• Text Processing
SUBTOPIC 1 INTRODUCTION TO DATA MINING
• Speech Recognition
What is Data Mining
• Face Recognition
Data mining aims at discovering useful data
patterns from massive amounts of data.

Artificial Intelligence

AI is the broad science of mimicking human


abilities.

AI is the science of training machines to perform


human tasks.

Machine Learning
Aspects of Business Analytics
Arthur Samuel (1959). Machine Learning: Field
of study that gives computers the ability to
learn without being explicitly programmed.

Tom Mitchell (1998) Well-posed Learning


Problem: A computer program is said to learn
from experience E with respect to some task T
and some performance measure P, if its
performance on T, as measured by P, improves
with experience E.

Suppose your email program watches which


emails you do or do not mark as spam, and Impact of Analytics
based on that learns how to better filter spam.
What is the task T in this setting?

• Classifying emails as spam or not spam. T

• Watching you label emails as spam or not


spam. E

• The number (or fraction) of emails correctly


classified as spam/not spam. P

• None of the above—this is not a machine


learning problem.
Data Mining Trends

• Ongoing research to ensure that analysts have


Standard Examples of Machine Learning access to modern techniques which are robust
and scalable
• Smart tagging
• Innovative computational implementations of
• Product Recommendations
existing analytical methods
• Creative applications of existing methods to Major Types of data patterns
solve new and different problems
Classification and Prediction
• Integration of methods from multiple
Allow us to classify or predict values of target
disciplines to provide targeted solutions
variables from values of attribute variables.
The role of big data in data mining
Cluster and Association

Cluster patterns give groups of similar data


records such that data records in one group are
similar but have larger differences from data
records in another group.

Association patterns are established based on


co-occurrences of items in data records

Data reduction patterns


Specific Applications
Data reduction patterns look for a small number
• Attrition/Churn prediction of variables that can be used to represent a
data set with a much larger number of variables
• Propensity to buy/avail of a product or service
Outliers and Anomaly Patterns
• Cross-sell or up-sell probability
Outliers and anomalies are data points that
• Next-best offer differ largely from the norm of data.
• Time-to-event modeling Sequential and Temporal Patterns
• Fraud detection Sequential and temporal patterns reveal
• Revenue/profit predictions patterns in a sequence of data points.

• Probability of default in credit risk assessment If the sequence is defined by the time over
which data points are observed, we call the
sequence of data points as a time series.
Data Mining can be performed in the ff data Data Mining Techniques
types:
- Classification
• Relational databases - Clustering
• Data warehouses - Regression
- Outer
• Advanced DB and information repositories - Sequential Patterns
- Prediction
• Object-oriented and object-relational
- Association Rules
databases
Challenges
• Transactional and Spatial databases
• Skilled Experts are needed to formulate the
• Heterogenous and legacy databases
data mining queries.
• Multimedia and streaming database
• Overfitting: Due to small size training
• Text databases database, a model may not fit future states.

• Text mining and web mining • Data mining needs large databases which
sometimes are difficult to manage
• Business practices may need to be modified to • The data mining techniques are not accurate,
determine to use the information uncovered. and so it can cause serious consequences in
certain conditions.
• If the data set is not diverse, data mining
results may not be accurate.

• Integration information needed from Industries that utilize data mining


heterogeneous databases and global
information systems could be complex ❑ Communications

❑ Insurance

Advantages of data mining ❑ Education

• Data mining technique helps companies to get ❑ Manufacturing


knowledgebased information.
❑ Banking
• Data mining helps organizations to make the
profitable adjustments in operation and ❑ Retail
production. ❑ Service providers
• The data mining is a cost-effective and ❑ E-commerce
efficient solution compared to other statistical
data applications. ❑ Super markets

• Data mining helps with the decision-making ❑ Crime


process.
❑ Bioinformatics
• Facilitates automated prediction of trends and
behaviors as well as automated discovery of
hidden patterns. SUBTOPIC 2 SUPERVISED AND UNSUPERVISED
• It can be implemented in new systems as well LEARNING
as existing platforms SUPERVISED LEARNING
• It is the speedy process which makes it easy • the desired output is known.
for the users to analyze huge amount of data in
less time. • also known as predictive modeling

• uses patterns to predict the values of the label


on additional unlabeled data
Disadvantages of data mining
• used in applications where historical data
• There are chances of companies may sell predicts likely future events
useful information of their customers to other
companies for money. For example, American Example Problem: House price prediction
Express has sold credit card purchases of their
customers to the other companies.

• Many data mining analytics software is


difficult to operate and requires advance
training to work on.

• Different data mining tools work in different


manners due to different algorithms employed Use of data in Supervised Learning
in their design. Therefore, the selection of
correct data mining tool is a very difficult task.
We can use the abundance of data to guard
against the potential for overfitting by
decomposing the data set into partitions:

Training dataset: Consists of the data used to


build the candidate models.

Test dataset: The data set to which the final


model should be applied to estimate this
model’s effectiveness when applied to data that
have not been used to build or select the
model.

• If there is only one dataset, it may be


partitioned into a training and test sets.

• The basic assumption is that the training and


test sets are produced by independent sampling
from an infinite population. • Partition a data set of observations into
increasingly smaller and more homogeneous
subsets.

• At each iteration, a subset of observations is


split into two new subsets based on the values
of a single variable.

• Series of questions that successively narrow


down observations into smaller and smaller
groups of decreasing impurity.
Supervised Learning example techniques

Classification Classification Tree


Weather Conditions for Playing Tennis
Logistic Regression • When k-NN is used as a classification method,
a new observation is classified as Class 1 if the
Attempts to classify a categorical outcome (y = 0
percentage of it k nearest neighbors in Class 1 is
or 1) as a linear function of explanatory
greater than or equal to a specified cut-off value
variables.
(e.g. 0.5).

• When k-NN is used as a prediction method, a


new observation’s outcome value is predicted
to be the average of the outcome values of its
k-nearest neighbors.

K-Nearest neighbor

To classify an outcome, the training set is


searched for the one that is “most like” it. This
is an example of “instancebased” learning. It is
“rote learning”, the simplest form of learning.

UNSUPERVISED LEARNING Clustering

• used against data that has no historical labels A definition of clustering could be “the process
of organising objects into groups whose
• the goal is to explore the data and find some
members are similar in some way”.
structure within

• There is no right or wrong answer


Software Demo (Weka)
Self-organizing maps
What is Weka?

Weka is tried and tested open source machine


learning software that can be accessed through
a graphical user interface, standard terminal
applications, or a Java API.

Logistic Regression in Weka

• Open diabetes dataset.

Unupervised Learning example techniques

Nearest-neighbor mapping

k-nearest neighbors (k-NN): This method can be


used either to classify an outcome category or
predict a continuous outcome. • Click Classifier Tab
• k-NN uses the k most similar observations • Choose Classifier: functions>Logistic
from the training set, where similarity is
typically measured with Euclidean distance. • Use Test Options: Use Training Set
• Press Start • Right-click result from Result List for options

• Choose Visualize tree

Trees in Weka

• Open weather dataset.


Ordinal data can also be called as ordered
factor.
- False

In this phase you’ll also develop and test


hypotheses through rapid prototyping in an
iterative process.
- Explore

It is a comprehensive data mining


methodology and process model that
provides anyone – from novice to data
mining experts – with a complete blueprint
• Click Classify Tab for conducting a data mining project.

• Choose Classifier: trees>J48 - CRISP-DM

• Use Test Options: Use Training Set CRISP-DM stands for


____________________________.
•Press Start
- Cross-industry standard process for
data mining

It is a role that is responsible for collecting,


analyzing, and interpreting large amount of
data.
- Data Scientist

It is the aspect of business analytics that


finds patterns in unstructured data like social
media or survey tools which could uncover
insights about consumer sentiment
- Data Mining Random forest is an example of supervised
learning.
One of the benefits of data mining is
overfitting. - True

- False Unsupervised methods help you to find


features which can be useful for
It is the phase of CRISP-DM where analysts
categorization.
review the steps executed.
- True
- Evaluation
All data is labeled and the algorithms learn
A key objective is to determine if there is
to predict the output from the input data.
some important business issue that has not
This statement pertains to ___________. (Use
been sufficiently considered.
lowercase for your answer)
- True
- Unsupervised Learning
Identify the third step in CRISP-DM.
- Data Preparation 17/20

In this approach in data mining, input This role provides the funding when doing
variables and output variables will be given. analytical project.

- Supervised Learning - Project Sponsor

It is a role that is responsible for collecting,


Self-organizing maps are example of analyzing, and interpreting large amount of
supervised learning. data.
- True - Data Scientist

Logistic regression is classified as supervised Based on the analytical life cycle defined by
learning. SAS, this phase has two types of decisions:
operational and strategic.
- True
- Act
Association rule algorithm is an example of
what approach in data mining? Interpreting mined patterns concludes the
KDD process.
- Unsupervised Learning
- True
Weka is tried and tested open source
machine learning software that can be It is the aspect of business analytics that
accessed through a graphical user interface, finds patterns in unstructured data like social
standard terminal applications, or a Java API. media or survey tools which could uncover
insights about consumer sentiment.
- True
- Text analytics
Unsupervised machine learning finds all kind
of unknown patterns in data This is the first step in KDD process.
- True - Data Selection

All data is unlabeled and the algorithms learn In this phase, you’ll search for relationships,
to inherent structure from the input data. trends and patterns to gain a deeper
This statement pertains to ___________. understanding of your data.
- Supervised Learning - Explore
Which of the following is not included on Which of the following is not considered as
the analytical life cycle defined by SAS? data mining technique?
- Integration - Kurtosis

These are values that lie away from the bulk Supposed your email program watches
of the data. which emails you do or do not mark as spam,
and based on that learns how to better filter
- Outliers
spam. What is the performance measure P in
CRISP-DM stands for this setting?
____________________________.
- The number of email correctly
- Cross-industry standard process for classified as spam/not spam
data mining
Self-organizing maps are example of
All data is labeled and the algorithms learn supervised learning.
to predict the output from the input data.
This statement pertains to ___________. (Use - False
lowercase for your answer)
- supervised

It makes the computation of multi-layer


neural network feasible.
- Deep Learning

Unsupervised methods help you to find


features which can be useful for
categorization.
- False

Logistic regression is classified as supervised


learning.
- True

Supposed you want to train a machine to


help you predict how long it will take you to
drive home from your workplace. What type
of data mining approach should you use?
- Supervised learning MODULE 3 INTRODUCTION TO REGRESSION
One of the challenges of data mining is It ANALYSIS
can be implemented in new systems as well What is REGRESSION ANALYSIS?
as existing platforms.
A technique of studying the dependence of one
- False variable (called dependent variable), on one or
One of the advantages of data mining is that more variables (called explanatory variable),
there are chances of companies may sell with a view to estimate or predict the average
useful information of their customers to value of the dependent variables in terms of the
other companies for money. known or fixed values of the independent
variables.
- False
• where a and b are parameters in the
regression model.

Parameters

• Dependent variable or response: Variable


being predicted

• Independent variables or predictor variables:


Variables being used to predict the value of the
When do you use regression?
dependent variable.

• Simple regression: A regression analysis


Reason to use regression? involving one independent variable and one
dependent variable.
• Estimate the relationship that exists, on the
average, between the dependent variable and • In statistical notation:
the explanatory variable
y = dependent variable
• Determine the effect of each of the
x = independent variable
explanatory variables on the dependent
variable, controlling the effects of all other Types of Regression Analysis
explanatory variables
- Linear Regression
• Predict the value of dependent variable for a - Multiple Linear
given value of the explanatory variable - Nonlinear Regression
Understanding regression model based on the Who uses regression?
concept of slope
• Data analysts
• The mathematical of slope is similar to
regression model. • Market researchers

Y=mx+b • Professors

• And when using the slope intercept formula, • Data scientists


we focus on the two constants (numbers) m and Advantages of using regression?
b.
• It indicates the strength of impact of multiple
• m describes the slope or steepness of the line, independent variables on a dependent variable
whereas • It indicates the significant relationships
• b represents the y-intercept or the point between dependent variable and independent
where the graph crosses the y-axis. variable.

Regression Model Regression Pitfalls

• The situation using the regression model is Overfitting


analogous to that of the interviewers, except It pertains to the accuracy of the provisional
instead of using interviewers, predictions are model is not as high on the test set as it is on
made by performing a linear transformation of the training set, often because the provisional
the predictor variable. model is overfitting on the training set
• The prediction takes the form

y = ax +b
Missing data has the potential to adversely
affect a regression analysis by reducing the total
usable sample size.

Excluding Important Predictor Variables

The linear association between two variables


ignoring other relevant variables can differ both
in magnitude and direction from the association Power and sample size
that controls for other relevant variables.
In small datasets, a lack of observations can
lead to poorly estimated models with large
standard errors.

Extrapolation

It refers to estimates and predictions of the


target variable made using the regression
equation with values of the predictor variable
SUBTOPIC 1 OVERVIEW OF REGRESSION
outside of the range of the values of x in the
data set Linear Regression

It is a model that tests the relationship between


a dependent variable and a single independent
variable. It can also be described using the
following expression:

𝑦 = 𝑎 + 𝑏𝑋 + 𝜖

It is a model that tests the relationship between


a dependent variable and a single independent
variable. It can also be described using the
following expression:
Missing Values
Where

• y – dependent variable

• X – independent (explanatory) variable

• a – intercept

• b – slope

• 𝜖 – residual (error)

Linear model assumptions

1. The dependent and independent variables Step 1


show a linear relationship between the slope
and the intercept. Identify the dependent and independent
variables.
2. The independent variable is not random.
For this example, we are trying to predict the
3. The value of the residual (error) is zero. house price. Therefore, it is the independent
4. The value of the residual (error) is constant variable.
across all observations.

5. The value of the residual (error) is not


correlated across all observations.

6. The residual (error) values follow the normal


distribution.

Example problem:

If you want to know the strength of relationship


between House Price and Square feet, you can
use regression.

Identify the dependent and independent


variables.

Since we want to predict house price based on


square feet, it is now our dependent variable.
House Price Data
Substitute the values of the coefficients to the
formula mentioned previously.

Value of X variable 1 from the results provided


in Step 3.

Step 5

Insert the value for sq. ft. to predict for the


house price.

𝒉𝒐𝒖𝒔𝒆 𝒑𝒓𝒊𝒄𝒆 = 𝟗𝟖. 𝟐𝟓 + 𝟎. 𝟏𝟎𝟗𝟖(𝒔𝒒. 𝒇𝒕. )


Step 2

Run regression analysis on the data using any


system that offers statistical analysis. For this
example, we can use Microsoft Excel with the
help of Data Analysis Tool pack.
Multiple Linear

Multiple linear regression analysis is essentially


similar to the simple linear model, with the
exception that multiple independent variables
are used in the model. The mathematical
representation of multiple linear regression is:
Step 3
𝑦 = 𝑎 + 𝑏𝑋1 + 𝑐𝑋2 + 𝑑𝑋3 + 𝜖
Analyze the results. Take note of the values for
coefficients. Interpretation

• Interpretation of slope coefficient βj :


Represents the change in the mean value of the
dependent variable y that corresponds to a one
unit increase in the independent variable xj ,
holding the values of all other independent
variables in the model constant.

• The multiple regression equation that


describes how the mean

Step 4

Substitute the values of the coefficients to the Multiple Linear Regression Model
formula mentioned previously.
• Estimated multiple regression equation:
𝒉𝒐𝒖𝒔𝒆 𝒑𝒓𝒊𝒄𝒆 = 𝟗𝟖. 𝟐𝟓 + 𝟎. 𝟏𝟎𝟗𝟖(𝒔𝒒. 𝒇𝒕. )

Substitute the values of the coefficients to the


formula mentioned previously.
• The least squares method is used to develop
the estimated multiple regression equation:

value of the Intercept from the results in step 3.


The Estimation Process for Multiple
Regression

Nonlinear Regression

Nonlinear regression is a regression in which


the dependent or criterion variables are
modeled as a non-linear function of model
parameters and one or more independent
variables.

Nonlinear regression can be model in several


equations similar to the one below:

Regression Analysis

Regression is a technique used for forecasting,


time series modeling and finding the casual
effect between the variables.

Why use regression?

1) Prediction of a target variable (forecasting).

2) Modeling the relationships between the


dependent variable and the explanatory
variable.

3) Testing hypothesis.
SUBTOPIC 2 LINEAR REGRESSION

- The whole process of linear regression • = Point estimator of E(y|x)


is based on the fact that there exists a • b0 = Estimated y-intercept
relation between the independent
variables and dependent variable. • b1 = Estimated slope

Simple Linear Regression The graph of the estimated simple linear


regression equation is called the estimated
The whole process of linear regression is based regression line.
on the fact that there exists a relation between
the independent variables and dependent
variable. Possible Linear Regression Graphs
Simple Linear Regression Model

y = β0 + β1 x + ε

Parameters: The characteristics of the


population, β0 and β1

Random variable - Error term, ε

- The error term accounts for the The Estimation Process in Simple Linear
variability in y that cannot be explained Regression
by the linear relationship between x
and y.

Simple Linear Regression

Regression equation: The equation that


describes how the expected value of y, denoted
E(y), is related to x.

Regression equation for simple linear


regression:

E(y|x) = β0 + β1x

• E(y|x) = expected value of y for a given value


of x

• β0 = y-intercept of the regression line

• β1 = slope

The graph of the simple linear regression


Least Square Method
equation is a straight line
Least squares method: A procedure for using
sample data to find the estimated regression
Simple Linear Regression Model equation.

Estimated simple linear regression equation: • Here, we will determine the values of b0 and
b1 .

• Interpretation of b0 and b1 :
• The slope b1 is the estimated change in the
mean of the dependent variable y that is • = predicted value of the dependent
associated with a one unit increase in the variable for the ith observation
independent variable x.
• n = total number of observations
• The y-intercept b0 is the estimated value of the
dependent variable y when the independent ith residual: The error made using the
variable x is equal to 0. regression model to estimate the mean value
of the dependent variable for the ith
observation.
Miles Traveled and Travel Time (in Hours) for
Ten Butler Trucking Company Driving • Denoted as
Assignments
• Hence,

• Least squares estimates of the regression


parameters:

Slope equation

Scatter Chart of Miles Traveled and Travel


Time in Hours for Sample of Ten Butler y-intercept equation
Trucking Company Driving Assignments

• = value of the independent variable for


the ith observation

• = value of the dependent variable for the


ith observation

• = mean value for the independent


variable

Least Squares Method • = mean value for the dependent variable


Least squares method equation: • n = total number of observations

For the Butler Trucking Company data in


Table 4.1:

• y𝑖 = observed value of the dependent • Estimated slope of b1 = 0.0678


variable for the ith observation • y-intercept of b0 = 1.2739

• The estimated simple linear regression


model:
• Experimental region: The range of values of
the independent variables in the data used to
estimate the model.

• The regression model is valid only over this


region.

• Extrapolation: Prediction of the value of the


dependent variable outside the experimental
region.

• It is risky Multiple Linear

Predicted Travel Time in Hours and the • It is a statistical technique that uses several
Residuals for Ten Butler Trucking Company independent variables to predict the dependent
Driving Assignments variable.

• It is the extension of linear regression.

Regression Question 1

Could the explanatory variables such as number


of cigarettes, cholesterol level, and exercise
predict the Coronary Heart Disease (CHD)
mortality?

Identifying variables
Scatter Chart of Miles Traveled and Travel
Time in Hours for Butler Trucking Company
Driving Assignments with Regression Line
Superimposed

Research Question 2

Does the number of years of psychological


study and number of years of counseling
experience predict clinical psychologists’
Linear Regression in Excel
effectiveness in treating mental illness?

Identifying Variables
Difference Correlation Results

Linear Regression only deals with one


independent variable.

Multiple linear Regression with more than one


independent variable.

Value of R square

Case Study

You want to know the effect of violence, stress,


social support on internalizing behavior.

Details about the study

• Participants were children aging from 8 to 12 Test for overall significance


• Lived in high-violence areas, USA
• Shows if there is a linear relationship between
• Hypothesis: violence and stress lead to all of the X variables taken together and Y
internalizing behavior, whereas social support
• Hypothesis:
would reduce internalizing behavior.

Parameters

• Predictors Results of Significance


• Degree of witnessing violence

• Measure of life stress

• Measure of social support

• Outcome

• Internalizing behavior (i.e. depression,


anxiety)
Substitute the values in the linear equation

In order to test the hypothesis, you can use the


following formula:

Interpretations

• Slopes for Witness and Stress are positive, but


slope for Social Support is negative

• If you had subjects with identical stress and


social support, a one unit increase in Witness
would produce .038 unit increase in
internalizing symptoms

• If Witness = 20, Stress = 5 and SocSupport 35,


then we would predict that the internalizing
symptoms would be .012.
and one or more independent variables by
estimating probabilities using a logistic function,
which is the cumulative logistic distribution.
SUBTOPIC 3 LOGISTIC REGRESSION
• There are other variants of logistic regression
Predicting Categorical Outcome that focus on modeling a categorical variable
with three or more levels, say X, Y, and Z and a
few others.

Concept of Probability

Unlike linear regression, logistic regression


model is concern for the log of odds ratio or the
probability of the event to happen. Everything
starts with the concept of probability.

Example Scenario
Logistic Regression
Let's say that the probability of success of some
It is a statistical technique used to develop event is 0.8. Then the probability of failure is 1 –
predictive models with categorical dependent 0.8 = 0.2. The odds of success are defined as the
variables having dichotomous or binary ratio of the probability of success over the
outcomes. probability of failure.

Similar to linear regression, the logistic Logistic Regression Equation


regression models the relationship between the
In our example, the odds of success are 0.8/0.2
dependent variable and one or more
= 4. That is to say that the odds of success are 4
independent variables.
to 1. If the probability of success is 0.5, that is, a
Graph of Linear Regression 50-50 percent chance, then the odds of success
are 1 to 1. In our example, the odds of success
are 0.8/0.2 = 4. That is to say that the odds of
success are 4 to 1. If the probability of success is
0.5, that is, a 50-50 percent chance, then the
odds of success are 1 to 1.

To predict the probability of the event to


happen, we can further solve the preceding
Graph of Logistic Regression
equation as follows:

Maximum-likelihood estimation

It is a method of estimating the parameters of a


Logistic Regression Assumptions statistical model with given data. The method of
• Logistic regression measures the relationship maximum likelihood selects the set of values of
between the categorical dependent variable the model parameters that maximizes the
likelihood function, that is, it maximizes the
“agreement” of the selected model with the • The significance of the estimate can be
observed data. determined if the probability of the event
happening by chance is less than 5%.

How do we assess the goodness of fit or


Building Logistic Regression Model
accuracy of the model?
You can perform logistic regression using R by
Two ways to validate model accuracy
using the glm() function. The family =
"binomial" command tells R to use the glm
function to fit a logistic regression model. (The
glm() function can fit other models too; we'll
look into this later.)

Coefficients results

What is confusion matrix?

A confusion matrix is a table used to analyze the


performance of a model (classification). Each
column of the matrix represents the instances
in a predicted class while each row represents
Interpreting Results the instances in an actual class or vice versa.

• A positive estimate indicates that, for every What is ROC Curve?


unit increase of the respective independent The Receiver Operating Characteristic (ROC)
variable, there is a corresponding increase in curve is a standard technique to summarize
the log of odds ratio and the other way for a classification model performance over a range
negative estimate. of trade-offs between true positive (TP) and
• A long with the independent variables, we false positive (FP) error rates. The ROC curve is
also see 'Intercept'. Intercept is the log of odds a plot of sensitivity (the ability of the model to
of the event (Good or Bad Quality) when we predict an event correctly) versus 1-specificity
have all the categorical predictors having a for the possible cutoff classification probability
value as 0. values.

• We can see the standard error, z value, and p- Validate model using confusion matrix
value along with an asterisk indication to easily Let’s assume that we acquire the following
identify significance. results after running logistic regression to our
• We then determine whether the estimate is sample data set:
truly far away from 0. If the standard error of
the estimate is small, then relatively small
values of the estimate can reject the null
hypothesis.
Columns in confusion matrix
• If the standard error is large, then the
estimate should also be large enough to reject • True Positive (TP): When it is predicted as
the null hypothesis. TRUE and is actually TRUE

Testing the significance • False Positive (FP): When it is predicted as


TRUE and is actually FALSE
• To test the significance, we use the 'Wald Z
Statistic' to measure how many standard • True Negative (TN): When it is predicted as
deviations the estimate is away from 0. FALSE and is actually FALSE
• False Negative (FN): When it is predicted as FORMATIVE 25/30
TRUE and is actually FALSE
It is a model that tests the relationship
Substitute the values between a dependent variable and a single
independent variable.
The preceding confusion matrix is overlaid with
the nomenclature and is showcased here: - Linear regression

Supposed you want to train a machine to


help you predict how long it will take you to
drive home from your workplace using
regression. What type of data mining
approach should you use?
Confusion Matrix Formula - Supervised Learning
An exhaustive list of metrics that are usually
In small datasets, a lack of observations can
computed from the confusion matrix to aid in lead to poorly estimated models with large
interpreting the goodness of fit for the standard errors.
classification model are as follows:
- True

Multiple linear regression is classified as


supervised learning.
- True

It is computed as the ratio of the two odds.


- Odds ratio
Interpreting ROC Curve
Multilinear regression is a regression in
Interpreting the ROC curve is again which the dependent or criterion variables
straightforward. The ROC curve visually helps us are modeled as a non-linear function of
understand how our model compares with a model parameters and one or more
random prediction. The random prediction will independent variables.
always have a 50% chance of predicting
correctly; after comparing with this model, we - False
can understand how much better is our model. Regression is a technique used for
The diagonal line indicates the accuracy of forecasting, time series modeling and finding
random predictions and the lift from the the casual effect between the variables.
diagonal line towards the left upper corner - False
indicates how much improvement our model
has in comparison to the random predictions. It is used to compare two nested generalized
linear models.
Models having a higher lift from the diagonals
are considered to be more accurate models. - Likelihood ratio test

The lower is the AUC value, the worse is the


model predictive accuracy.
- True
Which of the following is not a reason when - Explanatory variable
to use regression?
- When you aim to know the products
The formula below describes
that are usually bought together.

It is a regression in which the dependent or


criterion variables are modeled as a non-
linear function of model parameters and one
- Multiple linear regression
or more independent variables.
Regression is a technique used for
- Nonlinear Regression
forecasting, time series modeling, and
It is essentially similar to the simple linear finding the casual effect between the
model, with the exception that multiple variables.
independent variables are used in the model.
- False
- Multiple Regression
It is the tool pack in Microsoft Excel that can
There should not be any multicollinearity be downloaded to perform linear regression.
between the independent variables in the
- Data Analysis Tool Pack
model, and all independent variables should
be independent to each other A group of middle school students wants to
know if they can use height to predict age.
- True
The explanatory variable is height.
If a predictor variable x is found to be highly
- True
significant we would conclude that:
Multiple linear regression can be expressed
- a change in x causes a change in y
(changes in x are associated to changes
in y) as 
Overfitted data can significantly lose the - False
predictive ability due to an erratic response
to noise whereas underfitted will lack the The _____________ are defined as the ratio of
accuracy to account for the variability in the probability of success over the
response in its entirety. probability of failure. (Use lowercase for
your answer).
- True
- odds ratio
Given the results for multiple linear
regression below. Identify the estimated Multiple linear regression is classified as
regression equation. unsupervised learning.

- False

The formula above is used to model


nonlinear regression.

- exam score = 5.56*hours + - False


prep_exams*-0.60 + 67.67
It is a model that tests the relationship
It is the variable being manipulated by between a dependent variable and a single
researchers. independent variable.
- Linear Regression Logistic regression is classified as
supervised learning.
When it is predicted as TRUE and is actually
TRUE. -True

- True Positive If a predictor variable x is found to be highly


significant we would conclude that:
If a predictor variable x is found to be highly
significant we would conclude that: a change -changes in x are associated to changes in
y
in y causes a change in x.
It is the variable being predicted.
- False (changes in x are associated to
changes in y) -Dependent variable
When it is predicted as FALSE and is actually In logistic regression, it is that the target
FALSE. variable must be discrete and mostly binary
or dichotomous.
- False Negative
-True
Response variable is the variable being
manipulated by researchers. It is the tool pack in Microsoft Excel that can
be downloaded to perform linear regression.
- False
-Data Analysis Tool Pak
It is essentially similar to the simple linear
FORMATIVE 25/30 model, with the exception that multiple
independent variables are used in the
A linear regression analysis was model.
conducted to predict the company sales.
Below are the results. Identify the linear -Multiple regression
regression equation assuming that x is
It pertains to the accuracy of the provisional
the parameter for advertising.
model is not as high on the test set as it is
on the training set. (Use UPPERCASE for
your answer)
-OVERFITTING
Extrapolation of the range of values of the
independent variables in the data used to
estimate the model.
-True
In logistic regression, it is that the target
variable must be discrete and mostly binary
- sales = 23.42x + intercept
or dichotomous.
-True
The higher is the AUC value, the better is
It is the tool pack in Microsoft Excel that can
the model predictive accuracy.
be downloaded to perform linear regression.
-True
-Data Analysis Tool Pack
The explanatory variable is the variable
Multiple linear regression is classified as
being predicted.
unsupervised learning.
-False
-False
A researcher believes that the origin of the -True Negative
beans used to make a cup of coffee affects
hyperactivity. He wants to compare coffee Logistic regression measures the
from three different regions: Africa, South relationship between the ____________
America, and Mexico. What is the response dependent variable and one or more
variable on this study? independent variables. (Use lowercase
for your answer)
-Hyperactivity level
categorical
The graph of the simple linear regression -categorical
A group of middle school students wants to
equation is a straight line
know if they can use height to predict age.
-True The explanatory variable is height.
-True
Given the results for multiple linear
regression below. Identify the estimated Which of the following evaluation metrics
regression equation. can’t be applied in the case of logistic
regression output to compare with the
target?
-Mean-Squared-Error

-exam score = 5.56*hours + prep_exams*-


0.60 + 67.67

There should not be any multicollinearity


between the independent variables in the
model, and all independent variables should
be independent to each other
-True
It is essentially similar to the simple linear
Logistic regression is classified as one of model, with the exception that multiple
the examples of unsupervised learning. independent variables are used in the
model.
-False
-Multiple linear regression
It is essentially similar to the simple linear Which of the following methods do we use
model, with the exception that multiple to best fit the data in Logistic Regression?
independent variables are used in the -Maximum Likelihood
model.
What type of relationship is shown in the
-Multiple Regression graph below?

The graph of the simple linear regression


equation is a straight line.
-True
Multiple linear regression can be expressed
as y=a+bX+ ϵ

-False
When it is predicted as FALSE and is
actually FALSE.
2. Overfitted data can significantly lose
the predictive ability due to an erratic
response to noise whereas
underfitted will lack the accuracy to
account for the variability in
response in its entirety.
 True

3. A group of middle school students


wants to know if they can use height
to predict age. What is the response
variable in this study
 Height
 The type of school
 Age
(https://online.stat.psu.edu/stat200/le
sson/1/1.1/1.1.2#:~:text=A%20group
%20of%20middle%20school,use
%20height%20to%20predict
%20age.&text=The%20students
%20want%20to%20use,the
%20response%20variable%20is
%20age.)
-Undefined
 The number of students

A regression line is used for all of the 4. Multilinear regression is a regression


following except one. Which one is not a in which the dependent or criterion
valid use of a regression line? variables are modeled as a non-
linear function of model parameters
-to determine if a change in X causes a and one or more independent
change in Y. variables.
 False

5. It is the variable being manipulated


by researchers.
 Explanatory Variable
 Factor
 Response Variable
 Dependent Variable

6. Multiple linear regression is


24/30 classified as supervised learning.
 True
1. Regression is a famous supervised
learning technique.
 True
7. Research question: Do fourth
graders tend to be taller than the
third graders?
This is an observational study. The America, and Mexico. What is the
researcher wants to use grade level to explanatory variable in this
explain differences in height. What is the study?
response variable in this study?
 The taste of the coffee
 Gender
 The cups of coffee taken
 Age
 Origin of the coffee
 Grade level
 Hyperactivity level
 Height

13. 1The value of the residual (error) is


8. If p value for model fit is less than not correlated across all
0.5, then signify that our full observations.
model fits significantly better than  True
our reduced model.
14. There should not be any
 False multicollinearity between the
independent variables in the model,
9. It is used to compare two nested and all independent variables should
generalized linear models. be independent to each other
 Odds ratio  False
 Hosmer-lemeshow test
 Likelihood ratio test
 ROC Curve 15. It is the variable being predicted.
 Explanatory variable
 Dependent variable
10. In linear models, the dependent
and independent variables show  Independent variable
a linear relationship between the  Factor
slope and the intercept
16. The graph of the simple linear
 True regression equation is a straight line
 True
11. Supposed you want to train a
machine to help you predict how
17. It is the tool pack in Microsoft Excel
long it will take you to drive home
that can be downloaded to perform
from your workplace. What type
linear regression.
of data mining approach should
 Data Analysis Tool Pack
you use
 Test Mining Tool Pack
 Unsupervised Learning  Regression Tool Pack
 Data Mining Tool Pack
 Supervised learning
18. It is a point that lies far away from
the rest.
12. A researcher believes that the  Outliers
origin of the beans used to make  Dummy variables
a cup of coffee affects  Residual
hyperactivity. He wants to
compare coffee from three 19. The formula below describes
different regions: Africa, South
y=a+bX_1+cX_2+ 〖dX〗_3+ ϵ
 Linear Regression  Nonlinear regression
 Multiple Linear Regression
 Nonlinear Regression
26. Logistic regression is classified as
supervised learning.
20. Logistic regression is classified as
 True
one of the examples of unsupervised
learning.
 False
27. In the equation

21. It is a table used to analyze the


y=a+bX+ ϵy=a+bX+ ϵ
performance of a model A and b are considered as independent
(classification). (Use lowercase) variables.
 confusion matrix
 False

22. The family = ______________


command tells R to use the glm 28. Shown below is a scatterplot of Y
function to fit a logistic regression versus X. Which of the following is
model. (Use lowercase for your the approximate value for R2?
answer to supply the code)  99.5%
 logit regression  50%
23. It is a model that tests the  25%
relationship between a dependent  0%
variable and a single independent
variable.
 Linear Regression
 Least squared method
 Multiple regression
 Nonlinear regression

24. y=a+bX_1+cX_2+ 〖dX〗_
3+ ϵy=a+bX_1+cX_2+ 〖dX〗_3+
ϵ
The formula above is used to model
nonlinear regression.
 False

29. Nonlinear is the extension of linear


25. It is a regression in which the regression
dependent or criterion variables  False
are modeled as a non-linear
30. A group of middle school students
function of model parameters and
wants to know if they can use height to
one or more independent
predict age. The response variable in
variables.
this study is age.
 Linear Regression  True
 Least squared method
 Multiple regression
ADDITIONAL FORMATIVE independent variables are used in the model.
Multiple Regression
It is essentially similar to the simple
The value of the residual (error) is not
linear model, with the exception that
correlated across all observations. True
multiple independent variables are used
in the model. Multiple regression Given the results for multiple linear
regression below. Predict the exam score if a
Which of the following is not a reason student spent hours studying and had 2 prep
when to use regression? When you aim exams taken..
to know the products that are usually
bought together.
It pertains to the accuracy of the
105.39
provisional model is not as high on the
test set as it is on the training set. (Use Logistic regression is classified as supervised
UPPERCASE for your answer) learning. True
OVERFITTING In logistic regression, it is that the target
variable must be discrete and mostly binary
It is used to explain the variability in the
or dichotomous. True
response variable. response variable
It can be utilized to assess the strength of
It is used to measure the binary or the relationship between variables and for
dichotomous classifier performance modeling the future relationship between
visually and Area Under Curve (AUC) is them Regression Analysis
used to quantify the model performance. If p value for model fit is less than 0.5, then
ROC Curve signify that our full model fits significantly
better than our reduced model. False
Missing data has the potential to
adversely affect a regression analysis by Factor is the variable being predicted. False
reducing the total usable sample size. It is a model that tests the relationship
True between a dependent variable and a
single independent variable. Linear
In regression, the value of the residual Regression
(error) is one. True False
What type of relationship is shown in the
The value of the residual (error) is not graph below? Negative Linear Relationship
correlated across all observations. True
It is the variable of primary interest.
response variable
If p value for model fit is less than 0.5, then
signify that our full model fits significantly
better than our reduced model. False
It is used to explain the variability in the
response variable parameter
The value of the residual (error) is one false

The value of the residual (error) is correlated


across all observations. False It is essentially similar to the simple linear
model, with the exception that multiple
It is essentially similar to the simple linear independent variables are used in the model.
model, with the exception that multiple Multiple linear regression
Logistic regression is used to predict It is used to develop the estimated multiple
continuous target variable. True regression. Multiple Regression
Response variable is the variable being The graph for a nonlinear relationship is
manipulated by researcher. False often a straight line false

Logistic regression is classified as It is essentially similar to the simple linear


supervised learning. True model, with the exception that multiple
independent variables are used in the model.
A researcher believes that the origin of Multiple Regression.
the beans used to make a cup of coffee
 In linear regression, it is that the target
affects hyperactivity. He wants to
variable must be discrete and mostly binary
compare coffee from three different or dichotomous. True
regions: Africa, South America, and
Mexico. What is the explanatory variable What type of relationship is shown in the
in this study? Origin of the coffee graph above? No relationship

The lower is the AUC value, the worse is the The value of the residual (error) is not
model predictive accuracy. True correlated across all observations. True

Regression is a famous supervised learning Extrapolation of the range of values of the


technique. True independent variables in the data used to
estimate the model. True
In regression, the value of the residual
(error) is one. False A group of middle school students wants to
know if they can use height to predict age.
How will you express the equation of a What is the explanatory variable in this
regression analysis where you aim to predict study? height
the value of y based on x. (Use lowercase
and avoid so much space) y=b0+b1*x1 Supposed you want to train a machine to
help you predict how long it will take you to
Regression is a technique used for drive home from your workplace. What type
forecasting, time series modeling, and of data mining approach should you use?
finding the causal effect between the Supervised Learning
variables. False
Multiple linear regression is classified as
Supposed you want to train a machine to unsupervised learning. False
help you predict how long it will take you to
drive home from your workplace using It is a point that lies far away from the rest.
Outliers
regression. What type of data mining
approach should you use? Supervised When it is predicted as TRUE and is actually
Learning TRUE. True Positive
Multilinear regression is a regression in Two variables are correlated if there is a
which the dependent or criterion variables linear association between them. If not, the
are modeled as a non-linear function of variables are uncorrelated. True
model parameters and one or more
independent variables. True Multiple linear regression is classified as
unsupervised learning. False
Multilinear regression is a regression in
which the dependent or criterion variables What type of relationship is shown in the
are modeled as a non-linear function of graph below? Negative Linear Relationship
model parameters and one or more
independent variables. False
If a predictor variable x is found to be highly
significant we would conclude that: changes
in x are associated to changes in y
you to drive home from your workplace
using regression. What type of data
mining approach should you use?
Supervised Learning

In the equation y=a+bX+ ϵ What is


considered as a dependent parameter/s?
y

the graph for a nonlinear relationship is


often a straight line. True
In linear models, the dependent and
Logistic regression is used to predict independent variables show a linear
continuous target variable. True relationship between the slope and the
intercept true
You predicted negative and it’s false. False
Negative The formula below describes
In ROC Curve, models having a higher lift
y=a+bX_1+cX_2+ 〖dX〗_3+ ϵy=
from the diagonals are considered to be
more accurate models. True a+bX_1+cX_2+ 〖dX〗_3+ ϵ
Multiple linear regression
y=a+bX_1+cX_2+ 〖dX〗_3+ ϵy= A group of middle school students wants
a+bX_1+cX_2+ 〖dX〗_3+ ϵ to know if they can use height to predict
age. What is the response variable in this
The formula above is used to model study? Age
nonlinear regression. False
It is a regression in which the dependent or
Logistic regression is classified as criterion variables are modeled as a non-
supervised learning. True linear function of model parameters and one
or more independent variables. Nonlinear
It is essentially similar to the simple Regression
linear model, with the exception that
multiple independent variables are used Slope is the point that lies far away from
in the model. Multiple regression the rest. True
Nonlinear is the extension of linear
The lower is the AUC value, the worse is
regression. False
the model predictive accuracy. True
In logistic regression, it is that the target
Multiple linear regression is classified as
variable must be discrete and mostly binary
supervised learning. True
or dichotomous. True
It is the variable of primary interest.
Parameter Given the results for multiple linear
regression below. Predict the exam score
A group of middle school students wants to if a student spent
know if they can use height to predict age.
What is the response variable in this study? 10 hours in studying and had 4 prep
Age exams taken.  
If p value for model fit is less than 0.5, then
signify that our full model fits significantly
better than our reduced model. False

Supposed you want to train a machine to 120.98


help you predict how long it will take
It can be utilized to assess the strength of Which of the following methods do we use
the relationship between variables and for to best fit the data in Logistic Regression?
modeling the future relationship between Maximum Likelihood
them. Regression Analysis
When it is predicted as TRUE and is actually
Logistic regression is classified as one of TRUE. True Positive
the examples of unsupervised learning.
False Residual is a point that lies far away from
the rest true
Research question: Do fourth graders
tend to be taller than third graders? You predicted negative and it’s false. False
Negative.
This is an observational study. The
How do we assess the goodness of fit or
researcher wants to use grade level to
accuracy of the model in logistic regression?
explain differences in height. What is the
(Use lowercase for your answer) roc curve
explanatory variable on this study? grade
level The graph for a nonlinear relationship is
often a straight line. True
The family = ______________ command tells
R to use the glm function to fit a logistic How will you express the equation of a
regression model. (Use lowercase for your regression analysis where you aim to predict
answer to supply the code) glm() the value of y based on x. Y=a+bX
What type of relationship is shown in the
It is used to explain the variability in the
graph below? No relationship
response variable parameter

Research question: Do fourth graders


tend to be taller than the third graders?
This is an observational study. The
researcher wants to use grade level to
explain differences in height. What is the
response variable in this study? Height
It can be utilized to assess the strength
of the relationship between variables
and for modeling the future relationship
between them. Regression Analysis
It is used to develop the estimated
It is a regression in which the dependent or multiple regression. Multiple Regression
criterion variables are modeled as a non-
It refers to estimates and predictions of
linear function of model parameters and one
or more independent variables. Nonlinear the target variable made using the
Regression regression equation with values of the
predictor variable outside of the range of
Which of the following evaluation metrics the values of x in the data set. (Use
can’t be applied in the case of logistic UPPERCASE for your answer)
regression output to compare with the
PREDICTION
target? Mean-Squared-Error
Nonlinear is the extension of linear The formula below describes
regression. False
y= (b_1+ b_2 x_1+b_3 x_2+b_4
x_3)/(1+ b_5 x_1+b_6 x_2+ b_7
x_3 )
Nonlinear regression option is true for such a case? odds will
be 1
A researcher believes that the origin of the It is essentially similar to the simple
beans used to make a cup of coffee affects linear model, with the exception that
hyperactivity. He wants to compare coffee multiple independent variables are used
from three different regions: Africa, South
in the model. Multiple linear regression
America, and Mexico. What is the response
variable in this study? Hyperactivity level It is a regression in which the dependent
The graph of the simple linear regression or criterion variables are modeled as a
equation is a straight line true non-linear function of model parameters
and one or more independent variables
In the equation Multiple Regression

y=a+bX+ ϵy=a+bX+ ϵ Shown below is a scatterplot of Y versus


X. Which of the following is the
What is considered as the parameter/s? approximate value for R2? 99.5%
a,b

In logistic regression, it is that the target


variable must be discrete and mostly
binary or dichotomous. True
It is used to explain the variability in the
response variable parameter
Logistic regression is classified as one of
the examples of unsupervised learning.
False
Regression is a technique used for
forecasting, time series modeling, and
finding the casual effect between the
variables. True
If a predictor variable x is found to be
There should not be any multicollinearity highly significant we would conclude
between the independent variables in that: a change in y causes a change in x.
the model, and all independent variables false
should be independent to each other
Regression is a famous supervised
true
learning technique. True
It is the range of values of the
exam score = 5.56*hours + prep_exams*-0.60
independent variables in the data used
+ 67.67
to estimate the model. Experimental region
It is a statistical technique that uses several
The value of the residual (error) is independent variables to predict the
correlated across all observations. False dependent variable. Multiple linear
Two variables are correlated if there is a regression
linear association between them. If not, It is the tool pack in Microsoft Excel that can
the variables are uncorrelated. True be downloaded to perform linear regression.
Data Analysis Tool Pack
Suppose you have been given a fair coin
and you want to find out the odds of Compute for the accuracy of the model
getting heads. Which of the following depicted in the confusion matrix below;
TP = 20 
TN = 27
FP = 18
FN = 25
52%

It is considered as the log of odds of the MODULE 4 TIME SERIES AND FORECASTING
event. (Use lowercase for your answer)
intercept Time series analysis was introduced by Box and
Jenkins (1976) to model and analyze time series
Supposed your model was able to predict data with autocorrelation.
that a certain student fails the certification
exam and she actually is not. False Negative Time series data consist of data observations
Two variables are correlated if there is a over time.
linear association between them. If not, the
Application of Time Series
variables are uncorrelated. True
• Predicting stock prices

• Airline fares

• Labor force size

• Unemployment data

• Natural gas price

Jollibee: Forecasted Daily Closing Price

U.S Regular Gasoline Price Prediction

Key Ideas in Forecasting


• The objective of time series analysis is to
uncover a pattern in the time series and then
extrapolate the pattern into the future.

• The forecast is based solely on past values of


the variable and/or on past forecast errors.

• Modern data-collection technologies have


enabled individuals, businesses, and
Sample Scenario: Gasoline Sales Time Series
government agencies to collect vast amounts of
Plot after obtaining the Contract with the
data that may be used for causal forecasting.
Vermont State Police
• Time series: A sequence of observations on a
variable measured at successive points in time
or over successive periods of time.

• The measurements may be taken every


hour, day, week, month, year, or any other
regular interval. The pattern of the data is an
important to understand the series’ past
behavior.

• If the behavior of the times series data of


the past is expected to continue in the
Time Series Patterns
future, we can use it to guide us in selecting
an appropriate forecasting method. • Trend pattern: Gradual shifts or movements
to relatively higher or lower values over a
Sample Scenario: Gasoline Sales Time Series
longer period of time.

• A trend is usually the result of long-term


factors such as population increases or
decreases, shifting demographic
characteristics of the population, improving
technology, changes in the competitive
landscape, and/or changes in consumer
preferences.

Sample Scenario: Gasoline Sales Time Series Bicycle Sales Time Series
Plot

Bicycle Sales Time Series Plot

Sample Scenario: Gasoline Sales Time Series


after obtaining the Contract with the Vermont
State Police
Cholesterol Drug Revenue Time Series
Quality Smartphone Sales Time Series Plot

Cholesterol Drug Revenue Time Series Plot

Forecasting Methodologies

 Moving Average
 Exponential Smoothing

SUBTOPIC 1 OVERVIEW OF TIME SERIES


ANALYSIS
Time Series Patterns
• Time series data is a sequence of observations
• Seasonal pattern: Recurring patterns over collected from a process with equally spaced
successive periods of time. periods of time.

• Example: A manufacturer of swimming • It establish relation between “cause” and


pools expects low sales activity in the fall and “effect”
winter months, with peak sales in the spring
• One variable is “Time” which is considered as
and summer months to occur every year.
the independent variable and the second is
• Time series plot not only exhibits a seasonal “Data” also known as the dependent variable.
pattern over a one-year period but also for
Application of Time Series in Real-Life
less than one year in duration.
Forecast Population
• Example: daily traffic volume shows
within-the-day “seasonal” behavior

Umbrella Sales Time Series


Germany Population from 2010-2019

Understanding the time series data

Time series data is expected to have two


variables: time and data

Benefits of Time Series Analysis

Through Time Series Analysis, businessmen can


predict about the changes in economy.
Furthermore, it could also help in determining
the following:

• Stock Market analysis

• Risk analysis and evaluation

• Census analysis

• Budgetary analysis

• Inventory studies
What are other examples of time series data?
• Sales Forecasting
1. Daily data on sales

2. Monthly inventory
Components of Time Series
3. Daily customers
Secular Trend
4. Monthly interest rates, cost
The increase or decrease in the movements of a
5. Monthly unemployment rates time series is called secular trend. A time series
6. Weekly measures of money supply data may show upward trend or downward
trend for a period of years and this may be due
7. Daily closing prices of stock indexes, and to factors like:
soon.
• Increase of population
Plot time series data
• Change in technological progress
• Large scale shift in consumers demands

Seasonal variation

It is a short-term fluctuation in a time series


which occur periodically in a year.

Examples: Forecast Error

• More woolen clothes are sold in winter than Measures to determine how well a particular
in the season of summer forecasting method is able to reproduce the
time series data that are already available
• Each year more ice creams are sold in summer
and very little in Winter season

Cyclical Variations Forecast Error: Difference between the actual


and the forecasted values for period t.
These are recurrent upward or downward
movements in a time series but the period of
cycle is greater than a year.

Mean Forecast Error: Mean or average of the


forecast errors.

Irregular Variations Forecast Error Cont.

These are fluctuations in the time series that • Measures to determine how well a particular
are short in duration, erratic in nature and forecasting method is able to reproduce the
follow no regularity in the occurrence pattern. time series data that are already available.

Forecasting Methodologies Mean Absolute Error (MAE): Measure of


forecast accuracy that avoids the problem of
 Moving Average positive and negative forecast errors offsetting
 Exponential Smoothing one another.
Application available to perform TSA

 Tableau
 Excel
 SAS 9.4
Mean Squared Error (MSE): measure that avoids
 SAP Analytics
the problem of positive and negative errors
Forecast Accuracy offsetting each other is obtained by computing
the average of the squared forecast errors.
Computing Forecasts and Measures of Forecast
Accuracy using the most recent Value as the
Forecast for the next Period.
Computing Forecasts and Measures of Forecast Business Forecasting
Accuracy using the Average of all the Historical
Airline Passengers 1990-2004
Data as the Forecast for the next Period

Univariate Time Series Models The airline data illustrate seasonality, because
Univariate time series models are models used there are seasonal patterns with respect to
when the dependent variable is a single time when people fly. For example, more people fly
series. in August than in February. The data illustrate
trend, because the number of passengers is
Multivariate Time Series Models increasing from year to year. The data illustrate
the effect of events, because the events of
Multivariate time series models are used when
September 11, 2001, had a significant negative
there are multiple dependent variables. In
impact on airline travel.
addition to depending on their own past values,
each series may depend on past and present Consider the above data restricted to years
values of the other series. 1994 through 1997. This restricted period
allows you to focus on trend and seasonality
Modeling U.S. gross domestic product, inflation,
without having to worry about the effects of
and unemployment together as endogenous
external events. To validate data preparation
variables is an example of a multivariate time
and to formulate forecast models for a time
series model.
series, a time plot of the data is produced.
Why use time series analysis?
Model Diagnostic Statistics
Advantages

• Reliability – time series uses collected


historical data over a period of time.

• Seasonal patterns – since TSA deals with


periodic data, it is easy to predict seasonal
pattern.

• Estimation of trends – the graph of TSA makes


it easy to visualize the increase or decrease in
sales, production, etc.

• Growth – through depicting patterns, TSA When the forecast value is derived using a
helps in measuring the financial growth. linear regression model, the square of Pearson’s
correlation coefficient is equivalent to the
popular 2 R (R-Square) statistic. In practice,
Supplementary Pearson’s correlation and average error are
seldom used as primary accuracy diagnostics. rarely able to carry out a simulated
These measures will be used in this work to retrospective study because of data limitations.
motivate the accuracy measures that are
Rules of Thumb
commonly used.
 At least four time points are required
R-Square is never appropriate for evaluating
for every parameter to be estimated in
time series models. Some sources redefine R-
a model.
Square so that it is more appropriate for time
 Anything above the minimum series
series modeling. The above definition can
length can be used to create a holdout
actually lead to negative values if the model is
sample.
nonlinear, which is a contradiction for a squared
quantity based on non-complex numbers. The  Holdout samples should rarely contain
Time Series Forecasting System provides some over 25% of the series.
alternative R-Square measures.

Simulating a Retrospective Study Summary of Data Used for Forecast Model


1. Divide the time series data into two Building
segments. The fit sample is used to derive a Fit Sample
forecast model. The holdout sample will be
used to evaluate forecast accuracy.  Used to estimate model parameters for
accuracy evaluation
2. Derive a set of candidate models.  Used to forecast values in holdout
3. Calculate the chosen model accuracy statistic sample
by forecasting the holdout sample. Holdout Sample
4. Pick the model with the best accuracy  Used to evaluate model accuracy
statistic.  Simulates retrospective study
Before obtaining the data, as stated above, you Full = Fit + Holdout data is used to fit
must select one of the accuracy statistics based deployment model
on the business problem. In marketing and
inventory management, MAPE is commonly For model deployment or for generating
chosen. RMSE is commonly used in all areas of forecasts for decision making, the chosen model
business. R-Square is also very common, but as is refit to the entire data set.
you have seen, R-Square should never be used
Three Environments for Forecasting Using SAS
to evaluate forecast models.
Software*
The simulated retrospective study simulates
 The SAS Windowing Environment
what would have been experienced if the
 The Time Series Forecasting System
forecast model had been implemented on the
 SAS Enterprise Guide
date of the first series value in the holdout
sample. Additional Supplementary

The use of a simulated retrospective study to A time series is simply a series of data points
pick a model is known as honest assessment. ordered in time. In a time series, time is often
Unfortunately, honest assessment is not the independent variable and the goal is usually
presented in many modern forecasting to make a forecast for the future.
textbooks. Because many textbook data sets are
Autocorrelation
small, honest assessment is often not possible.
In your work, you may discover that you are
Informally, autocorrelation is the similarity factor. Of course, this is useful if you notice
between observations as a function of the time seasonality in your time series.
lag between them.
Seasonal autoregressive integraded moving
Seasonality average model (SARIMA)

Seasonality refers to periodic fluctuations. For SARIMA is actually the combination of simpler


example, electricity consumption is high during models to make a complex model that can
the day and low during night, or online sales model time series exhibiting non-stationary
increase during Christmas before slowing down properties and seasonality.
again.

Stationarity

Stationarity is an important characteristic of


time series. A time series is said to be stationary
if its statistical properties do not change over
time. In other words, it has constant mean and
variance, and covariance is independent of time.

Modelling time series

There are many ways to model a time series in SUBTOPIC 2 MOVING AVERAGE
order to make predictions. Here, I will present: Concept of averaging methods
 moving average • If a time series is generated by a constant
 exponential smoothing process subject to random error, then mean is a
 ARIMA useful statistic and can be used as a forecast for
the next period.
Moving average
• Averaging methods are suitable for stationary
The moving average model is probably the most
time series data where the series is in
naive approach to time series modelling. This
equilibrium around a constant value ( the
model simply states that the next observation is
underlying mean) with a constant variance over
the mean of all past observations.
time.
Exponential smoothing
What is moving average?
Exponential smoothing uses a similar logic to
• It is a technique that calculates the overall
moving average, but this time, a
trend in sales volume from historical data of the
different decreasing weight  is assigned to each
company.
observations. In other words, less importance  is
given to observations as we move further from • This technique is well-known when
the present. forecasting short-term trends.
Double exponential smoothing

Double exponential smoothing is used when


there is a trend in the time series. In that case,
we use this technique, which is simply a Sample scenario
recursive use of exponential smoothing twice.

Tripe exponential smoothing

This method extends double exponential


smoothing, by adding a seasonal smoothing
• When new data becomes available , the
forecast for time t+2 is the new mean including
the previously observed data plus this new
observation.

• This method is appropriate when there is no


noticeable trend or seasonality.

Assumptions on Moving Average

• The moving average for time period t is the


mean of the “k” most recent observations.

• The constant number k is specified at the


outset.

• The smaller the number k, the more weight is


given to recent periods.

• The greater the number k, the less weight is


given to more recent periods.

• A large k is desirable when there are wide,


infrequent fluctuations in the series.

• A small k is most desirable when there are


sudden shifts in the level of series.

• For quarterly data, a four-quarter moving


Applying Moving average average, MA(4), eliminates or averages out
The mean sales for the first five years (2003- seasonal effects.
2007) is calculated by finding the mean from Reminders
the first five years.
• For monthly data, a 12-month moving
average, MA(12), eliminate or averages out
seasonal effect.

• Equal weights are assigned to each


• Compute for the second subset of 5 years
observation used in the average.
(2004-2008) 𝑎𝑣𝑒 𝑜𝑓 𝑠𝑢𝑏𝑠𝑒𝑡2 = $6𝑀 + $5𝑀 +
$8𝑀 + $9𝑀 + $5𝑀 5 𝑎𝑣𝑒 𝑜𝑓 𝑠𝑢𝑏𝑠𝑒𝑡 2 = $6.6𝑀 • Each new data point is included in the average
as it becomes available, and the oldest data
• Get the average of the third subset (2005-
point is discarded.
2009) 𝑎𝑣𝑒 𝑜𝑓 𝑠𝑢𝑏𝑠𝑒𝑡3 = $5𝑀 + $8𝑀 + $9𝑀 +
$5𝑀 + $4𝑀 5 𝑎𝑣𝑒 𝑜𝑓 𝑠𝑢𝑏𝑠𝑒𝑡3 = $6.2𝑀 Moving average formula

• Continue calculating each five-year average A moving average of order k, MA(k) is the value
until you reach 2009-2013. This gives you a of k consecutive observations
series of points that you can plot a chart for
moving averages.

Averaging methods

• The Mean

• Uses the average of all the historical data as K is the number of terms in the moving average.
the forecast The moving average model does not handle
trend or seasonality very well although it can do
better than the total mean.

Example: Weekly Department Store Sales

The weekly sales figures (in millions of dollars) Results


presented in the following table are used by a RMSE = 0.63
major department store to determine the need
for temporary sales personnel.

 A simple moving average (SMA) is the


simplest type of moving average.

Graph of Stores weekly sales

SUBTOPIC 3 EXPONENTIAL SMOOTHING

• This method provides an exponentially


weighted moving average of all previously
observed values.

• Appropriate for data with no predictable


upward or downward trend.

• The aim is to estimate the current level and


Using moving average use it as a forecast of future value.
• Use a three-week moving average (k=3) for Simple Exponential Smoothing Method
the department store sales to forecast for the
week 24 and 26. Formally, the exponential smoothing equation
is

• The forecast error is


Where:

= 𝒇𝒐𝒓𝒆𝒄𝒂𝒔𝒕 𝒇𝒐𝒓 𝒕𝒉𝒆 𝒏𝒆𝒙𝒕 𝒑𝒆𝒓𝒊𝒐d

Forecasted value
= 𝒔𝒎𝒐𝒐𝒕𝒉𝒊𝒏𝒈 𝒄𝒐𝒏𝒔𝒕𝒂𝒏𝒕
• The forecast for the week 26 is
• 𝛼 can not be equal to 0 or 1.
= 𝒐𝒃𝒔𝒆𝒓𝒗𝒆𝒅 𝒗𝒂𝒍𝒖𝒆 𝒐𝒇 𝒔𝒆𝒓𝒊𝒆𝒔 𝒊𝒏
𝒑𝒆𝒓𝒊𝒐𝒅 t • If stable predictions with smoothed random
variation is desired then a small value of 𝛼 is
desire.
= 𝒐𝒍𝒅 𝒇𝒐𝒓𝒆𝒄𝒂𝒔𝒕 𝒇𝒐𝒓 𝒑𝒆𝒓𝒊𝒐𝒅 t
• If a rapid response to a real change in the
The forecast is 𝑭𝒕+𝟏 based on weighting the pattern of observations is desired, a large value
most recent observation 𝒚𝒕 with a weight 𝜶 and of 𝛼 is appropriate.
weighting the most recent forecast Ft with a
• To estimate 𝛼, Forecasts are computed for 𝛼
weight of 1- 𝜶.
equal to .1, .2, .3, …, .9 and the sum of squared
• The implication of exponential smoothing can forecast error is computed for each.
be better seen if the previous equation is
• The value of 𝛼 with the smallest RMSE is
expanded by replacing Ft with its components
chosen for use in producing the future
as follows:
forecasts.

Simple Exponential Smoothing Method

• To start the algorithm, we need F1 because

• If this substitution process is repeated by


replacing Ft-1 by its components, Ft-2 by its
components, and so on the result is:
• Since F1 is not known, we can

• Set the first estimate equal to the first


• Therefore, Ft+1 is the weighted moving observation.
average of all past observations.
• Use the average of the first five or six
The following table shows the weights assigned observations for the initial smoothed value.
to past observations for = 0.2, 0.4, 0.6, 0.8, 0.9
Holt’s Exponential Smoothing

• Holt’s two parameter exponential smoothing


method is an extension of simple exponential
smoothing.

• It adds a growth factor (or trend factor) to the


• The exponential smoothing equation smoothing equation as a way of adjusting for
rewritten in the following form elucidate the the trend.
role of weighting factor 𝛼. • Three equations and two smoothing constants
are used in the model.

• The exponentially smoothed series or


current level estimate.
• Exponential smoothing forecast is the old
forecast plus an adjustment for the error that
occurred in the last forecast.

Assumptions • The trend estimate.

• The value of smoothing constant 𝛼 must be


between 0 and 1
• Forecast m periods into the future. Examination of the plot

• A non-stationary time series data.

• Seasonal variation seems to exist.

• Sales for the first and fourth quarter are larger


Variables in Holt’s ES
than other quarters.
• Lt = Estimate of the level of the series at time t

•  = smoothing constant for the data.

• yt = new observation or actual value of series


in period t.

•  = smoothing constant for trend estimate

• bt = estimate of the slope of the series at time


t

• m = periods to be forecast into the future.

Example: Quarterly sales of saws for Acme tool


Holt’s Exponential Smoothing
company
• The weight  and  can be selected
• The plot of the Acme data shows that there
subjectively or by minimizing a measure of
might be trending in the data therefore we will
forecast error such as RMSE.
try Holt’s model to produce forecasts.
• Large weights result in more rapid changes in
• We need two initial values
the component.
• The first smoothed value for L1
• Small weights result in less rapid changes.
• The initial trend value b1.
Example: Quarterly sales of saws for Acme tool
company • We will use the first observation for the
estimate of the smoothed value L1, and the
• The following table shows the sales of saws
initial trend value b1 = 0.
for the Acme tool Company.
• We will use 𝛼 = .3 and 𝛼 =.1.
• These are quarterly sales From 1994 through
2000.
• RMSE for this application is: 𝛼 = .3 and 𝛼 = .1 Compute for the MSE
RMSE = 155.5
Actual
• The plot also showed the possibility of
seasonal variation that needs to be 20
investigated.
100 abs(20 – 30) ^ 2

FORMATIVES
Compute for the MSE: -9 85.5 81 9
A trend is usually the result of long-term
factors such as population increases or Actual
decreases, shifting demographic 81
characteristics of the population,
improving technology, changes in the 81 abs(81 – 90) ^ 2
competitive landscape, and/or changes
MSE stands for the mean standard error.
in consumer preferences. True
False
What does the graph below illustrate? an
increasing trend only
The reproduction of crops is highly
dependent on this component in time
series. Seasonal variations
Which of the following describes an
unpredictable, rare event that appears in the
time series? Irregular variations

Mean absolute error is the mean or 88.67 dalwa sagot ? 88.67


average of the forecast errors. False Moving average is suitable for dealing with a
time series that has short-term trends. True
Qualitative methods can be used when
past information about the variable Compute for the MAD based on the
being forecast is available. False following values:3
Trend projection is an example of a time
series. True Error |Error|

Compute for the forecasted error: 3


5 5
Actual
39 -3 3
4 4

The formula below describes MAD

It pertains to the gradual shifts or


movements to relatively higher or lower
values over a longer period of time.
Trend Pattern
Another measure is mean standard error Another measure is mean standard error
(MSE), which is a measure of the average (MSE), which is a measure of the average
size of the prediction errors in squared units. size of the prediction errors in squared units.
True True
It is a model that tests the relationship The forecast is Ft+1 based on weighting the
between a dependent variable and a single most recent observation yt with a weight 
independent variable. Linear Regression alpha and weighting the most recent
The moving average method uses a forecast Ft with a weight of 1-alpha . true
weighted average of the observed value. Supply the missing values given for the
False attribute salary. 19, 275 18, 625 18, 650
Which of the following describes the overall 19,525
tendency of a time series? Trend
salary
Its objective is to uncover a pattern in the
time series and then extrapolate the pattern
into the future. Time Series Analysis 16000
If a time series is generated by a constant
process subject to random error, then 12000
median is a useful statistic. True

The value of α with the smallest RMSE is 17500


chosen for use in producing future forecasts.
True 29000
One measure of forecasting accuracy is the If elections are held every 6 years in a
mean accuracy deviation (MAD). As the country, then the existing government may
name suggests, it is a measure of the allow wages to increase the year before the
average size of the prediction errors. To election to make the people more likely to
estimate the change in Y for a one-unit vote for them. The statement is an example
change in X. false of cyclical variation. True
Based on the given values below, what is the Trend series a sequence of observations on
value of F3 if the value of the alpha is 0.2? a variable measured at successive points in
40 (F1&F1 = 39 TO GET 40 time or over successive periods of time. True
F3=0.2(44)+0.8(39.99))
The classical multiplicative time series model
indicates that the forecast is the product of
22/30
which of the following terms?
Each year, more sweaters are sold in
- trend, cyclical, and irregular winter and very few in the summer
components season. This is an example of a
Which of the following statements is correct ______________.
regarding moving averages?
- Seasonal variations
- The choice of the number of
periods impacts the performance It is a measure that avoids the problem of
of the moving average forecast positive and negative errors offsetting each
The method of moving averages is used for other is obtained by computing the average
which of the following purposes? of the squared forecast errors.
- smooth the time series - Mean Absolute Error
When exponential smoothing is used as a
Which of the following research scenarios
forecasting method, which of the following is
used as the forecast for the next period? would time-series analysis be best for?

- smoothed value in the current time - Measuring the time it takes for
period something to happen based on a given
Which of the following is a valid weight for
number of variables
exponential smoothing?

- 0.5
Which of the following indicates the purpose for
using the least squares method on time series Compute for the Mean Absolute Deviation:
data?

- identifying the trend variations

An autoregressive forecast includes which of


the following terms?

- All of the above - 4


Which of the following indicates a guideline for
selecting a forecast model?
A trend is usually the result of long-term
- All of the above factors such as population increases or
decreases, shifting demographic
To assess the adequacy of a forecasting model,
which of the following methods is used? characteristics of the population,
improving technology, changes in the
- Mean absolute deviation competitive landscape, and/or changes
Which of the following indicates the calculation in consumer preferences.
of an index number?

- (Pi/Pbase)100
- True

How many degrees of freedom are used for the It pertains to the gradual shifts or
t test to determine the significance of the movements to relatively higher or lower
highest order autoregressive term? values over a longer period of time
- n – 2p – 1
- Trend Pattern

It is a measure of forecast accuracy that


avoids the problem of positive and negative
forecast errors offsetting one another.
- Mean Absolute Error

Compute for the MSE

- 100

Spike is the predictable change in something


based on the season.

- False

It pertains to the difference between the


actual and the forecasted values for period t

- Forecast Error

A time series data usually has two variables


namely transaction and item. Deviation and error are used
- False interchangeably in time series

- False

Which of the following statements best


describes the cyclical component?

- It represents periodic fluctuations that


recur within the business cycle

Compute for the 3 Moving Average based


on the table below: Compute for the MSE based on the
following values:

-
The formula below describes

- MAPE (d nascreenshot shuta)

Compute for the MAD based on the


following value:
- 16.67?

MAPE stands for mean absolute predicted


error

- False

Compute for the MAD:


Exponential Smoothing is an obvious
extension the moving average method

- True (False)
- 6
Given the following values for age, what is
It is the variable being manipulated by the problem with the data?
researchers.

- Explanatory variable

MAPE stands for mean absolute predicted


error

- False

Moving average is suitable for dealing with a


time series that has short-term trends.

- True

A trend is usually the result of long-term


factors such as population increases or
decreases, shifting demographic
characteristics of the population, improving
technology, changes in the competitive
landscape, and/or changes in consumer
preferences

- True

Trend series a sequence of observations on


a variable measured at successive points in
time or over successive periods of time

- True

Data Inconsistency?
A heterogenous data set is a data set whose
data records have the same target value. A time series data usually has two variables
namely transaction and item.
- False
- False
The sales of the mini electric fans that
Louise sells varies every season. This is an
example of a seasonal effect to a time series.
24/20
- True Regression is a famous supervised learning
technique.
One measure of forecasting accuracy is the
- True
mean accuracy deviation (MAD). As the
name suggests, it is a measure of the Forecasting methods can be classified as
average size of the prediction errors. To qualitative and quantitative
estimate the change in Y for a one-unit
change in X. - True

- False MSE stands for the mean standard error.


- False The classical multiplicative time series model
indicates that the forecast is the product of
It pertains to the gradual shifts or which of the following terms?
movements to relatively higher or lower
values over a longer period of time. - trend, cyclical, and irregular
components
- Trend Pattern
The MAD for the following values is 63.67.
Compute for the MAPE:

- -7.69

It is a measure of forecast accuracy that


avoids the problem of positive and negative
forecast errors offsetting one another

- Mean Absolute Error

A trend is usually the result of long-term


factors such as population increases or
decreases, shifting demographic
- False
characteristics of the population, improving Time series is an example of unsupervised
technology, changes in the competitive learning.
landscape, and/or changes in consumer
preferences. - False

- True Averaging methods are suitable for


stationary time series data
These are fluctuations in the time series that
are short in duration, erratic in nature - True

- Irregular variations The formula below describes

What component affects the time series


analysis when a sudden volcanic eruption
happens?

- Irregular variations

What does the graph below illustrate?


- MAPE
an increasing trend only
The choice of the number of periods impacts
the performance of the moving average
The seasonal component represents periodic forecast.
fluctuations that recur within the business
- True
cycle.
It is the variable being manipulated by
- False researchers.
MSE stands for ________.
- Explanatory variable
- Mean Squared Error

The formula below describes


- True

Exponential Smoothing is an obvious


extension of the five- moving average
method

- False

Averaging methods are suitable for


stationary time series data
- MSE - True
The equation below illustrates how to Supply the missing values given for the
compute the mean absolute deviation. attribute salary.

- False

Averaging methods are suitable for


nonstationary time series data.

- False

Which of the following describes an


unpredictable, rare event that appears in the
time series?

- Irregular variations
- 18,650
Trend series a sequence of observations on
a variable measured at successive points in With moving average the idea is that the
time or over successive periods of time most recent observations will usually
provide the best guide as to the future, so
- True we want a weighting scheme that has
decreasing weights as the observations get
older (exponential smoothing)
Exponential Smoothing is an obvious False
extension the moving average method.

- False

The value of α with the smallest RMSE is


chosen for use in producing future forecasts.

- True

The forecast is Ft+1 based on weighting the


most recent observation yt with a weight  MODULE 5 OVERVIEW OF
alpha and weighting the most recent ASSOCIATION ANALYSIS
forecast Ft with a weight of 1-alpha .
What are patterns?
Patterns are set of items, subsequences,
or substructures that occur frequently
together (or strongly correlated) in a
data set Patterns represent intrinsic and
importance properties of datasets

When do you use Pattern Discovery?

❑ What products were often purchased


together?

❑ What are the subsequent purchases


after buying an iPad?

❑ What code segments likely contain


copy-and-paste bugs? Possible actions based on Association rule

Barbie Doll -> Candy


❑ What word sequences likely form
phrases in this corpus? 1. Pull them closer together in the store.
Concept of Market Basket Analysis 2. Put them far apart in the store.
Market basket analysis is like an 3. Package candy bars with the dolls
imaginary basket used by retailers to
check the combination of two or more 4. Package Barbie + candy + poorly selling item
items that the customers are likely to 5. Raise the price on one, and lower it on the
buy other
“Two-thirds of what we buy in the 6. Offer Barbie accessories for proofs of
supermarket we had no intention of purchase
buying,” - Paco Underhill, author of
Why We Buy: The Science of Shopping 7. Do not advertise candy and Barbie together

Association Rule is the foundation of 8. Offer candies in the shape of a Barbie doll
several recommender systems Process of Rule Selection
Amazon Generate all rules that meet specified support &
confidence

• Find frequent item sets (those with sufficient


support – see above)

• From these item sets, generate rules with


sufficient confidence

Rule Interpretation

Lift ratio shows how effective the rule is in


finding consequents (useful if finding particular
consequents is important)
Spotify
Confidence shows the rate at which
consequents will be found (useful in learning
costs of promotion)

Support measures overall impact


Rule sets (XLMiner) • Data reduction

• Novelty detection (mga loyalty cards)

• Profiling (purchasing score, range ng binibili


nla, if employed or student)
Challenges
• Market basket analysis (what are the two
 Random data can generate apparently products that are bought together, combination
interesting association rules product x and y)
 The more rules you produce, the
greater this danger • Sequence analysis (What is the next product
 Rules based on large numbers of na kukunin mo after you purchase one)
records are less subject to this danger Pattern discovery tools

SUBTOPIC 1 PATTERN DISCOVERY

What is the essence of data mining?

“...the discovery of interesting, unexpected, or


valuable structures in large data sets.” - David
Hand

“If you’ve got terabytes of data, and you’re


relying on data mining to find interesting things
in there for you, you’ve lost before you’ve even Pattern discovery key ideas
begun.” - Herb Edelstein • Patterns are not known
What is pattern discovery? • But data which are believed to possess
According to SAS, it is one of the broad patterns are given
categories of analytical methods associated • Examples:
with data mining. It is a process of uncovering
patterns from massive data sets. • Clustering: grouping similar samples into
clusters
Pattern Discovery Caution
• Associative rule mining: discover certain
• Poor data quality (hindi enough yung data or features that often appear together in data
walang pattern, pinipilit lang)
Pattern Finding
• Opportunity
• Patterns are known beforehand and are
• Interventions (irregular variations) observed/described by:
• Separability (different dataset, di naka sync • Explicit samples
yung databse)
• Similar samples (usually)
• Obviousness (trivial rules, logical rules, like
kapag umorder ng chicken matik may kasamang • Modeling approaches
coke)
• Build a model for each pattern
• Non-stationarity (inconsistent yung sales)
• Find the best fit model for new data

• Usually require training using observed


Applications of Pattern Discovery samples
Sequential Pattern

Sequential patterns: shopping sequences,


medical treatments, natural disasters, weblog
click streams, programming execution
sequences, DNA, protein, etc.

Pattern discovery techniques

1. Clustering - Clustering aims at dividing the


data set into groups (clusters) where the inter-
cluster similarities are minimized while the
similarities within each cluster are maximized

2. Association Rule mining - Association rule


discovery on usage data results in finding What is confidence?
groups of pages that are commonly accessed.
Confidence defines the likeliness of occurrence
3. Sequential pattern mining of consequent on the cart given that the cart
already has the antecedents. Confidence is
4. Classification
estimated using the following:
5. Decision Trees

6. Naïve Bayes Classifier

Technically, confidence is the conditional


What is support? probability of occurrence of consequent given
the antecedent.
The strength of the association is measured by
the support and confidence of the rule. Calculating for the confidence

Support is estimated using the following:

The support for the rule A -> B is the probability


that the two item sets occur together

Calculating for the Support

Support = if then, overall probability na bibilhin


yung product together, how frequenty the if
and then relationship appears

Confidence = ano yung probability na kukunin


yung consequent given the antecedent
Rule Interpretation – Easily parallelized

Lift ratio shows how effective the rule is in – Easy to implement.


finding consequents (useful if finding particular
• Disadvantages:
consequents is important)
– Assumes transaction database is memory
Confidence shows the rate at which
resident.
consequents will be found (useful in learning
costs of promotion) – Requires many database scans.
Support measures overall impact

Key Ideas

SUBTOPIC 2 ASSOCIATION RULE • a set of all the


items
Market basket analysis (also known as
association rule discovery or affinity analysis) is • Transaction T: a set of items such that T I
a popular data mining method. In the simplest • Transaction Database D: a set of transactions
situation, the data consists of two variables: a
transaction and an item. • A transaction T  I contains a set X  I of
some items, if X  T
Forbes (Palmeri 1997) reported that a major
retailer determined that customers who buy • An Association Rule: is an implication of the
Barbie dolls have a 60% likelihood of buying one form X  Y, where X, Y  I
of three types of candy bars.
• A set of items is referred as an itemset. A
Application of Association Rule itemset that contains k items is a k-itemset.

• Market Basket Analysis: given a database of • The support s of an itemset X is the


customer transactions, where each transaction percentage of transactions in the transaction
is a set of items the goal is to find groups of database D that contain X.
items which are frequently purchased together.
• The support of the rule X  Y in the
• Telecommunication (each customer is a transaction database D is the support of the
transaction containing the set of phone calls) items set X  Y in D.

• Credit Cards/ Banking Services (each • The confidence of the rule X  Y in the
card/account is a transaction containing the set transaction database D is the ratio of the
of customer’s payments) number of transactions in D that contain X  Y
to the number of transactions that contain X in
• Medical Treatments (each patient is
D.
represented as a transaction containing the
ordered set of diseases)

• Basketball-Game Analysis (each game is Sample Scenario


represented as a transaction containing the
Sample dataset
ordered set of ball passes)

Advantages/Disadvantages

• Advantages:

– Uses large itemset property.


Finding the association rules Terms

“IF” part = antecedent

“THEN” part = consequent

“Item set” = the items (e.g., products)


comprising the antecedent or consequent

• Antecedent and consequent are disjoint (i.e.,


Variables in Association Rule have no items in common)
• Given: Example: Phone Faceplates
― a set I of all the items;

― a database D of transactions;

― minimum support s;

― minimum confidence c;

• Find:
Many rules are possible
― all association rules X  Y with a minimum
For example: Transaction 1 supports several
support s and confidence c.
rules, such as
Problem Decomposition
• “If red, then white” (“If a red faceplate is
• Find all sets of items that have minimum purchased, then so is a white one”)
support (frequent itemsets)
• “If white, then red”
• Use the frequent itemsets to generate the
• “If red and white, then green”
desired rules
• + several more

Confidence

Example rule

If the minimum support is 50%, then {red, white} > {green} with confidence = 2/4 =
{Shoes,Jacket} is the only 2- itemset that 50%
satisfies the minimum support.
• [(support {red, white, green})/(support {red,
white})]

{red, green} > {white} with confidence = 2/2 =


100%

• [(support {red, white, green})/(support {red,


green})]
If the minimum confidence is 50%, then the
Plus 4 more with confidence of 100%, 33%, 29%
only two rules generated from this 2-itemset,
& 100%
that have confidence greater than 50%, are:
If confidence criterion is 70%, report only rules
Shoes  Jacket Support=50%, Confidence=66%
2, 3 and 6
Jacket  Shoes Support=50%,
Confidence=100% Sample Association Problem
Task 1: Compute for the Support

What is the support for 𝑚𝑖𝑙𝑘 -> 𝑎𝑝𝑝𝑙𝑒𝑠?

Association Rule Mining

Task 2: Compute for the Confidence

What is the confidence of {𝑎𝑝𝑝𝑙𝑒𝑠 → 𝑏𝑎𝑛𝑎𝑛𝑎}?

1. It is intended to select the “best”


subset of predictors. Variables
Selection
2. Write the Support formula for the
following expression;
Task 3: Compute for the lift ratio
A => B
What is the lift ratio of {𝑎𝑝𝑝𝑙𝑒𝑠 → 𝑏𝑎𝑛𝑎𝑛𝑎}?
(transactions that contain every

item A and B)/(all transactions)

3. Affinity analysis is a data mining


method that usually consists of two
variables a transaction and an item.
[TRUE]
4. Which of the following is not an
application of a sequential pattern?
Identifying fake news]
5. Which of the following is not an
application of pattern discovery?
None of pattern discovery
6. It shows how effective the rule is in
finding consequents. Lift ratio
7. Trend series a sequence of
observation on a variable measured
at successive points in time or over
successive periods of time. False
8. Observe the table below and
compute for the confidence of
Beer->diaper.

0.43
10. Observe the table below and
compute for the support of
Beer=>peanut 4/7
11. Observe the table below and
compute for the support of
SD Card => Phone case

None of the choices


9. Observe the table below and
compute for the confidence of
Diaper => peanut

0.3
12. Observe the table below and
compute for the lift ratio of
0.3
1.5 1. Segmentation is a data mining
13. When it comes to association method that usually consists of
analysis, the more rules you two variables a transaction and
produce, the greater the risk is. an item. False
True 2. It is a useful tool for data
14. Supposed you want to solve a time reduction, such as choosing the
series problem where a rapid best variables or cluster
response to a real change in the components for
pattern of observations is desired, analysis.Variable Clustering
which among the following is the 3. It controls for the support
ideal value for your alpha? 0.8 ( frequency) of consequent
15. Which of the following is not an while calculating the
advantage of using association rule? conditional probability of
Assumes transaction database is occurrence of {Y} given {X}. Lift
memory resident. ratio
16. A trend is usually the result of long- 4. Association rule mining is about
term factors such as population grouping similar samples into
increases or decreases, shifting clusters. False
demographic characteristics of 5. It is the process of discovering
population, improving technology, useful patterns and trends in
changes in the competitive large data sets.[data mining]
landscape, and/or changes in 6. observe the table below and
consumer preferences. True compute for the lift ratio of
17. It measures the overall impact. egg -> Peanut
Support
18. Clustering aims to discover certain
features that often appear together
in data.False
19. It is another type of association
analysis that involves using
sequence data. Association Rule
20. Observe the table below and
compute for the support of
SD Card => Phone case
given the antecedent.
Confidence
10. Input validation helps to lesson
what type of anomaly?
Insertion anomaly
11. observe the table below and
compute for the confidence of
Phone -> SD card

0.93
7. observe the table below and
compute for the confidence of
Airpods->powerbank

0.5
12. Lift ratio shows how effective
the rule is in finding
consequents. True
13. observe the table below and
compute for the lift ratio of

0.4

8. observe the table below and


support for Fries -> Burger

0.6
9. It is the conditional probability
of occurrence of consequent 0.1
The objective of clustering is to
uncover a pattern in the time
series and then extrapolate the
pattern into the future. False

Observe the table below and


compute for the lift ratio of egg→
peanut.

- 4/7

Which of the following is not an


application of pattern discovery?
- None of the choices

Pattern discovery can also be applied


to data profiling.

- False

Observe the table below and compute


for the confidence of egg→ beer

- 0.93

Observe the table below and


compute for the support for beer⇒
peanut

- None of the choices

Observe the table below and compute


for the confidence of beer→diaper.
Observe the table below and compute for the
support for diaper⇒ peanut

- 1/5 or 0.2

Observe the table below and compute for the


confidence of egg→peanut
- 0.43

It is the conditional probability of


occurrence of consequent given the
antecedent

- Confidence

Observe table below and compute for the


support for powerbank⇒charger

- 4/5

Lift ration shows how effective the rule is in


finding consequents.

- True

It pertains to how likely item Y is purchased


when item X is purchased, expressed as {X
-> Y}. (Use lowercase for your answer)

- confidence
19/20

- 0.1

It pertains to how likely item Y is purchased


when item X is purchased, expressed as {X
-> Y}. (Use lowercase for your answer)

- confidence

This measure defines the likeliness of


occurrence of consequent on the cart given
that the cart already has the antecedents.

- Confidence

It pertains to how popular an itemset is, as


measured by the proportion of transactions
in which an itemset appears. (Use lowercase
for your answer)

- lift ratio - 1
It measures the overall impact. Its objective is to uncover a pattern in the
time series and then extrapolate the pattern
- Support
into the future.
Which of the following is not an application
- Time Series
of pattern discovery?

- None of the choices


Observe the table below and compute for
Observe the table below and compute for the
the support for milk⇒ egg.
lift ratio of phone case→ SD Card
It pertains to how likely item Y is
purchased when item X is purchased
while controlling for how popular item Y
is. (Use lowercase for you answer)
- lift ratio

These are set of items, subsequences, or


substructures that occur frequently together
(or strongly correlated) in the data set.

- Patterns

Market Basket Analysis creates If-Then


scenario rules.

- True

Which of the following is not an application


- 0.14 of pattern discovery?
Observe the table below and compute for - None of the choices
the confidence of beer→diaper.

Observe the table below and compute for


the lift ratio of egg→ peanut.

- 0.40

Lift ration shows how effective the rule


- 0.93
is in finding consequents.

It is a useful tool for data reduction, such


- True
as choosing the best variables or cluster
components for analysis. (Use lowercase
for your answer) variable clustering
pattern discoveryare also part of the Observe the table below and compute for
pattern discovery. the support for airpods⇒ charger

- False

Clustering aims to discover certain features


that often appear together in data.

- False

It measures the overall impact.

- Support

Observe the table below and compute for


the support for SD Card⇒ phone case

- 0.4

Its objective is to uncover a pattern in the


time series and then extrapolate the pattern
into the future. - Time Series

Input validation helps to lessen what


type of anomaly? Insertion anomaly

It shows how effective the rule is in


finding consequents. Lift ratio

Lift ratio shows how effective the rule is


in finding consequents. True
- 0.3
This measure defines the likeliness of
occurrence of consequent on the cart
Which of the following is not an application given that the cart already has the
of pattern discovery? antecedents.

- None of the choices Confidence


transactions in which an itemset
appears. Support
F5 16/20 19/20 7. It shows how effective the rule is in
finding consequents. Lift ratio
1. The objective of clustering is to 8. Observe the table below and
uncover a pattern in the time series compute for the lift ratio
and then extrapolate the pattern into
of diaper→ milk.
the future. False true
2. These are set of items, Transaction ID Items
subsequences, or substructures that
occur frequently together (or
strongly correlated) in the data set. Transaction A Beer, Peanut, Egg
Pattern
3. When it comes to association Beer, Milk, Peanut,
analysis, the more rules you produce, Transaction B
Diaper
the greater the risk is. True
4. Observe the table below and
compute for the support Transaction C Milk, Diaper, Egg
for diaper⇒ peanut
Transaction D Peanut, Egg, Diaper
Transaction ID Items

Transaction E Beer, Peanut, Egg


Transaction A Beer, Peanut, Egg

Transaction F Egg, Beer


Beer, Milk, Peanut,
Transaction B
Diaper
Transaction G Beer, Diaper, Peanut

Transaction C Milk, Diaper, Egg 1.5 0.29/0.29*0.57=1.75

Transaction D Peanut, Egg, Diaper


9. Supposed you want to solve a
time series problem where a rapid
Transaction E Beer, Peanut, Egg response to a real change in the
pattern of observations is desired,
Transaction F Egg, Beer, Peanut which among the following is the
ideal value for your alpha ? 0.2
(Choices : 0.2 1 0 0.8)
Transaction G Beer, Diaper, Peanut 10. Association rule mining is about
grouping similar samples into
1.75 (3/7 = .43 Umulit sa number19) clusters. False
11. It is the conditional probability of
occurrence of consequent given the
5. It controls for the support antecedent. Confidence
(frequency) of consequent while 12. It is the conditional probability of
calculating the conditional occurrence of consequent given the
probability of occurrence of {Y} given antecedent. Confidence
{X}. Lift ratio 13. Observe the table below and
6. It pertains to how popular an itemset compute for confidence of phone
is, as measured by the proportion of case→ SD Card
Transaction ID Items Transaction D Peanut, Egg, Diaper

airpods, charger, Transaction E Beer, Peanut, Egg


1
powerbank
Transaction F Egg, Beer
2 powerbank, phone case
Transaction G Beer, Diaper, Peanut
airpods, phone case,
3 4/7.
charger
16. Market Basket Analysis creates If-
4 phone case, SD Card Then scenario rules. True
17. It shows how effective the rule is in
finding consequents. Lift ratio
5 SD Card, charger, airpods 18. It is another type of association
analysis that involves using sequence
data. Sequence Analysis
SD Card, phonecase,
6
powerbank

Powerbank, phonecase, SD
7
Card

8 airpods, powerbank

19. Observe the table below and


9 phone case, airpods compute for the support
for diaper⇒peanut
10 charger, SD Card, airpods
Transaction ID Items
0.5 (0.3/0.6 = 0.5)
14. It pertains to how popular an itemset Transaction A Beer, Peanut, Egg
is, as measured by the proportion of
transactions in which an itemset
appears. (Use lowercase for your Beer, Milk, Peanut,
Transaction B
answer) support Diaper
15. Observe the table below and
compute for the support Transaction C Milk, Diaper, Egg
for beer⇒ peanut

Transaction ID Items Transaction D Peanut, Egg, Diaper

Transaction A Beer, Peanut, Egg Transaction E Beer, Peanut, Egg

Beer, Milk, Peanut, Transaction F Egg, Beer, Peanut


Transaction B
Diaper
Transaction G Beer, Diaper, Peanut
Transaction C Milk, Diaper, Egg
0.43 ( 3/7 = .43 Inulit nya lang nasa number 4)
20. It shows how effective the rule is in
finding consequents. Lift ratio

MODULE 6

• Inconsistent data

• Improperly formatted data

• Limited features

• The need for techniques such as feature


engineering
What is data preparation?

Data preparation is the method of transforming Ways to Preprocess Data


raw data into a form suitable for modeling. It is
also known as data preprocessing. Data cleaning

Data cleaning (or data cleansing) routines


Why do we need to preprocess the data? attempt to fill in missing values, smooth out
noise while identifying outliers, and correct
Missing values
inconsistencies in the data. In this section, you
Having null values in your data set could affect will study basic methods for data cleaning.
the accuracy of the model.
Data Integration and Transformation

There are a number of issues to consider during


data integration. Schema integration and object
matching can be tricky. This is referred to as the
entity identification problem.

Data Reduction

Data reduction techniques can be applied to


obtain a reduced representation of the data set
that is much smaller in volume, yet closely
maintains the integrity of the original data.

Data Preprocessing Techniques


Outliers

When your data has outliers, it could affect the


distribution of your data.
Variables Selection How to handle missing data?

Observe minimizing garbage in, garbage out Missing data is a problem that continues to
(GIGO)! plague data analysis methods. Let’s examine the
cars data set.
Procedures

1. Backward-selection

2. Forward-selection

SUBTOPIC 1 DATA ANOMALIES A common method of “handling” missing values


What are anomalies? is simply to omit the records or fields with
missing values from the analysis. However, this
Anomalies are problems that can occur in action would make our data bias.
poorly planned, unnormalized databases where
all the data is stored in one table (a flat-file Common criteria in handling null values
database). Anomalies are caused when there is 1. Replace the missing value with some
too much redundancy in the database's constant, specified by the analyst
information
2. Replace the missing value with the field mean
Database Anomalies (for numeric variables) or the mode (for
• Insertion anomaly - happen when inserting categorical variables).
vital data into the database is not possible 3. Replace the missing values with a value
because other data is not already there. generated at random from the observed
• Update Anomalies - happen when the person distribution of the variable.
charged with the task of keeping all the records 4. Replace the missing values with imputed
current and accurate, is asked, for example, to values based on the other characteristics of the
change an employee’s title due to a promotion. record.
• If the data is stored redundantly in the same Applications
table, and the person misses any of them, then
there will be multiple titles associated with the
employee. The end user has no way of knowing
which is the correct title.

• Deletion Anomalies - happen when the


deletion of unwanted information causes
desired information to be deleted as well. 1 ) Replacing missing field values with user
-defined constants.
• For example, if a single database record
contains information about a particular product
along with information about a salesperson for
the company and the salesperson quits, then
information about the product is deleted along
with salesperson information.
2) Replacing missing field values with means or
modes.
How to handle outliers? • Including categorical inputs in the model can
cause quasi-complete separation.
Use graphical methods to identify outliers
How to handle noisy data?

Handling Noisy Data

1. Binning: Binning methods smooth a sorted


data value by consulting its “neighbor-hood,”
that is, the values around it.

2. Regression: Data can be smoothed by fitting


the data to a function, such as with regression.
Linear regression involves finding the “best” line
to fit two attributes (or variables), so that one
Histogram of vehicle weights: can you find the
attribute can be used to predict the other.
outlier?
3. Clustering: Outliers may be detected by
clustering, where similar values are organized
into groups, or “clusters.” Intuitively, values
that fall outside of the set of clusters maybe
considered outliers

Dealing with categorical inputs

Solutions to the problems of categorical inputs


include the following choices:

Variable Clustering • Use the categorical input as a link to other


data sets.
Mortgage Balance
Credit Card Balance • Collapse the categories based on the number
of observations in a category.
Number of Checks
Checking Deposits • Collapse the categories based on the
Teller Visits reduction in the chi-square test of association
between the categorical input and the target.
Age
• Use smoothed weight of evidence coding to
convert the categorical input to a continuous
input.

Anomaly detection

Anomaly detection (aka outlier analysis) is a


step in data mining that identifies data points,
events, and/or observations that deviate from a
dataset’s normal behavior.
How to handle categorical data? Anomalous data can indicate critical incidents,
Dealing with categorical inputs such as a technical glitch, or potential
opportunities, for instance a change in
• When a categorical input has many levels, consumer behavior.
expanding the input into dummy variables can
greatly increase the dimension of the input
space.
Observe the graph Why does your company needs anomaly
detection?

Reasons to detect anomalies

1. Anomaly detection for application


performance
This graph shows an anomalous drop detected
in time series data. The anomaly is the yellow 2. Anomaly detection for product quality
part of the line extending far below the blue
shaded area, which is the normal range for this 3. Anomaly detection for user experience
metric. SUPPLEMENTARY MATERIAL 1
Three main categories of business data Why Preprocess the Data?
anomalies
1. Global Outliers Supposing that you are a manager of one
company and have been charged with analyzing
Also known as point anomalies, these outliers the company’s data with respect to the sales at
exist far outside the entirety of a data set. your branch. You immediately set out to
perform this task. You carefully inspect the
company’s database and data warehouse,
identifying and selecting the attributes or
dimensions to be included in your analysis, such
as item, price, and units sold.

2. Contextual Outliers You notice that several of the attributes for


various tuples have no recorded value. For your
Also called conditional outliers, these anomalies
analysis, you would like to include information
have values that significantly deviate from the
as to whether each item purchased was
other data points that exist in the same context.
advertised as on sale, yet you discover that this
information has not been recorded.
Furthermore, users of your database sys-tem
have reported errors, unusual values, and
inconsistencies in the data recorded for some
transactions.
3. Collective outliers Incomplete, noisy, and inconsistent data are
commonplace properties of large real-world
When a subset of data points within a set is
databases and data warehouses. Incomplete
anomalous to the entire dataset, those values
data can occur for a number of rea-sons.
are called collective outliers. In this category,
Attributes of interest may not always be
individual values aren’t anomalous globally or
available, such as customer information for
contextually.
sales transaction data.

There are many possible reasons for noisy data


(having incorrect attribute values). The data
collection instruments used may be faulty.
There may have been human or computer
errors occurring at data entry. Errors in data
transmission can also occur.

There may be technology limitations, such as


limited buffer size for coordinating synchronized
data transfer and consumption. Incorrect data
may also result from inconsistencies in naming include data aggregation (e.g., building a data
conventions or data codes used, or inconsistent cube), attribute subset selection (e.g., removing
formats for input fields, such as date. Duplicate irrelevant attributes through correlation
tuples also require data cleaning. analysis), dimensionality reduction (e.g., using
encoding schemes such as minimum length
encoding or wavelets), and numerosity
reduction (e.g., “replacing” the data by
alternative, smaller representations such as
clusters or parametric models).

Descriptive Data Summarization

For data preprocessing to be successful, it is


essential to have an overall picture of your data.
Descriptive data summarization techniques can
be used to identify the typical properties of your
data and highlight which data values should be
treated as noise or outliers.

The descriptive statistics are of great help in


Data cleaning routines work to “clean” the data understanding the distribution of the data. Such
by filling in missing values, smoothing noisy measures have been studied extensively in the
data, identifying or removing outliers, and statistical literature. From the data mining point
resolving inconsistencies. If users believe the of view, we need to examine how they can be
data are dirty, they are unlikely to trust the computed efficiently in large databases.
results of any data mining that has been applied
to it. Measuring the Central Tendency

Suppose that you would like to include data The most common and most effective numerical
from multiple sources in your analysis. This measure of the “center” of a set of data is the
would involve integrating multiple databases, (arithmetic) mean.
data cubes, or files, that is data integration. Yet
Different Measures of Central Tendency
some attributes representing a given concept
may have different names in different 1. Distributive Measure
databases, causing inconsistencies and
2. Algebraic Measure
redundancies.
3. Holistic Measure
Furthermore, it would be useful for your
analysis to obtain aggregate information as to Measuring the Dispersion of Data
the sales per customer region—something that
is not part of any precomputed data cube in The degree to which numerical data tend to
your data warehouse. You soon realize that spread is called the dispersion, or variance of
data transformation operations, such as the data. The most common measures of data
normalization and aggregation, are additional dispersion are range, the five number summary
data preprocessing procedures that would (based on quartiles), the interquartile range,
contribute toward the success of the mining and the standard deviation. Boxplots can be
process. plotted based on the five-number summary and
are a useful tool for identifying outliers.
Data reduction obtains a reduced
representation of the data set that is much
smaller in volume, yet produces the same (or
almost the same) analytical results. There are a
number of strategies for data reduction. These
Graphic Displays of Basic Descriptive Data Data Cleaning as a Process
Summaries
The first step in data cleaning as a process is
Aside from the bar charts, pie charts, and line discrepancy detection. Discrepancies can be
graphs used in most statistical or graphical data caused by several factors, including poorly
presentation software packages, there are other designed data entry forms that have many
popular types of graphs for the display of data optional fields, human error in data entry,
summaries and distributions. These include deliberate errors (e.g., respondents not wanting
histograms, quantile plots, q-q plots, scatter to divulge information about themselves), and
plots, and loess curves. Such graphs are very data decay (e.g., outdated addresses).
helpful for the visual inspection of your data.
As a starting point, use any knowledge you may
Data Cleaning already have regarding properties of the data.
Such knowledge or “data about data” is
Data cleaning (or data cleansing) routines
referred to as metadata. Field overloading is
attempt to fill in missing values, smooth out
another source of errors that typically results
noise while identifying outliers, and correct
when developers squeeze new attribute
inconsistencies in the data. In this section, you
definitions into unused (bit) portions of already
will study basic methods for data cleaning.
defined attributes (e.g., using an unused bit of
Missing Values an attribute whose value range uses only, say,
31 out of 32 bits).
These are the methods in filling in the missing
value for the attributes: The data should also be examined regarding
unique rules, consecutive rules, and null rules.
1. Ignore the tuple This is usually done when
the class label is missing (assuming the mining A unique rule says that each value of the given
task involves classification). attribute must be different from all other values
for that attribute.
2. Fill in the missing value manually In general,
this approach is time-consuming and may not A consecutive rule says that there can be no
be feasible given a large data set with many missing values between the lowest and highest
missing values. values for the attribute, and that all values must
also be unique (e.g., as in check numbers).
3. Use a global constant to fill in the missing
value Replace all missing attribute values by the A null rule specifies the use of blanks, question
same constant, such as a label like “Unknown” marks, special characters, or other strings that
or −∞. may indicate the null condition (e.g., where a
value for a given attribute is not available), and
4. Use the attribute mean to fill in the missing how such values should be handled.
value Use this value to replace the missing value
for income. There are a number of different commercial
tools that can aid in the step of discrepancy
5. Use the attribute mean for all samples detection.
belonging to the same class as the given tuple
For example, if classifying customers according Data scrubbing tools use simple domain
to credit risk, replace the missing value with the knowledge (e.g., knowledge of postal addresses,
average income value for customers in the and spellchecking) to detect errors and make
same credit risk category as that of the given corrections in the data. These tools rely on
tuple. parsing and fuzzy matching techniques when
cleaning data from multiple sources.
6. Use the most probable value to fill in the
missing value This may be determined with Data auditing tools find discrepancies by
regression, inference-based tools using a analyzing the data to discover rules and
Bayesian formalism, or decision tree induction. relationships, and detecting data that violate
such conditions. They are variants of data • Normalization, where the attribute data are
mining tools. Commercial tools can assist in the scaled so as to fall within a small specified
data transformation step. range, such as −1.0 to 1.0, or 0.0 to 1.0.

Data migration tools allow simple • Attribute construction (or feature


transformations to be specified, such as to construction), where new attributes are
replace the string “gender” by “sex”. constructed and added from the given set of
attributes to help the mining process.
ETL (extraction/transformation/loading) tools
allow users to specify transforms through a Data Reduction
graphical user interface (GUI). These tools
Data reduction techniques can be applied to
typically support only a restricted set of
obtain a reduced representation of the data set
transforms so that, often, we may also choose
that is much smaller in volume, yet closely
to write custom scripts for this step of the data
maintains the integrity of the original data. That
cleaning process.
is, mining on the reduced data set should be
Data Integration more efficient yet produce the same (or almost
the same) analytical results. Strategies for data
There are a number of issues to consider during
reduction include the following:
data integration. Schema integration and object
matching can be tricky. This is referred to as the 1. Data cube aggregation, where aggregation
entity identification problem. Redundancy is operations are applied to the data in the
another important issue. construction of a data cube.

An attribute (such as annual revenue, for 2. Attribute subset selection, where irrelevant,
instance) may be redundant if it can be weakly relevant, or redundant attributes or
“derived” from another attribute or set of dimensions may be detected and removed.
attributes. Some redundancies can be detected
3. Dimensionality reduction, where encoding
by correlation analysis. Given two attributes,
mechanisms are used to reduce the dataset
such analysis can measure how strongly one
size.
attribute implies the other, based on the
available data. 4. Numerosity reduction, where the data are
replaced or estimated by alternative, smaller
Data Transformation
data representations such as parametric models
In data transformation, the data are (which need store only the model parameters
transformed or consolidated into forms instead of the actual data) or nonparametric
appropriate for mining. Data transformation can methods such as clustering, sampling, and the
involve the following: use of histograms.

• Smoothing, which works to remove noise 5. Discretization and concept hierarchy


from the data. Such techniques include binning, generation, where raw data values for
regression, and clustering. attributes are replaced by ranges or higher
conceptual levels. Data discretization is a form
• Aggregation, where summary or aggregation
of numerosity reduction that is very useful for
operations are applied to the data. For example,
the automatic generation of concept
the daily sales data may be aggregated so as to
hierarchies.
compute monthly and annual total amounts.

• Generalization of the data, where low-level or


“primitive” (raw) data are replaced by higher-
level concepts through the use of concept
hierarchies.
SUBTOPIC 2 VARIABLE SELECTION What if you have 2k models to choose from?

Variable selection is intended to select the


“best” subset of predictors.

Why do we need to perform variable selection?

Why variable selection?

• We want to explain the data in the simplest


way — redundant predictors should be
removed.

• Unnecessary predictors will add noise to the


estimation of other quantities that we are Forward Selection
interested in. Degrees of freedom will be
wasted

• Collinearity is caused by having too many


variables trying to do the same job.

• Cost: if the model is to be used for prediction,


we can save time and/or money by not
measuring redundant predictors.

Prior to variable selection

1. Identify outliers and influential points -


maybe exclude them at least temporarily

2. Add in any transformations of the variables


that seem appropriate

Forward Selection

The forward selection procedure starts with no


variables in the model.

Steps Forward Selection

1. For the first variable to enter the model,


select the predictor most highly correlated with
the target.

2. For each remaining variable, compute the


sequential Fstatistic for that variable, given the
variables already in the model.

3. For the variable selected in step 2, test for


the significance of the sequential F-statistic. If Backward Selection
the resulting model is not significant, then stop,
This is the simplest of all variable selection
and report the current model without adding
procedures and can be easily implemented
the variable from step 2. Otherwise, add the
without special software. In situations where
variable from step 2 into the model and return
there is a complex hierarchy, backward
to step 2.
elimination can be run manually while taking
account of what variables are eligible for
removal.

Steps of Backward-selection

1. Perform the regression on the full model;


that is, using all available variables. For
example, perhaps the full model has four
variables, x1, x2, x3, x4.

2. For each variable in the current model,


compute the partial F-statistic. In the first pass
through the algorithm, these would be F(x1|x2, Scatter plots of MPG with each predictor. Some
x3, x4), F(x2|x1, x3, x4), F(x3|x1, x2, x4), and non-linearity
F(x4|x1, x2, x3). Select the variable with the
smallest partial F-statistic. Denote this value
Fmin.

3. Test for the significance of Fmin. If Fmin is


not significant, then remove the variable
associated with Fmin from the model, and
return to step 2. If Fmin is significant, then stop
the algorithm and report the current model

Scatter plots of ln MPG with each predictor


(including ln HP). Improved linearity.

Stepwise Procedure

The stepwise procedure represents a


modification of the forward selection
procedure. A variable that has been entered
into the model early in the forward selection
process may turn out to be nonsignificant, once
other variables have been entered into the
model.
The following can be done to treat
unsatisfactory response except: Returning to
the field

A good rule of thumb in having a right AGD


amount of data is to have 10 records for
every predictor value. True AGD
Given are the following records for the These outliers exist far outside the entirety
attribute rating. What is the problem of a data set. Global outliers
with the data? Data Inconsistency
These are also known as point anomalies.
Global outliers
application_rating
Anomaly detection can cause a bad user
experience. True
1
Anomaly detection is also known as outlier
analysis. True
2
It is a best practice to divide your dataset
into train and test dataset. True
a
It fits and performs variable selection on an
ordinary least square regression predictive
b
model. Linear regression selection

It is a useful tool for data reduction, such as


c
choosing the best variables or cluster
components for analysis. (Use lowercase for
3 your answer) variable clustering

The simplest of all variable selection


procedures is stepwise procedure. FALSE
Supply the missing value. AGD
Backward selection starts with all the
variables. True
Degree
It is about estimating the value for the target
variable except that the target variable is
SMBA
categorical rather than numeric. Prediction

WMA Prior to variable selection, one must identify


outliers and influential points - maybe
exclude them at least temporarily. True
AGD
Variable clustering is about grouping the
attributes with similarities. True
AGD
Resampling refers to the process of sampling
at random and with replacement from a data
WMA set. True

Estimation is about estimating the value for


WMA the target variable except that the target
variable is categorical rather than numeric.
- True
The figure below illustrates the first step Enumerate at least one of the two(2) types of
in doing backward selection. variables transformation commonly used in
machine learning

Data preparation affects:


False
 The objectives of the research
Forward selection is the opposite of  The quality of the research
stepwise selection. False
 The research approach
Formative 6  The sample size

Forward selection is the opposite of stepwise It is a manipulation of scale values to ensure


selection. comparability with other scales:

 True  Scale transformation

Estimation is about estimating the value for the It is a best practice to divide your dataset into
target variable except that the target variable is train and test dataset.
categorical rather than numeric.
 True
 True
This happens when the deletion of unwanted
Linearity is caused by having too many variables information causes desired information to be
trying to do the same job. deleted as well.

 False  Insertion anomaly


 Deletion anomaly
It is about estimating the value for the target
variable except that the target variable is The following are techniques to treat missing
categorical rather than numeric. values except:

 Estimation  Substitute an imputed value


 Returning to the field
It is the process of transforming the existing
 Substitute an imputed value
features into a lower-dimensional space,
 Case wise deletion
typically generating new features that are
composites of the existing features. Clustering can also detect outliers.

 Feature extraction  True


Postcoding process is necessary for: Supply the missing values given for the attribute
city.
 Structure questions
City
A good rule of thumb in having a right amount
of data is to have 10 records for every predictor Makati
value.
Caloocan
 True
Caloocan
Forward selection is the simplest variable
Makati
selection model.
Caloocan
 False
 Caloocan
 Makati Age
 Quezon
16
 None of the choices
27
These anomalies have values that significantly
deviate from the other data points that exist in -8990
the same context.
19
 Contextual outliers
15
When there’s a missing value for a categorical
18
variable, it is ideal to supply it by computing for
the average of the data values available. Answer:Data Inconsistency
 True Anomaly detection can cause a bad user
experience.
Outlier analysis can provide good product
quality.  False
 True You can use histogram to detect outliers.
A review of the questionnaires is essential in  True
order to:
When a subset points within a set is anomalous
 Select the data analysis strategy to the entire data set, those values are:
 Find new insights
Collective outliers
 Increase the quality of the data
 Increase accuracy and precise of the These anomalies have values that significantly
collected data deviate from the other data points that exist in
the same context.
Feature selection maps the original feature
space to a new feature space with lower  Contextual outliers
dimensions by combining the original feature
space. These are problems that can occur in poorly
planed, un-normalized databases where all the
True data is stored in one table (a flat-file database).
False  Anomalies
This happens when inserting vital data into the
database is not possible because other data is
not already there.

 Insertion anomaly

Unnecessary predictors will add noise to the


estimation of other quantities that we are
interested in.
16/20
 True
A homogenous data set is a data set whose
You can also use regression when handling
data records have the same target value.
noisy data.
True
True
Supply for the missing values.
Given the following values for age, what is the
problem with the data?
This happens when the deletion of
unwanted information causes desired
information to be deleted as well.
Deletion anomaly

Instead of using the real number for age


attribute, you categorized the age as the
following:

Young = 12 – 17
Adult   = 18 -34
Old     = 35 – 60
What kind of data preparation was
practiced? Data Cleaning
It is the process of integrating multiple
- 19.6 databases, data cubes, or files.  data
It is a manipulation of scale values to ensure integration
comparability with variables with other
scales:
Scale transformation

Supply the missing value in the given data


below. These are problems that can occur in
poorly planned, un-normalized databases
where all the data is stored in one table
(a flat-file database). Anomalies
You can also use regression when
handling noisy data. True
The procedure starts with an empty set
of features [reduced set]. Forward
Selection

It is the simplest of all variable selection


procedures and can be easily
implemented without special software
(Use lowercase for your answer)
Backward Selection
The forward selection procedure starts with
no variables in the model. True

- 88.2 Estimation is about estimating the value


for the target variable except that the
target variable is categorical rather than
It is a best practice to divide your numeric.
dataset into train and test dataset. True
True
The figure below illustrates the first step
in doing backward selection. False
It is intended to select the “best” subset
of predictors.  (Use lowercase for your
answer)
Variables Selection

Enumerate at least one of the two (2)


types of variables transformation
commonly used in machine learning:
(Use lowercase for your answer)
categorical variables?
numerical variables?
Forward selection is the opposite of
stepwise selection. False
The figure below illustrates the basic steps
for what type of variable selection method?

Backward

Prior to variable selection, one must


identify outliers and influential points -
maybe exclude them at least
temporarily. True

You might also like