Professional Documents
Culture Documents
Summarizing your data will enable you to A data set is a collection of observations on one
classify the normal values of your data and or more variables. (collection of data from
uncover what are the odd values present on excel, database, row in a data mining)
your dataset. A variable is a characteristic under study that
What is raw data? assures different values for different elements.
(age, name etc)
Raw data pertains to the collected data before
it’s processed or ranked. Observation or measurement pertains to a
value of a variable.(age 10 or 20 etc)
Qualitative raw data
Basic Terms
Suppose you are tasked to gather the ages (in
years) of 50 students in a university.
Discrete Variables
Continuous Variables
Binary Variables
Qualitative vs Quantitative
Mean
Range Histogram
The difference between the largest and the A plot of the frequency table with the bins on
smallest value in a data set. the x-axis and the count (or pro‐ portion) on the
y-axis.
𝑟𝑎𝑛𝑔𝑒 = 𝑙𝑎𝑟𝑔𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 − 𝑠𝑚𝑎𝑙𝑙𝑒𝑠𝑡 𝑣𝑎𝑙𝑢e
Standard Deviation
Understanding Distribution
Skewness
Negatively Skewed
Boxplot
STATISTICAL ANALYSIS
Problem: In reality, rash guard sales typically 𝐻0: 𝑒𝑎𝑐ℎ 𝑏𝑜𝑡𝑡𝑙𝑒 𝑖𝑠 𝑓𝑖𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 8 𝑜𝑧 𝑜𝑓 𝑠𝑜𝑑𝑎
increases in summer months. Thus, there is a 𝑙𝑒𝑡 𝜇 𝑖𝑠 𝑡ℎ𝑒 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑜𝑓 𝑜𝑓 𝑠𝑜𝑑𝑎 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 𝐻0:
“history effect”. 𝜇 = 8 oz
What is statistics? 𝐻1: 𝑒𝑎𝑐ℎ 𝑏𝑜𝑡𝑡𝑙𝑒 𝑖𝑠 𝑓𝑖𝑙𝑙𝑒𝑑 𝑤𝑖𝑡ℎ 𝑙𝑒𝑠𝑠 𝑡ℎ𝑎𝑛 8𝑜𝑧
𝑜𝑓 𝑠𝑜𝑑𝑎 𝑙𝑒𝑡 𝜇 𝑖𝑠 𝑡ℎ𝑒 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑜𝑓 𝑜𝑓 𝑠𝑜𝑑𝑎
Statistics is the science concerned with
𝑐𝑜𝑛𝑡𝑒𝑛𝑡 𝐻0: 𝜇 < 8 oz
developing and studying methods for collecting,
analyzing, interpreting and presenting empirical Conducting
data.
Student’s T-test
• Rely upon the calculation of numbers
Types of Statistics
Confidence Intervals
Identify p value
𝑑𝑒𝑔𝑟𝑒𝑒𝑠 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚 = 𝑛1 + 𝑛2 – 2
• 𝑑𝑓 = 16 + 16 − 2 = 30
• 𝑑𝑓 = 30 • a continuous dependent variable, or response
variable
One-way ANOVA
Correlation
What is ANOVA?
• Exploratory data analysis in many modeling Every data analysis requires data. Data can be in
projects (whether in data science or in research) different forms such as images, text, videos, and
involves examining correlation among etc, that are usually gathered from different
predictors, and between predictors and a target data sources. Organizations store data in
variable. different ways, some through data warehouses,
traditional RDBMS, or even through cloud. With
• Variables X and Y (each with measured data)
the voluminous amount of data that an
are said to be positively correlated if high values
organization processes each day, the dilemma
of X go with high values of Y, and low values of
on how to start data analysis emerges.
X go with low values of Y.
How do we start performing an analysis?
Correlation Coefficient
First and foremost, know your data.
• A metric that measures the extent to which
numeric variables are associated with one To understand your organization’s data, there
another (ranges from -1 to +1). are numerous techniques that can be used. In
module 1, the most common techniques will be
• Scatterplot is a plot in which the x-axis is the
identified.
value of one variable, and the y-axis the value of
another. What is raw data?
• The correlation coefficient is a standardized Another example is gathering the student status
metric so that it always ranges from -1 (perfect of the same 50 students. Now we will have an
negative correlation) to +1 (perfect positive example of categorical raw data which is
correlation) presented in the table below.
Types of Variables
Is it easier to understand through frequency Quantitative variables are divided into two
table, isn’t it? In order to study data, in this types: discrete variables and continuous
module we will be using statistics. variables.
Statistics is defined as the science of collecting, • Discrete variable is a variable whose values
analyzing, presenting, and interpreting data, as are countable. In other words, a discrete
well as of making decisions based on such variable can assume only certain values.
analysis.
• Continuous variables are variables that can
Since statistics is a broad body of knowledge, it assume any numerical value or interval.
is divided into two areas: descriptive statistics
2. Qualitative or Categorical Variables
and inferential statistics.
It pertains to the variables that values couldn’t
What is descriptive statistics?
be measured.
Descriptive statistics is consists of different
techniques for organizing, displaying, and
describing data by using labels, graph, and
summary measures.
Another way to present data is through pie
chart. A pie chart is a circle divided into portions
that represent frequencies or percent ages of a
population.
Population Vs Sample
1. Symmetric
2. Skewed
3. Uniform rectangular
If a histogram doesn’t have equal sides, it is said 1. Manufacturers use statistics to weave quality
to be skewed. into beautiful fabrics, to bring lift to the airline
industry and to help guitarists make beautiful
• A skewed-to-the-right histogram has longer music.
tail on the right side.
2. Researchers keep children healthy by using
statistics to analyze data from the production of
viral vaccines, which ensures consistency and
safety.
Measures of Tendencies
1. Mean
What is mean?
2. Median
Mean is also called as arithmetic mean which
3. Modes pertains to the sum of all the values over the
number of items added.
Measures of Dispersion
1. Range
2. Variance
3. Standard Deviation
There are instances where the data values are
even, in this case, the two middle numbers are
gathered and divided into two.
What is mode?
77 82 74 81 79 84 74 78
For this example, the middle value is 21.0, Relationships among mean, median, and mode:
therefore it is the median of the data values. • If a histogram is symmetric, then it can be
said that the values for the mean, median, and
mode are the equal.
What is standard deviation?
Measures of Dispersion
What is range?
- True
6. The data below is bimodal.
16. The alternative hypothesis or research
Data: 77 82 74 81 79 84 74 82
hypothesis Ha represents an alternative claim
- True about the value of the parameter.
a) positive correlation
b) negative correlation
c) no correlation
- False
- True
2. It refers to the critical process of 12. The height of three volleyball players is
performing initial investigations on the data 210 cm, 220 cm and 191 cm. What is their
so as to discover patterns, to spot anomalies, average height?
to test hypothesis and to check assumptions
- 207
with the help of summary statistics and
graphical representations 13. 0 indicates no correlation, but be aware
that random arrangements of data will
- Exploratory Data Analysis
produce both positive and negative values
3. A variable is a characteristic under study for the correlation coefficient just by chance
that assures different values for different
- True
elements.
14. Inferential statistics is about describing
- True
and summarizing data sets using pictures
4. The data that is collected before being and statistical quantities
processed is called statistical data.
- False
- False
15. Identify whether the statement
5. The smallest possible value for the below is an example of
standard deviation is 1.
a) positive correlation
- False b) negative correlation
6. Observation or measurement pertains c) no correlation
to a value of a variable.
- True The more you exercise your muscles,
the stronger they get.
- positive correlation
MODULE 2
- Weather forecast
- Singapore’s supply and demand
gap on a typical day
Data Mining
is defined as a process used to extract usable
insights from a larger set of any raw data.
- Gmail’s spam
Key Terms
• Text Processing
SUBTOPIC 1 INTRODUCTION TO DATA MINING
• Speech Recognition
What is Data Mining
• Face Recognition
Data mining aims at discovering useful data
patterns from massive amounts of data.
Artificial Intelligence
Machine Learning
Aspects of Business Analytics
Arthur Samuel (1959). Machine Learning: Field
of study that gives computers the ability to
learn without being explicitly programmed.
• Probability of default in credit risk assessment If the sequence is defined by the time over
which data points are observed, we call the
sequence of data points as a time series.
Data Mining can be performed in the ff data Data Mining Techniques
types:
- Classification
• Relational databases - Clustering
• Data warehouses - Regression
- Outer
• Advanced DB and information repositories - Sequential Patterns
- Prediction
• Object-oriented and object-relational
- Association Rules
databases
Challenges
• Transactional and Spatial databases
• Skilled Experts are needed to formulate the
• Heterogenous and legacy databases
data mining queries.
• Multimedia and streaming database
• Overfitting: Due to small size training
• Text databases database, a model may not fit future states.
• Text mining and web mining • Data mining needs large databases which
sometimes are difficult to manage
• Business practices may need to be modified to • The data mining techniques are not accurate,
determine to use the information uncovered. and so it can cause serious consequences in
certain conditions.
• If the data set is not diverse, data mining
results may not be accurate.
❑ Insurance
K-Nearest neighbor
• used against data that has no historical labels A definition of clustering could be “the process
of organising objects into groups whose
• the goal is to explore the data and find some
members are similar in some way”.
structure within
Nearest-neighbor mapping
Trees in Weka
In this approach in data mining, input This role provides the funding when doing
variables and output variables will be given. analytical project.
Logistic regression is classified as supervised Based on the analytical life cycle defined by
learning. SAS, this phase has two types of decisions:
operational and strategic.
- True
- Act
Association rule algorithm is an example of
what approach in data mining? Interpreting mined patterns concludes the
KDD process.
- Unsupervised Learning
- True
Weka is tried and tested open source
machine learning software that can be It is the aspect of business analytics that
accessed through a graphical user interface, finds patterns in unstructured data like social
standard terminal applications, or a Java API. media or survey tools which could uncover
insights about consumer sentiment.
- True
- Text analytics
Unsupervised machine learning finds all kind
of unknown patterns in data This is the first step in KDD process.
- True - Data Selection
All data is unlabeled and the algorithms learn In this phase, you’ll search for relationships,
to inherent structure from the input data. trends and patterns to gain a deeper
This statement pertains to ___________. understanding of your data.
- Supervised Learning - Explore
Which of the following is not included on Which of the following is not considered as
the analytical life cycle defined by SAS? data mining technique?
- Integration - Kurtosis
These are values that lie away from the bulk Supposed your email program watches
of the data. which emails you do or do not mark as spam,
and based on that learns how to better filter
- Outliers
spam. What is the performance measure P in
CRISP-DM stands for this setting?
____________________________.
- The number of email correctly
- Cross-industry standard process for classified as spam/not spam
data mining
Self-organizing maps are example of
All data is labeled and the algorithms learn supervised learning.
to predict the output from the input data.
This statement pertains to ___________. (Use - False
lowercase for your answer)
- supervised
Parameters
Y=mx+b • Professors
y = ax +b
Missing data has the potential to adversely
affect a regression analysis by reducing the total
usable sample size.
Extrapolation
𝑦 = 𝑎 + 𝑏𝑋 + 𝜖
• y – dependent variable
• a – intercept
• b – slope
• 𝜖 – residual (error)
Example problem:
Step 5
Step 4
Substitute the values of the coefficients to the Multiple Linear Regression Model
formula mentioned previously.
• Estimated multiple regression equation:
𝒉𝒐𝒖𝒔𝒆 𝒑𝒓𝒊𝒄𝒆 = 𝟗𝟖. 𝟐𝟓 + 𝟎. 𝟏𝟎𝟗𝟖(𝒔𝒒. 𝒇𝒕. )
Nonlinear Regression
Regression Analysis
3) Testing hypothesis.
SUBTOPIC 2 LINEAR REGRESSION
y = β0 + β1 x + ε
- The error term accounts for the The Estimation Process in Simple Linear
variability in y that cannot be explained Regression
by the linear relationship between x
and y.
E(y|x) = β0 + β1x
• β1 = slope
Estimated simple linear regression equation: • Here, we will determine the values of b0 and
b1 .
• Interpretation of b0 and b1 :
• The slope b1 is the estimated change in the
mean of the dependent variable y that is • = predicted value of the dependent
associated with a one unit increase in the variable for the ith observation
independent variable x.
• n = total number of observations
• The y-intercept b0 is the estimated value of the
dependent variable y when the independent ith residual: The error made using the
variable x is equal to 0. regression model to estimate the mean value
of the dependent variable for the ith
observation.
Miles Traveled and Travel Time (in Hours) for
Ten Butler Trucking Company Driving • Denoted as
Assignments
• Hence,
Slope equation
Predicted Travel Time in Hours and the • It is a statistical technique that uses several
Residuals for Ten Butler Trucking Company independent variables to predict the dependent
Driving Assignments variable.
Regression Question 1
Identifying variables
Scatter Chart of Miles Traveled and Travel
Time in Hours for Butler Trucking Company
Driving Assignments with Regression Line
Superimposed
Research Question 2
Identifying Variables
Difference Correlation Results
Value of R square
Case Study
Parameters
• Outcome
Interpretations
Concept of Probability
Example Scenario
Logistic Regression
Let's say that the probability of success of some
It is a statistical technique used to develop event is 0.8. Then the probability of failure is 1 –
predictive models with categorical dependent 0.8 = 0.2. The odds of success are defined as the
variables having dichotomous or binary ratio of the probability of success over the
outcomes. probability of failure.
Maximum-likelihood estimation
Coefficients results
• We can see the standard error, z value, and p- Validate model using confusion matrix
value along with an asterisk indication to easily Let’s assume that we acquire the following
identify significance. results after running logistic regression to our
• We then determine whether the estimate is sample data set:
truly far away from 0. If the standard error of
the estimate is small, then relatively small
values of the estimate can reject the null
hypothesis.
Columns in confusion matrix
• If the standard error is large, then the
estimate should also be large enough to reject • True Positive (TP): When it is predicted as
the null hypothesis. TRUE and is actually TRUE
- False
-False
When it is predicted as FALSE and is
actually FALSE.
2. Overfitted data can significantly lose
the predictive ability due to an erratic
response to noise whereas
underfitted will lack the accuracy to
account for the variability in
response in its entirety.
True
24. y=a+bX_1+cX_2+ 〖dX〗_
3+ ϵy=a+bX_1+cX_2+ 〖dX〗_3+
ϵ
The formula above is used to model
nonlinear regression.
False
The lower is the AUC value, the worse is the The value of the residual (error) is not
model predictive accuracy. True correlated across all observations. True
It is considered as the log of odds of the MODULE 4 TIME SERIES AND FORECASTING
event. (Use lowercase for your answer)
intercept Time series analysis was introduced by Box and
Jenkins (1976) to model and analyze time series
Supposed your model was able to predict data with autocorrelation.
that a certain student fails the certification
exam and she actually is not. False Negative Time series data consist of data observations
Two variables are correlated if there is a over time.
linear association between them. If not, the
Application of Time Series
variables are uncorrelated. True
• Predicting stock prices
• Airline fares
• Unemployment data
Sample Scenario: Gasoline Sales Time Series Bicycle Sales Time Series
Plot
Forecasting Methodologies
Moving Average
Exponential Smoothing
• Census analysis
• Budgetary analysis
• Inventory studies
What are other examples of time series data?
• Sales Forecasting
1. Daily data on sales
2. Monthly inventory
Components of Time Series
3. Daily customers
Secular Trend
4. Monthly interest rates, cost
The increase or decrease in the movements of a
5. Monthly unemployment rates time series is called secular trend. A time series
6. Weekly measures of money supply data may show upward trend or downward
trend for a period of years and this may be due
7. Daily closing prices of stock indexes, and to factors like:
soon.
• Increase of population
Plot time series data
• Change in technological progress
• Large scale shift in consumers demands
Seasonal variation
• More woolen clothes are sold in winter than Measures to determine how well a particular
in the season of summer forecasting method is able to reproduce the
time series data that are already available
• Each year more ice creams are sold in summer
and very little in Winter season
These are fluctuations in the time series that • Measures to determine how well a particular
are short in duration, erratic in nature and forecasting method is able to reproduce the
follow no regularity in the occurrence pattern. time series data that are already available.
Tableau
Excel
SAS 9.4
Mean Squared Error (MSE): measure that avoids
SAP Analytics
the problem of positive and negative errors
Forecast Accuracy offsetting each other is obtained by computing
the average of the squared forecast errors.
Computing Forecasts and Measures of Forecast
Accuracy using the most recent Value as the
Forecast for the next Period.
Computing Forecasts and Measures of Forecast Business Forecasting
Accuracy using the Average of all the Historical
Airline Passengers 1990-2004
Data as the Forecast for the next Period
Univariate Time Series Models The airline data illustrate seasonality, because
Univariate time series models are models used there are seasonal patterns with respect to
when the dependent variable is a single time when people fly. For example, more people fly
series. in August than in February. The data illustrate
trend, because the number of passengers is
Multivariate Time Series Models increasing from year to year. The data illustrate
the effect of events, because the events of
Multivariate time series models are used when
September 11, 2001, had a significant negative
there are multiple dependent variables. In
impact on airline travel.
addition to depending on their own past values,
each series may depend on past and present Consider the above data restricted to years
values of the other series. 1994 through 1997. This restricted period
allows you to focus on trend and seasonality
Modeling U.S. gross domestic product, inflation,
without having to worry about the effects of
and unemployment together as endogenous
external events. To validate data preparation
variables is an example of a multivariate time
and to formulate forecast models for a time
series model.
series, a time plot of the data is produced.
Why use time series analysis?
Model Diagnostic Statistics
Advantages
• Growth – through depicting patterns, TSA When the forecast value is derived using a
helps in measuring the financial growth. linear regression model, the square of Pearson’s
correlation coefficient is equivalent to the
popular 2 R (R-Square) statistic. In practice,
Supplementary Pearson’s correlation and average error are
seldom used as primary accuracy diagnostics. rarely able to carry out a simulated
These measures will be used in this work to retrospective study because of data limitations.
motivate the accuracy measures that are
Rules of Thumb
commonly used.
At least four time points are required
R-Square is never appropriate for evaluating
for every parameter to be estimated in
time series models. Some sources redefine R-
a model.
Square so that it is more appropriate for time
Anything above the minimum series
series modeling. The above definition can
length can be used to create a holdout
actually lead to negative values if the model is
sample.
nonlinear, which is a contradiction for a squared
quantity based on non-complex numbers. The Holdout samples should rarely contain
Time Series Forecasting System provides some over 25% of the series.
alternative R-Square measures.
The use of a simulated retrospective study to A time series is simply a series of data points
pick a model is known as honest assessment. ordered in time. In a time series, time is often
Unfortunately, honest assessment is not the independent variable and the goal is usually
presented in many modern forecasting to make a forecast for the future.
textbooks. Because many textbook data sets are
Autocorrelation
small, honest assessment is often not possible.
In your work, you may discover that you are
Informally, autocorrelation is the similarity factor. Of course, this is useful if you notice
between observations as a function of the time seasonality in your time series.
lag between them.
Seasonal autoregressive integraded moving
Seasonality average model (SARIMA)
Stationarity
There are many ways to model a time series in SUBTOPIC 2 MOVING AVERAGE
order to make predictions. Here, I will present: Concept of averaging methods
moving average • If a time series is generated by a constant
exponential smoothing process subject to random error, then mean is a
ARIMA useful statistic and can be used as a forecast for
the next period.
Moving average
• Averaging methods are suitable for stationary
The moving average model is probably the most
time series data where the series is in
naive approach to time series modelling. This
equilibrium around a constant value ( the
model simply states that the next observation is
underlying mean) with a constant variance over
the mean of all past observations.
time.
Exponential smoothing
What is moving average?
Exponential smoothing uses a similar logic to
• It is a technique that calculates the overall
moving average, but this time, a
trend in sales volume from historical data of the
different decreasing weight is assigned to each
company.
observations. In other words, less importance is
given to observations as we move further from • This technique is well-known when
the present. forecasting short-term trends.
Double exponential smoothing
• Continue calculating each five-year average A moving average of order k, MA(k) is the value
until you reach 2009-2013. This gives you a of k consecutive observations
series of points that you can plot a chart for
moving averages.
Averaging methods
• The Mean
• Uses the average of all the historical data as K is the number of terms in the moving average.
the forecast The moving average model does not handle
trend or seasonality very well although it can do
better than the total mean.
Forecasted value
= 𝒔𝒎𝒐𝒐𝒕𝒉𝒊𝒏𝒈 𝒄𝒐𝒏𝒔𝒕𝒂𝒏𝒕
• The forecast for the week 26 is
• 𝛼 can not be equal to 0 or 1.
= 𝒐𝒃𝒔𝒆𝒓𝒗𝒆𝒅 𝒗𝒂𝒍𝒖𝒆 𝒐𝒇 𝒔𝒆𝒓𝒊𝒆𝒔 𝒊𝒏
𝒑𝒆𝒓𝒊𝒐𝒅 t • If stable predictions with smoothed random
variation is desired then a small value of 𝛼 is
desire.
= 𝒐𝒍𝒅 𝒇𝒐𝒓𝒆𝒄𝒂𝒔𝒕 𝒇𝒐𝒓 𝒑𝒆𝒓𝒊𝒐𝒅 t
• If a rapid response to a real change in the
The forecast is 𝑭𝒕+𝟏 based on weighting the pattern of observations is desired, a large value
most recent observation 𝒚𝒕 with a weight 𝜶 and of 𝛼 is appropriate.
weighting the most recent forecast Ft with a
• To estimate 𝛼, Forecasts are computed for 𝛼
weight of 1- 𝜶.
equal to .1, .2, .3, …, .9 and the sum of squared
• The implication of exponential smoothing can forecast error is computed for each.
be better seen if the previous equation is
• The value of 𝛼 with the smallest RMSE is
expanded by replacing Ft with its components
chosen for use in producing the future
as follows:
forecasts.
FORMATIVES
Compute for the MSE: -9 85.5 81 9
A trend is usually the result of long-term
factors such as population increases or Actual
decreases, shifting demographic 81
characteristics of the population,
improving technology, changes in the 81 abs(81 – 90) ^ 2
competitive landscape, and/or changes
MSE stands for the mean standard error.
in consumer preferences. True
False
What does the graph below illustrate? an
increasing trend only
The reproduction of crops is highly
dependent on this component in time
series. Seasonal variations
Which of the following describes an
unpredictable, rare event that appears in the
time series? Irregular variations
- smoothed value in the current time - Measuring the time it takes for
period something to happen based on a given
Which of the following is a valid weight for
number of variables
exponential smoothing?
- 0.5
Which of the following indicates the purpose for
using the least squares method on time series Compute for the Mean Absolute Deviation:
data?
- (Pi/Pbase)100
- True
How many degrees of freedom are used for the It pertains to the gradual shifts or
t test to determine the significance of the movements to relatively higher or lower
highest order autoregressive term? values over a longer period of time
- n – 2p – 1
- Trend Pattern
- 100
- False
- Forecast Error
- False
-
The formula below describes
- False
- True (False)
- 6
Given the following values for age, what is
It is the variable being manipulated by the problem with the data?
researchers.
- Explanatory variable
- False
- True
- True
- True
Data Inconsistency?
A heterogenous data set is a data set whose
data records have the same target value. A time series data usually has two variables
namely transaction and item.
- False
- False
The sales of the mini electric fans that
Louise sells varies every season. This is an
example of a seasonal effect to a time series.
24/20
- True Regression is a famous supervised learning
technique.
One measure of forecasting accuracy is the
- True
mean accuracy deviation (MAD). As the
name suggests, it is a measure of the Forecasting methods can be classified as
average size of the prediction errors. To qualitative and quantitative
estimate the change in Y for a one-unit
change in X. - True
- -7.69
- Irregular variations
- False
- False
- False
- Irregular variations
- 18,650
Trend series a sequence of observations on
a variable measured at successive points in With moving average the idea is that the
time or over successive periods of time most recent observations will usually
provide the best guide as to the future, so
- True we want a weighting scheme that has
decreasing weights as the observations get
older (exponential smoothing)
Exponential Smoothing is an obvious False
extension the moving average method.
- False
- True
Association Rule is the foundation of 8. Offer candies in the shape of a Barbie doll
several recommender systems Process of Rule Selection
Amazon Generate all rules that meet specified support &
confidence
Rule Interpretation
Key Ideas
• Credit Cards/ Banking Services (each • The confidence of the rule X Y in the
card/account is a transaction containing the set transaction database D is the ratio of the
of customer’s payments) number of transactions in D that contain X Y
to the number of transactions that contain X in
• Medical Treatments (each patient is
D.
represented as a transaction containing the
ordered set of diseases)
Advantages/Disadvantages
• Advantages:
― a database D of transactions;
― minimum support s;
― minimum confidence c;
• Find:
Many rules are possible
― all association rules X Y with a minimum
For example: Transaction 1 supports several
support s and confidence c.
rules, such as
Problem Decomposition
• “If red, then white” (“If a red faceplate is
• Find all sets of items that have minimum purchased, then so is a white one”)
support (frequent itemsets)
• “If white, then red”
• Use the frequent itemsets to generate the
• “If red and white, then green”
desired rules
• + several more
Confidence
Example rule
If the minimum support is 50%, then {red, white} > {green} with confidence = 2/4 =
{Shoes,Jacket} is the only 2- itemset that 50%
satisfies the minimum support.
• [(support {red, white, green})/(support {red,
white})]
0.43
10. Observe the table below and
compute for the support of
Beer=>peanut 4/7
11. Observe the table below and
compute for the support of
SD Card => Phone case
0.3
12. Observe the table below and
compute for the lift ratio of
0.3
1.5 1. Segmentation is a data mining
13. When it comes to association method that usually consists of
analysis, the more rules you two variables a transaction and
produce, the greater the risk is. an item. False
True 2. It is a useful tool for data
14. Supposed you want to solve a time reduction, such as choosing the
series problem where a rapid best variables or cluster
response to a real change in the components for
pattern of observations is desired, analysis.Variable Clustering
which among the following is the 3. It controls for the support
ideal value for your alpha? 0.8 ( frequency) of consequent
15. Which of the following is not an while calculating the
advantage of using association rule? conditional probability of
Assumes transaction database is occurrence of {Y} given {X}. Lift
memory resident. ratio
16. A trend is usually the result of long- 4. Association rule mining is about
term factors such as population grouping similar samples into
increases or decreases, shifting clusters. False
demographic characteristics of 5. It is the process of discovering
population, improving technology, useful patterns and trends in
changes in the competitive large data sets.[data mining]
landscape, and/or changes in 6. observe the table below and
consumer preferences. True compute for the lift ratio of
17. It measures the overall impact. egg -> Peanut
Support
18. Clustering aims to discover certain
features that often appear together
in data.False
19. It is another type of association
analysis that involves using
sequence data. Association Rule
20. Observe the table below and
compute for the support of
SD Card => Phone case
given the antecedent.
Confidence
10. Input validation helps to lesson
what type of anomaly?
Insertion anomaly
11. observe the table below and
compute for the confidence of
Phone -> SD card
0.93
7. observe the table below and
compute for the confidence of
Airpods->powerbank
0.5
12. Lift ratio shows how effective
the rule is in finding
consequents. True
13. observe the table below and
compute for the lift ratio of
0.4
0.6
9. It is the conditional probability
of occurrence of consequent 0.1
The objective of clustering is to
uncover a pattern in the time
series and then extrapolate the
pattern into the future. False
- 4/7
- False
- 0.93
- 1/5 or 0.2
- Confidence
- 4/5
- True
- confidence
19/20
- 0.1
- confidence
- Confidence
- lift ratio - 1
It measures the overall impact. Its objective is to uncover a pattern in the
time series and then extrapolate the pattern
- Support
into the future.
Which of the following is not an application
- Time Series
of pattern discovery?
- Patterns
- True
- 0.40
- False
- False
- Support
- 0.4
Powerbank, phonecase, SD
7
Card
8 airpods, powerbank
MODULE 6
• Inconsistent data
• Limited features
Data Reduction
Observe minimizing garbage in, garbage out Missing data is a problem that continues to
(GIGO)! plague data analysis methods. Let’s examine the
cars data set.
Procedures
1. Backward-selection
2. Forward-selection
Anomaly detection
Suppose that you would like to include data The most common and most effective numerical
from multiple sources in your analysis. This measure of the “center” of a set of data is the
would involve integrating multiple databases, (arithmetic) mean.
data cubes, or files, that is data integration. Yet
Different Measures of Central Tendency
some attributes representing a given concept
may have different names in different 1. Distributive Measure
databases, causing inconsistencies and
2. Algebraic Measure
redundancies.
3. Holistic Measure
Furthermore, it would be useful for your
analysis to obtain aggregate information as to Measuring the Dispersion of Data
the sales per customer region—something that
is not part of any precomputed data cube in The degree to which numerical data tend to
your data warehouse. You soon realize that spread is called the dispersion, or variance of
data transformation operations, such as the data. The most common measures of data
normalization and aggregation, are additional dispersion are range, the five number summary
data preprocessing procedures that would (based on quartiles), the interquartile range,
contribute toward the success of the mining and the standard deviation. Boxplots can be
process. plotted based on the five-number summary and
are a useful tool for identifying outliers.
Data reduction obtains a reduced
representation of the data set that is much
smaller in volume, yet produces the same (or
almost the same) analytical results. There are a
number of strategies for data reduction. These
Graphic Displays of Basic Descriptive Data Data Cleaning as a Process
Summaries
The first step in data cleaning as a process is
Aside from the bar charts, pie charts, and line discrepancy detection. Discrepancies can be
graphs used in most statistical or graphical data caused by several factors, including poorly
presentation software packages, there are other designed data entry forms that have many
popular types of graphs for the display of data optional fields, human error in data entry,
summaries and distributions. These include deliberate errors (e.g., respondents not wanting
histograms, quantile plots, q-q plots, scatter to divulge information about themselves), and
plots, and loess curves. Such graphs are very data decay (e.g., outdated addresses).
helpful for the visual inspection of your data.
As a starting point, use any knowledge you may
Data Cleaning already have regarding properties of the data.
Such knowledge or “data about data” is
Data cleaning (or data cleansing) routines
referred to as metadata. Field overloading is
attempt to fill in missing values, smooth out
another source of errors that typically results
noise while identifying outliers, and correct
when developers squeeze new attribute
inconsistencies in the data. In this section, you
definitions into unused (bit) portions of already
will study basic methods for data cleaning.
defined attributes (e.g., using an unused bit of
Missing Values an attribute whose value range uses only, say,
31 out of 32 bits).
These are the methods in filling in the missing
value for the attributes: The data should also be examined regarding
unique rules, consecutive rules, and null rules.
1. Ignore the tuple This is usually done when
the class label is missing (assuming the mining A unique rule says that each value of the given
task involves classification). attribute must be different from all other values
for that attribute.
2. Fill in the missing value manually In general,
this approach is time-consuming and may not A consecutive rule says that there can be no
be feasible given a large data set with many missing values between the lowest and highest
missing values. values for the attribute, and that all values must
also be unique (e.g., as in check numbers).
3. Use a global constant to fill in the missing
value Replace all missing attribute values by the A null rule specifies the use of blanks, question
same constant, such as a label like “Unknown” marks, special characters, or other strings that
or −∞. may indicate the null condition (e.g., where a
value for a given attribute is not available), and
4. Use the attribute mean to fill in the missing how such values should be handled.
value Use this value to replace the missing value
for income. There are a number of different commercial
tools that can aid in the step of discrepancy
5. Use the attribute mean for all samples detection.
belonging to the same class as the given tuple
For example, if classifying customers according Data scrubbing tools use simple domain
to credit risk, replace the missing value with the knowledge (e.g., knowledge of postal addresses,
average income value for customers in the and spellchecking) to detect errors and make
same credit risk category as that of the given corrections in the data. These tools rely on
tuple. parsing and fuzzy matching techniques when
cleaning data from multiple sources.
6. Use the most probable value to fill in the
missing value This may be determined with Data auditing tools find discrepancies by
regression, inference-based tools using a analyzing the data to discover rules and
Bayesian formalism, or decision tree induction. relationships, and detecting data that violate
such conditions. They are variants of data • Normalization, where the attribute data are
mining tools. Commercial tools can assist in the scaled so as to fall within a small specified
data transformation step. range, such as −1.0 to 1.0, or 0.0 to 1.0.
An attribute (such as annual revenue, for 2. Attribute subset selection, where irrelevant,
instance) may be redundant if it can be weakly relevant, or redundant attributes or
“derived” from another attribute or set of dimensions may be detected and removed.
attributes. Some redundancies can be detected
3. Dimensionality reduction, where encoding
by correlation analysis. Given two attributes,
mechanisms are used to reduce the dataset
such analysis can measure how strongly one
size.
attribute implies the other, based on the
available data. 4. Numerosity reduction, where the data are
replaced or estimated by alternative, smaller
Data Transformation
data representations such as parametric models
In data transformation, the data are (which need store only the model parameters
transformed or consolidated into forms instead of the actual data) or nonparametric
appropriate for mining. Data transformation can methods such as clustering, sampling, and the
involve the following: use of histograms.
Forward Selection
Steps of Backward-selection
Stepwise Procedure
Estimation is about estimating the value for the It is a best practice to divide your dataset into
target variable except that the target variable is train and test dataset.
categorical rather than numeric.
True
True
This happens when the deletion of unwanted
Linearity is caused by having too many variables information causes desired information to be
trying to do the same job. deleted as well.
Insertion anomaly
Young = 12 – 17
Adult = 18 -34
Old = 35 – 60
What kind of data preparation was
practiced? Data Cleaning
It is the process of integrating multiple
- 19.6 databases, data cubes, or files. data
It is a manipulation of scale values to ensure integration
comparability with variables with other
scales:
Scale transformation
Backward