All Topic ISP565

TOPIC 1a
INTRODUCTION
TO DATA MINING
OBJECTIVES
To introduce about Data Mining and its

relationship with data and knowledge ฀
To discuss the history, evolution and motivation

of Data Mining
To discuss Data Mining techniques, tasks,

applications and some major issues
https://dribbble.com/shots/10494603-Isometric-Animation-Data-Mining
PATTERN RECOGNITION AND DATA MINING
PATTERN RECOGNITION
a process of recognizing a pattern using machine (computer), it can be viewed through several aspects
Pattern Recognition by Computer Pattern Recognition from Data

Pattern Recognition by Human
 benefit of automated pattern  learn or observe from large
 perceptual (emotions, amounts of data
recognition
feelings)  study the dependencies and
 advantage in complex
 specialized – decision making extract knowledge from data
calculations
WHAT IS DATA?
Data – the basic facts such as names, numbers or characters that come in different
forms (like text or image).
# Names Studies Education Work_performance Income (D)

1 Amni Ali Poor High School Poor None
2 Chuah Ah Lan Moderate High School Poor Low
Table 1 - a sample of data with 3 Daria Danial Poor High School Poor None
4 Marisa Malik Moderate Diploma Poor Low
five (5) variables, where the 5 Nur Aini Mat Poor High School Good Low
last column indicates the 6 Suria Mohd Moderate Diploma Poor Low
7 Ozaila Othman Good Master Good Medium
outcome of that sample. …
99 Muhd Haris Aziz Poor High School Good Low

100 Zulhairi Yatim Moderate Diploma Poor Low
WHAT IS KNOWLEDGE?
Knowledge – the processed or organized data (information) that is given some values to
uncover the relationship for deeper understanding.
Sample of knowledge in the form of IF then ELSE rules:

studies(Poor) AND work(Poor)  income(None)
studies(Poor) AND work(Good)  income(Low)
education(Diploma)  income(Low)
education(Master)  income(Medium)
OR income(High)
studies(Moderate)  income(Low)
studies(Good)  income(Medium)
OR income(High)
https://www.ontotext.com/knowledgehub/fundamentals/dikw-pyramid/
education(SPM) AND work(Good)  income(Low)
WHAT IS DATA MINING?
Data mining - definition
 extraction of interesting (non-trivial, implicit, previously unknown and potentially
useful) patterns or knowledge from huge amount of data
 exploration and analysis, by automatic or semi-automatic means, of large quantities
of data in order to discover meaningful patterns
Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting
Is everything “data mining”?

 Simple search and query processing, like query of information about “Shopee
products”
WHY IS DATA MINING?
Today, massive growth of data availability, from Terabyte to Yottabyte, it is everywhere and
anywhere
Source of data ?
Facebook, Instagram, Telegram Blogs, News Amazon, Shopee, Lazada

(Social Media) (Society) (E-commerce)
“There were 5 exabytes of information created between the dawn of civilization through
2003, but that much information is now created every 2 days” – Eric Schmidt, Executive Chairman of Google
“Information is the oil of 21st century, and analytics is the combustion engine.” – Peter Sondergaard, Gartner Research
FROM DATA MINING TO BIG DATA MINING
What is Big data?
A term which refers to a large

amount of data where the
concept is related to the
characteristics of the data
itself.
Figure 1. 5V’S of Big Data

https://www.techentice.com/the-data-veracity-big-data/
FROM DATA MINING TO BIG DATA MINING
Classifying youth emotions based on Sentiment analysis on reviews of Proton

Twitter data Cars in Malaysia using Facebook postings
Big data mining is referred to the Goal – to discover insights from the
collective data mining or extraction social media platforms (Instagram,
techniques that is performed on large Twitter, Facebook) with thousand of
volume of data or the big data. postings.
CONCLUSION
DATA MINING is simply…
Finds relationship
(that exist within the dataset)
and
makes prediction Photo-credit to:
https://www.bigstockphoto.com/image-12788702/stock-
vector-a-fortune-teller-holding-her-crystal-ball-vector
REFERENCES
1. Pang-Ning Tan, Michael Steinbach & Vipin Kumar, Introduction to Data Mining, Addison Wesley, 2019.
2. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 3rd Edition, Morgan Kaufmann, 2012.
3. Che D., Safran M., Peng Z. (2013) From Big Data to Big Data Mining: Challenges, Issues, and Opportunities. In: Hong
B., Meng X., Chen L., Winiwarter W., Song W. (eds) Database Systems for Advanced Applications. DASFAA 2013.
Lecture Notes in Computer Science, vol 7827. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40270-
8_1
4. Razak, Z. I., & Mutalib, S. (2018). Web Mining In Classifying Youth Emotions. Malaysian Journal of Computing, 3(1), 1-
11.
5. Wah, Y. B., Abdullah, N., Abdul-Rahman, S., & Tan, M. L. P. (2018). text mining and sentiment analysis on reviews of
proton cars in malaysia. Malaysian Journal of Science, 37(2), 137-153.
THANK YOU
Shuzlina Abdul Rahman | Sofianita Mutalib | Siti Nur Kamaliah Kamarudin | Farah Syazwani Mohd Rashid
TOPIC 1b
HISTORY, EVOLUTION AND
CLASSIFICATION OF DATA
MINING
OBJECTIVES
To introduce about Data Mining (DM) and its
relationship with data and knowledge
To discuss the history, evolution and motivation of DM ฀
To discuss DM techniques, tasks, applications and some

major issues
HISTORY OF DATA MINING
• The term “data mining” appeared around 1990 in the database community.
• Gregory Piatetsky-Shapiro coined the term “Knowledge Discovery in Databases”
for the first workshop on the same topic (KDD-1989) and this term become more
popular in AI and Machine Learning Community.
• Currently, Data Mining and KDD are used interchangeably.
• Since about 2007, “Predictive Analytics” and since 2011, “Data Science” terms
were also used to describe this field
(Source: Coenen, 2011)
ORIGIN OF DATA MINING
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
AI,
• Traditional techniques may be unsuitable due Statistics
Machine Learning,
to data that is Pattern
• Large-scale Recognition
• High dimensional
Data Mining
• Heterogeneous
• Complex
• Distributed Database
systems
• A key component of the emerging field of data
science and data-driven discovery
THE EVOLUTION OF DATA MINING
Evolutionary Step Enabling Technologies Business Question Characteristics
Data Collection Computers, tapes, "What was my total revenue Retrospective, static data
(1960s) disks in the last five years?" delivery
Data Access RDBMS, SQL, ODBC "What were unit sales in New Retrospective, dynamic data
(1980s) England last March? delivery at record level
Data OLAP, multidimensional "What were unit sales in New Retrospective, dynamic data
Warehousing databases, England last March? Drill delivery at multiple levels
(1990s) Data warehouses down to Boston”
Data Mining Advanced algorithms, “What’s likely to happen to Prospective, proactive

(Emerging Today) Multiprocessor computers, Boston unit sales next informative delivery
Massive databases month? Why?”
Source: www.thearling.com
MOTIVATION OF DATA MINING
Growth of data both in commercial and scientific databases
due to advances in data generation and collection technologies
• Commercial Viewpoint
o Lots of data is being collected and warehoused
Amazon, Shopee, Lazada
o Computers have become cheaper and more powerful
(E-commerce)
• Scientific Viewpoint
o Data collected and stored at enormous speeds
o Helps scientists in automated analysis of massive
datasets
https://www.ncdc.noaa.gov/sotc/global/202003
KNOWLEDGE DISCOVERY (KDD) PROCESS
• This is a view from typical Pattern Evaluation
database systems and data
warehousing communities
• Data mining plays an Data Mining
essential role in the

Task-relevant Data
knowledge discovery
process
Data Warehouse Selection
Data Cleaning
Data Integration
Databases
DATA MINING : 1-STEP OF KDD
Knowledge Discovery in Databases
Data mining
Task
Techniques
CLASSIFICATION OF DATA MINING SYSTEMS
Kinds of Knowledge Techniques used
Kinds of Database Categorizing data (Classification) Machine learning

Find relationship (Association) Pattern recognition
Relational Subdivide similar data (Clustering) Neural Network
Data warehouse Naïve-Bayes
Make prediction
Transactional DB K-nearest neighbour
… Rough Set
Advanced DB system \ Application adapted
Statistic
Flat files
WWW Finance
Marketing
Medical
Stock
Telecommunication
WHY DM? POTENTIAL APPLICATIONS
Data analysis and decision support
1. Market analysis and management
Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
2. Risk analysis and management
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
3. Fraud detection and detection of unusual patterns (outliers)
Other Applications
1. Text mining (news group, email, documents) and Web mining
2. Stream data mining
3. DNA and bio-data analysis
MARKET ANALYSIS & MANAGEMENT
CUSTOMER REQUIREMENT PROVISION OF SUMMARY

CUSTOMER PROFILING INFORMATION
ANALYSIS
Clustering or classifying the 1. identifying the best products for 1. multidimensional summary
customers based on the different customers reports
products they purchase 2. predict what factors will attract 2. statistical summary
new customers information (data central
tendency and variation)
REFERENCES
1. Tan, Steinbach, Karpatne, Kumar, Lecture Notes, Chapter 1, Introduction to Data Mining, 2nd Edition, 2018
4. Coenen, Frans. Data mining: past, present and future. Knowledge Engineering Review, 26(1), 25-29, 2011
5. Gregory Piatetsky-Shapiro, Data Science: Past, Present, and Future KDnuggets 1© Kdnuggets, 2016
THANK YOU
TOPIC 1c
TASKS AND TECHNIQUES

OF DATA MINING
OBJECTIVES
To discuss the history, evolution and motivation of DM
To discuss DM tasks, techniques, applications ฀ and

some major issues
DATA MINING: TASKS and TECHNIQUES
TASKS include; TECHNIQUES include;
Classification Decision Trees
Clustering
Association Rule Knowledge Discovery
Association Rules in Databases
k-means
Prediction
Neural Networks Data mining
Sequential Analysis
Naïve Bayes
Deviation analysis Tasks
k-nearest neighbor
Similarity analysis
Techniques
Trend analysis Statistical Method
CLASSIFICATION: DEFINITION
Given a collection of records (training set )
• Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other

attributes.
Goal: previously unseen records should be assigned a class as

accurately as possible.
• A test set is used to determine the accuracy of the model. Usually, the
given data set is divided into training and test sets, with training set used
to build the model and test set used to validate it.
CLASSIFICATION EXAMPLE
Tid Refund Marital Taxable Refund Marital Taxable

Status Income Cheat Status Income Cheat
1 Yes Single 125K No No Single 75K ?

2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
Set
10
7 Yes Divorced 220K No

8 No Single 85K Yes
9 No Married 75K No Training Learn
Set Classifier Model
10 No Single 90K Yes
10
CLASSIFICATION: APPLICATION 1
DIRECT MARKETING
1. Goal: Reduce cost of mailing by targeting a set of consumers

likely to buy a new cell-phone product.
2. Approach:
• We know Collect various demographic, lifestyle, and company-interaction
related information, type of business, where they stay, how much they
earn, etc.
• Identify which customers decided to buy and which decided otherwise.
This {buy, don’t buy} decision forms the class attribute.
• Use this information as input attributes to learn a classifier model.
CLASSIFICATION: APPLICATION 2
CUSTOMER ATTRITION/CHURN
1. Goal: To predict whether a customer is likely to be lost

to a competitor.
2. Approach:
• Use detailed record of transactions (past and present customers
• How often the customer calls, where he calls, what time-of-the day
he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.
CLUSTERING DEFINITION
Given a set of data points, each having a set of attributes,

and a similarity measure among them, find clusters such that
• Data points in one cluster are more similar to one another.
• Data points in separate clusters are less similar to one another.
Similarity Measures:
• Euclidean Distance if attributes are continuous.
• Other Problem-specific Measures.
ILLUSTRATING CLUSTERING
Euclidean Distance Based Clustering in 3-D space.
Intracluster distances Intercluster distances

are minimized are maximized
CLUSTERING: APPLICATION 1
MARKET SEGMENTATION
1. Goal: subdivide a market into distinct subsets of customers where any

subset may conceivably be selected as a market target to be reached
with a distinct marketing mix.
2. Approach:
• Collect different attributes of customers based on their geographical
and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of
customers in same cluster vs. those from different clusters.
CLUSTERING: APPLICATION 1 – MARKET
SEGMENTATION
Segment 1: high duration but low number of generated calls and moderate number
of sent and received SMS.
Segment 2: moderate duration of generated calls and moderate to high data usage.
Segment 3: high duration of off-net calls, high number of generated calls, and
moderate to low of both duration of generated calls and data usage.
Segment 4: very low call duration, high sent and received SMS, and high data usage.
Segment 5: very low data usage, low duration of generated calls, and high number of
received calls with respect to the number of generated calls.
Segment 6: relatively high duration of international calls.

Market Segmentation: https://online-journals.org/index.php/i-jim/article/download/4392/3606
CLUSTERING: APPLICATION 2
DOCUMENT CLUSTERING
1. Goal: To find groups of documents that are similar to each

other based on the important terms appearing in them.
2. Approach:
• To identify frequently occurring terms in each document. Form a
similarity measure based on the frequencies of different terms.
Use it to cluster.
• Gain: Information Retrieval can utilize the clusters to relate a
new document or search term to clustered documents.
ASSOCIATION RULE DISCOVERY:
DEFINITION
Given a set of records each of which contain some number of items from a given
collection;
• Produce dependency rules which will predict occurrence of an item based on
occurrences of other items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread Rules Discovered:
{Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
APPLICATION 1
MARKETING AND SALES PROMOTION
• Let the rule discovered be

{Bagels, … } --> {Potato Chips}
• Potato Chips as consequent can be used to determine what
should be done to boost its sales.
• Bagels in the antecedent Can be used to see which products
would be affected if the store discontinues selling bagels.
• Bagels in antecedent and Potato chips in consequent can be
used to see what products should be sold with Bagels to
promote sale of Potato chips!
APPLICATION 2
SUPERMARKET SHELF MANAGEMENT
1. Goal: To identify items that are bought together by
sufficiently many customers.
2. Approach:
• Process the point-of-sale data collected with barcode
scanners to find dependencies among items.
3. A classic rule
• If a customer buys diaper and milk, then he is very likely to buy rootbeer.
• So, don’t be surprised if you find six-packs of rootbeer stacked next to diapers!
RETAIL ANALYTICS
https://www.digitalnewsasia.com/download/tapwaycasestudy.pdf
REGRESSION
1. Predict a value of a given continuous valued variable based on the values of other
variables, assuming a linear or nonlinear model of dependency.
2. Greatly studied in statistics, and machine learning fields.
3. Examples:
• Predicting sales amounts of new product based on advertising expenditure.
• Predicting wind velocities as a function of temperature, humidity, air pressure,
etc.
• Time series prediction of stock market indices.
DEVIATION ANALYSIS
1. Discovering most significant changes in data from previously measured

or normative values
2. Usually categorical separately from other data mining tasks
3. Deviations are often infrequent
4. Modifications of classification, clustering, time series analysis can be
used as a means to achieve the goal
5. Outlier detection in statistics
DEVIATION ANALYSIS (ANOMALY
DETECTION)
1. Detect significant deviations from normal behavior.
2. Applications:
Credit Card Fraud Detection Network Intrusion Detection
Typical network traffic at University level may reach over 100 million connections per day
DEVIATION ANALYSIS (FRAUD DETECTION)
1. Identify employee accounts at financial institutions that have excess numbers

of credit memos. Excess credit memos can indicate diversion of funds into
employee accounts.
2. Compare employee home addresses, social security numbers, telephone

numbers and bank routing and account numbers to those of vendors from
vendor master file. This test can reveal bogus or improperly selected vendor
accounts.
DEVIATION ANALYSIS (FRAUD
DETECTION)
https://www.insurancebusinessmag.com/asia/news/breaking-news/malaysias-antifraud-system-operational-by-october-74933.aspx
PROFITEERING CASES
https://www.freemalaysiatoday.com/category/nation/2018/08/25/yes-keep-receipts-to-fight-
profiteering-say-retailers/
Yes, keep receipts to fight profiteering, say retailers

Robin Augustin -August 25, 2018 8:00 AM
http://english.astroawani.com/malaysia-news/gst-1-256-profiteering-
cases-detected-1-115-notices-issued-till-june-5-61853
REFERENCES
1. Tan, Steinbach, Karpatne, Kumar, Lecture Notes, Chapter 1, Introduction to Data Mining, 2nd Edition, 2018
4. Coenen, F. Data mining: past, present and future. Knowledge Engineering Review, 26(1), 25-29, 2011
5. Gregory Piatetsky-Shapiro, Data Science: Past, Present, and Future KDnuggets 1© Kdnuggets, 2016
THANK YOU
TOPIC 1d
PROBLEMS AND
CHALLENGES
OBJECTIVES
To discuss the history, evolution and motivation of DM
To discuss DM techniques, tasks, applications and some

major issues ฀
PROBLEMS AND CHALLENGES
• Efficiency and scalability of data mining algorithms
• Parallel, distributed, stream, and incremental mining methods
• Handling high-dimensionality
• Handling noise, uncertainty, and incompleteness of data
• Incorporation of constraints, expert knowledge, and
background knowledge in data mining
• Pattern evaluation and knowledge integration
• Mining diverse and heterogeneous kinds of data
COMPARING DATA MINING MODELS
PERFORMANCE ISSUES
1) Cost of the Learning Set 3) Predictive Availability

• To be able to predict the correct
• Number of examples necessary for training
decision towards the test or unseen
• Cost of assuring the good accuracy data
• Involve the generation of rules
2) Time and Memory Constraint
• Time complexity of the learning phase
• Time taken for evaluation
• Time it takes to reach a certain level of
accuracy
• Measuring the quality or accuracy of rules
COMPARING DATA MINING MODELS
PERFORMANCE ISSUES
4) Robustness 6) Interpretability
• Ability to cope with errors • Transparency
• The influence of noise • Explanation
• The presence of outliers
5) Scalability
• Ability to cope with big data
• Algorithms that are scalable
HOW DO YOU VALUE DATA AND
KNOWLEDGE?
Death of Scholar
“Sesungguhnya ALLAH tidak akan mengangkat
ilmu dgn sekaligus dari manusia. Tetapi ALLAH
akan mengangkat ilmu dengan mematikan para
ulama. Hingga ketika tidak ada lagi seorang
berilmu (di kalangan mereka), manusia
mengangkat para pemimpin yang jahil. Mereka
ditanya, dan mereka pun berfatwa tanpa ilmu.
Hingga akhirnya mereka sesat dan menyesatkan”
(Riwayat Bukhari).
MH370
 2014 – 2016
 Life, resources
 Searching process
CONCLUSION
• Data mining: Discovering interesting patterns from large amounts of data

• DM - A natural evolution of database technology, in great demand, with wide
applications (business, medical, manufacturing etc.)
• A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge presentation
• Mining can be performed in a variety of information repositories
• Data mining tasks: Classification, Clustering, Association, Outlier
• Data mining techniques: decision tree, k-means, apriori
• Major issues in data mining (scalability, high dimensionality, heterogenous
and complex data)
REFERENCES
2. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 3rd Edition, Morgan
Kaufmann, 2012.
THANK YOU

All Topic ISP565

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

All Topic ISP565

Uploaded by

Copyright:

Available Formats

TOPIC 1a

To introduce about Data Mining and its

To discuss the history, evolution and motivation

To discuss Data Mining techniques, tasks,

Pattern Recognition by Computer Pattern Recognition from Data

# Names Studies Education Work_performance Income (D)

99 Muhd Haris Aziz Poor High School Good Low

Sample of knowledge in the form of IF then ELSE rules:

Is everything “data mining”?

Facebook, Instagram, Telegram Blogs, News Amazon, Shopee, Lazada

What is Big data?

A term which refers to a large

Figure 1. 5V’S of Big Data

Classifying youth emotions based on Sentiment analysis on reviews of Proton

DATA MINING is simply…

To discuss the history, evolution and motivation of DM ฀

To discuss DM techniques, tasks, applications and some

Data Mining Advanced algorithms, “What’s likely to happen to Prospective, proactive

essential role in the

Knowledge Discovery in Databases

Kinds of Knowledge Techniques used

Kinds of Database Categorizing data (Classification) Machine learning

CUSTOMER REQUIREMENT PROVISION OF SUMMARY

TASKS AND TECHNIQUES

To discuss the history, evolution and motivation of DM

To discuss DM tasks, techniques, applications ฀ and

Find a model for class attribute as a function of the values of other

Goal: previously unseen records should be assigned a class as

Tid Refund Marital Taxable Refund Marital Taxable

1 Yes Single 125K No No Single 75K ?

7 Yes Divorced 220K No

1. Goal: Reduce cost of mailing by targeting a set of consumers

1. Goal: To predict whether a customer is likely to be lost

Given a set of data points, each having a set of attributes,

Intracluster distances Intercluster distances

1. Goal: subdivide a market into distinct subsets of customers where any

Segment 6: relatively high duration of international calls.

1. Goal: To find groups of documents that are similar to each

• Let the rule discovered be

1. Discovering most significant changes in data from previously measured

Credit Card Fraud Detection Network Intrusion Detection

1. Identify employee accounts at financial institutions that have excess numbers

2. Compare employee home addresses, social security numbers, telephone

Yes, keep receipts to fight profiteering, say retailers

To discuss the history, evolution and motivation of DM

To discuss DM techniques, tasks, applications and some

1) Cost of the Learning Set 3) Predictive Availability

• Data mining: Discovering interesting patterns from large amounts of data

You might also like