Professional Documents
Culture Documents
All Topic ISP565
All Topic ISP565
INTRODUCTION
TO DATA MINING
OBJECTIVES
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting
Source of data ?
“There were 5 exabytes of information created between the dawn of civilization through
2003, but that much information is now created every 2 days” – Eric Schmidt, Executive Chairman of Google
“Information is the oil of 21st century, and analytics is the combustion engine.” – Peter Sondergaard, Gartner Research
FROM DATA MINING TO BIG DATA MINING
Big data mining is referred to the Goal – to discover insights from the
collective data mining or extraction social media platforms (Instagram,
techniques that is performed on large Twitter, Facebook) with thousand of
volume of data or the big data. postings.
CONCLUSION
Finds relationship
(that exist within the dataset)
and
makes prediction Photo-credit to:
https://www.bigstockphoto.com/image-12788702/stock-
vector-a-fortune-teller-holding-her-crystal-ball-vector
REFERENCES
1. Pang-Ning Tan, Michael Steinbach & Vipin Kumar, Introduction to Data Mining, Addison Wesley, 2019.
2. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 3rd Edition, Morgan Kaufmann, 2012.
3. Che D., Safran M., Peng Z. (2013) From Big Data to Big Data Mining: Challenges, Issues, and Opportunities. In: Hong
B., Meng X., Chen L., Winiwarter W., Song W. (eds) Database Systems for Advanced Applications. DASFAA 2013.
Lecture Notes in Computer Science, vol 7827. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40270-
8_1
4. Razak, Z. I., & Mutalib, S. (2018). Web Mining In Classifying Youth Emotions. Malaysian Journal of Computing, 3(1), 1-
11.
5. Wah, Y. B., Abdullah, N., Abdul-Rahman, S., & Tan, M. L. P. (2018). text mining and sentiment analysis on reviews of
proton cars in malaysia. Malaysian Journal of Science, 37(2), 137-153.
THANK YOU
Shuzlina Abdul Rahman | Sofianita Mutalib | Siti Nur Kamaliah Kamarudin | Farah Syazwani Mohd Rashid
TOPIC 1b
HISTORY, EVOLUTION AND
CLASSIFICATION OF DATA
MINING
OBJECTIVES
To introduce about Data Mining (DM) and its
relationship with data and knowledge
• The term “data mining” appeared around 1990 in the database community.
• Gregory Piatetsky-Shapiro coined the term “Knowledge Discovery in Databases”
for the first workshop on the same topic (KDD-1989) and this term become more
popular in AI and Machine Learning Community.
• Currently, Data Mining and KDD are used interchangeably.
• Since about 2007, “Predictive Analytics” and since 2011, “Data Science” terms
were also used to describe this field
(Source: Coenen, 2011)
ORIGIN OF DATA MINING
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
AI,
• Traditional techniques may be unsuitable due Statistics
Machine Learning,
to data that is Pattern
• Large-scale Recognition
• High dimensional
Data Mining
• Heterogeneous
• Complex
• Distributed Database
systems
• A key component of the emerging field of data
science and data-driven discovery
THE EVOLUTION OF DATA MINING
Evolutionary Step Enabling Technologies Business Question Characteristics
Data Collection Computers, tapes, "What was my total revenue Retrospective, static data
(1960s) disks in the last five years?" delivery
Data Access RDBMS, SQL, ODBC "What were unit sales in New Retrospective, dynamic data
(1980s) England last March? delivery at record level
Data OLAP, multidimensional "What were unit sales in New Retrospective, dynamic data
Warehousing databases, England last March? Drill delivery at multiple levels
(1990s) Data warehouses down to Boston”
• Commercial Viewpoint
o Lots of data is being collected and warehoused
Amazon, Shopee, Lazada
o Computers have become cheaper and more powerful
(E-commerce)
• Scientific Viewpoint
o Data collected and stored at enormous speeds
o Helps scientists in automated analysis of massive
datasets
https://www.ncdc.noaa.gov/sotc/global/202003
KNOWLEDGE DISCOVERY (KDD) PROCESS
• This is a view from typical Pattern Evaluation
database systems and data
warehousing communities
• Data mining plays an Data Mining
Data Cleaning
Data Integration
Databases
DATA MINING : 1-STEP OF KDD
Data mining
Task
Techniques
CLASSIFICATION OF DATA MINING SYSTEMS
Other Applications
1. Text mining (news group, email, documents) and Web mining
2. Stream data mining
3. DNA and bio-data analysis
MARKET ANALYSIS & MANAGEMENT
1. Tan, Steinbach, Karpatne, Kumar, Lecture Notes, Chapter 1, Introduction to Data Mining, 2nd Edition, 2018
2. Pang-Ning Tan, Michael Steinbach & Vipin Kumar, Introduction to Data Mining, Addison Wesley, 2019.
3. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 3rd Edition, Morgan Kaufmann, 2012.
4. Coenen, Frans. Data mining: past, present and future. Knowledge Engineering Review, 26(1), 25-29, 2011
5. Gregory Piatetsky-Shapiro, Data Science: Past, Present, and Future KDnuggets 1© Kdnuggets, 2016
THANK YOU
Shuzlina Abdul Rahman | Sofianita Mutalib | Siti Nur Kamaliah Kamarudin | Farah Syazwani Mohd Rashid
TOPIC 1c
DIRECT MARKETING
2. Approach:
• We know Collect various demographic, lifestyle, and company-interaction
related information, type of business, where they stay, how much they
earn, etc.
• Identify which customers decided to buy and which decided otherwise.
This {buy, don’t buy} decision forms the class attribute.
• Use this information as input attributes to learn a classifier model.
CLASSIFICATION: APPLICATION 2
CUSTOMER ATTRITION/CHURN
2. Approach:
• Use detailed record of transactions (past and present customers
• How often the customer calls, where he calls, what time-of-the day
he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.
CLUSTERING DEFINITION
Similarity Measures:
• Euclidean Distance if attributes are continuous.
• Other Problem-specific Measures.
ILLUSTRATING CLUSTERING
Euclidean Distance Based Clustering in 3-D space.
MARKET SEGMENTATION
2. Approach:
• Collect different attributes of customers based on their geographical
and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of
customers in same cluster vs. those from different clusters.
CLUSTERING: APPLICATION 1 – MARKET
SEGMENTATION
Segment 1: high duration but low number of generated calls and moderate number
of sent and received SMS.
Segment 2: moderate duration of generated calls and moderate to high data usage.
Segment 3: high duration of off-net calls, high number of generated calls, and
moderate to low of both duration of generated calls and data usage.
Segment 4: very low call duration, high sent and received SMS, and high data usage.
Segment 5: very low data usage, low duration of generated calls, and high number of
received calls with respect to the number of generated calls.
DOCUMENT CLUSTERING
2. Approach:
• To identify frequently occurring terms in each document. Form a
similarity measure based on the frequencies of different terms.
Use it to cluster.
• Gain: Information Retrieval can utilize the clusters to relate a
new document or search term to clustered documents.
ASSOCIATION RULE DISCOVERY:
DEFINITION
Given a set of records each of which contain some number of items from a given
collection;
• Produce dependency rules which will predict occurrence of an item based on
occurrences of other items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread Rules Discovered:
{Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
ASSOCIATION RULE DISCOVERY:
APPLICATION 1
MARKETING AND SALES PROMOTION
1. Predict a value of a given continuous valued variable based on the values of other
variables, assuming a linear or nonlinear model of dependency.
2. Greatly studied in statistics, and machine learning fields.
3. Examples:
• Predicting sales amounts of new product based on advertising expenditure.
• Predicting wind velocities as a function of temperature, humidity, air pressure,
etc.
• Time series prediction of stock market indices.
DEVIATION ANALYSIS
Typical network traffic at University level may reach over 100 million connections per day
DEVIATION ANALYSIS (FRAUD DETECTION)
https://www.insurancebusinessmag.com/asia/news/breaking-news/malaysias-antifraud-system-operational-by-october-74933.aspx
PROFITEERING CASES
https://www.freemalaysiatoday.com/category/nation/2018/08/25/yes-keep-receipts-to-fight-
profiteering-say-retailers/
1. Tan, Steinbach, Karpatne, Kumar, Lecture Notes, Chapter 1, Introduction to Data Mining, 2nd Edition, 2018
2. Pang-Ning Tan, Michael Steinbach & Vipin Kumar, Introduction to Data Mining, Addison Wesley, 2019.
3. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 3rd Edition, Morgan Kaufmann, 2012.
4. Coenen, F. Data mining: past, present and future. Knowledge Engineering Review, 26(1), 25-29, 2011
5. Gregory Piatetsky-Shapiro, Data Science: Past, Present, and Future KDnuggets 1© Kdnuggets, 2016
THANK YOU
Shuzlina Abdul Rahman | Sofianita Mutalib | Siti Nur Kamaliah Kamarudin | Farah Syazwani Mohd Rashid
TOPIC 1d
PROBLEMS AND
CHALLENGES
OBJECTIVES
To introduce about Data Mining (DM) and its
relationship with data and knowledge
4) Robustness 6) Interpretability
• Ability to cope with errors • Transparency
• The influence of noise • Explanation
• The presence of outliers
5) Scalability
• Ability to cope with big data
• Algorithms that are scalable
HOW DO YOU VALUE DATA AND
KNOWLEDGE?
Death of Scholar
“Sesungguhnya ALLAH tidak akan mengangkat
ilmu dgn sekaligus dari manusia. Tetapi ALLAH
akan mengangkat ilmu dengan mematikan para
ulama. Hingga ketika tidak ada lagi seorang
berilmu (di kalangan mereka), manusia
mengangkat para pemimpin yang jahil. Mereka
ditanya, dan mereka pun berfatwa tanpa ilmu.
Hingga akhirnya mereka sesat dan menyesatkan”
(Riwayat Bukhari).
MH370
2014 – 2016
Life, resources
Searching process
CONCLUSION
1. Pang-Ning Tan, Michael Steinbach & Vipin Kumar, Introduction to Data Mining, Addison Wesley, 2019.
2. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 3rd Edition, Morgan
Kaufmann, 2012.
THANK YOU
Shuzlina Abdul Rahman | Sofianita Mutalib | Siti Nur Kamaliah Kamarudin | Farah Syazwani Mohd Rashid