You are on page 1of 58

TOPIC 1a

INTRODUCTION
TO DATA MINING
OBJECTIVES

To introduce about Data Mining and its


relationship with data and knowledge ฀

To discuss the history, evolution and motivation


of Data Mining

To discuss Data Mining techniques, tasks,


applications and some major issues
https://dribbble.com/shots/10494603-Isometric-Animation-Data-Mining
PATTERN RECOGNITION AND DATA MINING
PATTERN RECOGNITION
a process of recognizing a pattern using machine (computer), it can be viewed through several aspects

Pattern Recognition by Computer Pattern Recognition from Data


Pattern Recognition by Human
 benefit of automated pattern  learn or observe from large
 perceptual (emotions, amounts of data
recognition
feelings)  study the dependencies and
 advantage in complex
 specialized – decision making extract knowledge from data
calculations
WHAT IS DATA?
Data – the basic facts such as names, numbers or characters that come in different
forms (like text or image).

# Names Studies Education Work_performance Income (D)


1 Amni Ali Poor High School Poor None
2 Chuah Ah Lan Moderate High School Poor Low
Table 1 - a sample of data with 3 Daria Danial Poor High School Poor None
4 Marisa Malik Moderate Diploma Poor Low
five (5) variables, where the 5 Nur Aini Mat Poor High School Good Low
last column indicates the 6 Suria Mohd Moderate Diploma Poor Low
7 Ozaila Othman Good Master Good Medium
outcome of that sample. …

99 Muhd Haris Aziz Poor High School Good Low


100 Zulhairi Yatim Moderate Diploma Poor Low
WHAT IS KNOWLEDGE?
Knowledge – the processed or organized data (information) that is given some values to
uncover the relationship for deeper understanding.

Sample of knowledge in the form of IF then ELSE rules:


studies(Poor) AND work(Poor)  income(None)
studies(Poor) AND work(Good)  income(Low)
education(Diploma)  income(Low)
education(Master)  income(Medium)
OR income(High)
studies(Moderate)  income(Low)
studies(Good)  income(Medium)
OR income(High)
https://www.ontotext.com/knowledgehub/fundamentals/dikw-pyramid/
education(SPM) AND work(Good)  income(Low)
WHAT IS DATA MINING?
Data mining - definition
 extraction of interesting (non-trivial, implicit, previously unknown and potentially
useful) patterns or knowledge from huge amount of data
 exploration and analysis, by automatic or semi-automatic means, of large quantities
of data in order to discover meaningful patterns

Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting

Is everything “data mining”?


 Simple search and query processing, like query of information about “Shopee
products”
WHY IS DATA MINING?
Today, massive growth of data availability, from Terabyte to Yottabyte, it is everywhere and
anywhere

Source of data ?

Facebook, Instagram, Telegram Blogs, News Amazon, Shopee, Lazada


(Social Media) (Society) (E-commerce)

“There were 5 exabytes of information created between the dawn of civilization through
2003, but that much information is now created every 2 days” – Eric Schmidt, Executive Chairman of Google

“Information is the oil of 21st century, and analytics is the combustion engine.” – Peter Sondergaard, Gartner Research
FROM DATA MINING TO BIG DATA MINING

What is Big data?

A term which refers to a large


amount of data where the
concept is related to the
characteristics of the data
itself.

Figure 1. 5V’S of Big Data


https://www.techentice.com/the-data-veracity-big-data/
FROM DATA MINING TO BIG DATA MINING

Classifying youth emotions based on Sentiment analysis on reviews of Proton


Twitter data Cars in Malaysia using Facebook postings

Big data mining is referred to the Goal – to discover insights from the
collective data mining or extraction social media platforms (Instagram,
techniques that is performed on large Twitter, Facebook) with thousand of
volume of data or the big data. postings.
CONCLUSION

DATA MINING is simply…

Finds relationship
(that exist within the dataset)
and
makes prediction Photo-credit to:
https://www.bigstockphoto.com/image-12788702/stock-
vector-a-fortune-teller-holding-her-crystal-ball-vector
REFERENCES

1. Pang-Ning Tan, Michael Steinbach & Vipin Kumar, Introduction to Data Mining, Addison Wesley, 2019.
2. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 3rd Edition, Morgan Kaufmann, 2012.
3. Che D., Safran M., Peng Z. (2013) From Big Data to Big Data Mining: Challenges, Issues, and Opportunities. In: Hong
B., Meng X., Chen L., Winiwarter W., Song W. (eds) Database Systems for Advanced Applications. DASFAA 2013.
Lecture Notes in Computer Science, vol 7827. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40270-
8_1
4. Razak, Z. I., & Mutalib, S. (2018). Web Mining In Classifying Youth Emotions. Malaysian Journal of Computing, 3(1), 1-
11.
5. Wah, Y. B., Abdullah, N., Abdul-Rahman, S., & Tan, M. L. P. (2018). text mining and sentiment analysis on reviews of
proton cars in malaysia. Malaysian Journal of Science, 37(2), 137-153.
THANK YOU
Shuzlina Abdul Rahman | Sofianita Mutalib | Siti Nur Kamaliah Kamarudin | Farah Syazwani Mohd Rashid
TOPIC 1b
HISTORY, EVOLUTION AND
CLASSIFICATION OF DATA
MINING
OBJECTIVES
To introduce about Data Mining (DM) and its
relationship with data and knowledge

To discuss the history, evolution and motivation of DM ฀

To discuss DM techniques, tasks, applications and some


major issues
HISTORY OF DATA MINING

• The term “data mining” appeared around 1990 in the database community.
• Gregory Piatetsky-Shapiro coined the term “Knowledge Discovery in Databases”
for the first workshop on the same topic (KDD-1989) and this term become more
popular in AI and Machine Learning Community.
• Currently, Data Mining and KDD are used interchangeably.
• Since about 2007, “Predictive Analytics” and since 2011, “Data Science” terms
were also used to describe this field
(Source: Coenen, 2011)
ORIGIN OF DATA MINING
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems

AI,
• Traditional techniques may be unsuitable due Statistics
Machine Learning,
to data that is Pattern
• Large-scale Recognition
• High dimensional
Data Mining
• Heterogeneous
• Complex
• Distributed Database
systems
• A key component of the emerging field of data
science and data-driven discovery
THE EVOLUTION OF DATA MINING
Evolutionary Step Enabling Technologies Business Question Characteristics

Data Collection Computers, tapes, "What was my total revenue Retrospective, static data
(1960s) disks in the last five years?" delivery
Data Access RDBMS, SQL, ODBC "What were unit sales in New Retrospective, dynamic data
(1980s) England last March? delivery at record level

Data OLAP, multidimensional "What were unit sales in New Retrospective, dynamic data
Warehousing databases, England last March? Drill delivery at multiple levels
(1990s) Data warehouses down to Boston”

Data Mining Advanced algorithms, “What’s likely to happen to Prospective, proactive


(Emerging Today) Multiprocessor computers, Boston unit sales next informative delivery
Massive databases month? Why?”
Source: www.thearling.com
MOTIVATION OF DATA MINING
Growth of data both in commercial and scientific databases
due to advances in data generation and collection technologies

• Commercial Viewpoint
o Lots of data is being collected and warehoused
Amazon, Shopee, Lazada
o Computers have become cheaper and more powerful
(E-commerce)

• Scientific Viewpoint
o Data collected and stored at enormous speeds
o Helps scientists in automated analysis of massive
datasets

https://www.ncdc.noaa.gov/sotc/global/202003
KNOWLEDGE DISCOVERY (KDD) PROCESS
• This is a view from typical Pattern Evaluation
database systems and data
warehousing communities
• Data mining plays an Data Mining

essential role in the


Task-relevant Data
knowledge discovery
process
Data Warehouse Selection

Data Cleaning

Data Integration

Databases
DATA MINING : 1-STEP OF KDD

Knowledge Discovery in Databases

Data mining

Task
Techniques
CLASSIFICATION OF DATA MINING SYSTEMS

Kinds of Knowledge Techniques used

Kinds of Database Categorizing data (Classification) Machine learning


Find relationship (Association) Pattern recognition
Relational Subdivide similar data (Clustering) Neural Network
Data warehouse Naïve-Bayes
Make prediction
Transactional DB K-nearest neighbour
… Rough Set
Advanced DB system \ Application adapted
Statistic
Flat files
WWW Finance
Marketing
Medical
Stock
Telecommunication
WHY DM? POTENTIAL APPLICATIONS
Data analysis and decision support
1. Market analysis and management
Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
2. Risk analysis and management
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
3. Fraud detection and detection of unusual patterns (outliers)

Other Applications
1. Text mining (news group, email, documents) and Web mining
2. Stream data mining
3. DNA and bio-data analysis
MARKET ANALYSIS & MANAGEMENT

CUSTOMER REQUIREMENT PROVISION OF SUMMARY


CUSTOMER PROFILING INFORMATION
ANALYSIS
Clustering or classifying the 1. identifying the best products for 1. multidimensional summary
customers based on the different customers reports
products they purchase 2. predict what factors will attract 2. statistical summary
new customers information (data central
tendency and variation)
REFERENCES

1. Tan, Steinbach, Karpatne, Kumar, Lecture Notes, Chapter 1, Introduction to Data Mining, 2nd Edition, 2018
2. Pang-Ning Tan, Michael Steinbach & Vipin Kumar, Introduction to Data Mining, Addison Wesley, 2019.
3. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 3rd Edition, Morgan Kaufmann, 2012.
4. Coenen, Frans. Data mining: past, present and future. Knowledge Engineering Review, 26(1), 25-29, 2011
5. Gregory Piatetsky-Shapiro, Data Science: Past, Present, and Future KDnuggets 1© Kdnuggets, 2016
THANK YOU
Shuzlina Abdul Rahman | Sofianita Mutalib | Siti Nur Kamaliah Kamarudin | Farah Syazwani Mohd Rashid
TOPIC 1c

TASKS AND TECHNIQUES


OF DATA MINING
OBJECTIVES
To introduce about Data Mining (DM) and its
relationship with data and knowledge

To discuss the history, evolution and motivation of DM

To discuss DM tasks, techniques, applications ฀ and


some major issues
DATA MINING: TASKS and TECHNIQUES
TASKS include; TECHNIQUES include;
Classification Decision Trees
Clustering
Association Rule Knowledge Discovery
Association Rules in Databases
k-means
Prediction
Neural Networks Data mining
Sequential Analysis
Naïve Bayes
Deviation analysis Tasks
k-nearest neighbor
Similarity analysis
Techniques
Trend analysis Statistical Method
CLASSIFICATION: DEFINITION
Given a collection of records (training set )
• Each record contains a set of attributes, one of the attributes is the class.

Find a model for class attribute as a function of the values of other


attributes.

Goal: previously unseen records should be assigned a class as


accurately as possible.
• A test set is used to determine the accuracy of the model. Usually, the
given data set is divided into training and test sets, with training set used
to build the model and test set used to validate it.
CLASSIFICATION EXAMPLE

Tid Refund Marital Taxable Refund Marital Taxable


Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K ?


2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
Set
10

7 Yes Divorced 220K No


8 No Single 85K Yes
9 No Married 75K No Training Learn
Set Classifier Model
10 No Single 90K Yes
10
CLASSIFICATION: APPLICATION 1

DIRECT MARKETING

1. Goal: Reduce cost of mailing by targeting a set of consumers


likely to buy a new cell-phone product.

2. Approach:
• We know Collect various demographic, lifestyle, and company-interaction
related information, type of business, where they stay, how much they
earn, etc.
• Identify which customers decided to buy and which decided otherwise.
This {buy, don’t buy} decision forms the class attribute.
• Use this information as input attributes to learn a classifier model.
CLASSIFICATION: APPLICATION 2
CUSTOMER ATTRITION/CHURN

1. Goal: To predict whether a customer is likely to be lost


to a competitor.

2. Approach:
• Use detailed record of transactions (past and present customers
• How often the customer calls, where he calls, what time-of-the day
he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.
CLUSTERING DEFINITION

Given a set of data points, each having a set of attributes,


and a similarity measure among them, find clusters such that
• Data points in one cluster are more similar to one another.
• Data points in separate clusters are less similar to one another.

Similarity Measures:
• Euclidean Distance if attributes are continuous.
• Other Problem-specific Measures.
ILLUSTRATING CLUSTERING
Euclidean Distance Based Clustering in 3-D space.

Intracluster distances Intercluster distances


are minimized are maximized
CLUSTERING: APPLICATION 1

MARKET SEGMENTATION

1. Goal: subdivide a market into distinct subsets of customers where any


subset may conceivably be selected as a market target to be reached
with a distinct marketing mix.

2. Approach:
• Collect different attributes of customers based on their geographical
and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of
customers in same cluster vs. those from different clusters.
CLUSTERING: APPLICATION 1 – MARKET
SEGMENTATION
Segment 1: high duration but low number of generated calls and moderate number
of sent and received SMS.

Segment 2: moderate duration of generated calls and moderate to high data usage.

Segment 3: high duration of off-net calls, high number of generated calls, and
moderate to low of both duration of generated calls and data usage.

Segment 4: very low call duration, high sent and received SMS, and high data usage.

Segment 5: very low data usage, low duration of generated calls, and high number of
received calls with respect to the number of generated calls.

Segment 6: relatively high duration of international calls.


Market Segmentation: https://online-journals.org/index.php/i-jim/article/download/4392/3606
CLUSTERING: APPLICATION 2

DOCUMENT CLUSTERING

1. Goal: To find groups of documents that are similar to each


other based on the important terms appearing in them.

2. Approach:
• To identify frequently occurring terms in each document. Form a
similarity measure based on the frequencies of different terms.
Use it to cluster.
• Gain: Information Retrieval can utilize the clusters to relate a
new document or search term to clustered documents.
ASSOCIATION RULE DISCOVERY:
DEFINITION
Given a set of records each of which contain some number of items from a given
collection;
• Produce dependency rules which will predict occurrence of an item based on
occurrences of other items.

TID Items
1 Bread, Coke, Milk
2 Beer, Bread Rules Discovered:
{Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
ASSOCIATION RULE DISCOVERY:
APPLICATION 1
MARKETING AND SALES PROMOTION

• Let the rule discovered be


{Bagels, … } --> {Potato Chips}
• Potato Chips as consequent can be used to determine what
should be done to boost its sales.
• Bagels in the antecedent Can be used to see which products
would be affected if the store discontinues selling bagels.
• Bagels in antecedent and Potato chips in consequent can be
used to see what products should be sold with Bagels to
promote sale of Potato chips!
ASSOCIATION RULE DISCOVERY:
APPLICATION 2
SUPERMARKET SHELF MANAGEMENT
1. Goal: To identify items that are bought together by
sufficiently many customers.
2. Approach:
• Process the point-of-sale data collected with barcode
scanners to find dependencies among items.
3. A classic rule
• If a customer buys diaper and milk, then he is very likely to buy rootbeer.
• So, don’t be surprised if you find six-packs of rootbeer stacked next to diapers!
RETAIL ANALYTICS
https://www.digitalnewsasia.com/download/tapwaycasestudy.pdf
REGRESSION

1. Predict a value of a given continuous valued variable based on the values of other
variables, assuming a linear or nonlinear model of dependency.
2. Greatly studied in statistics, and machine learning fields.
3. Examples:
• Predicting sales amounts of new product based on advertising expenditure.
• Predicting wind velocities as a function of temperature, humidity, air pressure,
etc.
• Time series prediction of stock market indices.
DEVIATION ANALYSIS

1. Discovering most significant changes in data from previously measured


or normative values
2. Usually categorical separately from other data mining tasks
3. Deviations are often infrequent
4. Modifications of classification, clustering, time series analysis can be
used as a means to achieve the goal
5. Outlier detection in statistics
DEVIATION ANALYSIS (ANOMALY
DETECTION)
1. Detect significant deviations from normal behavior.
2. Applications:

Credit Card Fraud Detection Network Intrusion Detection

Typical network traffic at University level may reach over 100 million connections per day
DEVIATION ANALYSIS (FRAUD DETECTION)

1. Identify employee accounts at financial institutions that have excess numbers


of credit memos. Excess credit memos can indicate diversion of funds into
employee accounts.

2. Compare employee home addresses, social security numbers, telephone


numbers and bank routing and account numbers to those of vendors from
vendor master file. This test can reveal bogus or improperly selected vendor
accounts.
DEVIATION ANALYSIS (FRAUD
DETECTION)

https://www.insurancebusinessmag.com/asia/news/breaking-news/malaysias-antifraud-system-operational-by-october-74933.aspx
PROFITEERING CASES

https://www.freemalaysiatoday.com/category/nation/2018/08/25/yes-keep-receipts-to-fight-
profiteering-say-retailers/

Yes, keep receipts to fight profiteering, say retailers


Robin Augustin -August 25, 2018 8:00 AM
http://english.astroawani.com/malaysia-news/gst-1-256-profiteering-
cases-detected-1-115-notices-issued-till-june-5-61853
REFERENCES

1. Tan, Steinbach, Karpatne, Kumar, Lecture Notes, Chapter 1, Introduction to Data Mining, 2nd Edition, 2018
2. Pang-Ning Tan, Michael Steinbach & Vipin Kumar, Introduction to Data Mining, Addison Wesley, 2019.
3. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 3rd Edition, Morgan Kaufmann, 2012.
4. Coenen, F. Data mining: past, present and future. Knowledge Engineering Review, 26(1), 25-29, 2011
5. Gregory Piatetsky-Shapiro, Data Science: Past, Present, and Future KDnuggets 1© Kdnuggets, 2016
THANK YOU
Shuzlina Abdul Rahman | Sofianita Mutalib | Siti Nur Kamaliah Kamarudin | Farah Syazwani Mohd Rashid
TOPIC 1d
PROBLEMS AND
CHALLENGES
OBJECTIVES
To introduce about Data Mining (DM) and its
relationship with data and knowledge

To discuss the history, evolution and motivation of DM

To discuss DM techniques, tasks, applications and some


major issues ฀
PROBLEMS AND CHALLENGES
• Efficiency and scalability of data mining algorithms
• Parallel, distributed, stream, and incremental mining methods
• Handling high-dimensionality
• Handling noise, uncertainty, and incompleteness of data
• Incorporation of constraints, expert knowledge, and
background knowledge in data mining
• Pattern evaluation and knowledge integration
• Mining diverse and heterogeneous kinds of data
COMPARING DATA MINING MODELS
PERFORMANCE ISSUES

1) Cost of the Learning Set 3) Predictive Availability


• To be able to predict the correct
• Number of examples necessary for training
decision towards the test or unseen
• Cost of assuring the good accuracy data
• Involve the generation of rules
2) Time and Memory Constraint
• Time complexity of the learning phase
• Time taken for evaluation
• Time it takes to reach a certain level of
accuracy
• Measuring the quality or accuracy of rules
COMPARING DATA MINING MODELS
PERFORMANCE ISSUES

4) Robustness 6) Interpretability
• Ability to cope with errors • Transparency
• The influence of noise • Explanation
• The presence of outliers

5) Scalability
• Ability to cope with big data
• Algorithms that are scalable
HOW DO YOU VALUE DATA AND
KNOWLEDGE?
Death of Scholar
“Sesungguhnya ALLAH tidak akan mengangkat
ilmu dgn sekaligus dari manusia. Tetapi ALLAH
akan mengangkat ilmu dengan mematikan para
ulama. Hingga ketika tidak ada lagi seorang
berilmu (di kalangan mereka), manusia
mengangkat para pemimpin yang jahil. Mereka
ditanya, dan mereka pun berfatwa tanpa ilmu.
Hingga akhirnya mereka sesat dan menyesatkan”
(Riwayat Bukhari).

MH370
 2014 – 2016
 Life, resources
 Searching process
CONCLUSION

• Data mining: Discovering interesting patterns from large amounts of data


• DM - A natural evolution of database technology, in great demand, with wide
applications (business, medical, manufacturing etc.)
• A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge presentation
• Mining can be performed in a variety of information repositories
• Data mining tasks: Classification, Clustering, Association, Outlier
• Data mining techniques: decision tree, k-means, apriori
• Major issues in data mining (scalability, high dimensionality, heterogenous
and complex data)
REFERENCES

1. Pang-Ning Tan, Michael Steinbach & Vipin Kumar, Introduction to Data Mining, Addison Wesley, 2019.
2. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 3rd Edition, Morgan
Kaufmann, 2012.
THANK YOU
Shuzlina Abdul Rahman | Sofianita Mutalib | Siti Nur Kamaliah Kamarudin | Farah Syazwani Mohd Rashid

You might also like