You are on page 1of 43

Lecture 1

Introduction
Agenda

•Knowledge Pyramid
•Data mining Background
•What is data mining?
•*Main data mining objectives
•Current state of data mining
•Challenges
Knowledge Pyramid
Data, Information and Knowledge
• Data (D)
• Isolated factual recording of
separate objects and events
• Enables the recording of the
seen events
• Information (I)
• Fact of meaningful context
K
represented by relationships
between isolated data items
• Information enables the
responding to the seen
I
events
• Knowledge (K)
• Verified known information
D
that is accommodated into
the business process
• Enable the anticipation of the
unseen events
DIKW with Examples
Data mining Background
Current IT Trend

https://www.kdnuggets.com/2018/12/predictions-data-science-analytics-2019.html
8
Data Mining Background

• Facts:
• Storing the data is an operational necessity
• Storing the data has become easy and affordable
• Data acquisition is fully or partially automatic and fast
• Consequences:
• The speed of data comprehension does not match the speed of data
acquisition
• Many commercial database management systems (DBMSs) are not
equipped with data comprehension and analysis tools.
• We may be data rich, but information poor.
Data Mining Background

• Computerization of operations in commercial, governmental and scientific


organizations has resulted in large volumes of operational data, e.g.
• Itemized telephone bills
• Bank statements
• Supermarket transactions
• Share prices
• Scientific experimental data sets
• Published web pages
• CCTV video footages
• ……
What action
to take?

Describe
what already
happened.
Data Mining & Other Disciplines

Machine Learning
Statistics
(Artificial Intelligence)

Inductive & deductive Data analysis theories


learning methods methods and measures

DATA MINING

Fast storage structures &


retrieval operations

Database
Management
What is data mining?
Data Mining

• Discover Useful information; leading to a course of action or an


understanding of data
• Produce Non-trivial implicit information; not the raw data, nor the result
of a simple data summary
• Work on Real life databases; not laboratory generated data sets
• Use Efficient novel discovery methods; expected to be scaled up and
applied to large databases
Non-trivial Information
• Putting the “search for information” into a
spectrum:
Data retrieval Online analytic processing Data mining

sophistication
sophistication

High end of
Low end of

• Retrieval of stored data • Interactive reporting • Discovery of hidden and


• Trivial data aggregation on stored data embedded patterns
• Written in standard SQL • Summarisation and • Discovery algorithms
drilling along different
attributes • Written in programming
• Written in extended language probably with
SQL the assistance of SQL
Data Mining

• A process to discover patterns in large


data sets involving from database
• An essential core process in data
analytics
• Consist of data visualization
• Data Exploratory

17
Also Known as …
Business intelligence (core component), big data analytics, predictive analytics,
knowledge discovery in database …
The story of diapers and beers
Automatic Credit Card Approval
The data mining objectives
Data Mining Objectives

• Classification
• Using existing data to form a classification model and then using the
model to assign an appropriate class label for a data record (e.g. safe
vs. risky customers)
• Estimation
• Similar to classification but to assign a value to an output variable of a
data record (e.g. estimated house value, stock price)
• Prediction
• Similar to classification and estimation, but more concerned with
future outcome of the output (e.g. tomorrow’s weather, coming
election outcomes)
• Description
• General description of data characteristics (e.g. customer profile)
Data Analytics Objectives

• Classification
• Using existing data to form a model and then to assign an appropriate class
label for a data record (e.g. safe vs. risky customers)

23
Data Analytics Objectives

• Estimation
• Similar to classification but to assign a value to an output variable of a data
record (e.g. estimated house value)
Data Analytics Objectives

• Prediction
• Similar to classification and estimation, but more concerned with future
outcome of the output (e.g. tomorrow’s weather)
Data Analytics Objectives

• Description
• describes real-world events and the relationships between factors
Current state of data mining
Current States

• Many data mining algorithms have been developed or adapted


• Many data mining software tools have been built and are in use
• A cross-industry methodology has been formed
• Besides general solutions, more application-oriented data mining solutions
are being developed
• More and more organisations are either doing their own data mining or
hiring consultants to do the job
• Data mining has been extended to web mining and text mining
Data Mining: Current State

• Some nuisances
• Mining cookies
• Spyware and miningware
• Intrusion to privacy
• Some serious problems
• “Big Brother is watching”
• Unfair advantages in trading practice e.g. high-frequency trading
(HFT)
• Abuse of personal data
• Ethical concerns
Business Insider
https://www.businessinsider.com/foreign-intelligence-agents-china-spying-on-
americans-zoom-2020-4
Data Mining: Promises

• Areas of data mining application:


• Finance and insurance
• Marketing and sales
• Medicine
• Agriculture
• Society, politics and economics
• Science
• Engineering
• Law enforcement
• Military and intelligence (classified)
Customer Churning

Figure excerpted from https://medium.com/techlabsms/customer-churn-behavior-in-the-telecommunications-sector-a7c68b9e05db


Tap into
Customer
Behavior
New Drugs • Figure Excerpt from:
https://bmcbioinformatics.biomedcentral.com/articles/
Development 10.1186/s12859-018-2123-4
Precision Farming
Influencing Voters
Challenges
Real-life Databases

• Characteristics of a real-life database


• The size may be extremely large
• The dimensionality can be very high
• Attributes can be of different data types
• Data quality can be very poor
• Data may exist in pieces and isolated in different systems
• Value distribution can be extremely skewed
• Database content can be dynamic and evolving
• Data may lack traditional record-based structure
• Data are available on second storage media
Efficient Algorithms

• Discovering interesting patterns is


computationally hard
• Are the patterns make any sense?
• Large data → Require huge memory and
fast computer
• Is the algorithms really work? Or by
chance.
Challenges

• Some difficult problems to solve


• Big Data
• How to process data?
• Extremely large data sets
• Extremely high dimensionalities (curse of dimensions)
• Mining non-standard complex data such as multimedia materials

• Algorithms
• Combinatorial problems and fast algorithms
• Comprehensibility of patterns
• Meaningful evaluation of the patterns
• Discovery of changing and evolving patterns
• Integration of data mining techniques
Summary

• Importance of data in operation and importance of information and


knowledge in decision-making
• Data rich does not mean information rich
• Data mining: automatic or semi automatic data understanding and
decision support
• To classify, to estimate, to predict and to describe
• Data mining closely relates to database, statistics and machine learning
• Data mining: from technology towards application
• A lot of potential uses and a lot of challenges to face
References

• Read Chapter 1 of Data Mining Techniques and Applications

• Useful further references


• Han & Kamber, Chapter 1
• Berry & Linoff, Chapter 1 (business-like)
• Kdnuggets: http://www.kdnuggets.com/

You might also like