Ch1 Data Mining New

Data Mining:
Concepts and Techniques
MBA Sem 2
Data Mining
Dr.Ashvini Shende
11
Ch 1 Content
 Concept, Definitions and Need of Big Data,
 Data Mining,
 Business Intelligence.
 Data Mining Process,
 relation to Business Intelligence techniques.
 Introduction to Data Mining Tasks (Classification,
Clustering, Association Analysis, Anomaly
Detection).
 Concept, Definitions of model, descriptive models,
predictive modeling, basic terminology.
 Real-world data mining applications - Big Data
Analytics in Mobile Environments, Fraud Detection
and Prevention with Data Mining Techniques, Big
Data Analytics in Business Environments 2
Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability
 Automated data collection tools, database systems,
Web, computerized society
 Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …
 Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
3
Evolution of Database Technology
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web
databases
 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information systems
4
What Is Data Mining?
 Data mining (knowledge discovery from data)

 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data archeology,
data dredging, information harvesting, business intelligence,
etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems
5
Knowledge Discovery (KDD) Process
 This is a view from typical
database systems and
data warehousing Pattern Evaluation
communities
 Data mining plays an essential
role in the knowledge discovery Data Mining
process
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
6
Databases
7
Example: A Web Mining Framework
 Web mining usually involves

 Data cleaning
 Data integration from multiple sources
 Warehousing the data
 Data cube construction
 Data selection for data mining
 Data mining
 Presentation of the mining results
 Patterns and knowledge to be used or stored
into knowledge-base
8
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Presentation Business

Visualization Techniques Analyst
Data Mining Data

Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data

Warehouses DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
9
Example: Mining vs. Data Exploration
 Business intelligence view
 Warehouse, data cube, reporting but not much mining
 Business objects vs. data mining tools
 Supply chain example: tools
 Data presentation
 Exploration
10
KDD Process: A Typical View from ML and
Statistics
Data Pre- ProcessingData Mining Post-

Input Data Processing
Data integration Pattern discovery Pattern evaluation

Normalization Association & Pattern communities
selection
 This is a view from typical machine
correlation learning and statistics
Feature selection Pattern
Classification
Dimension interpretation
Clustering
Outlier
11
Example: Medical Data Mining
 Health care & medical data mining – often
adopted such a view in statistics and machine
learning
 Preprocessing of the data (including feature
extraction and dimension reduction)
 Classification or/and clustering processes
 Post-processing for presentation
12
Need of Big Data
• Bigdata is a term used to describe a collection of data that is
huge in size and yet growing exponentially with time.
• Big Data analytics examples includes stock exchanges,
social media sites, jet engines, etc.
• Big Data could be 1) Structured, 2) Unstructured, 3) Semi-
structured.
• Volume, Variety, Velocity, and Variability are few Big
Data ...
• Big Data is a collection of data that is huge in volume, yet
growing exponentially with time. It is a data with so large size
and complexity that none of traditional data management tools
can store it or process it efficiently. Big data is also a data but
with huge size.
August 18, Data Mining: Concepts and 1

Need of Big Data
 What is Data?
 The quantities, characters, or symbols on which
operations are performed by a computer, which may be
stored and transmitted in the form of electrical signals
and recorded on magnetic, optical, or mechanical
recording media.
 Big Data analytics help retailers from traditional to e-
commerce to understand customer behaviour and
recommend products as per customer interest. This
helps them in developing new and improved products
which help the firm enormously. We can conclude that
Big Data helps companies to make informed decisions,
understand their customer desires.

Business Intelligent
 Business intelligence (BI) is an umbrella term for the
technology that enables data preparation, data
mining, data management, and data visualization.
 BI is most effective when it combines data derived
from the market in which a company operates (external
data) with data from company sources internal to the
business such as financial and operations data (internal
data).

Business Intelligent
 Business intelligence (BI) comprises the strategies
and technologies used by enterprises for the data
analysis and management of business information.
Common functions of business intelligence
technologies include
1. reporting,
2. online analytical processing,
3. analytics,
4. dashboard development,
5. data mining, process mining,
6. complex event processing,
7. business performance management, benchmarking, text
mining, predictive analytics, and prescriptive analytics.
August 18, 1
Business Intelligence- Process
 These processes include:
 Data mining: Using databases, statistics and machine learning to uncover
trends in large datasets.
 Reporting: Sharing data analysis to stakeholders so they can draw
conclusions and make decisions.
 Performance metrics and benchmarking: Comparing current performance data
to historical data to track performance against goals, typically using
customized dashboards.
 Descriptive analytics: Using preliminary data analysis to find out what
happened.
 Querying: Asking the data specific questions, BI pulling the answers from
the datasets.
 Statistical analysis: Taking the results from descriptive analytics and further
exploring the data using statistics such as how this trend happened and
why.
 Data visualization: Turning data analysis into visual representations such as
charts, graphs, and histograms to more easily consume data.
 Visual analysis: Exploring data through visual storytelling to communicate
insights on the fly and stay in the flow of analysis.
 Data preparation: Compiling multiple data sources, identifying the dimensions
and measurements, preparing it for data analysis.
Business Intelligent Techniques
 Business intelligence techniques help understand
trends and identify patterns from big data In the
digital world, modern businesses generate big
data on daily basis. The recent advancement in
technology has opened the door for companies to
effectively store and process big data to unleash
data-driven decisions and insights.

Eg. Business Intelligence

Data Mining Tasks
 Classification
 Clustering
 Association Analysis
 Anomaly Detection
Data mining includes the utilization of refined data

analysis tools to find previously unknown, valid
patterns and relationships in huge data sets. These
tools can incorporate statistical models, machine
learning techniques, and mathematical algorithms,
such as neural networks or decision trees. Thus, data
mining incorporates analysis and prediction.

Classification
 This technique is used to obtain important and
relevant information about data and metadata. This
data mining technique helps to classify data in
different classes.
 Classification in data mining is a common technique
that separates data points into different classes. It
allows you to organize data sets of all sorts, including
complex and large datasets as well as small and simple
ones.

Applications of Classification of Data Mining Systems
 There are many examples of how we use classification algorithms in

our day-to-day lives. The following are the most common ones:
 Marketers use classification algorithms for audience segmentation.

They classify their target audiences into different categories by using
these algorithms to devise more accurate and effective marketing
strategies.
 Meteorologists use these algorithms to predict the weather conditions
according to various parameters such as humidity, temperature, etc.
 Public health experts use classifiers for predicting the risk of various
diseases and create strategies to mitigate their spread.
 Financial institutions use classification algorithms to find defaulters to
determine whose cards and loans they should approve. It also helps
them in detecting fraud.

Data mining techniques can be classified by different criteria
i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial
data, text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented
database, transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering,
characterization, etc. some frameworks tend to be extensive frameworks offering a
few data mining functionalities together..
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural
networks, machine learning, genetic algorithms, visualization, statistics, data
warehouse-oriented or database-oriented, etc.
The classification can also take into account, the level of user interaction involved in
the data mining procedure, such as query-driven systems, autonomous systems, or
interactive exploratory systems.

Classification problems are present in every industry
 For example, email spam is a great example to

demonstrate the need for classification in data
mining.
 The goal of classification algorithms in data mining
in this application is to understand if an email is a
spam or not. It helps in deciding if an email has to
be redirected to the junk folder.
 Another application for classification in data mining
would be to recognize handwritten digits. The goal of
this use case is to spot digits that are between 0 and 9
successfully.
 Another use of classification would be image
segmentation. This is much more complicated
than any other application of this technology.
Clustering
 Clustering is a division of information into groups

of connected objects. Describing the data by a few
clusters mainly loses certain confine details, but
accomplishes improvement. It models data by its
clusters. Data modeling puts clustering from a
historical point of view rooted in statistics,
mathematics, and numerical analysis.
 From a practical point of view, clustering plays an
extraordinary job in data mining applications. For
example, scientific data exploration, text mining,
information retrieval, spatial database applications,
CRM, Web analysis, computational biology,
medical diagnostics, and much more.

Clustering
 Clustering is the process of making a group of
abstract objects into classes of similar objects.
 Applications of cluster analysis :
• It is widely used in many applications such as image
processing, data analysis, and pattern recognition.
• It helps marketers to find the distinct groups in their
customer base and they can characterize their
customer groups by using purchasing patterns.
• It can be used in the field of biology, by deriving
animal and plant taxonomies, identifying genes
with the same capabilities.
• It also helps in information discovery by
classifying documents on the web.

:
Association Rules
 This data mining technique helps to discover a link
between two or more items. It finds a hidden pattern
in the data set.
 Association rules are if-then statements that support
to show the probability of interactions between data
items within large data sets in different types of
databases. Association rule mining has several
applications and is commonly used to help sales
correlations in data or medical data sets.
 The way the algorithm works is that you have various
data, For example, a list of grocery items that you
have been buying for the last six months. It calculates
a percentage of items being purchased together.

Anomaly Detection/Outlier Analysis
 It is a step in data mining that identifies data points, events, and/or
observations that deviate from a dataset’s normal behavior. Anomalous data
can indicate critical incidents, such as a technical glitch, or potential
opportunities, for instance a change in consumer behavior.
 What are Outliers?
 Outliers are an integral part of data analysis. An outlier can be defined as
observation point that lies in a distance from other observations.
 An outlier is important as it specifies an error in the experiment. Outliers are
extensively used in various areas such as detecting frauds, introducing
potential new trends in the market and others.
 What is Outlier Analysis?
 Outlier Analysis can be defined as the process in which abnormal or non-
typical observations in a data set is identified.
 Various causes of outliers in Data Mining
 There are various causes of outliers in Data Mining. Some of these causes are
given below:
a. It is used in identifying the frauds in banking sectors such as credit card
hacking or any similar frauds.
b. It is used in observing the change in trends of buying patterns of a customer.
c. It is used in identifying the typing errors and reporting errors made by
humans.
d. It is used in discovering the errors or faults in machines or systems.

 Applications of Outlier Detection in Data Mining
 In Data Mining, Outlier Detection is extensively
used. It is used to obtain patterns or trends in data
mining. The applications of Outlier Detection in Data
Mining are given below:
a. Fraud Detection
b. Telecom Fraud Detection
c. Intrusion Detection in Cyber Security
d. Medical Analysis
e. Environment Monitoring such as Cyclone, Tsunami,
Floods, Drought and so on
f. Noticing unforeseen entries in Databases

Data Mining Models
 Data mining algorithms can be described as
consisting of three parts.
 Model – The objective of the model is to fit the
model in the data.
Preference – Some identification tests must be used
to fit one model over another.
Search – All algorithms are necessary for processing to
find data.
 Types of Data Mining Models –
1. Predictive Models
2. Descriptive Models

Types of Models in Data Mining

:
Predictive Model
 A predictive model constitutes prediction concern values of data using known results
found from various data. Predictive modelling may be made based on the use of variant
historical data. Predictive model data mining tasks comprise regression, time series
analysis, classification, prediction.
 The Predictive Model is known as Statistical Regression. It is a monitoring learning technique
that Incorporates an explication of the dependency of few attribute values upon the values of
other attributes In a similar item and the growth of a model that can predict these attribute
values for recent cases.
• Classification –
It is the act of assigning objects to one of several predefined categories. Or we can define
classification as a learning function of a target function that sets each attribute to a predefined
class label.
• Regression –
It is used for appropriate data. It is a technique that verifies data values for a function.
There are two types of regression –
1. Linear Regression is associated with the search for the optimal line to fit the two attributes
so that one attribute can be applied to predict the other.
2. Multi-Linear Regression involves two or more than two attributes and data are fit to
multidimensional space.
• Time Series Analysis –
It is a set of data based on time. Time series analysis serves as an independent variable to
estimate the dependent variable in time.
• Prediction –
It predicts some missing or unknown values.

Description Model
 :
A descriptive model distinguishes relationships or patterns in data. Unlike
Predictive Model, a descriptive model serves as a way to explore the
properties of data being examined, not to predict new properties,
clustering, summarization, associating rules, and sequence discovery are
descriptive model data mining tasks.
 Descriptive analytics Concentrate on the summarization and conversion of
the data into significant information for monitoring and reporting.
• Clustering –
It is the technique of converting a group of abstract objects into classes of
identical objects.
• Summarization –
It holds a set of data in a more in-depth, easy-to-understand form.
• Associative Rules –
They find an exciting consistency or causal relationship between a large set
of data objects.
• Sequence –
It is the discovery of interesting patterns in the data is in relation to some
objective or subjective measurement of how interesting it is.


Big Data Analytics in Business Environment
 The benefits of Big Data Analytics and tools are –

• Data accumulation from multiple sources, including the Internet, social
media platforms, online shopping sites, company databases, external third-
party sources, etc.
• Real-time forecasting and monitoring of business as well as the market.
• Identify crucial points hidden within large datasets to influence business
decisions.
• Promptly mitigate risks by optimizing complex decisions for unforeseen
events and potential threats.
• Identify issues in systems and business processes in real-time.
• Unlock the true potential of data-driven marketing.
• Dig in customer data to create tailor-made products, services, offers,
discounts, etc.
• Facilitate speedy delivery of products/services that meet and exceed client
expectations.
• Diversify revenue streams to boost company profits and ROI.
• Respond to customer requests, grievances, and queries in real-time.
• Foster innovation of new business strategies, products, and services.

Ch1 Data Mining New

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ch1 Data Mining New

Uploaded by

Copyright:

Available Formats

Data Mining:

Concepts and Techniques

 Data mining (knowledge discovery from data)

 Web mining usually involves

Data Presentation Business

Data Mining Data

Data Preprocessing/Integration, Data

Data Pre- ProcessingData Mining Post-

Data integration Pattern discovery Pattern evaluation

August 18, Data Mining: Concepts and 1

August 18, Data Mining: Concepts and 1

August 18, Data Mining: Concepts and 1

August 18, Data Mining: Concepts and 1

August 18, Data Mining: Concepts and 1

Data mining includes the utilization of refined data

August 18, Data Mining: Concepts and 2

August 18, Data Mining: Concepts and 2

 There are many examples of how we use classification algorithms in

 Marketers use classification algorithms for audience segmentation.

August 18, Data Mining: Concepts and 2

August 18, Data Mining: Concepts and 2

 For example, email spam is a great example to

 Clustering is a division of information into groups

August 18, Data Mining: Concepts and 2

August 18, Data Mining: Concepts and 2

August 18, Data Mining: Concepts and 2

August 18, Data Mining: Concepts and 2

August 18, Data Mining: Concepts and 2

August 18, Data Mining: Concepts and 3

August 18, Data Mining: Concepts and 3

August 18, Data Mining: Concepts and 3

August 18, Data Mining: Concepts and 3

 The benefits of Big Data Analytics and tools are –

You might also like