Professional Documents
Culture Documents
Intelligence(MDS204)
Arti yadav
Einfach Bussiness Analytics pvt ltd.
1
Introduction
The distinction between data mining and knowledge discovery is largely one
of timing.
Data mining is the process by which substantial amounts of data are
organized, normalized, tabulated, and categorized; in short, it is analyzing
large databases in order to generate additional information.
Knowledge discovery, however, can be associated with specific context (e.g.,
can be guided by the vernacular of a particular specialty, organization, or
practice), making it both quantitative and qualitative. Knowledge can—and
should—be viewed as having a personality.
2
Knowledge discovery
3
Knowledge discovery
Some people don’t differentiate data mining from knowledge discovery while
others view data mining as an essential step in the process of knowledge discovery.
Here is the list of steps involved in the knowledge discovery process −
Data Cleaning − In this step, the noise and inconsistent data is removed.
Data Integration − In this step, multiple data sources are combined.
Data Selection − In this step, data relevant to the analysis task are retrieved
from the database.
Data Transformation − In this step, data is transformed or consolidated into
forms appropriate for mining by performing summary or aggregation
operations.
4
Cont.
Data Mining − In this step, intelligent methods are applied in order to extract
data patterns.
Pattern Evaluation − In this step, data patterns are evaluated.
Knowledge Presentation − In this step, knowledge is represented.
5
The following diagram shows the process
of knowledge discovery −
6
An Outline of the Steps of the KDD Process
7
Cont.
8
Cont.
9
Knowledge Discovery Process (KDP)
10
Cont.
4 Data Transformation -
In Data Transformation, data are transformed into forms appropriate for mining
by performing summary or aggregation operations.
5 Data Mining -
In Data Mining, data mining methods (algorithms) are applied in order to
extract data patterns.
6 Pattern Evaluation -
In Pattern Evaluation, data patterns are identified based on some interesting
measures.
7 Knowledge Presentation -
In Knowledge Presentation, knowledge is represented to user using many
knowledge representation techniques.
11
KDD process
Steps Involved in KDD Process:
Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant
data from collection.
Cleaning in case of Missing values.
Cleaning noisy data, where noise is a random or variance error.
Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration: Data integration is defined as heterogeneous data from
multiple sources combined in a common source(DataWarehouse).
Data integration using Data Migration tools.
Data integration using Data Synchronization tools.
Data integration using ETL(Extract-Load-Transformation) process.
12
Cont.
Data Selection: Data selection is defined as the process where data relevant
to the analysis is decided and retrieved from the data collection.
Data selection using Neural network.
Data selection using Decision Trees.
Data selection using Naive bayes.
Data selection using Clustering, Regression, etc.
Data Transformation: Data Transformation is defined as the process of
transforming data into appropriate form required by mining procedure.
Data Transformation is a two step process:
Data Mapping: Assigning elements from source base to destination to capture
transformations.
Code generation: Creation of the actual transformation program.
13
Cont.
Data Mining: Data mining is defined as clever techniques that are applied to
extract patterns potentially useful.
Transforms task relevant data into patterns.
Decides purpose of model using classification or characterization.
Pattern Evaluation: Pattern Evaluation is defined as as identifying strictly
increasing patterns representing knowledge based on given measures.
Find interestingness score of each pattern.
Uses summarization and Visualization to make data understandable by user.
Knowledge representation: Knowledge representation is defined as
technique which utilizes visualization tools to represent data mining results.
Generate reports.
Generate tables.
Generate discriminant rules, classification rules, characterization rules, etc.
14
Why we need Data Mining?
15
Why we need Data Mining?
16
Data mining
17
Data mining
Data mining architecture has many elements like Data Warehouse, Data
Mining Engine, Pattern evaluation,User Interface and Knowledge Base.
Data Warehouse:
A data warehouse is a place which store information collected from multiple
sources under unified schema. Information stored in a data warehouse is
critical to organizations for the process of decision-making.
Data Mining Engine:
Data Mining Engine is the core component of data mining process which
consists of various modules that are used to perform various tasks like
clustering, classification, prediction and correlation analysis.
19
Cont.
Pattern Evaluation:
Pattern Evaluation is responsible for finding various patterns with the help of
Data Mining Engine.
User Interface:
User Interface provides communication between user and data mining system.
It allows user to use the system easily even if user doesn't have proper
knowledge of the system.
Knowledge Base:
Knowledge Base consists of data that is very important in the process of data
mining.Knowledge Base provides input to the data mining engine which guides
data mining engine in the process of pattern search.
20
Data Mining Architecture
21
Data Mining Techniques
22
1. Association Technique
Association Technique helps to find out the pattern from huge data, based on
a relationship between two or more items of the same transaction. The
association technique is used to analyze market means it help us to analyze
people's buying habits.
For example, you might identify that a customer always buys ice cream
whenever he comes to watch move so it might be possible that when
customer again comes to watch movie he might also want to buy ice cream
again.
23
2. Classification Technique
24
3. Clustering Technique
Clustering is one of the oldest techniques used in the process of data mining.
The main aim of clustering technique is to makes cluster(groups) from pieces
of data which share common characteristics. Clustering Technique help to
identify the differences and similarities between the data.
Take an example of a shop in which many items are for sales, now the
challenge is how to keep those items in such way that customer can easily
find his required item.By using the clustering technique, you can keep some
items in one corner that have some similarities and other items in another
corner that have some different similarities.
25
4. Sequential patterns
Sequential patterns are a useful method for identifying trends and similar
patterns.
For example, in customer data you identify that a customer buys particular
product on particular time of year, you can use this information to suggest
customer these particular product on that time of year.
26
5. Decision tree
Decision tree is one of the most common used data mining techniques
because its model is easy to understand for users. In decision tree you start
with a simple question which has two or more answers.Each answer leads to a
further two or more question which help us to make a final decision. The root
node of decision tree is a simple question.
Take a example of flood warning system.
27
Decision tree
28
Few more techniques
29
Cont.
30
Data Mining Applications
31
1. Data mining applications in Marketing:
Data mining process extract information from various data source which is
very useful in the process of planning, organizing, managing and launching
new product in a cost effective way. Data mining technique help us to
understand the purchase behavior of a buyer like how frequently customer
purchase a item, total value of all purchases and when was the last purchase.
With data mining you can understand the needs of buyer’s and make product
and services according to buyer’s requirement.
Data base marketing is one of the most popular application of data mining.
32
2. Data mining applications in HealthCare:
Data mining can be very useful to improve healthcare system. With data
mining you can predict number of patients which help you to make sure that
every patient receive proper care at right time and at right place.
Data mining can help all parties involved in the healthcare industry. For
example, data mining can help healthcare insurers detect fraud and abuse,
healthcare organizations can improve there decision making by using
knowledge provided by data mining, patients can receive better and more
affordable healthcare services.
33
3. Data mining applications in Education:
34
4. Data mining applications in Retail
Industry:
Retail industry collects large amount of data on sales and customer shopping
history. Retail data mining helps in analyzing client behavior, customer buying
patterns and trends and lead to better customer service, good customer
satisfaction and minimize the cost of business.
35
5. Data mining applications in Banking:
The banking industry has hugely benefited from the advancements in digital
technology. Data mining is becoming strategically important area for many
business organizations including banking sector.
Data mining is used in financial and banking sector for credit analysis,
fraudulent transactions, cash management and to predicting payment.
36
Advantage of Data Mining:
37
Disadvantages of Data Mining
38
Data warehousing
Data warehouse is a subject oriented integrated non-volatile time variant
collection of data in support of management’s decisions.
A Data Warehousing (DW) is process for collecting and managing data from
varied sources to provide meaningful business insights. A Data warehouse is
typically used to connect and analyze business data from heterogeneous
sources. The data warehouse is the core of the BI system which is built for
data analysis and reporting.
It is a blend of technologies and components which aids the strategic use of
data. It is electronic storage of a large amount of information by a business
which is designed for query and analysis instead of transaction processing. It is
a process of transforming data into information and making it available to
users in a timely manner to make a difference.
39
Cont.
40
Data Warehouse Features
Information processing, analytical processing, and data mining are the three
types of data warehouse applications that are discussed below −
Information Processing − A data warehouse allows to process the data stored
in it. The data can be processed by means of querying, basic statistical
analysis, reporting using crosstabs, tables, charts, or graphs.
Analytical Processing − A data warehouse supports analytical processing of
the information stored in it. The data can be analyzed by means of basic OLAP
operations, including slice-and-dice, drill down, drill up, and pivoting.
Data Mining − Data mining supports knowledge discovery by finding hidden
patterns and associations, constructing analytical models, performing
classification and prediction. These mining results can be presented using the
visualization tools.
43
Types of Data Warehouse Models
From the perspective of data warehouse architecture, three main types of Data
Warehouses are:
1. Enterprise Data Warehouse:
Enterprise Data Warehouse is a centralized warehouse. It provides decision support
service across the enterprise. It offers a unified approach for organizing and
representing data. It also provide the ability to classify data according to the subject
and give access according to those divisions.
2. Operational Data Store:
Operational Data Store, which is also called ODS, are nothing but data store required
when neither Data warehouse nor OLTP systems support organizations reporting
needs. In ODS, Data warehouse is refreshed in real time. Hence, it is widely
preferred for routine activities like storing records of the Employees.
3. Data Mart:
A data mart is a subset of the data warehouse. It specially designed for a particular
line of business, such as sales, finance, sales or finance. In an independent data
mart, data can collect directly from sources. 44
General stages of Data Warehouse
Earlier, organizations started relatively simple use of data warehousing.
However, over time, more sophisticated use of data warehousing begun.
The following are general stages of use of the data warehouse:
Offline Operational Database:
In this stage, data is just copied from an operational system to another
server. In this way, loading, processing, and reporting of the copied data do
not impact the operational system's performance.
Offline Data Warehouse:
Data in the Datawarehouse is regularly updated from the Operational
Database. The data in Datawarehouse is mapped and transformed to meet the
Datawarehouse objectives.
45
Cont.
46
Components of Data warehouse
47
Cont.
48
Data warehouse Architecture
49
Steps to Implement Data Warehouse
The best way to address the business risk associated with a Data warehouse
implementation is to employ a three-prong strategy as below
Enterprise strategy: Here we identify technical including current
architecture and tools. We also identify facts, dimensions, and attributes.
Data mapping and transformation is also passed.
Phased delivery: Data warehouse implementation should be phased based on
subject areas. Related business entities like booking and billing should be first
implemented and then integrated with each other.
Iterative Prototyping: Rather than a big bang approach to implementation,
the Data warehouse should be developed and tested iteratively.
50
Best practices to implement a Data
Warehouse
Decide a plan to test the consistency, accuracy, and integrity of the data.
The data warehouse must be well integrated, well defined and time stamped.
While designing Data warehouse make sure you use right tool, stick to life
cycle, take care about data conflicts and ready to learn you're your mistakes.
Never replace operational systems and reports
Don't spend too much time on extracting, cleaning and loading data.
Ensure to involve all stakeholders including business personnel in Data
warehouse implementation process. Establish that Data warehousing is a joint/
team project. You don't want to create Data warehouse that is not useful to the
end users.
Prepare a training plan for the end users.
51
Data warehouse users
Data warehouse is needed for all types of users like:
Decision makers who rely on mass amount of data
Users who use customized, complex processes to obtain information from
multiple data sources.
It is also used by the people who want simple technology to access the data
It also essential for those people who want a systematic approach for making
decisions.
If the user wants fast performance on a huge amount of data which is a
necessity for reports, grids or charts, then Data warehouse proves useful.
Data warehouse is a first step If you want to discover 'hidden patterns' of
data-flows and groupings.
52
Data Warehouse Application
Public sector:
In the public sector, data warehouse is used for intelligence gathering. It helps
government agencies to maintain and analyze tax records, health policy
records, for every individual.
Investment and Insurance sector:
In this sector, the warehouses are primarily used to analyze data patterns,
customer trends, and to track market movements.
Retain chain:
In retail chains, Data warehouse is widely used for distribution and marketing. It
also helps to track items, customer buying pattern, promotions and also used for
determining pricing policy.
54
Cont.
Telecommunication:
A data warehouse is used in this sector for product promotions, sales
decisions and to make distribution decisions.
Hospitality Industry:
This Industry utilizes warehouse services to design as well as estimate their
advertising and promotion campaigns where they want to target clients
based on their feedback and travel patterns.
55
Advantages of Data Warehouse:
Data warehouse allows business users to quickly access critical data from
some sources all in one place.
Data warehouse provides consistent information on various cross-functional
activities. It is also supporting ad-hoc reporting and query.
Data Warehouse helps to integrate many sources of data to reduce stress on
the production system.
Data warehouse helps to reduce total turnaround time for analysis and
reporting.
Restructuring and Integration make it easier for the user to use for reporting
and analysis.
Data warehouse allows users to access critical data from the number of
sources in a single place. Therefore, it saves user's time of retrieving data
from multiple sources.
Data warehouse stores a large amount of historical data. This helps users to
analyze different time periods and trends to make future predictions. 56
Disadvantages of Data Warehouse:
Not an ideal option for unstructured data.
Creation and Implementation of Data Warehouse is surely time confusing affair.
Data Warehouse can be outdated relatively quickly
Difficult to make changes in data types and ranges, data source schema,
indexes, and queries.
The data warehouse may seem easy, but actually, it is too complex for the
average users.
Despite best efforts at project management, data warehousing project scope
will always increase.
Sometime warehouse users will develop different business rules.
Organisations need to spend lots of their resources for training and
Implementation purpose.
57
The Future of Data Warehousing
58
Data Warehouse Tools
There are many Data Warehousing tools are available in the market. Here, are
some most prominent one:
1. Mark Logic:
Mark Logic is useful data warehousing solution that makes data integration
easier and faster using an array of enterprise features. This tool helps to
perform very complex search operations. It can query different types of data
like documents, relationships, and metadata.
2. Oracle:
Oracle is the industry-leading database. It offers a wide range of choice of data
warehouse solutions for both on-premises and in the cloud. It helps to optimize
customer experiences by increasing operational efficiency.
3. Amazon Redshift:
Amazon Redshift is Data warehouse tool. It is a simple and cost-effective tool to
analyze all types of data using standard SQL and existing BI tools. It also allows
running complex queries against petabytes of structured data, using the
technique of query optimization.
59
End of slides
60