Professional Documents
Culture Documents
DATA WAREHOUSING
AND
DATA MINING
vi
Unit No. TITLE Page No.
7 Introduction to Data Mining 78 – 93
7.1 Introduction
7.2 Data Mining
7.2.1 Data Mining and Knowledge Discovery
7.2.2 Architecture of a Typical Data Mining System
7.3 Motivating Challenges
7.4 Data Mining Functionalities
7.4.1 Concept/Class Description
7.4.2 Mining Frequent Patterns, Associations and Correlations
7.4.3 Classification and Prediction
7.4.4 Cluster Analysis
7.4.5 Outlier Analysis
7.5 Classification of Data Mining Systems
7.6 Data Mining Task
7.7 Major Issues in Data Mining
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
8 Mining Association Rules 94 – 111
8.1 Introduction
8.2 Association Rule Mining
8.2.1 Association Rules
8.3 Mining Single-Dimensional Boolean Association Rules from
Transactional Databases
8.3.1 Different Data Formats for Mining
8.3.2 Apriori Algorithm
8.3.3 Frequent Pattern Growth (FP-growth) Algorithm
8.4 Mining Multilevel Association Rules from Transaction Databases,
Relational Databases
8.4.1 Approaches to Mining Multilevel Association Rules
8.5 Application of Association Mining
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
vii
Unit No. TITLE Page No.
9 Classification and Prediction 112 – 133
9.1 Introduction
9.2 Classification and Prediction
9.3 Issues Regarding Classification and Prediction
9.4 Classification by Decision Tree Induction
9.5 Classification by Bayesian Classification
9.6 Classification by Back Propagation
9.7 Classification Based on Concepts from Association Rule Mining
9.8 Prediction
9.9 Accuracy and Error Measures
9.10 Evaluating Accuracy of Classifier or Predictor
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
10 Mining Complex Types of Data 134 – 159
10.1 Introduction
10.2 Clustering and Outliers
10.2.1 Good Clustering
10.2.2 Measuring Dissimilarity or Similarity in Clustering
10.3 Clustering Techniques
10.4 Multidimensional Analysis-Descriptive Mining of Complex Data
Objects
10.5 Mining Spatial Databases
10.6 Mining Multimedia Databases
10.7 Mining Time-Series
10.8 Mining Sequence Data
10.9 Mining Text
Databases
10.9.1 Text mining process
10.10 Mining the WWW
10.10.1 Web Structure Mining
10.10.2 Web Usage Mining
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
viii Suggested Reading
Unit No. TITLE Page No.
11 Data Mining Applications and Trend 160 – 171
11.1 Introduction
11.2 Applications of Data Mining
11.3 Data Mining System Products and Research Prototypes
11.3.1 Examples of Commercial Data Mining Systems
11.4 Additional Themes on Data Mining
11.5 Social Impacts of Data Mining
11.6 Trends in Data Mining
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
ix
Big Data
UNIT
Structure: 1
1.1 Data and Big data
1.2 Characteristics of Big data – Vs of Big data
1.3 Types of Big data
1.4 Storage of Big data
1.5 Big data technology
1.6 Big data processing and analyses
1.7 Benefits of Big data
1.8 Applications of Big data in industry
Summary
Keywords
Self-Assessment Questions
Answers to Check Questions
Suggested Reading
Big Data 1
Objectives
From this unit, the student will be able to:
Have clarity in mind the difference between simple data and big data;
Understand the characteristics of big data; and
Establish relevance in identifying the practical scenarios wherein big data can be used.
There are some "dimensions" that differentiate data summarized as "3 Vs" from BIG data. Big data is not
just "more" data. There is a lot of data that is so mixed and unstructured and accumulates so quickly that
traditional techniques and methods, in which “normal” software (Excel, Crystal Reports or the like)
doesn’t really work. Let us look at Instagram, a quite popular social media website. Statistics show that
every day 500+ terabytes of new data engage the social media site Instagram’s database. This data is
mainly generated for photo and video uploads, message exchanges, comments, etc. A single jet engine
can generate 10+ terabytes of data in 30 minutes during flight. Creating data with tens of thousands of
flights per day reaches many petabytes.
Big data may contain terabytes (1,024 gigabytes), petabytes (1,024 terabytes) or exabytes (1,024
petabytes) of data being generated from people and machines through sales and transactions, customer
care and call centers, web, social media, mobile data, satellite data and so on. The data is grouped under
billions or trillions of records.
According to Gartner, big data is the data that contains greater variety and arrives in increasing volumes
and with ever-increasing velocity. It is high-volume, velocity and variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and decision making.
Therefore, big data is said to be consisting of three major characteristics – volume, velocity and variety (3
Vs). Volume represents the amount of data being accumulated from time to time. Examples are social
media websites like Facebook and Twitter, wherein every minute you can expect incredible accumulation
and growth of unknown data. Velocity is the speed of data being generated, produced, created, received
refreshed. According to IBM Marketing Cloud study, 90% of the Internet data has been created since
2016. Each day about 12 million social media users are producing new day, about 70 crore tweets, above
400 crores of Facebook messages and above 500 crores of Facebook likes and above 6 crore Instagram
Big Data 3
messages are being posted and above 40 lacs hours of content being uploaded to Youtube
(https://blog.microfocus.com/how-much-data-is-created-on-the-internet-each-day/). Above 40,000 search
queries are being processed by Google alone per second. All these examples are mentioned to understand
the tremendous growth of data. Depending on the requirements in different perspectives of different
users, it will be a big challenge to analyse some or all of such continuously growing data for making
useful decisions and taking proper actions. Some theorists and practitioners advanced further by
extending the characteristics of big data from ‘3 Vs’ to 4 Vs, 5 Vs and even 10 Vs, so as to elaborate the
meaning of big data more and more. The 4 Vs include Veracity and 5 Vs include Veracity and Value, in
addition to the Volume, Velocity and Variety of 3 Vs. Veracity represents quality of the data, that is,
cleanliness and accuracy of data without missing any data items. Value refers to the ability to transform
the huge flow of data into proper usage for making good decisions and taking appropriate actions. It can
be measured with the extent of benefit the user is getting.
In some cases, many more V’s are added and used to extend the meaning of big data. They include
Variability (consistency of data in terms of availability or interval of reporting), Viscosity (latency or lag
time in data with respect to the event in context), Virality (spread of data and the frequency of its pick up
by other users or events), Validity (similar to Veracity, ensuring consistent data quality, common
definitions and metadata), Vulnerability (tendency for data breach and other security concerns), Volatility
(history or longevity of data for use) and Visualization (scalable to visualize). However, the 3 Vs or 5 Vs
model provides the basic characteristics of big data.
Volume
Veracity Value
Big data is basically of three types. They are: Structured data, Unstructured data and Semi-structured
data. The sources of all such data are people and machines.
Structured data :Any data that can be stored, accessed, and processed in a fixed format is called
"structured" data follows. It is highly organized information. It can be stored and accessed from a row-
column database with the help of simple algorithms. This type of data can be generated by people and
machines like scanners, sensors, computer systems and other automatic devices. All such human and
machine generated data can be captured by the servers in an ordered format. Over time, computer science
talents have had greater success in developing techniques for working with such data (where the format is
known in advance) and in deriving value from it. However, today we see problems when the size of such
data increases dramatically. Typical sizes are in the range of several zettabytes. For example, customer
table in a firm’s database is structured data with the details of the customers like name, address and other
important information.
Unstructured data : Any data with an unknown form or structure is classified as unstructured data. Not
only is unstructured data huge, it also presents a variety of challenges in terms of processing it to derive
value. Most of the data being generated by humans through the internet and social media is unstructured.
The data produced by machines like satellites, scientific instruments, closed circuit TVs and radars are
also unstructured in nature. A typical example of unstructured data is a heterogeneous data source that
contains a combination of simple text files, images, videos, and so on. All this data has been
accumulating continuously at an abnormal rate and it is nothing, but, unstructured data. Because of not
following a format or structure for its storage, the processing and analysis of unstructured data will be a
difficult and time-consuming activity.
Unstructured data can be further classified as two types – Captured data and User-generated data. When
you book a cab (Uber or Ola) through your mobile phone, you can trace the movement of the cab to your
place and from your pick-up point to your destiny. In the same way, the cab driver can trace your
location and follow the navigation to reach the pick-up point and the destination. All such navigational
data is said to be captured data. The unstructured data that is being posted continuously by users in the
form of tweets and retweets, likes, shares and comments on social media is said to be user-generated data.
Nowadays, companies have a wealth of data, but unfortunately they have not known how to derive value
from it. This data is in its raw form or in unstructured format.
Big Data 5
Semi-structured data represents the data which is structured in one way and unstructured in another way.
Examples include tags and keywords that contain vital information and are useful in segregating
individual elements in the data. Most of the semi-structured data is of unstructured format, but contains
some organized data which is useful for processing. Examples include tags and keywords that contain
vital information and are useful in segregating individual elements in data. No SQL documents are semi-
structured data, because they contain keywords useful for processing the documents easily.
Structured data is Even though it has flexibility, This data has more flexibility than
dependent and has less its internal structure does not the structured data, but less
flexibility. follow any data models or flexibility when compared to
schema. unstructured data.
It is self-service access. Its access requires expertise in Its access also requires expertise in
data science. data science to some extent.
Structured queries along Only textual queries are Queries over anonymous nodes are
with complex joining are possible. possible.
possible.
Big Data 7
1.4 STORAGE OF BIG DATA
The traditional database is designed to handle predictable and structured data. In a relational database,
vertical and sometimes horizontal expansion of data is possible to a limited extent, depending on the
growth of the data or the processing requirements. Big data involves a continuous flow of large amounts
of data with a large amount of data. As a result, conventional database systems cannot store and process
them. Relational databases could not accommodate the changing changes in database system
requirements and therefore fit the existing data model without changing the schema that defines it. Any
change or modification that needs to be made either to the data model or to the schema is a manual and
time-consuming process and even affects the associated applications and services. The current scenario
requires two important characteristics in database processing: (i) flexibility in development by meeting
changing data requirements and (ii) scalability in operation by processing a fast and continuous flow of
data and a variation of data. These two characteristics are absent in traditional relational database
systems.
Big data can be stored either in data lakes or in data warehouses, depending on the needs of the user
organization. Data lakes can store a large amount of the three types of big data in raw format, which
means they can store any type of big data in native format without placing any restrictions on account
size or file. Data scientists can leverage such data because there is a large volume of data available and
enough leeway to improve analytics performance and integration and generate real-time insights. The
data can be updated quickly and easily accessible. Big data relies more on data lakes to store it in various
forms - raw, granular, structured, and unstructured data. All data from different source systems can be
loaded into data lakes without anything missing. The data can be transformed and an appropriate schema
applied to meet the data analysis needs.
Data warehouses only store the processed and filtered data for a specific purpose and use by business
people. They are repositories for structured data only, and data accessibility and updating will be
complex. Data warehouses are useful in financial and other business environments because the big data
they generate is those environments that can be stored in a structured format that the entire organization
can access for specific analysis.
● Hadoop Framework is designed to store and process data in a distributed computing environment
using standard hardware with a simple programming model. It can store and analyze the data
present on various machines at high speeds and low costs. It is developed by Apache Software
Foundation in 2011 and is written in JAVA
● NoSQL document databases offer a direct alternative to the rigid schema used in database
databases such as MongDB Relational Databases. This allows Mongodib to offer flexibility when
processing a wide variety of data types across large and distributed architectures. Developed by
MongoDB in 2009 and is written in C++, Go, JavaScript and Python.
● RainStor is a software company that developed a database management system of the same name
that can be used to manage and analyze big data for large companies. It uses simulation
techniques to organize the process of collecting large amounts of data for reference. It works like
SQL
● With Hunk, you can access data in remote Hadoop clusters through virtual indexes and analyze
your data using the Splunk Search Processing Language. With Hunk you can record and visualize
large amounts of data from your Hadoop and NoSQL data sources. Developed by Splunk INC in
the year 2013 and was written in JAVA.
Big data technology is a software service that can analyze, process and extract the right information from
big data data generated by extremely complex and large amounts of data. Currently, the whole world is
seeking more and more information about the speed and continuous flow of data in various forms to carry
out regular and future activities. To meet these challenging needs, new technology and sophisticated
systems have evolved rapidly, replacing traditional RDBMS, SQL and many front-end applications. The
database systems NoSQL (No Only SQL), Hadoop, MapReduce and Massive Parallel Computing are
important.
9 Big Data
Big data technology is mainly divided into two types:
Operational big data is all about the normal daily data we generate. This could be online transactions,
social media or data from a specific organization, etc. You can even think of this as some kind of raw
data that is used to feed in the analytical big data technologies.
A few examples are as follows:
● Online ticket bookings including your train tickets, plane tickets, movie tickets, etc.
● Online shopping for Amazon, Flipkart, Walmart, Snap and many more.
● Data from social media websites like Facebook, Instagram and many more.
● The employee data of a multinational company.
Analytical Big Data: It is more complex than operational big data. In short, when it comes to analytical
big data, the actual performance part comes into play, and the critical business decisions are made in real
time by analyzing the operational big data.
● Share marketing
● Carrying out the space missions where every piece of information is of vital importance.
● Weather forecast information.
● Medical areas where the health status of a particular patient can be monitored.
NoSQL is a completely different database framework for powerful and agile processing of information
on a large scale. It's well designed to essentially meet the needs of big data. The NoSQL database
infrastructure handles the unstructured, cluttered, and unpredictable data well by ensuring strict
consistency in maintaining the speed and agility of the data. It is not a relational database based on tables
and does not use SQL to manipulate the data. NoSQL follows the concept of distributed databases by
storing the semi-structured and unstructured data across multiple processing nodes and even across
multiple servers to handle a continuous data explosion with good performance. You keep the fault
tolerance. Big data warehouses can also be managed by these distributed NoSQL database architectures.
NoSQL ensures high performance and high availability by offering a rich query language and easy
scalability.
Hadoop is an open source software ecosystem for the distributed storage and processing of big data on
large hardware clusters. It supports massive parallel and functional computing. It can overcome high
chances of system failure, limited bandwidth, and high programming complexity. It can handle certain
types of distributed NoSQL databases by spreading the data across a large number of servers without
affecting performance. The Hadoop framework enables the provision of distributed storage of data sets
that are too large for a single system.
The main principle of the Hadoop framework is MapReduce, a kind of calculation model. MapReduce
absorbs intensive data processes and distributes the computation across a potential Hadoop cluster, which
can contain an infinite number of servers, all of which work in parallel and significantly reduce
processing time. Because of these capabilities, Hadoop technology supports the gigantic processing
requirements of big data.
2. _____________ represents quality of the data, that is, cleanliness and accuracy of data without missing
any data items
3. Volume, velocity, variety, _____ and ________ are the five V’s of Big Data.
Big Data 11
1.6 BIG DATA PROCESSING AND ANALYSIS
Big data engineering begins with the identification of the sources that make up the big data after capturing
the relevant data for integration and processing. Efficient data processing usually processes small pieces of
data and processes them in parallel and this demands the use of large computer infrastructure. As the
amount of data increases, the number of parallel processes increases and more servers with more processors.
Big Data Processing and Distribution System makes it easy to organize and distribute data in parallel
computer clusters. Hadoop, an open-source Big Data clustering tool, is ideal for large data processing and
distribution.
There are two important ways to process big data - batch processing and stream processing. In batch
processing, large batches or data blocks are processed, while in stream processing, individual records or
micro batches of some records are processed. The batch process is useful in situations where the analysis
does not demand results in real time. For batch processing, Hadoop Mapreduce is the most useful
framework. Stream processing is one of the big data technologies used to process data in real time, to
investigate continuous data flows and to detect conditions in the short time (from a few milliseconds to a
few minutes) since the data was received. This is useful in situations where real-time tics are demanding
results. It handles a fast feeding of data into analysis tools from the point of data generation to get instant
analysis results. Apache Kafka, Apache Flink, Apache Storm and Apache Samja are important open source
flow processing platforms. The Apache Spark is another popular system that is compatible with Hadup and
can act as a standalone processing engine. It can keep data in memory for multiple steps for data conversion
and therefore repeats multiple times on the same piece of data. This advantage is much needed in analysis
and machine learning. Hadoop cannot store and process such data. But for large data solutions, processing
data in memory (in the case of spark computing) is just as useful as distributed storage of large data in
Hadoop. Cloud Solutions also provide dynamic distributed processing services in terms of the number of
parallel processes based on data volume. They offer infrastructure flexibility and guarantees to achieve the
best solution financially. After the data has been recorded and processed, the big data is ready for analysis.
Appropriate analytical models and data visualization techniques are useful for this purpose.
● Apache Kafka is a distributed streaming platform. A streaming platform has three main functions:
publisher, subscriber and consumer. This is similar to a message queue or an enterprise messaging
system. It was developed in 2011 by the Apache Software Foundation and written in Scala, JAVA.
● Splunk collects, indexes and correlates real-time data in a searchable repository from which charts,
reports, alerts, dashboards and data visualizations can be generated. It was developed by Splunk INC
in 2014 and written in AJAX, C ++, Python, XML.
● With KNIME, users can visually create data flows, selectively perform some or all of the analysis
steps, and review the results, models and interactive views. KNIME is written in Java and is based
on Eclipse and uses its extension mechanism to add plugins that offer additional functionality. It was
developed by KNIME in 2008 and written in JAVA.
● Spark offers in-memory computing capabilities to provide speed, a generic execution model to
support a wide variety of applications, and Java, Scala, and Python APIs to simplify development. It
was developed by the Apache Software Foundation and written in Java, Scala, Python, R.
● R is a programming language and a free software environment for statistical calculations and
graphics. The R language is widely used among statisticians and data miners for statistical software
development and mainly for data analysis. It was developed by the R-Foundation in 2000 and
written in Fortran.
● BlockChain is used in key functions like payment, escrow, and title. It can also reduce fraud,
increase financial privacy, speed transactions, and internationalize markets. It was developed by
Bitcoin and written in JavaScript, C ++, Python. BlockChain can be used to achieve the following in
a business network environment:
○ Shared Ledger: Here we can attach a distributed system of data sets to the company network.
○ Smart Contracts: Terms and conditions are embedded in the transaction database and are
executed with the transaction.
○ Data protection: Transactions are secure, authenticated, and verifiable for adequate visibility
○ Consensus: All parties in a corporate network agree to a network verified transaction.
Big data analytics sets up extremely large, fast-flowing, and diverse data to reveal information and
knowledge from hidden patterns, unknown correlations, trends, and other insights for better decision
making. Specialized software tools and applications are used in large data analysis to make predictive
analysis, data mining, text mining, forecasting, data optimization and visualization, and many insights useful
for business and society.
13 Big Data with Data Warehousing & Data Mining
Hadoop and cloud-based Big Data tics Analytics help organizations reduce the cost of storing large amounts
of data and make better and faster decisions to take appropriate action. In-depth analysis of customer needs,
issues and preferences in different locations can be done to develop and deliver relevant and new products
and services, increase customer satisfaction and improve business while maintaining a competitive
advantage for organizations in the market.
The importance of big data doesn't depend on how much data you have, but on what you do with it.
You can extract and analyze data from any source to find answers that enable cost savings, time
savings, new product development and optimized offerings, and intelligent decisions. When you
combine big data with powerful analytics, you can perform business-related tasks such as:
● Identify the main causes of errors, problems and defects in near real time
● Generation of vouchers at the point of sale based on the customer's buying habits.
● Recalculation of entire risk portfolios in minutes.
● Detect fraudulent behavior before it affects your business.
The ability to process big data offers several advantages, such as:
● Companies can use external information when making decisions: Companies can optimize their
business strategies by accessing social data from search engines and websites such as Facebook and
Twitter.
● Improved customer service: In these new systems, big data and natural language processing
technologies will be used to read and evaluate consumer responses.
● If necessary, early identification of a risk for the product / service
● Better operational efficiency
● Big data has a large volume of information and it helps companies get broader answers to regularly
address problems.
● With big data, companies can optimize their processes and operational efficiency and reduce risks.
● Big data supports predictive analysis by accurately predicting the results and enabling companies to
make better decisions.
● It helps business organizations streamline their digital marketing strategies to improve customer
experience, solve problems, and improve their products and services.
● The accuracy of big data tools in filtering and integrating relevant data from multiple sources helps
save time, money and generate highly actionable insights.
Big data has many uses in various areas of application in companies and society. Some key uses of big data
are listed below:
Manufacturing sector: Big data analysis allows companies to have a good idea of the products that
can do good business and start production accordingly. The delivery strategies and the product
can be significantly improved. Manufacturing companies can benefit from creating a transparent
infrastructure to predict uncertainties and reasons for incompetence that adversely affect the
business. Based on the knowledge gained, companies can optimize their processes and procedures
in order to improve their productivity and their overall business. Predictive analysis enables
organizations to analyze past and current products or services and evaluate the market feasibility
for new products or services. Accordingly, they develop selected products and services in order to
maintain competitive advantages and do good business. Problems such as work restrictions,
equipment failures, and material flow can be quickly analyzed regularly to streamline production
Keywords
● Big data, Volume, Variety, Velocity, Value, Veracity, Hadoop, NoSQL, MapReduce.
2. Describe how traditional RDBMS is not suitable to store and process big data.
Suggested Reading
1. Big Data: A Revolution That Will Transform How We Live, Work, and Think. - Book by
Kenneth Cukier and Viktor Mayer-Schönberger.
2. Big Data For Dummies - Book by Alan Nugent, Fern Halper, Judith Hurwitz, and Marcia
Kaufman.
3. Big Data at Work: Dispelling the Myths, Uncovering the Opportunities - Book by Thomas H.
Davenport.
Structure: 2
2.1 Introduction
2.2 Understanding Data Warehouse
2.3 Difference between OLTP and Data Warehousing Environments
2.4 Basics of Data Warehouse Architecture
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
Let us examine some of the key defining features of the data warehouse based ----------------------
on these definitions. What about the nature of the data in the data warehouse?
How is this data different from the data in any operational system? Why does it
----------------------
have to be different? How is the data content in the data warehouse used? ----------------------
● Subject-Oriented Data
----------------------
In operational systems, we store data by individual applications. In the data
sets for an order processing application, we keep the data for that particular ----------------------
application. These data sets provide the data for all the functions for entering
orders, checking stock, verifying customer’s credit, and assigning the order
----------------------
for shipment but these data sets contain only the data that is needed for those ----------------------
functions relating to this particular application. We will have some data sets
containing data about individual orders, customers, stock status and detailed ----------------------
transactions, but all of these are structured around the processing of orders.
----------------------
Similarly, for a banking institution, data sets for a consumer loans
application contain data for that particular application. Data sets for other ----------------------
distinct applications of checking accounts and savings accounts relate to those ----------------------
specific applications. Again, in an insurance company, different data sets
support individual applications such as automobile insurance, life insurance, ----------------------
and workers’ compensation insurance.
----------------------
In every industry, data sets are organized around individual applications to
support those particular operational systems. These individual data sets have ----------------------
to provide data for the specific applications to perform the specific functions
efficiently. Therefore, the data sets for each application need to be organized ----------------------
around that specific application. ----------------------
In striking contrast, in the data warehouse, data is stored by subjects, not by
applications. If data is stored by business subjects, what are business subjects? ----------------------
Business subjects differ from enterprise to enterprise. These are the subjects ----------------------
critical for the enterprise. For a manufacturing company, sales, shipments, and
inventory are critical business subjects. For a retail store, a sale at the check-out ----------------------
counter is a critical subject.
----------------------
Figure 1.1 distinguishes between how data is stored in operational systems
and in the data warehouse. In the operational systems shown, data for each ----------------------
application is organized separately by application: order processing, consumer
loans, customer billing, accounts receivable, claims processing, and savings
----------------------
accounts. ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 2.1: The data warehouse is subject oriented
----------------------
In a data warehouse, there is no application flavor. The data in a data
---------------------- warehouse cut across applications.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 2.2: The data warehouse is integrated
----------------------
● Time-Variant Data
For an operational system, the stored data contains the current values. In an ----------------------
accounts receivable system, the balance is the current outstanding balance in ----------------------
the customer’s account.
In an order entry system, the status of an order is the current status of ----------------------
the order. In a consumer loans application, the balance amount owed by the ----------------------
customer is the current amount.
Of course, we store some past transactions in operational systems, but, ----------------------
essentially, operational systems reflect current information because these ----------------------
systems support day-to-day current operations.
On the other hand, the data in the data warehouse is meant for analysis
----------------------
and decision-making. If a user is looking at the buying pattern of a specific ----------------------
customer, the user needs data not only about the current purchase, but on the
past purchases as well. When a user wants to find out the reason for the drop ----------------------
in sales in the North East division, the user needs all the sales data for that
division over a period extending back in time. When an analyst in a grocery
----------------------
chain wants to promote two or more products together, that analyst wants sales ----------------------
of the selected products over a number of past quarters.
A data warehouse, because of the very nature of its purpose, has to contain
----------------------
----------------------
● Nonvolatile Data
----------------------
Data extracted from the various operational systems and pertinent data
---------------------- obtained from outside sources are transformed, integrated and stored in the
data warehouse. The data in the data warehouse is not intended to run the day-
---------------------- to-day business. When you want to process the next order received from a
customer, you do not look into the data warehouse to find the current stock
----------------------
status. The operational order entry application is meant for that purpose. In the
---------------------- data warehouse, you keep the extracted stock status data as snapshots over time.
You do not update the data warehouse every time you process a single order.
----------------------
Data from the operational systems are moved into the data warehouse at
---------------------- specific intervals. Depending on the requirements of the business, these data
movements take place twice a day, once a day, once a week, or once in two
---------------------- weeks. In fact, in a typical data warehouse, data movements to different data
sets may take place at different frequencies. The changes to the attributes of the
----------------------
products may be moved once a week. Any revisions to geographical setup may
---------------------- be moved once a month. The units of sales may be moved once a day. You plan
and schedule the data movements or data loads based on the requirements of
---------------------- your users.
---------------------- As illustrated in Figure 2.3, not every business transaction updates the data
in the data warehouse. The business transactions update the operational system
---------------------- databases in real time. We add, change, or delete data from an operational
---------------------- system as each transaction happens but do not usually update the data in the
data warehouse. You do not delete the data in the data warehouse in real time.
---------------------- Once the data is captured in the data warehouse, you do not run individual
transactions to change the data there. Data updates are commonplace in an
---------------------- operational database; not so in a data warehouse. The data in a data warehouse
---------------------- is not as volatile as the data in an operational database is. The data in a data
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 2.3: The data warehouse is nonvolatile
● Data Granularity ----------------------
In an operational system, data is usually kept at the lowest level of detail. ----------------------
In a point-of-sale system for a grocery store, the units of sale are captured and
stored at the level of units of a product per transaction at the check-out counter. ----------------------
In an order entry system, the quantity ordered is captured and stored at the level ----------------------
of units of a product per order received from the customer. Whenever you need
summary data, you add up the individual transactions. If you are looking for ----------------------
units of a product ordered this month, you read all the orders entered for the
entire month for that product and add up. You do not usually keep summary ----------------------
data in an operational system. ----------------------
When a user queries the data warehouse for analysis, he or she usually starts
by looking at summary data. The user may start with total sale units of a product ----------------------
in an entire region. Then the user may want to look at the breakdown by states ----------------------
in the region. The next step may be the examination of sale units by the next
level of individual stores. ----------------------
Frequently, the analysis begins at a high level and moves down to lower ----------------------
levels of detail.
In a data warehouse, therefore, you find it efficient to keep data summarized
----------------------
at different levels. Depending on the query, you can then go to the particular ----------------------
level of detail and satisfy the query. Data granularity in a data warehouse refers
to the level of detail. The lower the level of detail, the finer the data granularity. ----------------------
Of course, if you want to keep data in the lowest level of detail, you have to store
a lot of data in the data warehouse. You will have to decide on the granularity
----------------------
levels based on the data types and the expected system performance for queries. ----------------------
Figure 2.4 shows examples of data granularity in a typical data warehouse.
Data granularity refers to the level of detail. Depending on the requirements, ----------------------
multiple levels of detail may be present. Many data warehouses have at least
dual levels of granularity.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 2.4: Data Granularity
----------------------
Types of Systems
----------------------
Data mart
---------------------- A data mart is a simple form of a data warehouse that is focused on a single
---------------------- subject (or functional area), such as sales, finance or marketing. Data marts are
often built and controlled by a single department within an organization. Given
---------------------- their single-subject focus, data marts usually draw data from only a few sources.
The sources could be internal operational systems, a central data warehouse, or
---------------------- external data.
---------------------- Online analytical processing (OLAP)
---------------------- It ss characterized by a relatively low volume of transactions. Queries are
often very complex and involve aggregations. For OLAP systems, response
----------------------
time is an effectiveness measure. OLAP applications are widely used by Data
---------------------- Mining techniques. OLAP databases store aggregated historical data in multi-
---------------------- dimensional schemas (usually star schemas). OLAP systems typically have
data latency of a few hours, as opposed to data marts, where latency is expected
---------------------- to be closer to one day.
---------------------- Online Transaction Processing (OLTP)
---------------------- It is characterized by a large number of short on-line transactions (INSERT,
UPDATE, DELETE). OLTP systems emphasize very fast query processing
---------------------- and maintaining data integrity in multi-access environments. For OLTP
systems, effectiveness is measured by the number of transactions per second.
----------------------
OLTP databases contain detailed and current data. The schema used to store
---------------------- transactional databases is the entity model (usually 3NF).
Predictive analysis
----------------------
Predictive analysis is about finding and quantifying hidden patterns in the
---------------------- data using complex mathematical models that can be used to predict future
outcomes. Predictive analysis is different from OLAP in that OLAP focuses
----------------------
● Provide a single common data model for all data of interest regardless of ----------------------
the data’s source.
----------------------
● Restructure the data so that it makes sense to the business users.
---------------------- Query Operation: A typical data warehouse scans million of rows, whereas
OLTP scans only handful rows.
---------------------- Data History: Data Warehouse main focus is to store historical data, whereas
---------------------- OLTP deals with current data.
Find out the difference between Data Mart and Data Warehouse. ----------------------
----------------------
2.4 BASICS OF DATA WAREHOUSE ARCHITECTURE ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 2.5: Data Warehouse Basics
----------------------
In this figure, OLTP source data is present in form of summary data and raw
data in data warehouse. Summary data is very important to data warehouse, as ----------------------
it is pre-computed queries data. For example, a typical data warehouse query is
to retrieve the records based on some condition.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 2.6: Data Warehouse with Staging Area
----------------------
The operational data needs to be cleans and processed before loading into
---------------------- data warehouse. This process is carried through staging area. A staging area
simplifies building summaries and general warehouse management.
----------------------
Data Warehouse with Staging Areas and Data marts
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- Fig. 2.7: Data Warehouse with staging areas and Data marts
Identify the requirement of data warehouse architecture for your company. ----------------------
----------------------
----------------------
Summary
----------------------
● A Data warehouse is designed mainly for query processing; hence
it is different from traditional online processing databases working ----------------------
methodology.
----------------------
● The characterization of data warehouse makes it easier to understand the
data characteristics in data warehouse. ----------------------
● The data in the data warehouse is: Separate, Available, Integrated, Time ----------------------
stamped, Subject oriented, Nonvolatile and Accessible.
----------------------
Self-Assessment Questions
----------------------
1. Define data warehouse.
---------------------- 2. What is the difference between data warehouse and OLTP?
---------------------- 3. Write a short note on Subject-oriented data.
---------------------- 4. Explain the basics of Data Warehouse Architecture.
5. What is Data Mart?
----------------------
---------------------- Answers to Check your Progress
---------------------- Check your Progress 1
---------------------- State True or False.
1. False
----------------------
2. True
----------------------
3. True
---------------------- 4. False
---------------------- Check your Progress 2
---------------------- Multiple Choice Single Response.
1. A data warehouse is said to contain a ‘subject-oriented’ collection of data
---------------------- because
---------------------- i. Its contents have a common theme.
---------------------- 2. A data warehouse is an ‘integrated’ collection of data because
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
3
Structure:
3.1 Introduction
3.2 The Data Warehouse Architecture
3.3 Three-Tier Data Warehouse Architecture for Business analysis Framework
3.4 Data Warehouse Models
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
----------------------
----------------------
3.1 INTRODUCTION
----------------------
The technical architecture of data warehouses is somewhat similar to other
----------------------
systems, but does have some special characteristics. There are two border areas
---------------------- in data warehouse architecture - the single-layer architecture and the N-layer
architecture.
----------------------
In previous unit, we have discussed about the basics of data warehouse
---------------------- architecture, in this unit, we will study the same in detail. Data Warehouses
can be architected in many different ways, depending on the specific needs of
---------------------- a business.
----------------------
3.2 THE DATA WAREHOUSE ARCHITECTURE
----------------------
In short, data is moved from databases used in operational systems into a
---------------------- data warehouse staging area, then into a data warehouse and finally into a set
of conformed data marts. Data is copied from one database to another using a
---------------------- technology called ETL (Extract, Transform, and Load)
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 3.1: ETL Process in Data Warehousing
----------------------
In general, all data warehouse systems have the following layers: ----------------------
● Data Source Layer ----------------------
● Data Extraction Layer
----------------------
● Staging Area
● ETL Layer ----------------------
● Data Storage Layer ----------------------
● Data Logic Layer ----------------------
● Data Presentation Layer
● Metadata Layer
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- This is where information about the data stored in the data warehouse system
is stored. A logical data model would be an example of something that is in the
---------------------- metadata layer. A metadata is often used to manage metadata.
37
Big Data with Data Warehousing & Data Mining
System Operations Layer Notes
This layer includes information on how the data warehouse system operates, ----------------------
such as ETL job status, system performance, and user access history.
----------------------
Check your Progress 1 ----------------------
1. Data source fed into the Data Source Layer can be of any format. ----------------------
Fill in the Blanks. ----------------------
1. Logic is applied to transform the data from a transactional nature to ----------------------
an analytical nature in the ______ Layer.
2. Usually an OLAP tool and/or a reporting tool are used in the
----------------------
__________ layer. ----------------------
----------------------
3.3 THREE-TIER DATA WAREHOUSE ARCHITECTURE ----------------------
FOR BUSINESS ANALYSIS FRAMEWORK
----------------------
Generally, the data warehouses adopt the three-tier architecture. Following
are the three tiers of data warehouse architecture. ----------------------
● Bottom Tier - The bottom tier of the architecture is the data warehouse ----------------------
database server. It is the relational database system. We use the back end
tools and utilities to feed data into bottom tier. These back end tools and
----------------------
utilities perform the Extract, Clean, Load, and refresh functions. ----------------------
● Middle Tier - In the middle tier, we have OLAP Server. The OLAP Server
can be implemented in either of the following ways. ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- Fig. 3.3: Three-tier Architecture of Data Warehouse
39
Big Data with Data Warehousing & Data Mining
Points to remember about data marts Notes
● Window based or Unix/Linux based servers are used to implement data ----------------------
marts. They are implemented on low cost server.
----------------------
● The implementation cycle of data mart is measured in short period i.e. in
weeks rather than months or years. ----------------------
● The life cycle of a data mart may be complex in long run if it is planning ----------------------
and design is not organisation-wide.
● Data marts are small in size. ----------------------
● Data marts are customized by department. ----------------------
● The source of data mart is departmentally structured data warehouse.
----------------------
● Data marts are flexible.
ENTERPRISE WAREHOUSE
----------------------
The enterprise warehouse collects all the information all the subjects ----------------------
spanning the entire organization.
----------------------
● This provides us the enterprise-wide data integration.
● This provides us the enterprise-wide data integration.
----------------------
● The data is integrated from operational systems and external information ----------------------
providers.
----------------------
● This information can vary from a few gigabytes to hundreds of gigabytes,
terabytes or beyond. ----------------------
----------------------
State True or False.
----------------------
1. Virtual Warehouse, Data mart, Enterprise Warehouse are data
warehouse models. ----------------------
----------------------
Activity 1 ----------------------
----------------------
Summary ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
4
Structure:
4.1 Introduction
4.2 ER Model versus Dimensional Model
4.2.1 ER Model
4.2.2 Dimensional Model
4.2.3 Differences between Dimensional Model and Relational Model
4.3 Dimensional Modeling Technique
4.4 Dimensional Modeling Process
4.5 Benefits of Dimensional Modeling
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
Dimensional Modeling 44
Notes
Objectives
----------------------
After going through this unit, you will be able to:
----------------------
● Define the dimensional model
---------------------- ● Differentiate between the ER model and the dimensional model
---------------------- ● Describe the dimensional model process
----------------------
---------------------- 4.1 INTRODUCTION
----------------------
In relational modeling, the focus is on identification of strong entities
---------------------- involved in the execution of business transactions. Therefore, in transaction-
oriented systems, data structures are designed to enable fast writing through the
---------------------- process of ER Modeling and Normalization. However, such designs hamper
query performance badly due to multiple joins resulted from Normalization.
----------------------
For the data warehouse, the focus is on identifying associative entities that
---------------------- carry business measures. The designing process supports the measures, and
these measures are known as Dimensional Modeling. Such modeling helps to
---------------------- perform aggregation and integration of data from different sources.
----------------------
4.2 ER MODEL VERSUS DIMENSIONAL MODEL
----------------------
The basic differences between the ER model and the dimension model are
---------------------- discussed below:
---------------------- 4.2.1 ER Model
---------------------- Entity-relationship model (ER model) is a data model for describing the data
or information aspects of a business domain or its process requirements, in an
---------------------- abstract way that lends itself to ultimately being implemented in a database such
as a relational database.
----------------------
ER Model is for relational model in relational database, which is composed
---------------------- of set of relations. A relation schema is denoted by RSchema= R{ A1,A2,...An},
which is made up of a relation R and associated attributes Ai. Each attribute is a
---------------------- characteristic of a relation in a particular domain. Each relation R in relational
---------------------- schema RSchema is composed of set of tuples. Tuple is a group characteristic for
an entity. In other words, putting all the columns together of a relation is known
---------------------- as tuple.
---------------------- 4.2.2 Dimensional Model
Dimensions are the characteristics of subjects in which each row is an
----------------------
occurrence and each attribute can be used as ‘by’ attribute in where clause.
---------------------- For example, a user wants to see sales by customer or by product. Time is a
fundamental dimension across all the industries and thereby called confirmed
---------------------- dimension. Combining all the attributes of single business object into single
Dimensional Modeling 46
Notes of dimensions and facts. Two modeling techniques exist for dimensions. They
are elaborated below.
----------------------
Star schema and Snowflake schema modeling techniques for dimensions
---------------------- represents the structure of dimension model. The center of the schema is fact
table, which is the only table in schema having multiple joins from dimension
---------------------- tables. Fact table stores the measures of the business. Dimension tables define
---------------------- the characteristics of business. The primary key of a fact table is a composite
primary key, composed of all the foreign keys from existing dimensions. In
---------------------- other words, each component of a composite primary key is a foreign key
referencing primary key of a dimension table. The dimensions are usually
---------------------- grouped into a hierarchy, which specifies the granularity level. Such schemas
---------------------- have the following benefits:
● Easier to understand
----------------------
● Improved query performance, as lesser number of joins are required
---------------------- ● Scalable
---------------------- The diagram of the star schema resembles a star.
---------------------- Customer Product
----------------------
---------------------- Sales
----------------------
Location Time
----------------------
---------------------- Fig 4.1 Star Schema Modeling
---------------------- The Snowflake schema is a slightly more complex than the star schema. Its
diagram resembles the snowflake; hence the name. Such schema normalizes the
---------------------- dimension to reduce redundancy. In other words, the dimension is partitioned
into several small tables. For example, the product dimension is partitioned
---------------------- into product category. Hence, this results in more complex queries and joins,
---------------------- thereby reducing the query performance.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Sales ----------------------
----------------------
Customer Time ----------------------
----------------------
Time
----------------------
Fig 4.2 Snowflake Schema
----------------------
Check your Progress 2 ----------------------
Activity 1 ----------------------
----------------------
Design a dimensional model for your company’s data warehouse.
----------------------
Dimensional Modeling 50
Notes Keywords
----------------------
Fact table: It consists of the measurements, metrics or facts of a
---------------------- business process.
----------------------
Dimensional model: The dimensional model is a specialized
---------------------- adaptation of the relational model used to represent data in data
warehouses in a way that data can be easily summarized using
---------------------- online analytical processing or OLAP queries.
-
-------------------- Star schema: It is the simplest style of data mart schema. The star
schema consists of one or more fact tables referencing any number
---------------------- of dimension tables.
-
------------------- Snowflake schema: It is a logical arrangement of tables
in a multidimensional database such that the entity
----------------------
relationship diagram resembles a snowflake shape.
----------------------
Self-Assessment Questions
---------------------- 1. Define dimensional modeling.
---------------------- 2. Differentiate between ER modeling and Dimensional Modeling.
----------------------
----------------------
51 Big Data with Data Warehousing & Data Mining
52 Dimensional Modeling
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Model.
---------------------- Varga, Mladen. On the Differences of Relational and Dimensional Data 3.
---------------------- Modeling and Design: Logical Design.
----------------------
Teorey, Toby J., Sam S. Lightstone, Tom Nadeau, H.V. Jagadish. Database 2.
Environment. IBM Redbooks.
---------------------- Stanislav Vohnik. Dimensional Modeling: In a Business Intelligence
----------------------
Ballard, Chuck, Farrell, Daniel M; Gupta, Amit; Mazuela, Carlos; 1.
---------------------- Suggested Reading
----------------------
2. False
---------------------- 1. True
---------------------- State True or False.
Notes Check your Progress 3
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
5
Structure:
5.1 Introduction
5.2 Physical Database Design
5.3 Hardware and I/O Considerations
5.4 Integrity Constraints
5.5 Dimensions
5.6 Aggregation
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
----------------------
---------------------- 5.1 INTRODUCTION
---------------------- During physical design process, we reconstruct the data gathered from
---------------------- logical design into specification of physical design. This unit provides you with
an in-depth understanding of data warehousing and its application to business
---------------------- intelligence. You will learn the concepts and necessary to build a successful
data warehouse to enable your business intelligence program on the first
---------------------- implementation.
----------------------
5.2 PHYSICAL DATABASE DESIGN
----------------------
The physical database design is generated according to the requirement of
---------------------- query performance and maintenance.
---------------------- In the logical design, a model is created for data warehouse, which is
composed of entities, attributes and relationship. Such entities are linked
---------------------- together using relationships. The attribute characterizes the entities. The unique
---------------------- identifier is used to uniquely identify one instance from another instance of an
entity.
---------------------- Following translation is required for translating such design into actual
---------------------- database during physical design process.
From Entities to tables
----------------------
The Relationships between primary and foreign key constraints
----------------------
Mapping of Attributes to columns
---------------------- After such translation, we are required to create the following structures into
---------------------- database.
Tablespaces
----------------------
Tables
---------------------- Indexes
Constraints
Dimensions
1. Choose storage configuration based on their bandwidth not the capacity. ----------------------
2. Create the clusters of disks as a storage for striping and redundancy in ----------------------
order to minimize the risks involved in failures.
----------------------
3. Plan for I/O growth without neglecting the I/O bandwidth.
Partitioning
----------------------
As we have discussed, data warehouse stores very large tables and maintain ----------------------
---------------------- Indexes
In this section, we are covering B-Tree and bitmap indexes for data
----------------------
warehousing queries requirement.
---------------------- Bitmap indexes are widely preferable for ad-hoc queries because of low
cardinality and less transactions requirement. Cardinality is the number of
----------------------
unique options available for a given attribute. For such combinations, bitmap
---------------------- indexing provides:
1. Improved ad-hoc queries response time.
----------------------
2. Less storage requirement.
---------------------- Efficient maintenance during bulk loads.
3.
---------------------- Indexing a large table with a traditional B-tree index is more expensive
in terms of disk space, as index sizes are several times larger than respective
----------------------
data in the table. Also searching in B-Tree index is more time consuming than
---------------------- bitmap. A B-Tree index provides a pointer to the rows in a table for the specific
key, whereas in bitmap index, a bitmap represents a list of rowids. Each bit,
---------------------- which is a part of a bitmap, corresponds to possible rowid. If the bit is set, that
means row is present for a given key value. A mapping function converts the bit
----------------------
position to the actual rowids. For multiple conditions in a query, bitmap indexes
---------------------- perform better than B-Tree. Bitmap indexes are traditionally focused on data
warehousing applications. These are not suitable for OLTP applications due to
---------------------- large number of concurrent transactions modifying the data. Such modifications
results in expensive locks for bitmap indexes.
----------------------
Bitmap indexes are used to query only fact table or when fact table is joined
---------------------- with two or more dimension tables. A table attribute is a candidate for a bitmap
indexes for the following conditions.
----------------------
1. Column cardinality is low.
----------------------
2. Indexed column is frequently used in the conditional clause.
---------------------- Indexed column is a foreign key column for a dimension table.
3.
----------------------
B-tree indexes
----------------------
The bottom level of B-tree indexes contains the index key and pointers to
---------------------- the corresponding row. Through such indexes, our typical query retrieves with
the corresponding rows the indexed column. Hence it is faster for search and
---------------------- good for higher cardinality of the column. However, for every row retrieval
from a table, such index scan may exhibit more cost. B-tree indexes are most
----------------------
commonly used to enforce unique keys.
57 Big Data with Data Warehousing & Data Mining
Notes
Check your Progress 1
----------------------
State True or False. ----------------------
In this section, we will discuss the usefulness of constraints, constraint states ----------------------
and data warehouse constraints.
----------------------
● Usefulness
Integrity constraints provide mechanism to enforce business standards.
----------------------
Such constraints are used to achieve data cleanliness and query optimization. ----------------------
Constraints in cleanliness verify the prevention of introduction of dirty data and
hence query optimization is achieved. ----------------------
● Constraint States ----------------------
In order to achieve enforcement through constraint, it must be in enable
state. An enable constraint ensures data transaction according to the conditions
----------------------
of the constraints. Also, the validation ensures that data that exists in the table ----------------------
is according to constraints. All the constraints are by default in enabled and
validated state. However, for the validation, constraints need to be enabled on ----------------------
enforced.
----------------------
● Data warehouse constraints
Query performance may get affected by the available constraints and index
----------------------
associated with it. The major constraints, which contain the index are the primary ----------------------
and unique key. These constraints are typically enforced through unique index.
However, for large data warehouse tables, maintaining such large unique index ----------------------
can be quite a tedious job in terms of processing time and disk space. Also most
data warehouse queries do not use unique index attributes as their predicates,
----------------------
so creating this index will probably not improve the query performance. For ----------------------
data warehouse databases, one alternative solution is to disable the unique
constraint. Once the constraint is disabled, the unique index is not required. ----------------------
This approach is frequently used in data warehouse. Consequently, the updates
in respective base table cannot be performed because constraint is in the disable
----------------------
state. The better way is to drop and recreate respective constraints after loading ----------------------
of data in data warehouse.
----------------------
----------------------
Data Warehouse Implementation 58
Notes 5.5 DIMENSIONS
---------------------- In order to answer the business queries dimensions, categorize the data.
For example for customer and product relation, commonly used dimensions
---------------------- are customer, product and time. As we have discussed earlier, time dimension
---------------------- participates in every data warehouse. For example, a retail store might want
to create a data warehouse to understand its business or sales for a particular
---------------------- product and may want answer to the following questions:
---------------------- 1. Total sales for a particular product for given quarter.
2. Does any product require promotion?
----------------------
3. Effect of promotion on sales of a particular product.
----------------------
Two major components of retailer data warehouse are dimensions and facts.
---------------------- Dimensions are customer, product, time and location, whereas the fact is sales.
We need to identify the dimensions and facts from a given problem statement
---------------------- for dimensional modelling.
---------------------- The entries for above-mentioned dimension and fact are populated into
dimension table and fact table. The fact table will contain the sales according to
---------------------- product, customer and time. In addition, database object dimension may contain
the hierarchy of dimension tables. Moving upper level in hierarchy is known as
----------------------
roll-up and going down in a level is known as drill-down. For example, in a time
---------------------- dimension, days may roll-up to week, months, quarter and year. Data analysis
typically starts from higher level and goes to the dipper level if required.
----------------------
---------------------- 5.6 AGGREGATION
---------------------- The Aggregation is a considered to be a fundamental function of the data
warehouse. Aggregation through multi-dimensional queries has significant
---------------------- effect on performance. These aggregates building queries exhaust the major
---------------------- part of the processing power. To minimize such exhaustion, data warehouse
design plays a vital role. Following are the points to understand for design.
---------------------- 1. Generate the star schema in which large central fact table is surrounded
by single level of independent dimension tables.
----------------------
2. An aggregate navigator is a database API, which transforms the base-
---------------------- level SQL into aggregate-aware SQL.
---------------------- In order to improve the query aggregation, every database vendors provides
ROLL-UP and CUBE aggregate operations. These operations are extensions
---------------------- to SQL, to make SQL query easier and faster. The said operations produce
---------------------- single result set that is equivalent to UNION ALL of differently grouped
rows. ROLL-UP, as the name suggests, does increasing level of aggregation,
---------------------- i.e., from the most detailed up to grand total. CUBE operation requires heavy
processing workload. To enhance the query performance, these operations
---------------------- can be parallelized, thereby increasing the overall database performance and
---------------------- scalability.
3. Data analysis typically starts from ________ and goes to the ----------------------
_________.
----------------------
4. ROLL-UP is ________ operations.
5. ________ dimension participates in every data warehouse.
----------------------
----------------------
Activity 1 ----------------------
----------------------
Implement the data warehouse for your company by understanding the
physical design process. ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
6
Structure:
6.1 Introduction
6.2 OLAP Technology
6.3 ROLAP and MOLAP Processing
6.4 Database Design Methodology
6.4.1 Star Schema
6.4.2 Snowflake Schema
6.5 Server Architectures for Query Processing
6.5.1 SQL Extensions
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
----------------------
----------------------
6.1 INTRODUCTION
----------------------
Data warehousing and on-line analytical processing (OLAP) are essential
---------------------- elements of decision support, which has increasingly become a focus of the
database industry. Many commercial products and services are now available
---------------------- and all of the principal database management system vendors now have offerings
---------------------- in these areas. Decision support places some rather different requirements on
database technology compared to traditional on-line transaction processing
---------------------- applications. In this unit, we will be discussing about the OLAP technology in
detail.
----------------------
----------------------
----------------------
----------------------
----------------------
Typical OLAP operations include,
----------------------
rollup - increasing the level of aggregation
drill-down - decreasing the level of aggregation or increasing detail along ----------------------
one or more dimension hierarchies ----------------------
slice_and_dice - selection and projection
----------------------
pivot - re-orienting the multidimensional view of data
----------------------
Other operations
o drill across - involving (across) more than one fact table ----------------------
o drill through - through the bottom level of the cube to its back-end ----------------------
relational tables (using SQL)
----------------------
67
Big Data with Data Warehousing & Data Mining
of this the size of the data warehouse database is order of magnitude larger Notes
than transaction database. The workload for data warehouse is query intensive, ----------------------
which accesses millions of records to perform joins and aggregates. Query
performance is main parameter for data warehouse design. ----------------------
In order to support complex queries analysis data warehouse is designed ----------------------
with multi dimensional model using star or snowflake schema. As discussed
in earlier chapter that typical data warehousing operations include rollup and ----------------------
drill-down along one or more databases. Even if the operational databases
are tuned to support transaction and little of queries, running on transaction ----------------------
database, such operations may leave the OLTP transactions performance in ----------------------
bad shape. Furthermore decision support system or data warehouse requires
historical data. Such requirement cannot be full filled by OLTP as it contains the ----------------------
current data. Data warehouse usually requires integrating the data from several
heterogeneous resources. Such sources data several different inconsistencies ----------------------
and formats. Accessing such data requires the special implementation methods, ----------------------
which are not provided by OLTP. It is for this reason data warehouse database
is implemented separately. ----------------------
Data warehouse might be implemented using standard relational database ----------------------
called Relational Online Analytical Processing (ROLAP). For this the data is
stored in relational database and accessed efficiently to serve multi dimensional ----------------------
query requirement, whereas in Multidimensional Online Analytical Processing
(MOLAP) servers data is stored in a special data structure to implement special
----------------------
aggregate queries. To the end user the accessibility and working of ROLAP ----------------------
and MOLAP system are same, but this system differs in operational details.
There exist multiple OLAP systems. They are generally distinguished by the ----------------------
first letter of their abbreviation.
----------------------
ROLAP works for the data that is stored in relational databases, for which the
base data and dimension tables are stored as a relational table. This model has a ----------------------
set of APIs which facilitates multi dimensional queries. ROLAP has following
advantages over other structures.
----------------------
----------------------
----------------------
----------------------
---------------------- 6. 4 DATABASE DESIGN METHODOLOGY
----------------------
The logical database design phase maps the conceptual model on to a logical ----------------------
model, which is influenced by the data model for the target database (for example,
the relational model). The logical data model is a source of information for the
----------------------
physical design phase. ----------------------
The output of this process is a global logical data model consisting of an
Entity- Relationship diagram, relational schema, and supporting documentation
----------------------
that describes this model, such as a data dictionary. Together, these represent ----------------------
the sources of information for the physical design process, and they provide
the physical database designer with a vehicle for making tradeoffs that are so ----------------------
important to an efficient database design.
----------------------
Physical Database Design
----------------------
It is a description of the implementation of the database on secondary storage;
it describes the base relations, file organizations, and indexes used to achieve ----------------------
efficient access to the data, and any associated integrity constraints and security
measures. ----------------------
Whereas logical database design is concerned with the what, physical ----------------------
database design is concerned with the how. The physical database design phase
allows the designer to make decisions on how the database is to be implemented.
----------------------
Therefore, physical design is tailored to a specific DBMS. There is feedback ----------------------
between physical and logical design, because decisions taken during physical
design for improving performance may affect the logical data model. ----------------------
For example, decisions taken during physical for improving performance, ----------------------
such as merging relations together, might affect the structure of the logical data
model, which will have an associated effect on the application design. ----------------------
Steps of Physical Database Design Methodology ----------------------
After designing logical database model, the steps of physical database design ----------------------
methodology are as follows:
Step 1: Translate global logical data model for target DBMS It includes
----------------------
operations like the Design of base relation, derived data and design of enterprise ----------------------
constraints.
Step 2: Design physical representation.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- Fig. 6.3: A Star Schema
---------------------- Star schemas do not explicitly provide support for attribute hierarchies.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
In addition to the fact and dimension tables, data warehouses store selected ----------------------
summary tables containing pre-aggregated data. In the simplest cases, the
pre-aggregated data corresponds to aggregating the fact table on one or more ----------------------
selected dimensions. Such pre-aggregated summary data can be represented in ----------------------
the database in at least two ways. Let us consider the example of a summary
table that has total sales by product by year in the context of the star schema ----------------------
of Figure 6.3. We can represent such a summary table by a separate fact table
which shares the dimension Product and also a separate shrunken dimension ----------------------
table for time, which consists of only the attributes of the dimension those make ----------------------
sense for the summary table (i.e., year).
Alternatively, we can represent the summary table by encoding the aggregated
----------------------
tuples in the same fact table and the same dimension tables without adding new ----------------------
tables. This may be accomplished by adding a new level field to each dimension
and using nulls: We can encode a day, a month or a year in the Date dimension ----------------------
table as follows: (id0, 0, 22, 01, 1960) represents a record for Jan 22, 1960,
(id1, 1, NULL, 01, 1960) represents the month Jan 1960 and (id2, 2, NULL,
----------------------
NULL, 1960) represents the year 1960. The second attribute represents the new ----------------------
attribute level: 0 for days, 1 for months, 2 for years. In the fact table, a record
containing the foreign key id2 represents the aggregated sales for a Product in ----------------------
the year 1960. The latter method, while reducing the number of tables, is often
a source of operational errors since the level field needs be carefully interpreted.
----------------------
----------------------
6.5 SERVER ARCHITECTURES FOR QUERY
PROCESSING ----------------------
Traditional relational servers were not geared towards the intelligent use of
----------------------
indices and other requirements for supporting multidimensional views of data. ----------------------
However, all relational DBMS vendors have now moved rapidly to support
----------------------
● Comparisons ----------------------
An article by Ralph Kimball and Kevin Strehlo provides an excellent ----------------------
overview of the deficiencies of SQL in being able to do comparisons that are
common in the business world, e.g., compare the difference between the total ----------------------
projected sale and total actual sale by each quarter, where projected sale and
actual sale are columns of a table31. A straightforward execution of such queries ----------------------
may require multiple sequential scans. The article provides some alternatives to ----------------------
better support comparisons. A recent research paper also addresses the question
of how to do comparisons among aggregated values by extending SQL32. ----------------------
----------------------
Check your Progress 2
----------------------
State True or False. ----------------------
1. Redbrick is an example of specialised class servers. ----------------------
2. MOLAP servers directly support the multidimensional view of data
through a multidimensional storage engine. ----------------------
----------------------
Activity 1 ----------------------
----------------------
Find out how OLTP applications automate clerical data processing tasks.
----------------------
----------------------
----------------------
75 Big Data with Data Warehousing & Data Mining
Self-Assessment Questions Notes
----------------------
1. What do you understand by Data Warehouse and OLAP Technologies?
----------------------
2. Write a note on applications of ROLAP and MOLAP Processing in
business. ----------------------
3. How is Database Design Methodology important to business organisations? ----------------------
4. Write a short note on Server Architecture for Query Processing. ----------------------
1. Dzeroski, Saso and Nada Lavrac. 2001. Relational Data Mining. Berlin: ----------------------
Springer. ----------------------
2. Goswami, Gunjan. Data mining and data warehousing. S.K. Kataria and
sons. ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
7
Structure:
7.1 Introduction
7.2 Data Mining
7.2.1 Data Mining and Knowledge Discovery
7.2.2 Architecture of a Typical Data Mining System
7.3 Motivating Challenges
7.4 Data Mining Functionalities
7.4.1 Concept/Class Description
7.4.2 Mining Frequent Patterns, Associations and Correlations
7.4.3 Classification and Prediction
7.4.4 Cluster Analysis
7.4.5 Outlier Analysis
7.5 Classification of Data Mining Systems
7.6 Data Mining Task
7.7 Major Issues in Data Mining
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
----------------------
---------------------- 7.1 INTRODUCTION
---------------------- Data is everywhere. Every day, a huge amount of data is generated by the
---------------------- Web, business, the IT industry, sales, science, engineering, etc. This industry-
generated data is heterogeneous and stored in different forms in databases. These
---------------------- large and numerous data repositories are beyond human ability to understand
and analyse for decision making. This is a situation that might best be described
---------------------- as ‘data rich but information poor’. Extracting meaningful information from
---------------------- this data is a challenging job.
Most of the time, important decisions that are taken are based on the decision
----------------------
maker’s perception of data rather than information that is based on the data in
---------------------- the data repository as they do not have any powerful tool to extract and analyse
data. Traditional data analysis tools and techniques fail for such data because of
---------------------- its massive size and non-traditional nature. To solve this problem a new method
has been developed, which is Data Mining. Data mining technology blends the
----------------------
traditional method of data analysis, which is suitable for processing of a large
---------------------- amount of data.
----------------------
----------------------
----------------------
● Selection: Retrieve data from various sources for data mining. ----------------------
● Preprocessing: It involves data cleansing, that is, removal of noisy and ----------------------
inconsistent data.
●
----------------------
Transformation: Convert to the common format or to a new format.
● Data Mining: Techniques are applied to extract a pattern to get the desired ----------------------
results. ----------------------
● Interpretation/Evaluation: Visualisation or representation is used to
present results to the user in a meaningful manner. ----------------------
This is the data repository from where data is retrieved and preprocessing ----------------------
is performed on it. This is one or multiple data bases, data warehouses or ----------------------
some other repository.
● Database or data warehouse server: ----------------------
It is responsible for providing relevant data based on users’ data mining ----------------------
request.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 7.1: Architecture of a Typical Data Mining System
---------------------- Knowledge Base
---------------------- It is domain knowledge used to search or evaluate any interesting pattern.
This includes the concept of hierarchies, which are used to place attributes at
---------------------- different levels that can be used to assess a pattern from the data.
---------------------- Data mining Engine
---------------------- This consists of a set of techniques such as classification, association,
correlation analysis, prediction, outlier analysis, etc.
----------------------
Pattern Evaluation Module
----------------------
This module is used to analyse and interact with data mining modules to
---------------------- search for an interesting pattern. It filters data to discover an interesting pattern.
Based on the implementation techniques of the data mining used for data
---------------------- analysis, the pattern evaluation module may be integrated with the mining
module.
----------------------
User Interface
----------------------
This module is a communicator between the user and the data mining system.
---------------------- The user can interact with the data mining system to search for a pattern or
any interested data by specifying a data mining query or task. This component
---------------------- also helps the user to look through databases and data warehouse templates/
---------------------- schemas, evaluate minimised patterns and visualise a pattern.
----------------------
● Scalability ----------------------
Nowadays, datasets with gigabytes, terabytes and petabytes are becoming ----------------------
common. To handle such large volumes of data, scalable data mining
algorithms are required. Using sampling or developing parallel distributing ----------------------
algorithm, scalability can be improved. Sampling and parallelisation are
used to overcome scalability problems.
----------------------
Frequent sequential patterns: This is a set of items that the customer prefers ----------------------
to buy in a sequence or in some order. For example, the customer will first buy
a computer and then prefer to purchase software for that computer. ----------------------
Mining frequency patterns helps find association and correlation within ----------------------
data.
----------------------
7.4.3 Classification and Prediction
----------------------
Classification is the technique to find the class of an object whose label is
unknown, based on the historical model. A model is constructed based on data ----------------------
sets whose labels are known.
----------------------
For example, we can build a classification model to categorise bank loan
applications as either ‘safe’ or ‘risky’ or a prediction model to predict the ----------------------
expenditure in dollars of potential customers on computer equipment, given
their income and occupation. Another variation to classification is numerical
----------------------
prediction, to predict the numerical outcome of classification rather than class. ----------------------
Classification and Prediction Issues
----------------------
The major issue in preparing the data for Classification and Prediction
involves the following activities: ----------------------
● Data Cleaning - Data cleaning involves removing the noise and treatment ----------------------
of missing values. The noise is removed by applying smoothing techniques
and the problem of missing values is solved by replacing a missing value ----------------------
with most commonly occurring value for that attribute. ----------------------
● Relevance Analysis - Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are ----------------------
related. ----------------------
● Data Transformation and reduction - The data can be transformed by
any of the following methods. ----------------------
----------------------
Fill in the Blanks.
----------------------
1. ________ represents summarisation of the characteristics or features
of a target class of data. ----------------------
2. _______ is the technique of grouping similar data objects together. ----------------------
3. ______ refers to those objects that do not satisfy general behaviour ----------------------
or models of data objects.
----------------------
Activity 1 ----------------------
----------------------
Collect the data of the age, education and salary for 100 people and draw
at least five inferences. ----------------------
----------------------
7.5 CLASSIFICATION OF DATA MINING SYSTEMS ----------------------
Data mining is considered an interdisciplinary field. It includes a set of ----------------------
various disciplines, such as statistics, database systems, machine learning,
visualisation and information science. Owing to such diversity, classification ----------------------
of the data mining system helps users to understand the system and match their
requirements with such systems. ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 7.2: Classification of Data Mining Systems
----------------------
a. Classification according to types of databases mined: A database system
---------------------- can be classified as a ‘type of data’ or ‘use of data’ model or ‘application
of data’, for that matter.
----------------------
b. Classification according to the types of knowledge mined: This is based
---------------------- on functionalities such as characterisation, association, discrimination
and correlation, prediction, outlier analysis, etc.
----------------------
c. Classification according to the type of techniques utilised: This technique
---------------------- involves the degree of user interaction or the technique of data analysis
involved. For example, database-oriented or data-warehouse-oriented
---------------------- techniques, machine learning, statistics, visualisation, pattern recognition,
---------------------- neural networks, etc.
d. Classification according to the applications adapted: This involves
----------------------
domain-specific application. For example, the data mining systems can
---------------------- be tailored accordingly for telecommunications, finance, DNA, stock
markets, e-mail and so on.
----------------------
---------------------- Check your Progress 4
----------------------
----------------------
● Data mining functionalities are used to specify various activities of data ----------------------
mining tasks such as the kind of pattern finding, categorisation, prediction,
association, etc.
----------------------
Self-Assessment Questions
----------------------
----------------------
1. Define data mining.
2. Describe data mining architecture.
----------------------
3. Define data mining functionalities. ----------------------
4. Describe steps for knowledge discovery. ----------------------
5. Discuss the major issues in data mining.
---------------------- 2. Prediction is
i. To determine future outcome rather than current behaviour
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
8.1
8.2
Introduction
Association Rule Mining
8
8.2.1 Association Rules
8.3 Mining Single-Dimensional Boolean Association Rules from
Transactional Databases
8.3.1 Different Data Formats for Mining
8.3.2 Apriori Algorithm
8.3.3 Frequent Pattern Growth (FP-growth) Algorithm
8.4 Mining Multilevel Association Rules from Transaction Databases,
Relational Databases
8.4.1 Approaches to Mining Multilevel Association Rules
8.5 Application of Association Mining
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
It is an implication of the form A=>B, where A and B are subsets of attribute ----------------------
set and A∩B= ϕ
----------------------
An association rule is of the form: X => Y
X => Y: if X is present than there is highly chances of Y to present.
----------------------
Confidence ----------------------
Confidence is based on conditional probability. If item X is present in ----------------------
transaction there is high confidence of presence of Y.
----------------------
Confidence is defined as:-
Confidence(X=>Y) equals support(X, Y) / support(X) ----------------------
Consider records with high support and high confidence. “A rule with low ----------------------
confidence is not meaningful.” ----------------------
----------------------
Mining Association Rules 96
Notes Example: Database with transactions (customer_#: item1, item2,)
---------------------- 1: 2, 45, 8.
2: 3, 4, 8.
----------------------
3: 6, 4, 8, 10.
----------------------
4: 1, 8, 7.
---------------------- 5: 1, 5, 8.
---------------------- 6: 2, 5, 6.
---------------------- supp({4}) = 6,
supp({8}) = 7,
----------------------
supp({4,8}) = 5,
---------------------- then conf( {4} => {8} ) { Confidence(X=>Y) equals support(X,Y) /
---------------------- support(X)} = 5/6
=0.83 or 83%
----------------------
---------------------- Check your Progress 1
----------------------
Multiple Choice Single Response.
----------------------
1. The left hand side of an association rule is called __________.
---------------------- i. consequent.
---------------------- ii. onset.
iii. antecedent.
----------------------
iv. precedent.
----------------------
2. All set of items whose support is greater than the user-specified
---------------------- minimum support are called as _____________.
---------------------- i. border set.
---------------------- ii. frequent set.
iii. maximal frequent set.
----------------------
iv. lattice.
----------------------
Items TX ----------------------
Shoes, Socks, Tie TX1
Shoes, Socks, Tie, Belt, Shirt TX2
----------------------
Shoes, Tie TX3 ----------------------
----------------------
----------------------
It is process of eliminating the extension of (k-1) itemset that are not found ----------------------
to be frequent.
----------------------
Consider following transactional database.
----------------------
Items TID
1,3,4 100
----------------------
----------------------
Y 3 ----------------------
N 2
----------------------
C 2
D 1 ----------------------
A 1 ----------------------
U 1
I 1 ----------------------
Step 2: Remove items in list whose support < 3. ----------------------
---------------------- M:1
----------------------
---------------------- Y:1
----------------------
O:1
----------------------
----------------------
Transaction : T2 Nul
----------------------
K:2
----------------------
E:2
----------------------
M:1
----------------------
Y:1
---------------------- Y:1
---------------------- O:1
O:1
----------------------
----------------------
----------------------
---------------------- Transaction : T3 Nul
---------------------- K:3
E:3
----------------------
---------------------- M:2
Y:1
----------------------
Y:1
----------------------
O:1
---------------------- O:1
----------------------
----------------------
----------------------
103 Big Data with Data Warehousing & Data Mining
104 Mining Association Rules
---------------------- O:1
----------------------
O:1
----------------------
Y:1
---------------------- O:1
Y:1
---------------------- 3 Y
Y:1 M:2
---------------------- 3 O
3 M
---------------------- M:1
E:4 4 E
---------------------- 5 K
---------------------- K:5 No of transactions Item
---------------------- Null
---------------------- Once tree is ready, no more scans of transaction database are required.
---------------------- node.
node represent item with count that represent occurrences of path from root to
---------------------- FP tree represents compact structure to store transactional database. Each
---------------------- O:1
O:1
----------------------
Y:1 Y:1
----------------------
O:1
Y:1
---------------------- M:2
M:1
----------------------
E:4
----------------------
K:5
---------------------- Nul Transaction : T5
----------------------
O:1
---------------------- O:11
---------------------- Y:1 Y:1
---------------------- Y:1
M:2
---------------------- M:1
---------------------- E:3
---------------------- K:4
Notes Nul Transaction : T4
Notes The algorithm for frequent item set starts from last item in the header table.
It has prefix paths.
----------------------
The conditional pattern base of item consists of all patterns leading to that
---------------------- item. Conditional pattern base is used to construct conditional FP-tree with
header table in which only frequent items are added. When tree contains a
---------------------- single path, all possible combinations of the item are output.
---------------------- Conditional Pattern base Frequent Pattern Set
---------------------- O-{K,E,M,Y:1},{K,E,Y:1} {K,E:1} {O},{O,K},{O,E},{O,K,E}
Y-{K,E,M:1},{K,E:1} {K,M:1} {Y},{Y,K}
---------------------- M-{K,E:2},{K:1} {M},{M,K}
---------------------- E-{K:4} {E},{E,K}
K {K}
----------------------
Conditional pattern base and its corresponding frequent itemsets.
---------------------- Analysis of FP growth algorithm:
---------------------- 1. FP growth algorithm avoids scanning database more than twice. It first
search database to find frequent item set and scan second time to construct
---------------------- FP tree.
---------------------- 2. It allows selecting support count dynamically while mining frequent
itemset. The complete FP tree for all items scan be generated and depend
---------------------- on support count of upper element FP tree can be use for frequent mining.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
8.4 MINING MULTILEVEL ASSOCIATION RULES FROM
---------------------- TRANSACTION DATABASES, RELATIONAL
DATABASES
----------------------
---------------------- In many applications, it is difficult to discover association among data items
at low level due to sparsity of data in multidimensional space. Data mining
---------------------- systems provide capabilities to mine association rules at multiple levels of
---------------------- abstraction and traverse easily among different abstraction spaces.
Association rules produced from mining data at more than one levels of
---------------------- abstraction are called multiple-level or multilevel association rules.
---------------------- Support-confidence framework is use for mining rules.
---------------------- Data can be generalised by replacing low-level concepts within the data by
their higher-level concepts.
----------------------
8.4.1 Approaches to Mining Multilevel Association Rules
---------------------- How concept hierarchy is used for mining multilevel Association Rules? The
---------------------- top-down strategy is used. At each concept level calculation of frequent item
sets is made. Beginning from the concept level 1 and calculating downward
---------------------- in the hierarchy toward the more specific concept levels. This is done until no
more frequent item sets can be found. There are various approaches as follows,
---------------------- a. Use of uniform minimum support for all levels
---------------------- To mine at every level of abstraction minimum support threshold is used.
This simplifies search procedure. Using this method user has to specify only
---------------------- one minimum support threshold.
---------------------- With prior knowledge that ancestor is superset of descendents optimization
techniques can be apply. This avoids searching item in itemset, which do not
---------------------- have minimum support.
----------------------
----------------------
----------------------
----------------------
Level 2 ----------------------
min_sup = 5%
----------------------
laptop computer [support = 6%] desktop computer [support = 4%]
----------------------
Fig. 8.1: Using uniform minimum support for all levels ----------------------
In above example, to mine computer to laptop computer a minimum support
threshold of 5% is used. Therefore “computer” and “laptop computer” with ----------------------
support=6% are called frequent items and “desktop computer” with support=4% ----------------------
is not.
Minimum support threshold value is decided based upon nature of occurrences
----------------------
of items in given item set. ----------------------
If value of minimum support threshold is set too higher than it may not able
to find association at low level of abstraction. If minimum support threshold is
----------------------
set to lower value than that may lead to uninteresting pattern at high abstraction ----------------------
level.
b. Using reduced minimum support at lower levels (referred to as
----------------------
reduced support): ----------------------
This approach uses reduced minimum support at lower levels. For each ----------------------
deeper level of abstraction set, the smaller the corresponding threshold.
For example, in figure 7.1, Let us consider minimum support for level 1 and ----------------------
level 2 are 5% and 3%, respectively then, “computer,” “laptop computer,” and ----------------------
“desktop computer” are called as frequent items.
For mining multiple-level associations with reduced support, there are a ----------------------
number of alternative search Strategies ----------------------
Level-by-Level independent:
----------------------
In this approach pruning does not required background knowledge of
frequent itemset. Each node is scanned independently irrespective its parent ----------------------
node to be frequent or not.
----------------------
Level -cross-filtering by single item:
----------------------
In this technique node is examined only if its parent node is frequent. It
means first parent node is checked if it is frequent. If it is frequent its children ----------------------
will be examined otherwise children are pruned from search.
----------------------
Level-cross filtering by -K-itemset:
In this method instead of checking for single item, frequency for itemset is ----------------------
checked. It means i-itemset at the level ‘l’ is checked only if its related parent ----------------------
i-itemset at the (l-1)th level is frequent.
----------------------
----------------------
8.5 APPLICATION OF ASSOCIATION MINING
----------------------
● Market-basket analysis
----------------------
Association Mining helps companies to find demand of products i.e. most
---------------------- frequent itemset. This help companies to decide which stock in which
stores as well how to display them within a store.
----------------------
● Retail / Marketing
---------------------- Finding associations among customer demographic characteristics.
----------------------
Summary
----------------------
●
Association mining is discovery of relationship between various item sets
----------------------
in transactional and relational database.
---------------------- ● An itemset is called frequent if its support is equal or greater than an
agreed upon minimal value – the support threshold value.
---------------------- ●
Association rules that contain a single predicate are referred to as single-
---------------------- dimensional association rules.
●
---------------------- An association between more than one attribute is called multidimensional
association mining.
---------------------- ●
Apriori Algorithm is use to find frequent item set. This algorithm is called
---------------------- as level wise algorithm.
●
Main limitation of Apriori algorithm is it required candidate-generation-
---------------------- and-test.
●
---------------------- Apriori Mining required multiple passes of scan and generates lots of
candidates in case of long dataset.
---------------------- ●
Frequent pattern growth does not generate candidate sets. Frequent
---------------------- l-itemsets are generated; they are stored in a compact tree structure so
that database scan is reduced.
Keywords ----------------------
5. Find frequent item sets for following transactional dataset using apriori. ----------------------
Item is said frequently bought if it is bought at least 2 times. ----------------------
Items Bought Transaction ID
----------------------
{M, O, N, K, E, Y } T1
{D, O, N, K, E, Y } T2 ----------------------
{M, A, K, E} T3 ----------------------
{M, U, C, K, Y } T4
{C, O, K, I, E} T5 ----------------------
----------------------
----------------------
----------------------
9
Structure:
9.1 Introduction
9.2 Classification and Prediction
9.3 Issues Regarding Classification and Prediction
9.4 Classification by Decision Tree Induction
9.5 Classification by Bayesian Classification
9.6 Classification by Back Propagation
9.7 Classification Based on Concepts from Association Rule Mining
9.8 Prediction
9.9 Accuracy and Error Measures
9.10 Evaluating Accuracy of Classifier or Predictor
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
---------------------- In this chapter, you will learn classification as data mining task. The chapter
also explain difference between Classification and prediction. Different
---------------------- classifiers such as decision tree, Bayesian classifiers, Backpropagation are also
discussed. Classification based on association rule mining is explored.
----------------------
Supervised Learning and Unsupervised Learning
----------------------
In supervised learning the class label of each training record is predefined
---------------------- therefore this step is called as supervised learning. Classification is example of
supervised learning techniques.
----------------------
Unsupervised learning applies on dynamic dataset where class label of
---------------------- training data is unknown. Sometimes, total number of classes to be formed are
also unknown in advance. Clustering is unsupervised learning technique.
----------------------
Data Types:
----------------------
1. Discrete Data
---------------------- Discrete data can only take particular values. It has predictable fixed values
---------------------- or computational boundless set of values. Examples: zip codes or the set of
words in a collection of documents, male or female, good or bad.
----------------------
2. Continuous Data
---------------------- Continuous data are not limited to defined particular values, but it can take
any value over a continuous range.
----------------------
Examples: temperature, age, height, or weight, experience in year
----------------------
---------------------- 9.2 CLASSIFICATION AND PREDICTION
---------------------- Classification is a data mining technique used to predict categorical class
labels. For example, an Insurance company needs data analysis for predicting
---------------------- customer, will buy new policies or not or a company wants to predict good
customers based on old customers. An automobile company wants to predict
----------------------
whether customer will buy a car or not based on customer data. In all of these
---------------------- examples, classification task is applied.
----------------------
For example marketing manager wants to find out expenditure of given ----------------------
customer during sale. This requirement is numeric prediction. Model is designed
to predict a value than the class label. Regression analysis is a statistical method ----------------------
used for numeric prediction. ----------------------
----------------------
Fill in the Blanks.
----------------------
1. ______is data mining techniques use to predict categorical class
labels. ----------------------
2. In ___ type of learning the class label of each training record is ----------------------
predefined.
3. ______ can only take particular values. ----------------------
----------------------
----------------------
----------------------
Fill in the Blanks.
1. _____of classifier is the correct prediction of model for previously
----------------------
unknown data. ----------------------
2. ______is the process of converting information from one format to ----------------------
another format.
3. _____of data to reduce noise and handle missing values in data. ----------------------
----------------------
A decision tree is tree like structure. The tree has three types of node:
----------------------
1) Root node: It has no incoming edges and zero or more outgoing edges. ----------------------
2) Internal nodes: It has exactly one incoming edge and can have two or ----------------------
more outgoing edges.
----------------------
3) Leaf node or terminal node: It has exactly one incoming edge and no
outgoing edge. Leaf node represent class node. ----------------------
Decision tree classifier is popular because constructing decision tree does ----------------------
not require prior domain knowledge. Decision tree can handle high dimension
data. Decision tree classifiers have good accuracy. Decision tree classifiers ----------------------
convertible to simple and easy to understand classification rules
----------------------
Decision tree algorithm is use in many application such as medical,
production, manufacturing, financial analysis etc. ----------------------
Decision Tree Induction ----------------------
In late 1970 and 1980s, J. Ross Quinlan developed a decision tree algorithm ----------------------
known as ID3 (Iterative Dichotomiser), Quinlan later presented C4.5 (a
successor of ID3). In 1984, a group of statisticians (L. Breiman, J. Friedman, R. ----------------------
Olshen, and C. Stone) published Classification and Regression Trees (CART).
ID3, C4.5, and CART constructs decision trees in a top-down recursive divide- ----------------------
and-conquer manner. ----------------------
Steps to build a decision tree
----------------------
1) Select a node as root node
----------------------
2) Find possible value of the attribute and derive one branch for each possible
value. ----------------------
3) Repeat step 2 recursively for each branch till unique class label is ----------------------
determined.
---------------------- Learning algorithm for decision tree must address following issues,
● How to split training record
----------------------
Decision tree is created by recursively selecting an attribute test condition
---------------------- and splitting them into smaller subsets. Learning algorithm should support a
method to specify test condition for attribute type and also measure for goodness
---------------------- of each test condition.
---------------------- ● Stopping criteria for splitting attributes
Attributes are split recursively for each branch till unique class label is
----------------------
determined or all records belong to same class or all records have same value.
---------------------- Measures for selecting best split
---------------------- The best split measures are based on degree of impurity of child node. The
impurity measures are as follows,
----------------------
entropy(p1,p2, . . . ,pn)= -p1logp1-p2logp2. . .-pnlogpn
----------------------
# Entropy is a measure of how "mixed up" an attribute is.
---------------------- • Information gain:
---------------------- Information gain determine most relevant attribute. When splitting up the
decision tree node information gain is the removal of entropy related to specific
---------------------- attribute value.
---------------------- Information Gain = Entropy(X) – Entropy (X|Y)
---------------------- How to select root node?
---------------------- To select root node Information gain of each attribute is calculated. Select
the attribute that gives us the largest information gain. ID3 algorithm uses
---------------------- information gain as its attribute selection measure.
----------------------
----------------------
----------------------
----------------------
2) Nominal attributes
----------------------
Nominal attributes can take multiple values. Some decision tree algorithm
---------------------- such as CART generate only binary values, in such case multiple attributes
can be group together.
----------------------
----------------------
----------------------
----------------------
----------------------
Information Gain= Entropy(X) - Entropy(X|Y)
----------------------
=Entropy (Play Tennis) - Entropy (Play Tennis | Outlook)
---------------------- =0.940 - 0.694
---------------------- =0.246
---------------------- Similarly, calculate information gain for remaining attributes. Select the
attribute with maximum information gain. Attribute “Outlook” has maximum
---------------------- information gain
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- The decision tree of Fig. 8.1 is converted to classification using IF-THEN
rules. Start from root node and trace path till its leaf node. It is called rule-based
---------------------- classifiers where model is represented using IF-THEN rules.
----------------------
9.5 CLASSIFICATION BY BAYESIAN CLASSIFICATION
----------------------
Bayesian classifiers are statistical classifiers, which use class probability to
---------------------- predict the class of unknown tuple. Simple Bayesian classifier is also known
as the naïve Bayesian classifier. Naïve Bayesian classifiers works on class
---------------------- conditional independence. Bayesian classifier, when used on large databases it
---------------------- shows high accuracy and speed.
Bayesian Classification is based on Bayes’ theorem:
----------------------
Let X be a data tuple and H be hypothesis, such that X belongs to a specific
---------------------- class C. Posterior probability of a hypothesis h on X, P(h|X) follows the Baye’s
theorem
----------------------
----------------------
---------------------- Towards Naïve Bayesian Classifier
---------------------- ● Consider training set of tuples D with its related class labels. Each tuple
is represented as n-D attribute vector X=(x1,x2,…xn)
----------------------
● Let C1,C2,…Cm be the classes.
---------------------- ● Classification is to derive maximum posterioiri, max P(Ci|X)
---------------------- ● This can be derived from Bayes’ theorem
----------------------
----------------------
Consider weather data as the training data in the given decision tree example. ----------------------
To find out whether game play or not play i.e. play=yes or play=no,
----------------------
Consider following Test Phase,
----------------------
X’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
P(Outlook=Sunny|Play=Yes) = 2/9
----------------------
----------------------
Check your Progress 4
----------------------
Fill in the Blanks.
----------------------
1. Bayesian classifiers are statistical classifiers use ______to predict the
---------------------- class of unknown tuple
---------------------- 2. The _____are graphical models that allow the representation of
dependencies among subsets of attributes.
----------------------
----------------------
9.6 CLASSIFICATION BY BACK PROPAGATION
----------------------
Backpropagation is a neural network learning algorithm. A neural network
---------------------- is set of connected input, output unit. Each unit has weight associated with it.
It is also called as connectionist learning due to its connection between units.
----------------------
In learning phase, network learns by adjusting its weight to predict the correct
---------------------- class. A neural network has high tolerance to noisy data. It performs satisfactory
in domain with little bit knowledge about data and suitable for real-world data
---------------------- such as handwritten character recognition, pathology and laboratory medicine,
---------------------- and training a computer to pronounce English text.
Backpropagation algorithm learns on a multilayer feed-forward neural
----------------------
network. A multilayer feed-forward neural network consists of an input layer,
---------------------- one or more hidden layers, and an output layer.
Back Propagation
----------------------
Back propagation learns by iteratively processing training dataset and
---------------------- comparing its prediction with target value. The target value may be known
value for prediction and class label for classification.
----------------------
For each set of training, the weights ate modifies to minimize error between
---------------------- networks prediction and actual target value. These modifications are made in
backwards direction from output layer to first hidden layer. Hence, the name
----------------------
back propagation. The computational efficiency depends on the time spent
---------------------- training the network.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 9.2: Back propagation ----------------------
Consider the above given figure. ----------------------
Unit j : A hidden or output layer. The inputs to unit j are labelled y1, y2, : : : , yn
----------------------
The inputs to unit j are outputs from the previous layer.
Weighted sum is calculated by multiplying them with their weights (Wij).
----------------------
This is added to the bias associated with unit j. Output layers takes its net input ----------------------
and then applies an activation function. ----------------------
The output of unit j, is computed as
----------------------
----------------------
----------------------
For a unit j in the output layer, the error Errj is computed by ,
----------------------
----------------------
The error of a hidden layer unit j is,
----------------------
----------------------
Weights are updated by the following equations, where wij is the change in ----------------------
weight wij:
----------------------
wij = (l) Errj Oi
wij = wi j +Dwi j
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig 9.3: An example of a multilayer feed-forward neural network
----------------------
Table 9.1: Initial input, weight, and bias values.
----------------------
ⱷ1 ⱷ2 ⱷ3 W14 W15 W24 w25 W34 W35 W46 W56 ⱷ4 ⱷ5 ⱷ6
---------------------- 1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- y = b+wx
where b and w are regression coefficients specifying the Y-intercept and
----------------------
slope of the line, respectively.
---------------------- We can consider w and b, the regression coefficients, as weights. We can
equivalently write,
----------------------
y = w0+w1x
----------------------
Multiple linear regressions is an extension of straight-line regression which
---------------------- has more than one predictor variable.
---------------------- Logistic regression and Poisson regression are Generalized Linear Models
(GLMs).
----------------------
Non-Linear Regression
---------------------- In the above equation, y is modelled as a linear function of a single
independent predictor variable, x. Nonlinear model can be transformed into
----------------------
linear model by applying transformation to the model. Polynomial regression is
---------------------- used when there is just one predictor variable.
Transformation of a polynomial regression model to a linear regression
----------------------
model:
---------------------- Consider a cubic polynomial relationship given by
---------------------- y = w0+w1x+w2x2+w3x3
----------------------
Fill in the Blanks.
1. The value, which we want to predict is called _______.
----------------------
The confusion matrix is a useful to find how classifier has classified records ----------------------
of different classes. It also displays count of number of records misclassified by
classifier. Confusion matrix is a table of at least size m by m. An entry, CMi, j
----------------------
in the first m rows and m columns indicates the number of tuples of class i that ----------------------
were labelled by the classifier as class j.
● True positive (TP) Rate: actual class and predicted class are same.
----------------------
● False positive (FP) Rate: Predicted to be in class but does not belong to ----------------------
that class.
----------------------
● Precision: The closeness of repeated measurements to one another. It
calculated as the ratio of the number of relevant records retrieved to ----------------------
the total number of irrelevant and relevant records retrieved.
Precision=TP/(TP+FP)
----------------------
● Recall: It is defined as the total number of true positives records divided ----------------------
by the total number of records actually belong to the positive class
(i.e. the sum of true positives and false negatives, these are the records ----------------------
classified as negative but belong to positive class.) . ----------------------
● F-Measure: It is measure of a test's accuracy. Score reaches its best value
at 1 and worst score at 0. ----------------------
----------------------
----------------------
---------------------- To obtain the accuracy of classifier or predictor, there are some common
techniques such as Holdout method, random subsampling, cross validation and
---------------------- Bootstrap method.
---------------------- Holdout Method
---------------------- In Holdout Method, the data is randomly divided into two independent sets,
training set and testing set. Training data set is generated with two-thirds of the
---------------------- data and the remaining is allocated to the test set. Training set is use to derive
model and accuracy of the derive model is tested with test data.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 8. 4: Evaluating Accuracy Using Holdout Method
---------------------- Random Subsampling
---------------------- In Random subsampling, holdout method is repeated k times. Average of
accuracies obtained at each iteration is considered as overall accuracy.
----------------------
Cross-validation
----------------------
In cross validation, initial data randomly reordered and then are divided into
---------------------- ‘n’ number of folds. In each iteration, one fold is used for testing and the other
The bootstrap method, samples the given training records uniformly with ----------------------
replacement. It means - a machine randomly selects records for training set,
allows selecting same sample more than once and add to the training set. The
----------------------
data records that are not selected for training set are used for the test set. ----------------------
On an average, 63.2% of the original data records are used in the bootstrap ----------------------
and the remaining 36.8% are used for test set. Each record can be selected with
probability of 1/d. So, the probability of not selecting record is (1-1/d). If d is ----------------------
large, the probability approaches e-1=0.368 (e = 2:718). Therefore 36.8% of
records are not selected and are used for testing set and remaining 63.2% are ----------------------
used for training set. ----------------------
----------------------
Fill in the Blanks.
1. In Holdout Method, data are randomly divided into two independent
----------------------
sets, ___and _______. ----------------------
2. The _____method samples the given training tuples uniformly with
----------------------
replacement.
----------------------
Summary ----------------------
● A decision tree is tree like structure. In this topmost node is the root node, ----------------------
each nonleaf node represent a test on an attribute, each branch represents ----------------------
an outcome of the test and each leaf node (or terminal node) holds a class
label. ID3, C4.5, and CART are decision tree techniques use to constructs ----------------------
decision trees in a top-down recursive divide-and-conquer manner.
----------------------
● Bayesian classifiers are statistical classifiers that use class probability to
predict the class of unknown tuple. ----------------------
● Backpropagation is a neural network learning algorithm. A neural network
----------------------
is set of connected input, output unit. Each unit has weight associated
with it. It is also called as connectionist learning due to its connection ----------------------
between units.
● Associative classification is concept where association based rules are
----------------------
generated and used for classification purpose. ----------------------
● Prediction is the estimation of numeric value. Regression analysis is use
to find relationship between one or more independent (predictor) and
----------------------
dependent (response) variables. ----------------------
1. There are two approaches to tree pruning: prepruning and postpruning. ----------------------
2. Information gain is used to decide which of the attributes are the most ----------------------
relevant.
----------------------
Check your progress 4
Fill in the Blanks. ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
10
Structure:
10.1 Introduction
10.2 Clustering and Outliers
10.2.1 Good Clustering
10.2.2 Measuring Dissimilarity or Similarity in Clustering
10.3 Clustering Techniques
10.4 Multidimensional Analysis-Descriptive Mining of Complex Data Objects
10.5 Mining Spatial Databases
10.6 Mining Multimedia Databases
10.7 Mining Time-Series
10.8 Mining Sequence Data
10.9 Mining Text Databases
10.9.1 Text mining process
10.10 Mining the WWW
10.10.1 Web Structure Mining
10.10.2 Web Usage Mining
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
----------------------
10.1 INTRODUCTION
----------------------
In our previous units we have focused on mining relational data-bases,
---------------------- transactional databases, and data warehouses formed by the transformation and
---------------------- integration of structured data. Vast amount of data in various complex forms
(e.g., structured and unstructured, hypertext and multimedia) have been growing
---------------------- explosively owing to the rapid progress of data collection tolls, advanced
database system technologies and World –Wide Web (WWW) technologies.
---------------------- Therefore, an increasingly important task in data mining is to mine complex
---------------------- types of data, including complex objects, spatial data, multimedia data, time-
series data, text data, and the World Wide Web.
---------------------- In this chapter, we examine how to further develop the essential data mining
---------------------- techniques (such as characterization, association, classification and clustering),
and how to develop new ones to cope with complex types of data and perform
---------------------- fruitful knowledge mining in complex information repositories. Since search
into mining such complex databases has been evolving at a hasty pace, our
---------------------- discussion covers only some preliminary issues.
----------------------
10.2 CLUSTERING AND OUTLIERS
----------------------
Clustering is a process of dividing a set of data into a set of meaningful sub-
---------------------- classes, called clusters. Clustering is an unsupervised learning where classes
---------------------- are not predefined. Clustering is a method of learning through observation than
learning by example. It finds natural grouping of instances given unlabeled data.
---------------------- Clustering is thus, also called as process of organizing objects into groups
---------------------- where objects are “similar” within group and “dissimilar” to the objects
belonging to other clusters.
----------------------
Outliers
---------------------- Johnson (Johnson, 1992) defines an outlier as an observation in a dataset,
which appears to be inconsistent with the remainder of that set of data. Outliers
----------------------
are often, considered as an error or noise, but they may carry important
---------------------- information about abnormal characteristics of the systems and entities that
effects the data generation process.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- Fig. 10.2: Well Separated Clusters
----------------------
----------------------
----------------------
---------------------- Fig. 10.3: Each point is closer to its centre of cluster
3. Graph Based or Contiguity Based Cluster
----------------------
In this technique, data is represented graphically. Nodes of the graphs are
---------------------- represented as object and links between nodes represent relationship between
object. Objects connected to each other are within the group only and contains
---------------------- no connection between the objects outside the group. This type of clusters are
---------------------- also called connected component. Two objects within a specific distance can be
connected to each other.
----------------------
----------------------
----------------------
----------------------
Fig. 10.4: Each point in cluster is closer to at least one point in its cluster.
----------------------
4. Density Based Cluster
---------------------- A cluster is dense region of object. Dense cluster contains noise and outliers.
---------------------- Density based cluster represents the separation of high density clusters with low
density.
----------------------
----------------------
----------------------
---------------------- Fig. 10.5: High and low region clusters are separated
----------------------
----------------------
----------------------
Fig. 10.6: Some points in cluster share common properties. ----------------------
Data Structures
----------------------
Data matrix: This represents n objects with p variables.
----------------------
----------------------
----------------------
----------------------
Dissimilarity matrix
----------------------
It is represented by an by n table. It is set of proximities that are available for
all pairs of n objects. d(i, j) is the measured difference or dissimilarity between ----------------------
objects i and j.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- Manhattan distance:
----------------------
---------------------- Where n is the number of variables, and Xi and Yi are the values of the ith
---------------------- variable, at points X and Y respectively.
Max of dimension:
----------------------
----------------------
----------------------
---------------------- Where k is the number of variables.
----------------------
Check your Progress 1
----------------------
---------------------- Fill in the Blanks.
1. Clustering is an _____ type of learning where classes are not
---------------------- predefined.
---------------------- 2. _____is a process of partitioning a set of data (or objects) into a set of
meaningful sub-classes, called clusters.
----------------------
----------------------
----------------------
Initially, this method creates k partitions. At least one object must belong to ----------------------
this partition. Then iteratively it creates clusters by moving objects from one
group to another to improve quality of clusters. The k-means and k-medoids
----------------------
algorithm is use for forming clusters. ----------------------
2. Hierarchy algorithms: ----------------------
Create a hierarchical decomposition of the set of data based upon various
criterions. ----------------------
It is set of nested clusters that are organized as tree. Cluster at root of tree ----------------------
contains all objects and a node in tree is union of its sub clusters.
----------------------
Hierarchical clustering method is further classified as either agglomerative or
divisive clustering. In agglomerative clustering, the node representation starts ----------------------
in a bottom-up manner. In divisive clustering node, decomposition is formed in
top-down manner.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- 1. Select the Initial cluster centres are A1 (2, 10), A4 (5, 8) and A7 (1, 2).
2. The distance function between two points a=(x1, y1) and b=(x2, y2) is
----------------------
defined as:
---------------------- ρ(a, b) = |x2 – x1| + |y2 – y1| .
---------------------- 3. Use k-means algorithm to find the three cluster centres after the second
iteration.
----------------------
• The initial cluster centers means, are (2, 10), (5, 8) and (1, 2) - chosen
---------------------- randomly.
---------------------- Calculate the distance from the first point (2, 10) to each of the three means,
by using the distance function:-
---------------------- x1, y1 x2, y2
---------------------- (2, 10) (2, 10)
---------------------- ρ(a, b) = |x2 – x1| + |y2 – y1|
ρ(point, mean1) = |x2 – x1| + |y2 – y1|
----------------------
= |2 – 2| + |10 – 10|
----------------------
=5 ----------------------
point mean2 ----------------------
x1, y1 x2, y2 ----------------------
(2, 5) (5, 8)
----------------------
ρ(a, b) = |x2 – x1| + |y2 – y1|
----------------------
ρ(point, mean2) = |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 5| ----------------------
=3+3 ----------------------
=6 ----------------------
X2-> (2, 10) (5, 8) (1, 2)
----------------------
PointX1 Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster
A1 (2, 10) 0 5 9 1 ----------------------
A2 (2, 5) 5 6 4 3 ----------------------
A3 (8, 4)
----------------------
A4 (5, 8)
A5 (7, 5) ----------------------
A6 (6, 4)
----------------------
A7 (1, 2)
A8 (4, 9) ----------------------
Repeat process for each point. After first iteration table will be, ----------------------
(2, 10) (5, 8) (1, 2) ----------------------
Point Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster
----------------------
A1 (2, 10) 0 5 9 1
A2 (2, 5) 5 6 4 3 ----------------------
---------------------- = (6, 6)
For Cluster 3, we have ( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)
----------------------
New mean: (2,10), (6, 6), (1.5, 3.5)
----------------------
Next, process Iteration2, Iteration3, and so on until the means do not change
---------------------- anymore.
After Iteration2: C1= (3, 9.5), C2= (6.5, 5.25), C3= (1.5, 3.5)
----------------------
Clusters: 1 {A1, A8}, 2{A3, A4, A5, A6}, 3: {A2, A7}
----------------------
After Iteration 3: C1= (3.6, 9), C2= (7, 4.3) and C3=(1.5,3.5)
---------------------- Clusters: 1 {A1, A4, A8}, 2{A3, A5, A6}, 3: {A2, A7}
---------------------- After third iteration, mean value remains same. So the algorithm halts at this
step.
----------------------
• Advantages
----------------------
1. It is easy to implement and works with any of standard norms.
---------------------- 2. It is not sensitive to data ordering. It also allows straightforward
---------------------- parallelization.
Disadvantages
----------------------
1. Result is based upon initial guess of centroids.
---------------------- It is sensitive with respect to outliers
2.
---------------------- 3. It is not obvious what good number of k in each case.
---------------------- 4. Resulting clusters can be unbalanced or even empty.
---------------------- Hierarchical clustering:
---------------------- In hierarchical clustering, cluster are group and merge with each other until
one cluster left.
143
Big Data with Data Warehousing & Data Mining
Algorithm: Notes
Input: training dataset Output: A hierarchical cluster ----------------------
● Let consider cluster ‘C’ ----------------------
● Calculate proximity matrix.
----------------------
● Merge two nearest cluster based on some criteria (distance measurement)
● Repeat until only one cluster left. ----------------------
A B C D E F ----------------------
A 0.0 1.0 4.0 8.0 9.0 2.0 ----------------------
B 1.0 0.0 2.5 7.0 6.0 4.5 ----------------------
C 4.0 2.5 0.0 2.0 3.0 5.5 ----------------------
STEP1
----------------------
Find lowest value in the proximity matrix. In above table it is 0.4, so cluster ----------------------
D & E will merge. Consider lowest value between row and column for merging.
Update proximity matrix.
----------------------
----------------------
----------------------
Each object in class associated with object identifier, set of attributes, and set ----------------------
of methods describes computational rules.
----------------------
If complex data need to be analysed and mine than it require setting up
multidimensional data warehouse for complex data and then develop effective ----------------------
and scalable method for data mining.
----------------------
Object-relational and object-oriented databases has features to handle
complex data by storing, accessing, and modelling-valued data. For example, ----------------------
set-valued and list-valued data and data with nested structures.
----------------------
A set-valued attribute can be of homogeneous or heterogeneous in nature. This
data can be generalized by generalization of each value in the set by mapping ----------------------
to its related higher-level concept or derivation of the set. Generalization is
carried out by applying various generalization operators to find alternative
----------------------
generalization paths. ----------------------
A set-valued attribute ----------------------
Suppose that the hobby of a person is a set-valued attribute containing the set
of values {cricket, basketball, violin, solicitor} This set is generalized to high-
----------------------
level concepts, such as { sports, music, computer games}. To know how many ----------------------
elements are generalized, a count is placed with th generalized values {Sports
(2),music(1), computer games(1) } ----------------------
Let us consider a person’s education data record, ----------------------
“((B.A ARTS, Pune University June 2000), “(Ph.D. Computer Science, ----------------------
Mumbai university, Dec, 2005)”. These records can be represented by removing
----------------------
----------------------
10.5 MINING SPATIAL DATABASES
----------------------
Spatial database is database which store space-related data, such as maps,
---------------------- pre-processed remote sensing or medical imaging data and VLSI chip layout
data.
----------------------
Spatial data mining is the process of discovering interesting, useful, spatial
---------------------- relationships, non-trivial patterns from large spatial datasets. Some of examples
---------------------- of spatial patterns such as cancer clusters to investigate environment health
hazards, crime hotspots for planning police patrol routes, bald eagles nest on
---------------------- tall trees near open water etc.
---------------------- In spatial data mining, one of the challenges is that the information is usually
not uniformly distributed in spatial datasets. Spatial patterns are detected using
---------------------- classification, associations, clustering and outlier detection.
----------------------
---------------------- 10.7 MINING TIME-SERIES
---------------------- A time-series database is special type of data consists of sequences of values
obtained time.
----------------------
For example, financial data that contain objects that are time series of daily
---------------------- prices of various stocks. If two measurements are close in time then values of
those measurements are often similar.
----------------------
Time-series forecasting finds a mathematical formula that will approximately
---------------------- generate the historical patterns in a time series.
---------------------- Analysis task of time series include feature extraction, similarity measure,
segmentation of data set, matching two time series, clustering and classifying
---------------------- time series data.
---------------------- Similarity function
---------------------- This function is required to find a series database that is similar to given
query series. A simple approach is to define similarity function x and yin terms
---------------------- of Lp distances as point of Rn. but it is not suitable to determine similarity for
series in different scale and different shifts.
----------------------
Scale free similarity
----------------------
Consider an example: Two companies have identical stock price fluctuation
---------------------- but one company’s stock is worth twice as much as other company. Thus, pattern
is similar but numeric values are different. It is important to find similar time
---------------------- series objects in data mining.
---------------------- Shift free similarity
---------------------- Temperature at two different days may start at different values but their
fluctuation may be exactly same. That means this is same series with two
---------------------- different baselines.
---------------------- We say that two time series X and Y are similar if there exist a>0 and b, such
that yi=axi+b, for all i.
----------------------
----------------------
----------------------
Mining Complex Types of Data 150
Notes
Check your Progress 7
----------------------
---------------------- Fill in the Blanks.
1. A ___ is any database that consists of sequences of ordered events,
----------------------
with or without concrete notions of time.
---------------------- 2. A _____ search technique is employed for efficient support counting.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- Fig. 10.8: A Text Mining Framework
10.9.1 Text Mining Process
----------------------
1. “Preprocessing" the text into a structured format.
----------------------
2. Text Transformation and attribute selection
---------------------- 3. Mining the reduced data with traditional data mining techniques.
---------------------- 1. Data pre-processing
---------------------- Text data is unstructured data and it may contain Misspellings, abbreviation,
Punctuation and other non-alphanumeric characters, some noisy words etc. Data
---------------------- pre-processing deals with detecting and removing errors and inconsistencies
---------------------- from data in order to improve the quality of data. To make text data useful,
unstructured text data is converted into structured data.
---------------------- It include tokenization, stop word removal and stemming techniques.
151
Big Data with Data Warehousing & Data Mining
Tokenization: Tokenization is process of converting the document into Notes
tokens or words (noun, verb, pronoun, article, conjunction and preposition) ----------------------
without understanding their meaning.
Further data is cleaned to remove stop words. Stop words are the common ----------------------
frequently used words like pronouns, prepositions, white spaces, punctuation ----------------------
marks, conjunction etc.. Remove those words that are too general like ‘the’, ‘an’,
‘a’, ‘and’, ‘unless’, ‘versus’ etc. these are the words which does not contribute ----------------------
any meaning or not add any knowledge to data analysis
----------------------
Next stemming is applied to the data. Stemming or lemmatization is a
technique use to convert words into their root words. Stemming is used to ----------------------
identify the word stems for the remaining words to make the words in simple
format –Removing al, ing, lie, tion, ies, ’s etc. Stemming is the process of
----------------------
removing suffixes and endings of words. Ex: computable, computation, ----------------------
computing, computational – comput.
Following a selection of suffixes and prefixes for removal during stemming
----------------------
(David, 1996) ----------------------
_ suffixes: ly, ness, ion, ize, ant, ent , ic, al , ical, able, ance, ary, ate, ce, y,
----------------------
dom, ed, ee, eer, ence, ency, ery, ess, ful, hood, ible, icity, ify, ing, ish, ism, ist,
istic, ity, ive, less, let, like, ment, ory, ty, ship, some, ure ----------------------
_ prefixes: anti, bi, co, contra, counter, de, di, dis, en, extra, in, inter, intra, ----------------------
micro, mid, mini, multi,non, over, para, poly, post, pre, pro, re, semi, sub, super,
supra, sur, trans, tri, ultra, un ----------------------
This process represents uniform format to the words. After this words are use ----------------------
for creating bag of words by applying different techniques. Most frequent terms
can use to repreent the document using term frequency technique or inverse ----------------------
term document technique can be use to represent term matrix.
----------------------
2. Text Transformation and attribute selection
----------------------
Text Transformation contains text Representation and Feature Selection. Text
document is represented by the words it contains and their occurrences. There ----------------------
are two main approaches of document representation using “Bag of words” and
Vector Space. ----------------------
Bag of words are collection of words, each word is represented as a separate ----------------------
variable having numeric weight depending upon the word occurrences in a
document. ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
11
Structure:
11.1 Introduction
11.2 Applications of Data Mining
11.3 Data Mining System Products and Research Prototypes
11.3.1 Examples of Commercial Data Mining Systems
11.4 Additional Themes on Data Mining
11.5 Social Impacts of Data Mining
11.6 Trends in Data Mining
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
----------------------
----------------------
---------------------- 11.1 INTRODUCTION
---------------------- Data mining is a relatively young discipline having diverse applications. In
the previous units, we have studied the concepts of data mining and the various
---------------------- techniques used for analysing the data. It is useful in almost all fields, such as
---------------------- banking, marketing, medicine, fraud detection, manufacturing and production,
scientific data analysis, etc. Now, in this unit, we will be discussing various
---------------------- applications of data mining. We shall also analyse its trends..
----------------------
11.2 APPLICATIONS OF DATA MINING
----------------------
A few application domains are discussed below on how data mining tools
---------------------- are used:
---------------------- • Applications of data mining in banking: Banks and financial
institutions offer a wide variety of banking services, where data mining
----------------------
can be helpful for following applications:
---------------------- Data collected by data mining in banking
---------------------- Mining customer data of banks
Loan/credit card approval
----------------------
Classification and clustering of customers for targeted marketing
----------------------
Mining for prediction and forecasting
---------------------- Mining for fraud detection
---------------------- Mining for cross-selling banking services
---------------------- Mining for identifying customer preferences
● Data Mining for the Retail Industry
----------------------
Retail industries have large amounts of data on sales, customer shopping
---------------------- history, goods transportation, consumption and service. Therefore, data
---------------------- mining can be used in the following areas:
Design and construction of data warehouses based on the benefits
----------------------
of data mining
----------------------
----------------------
----------------------
----------------------
11.3 DATA MINING SYSTEM PRODUCTS AND
RESEARCH PROTOTYPES ----------------------
Data mining is a young field and many data mining products and tools are
----------------------
available in the market. To select a data mining system that fits your requirement, ----------------------
it is important to have a multidimensional view of data mining systems. The
following are some features to assess a data mining system: ----------------------
----------------------
---------------------- Activity 2
----------------------
List the features of C5.0 and CART.
----------------------
----------------------
---------------------- 11.5 SOCIAL IMPACTS OF DATA MINING
---------------------- Data mining is present in many aspects of our daily lives. It affects us with
regard to information retrieval, search, shop, time, etc.
----------------------
Data mining is used by marketing companies to find customer behaviour
---------------------- patterns. Your information may get collected when you use your credit
card, debit card, supermarket loyalty card or frequent flyer card, when you
---------------------- surf the Web, reply to an Internet newsgroup, subscribe to a magazine, etc.
---------------------- Advertisements and promotional material is being sent to customer email IDs
to target customers.
----------------------
Web-wide tracking is a technology that tracks a user across each site a user
---------------------- visits. This information can be used by marketers.
The traditional data analysis method fails to handle huge data efficiently. ----------------------
Data mining handles huge amounts of data efficiently. But there is a need of
data mining algorithms, which can handle incremental data efficiently. ----------------------
Integration of data mining with database systems, data warehouse systems ----------------------
and Web database systems
----------------------
It is required to smoothly integrate data mining systems with databases and
data warehouse databases. ----------------------
It ensures data mining portability, data availability, scalability, high ----------------------
performance and an integrated information-processing environment for
multidimensional data analysis and exploration. ----------------------
----------------------
Summary
----------------------
Many customised data mining tools have been developed for
---------------------- domain-specific applications, including finance,the retail industry,
---------------------- telecommunications, bioinformatics, intrusion detection and other
science, engineering and government data analysis.
----------------------
Researchers have been striving to build theoretical foundations for
data mining. Several interesting proposals have appeared, based on
---------------------- data reduction, data compression, pattern discovery, probability theory,
microeconomic theory and inductive databases.
----------------------
Several well-established statistical methods have been proposed for data
----------------------
analysis.
---------------------- Visual data mining integrates data mining and data visualisation in order
to discover implicit and useful knowledge from large data sets. Audio
data mining uses audio signals to indicate data patterns or features of data
---------------------- mining results.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------