Big Datawith Data Warehousingand Data Mining NEW020

BIG DATA WITH
DATA WAREHOUSING
AND
DATA MINING
(FOR PRIVATE CIRCULATION ONLY)

2021
CONTENTS
Unit No. TITLE Page No.
1 Big Data 1 – 17
1.1 Data and Big data
1.2 Characteristics of Big data – Vs of Big data
1.3 Types of Big data
1.4 Storage of Big data
1.5 Big data technology
1.6 Big data processing and analyses
1.7 Benefits of Big data
1.8 Applications of Big data in industry
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
2 Introduction to Data Warehouse 18 – 33
2.1 Introduction
2.2 Understanding Data Warehouse
2.3 Difference between OLTP and Data Warehousing Environments
2.4 Basics of Data Warehouse Architecture
Summary
Key Words
Suggested Reading
3 Data Warehouse Architecture 34 – 43
3.1 Introduction
3.2 The Data Warehouse Architecture
3.3 Three-Tier Data Warehouse Architecture for Business analysis
Framework
3.4 Data Warehouse Models
Summary
Key Words
Suggested Reading
4 Dimensional Modeling 44 – 53
4.1 Introduction
4.2 ER Model versus Dimensional Model
4.2.1 ER Model
4.2.2 Dimensional Model
4.2.3 Differences between Dimensional Model and Relational
Model
4.3 Dimensional Modeling Technique
4.4 Dimensional Modeling Process
4 .5 Benefits of Dimensional Modeling
Summary
Key Words
Suggested Reading
v
5 Data Warehouse Implementation 54 – 63
5.1 Introduction
5.2 Physical Database Design
5.3 Hardware and I/O Considerations
5.4 Integrity Constraints
5.5 Dimensions
5.6 Aggregation
Summary
Key Words
Suggested Reading
6 Data Warehouse and OLAP Technologies 64 – 77
6.1 Introduction
6.2 OLAP Technology
6.3 ROLAP and MOLAP Processing
6.4 Database Design Methodology
6.4.1 Star Schema
6.4.2 Snowflake Schema
6.5 Server Architectures for Query Processing
6.5.1 SQL Extensions
Summary
Key Words
Suggested Reading
vi
7 Introduction to Data Mining 78 – 93
7.1 Introduction
7.2 Data Mining
7.2.1 Data Mining and Knowledge Discovery
7.2.2 Architecture of a Typical Data Mining System
7.3 Motivating Challenges
7.4 Data Mining Functionalities
7.4.1 Concept/Class Description
7.4.2 Mining Frequent Patterns, Associations and Correlations
7.4.3 Classification and Prediction
7.4.4 Cluster Analysis
7.4.5 Outlier Analysis
7.5 Classification of Data Mining Systems
7.6 Data Mining Task
7.7 Major Issues in Data Mining
Summary
Key Words
Suggested Reading
8 Mining Association Rules 94 – 111
8.1 Introduction
8.2 Association Rule Mining
8.2.1 Association Rules
8.3 Mining Single-Dimensional Boolean Association Rules from
Transactional Databases
8.3.1 Different Data Formats for Mining
8.3.2 Apriori Algorithm
8.3.3 Frequent Pattern Growth (FP-growth) Algorithm
8.4 Mining Multilevel Association Rules from Transaction Databases,
Relational Databases
8.4.1 Approaches to Mining Multilevel Association Rules
8.5 Application of Association Mining
Summary
Key Words
Suggested Reading
vii
9 Classification and Prediction 112 – 133
9.1 Introduction
9.2 Classification and Prediction
9.3 Issues Regarding Classification and Prediction
9.4 Classification by Decision Tree Induction
9.5 Classification by Bayesian Classification
9.6 Classification by Back Propagation
9.7 Classification Based on Concepts from Association Rule Mining
9.8 Prediction
9.9 Accuracy and Error Measures
9.10 Evaluating Accuracy of Classifier or Predictor
Summary
Key Words
Suggested Reading
10 Mining Complex Types of Data 134 – 159
10.1 Introduction
10.2 Clustering and Outliers
10.2.1 Good Clustering
10.2.2 Measuring Dissimilarity or Similarity in Clustering
10.3 Clustering Techniques
10.4 Multidimensional Analysis-Descriptive Mining of Complex Data
Objects
10.5 Mining Spatial Databases
10.6 Mining Multimedia Databases
10.7 Mining Time-Series
10.8 Mining Sequence Data
10.9 Mining Text
Databases
10.9.1 Text mining process
10.10 Mining the WWW
10.10.1 Web Structure Mining
10.10.2 Web Usage Mining
Summary
Key Words
viii Suggested Reading
11 Data Mining Applications and Trend 160 – 171
11.1 Introduction
11.2 Applications of Data Mining
11.3 Data Mining System Products and Research Prototypes
11.3.1 Examples of Commercial Data Mining Systems
11.4 Additional Themes on Data Mining
11.5 Social Impacts of Data Mining
11.6 Trends in Data Mining
Summary
Key Words
Suggested Reading
ix
Big Data
UNIT
Structure: 1
1.1 Data and Big data
1.2 Characteristics of Big data – Vs of Big data
1.3 Types of Big data
1.4 Storage of Big data
1.5 Big data technology
1.6 Big data processing and analyses
1.7 Benefits of Big data
1.8 Applications of Big data in industry
Summary
Keywords
Answers to Check Questions
Suggested Reading
Big Data 1
Objectives
From this unit, the student will be able to:
 Have clarity in mind the difference between simple data and big data;
 Understand the characteristics of big data; and
 Establish relevance in identifying the practical scenarios wherein big data can be used.
1.1 DATA AND BIG DATA

The extent to which operations are performed by computers, characters or symbols, which can be stored
and transmitted as electrical signals and recorded on magnetic, optical or mechanical recording media is
known as data. Basically, data means facts in raw format. For example, a new position opens up in a
small company and you are asked to interview 10 candidates that have applied. You interview each
candidate and note down their qualifications, their specialization. This is Data. When you do it for all
candidates, it is a collection of data. When you combine this collected data and arrange them in an order
where the most suitable candidate for the job would be at the 1st position and least suitable would be at
the last position, you are processing your data. On the basis of this information (processed data) you hire
someone that you deem fit for your requirement.In this case, each candidate is considered an entity and
the details related to them are considered records. So the data of entity becomes a record. In the case of a
database management system, the table represents an entity set and contains a set of records. Depending
on the end user's data needs, the data to be processed and analyzed can be small or large.
There are some "dimensions" that differentiate data summarized as "3 Vs" from BIG data. Big data is not
just "more" data. There is a lot of data that is so mixed and unstructured and accumulates so quickly that
traditional techniques and methods, in which “normal” software (Excel, Crystal Reports or the like)
doesn’t really work. Let us look at Instagram, a quite popular social media website. Statistics show that
every day 500+ terabytes of new data engage the social media site Instagram’s database. This data is
mainly generated for photo and video uploads, message exchanges, comments, etc. A single jet engine
can generate 10+ terabytes of data in 30 minutes during flight. Creating data with tens of thousands of
flights per day reaches many petabytes.
2 Big Data with Data Warehousing & Data Mining

A stock exchange like BSE (Bombay Stock Exchange) Ltd. presents on a continuous basis, the details of
all the stocks registered/listed under it for the benefit of investors to buy or sell or watch the trend going
on. In this case, the data of a stock changes from time to time based on transactions and market effects
and this can happen in a fraction of time. Websites and mobile phones continuously create, generate,
update, and create data on a regular basis. The data becomes vast with continuous variations to hold and
analyse it. People, devices and businesses are the sources of such enormous growth of data and such data
can be treated as ‘big data’. Big data represents a collection of large volumes of data which grows
exponentially with time involving diverse sets of information and a lot of variations. IBM defined big
data as data sets whose size or type is beyond the ability of traditional relational databases to capture,
manage and process the data with low latency.
Big data may contain terabytes (1,024 gigabytes), petabytes (1,024 terabytes) or exabytes (1,024
petabytes) of data being generated from people and machines through sales and transactions, customer
care and call centers, web, social media, mobile data, satellite data and so on. The data is grouped under
billions or trillions of records.
1.2 CHARACTERISTICS OF BIG DATA – Vs OF BIG DATA
According to Gartner, big data is the data that contains greater variety and arrives in increasing volumes
and with ever-increasing velocity. It is high-volume, velocity and variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and decision making.
Therefore, big data is said to be consisting of three major characteristics – volume, velocity and variety (3
Vs). Volume represents the amount of data being accumulated from time to time. Examples are social
media websites like Facebook and Twitter, wherein every minute you can expect incredible accumulation
and growth of unknown data. Velocity is the speed of data being generated, produced, created, received
refreshed. According to IBM Marketing Cloud study, 90% of the Internet data has been created since
2016. Each day about 12 million social media users are producing new day, about 70 crore tweets, above
400 crores of Facebook messages and above 500 crores of Facebook likes and above 6 crore Instagram
Big Data 3
messages are being posted and above 40 lacs hours of content being uploaded to Youtube
(https://blog.microfocus.com/how-much-data-is-created-on-the-internet-each-day/). Above 40,000 search
queries are being processed by Google alone per second. All these examples are mentioned to understand
the tremendous growth of data. Depending on the requirements in different perspectives of different
users, it will be a big challenge to analyse some or all of such continuously growing data for making
useful decisions and taking proper actions. Some theorists and practitioners advanced further by
extending the characteristics of big data from ‘3 Vs’ to 4 Vs, 5 Vs and even 10 Vs, so as to elaborate the
meaning of big data more and more. The 4 Vs include Veracity and 5 Vs include Veracity and Value, in
addition to the Volume, Velocity and Variety of 3 Vs. Veracity represents quality of the data, that is,
cleanliness and accuracy of data without missing any data items. Value refers to the ability to transform
the huge flow of data into proper usage for making good decisions and taking appropriate actions. It can
be measured with the extent of benefit the user is getting.
In some cases, many more V’s are added and used to extend the meaning of big data. They include
Variability (consistency of data in terms of availability or interval of reporting), Viscosity (latency or lag
time in data with respect to the event in context), Virality (spread of data and the frequency of its pick up
by other users or events), Validity (similar to Veracity, ensuring consistent data quality, common
definitions and metadata), Vulnerability (tendency for data breach and other security concerns), Volatility
(history or longevity of data for use) and Visualization (scalable to visualize). However, the 3 Vs or 5 Vs
model provides the basic characteristics of big data.
Volume
Velocity Big data Variety
Veracity Value
Fig. 1. 5 Vs of Big Data

1.3 TYPES OF BIG DATA
Big data is basically of three types. They are: Structured data, Unstructured data and Semi-structured
data. The sources of all such data are people and machines.
Structured data :Any data that can be stored, accessed, and processed in a fixed format is called
"structured" data follows. It is highly organized information. It can be stored and accessed from a row-
column database with the help of simple algorithms. This type of data can be generated by people and
machines like scanners, sensors, computer systems and other automatic devices. All such human and
machine generated data can be captured by the servers in an ordered format. Over time, computer science
talents have had greater success in developing techniques for working with such data (where the format is
known in advance) and in deriving value from it. However, today we see problems when the size of such
data increases dramatically. Typical sizes are in the range of several zettabytes. For example, customer
table in a firm’s database is structured data with the details of the customers like name, address and other
important information.
Unstructured data : Any data with an unknown form or structure is classified as unstructured data. Not
only is unstructured data huge, it also presents a variety of challenges in terms of processing it to derive
value. Most of the data being generated by humans through the internet and social media is unstructured.
The data produced by machines like satellites, scientific instruments, closed circuit TVs and radars are
also unstructured in nature. A typical example of unstructured data is a heterogeneous data source that
contains a combination of simple text files, images, videos, and so on. All this data has been
accumulating continuously at an abnormal rate and it is nothing, but, unstructured data. Because of not
following a format or structure for its storage, the processing and analysis of unstructured data will be a
difficult and time-consuming activity.
Unstructured data can be further classified as two types – Captured data and User-generated data. When
you book a cab (Uber or Ola) through your mobile phone, you can trace the movement of the cab to your
place and from your pick-up point to your destiny. In the same way, the cab driver can trace your
location and follow the navigation to reach the pick-up point and the destination. All such navigational
data is said to be captured data. The unstructured data that is being posted continuously by users in the
form of tweets and retweets, likes, shares and comments on social media is said to be user-generated data.
Nowadays, companies have a wealth of data, but unfortunately they have not known how to derive value
from it. This data is in its raw form or in unstructured format.
Big Data 5
Semi-structured data represents the data which is structured in one way and unstructured in another way.
Examples include tags and keywords that contain vital information and are useful in segregating
individual elements in the data. Most of the semi-structured data is of unstructured format, but contains
some organized data which is useful for processing. Examples include tags and keywords that contain
vital information and are useful in segregating individual elements in data. No SQL documents are semi-
structured data, because they contain keywords useful for processing the documents easily.

Following table gives the differences among these three types of Big data:
Table 1: Differences among the three types of Big data
Structured data Unstructured data Semi-structured data
Structured data is Even though it has flexibility, This data has more flexibility than
dependent and has less its internal structure does not the structured data, but less
flexibility. follow any data models or flexibility when compared to
schema. unstructured data.
It is based on a relational It is based on character and It is based on XML syntax and

database table. binary data. RDF (Resource Description
Framework) data model.
Only select data types It uses many varied types.
are used.
It is self-service access. Its access requires expertise in Its access also requires expertise in
data science. data science to some extent.
It is commonly stored in It is commonly stored in data It is commonly stored in data

data warehouses. lakes. lakes.
It follows a predefined It follows a native format to It follows a native format to store.

format to store. store.
It follows a matured It does not follow any It follows the transaction

transaction management transaction management and management adapted from DBMS,
and various concurrency concurrency technique. which is not matured.
techniques.
Structured queries along Only textual queries are Queries over anonymous nodes are
with complex joining are possible. possible.
possible.
It is very difficult to It is very scalable. Scaling is simpler than structured

scale database schema. data.
Big Data 7
1.4 STORAGE OF BIG DATA
The traditional database is designed to handle predictable and structured data. In a relational database,
vertical and sometimes horizontal expansion of data is possible to a limited extent, depending on the
growth of the data or the processing requirements. Big data involves a continuous flow of large amounts
of data with a large amount of data. As a result, conventional database systems cannot store and process
them. Relational databases could not accommodate the changing changes in database system
requirements and therefore fit the existing data model without changing the schema that defines it. Any
change or modification that needs to be made either to the data model or to the schema is a manual and
time-consuming process and even affects the associated applications and services. The current scenario
requires two important characteristics in database processing: (i) flexibility in development by meeting
changing data requirements and (ii) scalability in operation by processing a fast and continuous flow of
data and a variation of data. These two characteristics are absent in traditional relational database
systems.
Big data can be stored either in data lakes or in data warehouses, depending on the needs of the user
organization. Data lakes can store a large amount of the three types of big data in raw format, which
means they can store any type of big data in native format without placing any restrictions on account
size or file. Data scientists can leverage such data because there is a large volume of data available and
enough leeway to improve analytics performance and integration and generate real-time insights. The
data can be updated quickly and easily accessible. Big data relies more on data lakes to store it in various
forms - raw, granular, structured, and unstructured data. All data from different source systems can be
loaded into data lakes without anything missing. The data can be transformed and an appropriate schema
applied to meet the data analysis needs.
Data warehouses only store the processed and filtered data for a specific purpose and use by business
people. They are repositories for structured data only, and data accessibility and updating will be
complex. Data warehouses are useful in financial and other business environments because the big data
they generate is those environments that can be stored in a structured format that the entire organization
can access for specific analysis.

Some popular data storages:
● Hadoop Framework is designed to store and process data in a distributed computing environment
using standard hardware with a simple programming model. It can store and analyze the data
present on various machines at high speeds and low costs. It is developed by Apache Software
Foundation in 2011 and is written in JAVA
● NoSQL document databases offer a direct alternative to the rigid schema used in database
databases such as MongDB Relational Databases. This allows Mongodib to offer flexibility when
processing a wide variety of data types across large and distributed architectures. Developed by
MongoDB in 2009 and is written in C++, Go, JavaScript and Python.
● RainStor is a software company that developed a database management system of the same name
that can be used to manage and analyze big data for large companies. It uses simulation
techniques to organize the process of collecting large amounts of data for reference. It works like
SQL
● With Hunk, you can access data in remote Hadoop clusters through virtual indexes and analyze
your data using the Splunk Search Processing Language. With Hunk you can record and visualize
large amounts of data from your Hadoop and NoSQL data sources. Developed by Splunk INC in
the year 2013 and was written in JAVA.
1.5 BIG DATA TECHNOLOGY
Big data technology is a software service that can analyze, process and extract the right information from
big data data generated by extremely complex and large amounts of data. Currently, the whole world is
seeking more and more information about the speed and continuous flow of data in various forms to carry
out regular and future activities. To meet these challenging needs, new technology and sophisticated
systems have evolved rapidly, replacing traditional RDBMS, SQL and many front-end applications. The
database systems NoSQL (No Only SQL), Hadoop, MapReduce and Massive Parallel Computing are
important.
9 Big Data
Big data technology is mainly divided into two types:
1. Operational Big Data Technologies

2. Analytical Big Data Technologies
Operational big data is all about the normal daily data we generate. This could be online transactions,
social media or data from a specific organization, etc. You can even think of this as some kind of raw
data that is used to feed in the analytical big data technologies.
A few examples are as follows:
● Online ticket bookings including your train tickets, plane tickets, movie tickets, etc.
● Online shopping for Amazon, Flipkart, Walmart, Snap and many more.
● Data from social media websites like Facebook, Instagram and many more.
● The employee data of a multinational company.
Analytical Big Data: It is more complex than operational big data. In short, when it comes to analytical
big data, the actual performance part comes into play, and the critical business decisions are made in real
time by analyzing the operational big data.
Few examples are as follows:
● Share marketing
● Carrying out the space missions where every piece of information is of vital importance.
● Weather forecast information.
● Medical areas where the health status of a particular patient can be monitored.
NoSQL is a completely different database framework for powerful and agile processing of information
on a large scale. It's well designed to essentially meet the needs of big data. The NoSQL database
infrastructure handles the unstructured, cluttered, and unpredictable data well by ensuring strict
consistency in maintaining the speed and agility of the data. It is not a relational database based on tables
and does not use SQL to manipulate the data. NoSQL follows the concept of distributed databases by
storing the semi-structured and unstructured data across multiple processing nodes and even across
multiple servers to handle a continuous data explosion with good performance. You keep the fault
tolerance. Big data warehouses can also be managed by these distributed NoSQL database architectures.
NoSQL ensures high performance and high availability by offering a rich query language and easy
scalability.

NoSQL database architectures support key value stores, document stores, and chart databases. They are
supported by other technologies such as MongoDB and Hadoop. MongoDB uses a document model like
a row in a relational database table. It contains a series of documents that can be assigned to the data
types of the programming language using a series of fields (key-value pairs). The embedded documents
and the arrays lead to a reduction in the number of links and thus to high performance and speed.
Hadoop is an open source software ecosystem for the distributed storage and processing of big data on
large hardware clusters. It supports massive parallel and functional computing. It can overcome high
chances of system failure, limited bandwidth, and high programming complexity. It can handle certain
types of distributed NoSQL databases by spreading the data across a large number of servers without
affecting performance. The Hadoop framework enables the provision of distributed storage of data sets
that are too large for a single system.
The main principle of the Hadoop framework is MapReduce, a kind of calculation model. MapReduce
absorbs intensive data processes and distributes the computation across a potential Hadoop cluster, which
can contain an infinite number of servers, all of which work in parallel and significantly reduce
processing time. Because of these capabilities, Hadoop technology supports the gigantic processing
requirements of big data.
Check your progress

Fill in the blanks
1. The main principle of Hadoop framework is ________.
2. _____________ represents quality of the data, that is, cleanliness and accuracy of data without missing
any data items
3. Volume, velocity, variety, _____ and ________ are the five V’s of Big Data.
Big Data 11
1.6 BIG DATA PROCESSING AND ANALYSIS
Big data engineering begins with the identification of the sources that make up the big data after capturing
the relevant data for integration and processing. Efficient data processing usually processes small pieces of
data and processes them in parallel and this demands the use of large computer infrastructure. As the
amount of data increases, the number of parallel processes increases and more servers with more processors.
Big Data Processing and Distribution System makes it easy to organize and distribute data in parallel
computer clusters. Hadoop, an open-source Big Data clustering tool, is ideal for large data processing and
distribution.
There are two important ways to process big data - batch processing and stream processing. In batch
processing, large batches or data blocks are processed, while in stream processing, individual records or
micro batches of some records are processed. The batch process is useful in situations where the analysis
does not demand results in real time. For batch processing, Hadoop Mapreduce is the most useful
framework. Stream processing is one of the big data technologies used to process data in real time, to
investigate continuous data flows and to detect conditions in the short time (from a few milliseconds to a
few minutes) since the data was received. This is useful in situations where real-time tics are demanding
results. It handles a fast feeding of data into analysis tools from the point of data generation to get instant
analysis results. Apache Kafka, Apache Flink, Apache Storm and Apache Samja are important open source
flow processing platforms. The Apache Spark is another popular system that is compatible with Hadup and
can act as a standalone processing engine. It can keep data in memory for multiple steps for data conversion
and therefore repeats multiple times on the same piece of data. This advantage is much needed in analysis
and machine learning. Hadoop cannot store and process such data. But for large data solutions, processing
data in memory (in the case of spark computing) is just as useful as distributed storage of large data in
Hadoop. Cloud Solutions also provide dynamic distributed processing services in terms of the number of
parallel processes based on data volume. They offer infrastructure flexibility and guarantees to achieve the
best solution financially. After the data has been recorded and processed, the big data is ready for analysis.
Appropriate analytical models and data visualization techniques are useful for this purpose.

Data Analytics:
● Apache Kafka is a distributed streaming platform. A streaming platform has three main functions:
publisher, subscriber and consumer. This is similar to a message queue or an enterprise messaging
system. It was developed in 2011 by the Apache Software Foundation and written in Scala, JAVA.
● Splunk collects, indexes and correlates real-time data in a searchable repository from which charts,
reports, alerts, dashboards and data visualizations can be generated. It was developed by Splunk INC
in 2014 and written in AJAX, C ++, Python, XML.
● With KNIME, users can visually create data flows, selectively perform some or all of the analysis
steps, and review the results, models and interactive views. KNIME is written in Java and is based
on Eclipse and uses its extension mechanism to add plugins that offer additional functionality. It was
developed by KNIME in 2008 and written in JAVA.
● Spark offers in-memory computing capabilities to provide speed, a generic execution model to
support a wide variety of applications, and Java, Scala, and Python APIs to simplify development. It
was developed by the Apache Software Foundation and written in Java, Scala, Python, R.
● R is a programming language and a free software environment for statistical calculations and
graphics. The R language is widely used among statisticians and data miners for statistical software
development and mainly for data analysis. It was developed by the R-Foundation in 2000 and
written in Fortran.
● BlockChain is used in key functions like payment, escrow, and title. It can also reduce fraud,
increase financial privacy, speed transactions, and internationalize markets. It was developed by
Bitcoin and written in JavaScript, C ++, Python. BlockChain can be used to achieve the following in
a business network environment:
○ Shared Ledger: Here we can attach a distributed system of data sets to the company network.
○ Smart Contracts: Terms and conditions are embedded in the transaction database and are
executed with the transaction.
○ Data protection: Transactions are secure, authenticated, and verifiable for adequate visibility
○ Consensus: All parties in a corporate network agree to a network verified transaction.
Big data analytics sets up extremely large, fast-flowing, and diverse data to reveal information and
knowledge from hidden patterns, unknown correlations, trends, and other insights for better decision
making. Specialized software tools and applications are used in large data analysis to make predictive
analysis, data mining, text mining, forecasting, data optimization and visualization, and many insights useful
for business and society.
Hadoop and cloud-based Big Data tics Analytics help organizations reduce the cost of storing large amounts
of data and make better and faster decisions to take appropriate action. In-depth analysis of customer needs,
issues and preferences in different locations can be done to develop and deliver relevant and new products
and services, increase customer satisfaction and improve business while maintaining a competitive
advantage for organizations in the market.
1.7 BENEFITS OF BIG DATA
The importance of big data doesn't depend on how much data you have, but on what you do with it.
You can extract and analyze data from any source to find answers that enable cost savings, time
savings, new product development and optimized offerings, and intelligent decisions. When you
combine big data with powerful analytics, you can perform business-related tasks such as:
● Identify the main causes of errors, problems and defects in near real time
● Generation of vouchers at the point of sale based on the customer's buying habits.
● Recalculation of entire risk portfolios in minutes.
● Detect fraudulent behavior before it affects your business.
The ability to process big data offers several advantages, such as:
● Companies can use external information when making decisions: Companies can optimize their
business strategies by accessing social data from search engines and websites such as Facebook and
Twitter.
● Improved customer service: In these new systems, big data and natural language processing
technologies will be used to read and evaluate consumer responses.
● If necessary, early identification of a risk for the product / service
● Better operational efficiency

Big data technologies allow you to create a staging area or landing zone for new data before deciding which
data to move into the data warehouse. In addition, such an integration of big data technologies and data
warehouse helps a company to outsource data that is rarely accessed. Big data adds value in any area of
economy or society by developing insights and instructions to make right decisions and take the necessary
action. The benefits are given below:
● Big data has a large volume of information and it helps companies get broader answers to regularly
address problems.
● With big data, companies can optimize their processes and operational efficiency and reduce risks.
● Big data supports predictive analysis by accurately predicting the results and enabling companies to
make better decisions.
● It helps business organizations streamline their digital marketing strategies to improve customer
experience, solve problems, and improve their products and services.
● The accuracy of big data tools in filtering and integrating relevant data from multiple sources helps
save time, money and generate highly actionable insights.
1.8 APPLICATIONS OF BIG DATA IN INDUSTRY
Big data has many uses in various areas of application in companies and society. Some key uses of big data
are listed below:
 Manufacturing sector: Big data analysis allows companies to have a good idea of the products that
can do good business and start production accordingly. The delivery strategies and the product
can be significantly improved. Manufacturing companies can benefit from creating a transparent
infrastructure to predict uncertainties and reasons for incompetence that adversely affect the
business. Based on the knowledge gained, companies can optimize their processes and procedures
in order to improve their productivity and their overall business. Predictive analysis enables
organizations to analyze past and current products or services and evaluate the market feasibility
for new products or services. Accordingly, they develop selected products and services in order to
maintain competitive advantages and do good business. Problems such as work restrictions,
equipment failures, and material flow can be quickly analyzed regularly to streamline production

and supply chain activities. Machine and process variations can be easily identified to solve
quality problems. Production capacities and customer requirements can be assessed to regulate
production efficiency and predict future requirements.
 Marketing and Distribution: Customers' big data from various sources such as social media,
websites, and call logs can be analyzed to improve customer interaction and experience, and to
maximize the value of the products and services delivered. All problems in marketing and sales
can be solved proactively by analyzing such data in depth. The market and demand can be
analyzed in detail to ensure proper distribution in appropriate markets. From all of the data that
has been gathered on demand, sales, and shipments, big data can be helpful for businesses to
predict demand and sales, and to develop a better marketing strategy to increase sales.
● Fraud and Compliance: With big data, companies can identify specific patterns in the vast
amounts of data gathered from different locations, organizations and people, detect fraud and
generate regulatory reports faster. This will lead to an increase in security and an increase in the
compliance requirements to be implemented.
● Research and development: With the help of big data, companies can understand how innovative
and special their products are in the marketplace and their resources, including people and
processes. This will help companies look for new features and new products for competitive
advantage and financial strength.
● Banking and Finance sector: Big data helps banking and finance companies balance the systems
to add value to their customers by preserving the privacy and protection of customer and other
organizational information. Investment decisions and guidelines can be passed on to customers.
● Healthcare: Big data analytics in healthcare can help predict, prevent, and cure disease. The
beneficiaries are both doctors and humans. The diagnosis can be made more easily and precisely.
The unstructured from various departments and doctors in the form of notes and reports can be
properly processed to generate important insights for patient care.
Keywords
● Big data, Volume, Variety, Velocity, Value, Veracity, Hadoop, NoSQL, MapReduce.

1. Distinguish between SQL and NoSQL.
2. Describe how traditional RDBMS is not suitable to store and process big data.
3. List the benefits of big data.
4. Describe the role of Hadoop in big data storing and processing.
5. Describe the differences among the different types of big data.
6. Give a detailed note on processing of big data.

1. MapReduce
2. Veracity
3. Value and Veracity
Suggested Reading
1. Big Data: A Revolution That Will Transform How We Live, Work, and Think. - Book by
Kenneth Cukier and Viktor Mayer-Schönberger.
2. Big Data For Dummies - Book by Alan Nugent, Fern Halper, Judith Hurwitz, and Marcia
Kaufman.
3. Big Data at Work: Dispelling the Myths, Uncovering the Opportunities - Book by Thomas H.
Davenport.

UNIT
Introduction to Data Warehouse
Structure: 2
2.1 Introduction
2.2 Understanding Data Warehouse
2.3 Difference between OLTP and Data Warehousing Environments
2.4 Basics of Data Warehouse Architecture
Summary
Key Words
Suggested Reading
Introduction to Data Warehouse 18

Notes
Objectives
----------------------
After going through this unit, you will be able to:
----------------------
● Identify the requirement of Data Warehouse
---------------------- ● Differentiate between OLTP (Online Transaction Processing) and
Data Warehouse
----------------------
● Analyse the characteristics of Data Warehouse
----------------------
● Understanding the architecture of Data Warehouse
----------------------
----------------------
2.1 INTRODUCTION
----------------------
Data Warehousing is a relational database concept, designed for query
---------------------- and analysis. It contains integrated historical data derived from transaction
---------------------- data. It isolates the data analysis from transaction processing. This helps in
maintaining historical records, analyzing the data to have better understanding
---------------------- of future prediction gathering of data and transforming it into useful actionable
information for delivering to business users is processed through extraction,
---------------------- transportation, transformation and loading solutions
----------------------
2.2 UNDERSTANDING DATA WAREHOUSE
----------------------
In computing, a data warehouse (DW, DWH) or an enterprise data
---------------------- warehouse (EDW), is a system used for reporting and data analysis. Integrating
---------------------- data from one or more disparate sources creates a central repository of data, a
data warehouse (DW). Data warehouses store current and historical data and
---------------------- are used for creating trending reports for senior management reporting such as
annual and quarterly comparisons.
----------------------
The data stored in the warehouse is uploaded from the operational systems
---------------------- (such as marketing, sales, etc., shown in the figure to the right). The data may
pass through an operational data store for additional operations before it is used
---------------------- in the DW for reporting.
---------------------- The data warehouse is an information delivery system. In this system, you
integrate and transform enterprise data into information suitable for strategic
----------------------
decision-making. You take all the historic data from the various operational
---------------------- systems, combine this internal data with any relevant data from outside sources
and pull them together. You resolve any conflicts in the way data resides in
---------------------- different systems and transform the integrated data content into a format
suitable for providing information to the various classes of users. Finally, you
----------------------
implement the information delivery methods.
---------------------- Bill Inmon, considered to be the father of Data Warehousing provides the
following definition: “A Data Warehouse is a subject oriented, integrated,
----------------------

nonvolatile and time variant collection of data in support of management’s Notes
decisions.” ----------------------
Sean Kelly, another leading data warehousing practitioner defines the data
warehouse in the following way. The data in the data warehouse is- Separate, ----------------------
Available, Integrated, Time stamped, Subject oriented, Nonvolatile and ----------------------
Accessible.
Defining Features
----------------------
Let us examine some of the key defining features of the data warehouse based ----------------------
on these definitions. What about the nature of the data in the data warehouse?
How is this data different from the data in any operational system? Why does it
----------------------
have to be different? How is the data content in the data warehouse used? ----------------------
● Subject-Oriented Data
----------------------
In operational systems, we store data by individual applications. In the data
sets for an order processing application, we keep the data for that particular ----------------------
application. These data sets provide the data for all the functions for entering
orders, checking stock, verifying customer’s credit, and assigning the order
----------------------
for shipment but these data sets contain only the data that is needed for those ----------------------
functions relating to this particular application. We will have some data sets
containing data about individual orders, customers, stock status and detailed ----------------------
transactions, but all of these are structured around the processing of orders.
----------------------
Similarly, for a banking institution, data sets for a consumer loans
application contain data for that particular application. Data sets for other ----------------------
distinct applications of checking accounts and savings accounts relate to those ----------------------
specific applications. Again, in an insurance company, different data sets
support individual applications such as automobile insurance, life insurance, ----------------------
and workers’ compensation insurance.
----------------------
In every industry, data sets are organized around individual applications to
support those particular operational systems. These individual data sets have ----------------------
to provide data for the specific applications to perform the specific functions
efficiently. Therefore, the data sets for each application need to be organized ----------------------
around that specific application. ----------------------
In striking contrast, in the data warehouse, data is stored by subjects, not by
applications. If data is stored by business subjects, what are business subjects? ----------------------
Business subjects differ from enterprise to enterprise. These are the subjects ----------------------
critical for the enterprise. For a manufacturing company, sales, shipments, and
inventory are critical business subjects. For a retail store, a sale at the check-out ----------------------
counter is a critical subject.
----------------------
Figure 1.1 distinguishes between how data is stored in operational systems
and in the data warehouse. In the operational systems shown, data for each ----------------------
application is organized separately by application: order processing, consumer
loans, customer billing, accounts receivable, claims processing, and savings
----------------------
accounts. ----------------------

Notes For example, Claims is a critical business subject for an insurance company.
Claims under automobile insurance policies are processed in the Auto Insurance
---------------------- application. Claims data for automobile insurance is organized in that application.
Similarly, claims data for workers’ compensation insurance is organized in
----------------------
the Workers’ Comp Insurance application. But in the data warehouse for an
---------------------- insurance company, claims data are organized around the subject of claims and
not by individual applications of Auto Insurance and Workers’ Comp. In the
---------------------- data warehouse, data is not stored by operational applications, but by business
subjects.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 2.1: The data warehouse is subject oriented
----------------------
In a data warehouse, there is no application flavor. The data in a data
---------------------- warehouse cut across applications.
---------------------- ● Integrated Data

For proper decision-making, you need to pull together all the relevant data
---------------------- from the various applications. The data in the data warehouse comes from
---------------------- several operational systems.
Source data are in different databases, files, and data segments. These are
---------------------- disparate applications, so the operational platforms and operating systems could
---------------------- be different. The file layouts, character code representations, and field naming
conventions all could be different.
----------------------
In addition to data from internal operational systems, for many enterprises,
---------------------- data from outside sources is likely to be very important. Companies such as
Metro Mail, A. C. Nielsen, and IRI specialize in providing vital data on a regular
---------------------- basis. Your data warehouse may need data from such sources. This is one more
variation in the mix of source data for a data warehouse.
----------------------
Figure 2 .2 illustrates a simple process of data integration for a
---------------------- banking institution. Here the data fed into the subject area of account in
the data warehouse comes from three different operational applications.
----------------------
Even within just three applications, there could be several variations.
---------------------- Naming conventions could be different; attributes for data items could be
different. The account
Big Data with Data Warehousing & Data Mining
21
number in the Savings Account application could be eight bytes long, but only Notes
six bytes in the Checking Account application. ----------------------
Before the data from various disparate sources can be usefully stored in a data
warehouse, you have to remove the inconsistencies. You have to standardize the ----------------------
various data elements and make sure of the meanings of data names in each ----------------------
source application. Before moving the data into the data warehouse, you have
to go through a process of transformation, consolidation, and integration of the ----------------------
source data. Data inconsistencies are removed; data from diverse operational
applications is integrated. ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 2.2: The data warehouse is integrated
----------------------
● Time-Variant Data
For an operational system, the stored data contains the current values. In an ----------------------
accounts receivable system, the balance is the current outstanding balance in ----------------------
the customer’s account.
In an order entry system, the status of an order is the current status of ----------------------
the order. In a consumer loans application, the balance amount owed by the ----------------------
customer is the current amount.
Of course, we store some past transactions in operational systems, but, ----------------------
essentially, operational systems reflect current information because these ----------------------
systems support day-to-day current operations.
On the other hand, the data in the data warehouse is meant for analysis
----------------------
and decision-making. If a user is looking at the buying pattern of a specific ----------------------
customer, the user needs data not only about the current purchase, but on the
past purchases as well. When a user wants to find out the reason for the drop ----------------------
in sales in the North East division, the user needs all the sales data for that
division over a period extending back in time. When an analyst in a grocery
----------------------
chain wants to promote two or more products together, that analyst wants sales ----------------------
of the selected products over a number of past quarters.
A data warehouse, because of the very nature of its purpose, has to contain
----------------------

Notes historical data, not just current values. Data is stored as snapshots over past and
current periods.
----------------------
Every data structure in the data warehouse contains the time element. You
---------------------- will find historical snapshots of the operational data in the data warehouse. This
aspect of the data warehouse is quite significant for both the design and the
---------------------- implementation phases.
---------------------- For example, in a data warehouse containing units of sale, the quantity stored
in each file record or table row relates to a specific time element. Depending on
---------------------- the level of the details in the data warehouse, the sales quantity in a record may
---------------------- relate to a specific date, week, month, or quarter.
The time-variant nature of the data in a data warehouse
----------------------
● Allows for analysis of the past
---------------------- ● Relates information to the present
---------------------- ● Enables forecasts for the future
----------------------
● Nonvolatile Data
----------------------
Data extracted from the various operational systems and pertinent data
---------------------- obtained from outside sources are transformed, integrated and stored in the
data warehouse. The data in the data warehouse is not intended to run the day-
---------------------- to-day business. When you want to process the next order received from a
customer, you do not look into the data warehouse to find the current stock
----------------------
status. The operational order entry application is meant for that purpose. In the
---------------------- data warehouse, you keep the extracted stock status data as snapshots over time.
You do not update the data warehouse every time you process a single order.
----------------------
Data from the operational systems are moved into the data warehouse at
---------------------- specific intervals. Depending on the requirements of the business, these data
movements take place twice a day, once a day, once a week, or once in two
---------------------- weeks. In fact, in a typical data warehouse, data movements to different data
sets may take place at different frequencies. The changes to the attributes of the
----------------------
products may be moved once a week. Any revisions to geographical setup may
---------------------- be moved once a month. The units of sales may be moved once a day. You plan
and schedule the data movements or data loads based on the requirements of
---------------------- your users.
---------------------- As illustrated in Figure 2.3, not every business transaction updates the data
in the data warehouse. The business transactions update the operational system
---------------------- databases in real time. We add, change, or delete data from an operational
---------------------- system as each transaction happens but do not usually update the data in the
data warehouse. You do not delete the data in the data warehouse in real time.
---------------------- Once the data is captured in the data warehouse, you do not run individual
transactions to change the data there. Data updates are commonplace in an
---------------------- operational database; not so in a data warehouse. The data in a data warehouse
---------------------- is not as volatile as the data in an operational database is. The data in a data

warehouse is primarily for query and analysis. Usually, the data in the warehouse Notes
is not updated or deleted. ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 2.3: The data warehouse is nonvolatile
● Data Granularity ----------------------
In an operational system, data is usually kept at the lowest level of detail. ----------------------
In a point-of-sale system for a grocery store, the units of sale are captured and
stored at the level of units of a product per transaction at the check-out counter. ----------------------
In an order entry system, the quantity ordered is captured and stored at the level ----------------------
of units of a product per order received from the customer. Whenever you need
summary data, you add up the individual transactions. If you are looking for ----------------------
units of a product ordered this month, you read all the orders entered for the
entire month for that product and add up. You do not usually keep summary ----------------------
data in an operational system. ----------------------
When a user queries the data warehouse for analysis, he or she usually starts
by looking at summary data. The user may start with total sale units of a product ----------------------
in an entire region. Then the user may want to look at the breakdown by states ----------------------
in the region. The next step may be the examination of sale units by the next
level of individual stores. ----------------------
Frequently, the analysis begins at a high level and moves down to lower ----------------------
levels of detail.
In a data warehouse, therefore, you find it efficient to keep data summarized
----------------------
at different levels. Depending on the query, you can then go to the particular ----------------------
level of detail and satisfy the query. Data granularity in a data warehouse refers
to the level of detail. The lower the level of detail, the finer the data granularity. ----------------------
Of course, if you want to keep data in the lowest level of detail, you have to store
a lot of data in the data warehouse. You will have to decide on the granularity
----------------------
levels based on the data types and the expected system performance for queries. ----------------------
Figure 2.4 shows examples of data granularity in a typical data warehouse.
Data granularity refers to the level of detail. Depending on the requirements, ----------------------
multiple levels of detail may be present. Many data warehouses have at least
dual levels of granularity.
----------------------

Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 2.4: Data Granularity
----------------------
Types of Systems
----------------------
Data mart
---------------------- A data mart is a simple form of a data warehouse that is focused on a single
---------------------- subject (or functional area), such as sales, finance or marketing. Data marts are
often built and controlled by a single department within an organization. Given
---------------------- their single-subject focus, data marts usually draw data from only a few sources.
The sources could be internal operational systems, a central data warehouse, or
---------------------- external data.
---------------------- Online analytical processing (OLAP)
---------------------- It ss characterized by a relatively low volume of transactions. Queries are
often very complex and involve aggregations. For OLAP systems, response
----------------------
time is an effectiveness measure. OLAP applications are widely used by Data
---------------------- Mining techniques. OLAP databases store aggregated historical data in multi-
---------------------- dimensional schemas (usually star schemas). OLAP systems typically have
data latency of a few hours, as opposed to data marts, where latency is expected
---------------------- to be closer to one day.
---------------------- Online Transaction Processing (OLTP)
---------------------- It is characterized by a large number of short on-line transactions (INSERT,
UPDATE, DELETE). OLTP systems emphasize very fast query processing
---------------------- and maintaining data integrity in multi-access environments. For OLTP
systems, effectiveness is measured by the number of transactions per second.
----------------------
OLTP databases contain detailed and current data. The schema used to store
---------------------- transactional databases is the entity model (usually 3NF).
Predictive analysis
----------------------
Predictive analysis is about finding and quantifying hidden patterns in the
---------------------- data using complex mathematical models that can be used to predict future
outcomes. Predictive analysis is different from OLAP in that OLAP focuses
----------------------

on historical data analysis and is reactive in nature, while predictive analysis Notes
focuses on the future. These systems are also used for CRM (Customer ----------------------
Relationship Management)
Software tools ----------------------
The typical extract-transform-load (ETL)-based data warehouse uses staging, ----------------------

data integration, and access layers to house its key functions. The staging
layer or staging database stores raw data extracted from each of the disparate
----------------------
source data systems. The integration layer integrates the disparate data sets by ----------------------
transforming the data from the staging layer often storing this transformed data
in an operational data store (ODS) database. The integrated data are then moved ----------------------
to yet another database, often called the data warehouse database, where the
data is arranged into hierarchical groups often called dimensions and into facts
----------------------
and aggregate facts. The combination of facts and dimensions is sometimes ----------------------
called a star schema. The access layer helps users retrieve data.
This definition of the data warehouse focuses on data storage. The main
----------------------
source of the data is cleaned, transformed, cataloged and made available for ----------------------
use by managers and other business professionals for data mining, online
analytical processing, market research and decision support (Marakas & ----------------------
O’Brien 2009). However, the means to retrieve and analyze data, to extract,
transform and load data, and to manage the data dictionary are also considered
----------------------
essential components of a data warehousing system. Many references to data ----------------------
warehousing use this broader context. Thus, an expanded definition for data
warehousing includes business intelligence tools, tools to extract, transform ----------------------
and load data into the repository, and tools to manage and retrieve metadata.
----------------------
Benefits
----------------------
A data warehouse maintains a copy of information from the source transaction
systems. This architectural complexity provides the opportunity to: ----------------------
● Congregate data from multiple sources into a single database so a single ----------------------
query engine can be used to present data.
● Mitigate the problem of database isolation level lock contention in ----------------------
transaction processing systems caused by attempts to run large, long ----------------------
running, analysis queries in transaction processing databases.
● Maintain data history, even if the source transaction systems do not. ----------------------
● Integrate data from multiple source systems, enabling a central view ----------------------
across the enterprise. This benefit is always valuable, but particularly so
when the organization has grown by merger. ----------------------
● Improve data quality, by providing consistent codes and descriptions, ----------------------
flagging or even fixing bad data.
● Present the organization’s information consistently.
----------------------
● Provide a single common data model for all data of interest regardless of ----------------------
the data’s source.
----------------------
● Restructure the data so that it makes sense to the business users.

Notes ● Restructure the data so that it delivers excellent query performance, even
for complex analytic queries, without impacting the operational systems.
---------------------- ● Add value to operational business applications, notably customer
---------------------- relationship management (CRM) systems.
● Making decision–support queries easier to write.
----------------------
---------------------- 2.3 DIFFERENCE BETWEEN OLTP AND DATA

WAREHOUSING ENVIRONMENTS
----------------------
The requirement analysis of OLTP and DW differs with following points:
----------------------
Difference in points OLTP Data Warehouse
---------------------- Indexes Too few Too many
---------------------- Joins Many Some
Duplicate Data Normalized up to 3NF De-normalized DBMS
---------------------- Derived Data Rare Common
---------------------- Workload: A data warehouse is designed to answer the ad-hoc queries. The
workload of data warehouse is not known in advance, so data warehouse should
---------------------- be designed well in order to answer all possible queries in advance. In contrast,
OLTP has a set of predefined queries; our applications are specifically designed
----------------------
to support such queries.
---------------------- Data Modification: Updating to data warehouse is carried out through an ETL
(Extract, Transform, Load) process using bulk data modification techniques.
----------------------
End users do not directly update the warehouse, except analysis. In contrast in
---------------------- OLTP, end users often issues transaction queries.
---------------------- Schema Design: To optimize query and analytical performance, data

warehouse uses de-normalize or partially de-normalize schemas. To optimize the
---------------------- transactions and guaranteed consistency OLTP often use normalizes schemas.
---------------------- Query Operation: A typical data warehouse scans million of rows, whereas
OLTP scans only handful rows.
---------------------- Data History: Data Warehouse main focus is to store historical data, whereas
---------------------- OLTP deals with current data.
---------------------- Check your Progress 1

----------------------
State True or False.
----------------------
1. The workload of data warehouse is known in advance.
---------------------- 2. Data warehouse is subject-oriented because it is defined by the subject
matter.
----------------------
3. The data in the data warehouse is meant for analysis and decision
---------------------- making.
4. Data Warehouse main focus is to store current data, whereas OLTP
---------------------- deals with historical data.

Notes
Activity 1
----------------------
Find out the difference between Data Mart and Data Warehouse. ----------------------
----------------------
2.4 BASICS OF DATA WAREHOUSE ARCHITECTURE ----------------------
Architecture of data warehouse depends on requirement analysis of an ----------------------

industry. Three widely used architectures are:
----------------------
● Data Warehouse Basics
● Data Warehouse with staging area ----------------------
● Data Warehouse with staging areas and Data marts ----------------------
Data Warehouse Basics ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 2.5: Data Warehouse Basics
----------------------
In this figure, OLTP source data is present in form of summary data and raw
data in data warehouse. Summary data is very important to data warehouse, as ----------------------
it is pre-computed queries data. For example, a typical data warehouse query is
to retrieve the records based on some condition.
----------------------
----------------------

Notes Data Warehouse with Staging Area
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 2.6: Data Warehouse with Staging Area
----------------------
The operational data needs to be cleans and processed before loading into
---------------------- data warehouse. This process is carried through staging area. A staging area
simplifies building summaries and general warehouse management.
----------------------
Data Warehouse with Staging Areas and Data marts
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- Fig. 2.7: Data Warehouse with staging areas and Data marts

Data Marts are used to customize the data warehouse architecture according Notes
to different groups in the organization. Figure 2.7 illustrates an example where ----------------------
purchasing, sales, and inventories are separated. In this example, a financial
analyst might want to analyze historical data for purchases and sales or mine ----------------------
historical data to make predictions about customer behaviour.
----------------------
Check your Progress 2 ----------------------
Multiple Choice Single Response.

----------------------
1. A data warehouse is said to contain a ‘subject-oriented’ collection of ----------------------

data because:
----------------------
i. Its contents have a common theme.
----------------------
ii. It is built for a specific application.
iii. It cannot support multiple objects. ----------------------
iv. It is a generalization of ‘object-oriented’ approach. ----------------------

2. A data warehouse is an ‘integrated’ collection of data because ----------------------
i. It is a collection of data of different types ----------------------
ii. It is a collection of data derived from multiple sources
----------------------
iii. It is a relational database
iv. It contains summarized data
----------------------
Fill in the Blanks. ----------------------

1. A data warehouse is designed for ______________. ----------------------
2. A data warehouse contains _________________ data. ----------------------
----------------------
Activity 2 ----------------------
Identify the requirement of data warehouse architecture for your company. ----------------------
----------------------
----------------------
Summary
----------------------
● A Data warehouse is designed mainly for query processing; hence
it is different from traditional online processing databases working ----------------------
methodology.
----------------------
● The characterization of data warehouse makes it easier to understand the
data characteristics in data warehouse. ----------------------
● The data in the data warehouse is: Separate, Available, Integrated, Time ----------------------
stamped, Subject oriented, Nonvolatile and Accessible.

Notes ● The architecture of data warehouse varies through requirement. Though
every data warehouse must follow basic architecture.
----------------------
---------------------- Keywords
---------------------- ● Subject-Oriented: Focussed on a particular subject
---------------------- ● Integrated: Combining data from diverse sources

● Non-volatile: Of or pertaining to storage that retains data even when
---------------------- electrical power is turned off or fails
---------------------- ● Time Variant: Changing over time
----------------------
----------------------
1. Define data warehouse.
---------------------- 2. What is the difference between data warehouse and OLTP?
---------------------- 3. Write a short note on Subject-oriented data.
---------------------- 4. Explain the basics of Data Warehouse Architecture.
5. What is Data Mart?
----------------------
---------------------- Answers to Check your Progress
---------------------- State True or False.
1. False
----------------------
2. True
----------------------
3. True
---------------------- 4. False
---------------------- Multiple Choice Single Response.
1. A data warehouse is said to contain a ‘subject-oriented’ collection of data
---------------------- because
---------------------- i. Its contents have a common theme.
---------------------- 2. A data warehouse is an ‘integrated’ collection of data because
---------------------- i. It is a collection of data derived from multiple sources.

Fill in the Blanks.
----------------------
1. A data warehouse is designed for query processing.
----------------------
2. A data warehouse contains derived data.
----------------------

32 Introduction to Data Warehouse
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- Published by Addison-Wesley, 1997.
Devlin, Barry. Data Warehouse: From Architecture to Implementation. 3.
----------------------
Warehousing: Architecture and Implementation.
---------------------- Humphries, Mark, Hawkins, Michael.W. and Michelle C. Dy. Data 2.
----------------------
B10501_01/server.920/a96520/concept.htm
Oracle Data Warehouse Documentation-http://docs.oracle.com/cd/ 1.
----------------------
Notes
Suggested Reading
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------

Data Warehouse Architecture
UNIT
3
Structure:
3.1 Introduction
3.2 The Data Warehouse Architecture
3.3 Three-Tier Data Warehouse Architecture for Business analysis Framework
3.4 Data Warehouse Models
Summary
Key Words
Suggested Reading
Data Warehouse Architecture 34

Notes
Objectives
----------------------
----------------------
● Explain Data Warehouse Architecture
---------------------- ● Describe the Three-tier Data Warehouse Architecture
---------------------- ● Elaborate on the Data Warehouse Models
----------------------
----------------------
3.1 INTRODUCTION
----------------------
The technical architecture of data warehouses is somewhat similar to other
----------------------
systems, but does have some special characteristics. There are two border areas
---------------------- in data warehouse architecture - the single-layer architecture and the N-layer
architecture.
----------------------
In previous unit, we have discussed about the basics of data warehouse
---------------------- architecture, in this unit, we will study the same in detail. Data Warehouses
can be architected in many different ways, depending on the specific needs of
---------------------- a business.
----------------------
3.2 THE DATA WAREHOUSE ARCHITECTURE
----------------------
In short, data is moved from databases used in operational systems into a
---------------------- data warehouse staging area, then into a data warehouse and finally into a set
of conformed data marts. Data is copied from one database to another using a
---------------------- technology called ETL (Extract, Transform, and Load)
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 3.1: ETL Process in Data Warehousing
----------------------

Different data warehousing systems have different structures. Some may Notes
have an ODS (operational data store), while some may have multiple data ----------------------
marts. Some may have a small number of data sources, while some may have
dozens of data sources. In view of this, it is far more reasonable to present ----------------------
the different layers of a data warehouse architecture rather than discussing the
specifics of any one system. ----------------------
In general, all data warehouse systems have the following layers: ----------------------
● Data Source Layer ----------------------
● Data Extraction Layer
----------------------
● Staging Area
● ETL Layer ----------------------
● Data Storage Layer ----------------------
● Data Logic Layer ----------------------
● Data Presentation Layer
● Metadata Layer
----------------------
● System Operations Layer ----------------------

The figure below shows the relationships among the different components of ----------------------
the data warehouse architecture:
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 3.2: Different components of the data warehouse architecture

----------------------
Each component is discussed individually below: ----------------------

Data Source Layer ----------------------
This represents the different data sources that feed data into the data ----------------------
warehouse. The data source can be of any format -- plain text file, relational
database, other types of database, Excel file, etc., can all act as a data source. ----------------------
----------------------

Notes Many different types of data can be a data source:
● Operations -- such as sales data, HR data, product data, inventory data,
----------------------
marketing data, systems data.
---------------------- ● Web server logs with user browsing data.
---------------------- ● Internal market research data.
---------------------- ● Third-party data, such as census data, demographics data, or survey data.
All these data sources together form the Data Source Layer.
----------------------
Data Extraction Layer
---------------------- Data is pulled from the data source into the data warehouse system. There
---------------------- is likely some minimal data cleansing, but there is unlikely any major data
transformation.
----------------------
Staging Area
---------------------- This is where data sits prior to being scrubbed and transformed into a data
warehouse / data mart. Having one common area makes it easier for subsequent
----------------------
data processing / integration.
---------------------- ETL Layer
---------------------- This is where data gains its "intelligence", as logic is applied to transform
the data from a transactional nature to an analytical nature. This layer is also,
---------------------- where data cleansing happens. The ETL design phase is often the most time-
---------------------- consuming phase in a data-warehousing project, and an ETL tool is often used
in this layer.
---------------------- Data Storage Layer
---------------------- This is where the transformed and cleansed data sit. Based on scope and
functionality, 3 types of entities can be found here: data warehouse, data mart,
---------------------- and operational data store (ODS). In any given system, you may have just one
---------------------- of the three, two of the three, or all three types.
Data Logic Layer
----------------------
This is where business rules are stored. Business rules stored here do not
---------------------- affect the underlying data transformation rules, but do affect what the report
looks like.
----------------------
Data Presentation Layer
----------------------
This refers to the information that reaches the users. This can be in a form of
---------------------- a tabular / graphical report in a browser, an emailed report that is automatically
generated and sent everyday or an alert that warns users of exceptions, among
---------------------- others. Usually, an OLAP tool and/or a reporting tool are used in this layer.
---------------------- Metadata Layer
---------------------- This is where information about the data stored in the data warehouse system
is stored. A logical data model would be an example of something that is in the
---------------------- metadata layer. A metadata is often used to manage metadata.
37
System Operations Layer Notes
This layer includes information on how the data warehouse system operates, ----------------------
such as ETL job status, system performance, and user access history.
----------------------

----------------------
1. Data source fed into the Data Source Layer can be of any format. ----------------------
1. Logic is applied to transform the data from a transactional nature to ----------------------
an analytical nature in the ______ Layer.
2. Usually an OLAP tool and/or a reporting tool are used in the
----------------------
__________ layer. ----------------------
----------------------
3.3 THREE-TIER DATA WAREHOUSE ARCHITECTURE ----------------------
FOR BUSINESS ANALYSIS FRAMEWORK
----------------------
Generally, the data warehouses adopt the three-tier architecture. Following
are the three tiers of data warehouse architecture. ----------------------
● Bottom Tier - The bottom tier of the architecture is the data warehouse ----------------------
database server. It is the relational database system. We use the back end
tools and utilities to feed data into bottom tier. These back end tools and
----------------------
utilities perform the Extract, Clean, Load, and refresh functions. ----------------------
● Middle Tier - In the middle tier, we have OLAP Server. The OLAP Server
can be implemented in either of the following ways. ----------------------
o By relational OLAP (ROLAP), which is an extended relational ----------------------

database management system. The ROLAP maps the operations on
multidimensional data to standard relational operations. ----------------------
o By Multidimensional OLAP (MOLAP) model, which directly ----------------------

implements multidimensional data and operations.
----------------------
● Top-Tier - This tier is the front-end client layer. This layer holds the query
tools and reporting tool, analysis tools and data mining tools. ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------

Notes Following diagram explains the Three-tier Architecture of Data warehouse:
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- Fig. 3.3: Three-tier Architecture of Data Warehouse
---------------------- 3.4 DATA WAREHOUSE MODELS

---------------------- From the perspective of data warehouse architecture, we have the following
---------------------- data warehouse models:
● Virtual Warehouse
----------------------
● Data mart
---------------------- ● Enterprise Warehouse
---------------------- Virtual Warehouse
---------------------- ● The view over an operational data warehouse is known as virtual

warehouse. It is easy to build the virtual warehouse.
---------------------- ● Building the virtual warehouse requires excess capacity on operational
---------------------- database servers.
---------------------- Data Mart
● Data mart contains the subset of organisation-wide data.
---------------------- ● This subset of data is valuable to specific group of an organisation
---------------------- Note: In other words, we can say that data mart contains only that data which
is specific to a particular group. For example, the marketing data mart may
---------------------- contain only data related to item, customers and sales. The data mart is confined
---------------------- to subjects.
39
Points to remember about data marts Notes
● Window based or Unix/Linux based servers are used to implement data ----------------------
marts. They are implemented on low cost server.
----------------------
● The implementation cycle of data mart is measured in short period i.e. in
weeks rather than months or years. ----------------------
● The life cycle of a data mart may be complex in long run if it is planning ----------------------
and design is not organisation-wide.
● Data marts are small in size. ----------------------
● Data marts are customized by department. ----------------------
● The source of data mart is departmentally structured data warehouse.
----------------------
● Data marts are flexible.
ENTERPRISE WAREHOUSE
----------------------
The enterprise warehouse collects all the information all the subjects ----------------------
spanning the entire organization.
----------------------
● This provides us the enterprise-wide data integration.
● This provides us the enterprise-wide data integration.
----------------------
● The data is integrated from operational systems and external information ----------------------
providers.
----------------------
● This information can vary from a few gigabytes to hundreds of gigabytes,
terabytes or beyond. ----------------------
Check your Progress 2

----------------------
----------------------
----------------------
1. Virtual Warehouse, Data mart, Enterprise Warehouse are data
warehouse models. ----------------------
----------------------
Activity 1 ----------------------
Explore the most popular ETL tools in the market. ----------------------
----------------------
Summary ----------------------
● Data Warehouses can be architected in many different ways, depending ----------------------

on the specific needs of a business. The model shown below is the "hub- ----------------------
and-spokes" Data Warehousing architecture that is popular in many
organizations. ----------------------
● In general, all data warehouse systems have the following layers:
----------------------
Data Source Layer, Data Extraction Layer, Staging Area, ETL Layer,

Notes Data Storage Layer, Data Logic Layer, Data Presentation Layer, Metadata
Layer, System Operations Layer
---------------------- ● Generally, the data warehouses adopt the three-tier architecture. The
---------------------- bottom tier of the architecture is the data warehouse database server. In
the middle tier, we have OLAP Server. Top-Tier is the front-end client
---------------------- layer.
---------------------- ● From the perspective of data warehouse architecture we have the following
data warehouse models: Virtual Warehouse, Data mart, Enterprise
---------------------- Warehouse
----------------------
Keywords
----------------------
● ETL: Extract, Transform, and Load (ETL) refers to a process in database
----------------------
usage and especially in data warehousing that: Extracts data from outside
---------------------- sources, Transforms it to fit operational needs, which can include quality
levels and Loads it into the end target.
---------------------- ● Architecture: It is both the process and the product of planning, designing,
---------------------- and constructing buildings and other physical structures.
● Software Framework: It is an abstraction in which software providing
---------------------- generic functionality can be selectively changed by additional user-
---------------------- written code, thus providing application-specific software.
---------------------- Self-Assessment Questions

---------------------- 1. What is data warehouse architecture?
---------------------- 2. Describe the three-tier data warehouse architecture in detail.

----------------------
---------------------- 1. True
---------------------- Fill in the Blanks.

1. Logic is applied to transform the data from a transactional nature to an
---------------------- analytical nature in the ETL Layer.
---------------------- 2. Usually an OLAP tool and/or a reporting tool are used in the Data
Presentation layer.
----------------------
----------------------
----------------------
----------------------

42 Data Warehouse Architecture
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Implementation. Addison-Wesley.
---------------------- to From Architecture Barry Devlin. 1997. Data Warehouse: 2.
---------------------- architect-and-bi-solution-architect/
http://dwbi1.wordpress.com/2010/06/16/data-architect-data-warehouse- 1.
----------------------
----------------------
Suggested Reading
---------------------- True 1.
Notes Check your Progress 2
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------

Dimensional Modeling
UNIT
4
Structure:
4.1 Introduction
4.2 ER Model versus Dimensional Model
4.2.1 ER Model
4.2.2 Dimensional Model
4.2.3 Differences between Dimensional Model and Relational Model
4.3 Dimensional Modeling Technique
4.4 Dimensional Modeling Process
4.5 Benefits of Dimensional Modeling
Summary
Key Words
Suggested Reading
Dimensional Modeling 44
Notes
Objectives
----------------------
----------------------
● Define the dimensional model
---------------------- ● Differentiate between the ER model and the dimensional model
---------------------- ● Describe the dimensional model process
----------------------
---------------------- 4.1 INTRODUCTION
----------------------
In relational modeling, the focus is on identification of strong entities
---------------------- involved in the execution of business transactions. Therefore, in transaction-
oriented systems, data structures are designed to enable fast writing through the
---------------------- process of ER Modeling and Normalization. However, such designs hamper
query performance badly due to multiple joins resulted from Normalization.
----------------------
For the data warehouse, the focus is on identifying associative entities that
---------------------- carry business measures. The designing process supports the measures, and
these measures are known as Dimensional Modeling. Such modeling helps to
---------------------- perform aggregation and integration of data from different sources.
----------------------
4.2 ER MODEL VERSUS DIMENSIONAL MODEL
----------------------
The basic differences between the ER model and the dimension model are
---------------------- discussed below:
---------------------- 4.2.1 ER Model
---------------------- Entity-relationship model (ER model) is a data model for describing the data
or information aspects of a business domain or its process requirements, in an
---------------------- abstract way that lends itself to ultimately being implemented in a database such
as a relational database.
----------------------
ER Model is for relational model in relational database, which is composed
---------------------- of set of relations. A relation schema is denoted by RSchema= R{ A1,A2,...An},
which is made up of a relation R and associated attributes Ai. Each attribute is a
---------------------- characteristic of a relation in a particular domain. Each relation R in relational
---------------------- schema RSchema is composed of set of tuples. Tuple is a group characteristic for
an entity. In other words, putting all the columns together of a relation is known
---------------------- as tuple.
---------------------- 4.2.2 Dimensional Model
Dimensions are the characteristics of subjects in which each row is an
----------------------
occurrence and each attribute can be used as ‘by’ attribute in where clause.
---------------------- For example, a user wants to see sales by customer or by product. Time is a
fundamental dimension across all the industries and thereby called confirmed
---------------------- dimension. Combining all the attributes of single business object into single

dimension table is called de-normalization, thereby reducing the joining cost. Notes
Facts are the metrics resulting from business process for an event. It identifies ----------------------
the level of details. A key to fact table is a multi-part key, which includes the
primary key from each participating dimension in a model. Most of the facts ----------------------
are numeric and continuously calculated. The fact table has the lowest level ----------------------
of details with little or no redundant data. Hence it is said to be in 3NF. The
dimensional model can be achieved majorly from star schema and snowflake ----------------------
schema.
----------------------
Consider a product database, which contains information for sales. The sales
data is driven by customer, location, time and product. They are determined as ----------------------
dimensions, while the sales data is referred as fact. The sales measures are mostly
numeric and additive. They vary over the time. The attributes of dimensions
----------------------
and measures are distinguished on the basis of application requirement. ----------------------
4.2.3 Differences Between Dimensional Model and Relational Model ----------------------
Purpose of Model:
----------------------
The relational model is used in transaction oriented system, whereas
dimensional model is used in decision support system. ----------------------
Analysis Focus: ----------------------
The relational model discovers the strong entities in terms of business ----------------------
process execution, whereas dimension model discovers the associative entities
that represent the effect of business process. ----------------------
Analysis details: ----------------------
Relational model defines the relationship among the attributes of strong ----------------------
entities, whereas dimensional modeling defines business measures in terms of
dimension and attributes of associative entities. ----------------------

----------------------
Fill in the blanks.
----------------------
1. ___________ are the characteristics of subjects in which each row is
an occurrence and each attribute can be used as ‘by’ attribute in where ----------------------
clause.
----------------------
2. The attributes of dimensions and measures are distinguished based on
____________ requirement. ----------------------
----------------------
4.3 DIMENSIONAL MODELING TECHNIQUE ----------------------
Dimensional Modeling is used by most contemporary data access tools. Such ----------------------
modeling is a mixture of normalization and de-normalization. Either data marts
or data warehouses may use this modeling. Dimensional modeling is composed ----------------------
Notes of dimensions and facts. Two modeling techniques exist for dimensions. They
are elaborated below.
----------------------
Star schema and Snowflake schema modeling techniques for dimensions
---------------------- represents the structure of dimension model. The center of the schema is fact
table, which is the only table in schema having multiple joins from dimension
---------------------- tables. Fact table stores the measures of the business. Dimension tables define
---------------------- the characteristics of business. The primary key of a fact table is a composite
primary key, composed of all the foreign keys from existing dimensions. In
---------------------- other words, each component of a composite primary key is a foreign key
referencing primary key of a dimension table. The dimensions are usually
---------------------- grouped into a hierarchy, which specifies the granularity level. Such schemas
---------------------- have the following benefits:
● Easier to understand
----------------------
● Improved query performance, as lesser number of joins are required
---------------------- ● Scalable
---------------------- The diagram of the star schema resembles a star.
---------------------- Customer Product
----------------------
---------------------- Sales
----------------------
Location Time
----------------------
---------------------- Fig 4.1 Star Schema Modeling
---------------------- The Snowflake schema is a slightly more complex than the star schema. Its
diagram resembles the snowflake; hence the name. Such schema normalizes the
---------------------- dimension to reduce redundancy. In other words, the dimension is partitioned
into several small tables. For example, the product dimension is partitioned
---------------------- into product category. Hence, this results in more complex queries and joins,
---------------------- thereby reducing the query performance.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------

Product Notes
Category ----------------------
----------------------
Product Location ----------------------
----------------------
Sales ----------------------
----------------------
Customer Time ----------------------
----------------------
Time
----------------------
Fig 4.2 Snowflake Schema
----------------------
Fill in the Blanks.

----------------------
1. Data warehouse modeling is known as _______. ----------------------
2. In transaction-oriented systems, data structures are designed to enable ----------------------

fast writing through the process of _______________.
----------------------
3. ______ is the easiest schema of data warehouse.
4. _______ are the metrics resulting from business process for an event.
----------------------
5. Dimensional modeling is a mixture of normalization and _______. ----------------------

----------------------
Activity 1 ----------------------
----------------------
Design a dimensional model for your company’s data warehouse.
----------------------
4.4 DIMENSIONAL MODELING PROCESS

----------------------
----------------------
The dimensional model is built on a star-like schema, with dimensions
surrounding the fact table. To build the schema, the following design model is ----------------------
used:
1. Choose the business process
----------------------
2. Declare the grain. ----------------------

3. Identify the dimensions. ----------------------
4. Identify the fact.
Notes Let us discuss each step in detail.
1. Choose the business process: The process of dimensional modeling
----------------------
builds on a 4-step design method that helps to ensure the usability of
---------------------- the dimensional model and the use of the data warehouse. The basics
in the design build on the actual business process, which the data
---------------------- warehouse should cover.
---------------------- Therefore, the first step in the model is to describe the business process,
which the model builds on. This could be, for instance, a sales situation in
---------------------- a retail store. To describe the business process, one can choose to do this
in plain text or use basic Business Process Modeling Notation (BPMN) or
----------------------
other design guides like the Unified Modeling Language (UML).
---------------------- 2. Declare the grain: After describing the Business Process, the next step in
the design is to declare the grain of the model. The grain of the model is
----------------------
the exact description of what the dimensional model should be focusing
---------------------- on. This could be, for instance, “An individual line item on a customer
slip from a retail store”. To clarify what the grain means, you should pick
---------------------- the central process and describe it in one sentence.
---------------------- Furthermore, the grain (sentence) is what you are going to build your
dimensions and fact table from. You might find it necessary to go back
---------------------- to this step to alter the grain due to new information gained on what your
---------------------- model is supposed to be able to deliver.
3. Identify the dimensions: The third step in the design process is to define
---------------------- the dimensions of the model. The dimensions must be defined within
---------------------- the grain from the second step of the 4-step process. Dimensions are the
foundation of the fact table and are where the data for the fact table is
---------------------- collected. Typically, dimensions are nouns, such as date, store, inventory
etc. These dimensions are where all the data is stored. For example, the
---------------------- date dimension could contain data such as year, month and weekday.
---------------------- 4. Identify the facts: After defining the dimensions, the next step in the
process is to make keys for the fact table. This step is to identify the
---------------------- numeric facts that will populate each fact table row. This step is closely
---------------------- related to the business users of the system, since this is where they get
access to data stored in the data warehouse. Therefore, most of the fact
---------------------- table rows are numerical, additive figures such as quantity or cost per
unit, etc.
----------------------
----------------------
---------------------- 1. The third step in the design process is to define the dimensions of
---------------------- the model.
2. Declare the grain step is meant to identify the numeric facts that will
---------------------- populate each fact table row.

4.5 BENEFITS OF DIMENSIONAL MODELING Notes
Benefits of dimensional modeling are as follows:
----------------------
 Understandability: Compared to the normalized model, the dimensional ----------------------

model is easier to understand and more intuitive. In dimensional models, ----------------------
information is grouped into coherent business categories or dimensions,
making it easier to read and interpret. Simplicity also allows software to ----------------------
navigate databases efficiently. In normalized models, data is divided into
many discrete entities and even a simple business process might result in ----------------------
dozens of tables joined in a complex way. ----------------------
 Query performance: Dimensional models are more denormalized and
optimized for data querying, while normalized models seek to eliminate ----------------------
data redundancies and are optimized for transaction loading and updating. ----------------------
The predictable framework of a dimensional model allows the database
to make strong assumptions about the data that aid in performance. ----------------------
Each dimension is an equivalent entry point into the fact table and this
symmetrical structure allows effective handling of complex queries. ----------------------
Query optimization for star join databases is simple, predictable and ----------------------
controllable.
 Extensibility: Dimensional models are extensible and easily ----------------------
accommodate unexpected new data. Existing tables can be changed in ----------------------
place by simply adding new data rows into the table or executing SQL
alter table commands. No queries or other applications that sit on top of ----------------------
the Warehouse need to be reprogrammed to accommodate changes. Old
queries and applications continue to run without yielding different results. ----------------------
But in normalized models, each modification should be considered ----------------------
carefully because of the complex dependencies between database tables.
----------------------
Summary ----------------------
 As data warehouse is meant for ad-hoc queries originated from decision ----------------------
support systems, traditional normalization and ER modeling impact the
query performance very badly. ----------------------
 Entity-relationship model (ER model) is a data model for describing ----------------------
the data or information aspects of a business domain or its process
requirements, in an abstract way that lends itself to ultimately being ----------------------
implemented in a database such as a relational database.
----------------------
 Dimensions are the characteristics of subjects in which each row is an
occurrence and each attribute can be used as ‘by’ attribute in where clause. ----------------------
 Dimensional Modeling is used by most contemporary data access tools. ----------------------
Such modeling is a mixture of normalization and de-normalization.
 In order to improve query performance, data warehouse is modeled with
----------------------
dimensional modeling by designing star schema and snowflake schema ----------------------
according to requirement.
Notes Keywords
----------------------
 Fact table: It consists of the measurements, metrics or facts of a
---------------------- business process.
----------------------
 Dimensional model: The dimensional model is a specialized
---------------------- adaptation of the relational model used to represent data in data
warehouses in a way that data can be easily summarized using
---------------------- online analytical processing or OLAP queries.
-
--------------------  Star schema: It is the simplest style of data mart schema. The star
schema consists of one or more fact tables referencing any number
---------------------- of dimension tables.
-
-------------------  Snowflake schema: It is a logical arrangement of tables
in a multidimensional database such that the entity
----------------------
relationship diagram resembles a snowflake shape.
----------------------
---------------------- 1. Define dimensional modeling.
---------------------- 2. Differentiate between ER modeling and Dimensional Modeling.

----------------------
---------------------- 1. Dimensions are the characteristics of subjects in which each row is an
---------------------- occurrence and each attribute can be used as ‘by’ attribute in where clause.
2. The attributes of dimensions and measures are distinguished based on
---------------------- application requirement.
---------------------- 1. Data warehouse modeling is known as dimensional modeling.

2. In transaction-oriented systems, data structures are designed to enable
---------------------- fast writing through the process of ER Modeling and Normalization.
---------------------- 3. Star schema is the easiest schema for data warehouse.
---------------------- 4. Facts are the metrics resulting from business process for an event.
5. Dimensional modeling is a mixture of normalization and de-normalization.
----------------------
----------------------
----------------------
52 Dimensional Modeling
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Model.
---------------------- Varga, Mladen. On the Differences of Relational and Dimensional Data 3.
---------------------- Modeling and Design: Logical Design.
----------------------
Teorey, Toby J., Sam S. Lightstone, Tom Nadeau, H.V. Jagadish. Database 2.
Environment. IBM Redbooks.
---------------------- Stanislav Vohnik. Dimensional Modeling: In a Business Intelligence
----------------------
Ballard, Chuck, Farrell, Daniel M; Gupta, Amit; Mazuela, Carlos; 1.
---------------------- Suggested Reading
----------------------
2. False
---------------------- 1. True
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------

Data Warehouse Implementation
UNIT
5
Structure:
5.1 Introduction
5.2 Physical Database Design
5.3 Hardware and I/O Considerations
5.4 Integrity Constraints
5.5 Dimensions
5.6 Aggregation
Summary
Key Words
Suggested Reading
Data Warehouse Implementation 54

Notes
Objectives
----------------------
----------------------
 Describe the physical database design
----------------------
 Explain how b-tree and bitmap indexes help in data warehousing
---------------------- queries requirement
----------------------  Implement Data Warehouse
----------------------
---------------------- 5.1 INTRODUCTION
---------------------- During physical design process, we reconstruct the data gathered from
---------------------- logical design into specification of physical design. This unit provides you with
an in-depth understanding of data warehousing and its application to business
---------------------- intelligence. You will learn the concepts and necessary to build a successful
data warehouse to enable your business intelligence program on the first
---------------------- implementation.
----------------------
5.2 PHYSICAL DATABASE DESIGN
----------------------
The physical database design is generated according to the requirement of
---------------------- query performance and maintenance.
---------------------- In the logical design, a model is created for data warehouse, which is
composed of entities, attributes and relationship. Such entities are linked
---------------------- together using relationships. The attribute characterizes the entities. The unique
---------------------- identifier is used to uniquely identify one instance from another instance of an
entity.
---------------------- Following translation is required for translating such design into actual
---------------------- database during physical design process.
 From Entities to tables
----------------------
 The Relationships between primary and foreign key constraints
----------------------
 Mapping of Attributes to columns
---------------------- After such translation, we are required to create the following structures into
---------------------- database.
 Tablespaces
----------------------
 Tables
----------------------  Indexes
 Constraints
 Dimensions

Tablespace Notes
Tablespace is composed of one or more operating system files, which are ----------------------
known as datafiles. The datafile is a repository for actual data. Every database
structure resides in a datafile and hence in the tablespace. Therefore, to improve ----------------------
performance, tablespaces are separated for each structure. ----------------------
Tables
----------------------
The basic structures for physical data storage are tables. They act as a capsule
of raw data in data warehouse. As data warehouse contains large data volumes, ----------------------
query performance and availability of the tables are the major key parameters. ----------------------
Use of partitioned tables addresses large data volumes efficiently. The main
design concern for partitioning is the maintainability. Typically, table partitions ----------------------
are created based on time.
----------------------
Constraints
Constraints are used to enforce business standards, thus preventing invalid
----------------------
information entry in the table. In OLTP environment, constraints are majorly ----------------------
focussed on transactions, whereas in data warehouse, constraints are majorly
used in query-rewrite, as accuracy is already guaranteed. ----------------------
Indexes ----------------------
Indexes are optional data structures associated with tables. In addition to ----------------------
B-Tree, bitmap indexes are used because of low cardinality of attributes in data
warehousing tables. Moreover, they are necessary for some optimized data ----------------------
access in dimensional modeling.
----------------------
Dimensions
----------------------
In order to show the hierarchical relationship between columns, we use
dimensions. It is a container of logical relationship with no space requirements. ----------------------
A typical dimension is city, state (or province), region and country.
----------------------
5.3 HARDWARE AND I/O CONSIDERATIONS ----------------------
Performance of a query is a primary consideration of data warehouse ----------------------
designers. The typical operations workload is the bulk load and ad-hoc queries
for large amount of data. One of the leading causes of poor query performance ----------------------
is poor I/O design. Database administrators have to pay more attention to the
following points while designing I/O configurations.
----------------------
1. Choose storage configuration based on their bandwidth not the capacity. ----------------------
2. Create the clusters of disks as a storage for striping and redundancy in ----------------------
order to minimize the risks involved in failures.
----------------------
3. Plan for I/O growth without neglecting the I/O bandwidth.
Partitioning
----------------------
As we have discussed, data warehouse stores very large tables and maintain ----------------------

Notes the standards for availability and good query performance. In order to achieve
this, one of the ways is partitioning. Partitioning supports the performance
---------------------- improvement and availability by dividing the table into smaller pieces and
---------------------- distribute them geographically. Hence, SQL query has less amount of data
to operate on. It also improves the scalability by adding more partitions
---------------------- dynamically to the database.
---------------------- Indexes
In this section, we are covering B-Tree and bitmap indexes for data
----------------------
warehousing queries requirement.
---------------------- Bitmap indexes are widely preferable for ad-hoc queries because of low
cardinality and less transactions requirement. Cardinality is the number of
----------------------
unique options available for a given attribute. For such combinations, bitmap
---------------------- indexing provides:
1. Improved ad-hoc queries response time.
----------------------
2. Less storage requirement.
---------------------- Efficient maintenance during bulk loads.
3.
---------------------- Indexing a large table with a traditional B-tree index is more expensive
in terms of disk space, as index sizes are several times larger than respective
----------------------
data in the table. Also searching in B-Tree index is more time consuming than
---------------------- bitmap. A B-Tree index provides a pointer to the rows in a table for the specific
key, whereas in bitmap index, a bitmap represents a list of rowids. Each bit,
---------------------- which is a part of a bitmap, corresponds to possible rowid. If the bit is set, that
means row is present for a given key value. A mapping function converts the bit
----------------------
position to the actual rowids. For multiple conditions in a query, bitmap indexes
---------------------- perform better than B-Tree. Bitmap indexes are traditionally focused on data
warehousing applications. These are not suitable for OLTP applications due to
---------------------- large number of concurrent transactions modifying the data. Such modifications
results in expensive locks for bitmap indexes.
----------------------
Bitmap indexes are used to query only fact table or when fact table is joined
---------------------- with two or more dimension tables. A table attribute is a candidate for a bitmap
indexes for the following conditions.
----------------------
1. Column cardinality is low.
----------------------
2. Indexed column is frequently used in the conditional clause.
---------------------- Indexed column is a foreign key column for a dimension table.
3.
----------------------
B-tree indexes
----------------------
The bottom level of B-tree indexes contains the index key and pointers to
---------------------- the corresponding row. Through such indexes, our typical query retrieves with
the corresponding rows the indexed column. Hence it is faster for search and
---------------------- good for higher cardinality of the column. However, for every row retrieval
from a table, such index scan may exhibit more cost. B-tree indexes are most
----------------------
commonly used to enforce unique keys.
Notes
----------------------
State True or False. ----------------------
1. Indexes are optional data structure associated with tables. ----------------------

2. Partitioning does not affect the scalability by adding more partitions ----------------------
dynamically to the database.
3. B-tree indexes are most commonly used to enforce unique keys.
----------------------
----------------------
5.4 INTEGRITY CONSTRAINTS ----------------------
In this section, we will discuss the usefulness of constraints, constraint states ----------------------
and data warehouse constraints.
----------------------
● Usefulness
Integrity constraints provide mechanism to enforce business standards.
----------------------
Such constraints are used to achieve data cleanliness and query optimization. ----------------------
Constraints in cleanliness verify the prevention of introduction of dirty data and
hence query optimization is achieved. ----------------------
● Constraint States ----------------------
In order to achieve enforcement through constraint, it must be in enable
state. An enable constraint ensures data transaction according to the conditions
----------------------
of the constraints. Also, the validation ensures that data that exists in the table ----------------------
is according to constraints. All the constraints are by default in enabled and
validated state. However, for the validation, constraints need to be enabled on ----------------------
enforced.
----------------------
● Data warehouse constraints
Query performance may get affected by the available constraints and index
----------------------
associated with it. The major constraints, which contain the index are the primary ----------------------
and unique key. These constraints are typically enforced through unique index.
However, for large data warehouse tables, maintaining such large unique index ----------------------
can be quite a tedious job in terms of processing time and disk space. Also most
data warehouse queries do not use unique index attributes as their predicates,
----------------------
so creating this index will probably not improve the query performance. For ----------------------
data warehouse databases, one alternative solution is to disable the unique
constraint. Once the constraint is disabled, the unique index is not required. ----------------------
This approach is frequently used in data warehouse. Consequently, the updates
in respective base table cannot be performed because constraint is in the disable
----------------------
state. The better way is to drop and recreate respective constraints after loading ----------------------
of data in data warehouse.
----------------------
----------------------
Notes 5.5 DIMENSIONS
---------------------- In order to answer the business queries dimensions, categorize the data.
For example for customer and product relation, commonly used dimensions
---------------------- are customer, product and time. As we have discussed earlier, time dimension
---------------------- participates in every data warehouse. For example, a retail store might want
to create a data warehouse to understand its business or sales for a particular
---------------------- product and may want answer to the following questions:
---------------------- 1. Total sales for a particular product for given quarter.
2. Does any product require promotion?
----------------------
3. Effect of promotion on sales of a particular product.
----------------------
Two major components of retailer data warehouse are dimensions and facts.
---------------------- Dimensions are customer, product, time and location, whereas the fact is sales.
We need to identify the dimensions and facts from a given problem statement
---------------------- for dimensional modelling.
---------------------- The entries for above-mentioned dimension and fact are populated into
dimension table and fact table. The fact table will contain the sales according to
---------------------- product, customer and time. In addition, database object dimension may contain
the hierarchy of dimension tables. Moving upper level in hierarchy is known as
----------------------
roll-up and going down in a level is known as drill-down. For example, in a time
---------------------- dimension, days may roll-up to week, months, quarter and year. Data analysis
typically starts from higher level and goes to the dipper level if required.
----------------------
---------------------- 5.6 AGGREGATION
---------------------- The Aggregation is a considered to be a fundamental function of the data
warehouse. Aggregation through multi-dimensional queries has significant
---------------------- effect on performance. These aggregates building queries exhaust the major
---------------------- part of the processing power. To minimize such exhaustion, data warehouse
design plays a vital role. Following are the points to understand for design.
---------------------- 1. Generate the star schema in which large central fact table is surrounded
by single level of independent dimension tables.
----------------------
2. An aggregate navigator is a database API, which transforms the base-
---------------------- level SQL into aggregate-aware SQL.
---------------------- In order to improve the query aggregation, every database vendors provides
ROLL-UP and CUBE aggregate operations. These operations are extensions
---------------------- to SQL, to make SQL query easier and faster. The said operations produce
---------------------- single result set that is equivalent to UNION ALL of differently grouped
rows. ROLL-UP, as the name suggests, does increasing level of aggregation,
---------------------- i.e., from the most detailed up to grand total. CUBE operation requires heavy
processing workload. To enhance the query performance, these operations
---------------------- can be parallelized, thereby increasing the overall database performance and
---------------------- scalability.

A good aggregate function should perform well for ever increasing Notes
dimensions size and disk storage requirement. Also, processing of aggregate ----------------------
functions should be transparent to end users. Specification of such aggregates
at the time of data load should be as automated as possible. There exist three ----------------------
different design methodologies for aggregates.
----------------------
 Each distinct aggregate level must have its own fact table separate from
base fact table. ----------------------
 The dimension tables attached to aggregate fact table must be smaller ----------------------
than actual dimension attached to base fact table.
 The base fact table and related aggregated fact tables are associated
----------------------
together to navigate the aggregate function so that this function can access ----------------------
only the base fact table and related dimension to it.
----------------------
The aggregate navigator function works in the following way:
1. Find the smallest table. Such table acts as a lookup table for aggregate ----------------------
navigator function. ----------------------
2. Compare the table attributes to the SQL statement select list. If the
attributes are different, go back to step 1; else, this table provides the ----------------------
correct answer. ----------------------

----------------------
Fill in the Blanks.
----------------------
1. A tablespace is composed of ________.
2. Query performance is driven by available Constraints and ________. ----------------------
3. Data analysis typically starts from ________ and goes to the ----------------------
_________.
----------------------
4. ROLL-UP is ________ operations.
5. ________ dimension participates in every data warehouse.
----------------------
----------------------
Activity 1 ----------------------
----------------------
Implement the data warehouse for your company by understanding the
physical design process. ----------------------
----------------------
----------------------
----------------------
----------------------

Notes Summary
----------------------  The physical database design is generated according to the requirement
----------------------
of query performance and maintenance.
----------------------
 In the logical design, a model is created for data warehouse,
---------------------- which is composed of entities, attributes and relationship. Such
----------------------
entities are linked together using relationships. The attribute
---------------------- characterizes the entities.
 Indexing a large table with a traditional B-tree index is more
---------------------- - expensive in terms of disk space, as index sizes are several times
--------------------- larger than respective data in the table. Also searching in B-Tree
index is more time consuming than bitmap.
---------------------- -  Integrity constraints provide mechanism to enforce business
--------------------- standards. Such constraints are used to achieve data
cleanliness and query optimization
----------------------
 As the goal of data warehouse design is to perform ad-hoc nature
---------------------- queries efficiently, we are required to create some of the objects
---------------------- such as tables, indexes, constraints etc. optimally. In addition to this,
h/w consideration is also an important part.
----------------------
 The Aggregation is considered to be a fundamental function of
---------------------- the data warehouse. Aggregation through multi-dimensional
queries has significant effect on performance. These aggregates
---------------------- building queries exhaust the major part of the processing power.
----------------------
---------------------- Keywords
----------------------
----------------------  Tables: A table is a collection of related data held in a structured
format within a database.
 Indexes: A database index is a data structure that improves
the speed of data retrieval operations on a database table at the
cost of additional writes and storage space to maintain the index
data structure.
 Aggregation: Data aggregation is any process in which
information is gathered and expressed in a summary form, for
purposes such as statistical analysis.
 Constraints: Data Constraints are constraints on a database that
require relation to satisfy certain properties. Relations that
satisfy all such constraints are legal relations.
Self Assessment Questions

---------------------- 1. Explain the roles of tablespaces, tables, indexes, constraints
and dimensions in the database structure.
2. What is the use of partitioning?
Answer to Check your Progress Notes
----------------------

1. True ----------------------
2. False ----------------------
3. True
----------------------
Fill in the Blanks.
----------------------
1. A tablespace is composed of one or more operating system files. ----------------------
2. Query performance is driven by available Constraints and Indexes. ----------------------

3. Data analysis typically starts from higher level and goes to the dipper ----------------------
level.
4. ROLL-UP is aggregation operation.
----------------------
5. Time dimension participates in every data warehouse. ----------------------

----------------------
Suggested Reading
----------------------
1. Humphries, Mark, Michael W. Hawkins, Michelle C. Dy. Data
Warehousing: Architecture and Implementation. ----------------------
2. Ullrey, Bruce Russell. Implementing a Data Warehouse: A Methodology ----------------------

that Worked.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------

Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------

Data Warehouse and OLAP Technologies
UNIT
6
Structure:
6.1 Introduction
6.2 OLAP Technology
6.3 ROLAP and MOLAP Processing
6.4 Database Design Methodology
6.4.1 Star Schema
6.4.2 Snowflake Schema
6.5 Server Architectures for Query Processing
Summary
Key Words
Suggested Reading
Data Warehouse and OLAP Technologies 64

Notes
Objectives
----------------------
----------------------
● State the basics of OLAP Technology
---------------------- ● Describe ROLAP and MOLAP Processing
---------------------- ● Analyse the importance of Database Design Methodology
---------------------- ● Explain SQL Extensions
----------------------
----------------------
6.1 INTRODUCTION
----------------------
Data warehousing and on-line analytical processing (OLAP) are essential
---------------------- elements of decision support, which has increasingly become a focus of the
database industry. Many commercial products and services are now available
---------------------- and all of the principal database management system vendors now have offerings
---------------------- in these areas. Decision support places some rather different requirements on
database technology compared to traditional on-line transaction processing
---------------------- applications. In this unit, we will be discussing about the OLAP technology in
detail.
----------------------
---------------------- 6.2 OLAP TECHNOLOGY

---------------------- In computing, online analytical processing, or OLAP is an approach to
answering multi-dimensional analytical (MDA) queries swiftly. OLAP is part of
---------------------- the broader category of business intelligence, which also encompasses relational
---------------------- database, report writing and data mining. Typical applications of OLAP
include business reporting for sales, marketing, management reporting, business
---------------------- process management (BPM), budgeting and forecasting, financial reporting and
similar areas, with new applications coming up, such as agriculture. The
---------------------- term OLAP was created as a slight modification of the traditional database
---------------------- term online transaction processing (OLTP).
OLAP tools enable users to analyze multidimensional data interactively
---------------------- from multiple perspectives. OLAP consists of three basic analytical operations:
---------------------- consolidation (roll-up), drill-down and slicing and dicing. Consolidation involves
the aggregation of data that can be accumulated and computed in one or more
---------------------- dimensions. For example, all sales offices are rolled up to the sales department
or sales division to anticipate sales trends. By contrast, the drill-down is a
---------------------- technique that allows users to navigate through the details. For instance, users
---------------------- can view the sales by individual products that make up a region’s sales. Slicing
and dicing is a feature whereby users can take out (slicing) a specific set of data
---------------------- of the OLAP cube and view (dicing) the slices from different viewpoints. These
viewpoints are sometimes called dimensions (such as looking at the same sales
---------------------- by salesman or by date or by customer or by product or by region, etc.)

Databases configured for OLAP use a multidimensional data model, allowing Notes
for complex analytical and ad hoc queries with a rapid execution time. They ----------------------
borrow aspects ofnavigational databases, hierarchical databases and relational
databases. ----------------------
At the core of any OLAP system is an OLAP cube (also called a ----------------------
'multidimensional cube' or a hypercube). It consists of numeric facts
called measures, which are categorized by dimensions. The measures are placed ----------------------
at the intersections of the hypercube, which is spanned by the dimensions as
a vector space. The usual interface to manipulate an OLAP cube is a matrix ----------------------
interface, like Pivot tables in a spreadsheet program, which performs projection ----------------------
operations along the dimensions, such as aggregation or averaging.
The cube metadata is typically created from a star schema or snowflake
----------------------
schema or fact constellation of tables in a relational database. Measures are ----------------------
derived from the records in the fact table and dimensions are derived from
the dimension tables. ----------------------
Each measure can be thought of as having a set of labels, or meta-data ----------------------
associated with it. A dimension is what describes these labels; it provides
information about the measure. ----------------------
A simple example would be a cube that contains a store's sales as a measure, ----------------------
and Date/Time as a dimension. Each Sale has a Date/Time label that describes
more about that sale. ----------------------
For example: ----------------------

----------------------
----------------------
----------------------
----------------------
----------------------
Typical OLAP operations include,

----------------------
rollup - increasing the level of aggregation
 drill-down - decreasing the level of aggregation or increasing detail along ----------------------
one or more dimension hierarchies ----------------------
 slice_and_dice - selection and projection
----------------------
 pivot - re-orienting the multidimensional view of data

----------------------
Other operations
o drill across - involving (across) more than one fact table ----------------------
o drill through - through the bottom level of the cube to its back-end ----------------------
relational tables (using SQL)
----------------------

Notes Multidimensional Databases
---------------------- Multidimensional structure is defined as "a variation of the relational

model that uses multidimensional structures to organize data and express the
---------------------- relationships between data".The structure is broken into cubes and the cubes are
able to store and access data within the confines of each cube. "Each cell within
---------------------- a multidimensional structure contains aggregated data related to elements
---------------------- along each of its dimensions". Even when data is manipulated it remains easy
to access and continues to constitute a compact database format. The data still
---------------------- remains interrelated. Multidimensional structure is quite popular for analytical
databases that use online analytical processing (OLAP) applications. Analytical
---------------------- databases use these databases because of their ability to deliver answers to
---------------------- complex business queries swiftly. Data can be viewed from different angles,
which gives a broader perspective of a problem unlike other models.
----------------------
Aggregations
---------------------- It has been claimed that for complex queries OLAP cubes can produce an
answer in around 0.1% of the time required for the same query on OLTPrelational
----------------------
data. The most important mechanism in OLAP which allows it to achieve such
---------------------- performance is the use of aggregations. Aggregations are built from the fact
table by changing the granularity on specific dimensions and aggregating up
---------------------- data along these dimensions. The number of possible aggregations is determined
by every possible combination of dimension granularities.
----------------------
The combination of all possible aggregations and the base data contains the
---------------------- answers to every query, which can be answered from the data.
---------------------- Because usually there are many aggregations that can be calculated, often
only a predetermined number are fully calculated; the remainder are solved
---------------------- on demand. The problem of deciding which aggregations (views) to calculate
---------------------- is known as the view selection problem. View selection can be constrained by
the total size of the selected set of aggregations, the time to update them from
---------------------- changes in the base data, or both. The objective of view selection is typically to
minimize the average time to answer OLAP queries, although some studies also
---------------------- minimize the update time. View selection is NP-Complete. Many approaches
---------------------- to the problem have been explored, including greedy algorithms, randomized
search, genetic algorithms and A* search algorithm.
----------------------
6.3 ROLAP AND MOLAP PROCESSING
----------------------
Transactions application typically automates day-to-day transactions. Such
----------------------
operations are structured and repetitive, consists of short, atomic and isolated
---------------------- transactions. These transaction operations modify or update the records based
on relations primary key. For such operations, consistency and recoverability
---------------------- are the critical tasks. Querying such transaction database may slow down the
performance on transaction operation.
----------------------
A data warehouse is designed for such ad-hoc queries. It includes the
---------------------- historical, consolidated and summarized data over the period of time. Because
67
of this the size of the data warehouse database is order of magnitude larger Notes
than transaction database. The workload for data warehouse is query intensive, ----------------------
which accesses millions of records to perform joins and aggregates. Query
performance is main parameter for data warehouse design. ----------------------
In order to support complex queries analysis data warehouse is designed ----------------------
with multi dimensional model using star or snowflake schema. As discussed
in earlier chapter that typical data warehousing operations include rollup and ----------------------
drill-down along one or more databases. Even if the operational databases
are tuned to support transaction and little of queries, running on transaction ----------------------
database, such operations may leave the OLTP transactions performance in ----------------------
bad shape. Furthermore decision support system or data warehouse requires
historical data. Such requirement cannot be full filled by OLTP as it contains the ----------------------
current data. Data warehouse usually requires integrating the data from several
heterogeneous resources. Such sources data several different inconsistencies ----------------------
and formats. Accessing such data requires the special implementation methods, ----------------------
which are not provided by OLTP. It is for this reason data warehouse database
is implemented separately. ----------------------
Data warehouse might be implemented using standard relational database ----------------------
called Relational Online Analytical Processing (ROLAP). For this the data is
stored in relational database and accessed efficiently to serve multi dimensional ----------------------
query requirement, whereas in Multidimensional Online Analytical Processing
(MOLAP) servers data is stored in a special data structure to implement special
----------------------
aggregate queries. To the end user the accessibility and working of ROLAP ----------------------
and MOLAP system are same, but this system differs in operational details.
There exist multiple OLAP systems. They are generally distinguished by the ----------------------
first letter of their abbreviation.
----------------------
ROLAP works for the data that is stored in relational databases, for which the
base data and dimension tables are stored as a relational table. This model has a ----------------------
set of APIs which facilitates multi dimensional queries. ROLAP has following
advantages over other structures.
----------------------
 It is more scalable ----------------------

 It is good for both textual and numeric data ----------------------
 It efficiently does the rollup and drilldown operations ----------------------
ROLAP is highly depending on the SQL queries and does most of the
operations at server side, which most of the time slows down the performance. ----------------------
MOLAP is a non relational engine and by virtue of their pre-summarization and ----------------------
loading it has incredibly fast response time to the user. In this chapter we are
focusing on improving the performance of ROLAP. The next section describes ----------------------
the same.
----------------------
----------------------
----------------------

Notes
----------------------
1. Data warehouse might be implemented using standard relational
----------------------
database called Relational Online Analytical Processing (ROLAP).
---------------------- 2. ROLAP works for the data that is stored in relational databases.
---------------------- 3. ROLAP is highly depending on the SQL queries and does most of the
operations at server side.
----------------------
4. MOLAP is a relational engine and by virtue of their pre-summarization
---------------------- and loading it has incredibly fast response time to the user.
----------------------
---------------------- 6. 4 DATABASE DESIGN METHODOLOGY
---------------------- Designing of database is most important responsibility of the software

professionals who are dealing with the database related projects. For this, they
---------------------- follow the Design Methodology. It helps the designer to plan, manage, control,
and evaluate database development projects.
----------------------
Design methodology is a structured approach that uses procedures,
---------------------- techniques, tools, and documentation aids to support and facilitate the process
of design.
----------------------
A design methodology consists of phases each containing a number of steps,
---------------------- which guide the designer in the techniques appropriate at each stage of the
project.
----------------------
Phases of Design Methodology
----------------------
The database design methodology is divided into three main phases. These
---------------------- are:
---------------------- ● Conceptual database design
● Logical database design
----------------------
● Physical database design
----------------------
Conceptual Database Design
---------------------- The process of constructing a model of the information used in an enterprise,
independent of all physical considerations.
----------------------
The conceptual database design phase begins with the creation of a conceptual
---------------------- data model of the enterprise, which is entirely independent of implementation
details such as the target DBMS, application programs, programming languages,
----------------------
hardware platform, performance issues, or any other physical considerations.
----------------------
----------------------

Logical Database Design Notes
It is a process of constructing a model of the information used in an enterprise ----------------------
based on specific data model, but independent of a particular DBMS and other
physical considerations. ----------------------
The logical database design phase maps the conceptual model on to a logical ----------------------
model, which is influenced by the data model for the target database (for example,
the relational model). The logical data model is a source of information for the
----------------------
physical design phase. ----------------------
The output of this process is a global logical data model consisting of an
Entity- Relationship diagram, relational schema, and supporting documentation
----------------------
that describes this model, such as a data dictionary. Together, these represent ----------------------
the sources of information for the physical design process, and they provide
the physical database designer with a vehicle for making tradeoffs that are so ----------------------
important to an efficient database design.
----------------------
Physical Database Design
----------------------
It is a description of the implementation of the database on secondary storage;
it describes the base relations, file organizations, and indexes used to achieve ----------------------
efficient access to the data, and any associated integrity constraints and security
measures. ----------------------
Whereas logical database design is concerned with the what, physical ----------------------
database design is concerned with the how. The physical database design phase
allows the designer to make decisions on how the database is to be implemented.
----------------------
Therefore, physical design is tailored to a specific DBMS. There is feedback ----------------------
between physical and logical design, because decisions taken during physical
design for improving performance may affect the logical data model. ----------------------
For example, decisions taken during physical for improving performance, ----------------------
such as merging relations together, might affect the structure of the logical data
model, which will have an associated effect on the application design. ----------------------
Steps of Physical Database Design Methodology ----------------------
After designing logical database model, the steps of physical database design ----------------------
methodology are as follows:
Step 1: Translate global logical data model for target DBMS It includes
----------------------
operations like the Design of base relation, derived data and design of enterprise ----------------------
constraints.
Step 2: Design physical representation.
----------------------
It includes operations like analyzing of transactions, selection offile ----------------------

organizations, selection of indexes and estimates the disk space requirements.
----------------------
Step 2 is most important part in designing of physical design of database.
It is used to determine the optimal file organizations to store the base relations ----------------------
and the indexes· that are required to achieve acceptable performance, that is, the ----------------------
way in which relations and tuples will be held on secondary storage.

Notes One of the main objectives of physical database design is to store data in an
efficient ay. There are a number of factors that we may use to measure efficiency:
----------------------
● Transaction throughput
---------------------- This is the number of transactions that can be processed in a given time
---------------------- interval.
In some systems, such as airline reservations, high transaction throughput is
---------------------- critical to the overall success of the system.
---------------------- ● Disk storage
This is the amount of disk space required to store the database files. The
----------------------
designer may wish to minimize the amount of disk storage used.
---------------------- ● Response time
---------------------- It is the time required for the completion of a single transaction. From a
user's point of view, we want to minimize response time as much as possible.
----------------------
6.4.1 Star Schema
----------------------
Most data warehouses use a star schema to represent the multidimensional
---------------------- data model. The database consists of a single fact table and a single table for
each dimension. Each tuple in the fact table consists of a pointer (foreign key
---------------------- – often uses a generated key for efficiency) to each of the dimensions that
provide its multidimensional coordinates, and stores the numeric measures for
----------------------
those coordinates. Each dimension table consists of columns that correspond
---------------------- to attributes of the dimension. Figure 6.3 shows an example of a star schema.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- Fig. 6.3: A Star Schema
---------------------- Star schemas do not explicitly provide support for attribute hierarchies.
---------------------- 6.4.2 Snowflake Schema

Snowflake schemas provide a refinement of star schemas where the
----------------------
dimensional hierarchy is explicitly represented by normalizing the dimension
---------------------- tables, as shown in Figure 6 .4. This leads to advantages in maintaining
the dimension tables. However, the denormalized structure of the
---------------------- dimensional tables in star schemas may be more appropriate for browsing the
dimensions.
71
Fact constellations are examples of more complex structures in which Notes
multiple fact tables share dimensional tables. For example, projected expense ----------------------
and the actual expense may form a fact constellation since they share many
dimensions. ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 6.4: A Snowflake Schema ----------------------
In addition to the fact and dimension tables, data warehouses store selected ----------------------
summary tables containing pre-aggregated data. In the simplest cases, the
pre-aggregated data corresponds to aggregating the fact table on one or more ----------------------
selected dimensions. Such pre-aggregated summary data can be represented in ----------------------
the database in at least two ways. Let us consider the example of a summary
table that has total sales by product by year in the context of the star schema ----------------------
of Figure 6.3. We can represent such a summary table by a separate fact table
which shares the dimension Product and also a separate shrunken dimension ----------------------
table for time, which consists of only the attributes of the dimension those make ----------------------
sense for the summary table (i.e., year).
Alternatively, we can represent the summary table by encoding the aggregated
----------------------
tuples in the same fact table and the same dimension tables without adding new ----------------------
tables. This may be accomplished by adding a new level field to each dimension
and using nulls: We can encode a day, a month or a year in the Date dimension ----------------------
table as follows: (id0, 0, 22, 01, 1960) represents a record for Jan 22, 1960,
(id1, 1, NULL, 01, 1960) represents the month Jan 1960 and (id2, 2, NULL,
----------------------
NULL, 1960) represents the year 1960. The second attribute represents the new ----------------------
attribute level: 0 for days, 1 for months, 2 for years. In the fact table, a record
containing the foreign key id2 represents the aggregated sales for a Product in ----------------------
the year 1960. The latter method, while reducing the number of tables, is often
a source of operational errors since the level field needs be carefully interpreted.
----------------------
----------------------
6.5 SERVER ARCHITECTURES FOR QUERY
PROCESSING ----------------------
Traditional relational servers were not geared towards the intelligent use of
----------------------
indices and other requirements for supporting multidimensional views of data. ----------------------
However, all relational DBMS vendors have now moved rapidly to support

Notes these additional requirements. In addition to the traditional relational servers,
there are three other categories of servers that were developed specifically for
---------------------- decision support.
---------------------- ● Specialized SQL Servers
Redbrick is an example of this class of servers. The objective here is to
---------------------- provide advanced query language and query processing support for SQL queries
---------------------- over star and snowflake schemas in read-only environments.
● ROLAP Servers
----------------------
These are intermediate servers that sit between a relational back end
---------------------- server (where the data in the warehouse is stored) and client front end tools.
Microstrategy is an example of such servers. They extend traditional relational
---------------------- servers with specialized middleware to efficiently support multidimensional
---------------------- OLAP queries, and they typically optimize for specific back end relational
servers. They identify the views that are to be materialised, rephrase given user
---------------------- queries in terms of the appropriate materialized views, and generate multi-
statement SQL for the back end server. They also provide additional services
---------------------- such as scheduling of queries and resource assignment (e.g., to prevent runaway
---------------------- queries). There has also been a trend to tune the ROLAP servers for domain
specific ROLAP tools. The main strength of ROLAP servers is that they exploit
---------------------- the scalability and the transactional features of relational systems. However,
intrinsic mismatches between OLAP-style querying and SQL (e.g., lack of
---------------------- sequential processing, column aggregation) can cause performance bottlenecks
---------------------- for OLAP servers.
● MOLAP Servers
----------------------
These servers directly support the multidimensional view of data through
---------------------- a multidimensional storage engine. This makes it possible to implement front-
end multidimensional queries on the storage layer through direct mapping. An
---------------------- example of such a server is Essbase (Arbor). Such an approach has the advantage
---------------------- of excellent indexing properties, but provides poor storage utilization, especially
when the data set is sparse. Many MOLAP servers adopt a 2-level storage
---------------------- representation to adapt to sparse data sets and use compression extensively.
In the two-level storage representation, a set of one or two dimensional sub-
---------------------- arrays that are likely to be dense are identified, through the use of design tools
---------------------- or by user input, and are represented in the array format. Then, the traditional
indexing structure is used to index onto these “smaller” arrays. Many of the
---------------------- techniques that were devised for statistical databases appear to be relevant for
MOLAP servers.
----------------------
----------------------
Several extensions to SQL that facilitate the expression and processing
---------------------- of OLAP queries have been proposed or implemented in extended relational
servers. Some of these extensions are described below.
----------------------
----------------------

● Extended family of aggregate functions Notes
These include support for rank and percentile (e.g., all products in the top 10 ----------------------
percentile or the top 10 products by total Sale) as well as support for a variety
of functions used in financial analysis (mean, mode, median). ----------------------
● Reporting Features ----------------------

The reports produced for business analysis often requires aggregate features ----------------------
evaluated on a time window, e.g., moving average. In addition, it is important
to be able to provide breakpoints and running totals. Redbrick’s SQL extensions ----------------------
provide such primitives.
----------------------
● Multiple Group-By
Front end tools such as multidimensional spreadsheets require grouping by ----------------------
different sets of attributes. This can be simulated by a set of SQL statements that ----------------------
require scanning the same data set multiple times, but this can be inefficient.
Recently, two new operators, Rollup and Cube, have been proposed to augment ----------------------
SQL to address this problem. Thus, Rollup of the list of attributes (Product, Year,
and City) over a data set results in answer sets with the following applications ----------------------
of group by: (a) group by (Product, Year, City) (b) group by (Product, Year), ----------------------
and (c) group by Product. On the other hand, given a list of k columns, the
Cube operator provides roup-by for each of the 2k combinations of columns. ----------------------
Such multiple group-by operations can be executed efficiently by recognizing
commonalties among them30. Microsoft SQL Server supports Cube and Rollup. ----------------------
● Comparisons ----------------------
An article by Ralph Kimball and Kevin Strehlo provides an excellent ----------------------
overview of the deficiencies of SQL in being able to do comparisons that are
common in the business world, e.g., compare the difference between the total ----------------------
projected sale and total actual sale by each quarter, where projected sale and
actual sale are columns of a table31. A straightforward execution of such queries ----------------------
may require multiple sequential scans. The article provides some alternatives to ----------------------
better support comparisons. A recent research paper also addresses the question
of how to do comparisons among aggregated values by extending SQL32. ----------------------
----------------------
----------------------
1. Redbrick is an example of specialised class servers. ----------------------
2. MOLAP servers directly support the multidimensional view of data
through a multidimensional storage engine. ----------------------
----------------------
Activity 1 ----------------------
----------------------
Find out how OLTP applications automate clerical data processing tasks.

Notes Summary
----------------------  OLTP applications typically automate clerical data processing
---------------------- tasks such as order entry and banking transactions that are the
bread-and-butter day-to-day operations of an organization.
---------------------- -  Databases configured for OLAP use a multidimensional data
--------------------- model, allowing for complex analytical and ad hoc queries with a
rapid execution time. They borrow aspects of navigational
---------------------- databases, hierarchical databases and relational databases.
----------------------
 The multidimensional data model grew out of the view of
---------------------- business data popularized by PC spreadsheet programs that
were extensively used by business analysts. The spreadsheet is
---------------------- still the most compelling front-end application for OLAP.
-
---------------------  ROLAP works for the data that is stored in relational
databases, for which the base data and dimension tables are
---------------------- stored as a relational table. This model has a set of APIs which
---------------------- facilitates multidimensional queries. ROLAP has following
advantages over other structures.
----------------------
---------------------- Keywords
----------------------
 Data Warehouse: The data warehouse supports on-line
---------------------- analytical processing (OLAP), the functional and performance
- requirements of which are quite different from those of the on-
-------------------- line transaction processing (OLTP) applications traditionally
---------------------- supported by the operational databases.
-
 Pivot: Re-orienting the multidimensional view of data
---------------------
 Fact constellations: Examples of more complex structures in
---------------------- which multiple fact tables share dimensional tables.
-
---------------------  ROLAP Servers: These are intermediate servers that sit
between a relational back end server (where the data in the
---------------------- warehouse is stored) and client front end tools.
----------------------  Extended family of aggregate functions: These include

support for rank and percentile (e.g., all products in the top 10
percentile or the top 10 products by total Sale) as well as
support for a variety of functions used in financial analysis
----------------------
----------------------
----------------------
Self-Assessment Questions Notes
----------------------
1. What do you understand by Data Warehouse and OLAP Technologies?
----------------------
2. Write a note on applications of ROLAP and MOLAP Processing in
business. ----------------------
3. How is Database Design Methodology important to business organisations? ----------------------
4. Write a short note on Server Architecture for Query Processing. ----------------------
Answers to Check your Progress ----------------------

1. True ----------------------
2. True
----------------------
3. True
4. False
----------------------

1. True ----------------------
2. True
----------------------
Suggested Reading ----------------------
1. Dzeroski, Saso and Nada Lavrac. 2001. Relational Data Mining. Berlin: ----------------------
Springer. ----------------------
2. Goswami, Gunjan. Data mining and data warehousing. S.K. Kataria and
sons. ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------

Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------

Introduction to Data Mining
UNIT
7
Structure:
7.1 Introduction
7.2 Data Mining
7.2.1 Data Mining and Knowledge Discovery
7.2.2 Architecture of a Typical Data Mining System
7.3 Motivating Challenges
7.4 Data Mining Functionalities
7.4.2 Mining Frequent Patterns, Associations and Correlations
7.4.4 Cluster Analysis
7.4.5 Outlier Analysis
7.5 Classification of Data Mining Systems
7.6 Data Mining Task
7.7 Major Issues in Data Mining
Summary
Key Words
Suggested Reading
Introduction to Data Mining 78

Notes
Objectives
----------------------
----------------------
● State the meaning of data mining and its need
---------------------- ● Explain data mining functionalities
---------------------- ● Analyse different areas where data mining techniques are used
---------------------- ● Differentiate between interdisciplinary fields

● Describe data mining tasks and issues in data mining
----------------------
----------------------
---------------------- 7.1 INTRODUCTION
---------------------- Data is everywhere. Every day, a huge amount of data is generated by the
---------------------- Web, business, the IT industry, sales, science, engineering, etc. This industry-
generated data is heterogeneous and stored in different forms in databases. These
---------------------- large and numerous data repositories are beyond human ability to understand
and analyse for decision making. This is a situation that might best be described
---------------------- as ‘data rich but information poor’. Extracting meaningful information from
---------------------- this data is a challenging job.
Most of the time, important decisions that are taken are based on the decision
----------------------
maker’s perception of data rather than information that is based on the data in
---------------------- the data repository as they do not have any powerful tool to extract and analyse
data. Traditional data analysis tools and techniques fail for such data because of
---------------------- its massive size and non-traditional nature. To solve this problem a new method
has been developed, which is Data Mining. Data mining technology blends the
----------------------
traditional method of data analysis, which is suitable for processing of a large
---------------------- amount of data.
---------------------- 7.2 DATA MINING

---------------------- Data mining is a process of automatically discovering information, data
---------------------- patterns, relationships, tendencies. In other words it is “mining” knowledge
from large data. Data mining also provides the capabilities to predict the
---------------------- outcome of future observations based on previous data.
---------------------- For example, to predict whether customers will buy new policies or not,
based on previous data of a similar type of customers.
----------------------
In data mining the raw data is converted into useful information, that is,
---------------------- knowledge is extracted from available data. There are other similar terms
used, related to data mining, such as knowledge mining, knowledge extraction,
---------------------- pattern analysis and data dredging.
----------------------

7.2.1 Data Mining and Knowledge Discovery Notes
Knowledge Discovery on databases is a process of converting raw data ----------------------
into meaningful information called ‘knowledge’. This consists of a series of
processes from data collection through post-processing to deriving results. ----------------------
----------------------
----------------------
----------------------
● Selection: Retrieve data from various sources for data mining. ----------------------
● Preprocessing: It involves data cleansing, that is, removal of noisy and ----------------------
inconsistent data.
●
----------------------
Transformation: Convert to the common format or to a new format.
● Data Mining: Techniques are applied to extract a pattern to get the desired ----------------------
results. ----------------------
● Interpretation/Evaluation: Visualisation or representation is used to
present results to the user in a meaningful manner. ----------------------
7.2.2 Architecture of a Typical Data Mining System ----------------------

The architecture of a typical data mining system has the following major ----------------------
components:
● Database, data warehouse, Worldwide Web or any other repository
----------------------
This is the data repository from where data is retrieved and preprocessing ----------------------
is performed on it. This is one or multiple data bases, data warehouses or ----------------------
some other repository.
● Database or data warehouse server: ----------------------
It is responsible for providing relevant data based on users’ data mining ----------------------
request.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------

Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 7.1: Architecture of a Typical Data Mining System
---------------------- Knowledge Base
---------------------- It is domain knowledge used to search or evaluate any interesting pattern.
This includes the concept of hierarchies, which are used to place attributes at
---------------------- different levels that can be used to assess a pattern from the data.
---------------------- Data mining Engine
---------------------- This consists of a set of techniques such as classification, association,
correlation analysis, prediction, outlier analysis, etc.
----------------------
Pattern Evaluation Module
----------------------
This module is used to analyse and interact with data mining modules to
---------------------- search for an interesting pattern. It filters data to discover an interesting pattern.
Based on the implementation techniques of the data mining used for data
---------------------- analysis, the pattern evaluation module may be integrated with the mining
module.
----------------------
User Interface
----------------------
This module is a communicator between the user and the data mining system.
---------------------- The user can interact with the data mining system to search for a pattern or
any interested data by specifying a data mining query or task. This component
---------------------- also helps the user to look through databases and data warehouse templates/
---------------------- schemas, evaluate minimised patterns and visualise a pattern.
----------------------

Notes
1. A ________ is process of automatically discovering information, data ----------------------

patterns, relationships, tendencies or mining knowledge from large
data.
----------------------
2. A _______ on databases is the process of converting raw data into ----------------------

meaningful information called ‘knowledge’.
----------------------
3. The next stage to data selection in the KDD process is ____.
----------------------
7.3 MOTIVATING CHALLENGES ----------------------

----------------------
Traditional data analysis techniques have some practical challenges to face
new data sets. This facilitates the development of data mining techniques. ----------------------
● High Dimensionality ----------------------
Data sets with more than hundreds or thousands of attributes are available.
Traditional data analysis techniques fail to handle such a high dimension ----------------------
dataset. Also, complexity of data increases with increase in dimension. ----------------------
Consider the temperature dataset containing temperature records of all
cities. If the temperature is recorded repeatedly at specific intervals then ----------------------
the number of dimensions also increases. So, there is the need of data
mining techniques, which can handle high-dimension data. ----------------------
● Scalability ----------------------
Nowadays, datasets with gigabytes, terabytes and petabytes are becoming ----------------------
common. To handle such large volumes of data, scalable data mining
algorithms are required. Using sampling or developing parallel distributing ----------------------
algorithm, scalability can be improved. Sampling and parallelisation are
used to overcome scalability problems.
----------------------
● Heterogeneous Data ----------------------

Traditional data analysis methods require data attributes of the same type. ----------------------
In keeping with the demand, business data sets are available at multiple
locations and can be in different forms. They can also be complex in nature. ----------------------
For example, whether the data base or data collected from Web pages ----------------------
contains text data or a three dimensional data structure, including graphs,
etc., can make it complex in nature. So there is a need for techniques, ----------------------
which can consolidate and handle heterogeneous and complex data.
●
----------------------
Nontraditional Analysis
Most of statistical approaches are based on hypothesis testing. This is ----------------------
time consuming. Current data analysis may require the building of ----------------------
thousands of hypotheses and evaluation of the same. Also, data available

Notes today is unstructured and requires a lot of processing by using traditional
methods. Consequently, some data mining techniques are developed to
---------------------- automate the process of hypothesis generation and evaluation.
----------------------
----------------------
---------------------- Multiple Choice Multiple Response.
1. For what can Heterogeneous data be used?
----------------------
i. To draw management decisions
---------------------- ii. To confirm that data exists
---------------------- iii. To analyse data for expected relationships
---------------------- iv. To create a new data warehouse
2. Which of the following are effective tools to attack the scalability
----------------------
problem?
---------------------- i. Sampling
---------------------- ii. Parallelisation
---------------------- iii. Collection
iv. None of the above
----------------------
----------------------
7.4 DATA MINING FUNCTIONALITIES
----------------------
Data mining functionalities are used to specify various activities of data
----------------------
mining tasks such as the kind of pattern finding, categorisation, prediction,
---------------------- association, etc.
----------------------
Data can be collected and stored with concepts and classes specific to that
---------------------- data.
---------------------- For example, in shopping malls each and every product is placed in some
category. TV, computers, printers, etc., are categorised as electronic products.
---------------------- The concepts for the customer include high-budget, medium-budget and low-
---------------------- budget customers.
These two approaches are described using data characterisation and data
---------------------- discrimination.
---------------------- Data characterisation represents summarisation of the characteristics or
features of a target class of data. For example, using the data-mining approach
----------------------
system can be able to produce a summary of the characteristics of customers
---------------------- who are categorised under ‘high budget’ for electronic products.
Data discrimination is the comparison of the general features of the target
----------------------

class with those of one or multiple classes or sets of contrasting classes. Notes
For example, a comparison can be made of the general features of one ----------------------
product whose sale has increased by 10% with the features of a product whose
sale has decreased by 30% during the same period. ----------------------
7.4.2 Mining Frequent Patterns, Associations and Correlations ----------------------

Frequent patterns are those patterns that occur frequently in data. There ----------------------
are different kinds of frequent patterns: frequent item sets patterns, frequent
sequential patterns and frequent structured patterns. ----------------------
Frequent item sets patterns: This is a set of items that are most frequently ----------------------
paired together in a transaction database of the customer. For example, if the
customer buys bread, then it is highly probable that customer also buys milk. ----------------------
Frequent sequential patterns: This is a set of items that the customer prefers ----------------------
to buy in a sequence or in some order. For example, the customer will first buy
a computer and then prefer to purchase software for that computer. ----------------------
Frequent structured patterns: It refers to structural forms such as graphs, ----------------------

trees or any other representation. This may be a subset of items, which occur
repeatedly and are called ‘structure patterns’.
----------------------
Mining frequency patterns helps find association and correlation within ----------------------
data.
----------------------
----------------------
Classification is the technique to find the class of an object whose label is
unknown, based on the historical model. A model is constructed based on data ----------------------
sets whose labels are known.
----------------------
For example, we can build a classification model to categorise bank loan
applications as either ‘safe’ or ‘risky’ or a prediction model to predict the ----------------------
expenditure in dollars of potential customers on computer equipment, given
their income and occupation. Another variation to classification is numerical
----------------------
prediction, to predict the numerical outcome of classification rather than class. ----------------------
Classification and Prediction Issues
----------------------
The major issue in preparing the data for Classification and Prediction
involves the following activities: ----------------------
● Data Cleaning - Data cleaning involves removing the noise and treatment ----------------------
of missing values. The noise is removed by applying smoothing techniques
and the problem of missing values is solved by replacing a missing value ----------------------
with most commonly occurring value for that attribute. ----------------------
● Relevance Analysis - Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are ----------------------
related. ----------------------
● Data Transformation and reduction - The data can be transformed by
any of the following methods. ----------------------

Notes ● Normalization - The data is transformed using normalization.
Normalization involves scaling all values for given attribute in order
---------------------- to make them fall within a small specified range. Normalization is used
---------------------- when in the learning step, the neural networks or the methods involving
measurements are used.
---------------------- ● ● Generalization -The data can also be transformed by generalizing it to the
higher concept. For this purpose we can use the concept hierarchies.
----------------------
Comparison of Classification and Prediction Methods
---------------------- The criteria for comparing methods of Classification and Prediction are
---------------------- given below
● Accuracy - Accuracy of classifier refers to ability of classifier predict the
----------------------
class label correctly and the accuracy of predictor refers to how well a
---------------------- given predictor can guess the value of predicted attribute for a new data.
● Speed - This refers to the computational cost in generating and using the
----------------------
classifier or predictor.
---------------------- ● Robustness - It refers to the ability of classifier or predictor to make
correct predictions from given noisy data.
----------------------
●● Scalability
● - Scalability refers to ability to construct the classifier or
---------------------- predictor efficiently given large amount of data.
Interpretability - This refers to the to what extent the classifier or predictor
---------------------- ● understand.
---------------------- 7.4.4 Cluster Analysis
---------------------- Clustering is the technique of grouping similar data objects together.
In clustering, class labels are unknown. Objects within a cluster have high
---------------------- similarity but they are not similar to objects in other clusters.
---------------------- Applications of Cluster Analysis
----------------------  Clustering Analysis is broadly used in many applications such as
market research, pattern recognition, data analysis, and image
----------------------
processing.

----------------------  Clustering can also help marketers discover distinct groups
in their customer basis. And they can characterize their
---------------------- customer groups based on purchasing patterns.

----------------------  In field of biology it can be used to derive plant and animal
taxonomies, categorize genes with similar functionality and gain
---------------------- insight into structures inherent in populations.

 Clustering also helps in identification of areas of similar land
----------------------
use in an earth observation database. It also helps in the
identification of groups of houses in a city according house type,

value, geographic location.

  Clustering also helps in classifying documents on the web for
information discovery.
7.4.5 Outlier Analysis Notes
Outliers are those objects that do not satisfy general behavior or models of ----------------------
data objects. These are sometimes referred to as noise in data sets. But, this
is useful to detect abnormal behavior of data. For example, it can be used in ----------------------
fraud detection for credit cards, if a higher amount of purchase is made than ----------------------
the regular amount, charged by that account. Outlier can be detected with the
location of purchase or frequency of purchase. ----------------------

----------------------
----------------------
Fill in the Blanks.
----------------------
1. ________ represents summarisation of the characteristics or features
of a target class of data. ----------------------
2. _______ is the technique of grouping similar data objects together. ----------------------
3. ______ refers to those objects that do not satisfy general behaviour ----------------------
or models of data objects.
----------------------
Activity 1 ----------------------
----------------------
Collect the data of the age, education and salary for 100 people and draw
at least five inferences. ----------------------
----------------------
7.5 CLASSIFICATION OF DATA MINING SYSTEMS ----------------------
Data mining is considered an interdisciplinary field. It includes a set of ----------------------
various disciplines, such as statistics, database systems, machine learning,
visualisation and information science. Owing to such diversity, classification ----------------------
of the data mining system helps users to understand the system and match their
requirements with such systems. ----------------------
Data mining systems can be categorised according to various criteria, as ----------------------

follows:
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------

Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 7.2: Classification of Data Mining Systems
----------------------
a. Classification according to types of databases mined: A database system
---------------------- can be classified as a ‘type of data’ or ‘use of data’ model or ‘application
of data’, for that matter.
----------------------
b. Classification according to the types of knowledge mined: This is based
---------------------- on functionalities such as characterisation, association, discrimination
and correlation, prediction, outlier analysis, etc.
----------------------
c. Classification according to the type of techniques utilised: This technique
---------------------- involves the degree of user interaction or the technique of data analysis
involved. For example, database-oriented or data-warehouse-oriented
---------------------- techniques, machine learning, statistics, visualisation, pattern recognition,
---------------------- neural networks, etc.
d. Classification according to the applications adapted: This involves
----------------------
domain-specific application. For example, the data mining systems can
---------------------- be tailored accordingly for telecommunications, finance, DNA, stock
markets, e-mail and so on.
----------------------

---------------------- 1. Which of the following are data mining activities?
---------------------- i. Finding records of students whose grade equals to A
---------------------- ii. Arranging records of employees according to their designation
iii. Predicting future sales of products using historical records
----------------------
iv. Arranging data only
----------------------
----------------------
----------------------

7.6 DATA MINING TASK Notes
The data mining task is generally divided into two categories:
----------------------
1. Predictive Task ----------------------

This type of task is to predict the value of one attribute based on the values ----------------------
of other attributes. Attribute use to predict is known as the ‘explanatory’
or ‘independent’ variable and the attribute to be predicted is known as the ----------------------
‘target’ variable. ----------------------
2. Descriptive Task
----------------------
This task is descriptive in nature and is used to generate a pattern and find
out the underlying relationship among different kinds of data. This task is ----------------------
self-explanatory in nature.
----------------------
DM Task Primitives
----------------------
These DM task primitives depend on the data set to be mined, the kind of
knowledge that is to be mined, background knowledge, measures and techniques ----------------------
(presentation and visualisation) to be used to display results, etc.
----------------------
7.7 MAJOR ISSUES IN DATA MINING ----------------------
Major issues in data mining regarding mining methodology, user interaction, ----------------------
performance and diverse data types are as follows:
Mining different types of knowledge
----------------------
•
According to the need of users, each user may be interested in

----------------------
different types of knowledge mining. This covers a wide area of ----------------------
data analysis such as classification, association, correlation, data
characterisation, discrimination and evolution analysis. ----------------------
• Query languages for data mining
Relational database language accepts an ad hoc query from the ----------------------
----------------------
user for retrial of data. Similarly, ad hoc data mining language
needs to be developed to allow the user to work on datasets to be ----------------------
mined, the kinds of knowledge that are to be mined and the conditions
that are to be applied.
• Noisy data handling ----------------------
The data in the database contains incomplete data or is called missing ----------------------
data for some records or noisy data, which misleads the data mining ----------------------
process. As a result, accuracy of the result may be poor. Identifying
and cleansing of such data before applying mining is, therefore, a ----------------------
major challenge.
• Presentation of data mining results
Results, that is, knowledge must be presented in such a way that it ----------------------
should be easily adoptable by users. To represent knowledge ----------------------
visualisation techniques are used or it is represented in the form of ----------------------
trees, tables, rules, graphs, charts, matrices or curves.
Notes • Performance issue:
---------------------- To extract data from the large database, there is the need to use
efficient data mining algorithms. The running time of an algorithm
---------------------- must be easily predictable and acceptable.
Parallel, distributed data mining algorithms are designed to handle large
----------------------
and complex databases. This method divides data into partition and processes
---------------------- simultaneously. Then, results from these processes are merged. Another
technique called incremental techniques is designed, which allows one to
---------------------- incorporate database updates directly without mining the entire database from
scratch. This involves a high cost of data mining techniques.
----------------------
Databases may contain complex data types, hypertext, data objects and
---------------------- transactional data. Knowledge extraction can be from structured, unstructured
---------------------- or semi-structured data. With heterogeneous data and data mining requirement
it is unrealistic to expect a single system to mine all types of data. Specific data
---------------------- mining systems can be designed for specific data. It is a challenge to design
efficient and effective data mining systems for such data.
----------------------
----------------------
---------------------- 1. Clustering belongs to which of the following types of Data Mining?
---------------------- i. Selective
ii. Predictive
----------------------
iii. Descriptive
---------------------- iv. Additive
2. Prediction is
----------------------
i. To determine the future outcome rather than current behaviour
---------------------- ii. Discipline in statistics
---------------------- iii. One of several possible entries within a database table
iv. Selection from data
----------------------
3. Attribute to be predicted is known as
---------------------- i. Target variable
ii. Explanatory variable
----------------------
iii. Knowledge
---------------------- iv. Statistics
----------------------
---------------------- Activity 2
----------------------
Give an example where data mining is crucial to the success of business.
---------------------- Also, discuss different data mining functionalities needed by business.

Summary Notes
----------------------
● Data mining is a process of automatically discovering information, data
patterns, relationships, tendencies or “mining” knowledge from large ----------------------
data. Data mining also provides capabilities to predict the outcome of
future observation based on previous data.
----------------------
● Knowledge discovery involves selection, pre-processing, transformation, ----------------------

data mining and interpretation/evaluation.
----------------------
● The architecture of a typical data mining system includes a database and/
or data warehouse and servers, a data mining engine and pattern evaluation ----------------------
module and a graphical user interface. ----------------------
● Traditional data analysis techniques have some practical challenges
to face new data sets. This facilitates the development of data mining ----------------------
techniques. ----------------------
● High dimensional data, scalability, heterogeneous data and the requirement
of non-traditional data analysis results in data mining techniques. ----------------------
● Data mining functionalities are used to specify various activities of data ----------------------
mining tasks such as the kind of pattern finding, categorisation, prediction,
association, etc.
----------------------
● Data mining is an interdisciplinary field. It includes a set of various ----------------------

disciplines such as database systems, statistics, machine learning,
visualisation and information science. Owing to such diversity
----------------------
classification of data mining systems helps users to understand the system ----------------------
and match their requirements with it. The data mining task is generally
divided into two categories: predictive and descriptive. ----------------------
● Major issues in data mining are regarding mining methodology, user ----------------------
interaction, performance and diverse data types.
----------------------
Keywords ----------------------
● Knowledge: Meaningful information ----------------------

● Heterogeneous: Diverse in content ----------------------
----------------------
----------------------
1. Define data mining.
2. Describe data mining architecture.
----------------------
3. Define data mining functionalities. ----------------------
4. Describe steps for knowledge discovery. ----------------------
5. Discuss the major issues in data mining.

Notes Answers to Check your Progress
---------------------- 1. Data Mining is the process of automatically discovering information, data

patterns, relationships, tendencies or mining knowledge from large data.
---------------------- 2. A Knowledge Discovery on databases is the process of converting raw
---------------------- data into meaningful information called ‘knowledge’.
3. The next stage to data selection in the KDD process is preprocessing.
----------------------
----------------------
Multiple Choice Multiple Response.
---------------------- 1. For what can heterogeneous data be used?
---------------------- i. To draw management decisions.
---------------------- iii. To analyse data for expected relationships
2. Which of the following are effective tools to attack the scalability
---------------------- problem?
---------------------- i Sampling
---------------------- ii. Parallelisation
----------------------
Fill in the Blanks.
----------------------
1. Data characterisation represents summarisation of the characteristics or
---------------------- features of a target class of data.
---------------------- 2. Clustering is the technique of grouping similar data objects together.

3. Outliers are those objects that are not satisfying general behaviour or the
----------------------
model of data objects.
---------------------- 1. Which of the following are data mining activities?
iii. Predicting the future sales of products using historical records
----------------------
----------------------
---------------------- 1. Clustering belongs to which of the following types of Data Mining?
---------------------- iii. Descriptive
---------------------- 2. Prediction is
i. To determine future outcome rather than current behaviour
----------------------

92 Introduction to Data Mining
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- Mining.
Tan, Pang-Ning, Vipin Kumar, Michael Steinnach. Introduction to Data 2.
----------------------
Han, Jiawei, Micheline Kamber. Data Mining: Concepts and Techniques. 1.
----------------------
Suggested Reading
----------------------
---------------------- Target variable i.
Notes Attribute to be predicted is known as 3.
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------

Mining Association Rules
UNIT
Structure:
8.1
8.2
Introduction
Association Rule Mining
8
8.2.1 Association Rules
8.3 Mining Single-Dimensional Boolean Association Rules from
Transactional Databases
8.3.3 Frequent Pattern Growth (FP-growth) Algorithm
8.4 Mining Multilevel Association Rules from Transaction Databases,
Relational Databases
8.5 Application of Association Mining
Summary
Key Words
Suggested Reading
Mining Association Rules 94

Notes
Objectives
----------------------
----------------------
● Understand what is association mining
---------------------- ● Explain frequent item set, support and confidence
---------------------- ● Describe single-dimensional and multidimensional association mining
● Describe various approaches for Multilevel Association Rules
----------------------
----------------------
8.1 INTRODUCTION
----------------------
---------------------- This chapter introduces the concepts of association mining. This chapter
address problem how to find frequent itemset from large dataset where the data
---------------------- are either transactional or relational. It explains Apriori algorithm to discover
---------------------- interesting pattern. This chapter also describes mining single level and at
multilevel from Transaction Databases.
----------------------
8.2 ASSOCIATION RULE MINING
----------------------
---------------------- Association mining is discovery of relationship between various item sets in
transactional and relational database.
---------------------- A deriving association and correlations from data has received a great
---------------------- attention. In 1993, a problem formulated by Agrawal et.al called market basket
problem.
---------------------- It is technique of finding out relationship within an item set that how items
---------------------- are associated with each other.
Association is created by measuring frequency of the terms in same regards.
----------------------
Terms used in association mining:
---------------------- ● Item set: It is group of one or many items. .
---------------------- ● K-item set: An item set containing k items.
● Frequent- patterns: These are pair of items that frequently appear in data.
----------------------
There are different kinds of frequent patterns:
----------------------
● A frequent item set pattern
---------------------- This is collection of items that repeatedly appear together in a transactional
data set. For example, bread and butter.
----------------------
● A frequent sequential pattern
----------------------
This is the pattern that describes customer’s trends to purchase in
---------------------- (frequent) sequential pattern. For example, first a customer purchase
mobile, followed by a memory card.
----------------------

● A frequent structured pattern Notes
Based on frequency of itemset, pattern can be derive to represent data in ----------------------
form of graph, trees or lattices... This structure may be combined with
item sets or subsequence. It is called a (frequent) structured pattern only if ----------------------
a substructure occurs frequently. In given structure if substructure occurs ----------------------
frequently, it is called frequent structured pattern.
8.2.1 Association Rules ----------------------
It is an implication of the form A=>B, where A and B are subsets of attribute ----------------------
set and A∩B= ϕ
----------------------
An association rule is of the form: X => Y
X => Y: if X is present than there is highly chances of Y to present.
----------------------
X is often referred as antecedent and Y as the respondent. ----------------------

For example: Bread=> Butter ----------------------
If customer is buying bread then it is likely to buy butter. These two items ----------------------
often purchase together.
If customer is buying items a, b, c then it is likely to buy item d. Retailers can
----------------------
place item ‘d’ accordingly. ----------------------
Association mining helps to identify items that are connected to each other,
but it does not help to find nature of the connection.
----------------------
Every association rule has a support and a confidence. ----------------------

● Support ----------------------
– Fraction of transactions that contain an itemset
----------------------
● Support count (s)
----------------------
– Frequency of occurrence of an itemset
An item can be frequent item if its support is equal or greater than predefined ----------------------
minimum value. ----------------------
An item set is called frequent if its support is equal or greater than an
agreed upon minimal value – the support threshold value. ----------------------
Confidence ----------------------
Confidence is based on conditional probability. If item X is present in ----------------------
transaction there is high confidence of presence of Y.
----------------------
Confidence is defined as:-
Confidence(X=>Y) equals support(X, Y) / support(X) ----------------------
Consider records with high support and high confidence. “A rule with low ----------------------
confidence is not meaningful.” ----------------------
----------------------
Notes Example: Database with transactions (customer_#: item1, item2,)
---------------------- 1: 2, 45, 8.
2: 3, 4, 8.
----------------------
3: 6, 4, 8, 10.
----------------------
4: 1, 8, 7.
---------------------- 5: 1, 5, 8.
---------------------- 6: 2, 5, 6.
---------------------- 7: 3, 4minimum support threshold, 6, and 8.

8: 3, 2, 4.
----------------------
9: 4, 5, 7, 8.
----------------------
10: 3, 8, 7, 10.
---------------------- If (X -> Y)
---------------------- Find Conf ({4} => {8} ) ?
---------------------- supp({4}) = 6,
supp({8}) = 7,
----------------------
supp({4,8}) = 5,
---------------------- then conf( {4} => {8} ) { Confidence(X=>Y) equals support(X,Y) /
---------------------- support(X)} = 5/6
=0.83 or 83%
----------------------
----------------------
----------------------
1. The left hand side of an association rule is called __________.
---------------------- i. consequent.
---------------------- ii. onset.
iii. antecedent.
----------------------
iv. precedent.
----------------------
2. All set of items whose support is greater than the user-specified
---------------------- minimum support are called as _____________.
---------------------- i. border set.
---------------------- ii. frequent set.
iii. maximal frequent set.
----------------------
iv. lattice.
----------------------

8.3 MINING SINGLE-DIMENSIONAL BOOLEAN Notes
ASSOCIATION RULES FROM TRANSACTIONAL ----------------------
DATABASES
----------------------
Single-dimensional association rule: ----------------------
Single-dimensional association rules has a single predicate.
----------------------
For example, consider following transaction from AllElectronics database.
Buys (X,”Laptop”)=> buys(X,”Software”)
----------------------
[support=1% , confidence=50%] ----------------------

where X is a variable representing a customer. ----------------------
Above rule interprets that, If customer buys computer, there is 50% confidence ----------------------
that customer will also buy software.
Support is 1% that means 1% of total transaction shows that computer and ----------------------
software are purchased together. ----------------------
Multidimensional association rule:
----------------------
This is an association between more than one attribute is called
multidimensional association mining. ----------------------
Consider following example, ----------------------
Age (X,”21..30) ^ income (X,”20K…29K”) => buys (X,”Computer”) ----------------------
[Support=2%, Confidence=60%]
----------------------
X is a variable representing a customer.
----------------------
The rule says that with 2% support, age group of 21 to 30 years with an
income of 20,000 to 29,000 and have purchased a Computer. There is a 60% ----------------------
confidence that customer in the age group of 21 to 30 and income group of 20k
to 29k will purchase a computer. ----------------------
This is an association between age, income, and buys. This is a ----------------------
multidimensional association rule.
----------------------
----------------------
The data can be available in form of transaction or table form.
● Transaction form: ----------------------
Items TX ----------------------
Shoes, Socks, Tie TX1
Shoes, Socks, Tie, Belt, Shirt TX2
----------------------
Shoes, Tie TX3 ----------------------
----------------------
----------------------

Notes ● Table form:
Attr3 Attr2 Attr1
----------------------
d b a
---------------------- e c b
---------------------- b a c
Note:Table data need to be converted to transaction form for association
---------------------- mining.
---------------------- ● Transaction form:
---------------------- (Attr1, a), (Attr2, b), (Attr3, d)
(Attr1, b), (Attr2, c), (Attr3, e)
----------------------
(Attr1, c), (Attr2, a), (Attr3, b)
----------------------
Naïve Brute Force algorithm
---------------------- Step 1
---------------------- ● Generate possible itemset
---------------------- ● Compute support count of each itemset
● Output frequent itemset whose support is greater than equal to minimum
----------------------
support.
---------------------- Step 2
---------------------- ●● ● or each frequent item set I,
● Generate all nonempty subset of I
----------------------
● For every nonempty subset S of I,
---------------------- ● Output association rule S=>I-S
---------------------- ● Is support count(I)/support count(S)>=min-conf
---------------------- If itemset I of size k is consider than, it will have 2k subsets and calculating
support will involve each transaction, thus time taken will be , O(2k *n).
----------------------
----------------------
This algorithm is called as level wise algorithm. This was proposed by
---------------------- Agrawal and Shrikant in 1994. This is bottom up search moving upward level
wise to find frequent sets. Apriori algorithm is a powerful algorithm for mining
---------------------- frequent item sets using prior knowledge.
---------------------- First count number of times item is appeared in itemset. Then it generates
frequent k+1 item sets from already generated frequent k-item sets.
----------------------
It uses property of frequent item sets called Apriori property. It states that a
---------------------- subset of frequent item set is always frequent. It means that superset of non-
frequent item can’t be frequent. This is Upward closure property. Any subset of
---------------------- a frequent set is a frequent set. This is Downward closure property.
Steps Notes
– Let i=1 ----------------------
– Produce frequent item sets of length 1 ----------------------
– Repeat steps until no new frequent item sets are identified
----------------------
● Generate length (i+1) candidate item sets from length i frequent item sets
● Prune candidate item sets containing subsets of length i that are infrequent
----------------------
● Count the support of each candidate by scanning the data. ----------------------

● Eliminate candidates that are infrequent, leaving only those that are ----------------------
frequent
Pruning
----------------------
It is process of eliminating the extension of (k-1) itemset that are not found ----------------------
to be frequent.
----------------------
Consider following transactional database.
----------------------
Items TID
1,3,4 100
----------------------
2,3,5 200 ----------------------

1,2,3,5 300 ----------------------
2,5 400
----------------------
Counts item occurrences to determine frequent item sets.
----------------------
Support Items
----------------------
2 1
3 2 ----------------------
3 3 ----------------------
1 4
----------------------
3 5
Eliminate candidates that are infrequent (support=1), leaving only those that
----------------------
are frequent and generate length (k+1) candidate item sets from existing set. ----------------------
Support Items ----------------------
1 1,2
----------------------
2 1,3
1 1,5 ----------------------
2 2,3 ----------------------
3 2,5
2 3,5
----------------------
----------------------

Notes Eliminate candidates that are infrequent (support=1), leaving only those that
are frequent and generate candidate item sets from existing set.
---------------------- Support Items
---------------------- 2 1,3
---------------------- 2 2,3
3 2,5
---------------------- 2 3,5
---------------------- Generate next candidate item sets.
---------------------- Support Items
---------------------- 1 1,2,3
1 1,2,5
----------------------
1 1,3,5
---------------------- 2 2,3,5
---------------------- Frequent item set {2,3,5}
---------------------- Advantages:
---------------------- 1. Algorithm works well for small transaction data.

2. When number of items in each transaction is small as compared to total
----------------------
number of items, algorithms perform better with sparsity of data.
---------------------- Bottleneck: Apriori
---------------------- Apriori Mining required multiple passes of scan and generates lots of
candidates in case of long dataset.
----------------------
Bottleneck: candidate-generation-and-test.
----------------------
Improving Apriori
---------------------- ● Required to reduce repeated scanning of transaction database.
---------------------- ● Reducing candidate sets
---------------------- ● Ease in support counting of candidates
---------------------- 8.3.3 Frequent Pattern Growth (FP-growth) Algorithm
Frequent pattern growth does not generate candidate sets. Frequent l-itemsets
----------------------
are generated; they are stored in a compact tree structure so that database scan
---------------------- is reduced.
---------------------- Construction of FP tree

1. First entire database is scanned to search frequent data item and support
---------------------- count.
---------------------- 2. Arrange all frequent item set in descending order according to support
count. This is list of item set L
---------------------- 3. Create a root of tree as null.

4. Once again, database is scanned. For each transaction in database remove Notes
non-frequent item and sort item in order of list L. ----------------------
5. Let the sorted list [q|Q] where q is first element in the list and Q is rest
of list. If item q is already exist in the list than increment its node count ----------------------
otherwise create new node and link it to node links with same value. If Q ----------------------
is nonempty, insert Q recursively.
Let minimum support=3, apply FP-growth algorithm on following itemset.
----------------------
Items Bought Transaction ID ----------------------

{M, O, N, K, E, Y } T1 ----------------------
{D, O, N, K, E, Y } T2
{M, A, K, E} T3
----------------------
{M, U, C, K, Y } T4 ----------------------
{C, O, O, K, I, E} T5
----------------------
Arrange item set in descending order according to support.
----------------------
Item No of transactions
K 5
----------------------
E 4 ----------------------
M 3
O 3
----------------------
Y 3 ----------------------
N 2
----------------------
C 2
D 1 ----------------------
A 1 ----------------------
U 1
I 1 ----------------------
Step 2: Remove items in list whose support < 3. ----------------------
Item No of transactions ----------------------

K 5 ----------------------
E 4
M 3 ----------------------
O 3 ----------------------
Y 3
----------------------
Items Bought Transaction ID
{K, E,M, Y,O } T1 ----------------------
{K, E, Y,O } T2
{K, E,M} T3
----------------------
{K,M,Y } T4 ----------------------
{K,E,O} T5
Notes Transaction : T1 Nul
---------------------- K:1
---------------------- E:1
---------------------- M:1
----------------------
---------------------- Y:1
----------------------
O:1
----------------------
----------------------
Transaction : T2 Nul
----------------------
K:2
----------------------
E:2
----------------------
M:1
----------------------
Y:1
---------------------- Y:1
---------------------- O:1
O:1
----------------------
----------------------
----------------------
---------------------- Transaction : T3 Nul
---------------------- K:3
E:3
----------------------
---------------------- M:2
Y:1
----------------------
Y:1
----------------------
O:1
---------------------- O:1
----------------------
----------------------
----------------------
104 Mining Association Rules
---------------------- O:1
----------------------
O:1
----------------------
Y:1
---------------------- O:1
Y:1
---------------------- 3 Y
Y:1 M:2
---------------------- 3 O
3 M
---------------------- M:1
E:4 4 E
---------------------- 5 K
---------------------- K:5 No of transactions Item
---------------------- Null
---------------------- Once tree is ready, no more scans of transaction database are required.
---------------------- node.
node represent item with count that represent occurrences of path from root to
---------------------- FP tree represents compact structure to store transactional database. Each
---------------------- O:1
O:1
----------------------
Y:1 Y:1
----------------------
O:1
Y:1
---------------------- M:2
M:1
----------------------
E:4
----------------------
K:5
---------------------- Nul Transaction : T5
----------------------
O:1
---------------------- O:11
---------------------- Y:1 Y:1
---------------------- Y:1
M:2
---------------------- M:1
---------------------- E:3
---------------------- K:4
Notes Nul Transaction : T4
Notes The algorithm for frequent item set starts from last item in the header table.
It has prefix paths.
----------------------
The conditional pattern base of item consists of all patterns leading to that
---------------------- item. Conditional pattern base is used to construct conditional FP-tree with
header table in which only frequent items are added. When tree contains a
---------------------- single path, all possible combinations of the item are output.
---------------------- Conditional Pattern base Frequent Pattern Set
---------------------- O-{K,E,M,Y:1},{K,E,Y:1} {K,E:1} {O},{O,K},{O,E},{O,K,E}
Y-{K,E,M:1},{K,E:1} {K,M:1} {Y},{Y,K}
---------------------- M-{K,E:2},{K:1} {M},{M,K}
---------------------- E-{K:4} {E},{E,K}
K {K}
----------------------
Conditional pattern base and its corresponding frequent itemsets.
---------------------- Analysis of FP growth algorithm:
---------------------- 1. FP growth algorithm avoids scanning database more than twice. It first
search database to find frequent item set and scan second time to construct
---------------------- FP tree.
---------------------- 2. It allows selecting support count dynamically while mining frequent
itemset. The complete FP tree for all items scan be generated and depend
---------------------- on support count of upper element FP tree can be use for frequent mining.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------

Notes
1. Association rules that contain a single predicate are referred to as ----------------------

___dimensional association rules.
----------------------
2. An association between more than one attribute is called ___
association mining. ----------------------
Multiple Choice Single Response. ----------------------
1. Any subset of a frequent set is a frequent set. This is
----------------------
i. Upward closure property.
ii. Downward closure property. ----------------------
iii. Maximal frequent set. ----------------------
iv. Border set.
2. Any superset of an infrequent set is an infrequent set. This is ----------------------
i. Maximal frequent set. ----------------------
ii. Border set.
----------------------
iii. Upward closure property.
iv. Downward closure property. ----------------------
3. Apriori algorithm is otherwise called as ----------------------
i. width-wise algorithm.
ii. level-wise algorithm.
----------------------
iii. pincer-search algorithm. ----------------------
iv. FP growth algorithm.
----------------------
4. The apriori algorithm is a
i. top-down search. ----------------------
ii. breadth first search. ----------------------
iii. depth first search.
iv. bottom-up search.
----------------------
5. The first phase of apriori algorithm is ----------------------
i. Candidate generation.
----------------------
ii. Itemset generation.
iii. Pruning. ----------------------
iv. Partitioning. ----------------------
----------------------
----------------------
----------------------

Notes
6. The following step eliminates the extensions of (k-1)-item sets, which
---------------------- are not found to be frequent, from being considered for counting
support.
---------------------- i. Candidate generation.
---------------------- ii. Pruning.
iii. Partitioning.
----------------------
iv. Itemset eliminations.
----------------------
----------------------
8.4 MINING MULTILEVEL ASSOCIATION RULES FROM
---------------------- TRANSACTION DATABASES, RELATIONAL
DATABASES
----------------------
---------------------- In many applications, it is difficult to discover association among data items
at low level due to sparsity of data in multidimensional space. Data mining
---------------------- systems provide capabilities to mine association rules at multiple levels of
---------------------- abstraction and traverse easily among different abstraction spaces.
Association rules produced from mining data at more than one levels of
---------------------- abstraction are called multiple-level or multilevel association rules.
---------------------- Support-confidence framework is use for mining rules.
---------------------- Data can be generalised by replacing low-level concepts within the data by
their higher-level concepts.
----------------------
---------------------- How concept hierarchy is used for mining multilevel Association Rules? The
---------------------- top-down strategy is used. At each concept level calculation of frequent item
sets is made. Beginning from the concept level 1 and calculating downward
---------------------- in the hierarchy toward the more specific concept levels. This is done until no
more frequent item sets can be found. There are various approaches as follows,
---------------------- a. Use of uniform minimum support for all levels
---------------------- To mine at every level of abstraction minimum support threshold is used.
This simplifies search procedure. Using this method user has to specify only
---------------------- one minimum support threshold.
---------------------- With prior knowledge that ancestor is superset of descendents optimization
techniques can be apply. This avoids searching item in itemset, which do not
---------------------- have minimum support.
----------------------
----------------------
----------------------
----------------------

Level 1 Notes
min_sup = 5% computer [support = 10%] ----------------------
Level 2 ----------------------
min_sup = 5%
----------------------
laptop computer [support = 6%] desktop computer [support = 4%]
----------------------
Fig. 8.1: Using uniform minimum support for all levels ----------------------
In above example, to mine computer to laptop computer a minimum support
threshold of 5% is used. Therefore “computer” and “laptop computer” with ----------------------
support=6% are called frequent items and “desktop computer” with support=4% ----------------------
is not.
Minimum support threshold value is decided based upon nature of occurrences
----------------------
of items in given item set. ----------------------
If value of minimum support threshold is set too higher than it may not able
to find association at low level of abstraction. If minimum support threshold is
----------------------
set to lower value than that may lead to uninteresting pattern at high abstraction ----------------------
level.
b. Using reduced minimum support at lower levels (referred to as
----------------------
reduced support): ----------------------
This approach uses reduced minimum support at lower levels. For each ----------------------
deeper level of abstraction set, the smaller the corresponding threshold.
For example, in figure 7.1, Let us consider minimum support for level 1 and ----------------------
level 2 are 5% and 3%, respectively then, “computer,” “laptop computer,” and ----------------------
“desktop computer” are called as frequent items.
For mining multiple-level associations with reduced support, there are a ----------------------
number of alternative search Strategies ----------------------
Level-by-Level independent:
----------------------
In this approach pruning does not required background knowledge of
frequent itemset. Each node is scanned independently irrespective its parent ----------------------
node to be frequent or not.
----------------------
Level -cross-filtering by single item:
----------------------
In this technique node is examined only if its parent node is frequent. It
means first parent node is checked if it is frequent. If it is frequent its children ----------------------
will be examined otherwise children are pruned from search.
----------------------
Level-cross filtering by -K-itemset:
In this method instead of checking for single item, frequency for itemset is ----------------------
checked. It means i-itemset at the level ‘l’ is checked only if its related parent ----------------------
i-itemset at the (l-1)th level is frequent.
----------------------

Notes c. Using item or group-based minimum support (referred to as group-
based support):
----------------------
This approach is suitable for mining multilevel rules. Minimum threshold is
---------------------- set based upon user specific, item specific or group based multiple rules. For
example, in electronic showroom minimal support threshold can set on product
---------------------- or different item. Low support threshold can set for laptop computer and then
---------------------- find association with these categories.

----------------------
Fill in the Blanks.
---------------------- 1. Association rules generated from mining data at multiple levels of
---------------------- abstraction are called __________ rules.
----------------------
8.5 APPLICATION OF ASSOCIATION MINING
----------------------
● Market-basket analysis
----------------------
Association Mining helps companies to find demand of products i.e. most
---------------------- frequent itemset. This help companies to decide which stock in which
stores as well how to display them within a store.
----------------------
● Retail / Marketing
---------------------- Finding associations among customer demographic characteristics.
----------------------
Summary
----------------------
●
Association mining is discovery of relationship between various item sets
----------------------
in transactional and relational database.
---------------------- ● An itemset is called frequent if its support is equal or greater than an
agreed upon minimal value – the support threshold value.
---------------------- ●
Association rules that contain a single predicate are referred to as single-
---------------------- dimensional association rules.
●
---------------------- An association between more than one attribute is called multidimensional
association mining.
---------------------- ●
Apriori Algorithm is use to find frequent item set. This algorithm is called
---------------------- as level wise algorithm.
●
Main limitation of Apriori algorithm is it required candidate-generation-
---------------------- and-test.
●
---------------------- Apriori Mining required multiple passes of scan and generates lots of
candidates in case of long dataset.
---------------------- ●
Frequent pattern growth does not generate candidate sets. Frequent
---------------------- l-itemsets are generated; they are stored in a compact tree structure so
that database scan is reduced.

● Association rules generated from mining data at multiple levels of Notes
abstraction are called multiple-level or multilevel association rules. ----------------------
Keywords ----------------------
● Association mining: It is discovery of relationship between various item ----------------------

sets in transactional and relational database. ----------------------
● Frequent item: An item can be frequent item if its support is equal or
greater than predefined minimum value.
----------------------
● Confidence: Confidence is based on conditional probability. If item X is ----------------------

present in transaction there is high confidence of presence of Y.
----------------------
Self-Assessment Questions ----------------------
1. Define frequent set and association rule. ----------------------
2. Discuss concepts of frequent sets, confidence and support. ----------------------

3. Describe Apriori algorithm. ----------------------
4. If (a,b),(b,c),(b,d),(a,c),(e,a) and (e,c) are frequent 2-item sets, how many
frequent 3 item sets are possible?
----------------------
5. Find frequent item sets for following transactional dataset using apriori. ----------------------
Item is said frequently bought if it is bought at least 2 times. ----------------------
Items Bought Transaction ID
----------------------
{M, O, N, K, E, Y } T1
{D, O, N, K, E, Y } T2 ----------------------
{M, A, K, E} T3 ----------------------
{M, U, C, K, Y } T4
{C, O, K, I, E} T5 ----------------------

----------------------
----------------------
----------------------
1. The left hand side of an association rule is called __________. ----------------------

iii. antecedent. ----------------------
2. All set of items whose support is greater than the user-specified minimum ----------------------
support are called as ____________.
ii. frequent set. ----------------------
----------------------
----------------------


1. Association rules that contain a single predicate are referred to as single
---------------------- dimensional association rules.
---------------------- 2. An association between more than one attribute is called multidimensional
association mining.
----------------------
---------------------- 1. Any subset of a frequent set is a frequent set. This is
---------------------- ii. Downward closure property.
---------------------- 2. Any superset of an infrequent set is an infrequent set. This is
iii. Upward closure property.
----------------------
3. Apriori algorithm is otherwise called as
----------------------
ii. level-wise algorithm.
---------------------- 4. The apriori algorithm is a
---------------------- iv. bottom-up search.
---------------------- 5. The first phase of apriori algorithm is
i. Candidate generation.
----------------------
6. The following step eliminates the extensions of (k-1)-item sets, which are
---------------------- not found to be frequent, from being considered for counting support.
---------------------- ii. Pruning.
1. Association rules generated from mining data at multiple levels of
----------------------
abstraction are called multilevel association rules.
----------------------
----------------------
Suggested Reading
● Ning Tan, Kumar, Vipin; Steinnach, Michael. Introduction to Data
---------------------- Mining.
---------------------- ● R. Agrawal and R. Srikant. Fast algorithm for Mining Association Rules
in Large Databases. In Research Report RJ 9839, IBM Almaden Research
---------------------- Center, San Jose, CA, June 1994.
---------------------- ● Han, S.Jiawei and Kamber, Micheline 2006. Data Mining Concepts and
Techniques.
----------------------
----------------------
----------------------

Classification and Prediction
UNIT
9
Structure:
9.1 Introduction
9.2 Classification and Prediction
9.3 Issues Regarding Classification and Prediction
9.4 Classification by Decision Tree Induction
9.5 Classification by Bayesian Classification
9.6 Classification by Back Propagation
9.7 Classification Based on Concepts from Association Rule Mining
9.8 Prediction
9.9 Accuracy and Error Measures
9.10 Evaluating Accuracy of Classifier or Predictor
Summary
Key Words
Suggested Reading
Classification and Prediction 112

Notes
Objectives
----------------------
----------------------
● Understand classification and prediction
---------------------- ● Use different techniques for classification
---------------------- ● Evaluate accuracy of classifier or predictor
----------------------
---------------------- 9.1 INTRODUCTION
---------------------- In this chapter, you will learn classification as data mining task. The chapter
also explain difference between Classification and prediction. Different
---------------------- classifiers such as decision tree, Bayesian classifiers, Backpropagation are also
discussed. Classification based on association rule mining is explored.
----------------------
Supervised Learning and Unsupervised Learning
----------------------
In supervised learning the class label of each training record is predefined
---------------------- therefore this step is called as supervised learning. Classification is example of
supervised learning techniques.
----------------------
Unsupervised learning applies on dynamic dataset where class label of
---------------------- training data is unknown. Sometimes, total number of classes to be formed are
also unknown in advance. Clustering is unsupervised learning technique.
----------------------
Data Types:
----------------------
1. Discrete Data
---------------------- Discrete data can only take particular values. It has predictable fixed values
---------------------- or computational boundless set of values. Examples: zip codes or the set of
words in a collection of documents, male or female, good or bad.
----------------------
2. Continuous Data
---------------------- Continuous data are not limited to defined particular values, but it can take
any value over a continuous range.
----------------------
Examples: temperature, age, height, or weight, experience in year
----------------------
---------------------- 9.2 CLASSIFICATION AND PREDICTION
---------------------- Classification is a data mining technique used to predict categorical class
labels. For example, an Insurance company needs data analysis for predicting
---------------------- customer, will buy new policies or not or a company wants to predict good
customers based on old customers. An automobile company wants to predict
----------------------
whether customer will buy a car or not based on customer data. In all of these
---------------------- examples, classification task is applied.
----------------------

Based on available data, a model is built to calculate class l labels, such as Notes
for the insurance data; “yes” or “no”. An automobile company, customer will ----------------------
buy car: answer can be either “yes” or “no”. These classes can be represented
by discrete values. ----------------------
Data classification is a two-step process. First step is to learn a classification ----------------------
model from the data and second step is to use the model to classify cases.
In the first step, a classification algorithm learns predefined set of data classes
----------------------
and constructs the classifier model. This phase is called learning step or training ----------------------
phase.
Training data set consist of sample datasets along with predefined class
----------------------
labels. Model consists of set of classification rules, decision tree rules or any ----------------------
mathematical formula.
----------------------
Second step is usage of model. The constructed model is used to classify
unknown cases ----------------------
Data set use for testing result of model is called as Test set. Test set is ----------------------
independent of training set. Result of the model is compared against known
class label to find accuracy of the model. ----------------------
“Accuracy rate is the percentage of test set samples that are correctly ----------------------
classified by the model.”
If the accuracy of the model is acceptable, than model is use to classify ----------------------
unknown cases. ----------------------
Prediction
----------------------
To determine future outcome rather than current behaviour. i.e., predicts
unknown or missing values. ----------------------
For example marketing manager wants to find out expenditure of given ----------------------
customer during sale. This requirement is numeric prediction. Model is designed
to predict a value than the class label. Regression analysis is a statistical method ----------------------
used for numeric prediction. ----------------------
----------------------
Fill in the Blanks.
----------------------
1. ______is data mining techniques use to predict categorical class
labels. ----------------------
2. In ___ type of learning the class label of each training record is ----------------------
predefined.
3. ______ can only take particular values. ----------------------
----------------------
----------------------

Notes 9.3 ISSUES REGARDING CLASSIFICATION AND
----------------------
PREDICTION
There are issues while pre-processing the data as well as for comparing and
---------------------- evaluating classification and prediction methods.
---------------------- Preparing the Data for Classification and Prediction
---------------------- There are issues like data cleansing and relevance analysis regarding pre-
processing the data for classification and prediction
----------------------
Data Cleansing
----------------------
Data cleansing is process of finding and correcting inaccurate data, removing
---------------------- noisy data and handling missing values in data. Noise can be reduced by applying
smoothing techniques and missing value problem can be solve by replacing
---------------------- missing value with common value. Most of the algorithms, handle this problem
but can confuse classifier.
----------------------
Relevance Analysis
----------------------
Attributes in the data may be redundant. There may be some attributes in
---------------------- data that are irrelevant attributes and not required for further analysis. To find
relevant and reduced set of attribute of the data set attribute subset selection is
---------------------- used. Therefore correlation analysis and subset selection are required to use to
---------------------- attributes that do not contribute to the classification or prediction task. Presence
of such noisy attributes may slow down, and possibly mislead, the learning step.
----------------------
Data Transformation and Reduction
---------------------- Data transformation is the process of converting information from one
format to another format. Convert common data elements into a consistent
----------------------
form. The data may be transformed by normalization process. It involves
---------------------- scaling all values. Data can also reduced by principle components analysis to
discretization techniques.
----------------------
Comparing Classification and Prediction Methods
----------------------
Following criteria is used to compare and evaluate classification and
---------------------- prediction methods:
---------------------- a) Accuracy: Accuracy of classifier is the correct prediction of model for

previously unknown data. Similarly, accuracy of prediction model is
---------------------- how much accurate predictor guess value. Accuracy describes how good
classifier model is.
----------------------
b) Speed: This include the computational costs involved in generating and
---------------------- using the given classifier or predictor.
---------------------- c) Robustness: It is the ability of classifier or predictor to handle presence of
noisy data or missing data and perform well to predict accurate output. .
---------------------- d) Scalability: This describes the ability to construct the classifier or predictor
---------------------- efficiently given large amounts of data.

e) Interpretability: It is the level of understanding that is provided by the Notes
classifier or predictor. ----------------------
----------------------
Fill in the Blanks.
1. _____of classifier is the correct prediction of model for previously
----------------------
unknown data. ----------------------
2. ______is the process of converting information from one format to ----------------------
another format.
3. _____of data to reduce noise and handle missing values in data. ----------------------
----------------------
9.4 CLASSIFICATION BY DECISION TREE INDUCTION ----------------------
A decision tree is tree like structure. The tree has three types of node:
----------------------
1) Root node: It has no incoming edges and zero or more outgoing edges. ----------------------
2) Internal nodes: It has exactly one incoming edge and can have two or ----------------------
more outgoing edges.
----------------------
3) Leaf node or terminal node: It has exactly one incoming edge and no
outgoing edge. Leaf node represent class node. ----------------------
Decision tree classifier is popular because constructing decision tree does ----------------------
not require prior domain knowledge. Decision tree can handle high dimension
data. Decision tree classifiers have good accuracy. Decision tree classifiers ----------------------
convertible to simple and easy to understand classification rules
----------------------
Decision tree algorithm is use in many application such as medical,
production, manufacturing, financial analysis etc. ----------------------
Decision Tree Induction ----------------------
In late 1970 and 1980s, J. Ross Quinlan developed a decision tree algorithm ----------------------
known as ID3 (Iterative Dichotomiser), Quinlan later presented C4.5 (a
successor of ID3). In 1984, a group of statisticians (L. Breiman, J. Friedman, R. ----------------------
Olshen, and C. Stone) published Classification and Regression Trees (CART).
ID3, C4.5, and CART constructs decision trees in a top-down recursive divide- ----------------------
and-conquer manner. ----------------------
Steps to build a decision tree
----------------------
1) Select a node as root node
----------------------
2) Find possible value of the attribute and derive one branch for each possible
value. ----------------------
3) Repeat step 2 recursively for each branch till unique class label is ----------------------
determined.

Notes Design issues:
---------------------- Learning algorithm for decision tree must address following issues,
● How to split training record
----------------------
Decision tree is created by recursively selecting an attribute test condition
---------------------- and splitting them into smaller subsets. Learning algorithm should support a
method to specify test condition for attribute type and also measure for goodness
---------------------- of each test condition.
---------------------- ● Stopping criteria for splitting attributes
Attributes are split recursively for each branch till unique class label is
----------------------
determined or all records belong to same class or all records have same value.
---------------------- Measures for selecting best split
---------------------- The best split measures are based on degree of impurity of child node. The
impurity measures are as follows,
----------------------
entropy(p1,p2, . . . ,pn)= -p1logp1-p2logp2. . .-pnlogpn
----------------------
# Entropy is a measure of how "mixed up" an attribute is.
---------------------- • Information gain:
---------------------- Information gain determine most relevant attribute. When splitting up the
decision tree node information gain is the removal of entropy related to specific
---------------------- attribute value.
---------------------- Information Gain = Entropy(X) – Entropy (X|Y)
---------------------- How to select root node?
---------------------- To select root node Information gain of each attribute is calculated. Select
the attribute that gives us the largest information gain. ID3 algorithm uses
---------------------- information gain as its attribute selection measure.
---------------------- Different Ways to Represent Attribute Test

1) Binary Attributes
----------------------
Test condition for binary attribute contains two outcomes
----------------------
----------------------
----------------------
----------------------
----------------------
2) Nominal attributes
----------------------
Nominal attributes can take multiple values. Some decision tree algorithm
---------------------- such as CART generate only binary values, in such case multiple attributes
can be group together.
----------------------

Notes
----------------------
----------------------
----------------------
3) Ordinal Attributes ----------------------
Ordinal attributes can generate binary or multiway values. Attributes ----------------------

values can be grouped together as long as they preserve the order. For
example, attribute values for node income can be combined as high and
----------------------
Medium, Low which preserves the order. But grouping High and Low ----------------------
together violets the rule.
----------------------
4) Continuous Attributes
Continuous attribute values are represented as comparison test with ----------------------
binary values or multiple values. Multiple continuous values can be group ----------------------
together into wider ranges retaining the order of the properties.
----------------------
----------------------
----------------------
----------------------
A Decision Tree Example:-

----------------------
Weather data example- ----------------------

Play Outlook Temperature Humidity Windy ID code ----------------------
No Sunny Hot High FALSE a
No Sunny Hot High TRUE b
----------------------
Yes Overcast Hot High FALSE c ----------------------
Yes Rain Mild High FALSE d
Yes Rain Cool Normal FALSE e
----------------------
No Rain Cool Normal TRUE f ----------------------
Yes Overcast Cool Normal TRUE g
No Sunny Mild High FALSE h
----------------------
Yes Sunny Cool Normal FALSE i ----------------------

Yes Rain Mild Normal FALSE j
----------------------
Yes Sunny Mild Normal TRUE k
Yes Overcast Mild High TRUE l ----------------------
Yes Overcast Hot Normal FALSE m
----------------------
No Rain Mild High TRUE n
----------------------

Notes Attribute values
Outlook Sunny,overcast,rain
----------------------
Temperature Hot,mild,cool
---------------------- Humidity High,normal
---------------------- Windy True,false
Play Yes, no
----------------------
In the weather data example, there are 9 records with play=”yes” and there
---------------------- are 5 records with play is “no’.
---------------------- The information gain calculated for decision is:
---------------------- P(Play Tennis = Yes) = 9/14

P(Play Tennis = No) = 5/14
----------------------
= - 9/14 * LOG2 (9/14) - 5/14 * LOG2 (5/14)
----------------------
= 0.94028
---------------------- Srainy =
---------------------- P(Outlook = Rainy and Play Tennis = yes) = 3/5
---------------------- P(Outlook = Rainy and Play Tennis = no) = 2/5
---------------------- a) Entropy (Srain)= -3/5* log2(3/5)-2/5*log2(2/5)

Entropy (Srain) =0 .971
----------------------
b) Entropy (Sovercast)= -4/4* log2(4/4)-0*log2(0)
----------------------
Entropy (Sovercast)=0
---------------------- c) Entropy (Ssunny)= -2/5*log2(2/5)-3/5*log2(3/5)
---------------------- Entropy (Swindy)=0.971
---------------------- P(rain) = 5/14 P(overcast) = 4/14 P(sunny) = 5/14
---------------------- Entropy (Play Tennis | Outlook )=(5/14)*0.971+(4/14)*0+(5/14)*0.971
---------------------- =0.694
----------------------
Information Gain= Entropy(X) - Entropy(X|Y)
----------------------
=Entropy (Play Tennis) - Entropy (Play Tennis | Outlook)
---------------------- =0.940 - 0.694
---------------------- =0.246
---------------------- Similarly, calculate information gain for remaining attributes. Select the
attribute with maximum information gain. Attribute “Outlook” has maximum
---------------------- information gain
----------------------

Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 9.1: Decision Tree ----------------------

Overfitting ----------------------
In testing phase, errors generated by classification model is of two types: ----------------------
1) Training errors: Number of misclassified records by classification model
When tested on training records. ----------------------
2) Generalization errors: These are the expected error produce by ----------------------

classification on unseen data.
----------------------
A model with low training errors and low testing errors is called good model.
----------------------
It may possible that model fits into training data will have low training
errors but increase in generalization error. Such a situation is called overfitting. ----------------------
It means classification model fits well for training data but misclassifies the
testing data. Overfitting may result due to use of noisy data or redundant data. ----------------------
Pruning is used to avoid overfitting in constructing decision tree. ----------------------
What is decision tree pruning? ----------------------
Due to noise or outliers, many branches of tree reflects anomalies in decision
trees. Once tree is built, it may require modification to remove redundant
----------------------
comparisons or to reduce subtree. ----------------------
The pruning is process of removing redundant comparisons or to remove
----------------------
subtrees. This reduces unnecessary comparisons and achieves better performance
during classification phase. Pruned trees are smaller, less complex and easy to ----------------------
comprehend.
----------------------
There are two approaches to tree pruning: prepruning and postpruning.
In prepruning approach, a splitting or partition the subset of tree is halted
----------------------
at particular node. In postpruning approach, removes subtree from full tree. A ----------------------
subtree is pruned at a node by removing branches at a node and replacing it with
leaf node. ----------------------

Notes Extracting classification rules from a decision tree
---------------------- The decision tree of Fig. 8.1 is converted to classification using IF-THEN
rules. Start from root node and trace path till its leaf node. It is called rule-based
---------------------- classifiers where model is represented using IF-THEN rules.
---------------------- R1: IF outlook= sunny AND humidity= high THEN play = no

R2: IF outlook= sunny AND humidity= normal THEN play = yes
----------------------
R3: outlook= overcast THEN play = yes
----------------------
R4: IF outlook=rainy AND windy= false THEN play =yes
---------------------- R5: IF outlook=rainy AND windy= true THEN play =no
----------------------
----------------------
1. There are two approaches to tree pruning: ___ and ___.
----------------------
2. ___ is used to decide which of the attributes are the most relevant.
----------------------
----------------------
9.5 CLASSIFICATION BY BAYESIAN CLASSIFICATION
----------------------
Bayesian classifiers are statistical classifiers, which use class probability to
---------------------- predict the class of unknown tuple. Simple Bayesian classifier is also known
as the naïve Bayesian classifier. Naïve Bayesian classifiers works on class
---------------------- conditional independence. Bayesian classifier, when used on large databases it
---------------------- shows high accuracy and speed.
Bayesian Classification is based on Bayes’ theorem:
----------------------
Let X be a data tuple and H be hypothesis, such that X belongs to a specific
---------------------- class C. Posterior probability of a hypothesis h on X, P(h|X) follows the Baye’s
theorem
----------------------
----------------------
---------------------- Towards Naïve Bayesian Classifier
---------------------- ● Consider training set of tuples D with its related class labels. Each tuple
is represented as n-D attribute vector X=(x1,x2,…xn)
----------------------
● Let C1,C2,…Cm be the classes.
---------------------- ● Classification is to derive maximum posterioiri, max P(Ci|X)
---------------------- ● This can be derived from Bayes’ theorem
----------------------
----------------------

● Since P(X) is constant for all classes, only needs to be maximized Notes
----------------------
Bayesian belief networks represents graphical models to show relation ----------------------

among subsets of attributes. Bayesian belief networks can also be used for ----------------------
classification.
Example:
----------------------
Consider weather data as the training data in the given decision tree example. ----------------------
To find out whether game play or not play i.e. play=yes or play=no,
----------------------
Consider following Test Phase,
----------------------
X’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
P(Outlook=Sunny|Play=Yes) = 2/9
----------------------
P(Temperature=Cool|Play=Yes) = 3/9 ----------------------

P(Huminity=High|Play=Yes) = 3/9 ----------------------
P(Wind=Strong|Play=Yes) = 3/9 ----------------------
P(Play=Yes) = 9/14
----------------------
P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play==No) = 1/5
----------------------
P(Huminity=High|Play=No) = 4/5 ----------------------

P(Wind=Strong|Play=No) = 3/5 ----------------------
P(Play=No) = 5/14 ----------------------
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes)
= 0.0053 ----------------------
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = ----------------------

0.0206
----------------------
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
----------------------
Advantages and Disadvantages of Naïve Bayes classifier:
• Advantages of Naïve Bayes classifier: ----------------------
– Naïve Bayes classifier us easy to implement ----------------------

– Proves good results in most of the cases ----------------------
• Disadvantages of Naïve Bayes classifier:
----------------------
Classifier assumes class conditional independence, which may result into
loss of accuracy. ----------------------
Dependencies may exist among variable. For example, Symptoms: fever, ----------------------
cough etc., Disease: lung cancer, diabetes, etc.
----------------------

Notes These dependencies cannot be modelled by Naïve Bayes classifier. Practically,
dependencies exist among variables. For example, hospitals: patients: Profile:
---------------------- age, family history, etc.
---------------------- Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
----------------------
----------------------
Fill in the Blanks.
----------------------
1. Bayesian classifiers are statistical classifiers use ______to predict the
---------------------- class of unknown tuple
---------------------- 2. The _____are graphical models that allow the representation of
dependencies among subsets of attributes.
----------------------
----------------------
9.6 CLASSIFICATION BY BACK PROPAGATION
----------------------
Backpropagation is a neural network learning algorithm. A neural network
---------------------- is set of connected input, output unit. Each unit has weight associated with it.
It is also called as connectionist learning due to its connection between units.
----------------------
In learning phase, network learns by adjusting its weight to predict the correct
---------------------- class. A neural network has high tolerance to noisy data. It performs satisfactory
in domain with little bit knowledge about data and suitable for real-world data
---------------------- such as handwritten character recognition, pathology and laboratory medicine,
---------------------- and training a computer to pronounce English text.
Backpropagation algorithm learns on a multilayer feed-forward neural
----------------------
network. A multilayer feed-forward neural network consists of an input layer,
---------------------- one or more hidden layers, and an output layer.
Back Propagation
----------------------
Back propagation learns by iteratively processing training dataset and
---------------------- comparing its prediction with target value. The target value may be known
value for prediction and class label for classification.
----------------------
For each set of training, the weights ate modifies to minimize error between
---------------------- networks prediction and actual target value. These modifications are made in
backwards direction from output layer to first hidden layer. Hence, the name
----------------------
back propagation. The computational efficiency depends on the time spent
---------------------- training the network.
----------------------
----------------------
----------------------
----------------------

Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 9.2: Back propagation ----------------------
Consider the above given figure. ----------------------
Unit j : A hidden or output layer. The inputs to unit j are labelled y1, y2, : : : , yn
----------------------
The inputs to unit j are outputs from the previous layer.
Weighted sum is calculated by multiplying them with their weights (Wij).
----------------------
This is added to the bias associated with unit j. Output layers takes its net input ----------------------
and then applies an activation function. ----------------------
The output of unit j, is computed as
----------------------
----------------------
----------------------
For a unit j in the output layer, the error Errj is computed by ,
----------------------
----------------------
The error of a hidden layer unit j is,
----------------------
----------------------
Weights are updated by the following equations, where wij is the change in ----------------------
weight wij:
----------------------
wij = (l) Errj Oi
wij = wi j +Dwi j
----------------------
----------------------
----------------------
----------------------
----------------------

Notes Sample calculations for learning by the backpropagation algorithm
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig 9.3: An example of a multilayer feed-forward neural network
----------------------
Table 9.1: Initial input, weight, and bias values.
----------------------
ⱷ1 ⱷ2 ⱷ3 W14 W15 W24 w25 W34 W35 W46 W56 ⱷ4 ⱷ5 ⱷ6
---------------------- 1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------

Notes
1. A __network is set of connected input, output unit. ----------------------

2. A multilayer feed-forward neural network consists of a ___,____ and ----------------------
___ layers.
----------------------
9.7 CLASSIFICATION BASED ON CONCEPTS FROM ----------------------

ASSOCIATION RULE MINING ----------------------
Associative classification is a concept where association based rules ----------------------
are generated and used for classification purpose. The idea is to find strong
relationship between frequent pattern and class labels. Decision tree uses only ----------------------
one attribute at a time. This constraint is overcome by associative classification.
Associative classification approach differs from Association mining on how the
----------------------
derived rules are analysed and used for classification. ----------------------
Classification-Based Association (CBA) is one of the simple algorithms used ----------------------
for associative classification. CBA uses an iterative approach to frequent item
set, a heuristic method to construct the classifier. In CBA, rules are organised ----------------------
based on support and confidence in decreasing order. If a set of rules has the same
antecedent, the set is represented by selecting the rule with highest confidence. ----------------------
When classifying test tuple, first rule satisfy the condition is used to classify ----------------------
it. Classifier has default rule. If any new tuple that is not satisfied by any other
rule in the classifier that classify with default class. In this way set of rules ----------------------
making up classifiers form decision tree. This algorithm is most effective than ----------------------
decision tree classifiers.
Another classifier called as CMAR (Classification based on Multiple ----------------------
Association Rules) is also used for associative classification. It differs from ----------------------
CBA in its approach to frequent item set and classifier construction. CMAR
adopts a variant of the FP-growth algorithm. CMAR finds set of rules satisfying ----------------------
the minimum confidence and minimum support thresholds. FP-growth uses a
tree structure, forming frequent item set in dataset. CMAR further maintains ----------------------
distribution of class label for frequent items among the training data set. ----------------------
Therefore, it is easy to form rule together with frequent items set.
----------------------
----------------------

1. ______is concept where association based rules are generated and ----------------------
used for classification purpose.
----------------------
2. Full form of CBA is ______.

Notes 9.8 PREDICTION
---------------------- As discussed earlier, prediction is the estimation of numeric value. There are
some classifications methods such as backpropagation technique and k-nearest
---------------------- neighbour. Support vector machine is used for prediction. For example, you
---------------------- want to predict the sale of a product in company after 10 years. An approach to
find the prediction using a statistical methodology is called regression, which
---------------------- was developed by Sir Frances Galton (1822–1911).
---------------------- Regression analysis is used to find relationship between one or more

independent (predictor) and dependent (response) variables. Values of predictor
---------------------- variables are known that describes tuples. Response variable are the value,
which want to predict.
----------------------
Linear Regression
----------------------
Straight-line regression analysis is a simple method of regression. In this
---------------------- method, prediction is made on one variable from the scores of another variable.
It has two variables - x & y.
----------------------
Variable, which we are predicting is called response variable (y) and variable
---------------------- which is used as base for prediction is called as predictor variable (x)
---------------------- y = b+wx
where b and w are regression coefficients specifying the Y-intercept and
----------------------
slope of the line, respectively.
---------------------- We can consider w and b, the regression coefficients, as weights. We can
equivalently write,
----------------------
y = w0+w1x
----------------------
Multiple linear regressions is an extension of straight-line regression which
---------------------- has more than one predictor variable.
---------------------- Logistic regression and Poisson regression are Generalized Linear Models
(GLMs).
----------------------
Non-Linear Regression
---------------------- In the above equation, y is modelled as a linear function of a single
independent predictor variable, x. Nonlinear model can be transformed into
----------------------
linear model by applying transformation to the model. Polynomial regression is
---------------------- used when there is just one predictor variable.
Transformation of a polynomial regression model to a linear regression
----------------------
model:
---------------------- Consider a cubic polynomial relationship given by
---------------------- y = w0+w1x+w2x2+w3x3
---------------------- To convert this equation to linear form, we define new variables:

x1 = x, x2 = x2, x3 = x3
----------------------

Resulting equation y = w0 +w1x1 +w2x2 +w3x3, which can be solved by the Notes
method of least squares for regression analysis. ----------------------
----------------------
Fill in the Blanks.
1. The value, which we want to predict is called _______.
----------------------
2. A _______ is use to find relationship between one or more independent ----------------------

(predictor) and dependent (response) variables. ----------------------
3. Straight-line regression analysis is simple method of __________.
----------------------
9.9 ACCURACY AND ERROR MEASURES

----------------------
----------------------
Accuracy is measured as percentage of total number records correctly
classified by the model. This is also referred to as the overall recognition rate of ----------------------
the classifier. Evaluation of classification and prediction methods are based on
predictive accuracy of classifier, computational speed, robustness, scalability ----------------------
and interpretability of model. ----------------------
Error rate of classifier is the percentage of total records misclassified by the
classifier. Error rate is simply 1-Acc(c), where Acc(c) is the accuracy of c. ----------------------
The confusion matrix is a useful to find how classifier has classified records ----------------------
of different classes. It also displays count of number of records misclassified by
classifier. Confusion matrix is a table of at least size m by m. An entry, CMi, j
----------------------
in the first m rows and m columns indicates the number of tuples of class i that ----------------------
were labelled by the classifier as class j.
● True positive (TP) Rate: actual class and predicted class are same.
----------------------
● False positive (FP) Rate: Predicted to be in class but does not belong to ----------------------
that class.
----------------------
● Precision: The closeness of repeated measurements to one another. It
calculated as the ratio of the number of relevant records retrieved to ----------------------
the total number of irrelevant and relevant records retrieved.
Precision=TP/(TP+FP)
----------------------
● Recall: It is defined as the total number of true positives records divided ----------------------
by the total number of records actually belong to the positive class
(i.e. the sum of true positives and false negatives, these are the records ----------------------
classified as negative but belong to positive class.) . ----------------------
● F-Measure: It is measure of a test's accuracy. Score reaches its best value
at 1 and worst score at 0. ----------------------
----------------------
----------------------

Notes ● ROC (Receiver Operating Characteristics) curves: ROC represents visual
comparison of classification models.
----------------------
It shows the trade-off between the true positive rate and the false positive
---------------------- rate. The area under the ROC curve is a measure of the accuracy of the model.

----------------------
Fill in the Blanks.
----------------------
1. Accuracy rate is also referred to as the ______rate of the classifier.
---------------------- 2. _______ is defined as the number of true positives divided by the
---------------------- total number of elements that actually belong to the positive class.
3. When the rate of actual class and predicted class are same, it is called
---------------------- ______.
----------------------
---------------------- 9.10 EVALUATING ACCURACY OF CLASSIFIER OR
---------------------- PREDICTOR
---------------------- To obtain the accuracy of classifier or predictor, there are some common
techniques such as Holdout method, random subsampling, cross validation and
---------------------- Bootstrap method.
---------------------- Holdout Method
---------------------- In Holdout Method, the data is randomly divided into two independent sets,
training set and testing set. Training data set is generated with two-thirds of the
---------------------- data and the remaining is allocated to the test set. Training set is use to derive
model and accuracy of the derive model is tested with test data.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 8. 4: Evaluating Accuracy Using Holdout Method
---------------------- Random Subsampling
---------------------- In Random subsampling, holdout method is repeated k times. Average of
accuracies obtained at each iteration is considered as overall accuracy.
----------------------
Cross-validation
----------------------
In cross validation, initial data randomly reordered and then are divided into
---------------------- ‘n’ number of folds. In each iteration, one fold is used for testing and the other

n-1 folds are used for training the classifier. The test result at each iteration Notes
is collected and average over all folds is considered as overall accuracy. This ----------------------
gives estimate of the accuracy.
Bootstrap
----------------------
The bootstrap method, samples the given training records uniformly with ----------------------
replacement. It means - a machine randomly selects records for training set,
allows selecting same sample more than once and add to the training set. The
----------------------
data records that are not selected for training set are used for the test set. ----------------------
On an average, 63.2% of the original data records are used in the bootstrap ----------------------
and the remaining 36.8% are used for test set. Each record can be selected with
probability of 1/d. So, the probability of not selecting record is (1-1/d). If d is ----------------------
large, the probability approaches e-1=0.368 (e = 2:718). Therefore 36.8% of
records are not selected and are used for testing set and remaining 63.2% are ----------------------
used for training set. ----------------------
----------------------
Fill in the Blanks.
1. In Holdout Method, data are randomly divided into two independent
----------------------
sets, ___and _______. ----------------------
2. The _____method samples the given training tuples uniformly with
----------------------
replacement.
----------------------
Summary ----------------------
● A decision tree is tree like structure. In this topmost node is the root node, ----------------------
each nonleaf node represent a test on an attribute, each branch represents ----------------------
an outcome of the test and each leaf node (or terminal node) holds a class
label. ID3, C4.5, and CART are decision tree techniques use to constructs ----------------------
decision trees in a top-down recursive divide-and-conquer manner.
----------------------
● Bayesian classifiers are statistical classifiers that use class probability to
predict the class of unknown tuple. ----------------------
● Backpropagation is a neural network learning algorithm. A neural network
----------------------
is set of connected input, output unit. Each unit has weight associated
with it. It is also called as connectionist learning due to its connection ----------------------
between units.
● Associative classification is concept where association based rules are
----------------------
generated and used for classification purpose. ----------------------
● Prediction is the estimation of numeric value. Regression analysis is use
to find relationship between one or more independent (predictor) and
----------------------
dependent (response) variables. ----------------------

Notes ● Accuracy rate is the percentage of test set samples that are correctly
classified by the model. This is also referred to as the overall recognition
---------------------- rate of the classifier. Predictive accuracy, computational speed, robustness,
scalability, and interpretability are five criteria for the evaluation of
---------------------- classification and prediction methods.
---------------------- ● Holdout method, random subsampling, cross validation, Bootstrap are the
techniques use for evaluating accuracy of the given data.
----------------------
---------------------- Keywords
---------------------- ● Classification: It predicts categorical class labels (discrete or nominal)
and classifies data (constructs a model) based on the training set and the
---------------------- values (class labels) in a classifying attribute and uses it in classifying
---------------------- new data
● Prediction: To determine future outcome rather than current behaviour.
----------------------
i.e., predicts unknown or missing values.
---------------------- ● Information gain: Information gain determine most relevant attribute.
---------------------- When splitting up the decision tree node, information gain is the removal
of entropy related to specific attribute value.
----------------------
---------------------- Self-Assessment Questions
---------------------- 1. State the difference between Classification and Prediction.

2. Describe issues regarding Classification and Prediction.
----------------------
3. Explain classification by Decision Tree Induction.
----------------------
4. Describe different types of node in decision tree.
---------------------- 5. What is need of pruning in Decision Tree Induction?
---------------------- 6. Explain Bayesian Classification and Backpropagation techniques.
---------------------- 7. What do you mean by Associative Classification?
8. State the types of linear model and state its use?
----------------------
9. Discuss various methods for evaluating the Accuracy of a Classifier or
----------------------
Predictor.
----------------------
----------------------
Check your progress 1
----------------------
Fill in the Blanks.
---------------------- 1. Classification is data mining techniques use to predict categorical class
---------------------- labels.
2. In Supervised learning type of learning the class label of each training
---------------------- record is predefined.

3. Discrete data can only take particular values. Notes
Check your progress 2 ----------------------
1. Accuracy of classifier is the correct prediction of model for previously
unknown data.
----------------------
2. Data transformation is the process of converting information from one ----------------------

format to another format.
----------------------
3. Data cleansing of data to reduce noise and handle missing values in data.
----------------------
1. There are two approaches to tree pruning: prepruning and postpruning. ----------------------
2. Information gain is used to decide which of the attributes are the most ----------------------
relevant.
----------------------
1. Bayesian classifiers are statistical classifiers use class probability to ----------------------

predict the class of unknown tuple
----------------------
2. The Bayesian belief networks are graphical models that allow the
representation of dependencies among subsets of attributes. ----------------------
Fill in the Blanks.
----------------------
1. A neural network is set of connected input, output unit.
----------------------
2. A multilayer feed-forward neural network consists of a Input layer, hidden
layers, and an output layer. ----------------------
Fill in the Blanks.
----------------------
1. Associative classification is a concept where association based rules are
generated and used for classification purpose. ----------------------
2. Full form of CBA is Classification-Based Association. ----------------------
Fill in the Blanks.
----------------------
1. The value, which we want to predict is called Response variable.
2. A Regression analysis is use to find relationship between one or more
----------------------
independent (predictor) and dependent (response) variables. ----------------------
3. Straight-line regression analysis is simple method of Linear regression.
----------------------

Notes Check your progress 8

1. Accuracy rate is also referred to as the Overall recognition rate of the
---------------------- classifier.
---------------------- 2. Recall is defined as the number of true positives divided by the total
number of elements that actually belong to the positive class.
----------------------
3. When the rate of actual class and predicted class are same, it is called
---------------------- True positive (TP).
---------------------- Check your progress 9
1. In Holdout Method, data are randomly divided into two independent sets,
----------------------
Training set and testing set.
---------------------- 2. The Bootstrap method samples the given training tuples uniformly with
replacement.
----------------------
---------------------- 1. Tan, Pang-Ning, Kumar, Vipin and Steinnach, Michael. Introduction to
Data Mining,
----------------------
2. Jiawei Han, Micheline Kamber Data Mining :Concepts and Techniques.
----------------------
3. WEKA Manual by Remco R. Bouckaert, Eibe Frank, Mark Hall, Richard
---------------------- Kirkby, Peter Reutemann, Alex Seewald, David Scuse, January 11, 2010
---------------------- 4. David M. Lane,” Introduction to Linear Regression” http: // onlinestatbook.
com /2/regression/intro.html
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------

Mining Complex Types of Data
UNIT
10
Structure:
10.1 Introduction
10.2 Clustering and Outliers
10.2.2 Measuring Dissimilarity or Similarity in Clustering
10.3 Clustering Techniques
10.4 Multidimensional Analysis-Descriptive Mining of Complex Data Objects
10.5 Mining Spatial Databases
10.6 Mining Multimedia Databases
10.7 Mining Time-Series
10.8 Mining Sequence Data
10.9 Mining Text Databases
10.9.1 Text mining process
10.10 Mining the WWW
10.10.2 Web Usage Mining
Summary
Key Words
Suggested Reading
Mining Complex Types of Data 134

Notes
Objectives
----------------------
----------------------
● Understand the concept of Clustering and Outliers
---------------------- ● Study different types of Clustering Techniques
---------------------- ● Describe single-dimensional and multidimensional association mining
---------------------- ● Describe various approaches for Multilevel Association Rules
----------------------
10.1 INTRODUCTION
----------------------
In our previous units we have focused on mining relational data-bases,
---------------------- transactional databases, and data warehouses formed by the transformation and
---------------------- integration of structured data. Vast amount of data in various complex forms
(e.g., structured and unstructured, hypertext and multimedia) have been growing
---------------------- explosively owing to the rapid progress of data collection tolls, advanced
database system technologies and World –Wide Web (WWW) technologies.
---------------------- Therefore, an increasingly important task in data mining is to mine complex
---------------------- types of data, including complex objects, spatial data, multimedia data, time-
series data, text data, and the World Wide Web.
---------------------- In this chapter, we examine how to further develop the essential data mining
---------------------- techniques (such as characterization, association, classification and clustering),
and how to develop new ones to cope with complex types of data and perform
---------------------- fruitful knowledge mining in complex information repositories. Since search
into mining such complex databases has been evolving at a hasty pace, our
---------------------- discussion covers only some preliminary issues.
----------------------
10.2 CLUSTERING AND OUTLIERS
----------------------
Clustering is a process of dividing a set of data into a set of meaningful sub-
---------------------- classes, called clusters. Clustering is an unsupervised learning where classes
---------------------- are not predefined. Clustering is a method of learning through observation than
learning by example. It finds natural grouping of instances given unlabeled data.
---------------------- Clustering is thus, also called as process of organizing objects into groups
---------------------- where objects are “similar” within group and “dissimilar” to the objects
belonging to other clusters.
----------------------
Outliers
---------------------- Johnson (Johnson, 1992) defines an outlier as an observation in a dataset,
which appears to be inconsistent with the remainder of that set of data. Outliers
----------------------
are often, considered as an error or noise, but they may carry important
---------------------- information about abnormal characteristics of the systems and entities that
effects the data generation process.
----------------------

Outlier detection methods are used for various applications, such as credit Notes
card fraud detection, clinical trials, voting irregularity analysis, data cleansing, ----------------------
network intrusion, severe weather prediction, geographic information systems
and other data-mining tasks. ----------------------
Applications of Clustering: ----------------------
Clustering use in area of Information retrieval, text mining, web analysis,
marketing, medical diagnostic. Clustering can also be used for outlier detection.
----------------------
● Marketing: It helps marketers to find similar and distinct group of ----------------------

customers. This knowledge helps him to decide various marketing ----------------------
strategy and programs.
● Land use: finding areas of similar land used in an earth observation ----------------------
database. ----------------------
● Insurance: Identifying groups of insurance policy holders with a high
average claim cost. ----------------------
● City-planning: to identify group of houses according to their house type, ----------------------

value, and geographical location.
----------------------
----------------------
A good clustering is method help to create high quality clusters in which-
● The intra-class (that is, intra-cluster) similarity is high. ----------------------
● The inter-class similarity is low. ----------------------

The clustering technique should be strong to find hidden pattern. Depending ----------------------
on clusters, generated quality of clustering technique is decided.
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 10.1: Different ways of Clustering ----------------------
Different Types of Clusters ----------------------

1. Well Separated Cluster ----------------------
Cluster defines group of objects. Each object in a group is similar to the
object in the group than the object in another group. To define similarity and
----------------------
dissimilarity within objects of cluster a threshold value is use. Well-separated ----------------------
cluster defines two points in two-dimensional spaces where any two points in
different group is having larger distance than the two points within the group. ----------------------
----------------------

Notes
----------------------
----------------------
---------------------- Fig. 10.2: Well Separated Clusters
---------------------- 2. Prototype Based Cluster

Cluster defines group of objects in which each object is similar to the
---------------------- prototype-defined cluster compared to the prototype of any other cluster.
---------------------- For data having continuous attributes, centroid is use as prototype of
clustering. A centroid is calculated as mean of all values. If data is categorical
---------------------- data than most representative points of cluster is use, called as medoid. Most of
---------------------- the times prototype based model is also referred as centre based cluster.
----------------------
----------------------
----------------------
---------------------- Fig. 10.3: Each point is closer to its centre of cluster
3. Graph Based or Contiguity Based Cluster
----------------------
In this technique, data is represented graphically. Nodes of the graphs are
---------------------- represented as object and links between nodes represent relationship between
object. Objects connected to each other are within the group only and contains
---------------------- no connection between the objects outside the group. This type of clusters are
---------------------- also called connected component. Two objects within a specific distance can be
connected to each other.
----------------------
----------------------
----------------------
----------------------
Fig. 10.4: Each point in cluster is closer to at least one point in its cluster.
----------------------
4. Density Based Cluster
---------------------- A cluster is dense region of object. Dense cluster contains noise and outliers.
---------------------- Density based cluster represents the separation of high density clusters with low
density.
----------------------
----------------------
----------------------
---------------------- Fig. 10.5: High and low region clusters are separated

5. Shared Property Cluster Notes
Cluster object that shared some common properties with each other. A ----------------------
clustering technique required to find such specific cluster. This cluster can be
totally new cluster. ----------------------
The method of finding such cluster is called as conceptual cluster. ----------------------
----------------------
----------------------
----------------------
Fig. 10.6: Some points in cluster share common properties. ----------------------
Data Structures
----------------------
Data matrix: This represents n objects with p variables.
----------------------
----------------------
----------------------
----------------------
Dissimilarity matrix
----------------------
It is represented by an by n table. It is set of proximities that are available for
all pairs of n objects. d(i, j) is the measured difference or dissimilarity between ----------------------
objects i and j.
----------------------
----------------------
----------------------
----------------------
----------------------
Standardization / Normalization ----------------------

If the values of attributes are in different units and if values are very high ----------------------
then the distance between two cases can be very high.
----------------------
If attribute values are very small then the difference between the two cases
will be small. The attributes with high variability will dominate the metric. ----------------------
Therefore, to overcome these problems, standardization or normalization is use.
----------------------
----------------------
----------------------
----------------------

Notes 10.2.2 Measuring Dissimilarity or Similarity in Clustering
---------------------- To form clusters of similar objects, similarity is measured between points.

Similarity is represented in terms of a distance function.
---------------------- To decide set of points close enough to consider or not, distance measure
---------------------- D(x, y) is use.
Distance measure D are:
----------------------
● D(x, x) = 0
---------------------- ● D(x, y) = D(y, x)
---------------------- ● D(x, y) ≤ D(x, z) + D(z, y) the triangle inequality.
The distance between two points x=[x1, x2,…xk] and y=[y1,y2,….yk ] defined
----------------------
using one of the following measures, Euclidean distance, Manhattan distance
---------------------- or max of dimension is used to calculate distance.
---------------------- Euclidean distance:
----------------------
----------------------
---------------------- Manhattan distance:
----------------------
---------------------- Where n is the number of variables, and Xi and Yi are the values of the ith
---------------------- variable, at points X and Y respectively.
Max of dimension:
----------------------
----------------------
----------------------
---------------------- Where k is the number of variables.
----------------------
----------------------
1. Clustering is an _____ type of learning where classes are not
---------------------- predefined.
---------------------- 2. _____is a process of partitioning a set of data (or objects) into a set of
meaningful sub-classes, called clusters.
----------------------
----------------------
----------------------

10.3 CLUSTERING TECHNIQUES Notes
----------------------
1. Partitioning algorithms:
Create various partitions and then evaluate them based upon various ----------------------
criterions. ----------------------
As name suggest, if n objects are given, this method creates k partition of
data. Such that k <=n. This k partitions are called as clusters. ----------------------
Initially, this method creates k partitions. At least one object must belong to ----------------------
this partition. Then iteratively it creates clusters by moving objects from one
group to another to improve quality of clusters. The k-means and k-medoids
----------------------
algorithm is use for forming clusters. ----------------------
2. Hierarchy algorithms: ----------------------
Create a hierarchical decomposition of the set of data based upon various
criterions. ----------------------
It is set of nested clusters that are organized as tree. Cluster at root of tree ----------------------
contains all objects and a node in tree is union of its sub clusters.
----------------------
Hierarchical clustering method is further classified as either agglomerative or
divisive clustering. In agglomerative clustering, the node representation starts ----------------------
in a bottom-up manner. In divisive clustering node, decomposition is formed in
top-down manner.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 10.7: Agglomerative and divisive hierarchical clustering

----------------------
● Density-based: It is based on connectivity and density functions. Each ----------------------

cluster has a high density of points than outside of the cluster. DBSCAN
(Density-Based Spatial Clustering of Applications with Noise) is a density
----------------------
based clustering algorithm. ----------------------
● Grid-based: It is based on a multiple-level granularity structure. It
----------------------
divides the object into infinite number of cell to form grid structure. All
the clustering operations are performed on this cluster ----------------------
● Model-based: A model is use to find best model that suits cluster.
----------------------

Notes An Attempt is made to optimize the fit between the data and some
mathematical model
----------------------
3. K-means Algorithm:
---------------------- K-means Algorithm is Partitioning algorithms and most popular for
---------------------- clustering.
---------------------- 1. Input: k ; the number of clusters; D : the training data set;
2. Output: A set of k clusters;
----------------------
3. Select k objects as initial cluster centroids;
---------------------- 4. Repeat
---------------------- a. Assign all objects in D to nearest centroids
---------------------- b. Update centroid for each cluster i.e. computer the mean value of
objects for each cluster;
---------------------- 5. Until no change to all centroids or maximum iteration has been reached;
---------------------- Select the k number of cluster centers, called centroids. Assign each element
to its nearest clusters. Nearest cluster is decided by calculating distance between
---------------------- two points and selecting minimum distance points. Move each cluster center to
---------------------- the mean of its assigned items.
Repeat the steps until cluster assignments are less than a threshold.
----------------------
Cluster the following eight points (with (x, y) representing locations) into
---------------------- three clusters
---------------------- A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4) A7(1, 2) A8(4, 9).
---------------------- 1. Select the Initial cluster centres are A1 (2, 10), A4 (5, 8) and A7 (1, 2).
2. The distance function between two points a=(x1, y1) and b=(x2, y2) is
----------------------
defined as:
---------------------- ρ(a, b) = |x2 – x1| + |y2 – y1| .
---------------------- 3. Use k-means algorithm to find the three cluster centres after the second
iteration.
----------------------
• The initial cluster centers means, are (2, 10), (5, 8) and (1, 2) - chosen
---------------------- randomly.
---------------------- Calculate the distance from the first point (2, 10) to each of the three means,
by using the distance function:-
---------------------- x1, y1 x2, y2
---------------------- (2, 10) (2, 10)
---------------------- ρ(a, b) = |x2 – x1| + |y2 – y1|
ρ(point, mean1) = |x2 – x1| + |y2 – y1|
----------------------
= |2 – 2| + |10 – 10|
----------------------

=0+0 Notes
=0 ----------------------
The point has the shortest distance to the mean – that is mean 1 (cluster 1), ----------------------
since the distance is 0.
point mean1
----------------------
x1, y1 x2, y2 ----------------------

(2, 5) (2, 10) ----------------------
ρ(a, b) = |x2 – x1| + |y2 – y1| ----------------------
ρ(point, mean1) = |x2 – x1| + |y2 – y1|
----------------------
= |2 – 2| + |10 – 5|
=0+5
----------------------
=5 ----------------------
point mean2 ----------------------
x1, y1 x2, y2 ----------------------
(2, 5) (5, 8)
----------------------
ρ(a, b) = |x2 – x1| + |y2 – y1|
----------------------
ρ(point, mean2) = |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 5| ----------------------
=3+3 ----------------------
=6 ----------------------
X2-> (2, 10) (5, 8) (1, 2)
----------------------
PointX1 Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster
A1 (2, 10) 0 5 9 1 ----------------------
A2 (2, 5) 5 6 4 3 ----------------------
A3 (8, 4)
----------------------
A4 (5, 8)
A5 (7, 5) ----------------------
A6 (6, 4)
----------------------
A7 (1, 2)
A8 (4, 9) ----------------------
Repeat process for each point. After first iteration table will be, ----------------------
(2, 10) (5, 8) (1, 2) ----------------------
Point Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster
----------------------
A1 (2, 10) 0 5 9 1
A2 (2, 5) 5 6 4 3 ----------------------

Notes A3 (8, 4) 12 7 9 2
A4 (5, 8) 5 0 10 2
----------------------
A5 (7, 5) 10 5 9 2
---------------------- A6 (6, 4) 10 5 7 2
A7 (1, 2) 9 10 0 3
----------------------
A8 (4, 9) 3 2 10 2
---------------------- Cluster 1 : (2, 10)
---------------------- Cluster 2: (8, 4), (5, 8),(7, 5),(6, 4),(4, 9)
---------------------- Cluster 3: (2, 5) , (1, 2)
Again, find new cluster centers by calculating mean of all points.
----------------------
For Cluster 1, A1(2, 10), is the single point, so the cluster center remains
---------------------- unchanged.
---------------------- For Cluster 2, we have ((8+5+7+6+4)/5, (4+8+5+4+9)/5)
---------------------- = (6, 6)
For Cluster 3, we have ( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)
----------------------
New mean: (2,10), (6, 6), (1.5, 3.5)
----------------------
Next, process Iteration2, Iteration3, and so on until the means do not change
---------------------- anymore.
After Iteration2: C1= (3, 9.5), C2= (6.5, 5.25), C3= (1.5, 3.5)
----------------------
Clusters: 1 {A1, A8}, 2{A3, A4, A5, A6}, 3: {A2, A7}
----------------------
After Iteration 3: C1= (3.6, 9), C2= (7, 4.3) and C3=(1.5,3.5)
---------------------- Clusters: 1 {A1, A4, A8}, 2{A3, A5, A6}, 3: {A2, A7}
---------------------- After third iteration, mean value remains same. So the algorithm halts at this
step.
----------------------
• Advantages
----------------------
1. It is easy to implement and works with any of standard norms.
---------------------- 2. It is not sensitive to data ordering. It also allows straightforward
---------------------- parallelization.
Disadvantages
----------------------
1. Result is based upon initial guess of centroids.
---------------------- It is sensitive with respect to outliers
2.
---------------------- 3. It is not obvious what good number of k in each case.
---------------------- 4. Resulting clusters can be unbalanced or even empty.
---------------------- Hierarchical clustering:
---------------------- In hierarchical clustering, cluster are group and merge with each other until
one cluster left.
143
Algorithm: Notes
Input: training dataset Output: A hierarchical cluster ----------------------
● Let consider cluster ‘C’ ----------------------
● Calculate proximity matrix.
----------------------
● Merge two nearest cluster based on some criteria (distance measurement)
● Repeat until only one cluster left. ----------------------
Distance measurement ----------------------

a. Single-linkage clustering ----------------------
The smallest distance is considered between one cluster to another cluster
from any object member of one cluster to any object member of another
----------------------
cluster. ----------------------
b. Complete-linkage clustering
----------------------
The greatest distance is considered between one cluster and another
cluster to from any object member of one cluster to any member of ----------------------
another cluster. ----------------------
c. Average-linkage clustering
----------------------
The average distance is considered between one cluster and another cluster
to be equal to the average distance from any member of one cluster to any ----------------------
member of the another cluster
----------------------
Construct tree using agglomerative clustering:
----------------------
Consider, A,B,C,D,E,F group of clusters. Following is dissimilarity matrix
between the clusters, ----------------------
A B C D E F ----------------------
A 0.0 1.0 4.0 8.0 9.0 2.0 ----------------------
B 1.0 0.0 2.5 7.0 6.0 4.5 ----------------------
C 4.0 2.5 0.0 2.0 3.0 5.5 ----------------------
D 8.0 7.0 2.0 0.0 0.4 3.5 ----------------------
E 9.0 6.0 3.0 0.4 0.0 1.5 ----------------------
F 2.0 4.5 5.0 3.5 1.5 0.0 ----------------------
STEP1
----------------------
Find lowest value in the proximity matrix. In above table it is 0.4, so cluster ----------------------
D & E will merge. Consider lowest value between row and column for merging.
Update proximity matrix.
----------------------
----------------------

Notes A B C D, E F
---------------------- A 0.0 1.0 4.0 8.0 2.0
----------------------
B 1.0 0.0 2.5 7.0 4.5
----------------------
C 4.0 2.5 0.0 2.0 5.5
----------------------
D, E 8.0 6.0 2.0 0.0 1.5
----------------------
F 2.0 4.5 5.0 1.5 0.0
----------------------
Step 2: Now merge A&B and update proximity matrix.
----------------------
A, B C D, E F
----------------------
A, B 0.0 2.5 7.0 2.0
----------------------
C 2.5 0.0 2.3 5.5
----------------------
---------------------- D, E 6.0 2.3 0.0 1.5
---------------------- F 2.0 5.0 1.5 0.0
---------------------- Step 3: Now merge DE and /F,
---------------------- A,B C D,E/F
---------------------- A,B 0.0 2.5 2.0
---------------------- C 2.5 0.0 2.3
---------------------- D,E/F 2.0 2.3 0.0
---------------------- Step 4: Merge A,B and D,E/F
---------------------- A,B, D,E/F C

---------------------- A,B/D,E,F 0.0 2.3
---------------------- C 2.3 0.0

---------------------- Step 5 : Merge A,B, D,E/F with C
---------------------- A,B, D,E/F/C
---------------------- A,B/D,E,F,C 2.0
---------------------- Only one cluster: - (A,B,C,D,E,F)

---------------------- { A,B,C,D,E,F}
----------------------

Notes
1. _______Construct various partitions and then evaluate them by some ----------------------

criterion.
----------------------
2. Grid-based is based on a ________level granularity structure.
3. DBSCAN is ________ clustering algorithm.
----------------------
----------------------
10.4 MULTIDIMENSIONAL ANALYSIS - DESCRIPTIVE ----------------------

MINING OF COMPLEX DATA OBJECTS ----------------------
Many applications like scientific research and engineering design have ----------------------
complex data. These data need to store, retrieve and analyse in structured way.
However, these objects cannot store in uniformly structured way. For such ----------------------
applications, data is stored by object-relational and object-oriented database
systems, which deal with complex data objects. ----------------------
Each object in class associated with object identifier, set of attributes, and set ----------------------
of methods describes computational rules.
----------------------
If complex data need to be analysed and mine than it require setting up
multidimensional data warehouse for complex data and then develop effective ----------------------
and scalable method for data mining.
----------------------
Object-relational and object-oriented databases has features to handle
complex data by storing, accessing, and modelling-valued data. For example, ----------------------
set-valued and list-valued data and data with nested structures.
----------------------
A set-valued attribute can be of homogeneous or heterogeneous in nature. This
data can be generalized by generalization of each value in the set by mapping ----------------------
to its related higher-level concept or derivation of the set. Generalization is
carried out by applying various generalization operators to find alternative
----------------------
generalization paths. ----------------------
A set-valued attribute ----------------------
Suppose that the hobby of a person is a set-valued attribute containing the set
of values {cricket, basketball, violin, solicitor} This set is generalized to high-
----------------------
level concepts, such as { sports, music, computer games}. To know how many ----------------------
elements are generalized, a count is placed with th generalized values {Sports
(2),music(1), computer games(1) } ----------------------
Let us consider a person’s education data record, ----------------------
“((B.A ARTS, Pune University June 2000), “(Ph.D. Computer Science, ----------------------
Mumbai university, Dec, 2005)”. These records can be represented by removing
----------------------

Notes less important attributes in records. For example, ignoring the month attribute
in list and keeping important and meaningful attributes.
----------------------
----------------------
Fill in the Blanks.
---------------------- 1. A _____attribute may be of homogeneous or heterogeneous in type.
---------------------- 2. Multidimensional complex data objects applications data is stored by
____ and _____.
----------------------
----------------------
10.5 MINING SPATIAL DATABASES
----------------------
Spatial database is database which store space-related data, such as maps,
---------------------- pre-processed remote sensing or medical imaging data and VLSI chip layout
data.
----------------------
Spatial data mining is the process of discovering interesting, useful, spatial
---------------------- relationships, non-trivial patterns from large spatial datasets. Some of examples
---------------------- of spatial patterns such as cancer clusters to investigate environment health
hazards, crime hotspots for planning police patrol routes, bald eagles nest on
---------------------- tall trees near open water etc.
---------------------- In spatial data mining, one of the challenges is that the information is usually
not uniformly distributed in spatial datasets. Spatial patterns are detected using
---------------------- classification, associations, clustering and outlier detection.
---------------------- Spatial Association Rule

It is an association between the objects based on spatial neighbourhood
----------------------
relations. Association with spatial or non-spatial attributes can be done. For
---------------------- example, rental of houses around cyber tower are mostly Rs. 500sq.ft. This
example shows the association of spatial and non-spatial attributes. Another
---------------------- example - Bald eagles’ nest on tall trees, near open water is association of
spatial data.
----------------------
Attribute Oriented Induction
----------------------
Concept hierarchies can be used to derive relationship between spatial and
---------------------- non-spatial attributes. Using concept hierarchy, one may decide the level to
which details of spatial data has to discover. For example, one may be interested
---------------------- in finding the land use patterns. This classification can be hierarchical. It can be
---------------------- recreational facility or residential facility. Further recreational facility can be
cinema or restaurant.
----------------------
Aggregate Proximity Relationship
---------------------- One of the problems is to determine aggregate proximity relationship between
---------------------- spatial cluster based on spatial and non-spatial attributes. With n input clusters,

associate cluster with classes of features. For example, educational institute can Notes
be secondary school, junior college or higher institute. The problem is to find ----------------------
cluster close to proximity. School can be private girls’ school or private boys’
school; if these 2 and 3 cluster close to girls school and other is to close boys ----------------------
school then generalized fact can be all 3 clusters are close to private school.
----------------------
Fill in the Blanks.

----------------------
1. Spatial database is a database, which store ___related data. ----------------------

2. A rule between the objects based on spatial neighbourhood relations ----------------------
is called ____._
----------------------
3. In attribute oriented induction ____can be used to derive relationship
between spatial and non-spatial attributes. ----------------------
----------------------
10.6 MINING MULTIMEDIA DATABASES
----------------------
Multimedia Databases contain complex texts, graphics, images, video
fragments, maps, voice, music, and other forms of audio/video information.
----------------------
Multimedia information is stored in form of bits/bytes. The segments of data ----------------------
are linked together or indexed in a multi-dimensional way for easy reference.
----------------------
Similarity Search in Multimedia Data
To search multimedia data two main approaches are used ----------------------
● Description-based retrieval systems ----------------------

● Content-based retrieval systems ----------------------
Description-based retrieval systems works on image description for retrieval
purposes. Content-based retrieval systems, which support retrieval based on
----------------------
the feature of image content such as colour, pattern, shape, texture of image. ----------------------
The result of description-based retrieval system is very poor as compared to
content-based system. ----------------------
There are two types of queries in content-based retrieval that are image ----------------------
sample-based queries and image feature specification queries.
----------------------
Image-sample-based queries find all of the images that are similar to the
given image sample. ----------------------
Image feature specification queries specify features related to image such as ----------------------
size of image, colour of image, texture or shape of image. These features are
translated into a feature vector tso that they should match feature vectors of the ----------------------
images in the database.
----------------------
Content-based retrieval used in applications like medical diagnosis, weather
prediction, TV production, Web search engines and e-commerce. QBIC (Query ----------------------

Notes By Image Content), support both sample-based and image feature specification
queries.
----------------------
----------------------
Fill in the Blanks.
---------------------- 1. To search multimedia data ___ and ___ retrieval systems are two
main approaches are used.
----------------------
2. A ______ tool supports both sample-based and image feature
---------------------- specification queries.
----------------------
---------------------- 10.7 MINING TIME-SERIES
---------------------- A time-series database is special type of data consists of sequences of values
obtained time.
----------------------
For example, financial data that contain objects that are time series of daily
---------------------- prices of various stocks. If two measurements are close in time then values of
those measurements are often similar.
----------------------
Time-series forecasting finds a mathematical formula that will approximately
---------------------- generate the historical patterns in a time series.
---------------------- Analysis task of time series include feature extraction, similarity measure,
segmentation of data set, matching two time series, clustering and classifying
---------------------- time series data.
---------------------- Similarity function
---------------------- This function is required to find a series database that is similar to given
query series. A simple approach is to define similarity function x and yin terms
---------------------- of Lp distances as point of Rn. but it is not suitable to determine similarity for
series in different scale and different shifts.
----------------------
Scale free similarity
----------------------
Consider an example: Two companies have identical stock price fluctuation
---------------------- but one company’s stock is worth twice as much as other company. Thus, pattern
is similar but numeric values are different. It is important to find similar time
---------------------- series objects in data mining.
---------------------- Shift free similarity
---------------------- Temperature at two different days may start at different values but their
fluctuation may be exactly same. That means this is same series with two
---------------------- different baselines.
---------------------- We say that two time series X and Y are similar if there exist a>0 and b, such
that yi=axi+b, for all i.
----------------------

Notes
1. A ___database is special type of data consists of sequences of values ----------------------

obtained time.
----------------------
2. If two measurements are close in time then values of those
measurements are often _______. ----------------------
----------------------
10.8 MINING SEQUENCE DATA
----------------------
Sequence database t consists of sequences of ordered events, with or without ----------------------
concrete notions of time. Sequential pattern mining is mining of ordered events
that occur frequently ----------------------
Let, set of all items I={I1, I2, : : : , In} , nonempty item set. ----------------------
A sequence s is represented as {e1 e2 e3….ej} where event e1 occurs before
e2, which occurs before e3, and so on. It is an ordered list of events. ----------------------
An item set is a nonempty set of items. A sequence is an ordered list of ----------------------

events. A sequence s is denoted {e1e2e3….ej
----------------------
A sequence ὰ={a1 a2…….an} is called a subsequence of another sequence.
b= {b1 b2 ….bm }, and b is a super sequence of ὰ, denoted as ὰГb, if there
----------------------
exist integers 1 <= j1 <j2 <….<jn<=m such that a1⊆bj1, a2⊆bj2 , . . . , an ----------------------
⊆bjn
The algorithm for solving sequence-mining problems is mostly GSP-Based ----------------------
on Candidate Generate-and-Test algorithm used. There are two main steps of
algorithm, ----------------------
Candidate generation ----------------------
Given set of frequent (k-1) sequences, (k-1) the candidate for the next phase is
created by joining F(k-1) with itself. A pruning phase eliminates any sequence at
----------------------
least one of those subsequences is not frequent. ----------------------
Support counting ----------------------
Hash tree based search is use for efficient support counting. Non-maximal
frequent sequences are not consider. ----------------------
SPADE (sequential Pattern discovery using Equivalence classes) for ----------------------

discovering set of all frequent sequences. It is level wise an Apriori-Based
Vertical Data Format Sequential Pattern Mining Algorithm.
----------------------
----------------------
----------------------
----------------------
Notes
----------------------
1. A ___ is any database that consists of sequences of ordered events,
----------------------
with or without concrete notions of time.
---------------------- 2. A _____ search technique is employed for efficient support counting.
----------------------
---------------------- 10.9 MINING TEXT DATABASES

---------------------- Due to growth of text data, it is required to extract data automatically, the
---------------------- previously unknown information from data. Text Mining is the discovery of
new previously unknown information, extracted automatically by computer
---------------------- from different written resources. Text mining is a young interdisciplinary
field, which draws on information retrieval, data mining, machine learning,
---------------------- statistics and computational linguistics. There are many possible applications
of text mining such as Customer profile analysis, Patent analysis, Information
----------------------
dissemination, Company resource planning.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- Fig. 10.8: A Text Mining Framework
10.9.1 Text Mining Process
----------------------
1. “Preprocessing" the text into a structured format.
----------------------
2. Text Transformation and attribute selection
---------------------- 3. Mining the reduced data with traditional data mining techniques.
---------------------- 1. Data pre-processing
---------------------- Text data is unstructured data and it may contain Misspellings, abbreviation,
Punctuation and other non-alphanumeric characters, some noisy words etc. Data
---------------------- pre-processing deals with detecting and removing errors and inconsistencies
---------------------- from data in order to improve the quality of data. To make text data useful,
unstructured text data is converted into structured data.
---------------------- It include tokenization, stop word removal and stemming techniques.
151
Tokenization: Tokenization is process of converting the document into Notes
tokens or words (noun, verb, pronoun, article, conjunction and preposition) ----------------------
without understanding their meaning.
Further data is cleaned to remove stop words. Stop words are the common ----------------------
frequently used words like pronouns, prepositions, white spaces, punctuation ----------------------
marks, conjunction etc.. Remove those words that are too general like ‘the’, ‘an’,
‘a’, ‘and’, ‘unless’, ‘versus’ etc. these are the words which does not contribute ----------------------
any meaning or not add any knowledge to data analysis
----------------------
Next stemming is applied to the data. Stemming or lemmatization is a
technique use to convert words into their root words. Stemming is used to ----------------------
identify the word stems for the remaining words to make the words in simple
format –Removing al, ing, lie, tion, ies, ’s etc. Stemming is the process of
----------------------
removing suffixes and endings of words. Ex: computable, computation, ----------------------
computing, computational – comput.
Following a selection of suffixes and prefixes for removal during stemming
----------------------
(David, 1996) ----------------------
_ suffixes: ly, ness, ion, ize, ant, ent , ic, al , ical, able, ance, ary, ate, ce, y,
----------------------
dom, ed, ee, eer, ence, ency, ery, ess, ful, hood, ible, icity, ify, ing, ish, ism, ist,
istic, ity, ive, less, let, like, ment, ory, ty, ship, some, ure ----------------------
_ prefixes: anti, bi, co, contra, counter, de, di, dis, en, extra, in, inter, intra, ----------------------
micro, mid, mini, multi,non, over, para, poly, post, pre, pro, re, semi, sub, super,
supra, sur, trans, tri, ultra, un ----------------------
This process represents uniform format to the words. After this words are use ----------------------
for creating bag of words by applying different techniques. Most frequent terms
can use to repreent the document using term frequency technique or inverse ----------------------
term document technique can be use to represent term matrix.
----------------------
2. Text Transformation and attribute selection
----------------------
Text Transformation contains text Representation and Feature Selection. Text
document is represented by the words it contains and their occurrences. There ----------------------
are two main approaches of document representation using “Bag of words” and
Vector Space. ----------------------
Bag of words are collection of words, each word is represented as a separate ----------------------
variable having numeric weight depending upon the word occurrences in a
document. ----------------------
Feature Selection is to select a subset of the features to represent a document. ----------------------

Further if document have high dimension it is very difficult to handle and
process it, therefore dimension reduction techniques are used. These techniques ----------------------
are required in classification, clustering, information retrieval, data analysis. ----------------------
3. Data/text mining and result Evaluation
----------------------
Then data mining techniques such as Classification, Clustering, and
Associations are used to get the result. If results are well suited for application ----------------------

Notes then process terminates. Otherwise, results generated are used as part of the
input for one or more earlier stages
----------------------
Other areas
----------------------
Information retrieval
---------------------- It is concern with finding information from documents based on users’
---------------------- requirements. The documents can be found similar to the user specification
or exact index term is retrieved in collection so appropriate document can be
---------------------- returned. Examples of l IR systems are online library catalogs, online document
management systems
----------------------
Information Extraction
----------------------
Information Extraction is term represent to the automatic extraction of
---------------------- structured information such as entities, relationships between entities, and
attributes describing entities from unstructured sources.
----------------------
Many techniques are used for information extraction problem, such as core
---------------------- statistical and rule-based models. Frameworks and Architectures are used for
managing the extraction pipelines, for performance optimization and uncertainty
---------------------- management.
----------------------
----------------------
1. A ______refers to the automatic extraction of structured information
----------------------
such as entities, relationships between entities.
---------------------- 2. A ______ is concern with finding information from documents based
on users’ requirements.
----------------------
---------------------- 10.10 MINING THE WWW

---------------------- The web is a highly dynamic information source. Web mining is use to extract
---------------------- knowledge from web data, including web documents, hyperlinks between
documents, usage logs of web sites, etc.
---------------------- The main challenge is to make efficient and effective discovery and use of
---------------------- resources on the internet. Content data is the collection of facts a web page is
designed to contain. It may consist of text, images, audio, video, or structured
---------------------- records such as lists and tables. This web page data is more complex than any
other data.
----------------------
----------------------
Web page has structure. The DOM (Document object model) structure of a
---------------------- web page is a tree structure, consists of every HTML tag in the page. Web page
is collection of information represented in form of hyperlink.
----------------------

Hyperlinks Notes
A hyperlink is a structural unit, use to link to a location on another web page, ----------------------
within the same web page or on a different web Site.
----------------------
Document Structure
----------------------
Contains on web page is organised in a tree structure format. HTML and
XML represents basic document structure to follow. ----------------------
10.10.2 Web Usage Mining ----------------------
Web usage mining is a technique to find interesting patterns from web data.
The pattern helps to understand the current requirement and design web based
----------------------
applications according to needs.. ----------------------
Web Server Data ----------------------
Webserver collects user logs and includes IP address, page reference and
access time. ----------------------
Application Server Data ----------------------

It is used to track various business events and log them into application ----------------------
server log. Web logic, Story Server are commercial application server.
----------------------
Application Level Data
It is special type of data. New events can be defined in application and logging
----------------------
can be turn on for that event. Many end applications require a combination of ----------------------
one or more of the above techniques to be used.
There are two main approaches in web usage mining:
----------------------
1. General access pattern tracking ----------------------

Use web logs and analyses the web logs to understand access pattern and trends. ----------------------
2. Customized usage tracking ----------------------
This tracks individual trends. This is to learn user profile in adaptive
interfaces. The depth of site structure, format of resources can be dynamically
----------------------
customized for each user over time. The success is depending upon what and ----------------------
how knowledge is discover from log data.
----------------------

1. A _______ analyses the web logs to understand access pattern and ----------------------
trends.
----------------------
2. A ______ tracks individual trends.
----------------------
----------------------

Notes
Activity 1
----------------------
---------------------- Weblog records provide web usage information for data mining. Mining
of weblog access records help cluster to facilitate customized marketing.
---------------------- Discuss how to develop an efficient implementation method that may help
user clustering.
----------------------
----------------------
Summary
----------------------
● Clustering is a process of partitioning a set of data (or objects) into a set
---------------------- of meaningful sub-classes, called clusters.
---------------------- ● Clustering is thus, also called as process of organizing objects into groups
where objects are “similar” within group and “dissimilar” to the objects
---------------------- belonging to other clusters.
---------------------- ● Outlier as an observation in a dataset, which appears to be inconsistent
with the remainder of that set of data.
---------------------- ● Outlier detection methods are used for various applications, such as
---------------------- credit card fraud detection, clinical trials, voting irregularity analysis,
data cleansing, network intrusion, severe weather prediction, geographic
---------------------- information systems, and other data-mining tasks.
---------------------- ● Partitioning algorithms, Hierarchical algorithms, Density-based, Grid-
based, Model-based are clustering categories.
---------------------- ● Multidimensional Complex Data Objects applications data is stored by
---------------------- object-relational and object-oriented database systems, which deal with
complex data objects
---------------------- ● Spatial database is database which store space-related data, such as maps,
---------------------- pre-processed remote sensing or medical imaging data, and VLSI chip
layout data.
---------------------- ● Multimedia Databases contain complex texts, graphics, images, video
---------------------- fragments, maps, voice, music, and other forms of audio/video information
●
A time-series database is special type of data consists of sequences of
----------------------
values obtained time. For example, financial data that contain objects that
---------------------- are time series of daily prices of various stocks. If two measurements are
close in time then values of those measurements are often similar.
---------------------- ● Sequence database is any database that consists of sequences of ordered
---------------------- events, with or without concrete notions of time.
● Text Mining is the discovery of new previously unknown information,
---------------------- extracted automatically by computer from different written resources.
---------------------- ● Web mining is the application of data mining techniques to extract
knowledge from web data, including web documents, hyperlinks between
---------------------- documents, usage logs of web sites, etc.

Keywords Notes
----------------------
● Clustering: Clustering is a process of partitioning a set of data (or
objects) into a set of meaningful sub-classes, called clusters ----------------------
● Outlier: Outlier as an observation in a dataset, which appears to be ----------------------

inconsistent with the remainder of that set of data
----------------------
1. Define Clustering. What do you mean by Cluster Analysis? ----------------------

2. List all the fields in which clustering techniques are used. ----------------------
3. Differentiate Agglomerative and Divisive Hierarchical Clustering.
----------------------
4. Define Density based method.
----------------------
5. What is Time Series Analysis?
6. Define text mining.
----------------------
7. What does web mining mean? ----------------------

8. Define spatial data mining ----------------------
9. Explain multimedia data mining.
----------------------

1. Clustering is an unsupervised type of learning where classes are not ----------------------
predefined.
2. Clustering is a process of partitioning a set of data (or objects) into a set ----------------------
of meaningful sub-classes, called clusters. ----------------------
----------------------
Fill in the Blanks.
----------------------
1. Partitioning algorithms construct various partitions and then evaluate
them by some criterion. ----------------------
2. Grid-based is based on a multiple level granularity structure. ----------------------
DBSCAN is a density based clustering algorithm
----------------------
Fill in the Blanks.
----------------------
1. A set-valued attribute may be of homogeneous or heterogeneous in type. ----------------------

2. Multidimensional complex data objects applications data is stored by ----------------------
object-relational and object-oriented database system.


1. Spatial database is database, which store space related data.
----------------------
2. A rule between the objects based on spatial neighbourhood relations is
---------------------- called spatial association.
---------------------- 3. In attribute-oriented induction, Concept hierarchies can be used to derive
relationship between spatial and non-spatial attributes.
---------------------- 1. To search multimedia data description-based and Content-based retrieval
systems are two main approaches are used.
----------------------
2. A QBIC (Query By Image Content), tool supports both sample-based and
---------------------- image feature specification queries.
----------------------
Fill in the Blanks.
----------------------
1. A Time-series database is special type of data consists of sequences of
---------------------- values obtained time.
---------------------- 2. If two measurements are close in time then values of those measurements
are often similar.
----------------------
---------------------- 1. A Sequence database is any database that consists of sequences of ordered
events, with or without concrete notions of time.
----------------------
2. A Hash tree based search technique is employed for efficient support
---------------------- counting.
1. An Information Extraction refers to the automatic extraction of structured
----------------------
information such as entities, relationships between entities.
---------------------- 2. An Information retrieval is concern with finding information from
documents based on users’ requirements.
----------------------
----------------------
Fill in the Blanks.
---------------------- 1. A General access pattern tracking analyses the web logs to understand
access pattern and trends.
----------------------
2. A Customized usage tracking tracks individual trends.
----------------------

158 Mining Complex Types of Data
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Concepts, Applications and Research Directions
Srivastava, Jaideep, Desikan, Prasanna, Kumar, Vipin. Web Mining- 2.
---------------------- Computing, 2009.
----------------------
Challenges And Future Directions”, International Journal of Reviews in
Mahesh R, Suresh B and M Vinayababu. “Text Mining: Advancements, 1.
----------------------
Notes
Suggested Reading
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------

Data Mining Applications and Trend
UNIT
11
Structure:
11.1 Introduction
11.2 Applications of Data Mining
11.3 Data Mining System Products and Research Prototypes
11.3.1 Examples of Commercial Data Mining Systems
11.4 Additional Themes on Data Mining
11.5 Social Impacts of Data Mining
11.6 Trends in Data Mining
Summary
Key Words
Suggested Reading
Data Mining Applications and Trend 160

Notes
Objectives
----------------------
----------------------
● Discuss data mining applications.
---------------------- ● Describe different data mining themes.
---------------------- ● Analyse the trends in Data Mining.
----------------------
----------------------
---------------------- 11.1 INTRODUCTION
---------------------- Data mining is a relatively young discipline having diverse applications. In
the previous units, we have studied the concepts of data mining and the various
---------------------- techniques used for analysing the data. It is useful in almost all fields, such as
---------------------- banking, marketing, medicine, fraud detection, manufacturing and production,
scientific data analysis, etc. Now, in this unit, we will be discussing various
---------------------- applications of data mining. We shall also analyse its trends..
----------------------
11.2 APPLICATIONS OF DATA MINING
----------------------
A few application domains are discussed below on how data mining tools
---------------------- are used:
---------------------- • Applications of data mining in banking: Banks and financial
institutions offer a wide variety of banking services, where data mining
----------------------
can be helpful for following applications:
----------------------  Data collected by data mining in banking
----------------------  Mining customer data of banks
 Loan/credit card approval
----------------------
 Classification and clustering of customers for targeted marketing
----------------------
 Mining for prediction and forecasting
----------------------  Mining for fraud detection
----------------------  Mining for cross-selling banking services
----------------------  Mining for identifying customer preferences
● Data Mining for the Retail Industry
----------------------
Retail industries have large amounts of data on sales, customer shopping
---------------------- history, goods transportation, consumption and service. Therefore, data
---------------------- mining can be used in the following areas:
 Design and construction of data warehouses based on the benefits
----------------------
of data mining

 Identifying the buying patterns of customers Notes
 Finding associations among customers’ demographic characteristics ----------------------
 Predicting the response to mailing campaigns. ----------------------
 Determining the best course of action for customers, behaviour of
customers
----------------------
 Market-basket analysis: With data mining, companies can determine ----------------------

which products to stock in which stores and even how to display
----------------------
products within a store.
● Data Mining for the Telecommunications Industry ----------------------
The telecommunications industry offers telephone services for local and ----------------------
long distances. It also provides cellular phone services, Internet, email
messaging, etc. The following are some of the applications that help the ----------------------
telecommunications industry grow: ----------------------
 Multidimensional analysis of telecommunications data (call,
distance, time and location) used to identify and compare data ----------------------
traffic, system workload, resource usage, user-group behaviour and ----------------------
profit.
 It may be used to identify potentially fraudulent users, usage ----------------------
patterns, attempts to enter customers’ accounts, multidimensional ----------------------
associations and sequential pattern analysis such as the pattern of
usage by month, time and location can promote sales of plans. ----------------------
 Spatiotemporal data mining may become essential to find certain ----------------------
patterns. For example, unusually busy mobile phone traffic at
certain locations may indicate something abnormal happening at ----------------------
those locations.
----------------------
● Data Mining for Biological Data Analysis
Biological data mining has become an important part of a new research
----------------------
field called bioinformatics. Data mining may contribute to biological data ----------------------
analysis in the following aspects:
Alignment, indexing, similarity search and comparative analysis of
----------------------

multiple nucleotide/protein sequences. ----------------------
 Data cleaning, data integration, classification and clustering
----------------------
methods, which will help the integration of biological data and the
construction of data warehouses for biological data analysis. ----------------------
 Finding structural patterns and analyse genetic networks and protein ----------------------
pathways.
 Data mining visualisation tools can help in genetic data analysis. ----------------------
● Data Mining for Medicine ----------------------

 Using it for disease outcome and effectiveness of treatments ----------------------

Notes  Characterising patient behaviour to predict surgery visits
 Finding relationships between diseases
----------------------
 Identifying successful medical therapies for different illnesses.
----------------------
● Data Mining for Intrusion Detection
---------------------- An intrusion can be defined as any set of actions that threaten the integrity,
confidentiality or availability of a network resource. With the growth of
----------------------
the Internet, it becomes critical to find intrusion into the network. Most
---------------------- commercial intrusion detection systems do not provide complete solutions
and are unable to detect new or previously unknown intrusion techniques.
----------------------
Data mining techniques can be further developed and used for intrusion
---------------------- detection in the following ways:
----------------------  Data mining algorithms can be used for misuse detection and
anomaly detection. Anomaly detection builds models of normal
---------------------- network behaviour. It detects new patterns that significantly
---------------------- different from the normal pattern. This new pattern may represent
an intrusion or may add to the usual pattern.
----------------------  Association and correlation analysis can help select and build
---------------------- discriminating attributes.
 Analysis of stream data to find sequential patterns and identify
----------------------
outliers.
----------------------  Data mining visualisation tools are available to view any anomalous
pattern that is detected.
----------------------
 Distributed data mining methods can be used to analyse network data
---------------------- from several network locations in order to detect these distributed
attacks.
----------------------
● Insurance
----------------------  Undertaking claims analysis
----------------------  Predicting which customers will buy new policies
---------------------- ● Web site/store design and promotion

 Finding the affinity of visitors to pages and modifying layout
----------------------
----------------------
Fill in the Blanks.
---------------------- 1. A _______ can be defined as any set of actions that threaten the
integrity, confidentiality or availability of a network resource.
----------------------
----------------------
----------------------

Notes
Activity 1
----------------------
List the applications of Data Mining in an IT organisation. ----------------------
----------------------
----------------------
11.3 DATA MINING SYSTEM PRODUCTS AND
RESEARCH PROTOTYPES ----------------------
Data mining is a young field and many data mining products and tools are
----------------------
available in the market. To select a data mining system that fits your requirement, ----------------------
it is important to have a multidimensional view of data mining systems. The
following are some features to assess a data mining system: ----------------------
Data types ----------------------

Most data mining systems available in the market handle formatted, relational ----------------------
data with numeric, categorical and symbolic features. It is important to check
what type of data your system can handle. Sometimes, some application ----------------------
requires a specialised algorithm to generate a pattern for search. Such a special
requirement may not be handled by a generalised data mining system. ----------------------
System issues ----------------------

A data mining system may not operate on all operating systems. UNIX/ ----------------------
Linux and Microsoft Windows are popular operating systems to run data mining
software. There is data mining software, which also runs on Macintosh, OS/2. ----------------------
A large industry-oriented data mining system adopts client server architecture
to run.
----------------------
Data sources ----------------------

This refers to the specific data formats on which the data mining system ----------------------
works. Some data mining systems require ASCII files. Others work on relational
data or data warehouse data, accessing multiple relational data sources. ----------------------
Data mining functions and methodologies ----------------------

Data mining systems must provide all data mining functionalities such ----------------------
classification, association, clustering, linkage analysis, statistical analysis,
prediction, outlier analysis, similarity search, sequential pattern analysis ----------------------
and visual data mining. Besides, it should also support multiple data mining
functions and multiple methods per function. A good data mining system ----------------------
provides flexibility to users and analysis power. ----------------------
Scalability ----------------------
Data mining systems must be capable of handling data scalable database size
and dimension scalability. ----------------------
----------------------

Notes Visualisation and graphical user interface
Visualisation in data mining supports different categories, such as data
----------------------
visualisation, mining result visualisation, mining process visualisation and
---------------------- visual data mining. Quality and flexibility of visualisation tools and graphical
user interface had great impact on usability, interpretability and attractiveness
---------------------- of a data mining system.
---------------------- 11.3.1 Examples of Commercial Data Mining Systems
Many data mining systems provide only classification or specialise in one
----------------------
data mining function. The following are some examples that provide multiple
---------------------- data mining functions:
---------------------- From database system and graphics system vendors

Microsoft SQL Server 2005 is also a database management system,
---------------------- which provides multiple data mining functions and data warehouse system
---------------------- environments. It includes association mining, classification, regression trees,
sequence clustering and time-series analysis.
---------------------- Mine Set, available from Purple Insight, was introduced by SGI in 1999.
---------------------- It provides multiple data mining functions, including association mining and
classification, as well as advanced statistics and visualisation tools.
----------------------
Originating from the machine learning community
---------------------- Weak, developed at the University of Waikato in New Zealand, is open-
source data mining software in Java. It contains a collection of algorithms
----------------------
including data pre-processing, association mining, classification, regression,
---------------------- clustering and visualisation.

----------------------
Fill in the Blanks.
----------------------
1. Weka is an open-source data mining software developed in ________
---------------------- programming language.
---------------------- 2. ______ software available from Salford Systems creates decision
trees for classification and regression trees for prediction.
----------------------
---------------------- Activity 2
----------------------
List the features of C5.0 and CART.
----------------------
---------------------- 11.4 ADDITIONAL THEMES ON DATA MINING

---------------------- There are various data mining tools and techniques available. There
---------------------- are several theories available as the basis of data mining, which include the
following:
165
Data Reduction Notes
This theory is related to reducing data representation. It helps get a quick ----------------------
response to queries. There are reduction techniques such as singular value
decomposition, wavelets, regression, log-linear models, histograms, clustering, ----------------------
sampling and the construction of index trees.
----------------------
Data Compression
----------------------
The basis of this theory is to compress data by encoding in bits, decision
trees or clusters, etc. The encoding, which is based on the minimum description ----------------------
length principle, states that the “best” theory to infer from a set of data is the one
that minimises the length of the theory and the length of the data when encoded, ----------------------
using the theory as a predictor for the data. ----------------------
Pattern Discovery
----------------------
The basis of this theory is to discover patterns occurring in the database.
Areas such as machine learning, neural network, association mining, sequential ----------------------
pattern mining, clustering and several other subfields contribute to this theory.
----------------------
Probability Theory
In this theory, the basis of data mining is to discover joint probability
----------------------
distributions. Bayesian belief networks or hierarchical Bayesian models are ----------------------
used in finding probability.
----------------------
Microeconomic View
This theory is to find an interesting pattern that can be used for decision ----------------------
making. This view is one of utility, in which patterns are considered interesting ----------------------
if they can be acted on.
----------------------
Statistical Data Mining
We have discussed classification, association and clustering techniques ----------------------
to handling multidimensional and complex types of huge data. Also, there
are statistical techniques available to handle huge amounts of data such as
----------------------
regression, generalised linear models, analysis of variance, mixed-effect ----------------------
models, factor analysis, discriminant analysis, data mining and, collaborative
filtering and time series analysis. ----------------------
Data Mining and Collaborative Filtering ----------------------

Today, most people buy products via the Internet. In e-commerce, there
----------------------
are systems available in the market, which help customers by recommending
products live. Collaborative filtering is an approach wherein products are ----------------------
recommended to customers based on the opinion of other customers. Applying
data mining or statistical techniques to collaborative filtering may help search ----------------------
similarities among customer approach.
----------------------
Visual and Audio Data Mining
----------------------
Visual representation is always better and easy to interpret. Visual data
mining discovers useful knowledge from the data set and represents it. It is an ----------------------

Notes effective tool for the comprehension of data distributions, patterns, clusters and
outliers in data. Audio data mining uses audio signals to indicate patterns of
---------------------- data or the features of data mining results.
----------------------
----------------------
1. ______ theory is used to discover patterns occurring in the database.
----------------------
2. Audio data mining uses _______ signals to indicate the patterns of
---------------------- data or the features of data mining results.
----------------------
---------------------- 11.5 SOCIAL IMPACTS OF DATA MINING
---------------------- Data mining is present in many aspects of our daily lives. It affects us with
regard to information retrieval, search, shop, time, etc.
----------------------
Data mining is used by marketing companies to find customer behaviour
---------------------- patterns. Your information may get collected when you use your credit
card, debit card, supermarket loyalty card or frequent flyer card, when you
---------------------- surf the Web, reply to an Internet newsgroup, subscribe to a magazine, etc.
---------------------- Advertisements and promotional material is being sent to customer email IDs
to target customers.
----------------------
Web-wide tracking is a technology that tracks a user across each site a user
---------------------- visits. This information can be used by marketers.
---------------------- Purpose Specification and Use Limitation

While collecting data, the purpose must be specified for which data is being
---------------------- collected and the data should be specified at the time of collection. Besides, the
data collected should not exceed the stated purpose.
----------------------
Openness
----------------------
Individuals have the right to know what information is collected about them,
---------------------- who has access to the data and how the data is being used.
---------------------- Security Safeguards

Personal data must be protected by security safeguards against such risks as
----------------------
loss or unauthorised access, destruction, use, modification or disclosure of data.
---------------------- Individual Participation
---------------------- An individual has the right to know whether data is collected and if it is, then
what that data is. The individual may also challenge such data. If the challenge
---------------------- is successful, the individual has the right to have the data erased, corrected or
completed.
----------------------
Data security-enhancing techniques such as blind signatures, biometric
---------------------- encryption, anonymous databases, etc., are developed to protect data.

Notes
1. Individuals have the right to know what information is collected ----------------------

about them. This property is called _________.
----------------------
2. A ___________ established a set of international guidelines, referred
to as fair information practices. ----------------------
----------------------
11.6 TRENDS IN DATA MINING ----------------------
Data mining is a diverse field and to deal with diverse data, puts challenging ----------------------
research issues in data mining.
----------------------
Application Exploration
In the early days, data mining was used for only business-specific ----------------------
requirements. It was used in e-commerce and marketing. Nowadays, data ----------------------
mining is increasingly being used in other areas, such as financial analysis,
telecommunications, biomedicine and science. A generic data mining ----------------------
system has its limitations when it comes to dealing with application-specific
problems. There is a need to design systems, which can handle all types of ----------------------
business requirements. Nowadays, we see a trend towards the development of ----------------------
application-specific data mining systems.
Scalable and interactive data mining methods
----------------------
The traditional data analysis method fails to handle huge data efficiently. ----------------------
Data mining handles huge amounts of data efficiently. But there is a need of
data mining algorithms, which can handle incremental data efficiently. ----------------------
Integration of data mining with database systems, data warehouse systems ----------------------
and Web database systems
----------------------
It is required to smoothly integrate data mining systems with databases and
data warehouse databases. ----------------------
It ensures data mining portability, data availability, scalability, high ----------------------
performance and an integrated information-processing environment for
multidimensional data analysis and exploration. ----------------------
Standardisation of Data Mining Language ----------------------

Standardisation of data mining language will help improve interoperability ----------------------
among multiple data mining systems and functions. Microsoft’s OLE DB and
CRISP-DM are examples of standardisation of data mining processes. ----------------------
Visual Data Mining ----------------------
Visual data mining is an effective way to represent output. There is a
----------------------

Notes requirement of development of visual data mining techniques to promote the
use of visual data mining as a tool for data analysis.
----------------------
Distributed Data Mining
---------------------- Traditional data mining methods do not work on distributed work
environment. It is expected to develop a system, which fits into a distributed
----------------------
work environment.
----------------------
Multi-relational and Multi-database Data Mining
---------------------- Most data mining techniques available today work well to search patterns in a
single relational table. Data available in the real world is spread across multiple
---------------------- tables and is multi-relational data. It is expected in effective and efficient data
---------------------- mining across multiple relations and multiple databases.
Privacy Protection and Information Security in Data Mining
----------------------
Growing interest in data mining and Web mining had added the threat to
---------------------- privacy and information security. Therefore, further development of privacy-
preserving formalism is required to prove privacy-preservation in data mining.
----------------------
----------------------
Fill in the Blanks.
---------------------- 1. ________ is a diverse field.
---------------------- 2. ________ of data mining language will help improve interoperability
among multiple data mining systems.
----------------------
----------------------
Summary
----------------------
 Many customised data mining tools have been developed for
---------------------- domain-specific applications, including finance,the retail industry,
---------------------- telecommunications, bioinformatics, intrusion detection and other
science, engineering and government data analysis.
----------------------
Researchers have been striving to build theoretical foundations for

data mining. Several interesting proposals have appeared, based on
---------------------- data reduction, data compression, pattern discovery, probability theory,
microeconomic theory and inductive databases.
----------------------
Several well-established statistical methods have been proposed for data
---------------------- 
analysis.
---------------------- Visual data mining integrates data mining and data visualisation in order
to discover implicit and useful knowledge from large data sets. Audio
 data mining uses audio signals to indicate data patterns or features of data
---------------------- mining results.

 Collaborative recommender systems offer personalised product Notes
recommendations based on the opinions of other customers.
----------------------
 Trends in data mining include further efforts toward the exploration of
new application areas and improved scalable and interactive methods. ----------------------
----------------------
Keywords
----------------------
 Visual data mining: Visual data mining is an effective way to represent
output. There is a requirement of the development of visual data mining
----------------------
techniques to promote the use of visual data mining as a tool for data ----------------------
analysis.
 Data sources: This refers to specific data formats on which the data
----------------------
mining system works. Some data mining systems require ASCII files. ----------------------
 Audio data mining: Audio data mining uses audio signals to indicate the ----------------------
patterns of data or the features of data mining results.
 Openness: Individuals have the right to know what information is ----------------------
collected about them, who have access to the data and how the data is ----------------------
being used.
----------------------
1. Describe the different forms of data mining, which can be used in various ----------------------
applications.
----------------------
2. Describe the main theoretical foundations that have been proposed for
data mining. ----------------------
3. Explain trends in data mining. ----------------------

1. An intrusion can be defined as any set of actions that threatens the integrity, ----------------------
confidentiality or availability of a network resource.
----------------------

1. Weka is open-source data mining software developed in Java programming ----------------------
language.
----------------------
2. CART, (Classification and Regression Trees) software available from
Salford Systems creates decision trees for classification and regression ----------------------
trees for prediction.
----------------------


1. Pattern discovery theory is used to discover patterns occurring in the
---------------------- database.
---------------------- 2. Audio data mining uses audio signals to indicate the patterns of data or the
features of data mining results.
----------------------
----------------------
Fill in the Blanks.
---------------------- 1. Individuals have the right to know what information is collected about
them. This property is called openness.
----------------------
2. The Organisation for Economic Co-operation and Development (OECD)
---------------------- established a set of international guidelines, referred to as fair information
practices.
----------------------
----------------------
Fill in the Blanks.
---------------------- 1. Data mining is a diverse field.
---------------------- 2. Standardisation of data mining language will help improve interoperability
among multiple data mining systems.
----------------------
---------------------- 1. Han, Jiawei, Micheline Kamber. Data Mining: Concepts and Techniques.
---------------------- 2. Tan, Pang-Ning, Vipin Kumar, Michael Steinnach. Introduction to Data
Mining.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------

Big Datawith Data Warehousingand Data Mining NEW020

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Datawith Data Warehousingand Data Mining NEW020

Uploaded by

Copyright:

Available Formats

BIG DATA WITH

(FOR PRIVATE CIRCULATION ONLY)

1.1 DATA AND BIG DATA

2 Big Data with Data Warehousing & Data Mining

1.2 CHARACTERISTICS OF BIG DATA – Vs OF BIG DATA

Velocity Big data Variety

Fig. 1. 5 Vs of Big Data

4 Big Data with Data Warehousing & Data Mining

6 Big Data with Data Warehousing & Data Mining

Table 1: Differences among the three types of Big data

Structured data Unstructured data Semi-structured data

It is based on a relational It is based on character and It is based on XML syntax and

It is commonly stored in It is commonly stored in data It is commonly stored in data

It follows a predefined It follows a native format to It follows a native format to store.

It follows a matured It does not follow any It follows the transaction

It is very difficult to It is very scalable. Scaling is simpler than structured

8 Big Data with Data Warehousing & Data Mining

1.5 BIG DATA TECHNOLOGY

1. Operational Big Data Technologies

Few examples are as follows:

10 Big Data with Data Warehousing & Data Mining

Check your progress

1. The main principle of Hadoop framework is ________.

12 Big Data with Data Warehousing & Data Mining

1.7 BENEFITS OF BIG DATA

14 Big Data with Data Warehousing & Data Mining

1.8 APPLICATIONS OF BIG DATA IN INDUSTRY

15 Big Data with Data Warehousing & Data Mining

16 Big Data with Data Warehousing & Data Mining

3. List the benefits of big data.

4. Describe the role of Hadoop in big data storing and processing.

5. Describe the differences among the different types of big data.

6. Give a detailed note on processing of big data.

Answers to Check your Progress

17 Big Data with Data Warehousing & Data Mining

Introduction to Data Warehouse 18

19 Big Data with Data Warehousing & Data Mining

Introduction to Data Warehouse 20

---------------------- ● Integrated Data

Introduction to Data Warehouse 22

23 Big Data with Data Warehousing & Data Mining

Introduction to Data Warehouse 24

25 Big Data with Data Warehousing & Data Mining

The typical extract-transform-load (ETL)-based data warehouse uses staging, ----------------------

Introduction to Data Warehouse 26

---------------------- 2.3 DIFFERENCE BETWEEN OLTP AND DATA

---------------------- Schema Design: To optimize query and analytical performance, data

---------------------- Check your Progress 1

27 Big Data with Data Warehousing & Data Mining

Architecture of data warehouse depends on requirement analysis of an ----------------------

Introduction to Data Warehouse 28

29 Big Data with Data Warehousing & Data Mining

Multiple Choice Single Response.

1. A data warehouse is said to contain a ‘subject-oriented’ collection of ----------------------

iv. It is a generalization of ‘object-oriented’ approach. ----------------------

Fill in the Blanks. ----------------------

Introduction to Data Warehouse 30

---------------------- ● Integrated: Combining data from diverse sources

---------------------- i. It is a collection of data derived from multiple sources.

31 Big Data with Data Warehousing & Data Mining

33 Big Data with Data Warehousing & Data Mining

Data Warehouse Architecture 34