You are on page 1of 104

Data Mining - Decision Tree Induction

What is decision tree in data mining and How it works?


Data Mining - Decision Tree Induction. Advertisements. A decision tree is a
structure that includes a root node, branches, and leaf nodes. Each internal node
denotes a test on an attribute, each branch denotes the outcome of a test, and each
leaf node holds a class label. The topmost node in the tree is the root node.
The general motive of using Decision Tree is to create a training model which can use
to predict class or value of target variables by learning decision rules inferred from
prior data(training data). The understanding level of Decision Trees algorithm is so
easy compared with other classification algorithms
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a
test, and each leaf node holds a class label. The topmost node in the tree is the root
node.
The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node
represents a test on an attribute. Each leaf node represents a class.

The benefits of having a decision tree are as follows −

 It does not require any domain knowledge.


 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.

Decision Tree Induction Algorithm


A machine researcher named J. Ross Quinlan in 1980 developed a decision tree
algorithm known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was
the successor of ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm, there
is no backtracking; the trees are constructed in a top-down recursive divide-and-
conquer manner.
Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree

Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.

Output:
A Decision Tree
Method
create a node N;

if tuples in D are all of the same class, C then


return N as leaf node labeled with class C;

if attribute_list is empty then


return N as leaf node with labeled
with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list)


to find the best splitting_criterion;
label node N with splitting_criterion;

if splitting_attribute is discrete-valued and


multiway splits allowed then // no restricted to binary trees

attribute_list = splitting attribute; // remove splitting attribute


for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition


let Dj be the set of data tuples in D satisfying outcome j; // a partition

if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;

Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to
noise or outliers. The pruned trees are smaller and less complex.

Tree Pruning Approaches


There are two approaches to prune a tree −
 Pre-pruning − The tree is pruned by halting its construction early.
 Post-pruning - This approach removes a sub-tree from a fully grown tree.

Cost Complexity
The cost complexity is measured by the following two parameters −

 Number of leaves in the tree, and


 Error rate of the tree.
Strengths and Weakness of Decision Tree approach
The strengths of decision tree methods are:
 Decision trees are able to generate understandable rules.
 Decision trees perform classification without requiring much computation.
 Decision trees are able to handle both continuous and categorical variables.
 Decision trees provide a clear indication of which fields are most important for
prediction or classification.
The weaknesses of decision tree methods :
 Decision trees are less appropriate for estimation tasks where the goal is to
predict the value of a continuous attribute.
 Decision trees are prone to errors in classification problems with many class and
relatively small number of training examples.
 Decision tree can be computationally expensive to train. The process of growing
a decision tree is computationally expensive. At each node, each candidate
splitting field must be sorted before its best split can be found. In some
algorithms, combinations of fields are used and a search must be made for optimal
combining weights. Pruning algorithms can also be expensive since many
candidate sub-trees must be formed and compared.
 Construction of Decision Tree :
A tree can be “learned” by splitting the source set into subsets based on an
attribute value test. This process is repeated on each derived subset in a
recursive manner called recursive partitioning. The recursion is completed when
the subset at a node all has the same value of the target variable, or when
splitting no longer adds value to the predictions. The construction of decision
tree classifier does not require any domain knowledge or parameter setting, and
therefore is appropriate for exploratory knowledge discovery. Decision trees can
handle high dimensional data. In general decision tree classifier has good
accuracy. Decision tree induction is a typical inductive approach to learn
knowledge on classification.
 Decision Tree Representation :
Decision trees classify instances by sorting them down the tree from the root to
some leaf node, which provides the classification of the instance. An instance is
classified by starting at the root node of the tree,testing the attribute specified
by this node,then moving down the tree branch corresponding to the value of
the attribute as shown in the above figure.This process is then repeated for the
subtree rooted at the new node.
 The decision tree in above figure classifies a particular morning according to
whether it is suitable for playing tennis and returning the classification associated
with the particular leaf.(in this case Yes or No).
For example,the instance
 (Outlook = Rain, Temperature = Hot, Humidity = High, Wind = Strong )
 would be sorted down the leftmost branch of this decision tree and would
therefore be classified as a negative instance.
 In other words we can say that decision tree represent a disjunction of
conjunctions of constraints on the attribute values of instances.
 (Outlook = Sunny ^ Humidity = Normal) v (Outllok = Overcast) v (Outlook =
Rain ^ Wind = Weak)
Data Warehouse Architecture, Concepts and Components

What is Data warehouse?

Data warehouse is an information system that contains historical and commutative data
from single or multiple sources. It simplifies reporting and analysis process of the
organization.

It is also a single version of truth for any company for decision making and forecasting.

Characteristics of Data warehouse

A data warehouse has following characteristics:

 Subject-Oriented
 Integrated
 Time-variant
 Non-volatile

Subject-Oriented

A data warehouse is subject oriented as it offers information regarding a theme instead


of companies' ongoing operations. These subjects can be sales, marketing,
distributions, etc.

A data warehouse never focuses on the ongoing operations. Instead, it put emphasis on
modeling and analysis of data for decision making. It also provides a simple and
concise view around the specific subject by excluding data which not helpful to support
the decision process.

Integrated

In Data Warehouse, integration means the establishment of a common unit of measure


for all similar data from the dissimilar database. The data also needs to be stored in the
Datawarehouse in common and universally acceptable manner.

A data warehouse is developed by integrating data from varied sources like a


mainframe, relational databases, flat files, etc. Moreover, it must keep consistent
naming conventions, format, and coding.

This integration helps in effective analysis of data. Consistency in naming conventions,


attribute measures, encoding structure etc. have to be ensured. Consider the following
example:
In the above example, there are three different application labeled A, B and C.
Information stored in these applications are Gender, Date, and Balance. However, each
application's data is stored different way.

 In Application A gender field store logical values like M or F


 In Application B gender field is a numerical value,
 In Application C application, gender field stored in the form of a character value.
 Same is the case with Date and balance

However, after transformation and cleaning process all this data is stored in common
format in the Data Warehouse.

Time-Variant

The time horizon for data warehouse is quite extensive compared with operational
systems. The data collected in a data warehouse is recognized with a particular period
and offers information from the historical point of view. It contains an element of time,
explicitly or implicitly.

One such place where Datawarehouse data display time variance is in in the structure
of the record key. Every primary key contained with the DW should have either
implicitly or explicitly an element of time. Like the day, week month, etc.

Another aspect of time variance is that once data is inserted in the warehouse, it can't
be updated or changed.
Non-volatile

Data warehouse is also non-volatile means the previous data is not erased when new
data is entered in it.

Data is read-only and periodically refreshed. This also helps to analyze historical data
and understand what & when happened. It does not require transaction process,
recovery and concurrency control mechanisms.

Activities like delete, update, and insert which are performed in an operational
application environment are omitted in Data warehouse environment. Only two types of
data operations performed in the Data Warehousing are

1. Data loading
2. Data access

Here, are some major differences between Application and Data Warehouse

Operational Application Data Warehouse

Complex program must be coded to make sure This kind of issues does not happen because data
that data upgrade processes maintain high update is not performed.
integrity of the final product.

Data is placed in a normalized form to ensure Data is not stored in normalized form.
minimal redundancy.

Technology needed to support issues of It offers relative simplicity in technology.


transactions, data recovery, rollback, and
resolution as its deadlock is quite complex.

Data Warehouse Architectures

There are mainly three types of Datawarehouse Architectures: -

Single-tier architecture

The objective of a single layer is to minimize the amount of data stored. This goal is to
remove data redundancy. This architecture is not frequently used in practice.

Two-tier architecture
Two-layer architecture separates physically available sources and data warehouse. This
architecture is not expandable and also not supporting a large number of end-users. It
also has connectivity problems because of network limitations.

Three-tier architecture

This is the most widely used architecture.

It consists of the Top, Middle and Bottom Tier.

1. Bottom Tier: The database of the Datawarehouse servers as the bottom tier. It


is usually a relational database system. Data is cleansed, transformed, and
loaded into this layer using back-end tools.
2. Middle Tier: The middle tier in Data warehouse is an OLAP server which is
implemented using either ROLAP or MOLAP model. For a user, this application
tier presents an abstracted view of the database. This layer also acts as a
mediator between the end-user and the database.
3. Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API
that you connect and get data out from the data warehouse. It could be Query
tools, reporting tools, managed query tools, Analysis tools and Data mining
tools.

Datawarehouse Components
The data warehouse is based on an RDBMS server which is a central information
repository that is surrounded by some key components to make the entire environment
functional, manageable and accessible

There are mainly five components of Data Warehouse:

Data Warehouse Database

The central database is the foundation of the data warehousing environment. This
database is implemented on the RDBMS technology. Although, this kind of
implementation is constrained by the fact that traditional RDBMS system is optimized
for transactional database processing and not for data warehousing. For instance, ad-
hoc query, multi-table joins, aggregates are resource intensive and slow down
performance.

Hence, alternative approaches to Database are used as listed below-

 In a datawarehouse, relational databases are deployed in parallel to allow for


scalability. Parallel relational databases also allow shared memory or shared
nothing model on various multiprocessor configurations or massively parallel
processors.
 New index structures are used to bypass relational table scan and improve
speed.
 Use of multidimensional database (MDDBs) to overcome any limitations which
are placed because of the relational data model. Example: Essbase from Oracle.

Sourcing, Acquisition, Clean-up and Transformation Tools (ETL)

The data sourcing, transformation, and migration tools are used for performing all the
conversions, summarizations, and all the changes needed to transform data into a
unified format in the datawarehouse. They are also called Extract, Transform and Load
(ETL) Tools.

Their functionality includes:

 Anonymize data as per regulatory stipulations.


 Eliminating unwanted data in operational databases from loading into Data
warehouse.
 Search and replace common names and definitions for data arriving from
different sources.
 Calculating summaries and derived data
 In case of missing data, populate them with defaults.
 De-duplicated repeated data arriving from multiple datasources.

These Extract, Transform, and Load tools may generate cron jobs, background jobs,
Cobol programs, shell scripts, etc. that regularly update data in datawarehouse. These
tools are also helpful to maintain the Metadata.

These ETL Tools have to deal with challenges of Database & Data heterogeneity.
Metadata

The name Meta Data suggests some high- level technological concept. However, it is
quite simple. Metadata is data about data which defines the data warehouse. It is used
for building, maintaining and managing the data warehouse.

In the Data Warehouse Architecture, meta-data plays an important role as it specifies


the source, usage, values, and features of data warehouse data. It also defines how
data can be changed and processed. It is closely connected to the data warehouse.

For example, a line in sales database may contain:

4030 KJ732 299.90

This is a meaningless data until we consult the Meta that tell us it was

 Model number: 4030


 Sales Agent ID: KJ732
 Total sales amount of $299.90

Therefore, Meta Data are essential ingredients in the transformation of data into
knowledge.

Metadata helps to answer the following questions

 What tables, attributes, and keys does the Data Warehouse contain?
 Where did the data come from?
 How many times do data get reloaded?
 What transformations were applied with cleansing?

Metadata can be classified into following categories:

1. Technical Meta Data: This kind of Metadata contains information about


warehouse which is used by Data warehouse designers and administrators.
2. Business Meta Data: This kind of Metadata contains detail that gives end-users
a way easy to understand information stored in the data warehouse.

Query Tools

One of the primary objects of data warehousing is to provide information to businesses


to make strategic decisions. Query tools allow users to interact with the data warehouse
system.

These tools fall into four different categories:

1. Query and reporting tools


2. Application Development tools
3. Data mining tools
4. OLAP tools
1. Query and reporting tools:

Query and reporting tools can be further divided into

 Reporting tools
 Managed query tools

Reporting tools: Reporting tools can be further divided into production reporting tools
and desktop report writer.

1. Report writers: This kind of reporting tool are tools designed for end-users for
their analysis.
2. Production reporting: This kind of tools allows organizations to generate regular
operational reports. It also supports high volume batch jobs like printing and
calculating. Some popular reporting tools are Brio, Business Objects, Oracle,
PowerSoft, SAS Institute.

Managed query tools:

This kind of access tools helps end users to resolve snags in database and SQL and
database structure by inserting meta-layer between users and database.

2. Application development tools:

Sometimes built-in graphical and analytical tools do not satisfy the analytical needs of
an organization. In such cases, custom reports are developed using Application
development tools.

3. Data mining tools:

Data mining is a process of discovering meaningful new correlation, pattens, and trends
by mining large amount data. Data mining tools are used to make this process
automatic.

4. OLAP tools:

These tools are based on concepts of a multidimensional database. It allows users to


analyse the data using elaborate and complex multidimensional views.

Data warehouse Bus Architecture

Data warehouse Bus determines the flow of data in your warehouse. The data flow in a
data warehouse can be categorized as Inflow, Upflow, Downflow, Outflow and Meta
flow.

While designing a Data Bus, one needs to consider the shared dimensions, facts across
data marts.
Data Marts

A data mart is an access layer which is used to get data out to the users. It is
presented as an option for large size data warehouse as it takes less time and money to
build. However, there is no standard definition of a data mart is differing from person to
person.

In a simple word Data mart is a subsidiary of a data warehouse. The data mart is used
for partition of data which is created for the specific group of users.

Data marts could be created in the same database as the Datawarehouse or a


physically separate Database.

Data warehouse Architecture Best Practices

To design Data Warehouse Architecture, you need to follow below given best practices:

 Use a data model which is optimized for information retrieval which can be the
dimensional mode, denormalized or hybrid approach.
 Need to assure that Data is processed quickly and accurately. At the same time,
you should take an approach which consolidates data into a single version of the
truth.
 Carefully design the data acquisition and cleansing process for Data warehouse.
 Design a MetaData architecture which allows sharing of metadata between
components of Data Warehouse
 Consider implementing an ODS model when information retrieval need is near
the bottom of the data abstraction pyramid or when there are multiple
operational sources required to be accessed.
 One should make sure that the data model is integrated and not just
consolidated. In that case, you should consider 3NF data model. It is also ideal
for acquiring ETL and Data cleansing tools

Summary:

 Data warehouse is an information system that contains historical and


commutative data from single or multiple sources.
 A data warehouse is subject oriented as it offers information regarding subject
instead of organization's ongoing operations.
 In Data Warehouse, integration means the establishment of a common unit of
measure for all similar data from the different databases
 Data warehouse is also non-volatile means the previous data is not erased when
new data is entered in it.
 A Datawarehouse is Time-variant as the data in a DW has high shelf life.
 There are 5 main components of a Datawarehouse. 1) Database 2) ETL Tools 3)
Meta Data 4) Query Tools 5) DataMarts
 These are four main categories of query tools 1. Query and reporting, tools 2.
Application Development tools, 3. Data mining tools 4. OLAP tools
 The data sourcing, transformation, and migration tools are used for performing
all the conversions and summarizations.
 In the Data Warehouse Architecture, meta-data plays an important role as it
specifies the source, usage, values, and features of data warehouse data.

Data Mining Tutorial: Process, Techniques, Tools, EXAMPLES

What is Data Mining?

Data mining is looking for hidden, valid, and potentially useful patterns in huge data
sets. Data Mining is all about discovering unsuspected/ previously unknown
relationships amongst the data.

It is a multi-disciplinary skill that uses machine learning, statistics, AI and database


technology.

The insights derived via Data Mining can be used for marketing, fraud detection, and
scientific discovery, etc.

Data mining is also called as Knowledge discovery, Knowledge extraction, data/pattern


analysis, information harvesting, etc.

Types of Data

Data mining can be performed on following types of data

 Relational databases
 Data warehouses
 Advanced DB and information repositories
 Object-oriented and object-relational databases
 Transactional and Spatial databases
 Heterogeneous and legacy databases
 Multimedia and streaming database
 Text databases
 Text mining and Web mining

Data Mining Implementation Process

Data Mining implementation process in detail

Business understanding:

In this phase, business and data-mining goals are established.


 First, you need to understand business and client objectives. You need to define
what your client wants (which many times even they do not know themselves)
 Take stock of the current data mining scenario. Factor in resources, assumption,
constraints, and other significant factors into your assessment.
 Using business objectives and current scenario, define your data mining goals.
 A good data mining plan is very detailed and should be developed to accomplish
both business and data mining goals.

Data understanding:

In this phase, sanity check on data is performed to check whether its appropriate for
the data mining goals.

 First, data is collected from multiple data sources available in the organization.
 These data sources may include multiple databases, flat filer or data cubes.
There are issues like object matching and schema integration which can arise
during Data Integration process. It is a quite complex and tricky process as data
from various sources unlikely to match easily. For example, table A contains an
entity named cust_no whereas another table B contains an entity named cust-id.
 Therefore, it is quite difficult to ensure that both of these given objects refer to
the same value or not. Here, Metadata should be used to reduce errors in the
data integration process.
 Next, the step is to search for properties of acquired data. A good way to explore
the data is to answer the data mining questions (decided in business phase)
using the query, reporting, and visualization tools.
 Based on the results of query, the data quality should be ascertained. Missing
data if any should be acquired.

Data preparation:

In this phase, data is made production ready.

The data preparation process consumes about 90% of the time of the project.

The data from different sources should be selected, cleaned, transformed, formatted,
anonymized, and constructed (if required).

Data cleaning is a process to "clean" the data by smoothing noisy data and filling in
missing values.

For example, for a customer demographics profile, age data is missing. The data is
incomplete and should be filled. In some cases, there could be data outliers. For
instance, age has a value 300. Data could be inconsistent. For instance, name of the
customer is different in different tables.

Data transformation operations change the data to make it useful in data mining.
Following transformation can be applied
Data transformation:

Data transformation operations would contribute toward the success of the mining
process.

Smoothing: It helps to remove noise from the data.

Aggregation: Summary or aggregation operations are applied to the data. I.e., the


weekly sales data is aggregated to calculate the monthly and yearly total.

Generalization: In this step, Low-level data is replaced by higher-level concepts with


the help of concept hierarchies. For example, the city is replaced by the county.

Normalization: Normalization performed when the attribute data are scaled up o


scaled down. Example: Data should fall in the range -2.0 to 2.0 post-normalization.

Attribute construction: these attributes are constructed and included the given set of
attributes helpful for data mining.

The result of this process is a final data set that can be used in modeling.

Modelling

In this phase, mathematical models are used to determine data patterns.

 Based on the business objectives, suitable modeling techniques should be


selected for the prepared dataset.
 Create a scenario to test check the quality and validity of the model.
 Run the model on the prepared dataset.
 Results should be assessed by all stakeholders to make sure that model can
meet data mining objectives.

Evaluation:

In this phase, patterns identified are evaluated against the business objectives.

 Results generated by the data mining model should be evaluated against the
business objectives.
 Gaining business understanding is an iterative process. In fact, while
understanding, new business requirements may be raised because of data
mining.
 A go or no-go decision is taken to move the model in the deployment phase.

Deployment:

In the deployment phase, you ship your data mining discoveries to everyday business
operations.
 The knowledge or information discovered during data mining process should be
made easy to understand for non-technical stakeholders.
 A detailed deployment plan, for shipping, maintenance, and monitoring of data
mining discoveries is created.
 A final project report is created with lessons learned and key experiences during
the project. This helps to improve the organization's business policy.

Data Mining Techniques

1.Classification:

This analysis is used to retrieve important and relevant information about data, and
metadata. This data mining method helps to classify data in different classes.

2. Clustering:

Clustering analysis is a data mining technique to identify data that are like each other.
This process helps to understand the differences and similarities between the data.

3. Regression:

Regression analysis is the data mining method of identifying and analyzing the
relationship between variables. It is used to identify the likelihood of a specific variable,
given the presence of other variables.

4. Association Rules:

This data mining technique helps to find the association between two or more Items. It
discovers a hidden pattern in the data set.
5. Outer detection:

This type of data mining technique refers to observation of data items in the dataset
which do not match an expected pattern or expected behavior. This technique can be
used in a variety of domains, such as intrusion, detection, fraud or fault detection, etc.
Outer detection is also called Outlier Analysis or Outlier mining.

6. Sequential Patterns:

This data mining technique helps to discover or identify similar patterns or trends in
transaction data for certain period.

7. Prediction:

Prediction has used a combination of the other data mining techniques like trends,
sequential patterns, clustering, classification, etc. It analyzes past events or instances
in a right sequence for predicting a future event.

Challenges of Implementation of Data mine:

 Skilled Experts are needed to formulate the data mining queries.


 Overfitting: Due to small size training database, a model may not fit future
states.
 Data mining needs large databases which sometimes are difficult to manage
 Business practices may need to be modified to determine to use the information
uncovered.
 If the data set is not diverse, data mining results may not be accurate.
 Integration information needed from heterogeneous databases and global
information systems could be complex

Data mining Examples:

Example 1:

Consider a marketing head of telecom service provides who wants to increase revenues
of long distance services. For high ROI on his sales and marketing efforts customer
profiling is important. He has a vast data pool of customer information like age, gender,
income, credit history, etc. But its impossible to determine characteristics of people who
prefer long distance calls with manual analysis. Using data mining techniques, he may
uncover patterns between high long distance call users and their characteristics.

For example, he might learn that his best customers are married females between the
age of 45 and 54 who make more than $80,000 per year. Marketing efforts can be
targeted to such demographic.

Example 2:

A bank wants to search new ways to increase revenues from its credit card operations.
They want to check whether usage would double if fees were halved.
Bank has multiple years of record on average credit card balances, payment amounts,
credit limit usage, and other key parameters. They create a model to check the impact
of the proposed new business policy. The data results show that cutting fees in half for
a targetted customer base could increase revenues by $10 million.

Data Mining Tools

Following are 2 popular Data Mining Tools widely used in Industry

R-language:

R language is an open source tool for statistical computing and graphics. R has a wide
variety of statistical, classical statistical tests, time-series analysis, classification and
graphical techniques. It offers effective data handing and storage facility.

Learn more here

Oracle Data Mining:

Oracle Data Mining popularly knowns as ODM is a module of the Oracle Advanced


Analytics Database. This Data mining tool allows data analysts to generate detailed
insights and makes predictions. It helps predict customer behavior, develops customer
profiles, identifies cross-selling opportunities.

Learn more here

Benefits of Data Mining:

 Data mining technique helps companies to get knowledge-based information.


 Data mining helps organizations to make the profitable adjustments in operation
and production.
 The data mining is a cost-effective and efficient solution compared to other
statistical data applications.
 Data mining helps with the decision-making process.
 Facilitates automated prediction of trends and behaviors as well as automated
discovery of hidden patterns.
 It can be implemented in new systems as well as existing platforms
 It is the speedy process which makes it easy for the users to analyze huge
amount of data in less time.

Disadvantages of Data Mining

 There are chances of companies may sell useful information of their customers to
other companies for money. For example, American Express has sold credit card
purchases of their customers to the other companies.
 Many data mining analytics software is difficult to operate and requires advance
training to work on.
 Different data mining tools work in different manners due to different algorithms
employed in their design. Therefore, the selection of correct data mining tool is a
very difficult task.
 The data mining techniques are not accurate, and so it can cause serious
consequences in certain conditions.

Data Mining Applications


Applications Usage

Communications Data mining techniques are used in communication sector to predict


customer behavior to offer highly targetted and relevant campaigns.

Insurance Data mining helps insurance companies to price their products profitable
and promote new offers to their new or existing customers.

Education Data mining benefits educators to access student data, predict achievement
levels and find students or groups of students which need extra attention.
For example, students who are weak in maths subject.

Manufacturing With the help of Data Mining Manufacturers can predict wear and tear of
production assets. They can anticipate maintenance which helps them
reduce them to minimize downtime.

Banking Data mining helps finance sector to get a view of market risks and manage
regulatory compliance. It helps banks to identify probable defaulters to
decide whether to issue credit cards, loans, etc.

Retail Data Mining techniques help retail malls and grocery stores identify and
arrange most sellable items in the most attentive positions. It helps store
owners to comes up with the offer which encourages customers to increase
their spending.

Service Service providers like mobile phone and utility industries use Data Mining to
Providers predict the reasons when a customer leaves their company. They analyze
billing details, customer service interactions, complaints made to the
company to assign each customer a probability score and offers incentives.

E-Commerce E-commerce websites use Data Mining to offer cross-sells and up-sells
through their websites. One of the most famous names is Amazon, who use
Data mining techniques to get more customers into their eCommerce store.
Super Markets Data Mining allows supermarket's develope rules to predict if their shoppers
were likely to be expecting. By evaluating their buying pattern, they could
find woman customers who are most likely pregnant. They can start
targeting products like baby powder, baby shop, diapers and so on.

Crime Data Mining helps crime investigation agencies to deploy police workforce
Investigation (where is a crime most likely to happen and when?), who to search at a
border crossing etc.

Bioinformatics Data Mining helps to mine biological data from massive datasets gathered
in biology and medicine.

Summary:

 Data Mining is all about explaining the past and predicting the future for
analysis.
 Data mining helps to extract information from huge sets of data. It is the
procedure of mining knowledge from data.
 Data mining process includes business understanding, Data Understanding, Data
Preparation, Modelling, Evolution, Deployment.
 Important Data mining techniques are Classification, clustering, Regression,
Association rules, Outer detection, Sequential Patterns, and prediction
 R-language and Oracle Data mining are prominent data mining tools.
 Data mining technique helps companies to get knowledge-based information.
 The main drawback of data mining is that many analytics software is difficult to
operate and requires advance training to work on.
 Data mining is used in diverse industries such as Communications, Insurance,
Education, Manufacturing, Banking, Retail, Service providers, eCommerce,
Supermarkets Bioinformatics.

Database vs Data Warehouse: Key Differences

What is Database?

A database is a collection of related data which represents some elements of the real
world. It is designed to be built and populated with data for a specific task. It is also a
building block of your data solution.

What is a Data Warehouse?

A data warehouse is an information system which stores historical and commutative


data from single or multiple sources. It is designed to analyze, report, integrate
transaction data from different sources.
Data Warehouse eases the analysis and reporting process of an organization. It is also
a single version of truth for the organization for decision making and forecasting
process.

Why use a Database?

Here, are prime reasons for using Database system:

 It offers the security of data and its access


 A database offers a variety of techniques to store and retrieve data.
 Database act as an efficient handler to balance the requirement of multiple
applications using the same data
 A DBMS offers integrity constraints to get a high level of protection to prevent
access to prohibited data.
 A database allows you to access concurrent data in such a way that only a single
user can access the same data at a time.

Why Use Data Warehouse?

Here, are Important reasons for using Data Warehouse:

 Data warehouse helps business users to access critical data from some sources
all in one place.
 It provides consistent information on various cross-functional activities
 Helps you to integrate many sources of data to reduce stress on the production
system.
 Data warehouse helps you to reduce TAT (total turnaround time) for analysis and
reporting.
 Data warehouse helps users to access critical data from different sources in a
single place so, it saves user's time of retrieving data information from multiple
sources. You can also access data from the cloud easily.
 Data warehouse allows you to stores a large amount of historical data to analyze
different periods and trends to make future predictions.
 Enhances the value of operational business applications and customer
relationship management systems
 Separates analytics processing from transactional databases, improving the
performance of both systems
 Stakeholders and users may be overestimating the quality of data in the source
systems. Data warehouse provides more accurate reports.

Characteristics of Database

 Offers security and removes redundancy


 Allow multiple views of the data
 Database system follows the ACID compliance ( Atomicity, Consistency,
Isolation, and Durability).
 Allows insulation between programs and data
 Sharing of data and multiuser transaction processing
 Relational Database support multi-user environment
Characteristics of Data Warehouse

 A data warehouse is subject oriented as it offers information related to theme


instead of companies' ongoing operations.
 The data also needs to be stored in the Datawarehouse in common and
unanimously acceptable manner.
 The time horizon for the data warehouse is relatively extensive compared with
other operational systems.
 A data warehouse is non-volatile which means the previous data is not erased
when new information is entered in it.

Difference between Database and Data Warehouse

Parameter Database Data Warehouse

Purpose Is designed to record Is designed to analyze

Processing The database uses the Online Data warehouse uses Online Analytical
Method Transactional Processing Processing (OLAP).
(OLTP)

Usage The database helps to perform Data warehouse allows you to analyze your
fundamental operations for business.
your business

Tables and Tables and joins of a database Table and joins are simple in a data
Joins are complex as they are warehouse because they are denormalized.
normalized.

Orientation Is an application-oriented It is a subject-oriented collection of data


collection of data

Storage limit Generally limited to a single Stores data from any number of applications
application

Availability Data is available real-time Data is refreshed from source systems as and
when needed

Usage ER modeling techniques are Data modeling techniques are used for
used for designing. designing.

Technique Capture data Analyze data

Data Type Data stored in the Database is Current and Historical Data is stored in Data
up to date. Warehouse. May not be up to date.

Storage of Flat Relational Approach Data Ware House uses dimensional and
data method is used for data normalized approach for the data structure.
storage. Example: Star and snowflake schema.

Query Type Simple transaction queries are Complex queries are used for analysis
used. purpose.

Data Detailed Data is stored in a It stores highly summarized data.


Summary database.

Applications of Database
Sector Usage

Banking Use in the banking sector for customer


information, account-related activities, payments
deposits, loans, credit cards, etc.

Airlines Use for reservations and schedule information.

Universities To store student information, course


registrations, colleges, and results.
Telecommunication It helps to store call records, monthly bills,
balance maintenance, etc.

Finance Helps you to store information related stock,


sales, and purchases of stocks and bonds.

Sales & Production Use for storing customer, product and sales
details.

Manufacturing It is used for the data management of the supply


chain and for tracking production of items,
inventories status.

HR Management Detail about employee's salaries, deduction,


generation of paychecks, etc.

Applications of Data Warehousing


Sector Usage

Airline It is used for airline system management


operations like crew assignment, analyzes of
route, frequent flyer program discount
schemes for passenger, etc.

Banking It is used in the banking sector to manage the


resources available on the desk effectively.

Healthcare sector Data warehouse used to strategize and predict


outcomes, create patient's treatment reports,
etc. Advanced machine learning, big data
enable datawarehouse systems can predict
ailments.

Insurance sector Data warehouses are widely used to analyze


data patterns, customer trends, and to track
market movements quickly.
Retain chain It helps you to track items, identify the buying
pattern of the customer, promotions and also
used for determining pricing policy.

Telecommunication In this sector, data warehouse used for


product promotions, sales decisions and to
make distribution decisions.

Disadvantages of Database

 Cost of Hardware and Software of an implementing Database system is high


which can increase the budget of your organization.
 Many DBMS systems are often complex systems, so the training for users to use
the DBMS is required.
 DBMS can't perform sophisticated calculations
 Issues regarding compatibility with systems which is already in place
 Data owners may lose control over their data, raising security, ownership, and
privacy issues.

Disadvantages of Data Warehouse

 Adding new data sources takes time, and it is associated with high cost.
 Sometimes problems associated with the data warehouse may be undetected for
many years.
 Data warehouses are high maintenance systems. Extracting, loading, and
cleaning data could be time-consuming.
 The data warehouse may look simple, but actually, it is too complicated for the
average users. You need to provide training to end-users, who end up not using
the data mining and warehouse.
 Despite best efforts at project management, the scope of data warehousing will
always increase.

What Works Best for You?

To sum up, we can say that the database helps to perform the fundamental operation
of business while the data warehouse helps you to analyze your business. You choose
either one of them based on your business goals.

Building a data warehouse 

7 Steps to Data Warehousing

It is a business analyst's dream—all the information about the organization's activities


gathered in one place, open to a single set of analytical tools. But how do you make the
dream a reality? First, you have to plan your data warehouse system. You must
understand what questions users will ask it (e.g., how many registrations did the
company receive in each quarter, or what industries are purchasing custom software
development in the Northeast) because the purpose of a data warehouse system is to
provide decision-makers the accurate, timely information they need to make the right
choices.

Step 1: Determine Business Objectives

The company is in a phase of rapid growth and will need the proper mix of
administrative, sales, production, and support personnel. Key decision-makers want to
know whether increasing overhead staffing is returning value to the organization. As
the company enhances the sales force and employs different sales modes, the leaders
need to know whether these modes are effective. External market forces are changing
the balance between a national and regional focus, and the leaders need to understand
this change's effects on the business.

To answer the decision-makers' questions, we needed to understand what defines


success for this business. The owner, the president, and four key managers oversee the
company. These managers oversee profit centers and are responsible for making their
areas successful. They also share resources, contacts, sales opportunities, and
personnel. The managers examine different factors to measure the health and growth
of their segments. Gross profit interests everyone in the group, but to make decisions
about what generates that profit, the system must correlate more details. For instance,
a small contract requires almost the same amount of administrative overhead as a
large contract. Thus, many smaller contracts generate revenue at less profit than a few
large contracts. Tracking contract size becomes important for identifying the factors
that lead to larger contracts.

As we worked with the management team, we learned the quantitative measurements


of business activity that decision-makers use to guide the organization. These
measurements are the key performance indicators, a numeric measure of the
company's activities, such as units sold, gross profit, net profit, hours spent, students
taught, and repeat student registrations. We collected the key performance indicators
into a table called a fact table.

Step 2: Collect and Analyze Information

The only way to gather this performance information is to ask questions. The leaders
have sources of information they use to make decisions. Start with these data sources.
Many are simple. You can get reports from the accounting package, the customer
relationship management (CRM) application, the time reporting system, etc. You'll need
copies of all these reports and you'll need to know where they come from.

Often, analysts, supervisors, administrative assistants, and others create analytical and
summary reports. These reports can be simple correlations of existing reports, or they
can include information that people overlook with the existing software or information
stored in spreadsheets and memos. Such overlooked information can include logs of
telephone calls someone keeps by hand, a small desktop database that tracks shipping
dates, or a daily report a supervisor emails to a manager. A big challenge for data
warehouse designers is finding ways to collect this information. People often write off
this type of serendipitous information as unimportant or inaccurate. But remember that
nothing develops without a reason. Before you disregard any source of information, you
need to understand why it exists.

Another part of this collection and analysis phase is understanding how people gather
and process the information. A data warehouse can automate many reporting tasks,
but you can't automate what you haven't identified and don't understand. The process
requires extensive interaction with the individuals involved. Listen carefully and repeat
back what you think you heard. You need to clearly understand the process and its
reason for existence. Then you're ready to begin designing the warehouse.

Step 3: Identify Core Business Processes

By this point, you must have a clear idea of what business processes you need to
correlate. You've identified the key performance indicators, such as unit sales, units
produced, and gross revenue. Now you need to identify the entities that interrelate to
create the key performance indicators. For instance, at our example company, creating
a training sale involves many people and business factors. The customer might not
have a relationship with the company. The client might have to travel to attend classes
or might need a trainer for an on-site class. New product releases such as Windows
2000 (Win2K) might be released often, prompting the need for training. The company
might run a promotion or might hire a new salesperson.

The data warehouse is a collection of interrelated data structures. Each structure stores
key performance indicators for a specific business process and correlates those
indicators to the factors that generated them. To design a structure to track a business
process, you need to identify the entities that work together to create the key
performance indicator. Each key performance indicator is related to the entities that
generated it. This relationship forms a dimensional model. If a salesperson sells 60
units, the dimensional structure relates that fact to the salesperson, the customer, the
product, the sale date, etc.

Then you need to gather the key performance indicators into fact tables. You gather the
entities that generate the facts into dimension tables. To include a set of facts, you
must relate them to the dimensions (customers, salespeople, products, promotions,
time, etc.) that created them. For the fact table to work, the attributes in a row in the
fact table must be different expressions of the same event or condition. You can
express training sales by number of seats, gross revenue, and hours of instruction
because these are different expressions of the same sale. An instructor taught one class
in a certain room on a certain date. If you need to break the fact down into individual
students and individual salespeople, however, you'd need to create another table
because the detail level of the fact table in this example doesn't support individual
students or salespeople. A data warehouse consists of groups of fact tables, with each
fact table concentrating on a specific subject. Fact tables can share dimension tables
(e.g., the same customer can buy products, generate shipping costs, and return times).
This sharing lets you relate the facts of one fact table to another fact table. After the
data structures are processed as OLAP cubes, you can combine facts with related
dimensions into virtual cubes.

Step 4: Construct a Conceptual Data Model

After identifying the business processes, you can create a conceptual model of the data.
You determine the subjects that will be expressed as fact tables and the dimensions
that will relate to the facts. Clearly identify the key performance indicators for each
business process, and decide the format to store the facts in. Because the facts will
ultimately be aggregated together to form OLAP cubes, the data needs to be in a
consistent unit of measure. The process might seem simple, but it isn't. For example, if
the organization is international and stores monetary sums, you need to choose a
currency. Then you need to determine when you'll convert other currencies to the
chosen currency and what rate of exchange you'll use. You might even need to track
currency-exchange rates as a separate factor.

Now you need to relate the dimensions to the key performance indicators. Each row in
the fact table is generated by the interaction of specific entities. To add a fact, you need
to populate all the dimensions and correlate their activities. Many data systems,
particularly older legacy data systems, have incomplete data. You need to correct this
deficiency before you can use the facts in the warehouse. After making the corrections,
you can construct the dimension and fact tables. The fact table's primary key is a
composite key made from a foreign key of each of the dimension tables.

Data warehouse structures are difficult to populate and maintain, and they take a long
time to construct. Careful planning in the beginning can save you hours or days of
restructuring.

Step 5: Locate Data Sources and Plan Data Transformations

Now that you know what you need, you have to get it. You need to identify where the
critical information is and how to move it into the data warehouse structure. For
example, most of our example company's data comes from three sources. The
company has a custom in-house application for tracking training sales. A CRM package
tracks the sales-force activities, and a custom time-reporting system keeps track of
time.

You need to move the data into a consolidated, consistent data structure. A difficult
task is correlating information between the in-house CRM and time-reporting
databases. The systems don't share information such as employee numbers, customer
numbers, or project numbers. In this phase of the design, you need to plan how to
reconcile data in the separate databases so that information can be correlated as it is
copied into the data warehouse tables.

You'll also need to scrub the data. In online transaction processing (OLTP) systems,
data-entry personnel often leave fields blank. The information missing from these
fields, however, is often crucial for providing an accurate data analysis. Make sure the
source data is complete before you use it. You can sometimes complete the information
programmatically at the source. You can extract ZIP codes from city and state data, or
get special pricing considerations from another data source. Sometimes, though,
completion requires pulling files and entering missing data by hand. The cost of fixing
bad data can make the system cost-prohibitive, so you need to determine the most
cost-effective means of correcting the data and then forecast those costs as part of the
system cost. Make corrections to the data at the source so that reports generated from
the data warehouse agree with any corresponding reports generated at the source.

You'll need to transform the data as you move it from one data structure to another.
Some transformations are simple mappings to database columns with different names.
Some might involve converting the data storage type. Some transformations are unit-
of-measure conversions (pounds to kilograms, centimeters to inches), and some are
summarizations of data (e.g., how many total seats sold in a class per company, rather
than each student's name). And some transformations require complex programs that
apply sophisticated algorithms to determine the values. So you need to select the right
tools (e.g., Data Transformation Services—DTS—running ActiveX scripts, or third-party
tools) to perform these transformations. Base your decision mainly on cost, including
the cost of training or hiring people to use the tools, and the cost of maintaining the
tools.

You also need to plan when data movement will occur. While the system is accessing
the data sources, the performance of those databases will decline precipitously.
Schedule the data extraction to minimize its impact on system users (e.g., over a
weekend).

Step 6: Set Tracking Duration

Data warehouse structures consume a large amount of storage space, so you need to
determine how to archive the data as time goes on. But because data warehouses track
performance over time, the data should be available virtually forever. So, how do you
reconcile these goals?

The data warehouse is set to retain data at various levels of detail, or granularity. This
granularity must be consistent throughout one data structure, but different data
structures with different grains can be related through shared dimensions. As data
ages, you can summarize and store it with less detail in another structure. You could
store the data at the day grain for the first 2 years, then move it to another structure.
The second structure might use a week grain to save space. Data might stay there for
another 3 to 5 years, then move to a third structure where the grain is monthly. By
planning these stages in advance, you can design analysis tools to work with the
changing grains based on the age of the data. Then if older historical data is imported,
it can be transformed directly into the proper format.
Step 7: Implement the Plan

After you've developed the plan, it provides a viable basis for estimating work and
scheduling the project. The scope of data warehouse projects is large, so phased
delivery schedules are important for keeping the project on track. We've found that an
effective strategy is to plan the entire warehouse, then implement a part as a data mart
to demonstrate what the system is capable of doing. As you complete the parts, they fit
together like pieces of a jigsaw puzzle. Each new set of data structures adds to the
capabilities of the previous structures, bringing value to the system.

Data warehouse systems provide decision-makers consolidated, consistent historical


data about their organization's activities. With careful planning, the system can provide
vital information on how factors interrelate to help or harm the organization. A solid
plan can contain costs and make this powerful tool a reality.

Data Warehouse Design

After the tools and team personnel selections are made, the data warehouse design can
begin. The following are the typical steps involved in the data warehousing project
cycle.

 Requirement Gathering
 Physical Environment Setup
 Data Modeling
 ETL
 OLAP Cube Design
 Front End Development
 Report Development
 Performance Tuning
 Query Optimization
 Quality Assurance
 Rolling out to Production
 Production Maintenance
 Incremental Enhancements

Each page listed above represents a typical data warehouse design phase, and has
several sections:

 Task Description: This section describes what typically needs to be


accomplished during this particular data warehouse design phase.
 Time Requirement: A rough estimate of the amount of time this particular data
warehouse task takes.
 Deliverables: Typically at the end of each data warehouse task, one or more
documents are produced that fully describe the steps and results of that
particular task. This is especially important for consultants to communicate their
results to the clients.
 Possible Pitfalls: Things to watch out for. Some of them obvious, some of them
not so obvious. All of them are real.
What is Data Warehousing?
Data warehousing is the process of constructing and using a data warehouse. A data
warehouse is constructed by integrating data from multiple heterogeneous sources
that support analytical reporting, structured and/or ad hoc queries, and decision
making. Data warehousing involves data cleaning, data integration, and data
consolidations.

Using Data Warehouse Information


There are decision support technologies that help utilize the data available in a data
warehouse. These technologies help executives to use the warehouse quickly and
effectively. They can gather data, analyze it, and take decisions based on the
information present in the warehouse. The information gathered in a warehouse can
be used in any of the following domains −
 Tuning Production Strategies − The product strategies can be well tuned by
repositioning the products and managing the product portfolios by comparing
the sales quarterly or yearly.
 Customer Analysis − Customer analysis is done by analyzing the customer's
buying preferences, buying time, budget cycles, etc.
 Operations Analysis − Data warehousing also helps in customer relationship
management, and making environmental corrections. The information also
allows us to analyze business operations.

Integrating Heterogeneous Databases


To integrate heterogeneous databases, we have two approaches −

 Query-driven Approach
 Update-driven Approach

Query-Driven Approach
This is the traditional approach to integrate heterogeneous databases. This approach
was used to build wrappers and integrators on top of multiple heterogeneous
databases. These integrators are also known as mediators.

Process of Query-Driven Approach


 When a query is issued to a client side, a metadata dictionary translates the
query into an appropriate form for individual heterogeneous sites involved.
 Now these queries are mapped and sent to the local query processor.
 The results from heterogeneous sites are integrated into a global answer set.

Disadvantages
 Query-driven approach needs complex integration and filtering processes.
 This approach is very inefficient.
 It is very expensive for frequent queries.
 This approach is also very expensive for queries that require aggregations.
Update-Driven Approach
This is an alternative to the traditional approach. Today's data warehouse systems
follow update-driven approach rather than the traditional approach discussed earlier.
In update-driven approach, the information from multiple heterogeneous sources are
integrated in advance and are stored in a warehouse. This information is available for
direct querying and analysis.

Advantages
This approach has the following advantages −
 This approach provide high performance.
 The data is copied, processed, integrated, annotated, summarized and
restructured in semantic data store in advance.
 Query processing does not require an interface to process data at local sources.

Functions of Data Warehouse Tools and Utilities


The following are the functions of data warehouse tools and utilities −
 Data Extraction − Involves gathering data from multiple heterogeneous
sources.
 Data Cleaning − Involves finding and correcting the errors in data.
 Data Transformation − Involves converting the data from legacy format to
warehouse format.
 Data Loading − Involves sorting, summarizing, consolidating, checking
integrity, and building indices and partitions.
 Refreshing − Involves updating from data sources to warehouse.
Note − Data cleaning and data transformation are important steps in improving the
quality of data and data mining results.
how to build a data warehouse

1. Improving integration
An organization registers data in various systems which support the various business
processes. In order to create an overall picture of business operations, customers, and
suppliers – thus creating a single version of the truth – the data must come together in
one place and be made compatible. Both external (from the environment) and internal
data (from ERP, CRM and financial systems) should merge into the data warehouse and
then be grouped.
2. Speeding up response times
The source systems are fully optimized in order to process many small transactions,
such as orders, in a short time. Generating information about the performance of the
organization only requires a few large ‘transactions’ in which large volumes of data are
gathered and aggregated. The structure of a data warehouse is specifically designed to
quickly analyze such large volumes of (big) data.
3. Faster and more flexible reporting
The structure of both data warehouses and data marts enables end users to report in a
flexible manner and to quickly perform interactive analysis based on various predefined
angles (dimensions). They may, for example, with a single mouse click jump from year
level, to quarter, to month level, and quickly switch between the customer dimension
and the product dimension, all while the indicator remains fixed. In this way, end users
can actually juggle the data and thus quickly gain knowledge about business operations
and performance indicators.
4. Recording changes to build history
Source systems don’t usually keep a history of certain data. For example, if a customer
relocates or a product moves to a different product group, the (old) values will most
likely be overwritten. This means they disappear from the system – or at least they’re
very difficult to trace back. That’s a pity, because in order to generate reliable
information, we actually need these old values, as users sometimes want to be able to
look back in time. In other words: we want to be able to look at the organization’s
performance from a historical perspective – in accordance with the organizational
structure and product classifications of that time – instead of in the current context. A
data warehouse ensures that data changes in the source system are recorded, which
enables historical analysis.
5. Increasing data quality
Stakeholders and users frequently overestimate the quality of data in the source
systems. Unfortunately, source systems quite often contain data of poor quality. When
we use a data warehouse, we can greatly improve the data quality, either through –
where possible – correcting the data while loading or by tackling the problem at its
source.
6. Unburdening operational systems
By transferring data to a separate computer in order to analyze it, the operational
system is unburdened.
7. Unburdening the IT department
A data warehouse and Business Intelligence tools allow employees within the
organization to create reports and perform analyses independently. However, an
organization will first have to invest in order to set up the required infrastructure for
that data warehouse and those BI tools. The following principle applies: the better the
architecture is set up and developed, the more complex reports users can
independently create. Obviously, users first need sufficient training and support, where
necessary. Yet, what we see in practice is that many of the more complex reports end
up being created by the IT department. This is mostly due to users lacking either the
time or the knowledge. That’s why data literacy is an important factor. Another reason
may be that the organization hasn’t put enough effort into developing the right
architecture.
8. Increasing recognizability
Indicators are ‘prepared’ in the data warehouse. This allows users to create complex
reports on, for example, returns on customers, or on service levels divided by month,
customer group and country, in a few simple steps. In the source system this
information only emerges when we manually perform a large number of actions and
calculations. Using a data warehouse thus increases the recognizability of the
information we require, provided that the data warehouse is set up based on the
business.
9. Increasing findability
When we create a data warehouse, we make sure that users can easily access the
meaning of data. (In the source system, these meanings are either non-existent or
poorly accessible.) With a data warehouse, users can find data more quickly, and thus
establish information and knowledge faster.
All the goals of the data warehouse serve the aims of Business Intelligence: making
better decisions faster at all levels within the organization and even across
organizational boundaries.

Discuss data mining and kdd in database

Data Mining - Knowledge Discovery

What is Knowledge Discovery?


Some people don’t differentiate data mining from knowledge discovery while others
view data mining as an essential step in the process of knowledge discovery. Here is
the list of steps involved in the knowledge discovery process −
 Data Cleaning − In this step, the noise and inconsistent data is removed.
 Data Integration − In this step, multiple data sources are combined.
 Data Selection − In this step, data relevant to the analysis task are retrieved
from the database.
 Data Transformation − In this step, data is transformed or consolidated into
forms appropriate for mining by performing summary or aggregation
operations.
 Data Mining − In this step, intelligent methods are applied in order to extract
data patterns.
 Pattern Evaluation − In this step, data patterns are evaluated.
 Knowledge Presentation − In this step, knowledge is represented.
The following diagram shows the process of knowledge discovery −
KDD Process in Data Mining

Data Mining – Knowledge Discovery in Databases(KDD).


Why we need Data Mining?
Volume of information is increasing everyday that we can handle from business
transactions, scientific data, sensor data, Pictures, videos, etc. So, we need a system
that will be capable of extracting essence of information available and that can
automatically generate report,
views or summary of data for better decision-making.
Why Data Mining is used in Business?
Data mining is used in business to make better managerial decisions by:

Automatic summarization of data
 Extracting essence of information stored.
 Discovering patterns in raw data.
Data Mining also known as Knowledge Discovery in Databases, refers to the nontrivial
extraction of implicit, previously unknown and potentially useful information from data
stored in databases.
Steps Involved in KDD Process:
KDD process
1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data
from collection.
 Cleaning in case of Missing values.
 Cleaning noisy data, where noise is a random or variance error.
 Cleaning with Data discrepancy detection and Data transformation
tools.
2. Data Integration: Data integration is defined as heterogeneous data from
multiple sources combined in a common source(DataWarehouse).
 Data integration using Data Migration tools.
 Data integration using Data Synchronization tools.
 Data integration using ETL(Extract-Load-Transformation) process.
3. Data Selection: Data selection is defined as the process where data relevant to
the analysis is decided and retrieved from the data collection.
 Data selection using Neural network.
 Data selection using Decision Trees.
 Data selection using Naive bayes.
 Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the process of
transforming data into appropriate form required by mining procedure.
Data Transformation is a two step process:
 Data Mapping: Assigning elements from source base to destination to
capture transformations.
 Code generation: Creation of the actual transformation program.
5. Data Mining: Data mining is defined as clever techniques that are applied to
extract patterns potentially useful.
 Transforms task relevant data into patterns.
 Decides purpose of model using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as as identifying strictly
increasing patterns representing knowledge based on given measures.
 Find interestingness score of each pattern.
 Uses summarization and Visualization to make data understandable by
user.
7. Knowledge representation: Knowledge representation is defined as technique
which utilizes visualization tools to represent data mining results.
 Generate reports.
 Generate tables.
 Generate discriminant rules, classification rules, characterization
rules, etc.
Note:
 KDD is an iterative process where evaluation measures can be enhanced,
mining can be refined, new data can be integrated and transformed in order to get
different and more appropriate results.
 Preprocessing of databases consists of Data cleaning and Data
Integration.

Neural Networks and Data Mining


An Artificial Neural Network, often just called a neural network, is a mathematical
model inspired by biological neural networks. A neural network consists of an
interconnected group of artificial neurons, and it processes information using a
connectionist approach to computation. In most cases a neural network is an adaptive
system that changes its structure during a learning phase. Neural networks are used
to model complex relationships between inputs and outputs or to find patterns in data.
The inspiration for neural networks came from examination of central nervous systems.
In an artificial neural network, simple artificial nodes, called “neurons”, “neurodes”,
“processing elements” or “units”, are connected together to form a network which
mimics a biological neural network.
There is no single formal definition of what an artificial neural network is. Generally, it
involves a network of simple processing elements that exhibit complex global behavior
determined by the connections between the processing elements and element
parameters. Artificial neural networks are used with algorithms designed to alter the
strength of the connections in the network to produce a desired signal flow.
Neural networks are also similar to biological neural networks in that functions are
performed collectively and in parallel by the units, rather than there being a clear
delineation of subtasks to which various units are assigned. The term “neural network”
usually refers to models employed in statistics, cognitive psychology and artificial
intelligence. Neural network models which emulate the central nervous system are part
of theoretical neuroscience and computational neuroscience.
In modern software implementations of artificial neural networks, the approach inspired
by biology has been largely abandoned for a more practical approach based on
statistics and signal processing. In some of these systems, neural networks or parts of
neural networks (such as artificial neurons) are used as components in larger systems
that combine both adaptive and non-adaptive elements. While the more general
approach of such adaptive systems is more suitable for real-world problem solving, it
has far less to do with the traditional artificial intelligence connectionist models. What
they do have in common, however, is the principle of non-linear, distributed, parallel
and local processing and adaptation. Historically, the use of neural networks models
marked a paradigm shift in the late eighties from high-level (symbolic) artificial
intelligence, characterized by expert systems with knowledge embodied in if-then
rules, to low-level (sub-symbolic) machine learning, characterized by knowledge
embodied in the parameters of a dynamical system.

Applications
The utility of artificial neural network models lies in the fact that they can be used to
infer a function from observations. This is particularly useful in applications where the
complexity of the data or task makes the design of such a function by hand impractical.

Real-life applications
The tasks artificial neural networks are applied to tend to fall within the following broad
categories:

 Function approximation, or regression analysis, including time series prediction,


fitness approximation and modeling.
 Classification, including pattern and sequence recognition, novelty detection and
sequential decision making.
 Data processing, including filtering, clustering, blind source separation and
compression.
 Robotics, including directing manipulators, Computer numerical control.

Application areas include system identification and control (vehicle control, process
control, natural resources management), quantum chemistry, game-playing and
decision making (backgammon, chess, poker), pattern recognition (radar systems, face
identification, object recognition and more), sequence recognition (gesture, speech,
handwritten text recognition), medical diagnosis, financial applications (automated
trading systems), data mining (or knowledge discovery in databases, “KDD”),
visualization and e-mail spam filtering.
Artificial neural networks have also been used to diagnose several cancers. An ANN
based hybrid lung cancer detection system named HLND improves the accuracy of
diagnosis and the speed of lung cancer radiology. These networks have also been used
to diagnose prostate cancer. The diagnoses can be used to make specific models taken
from a large group of patients compared to information of one given patient. The
models do not depend on assumptions about correlations of different variables.
Colorectal cancer has also been predicted using the neural networks. Neural networks
could predict the outcome for a patient with colorectal cancer with a lot more accuracy
than the current clinical methods. After training, the networks could predict multiple
patient outcomes from unrelated institutions.
Neural networks and neuroscience
Theoretical and computational neuroscience is the field concerned with the theoretical
analysis and computational modeling of biological neural systems. Since neural systems
are intimately related to cognitive processes and behavior, the field is closely related to
cognitive and behavioral modeling.
The aim of the field is to create models of biological neural systems in order to
understand how biological systems work. To gain this understanding, neuroscientists
strive to make a link between observed biological processes (data), biologically
plausible mechanisms for neural processing and learning (biological neural network
models) and theory (statistical learning theory and information theory).
Types of models
Many models are used in the field defined at different levels of abstraction and
modeling different aspects of neural systems. They range from models of the short-
term behavior of individual neurons, models of how the dynamics of neural circuitry
arise from interactions between individual neurons and finally to models of how
behavior can arise from abstract neural modules that represent complete subsystems.
These include models of the long-term, and short-term plasticity, of neural systems and
their relations to learning and memory from the individual neuron to the system level.
While initial research had been concerned mostly with the electrical characteristics of
neurons, a particularly important part of the investigation in recent years has been the
exploration of the role of neuromodulators such as dopamine, acetylcholine, and
serotonin on behavior and learning.
Biophysical models, such as BCM theory, have been important in understanding
mechanisms for synaptic plasticity, and have had applications in both computer science
and neuroscience. Research is ongoing in understanding the computational algorithms
used in the brain, with some recent biological evidence for radial basis networks and
neural backpropagation as mechanisms for processing data.
Computational devices have been created in CMOS for both biophysical simulation and
neuromorphic computing. More recent efforts show promise for creating nanodevices
for very large scale principal components analyses and convolution. If successful, these
efforts could usher in a new era of neural computing that is a step beyond digital
computing, because it depends on learning rather than programming and because it is
fundamentally analog rather than digital even though the first instantiations may in fact
be with CMOS digital devices.
Hierarchical clustering algorithm
Hierarchical clustering algorithm is of two types:

i) Agglomerative Hierarchical clustering algorithm or AGNES (agglomerative nesting) and

ii) Divisive Hierarchical clustering algorithm or DIANA (divisive analysis).

Both this algorithm are exactly reverse of each other. So we will be covering Agglomerative
Hierarchical clustering algorithm in detail.
Agglomerative Hierarchical clustering -This algorithm  works by  grouping  the data one by one on th
basis of the  nearest distance measure of all the pairwise distance between the data point. Again
distance between the data point is recalculated but which distance to consider when the groups has
been formed? For this there are many available methods. Some of them are:

1) single-nearest distance or single linkage.

2) complete-farthest distance or complete linkage.

3) average-average distance or average linkage.

4) centroid distance.

5) ward's method - sum of squared euclidean distance is minimized.

This way we go on grouping the data until one cluster is formed. Now on the basis of
dendogram graph we can calculate how many number of clusters should be actually
present.

Algorithmic steps for Agglomerative Hierarchical clustering

Let  X = {x1, x2, x3, ..., xn} be the set of data points.

1) Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0.

2) Find the least distance pair of clusters in the current clustering, say pair (r), (s),
according to d[(r),(s)] = min d[(i),(j)]   where the minimum is over all pairs of clusters
in the current clustering.

3) Increment the sequence number: m = m +1.Merge clusters (r) and (s) into a single
cluster to form the next clustering   m. Set the level of this clustering to L(m) = d[(r),
(s)].

4) Update the distance matrix, D, by deleting the rows and columns corresponding to


clusters (r) and (s) and adding a row and column corresponding to the newly formed
cluster. The distance between the new cluster, denoted (r,s) and old cluster(k) is
defined in this way: d[(k), (r,s)] = min (d[(k),(r)], d[(k),(s)]).

5) If all the data points are in one cluster then stop, else repeat from step 2).

Divisive Hierarchical clustering - It is just the reverse of Agglomerative Hierarchical


approach.

Advantages

1) No apriori information about the number of clusters required.


2) Easy to implement and gives best result in some cases.

Disadvantages

1) Algorithm can never undo what was done previously.

2) Time complexity of at least O(n2  log n) is required, where ‘n’ is the number of data
points.

3) Based on the type of distance matrix chosen for merging different algorithms can
suffer with one or more of the following:

    i) Sensitivity to noise and outliers

    ii) Breaking large clusters

    iii) Difficulty handling different sized clusters and convex shapes

4) No objective function is directly minimized

5) Sometimes it is difficult to identify the correct number of clusters by the dendogram.

Why is data mining important?


Have a look at the below-mentioned points which explain why data mining is required.

1. Data mining is the procedure of capturing large sets of data in order to


identify the insights and visions of that data. Nowadays, the demand of data
industry is rapidly growing which has also increased the demands for Data
analysts and Data scientists.
2. Since with this technique, we analyze the data and then convert that data into
meaningful information. This helps the business to take accurate and better
decisions in an organization.
3. Data mining helps to develop smart market decision, run accurate campaigns,
predictions are taken and many more.
4. With the help of Data mining, we can analyze customer behaviors and their
insights. This leads to great success and data-driven business.
Data mining and its process

Data mining is an interactive process. Take a look at the following steps.

1- Requirement gathering

Data mining project starts with the requirement gathering and understanding. Data
mining analysts or users define the requirement scope with the vendor business
perspective. Once, the scope is defined we move to the next phase.

2- Data exploration

Here, in this step Data mining experts gather, evaluate and explore the requirement or
project. Experts understand the problems, challenges and convert them to metadata. In
this step, data mining statistics are used to identify and convert the data patterns.

3- Data preparations

Data mining experts convert the data into meaningful information for the modelling
step. They use ETL process – extract, transform and load. They are also responsible for
creating new data attributes. Here various tools are used to present data in a structural
format without changing the meaning of data sets.

4- Modelling

Data experts put their best tools in place for this step as this plays a vital role in the
complete processing of data. All modeling methods are applied to filter the data in an
appropriate manner. Modelling and evaluation are correlated steps and are followed
same time to check the parameters. Once the final modeling is done the final outcome
is quality proven.
5- Evaluation

This is the filtering process after the successful modelling. If the outcome is not
satisfied then it is transferred to the model again. Upon final outcome, the requirement
is checked again with the vendor so no point is missed. Data mining experts judge the
complete result at the end.

6- Deployment

This is the final stage of the complete process. Experts present the data to vendors in
the form of spreadsheets or graphs.

Data mining services can be used for the following functions

 Research and surveys. Data mining can be used for product research,


surveys, market research, and analysis. Information can be gathered that is
quite useful in driving new marketing campaigns and promotions.
 Information collection. Through the web scraping process, it is possible to
collect information regarding investors, investments, and funds by scraping
through related websites and databases.
 Customer opinions. Customer views and suggestions play an important role
in the way a company operates. The information can be readily be found on
forums, blogs and other resources where customers freely provide their
views.
 Data scanning. Data collected and stored will not be important unless
scanned. Scanning is important to identify patterns and similarities contained
in the data.
 Extraction of information. This is the processing of identifying the useful
patterns in data that can be used in decision-making process. This is so
because decision making must be based on sound information and facts.
 Pre-processing of data. Usually, the data collected is stored in the data
warehouse. This data needs to be pre-processed.by pre-processing it means
some data that may be deemed unimportant may therefore removed
manually be data mining experts.
 Web data. Web data usually poses many challenges in mining. This is so
because of its nature. For instance, web data can be deemed as dynamic
meaning it keeps changing from time to time. Therefore it means the process
of data mining should be repeated at regular intervals.
 Competitor analysis. There is a need to understand how your competitors
are fairing on in the business market. You need to know both their
weaknesses and strengths. Their methods of marketing and distribution can
be mined. How they reduce their overall costs is also quite important.
 Online research. The internet is highly regarded for its huge information. It
is evident that it is the largest source of information. It is possible to gather a
lot of information regarding different companies, customers, and your
business clients. It is possible to detect frauds through online means.
 News. Nowadays with almost all major newspapers and news sources posting
their news online, it is possible to gather information regarding trends and
other critical areas. In this way, it is possible to be in the better position of
competing in the market.
 Updating data. This is quite important. Data collected will be useless unless
it is updated. This is to ensure that the information is relevant so as to make
decisions from it.

After the application of data mining process, it is possible to extract information that
has been filtered through the processes of filtering and refining. Usually, the process of
data mining is majorly divided into three sections; pre-processing of data, mining data
and then validation of the data. Generally, this process involves the conversion of data
into valid information.
ADVANTAGES OF DATA MINING

Check out the below-mentioned Data mining benefits

1. With the help of Data mining- Marketing companies build data models and
prediction based on historical data. They run campaigns, marketing strategy
etc. This leads to success and rapid growth.
2. The retail industry is also on the same page with marketing companies- With
Data mining they believe in predictive based models for their goods and
services. Retail stores can have better production and customer insights.
Discounts and redemption are based on historical data.
3. Data mining suggest banks regarding their financial benefits and updates.
They build a model based on customer data and then check out the loan
process which is truly based on data mining. In other ways also Data mining
serves a lot to the banking industry.
4. Manufacturing obtains benefits from Data mining in engineering data and
detecting the faulty devices and products. This helps them to cut off the
defected items from the list and then they can occupy the best services and
products in place.
5. It helps government bodies to analyze the financial data and transaction to
model them to useful information.
6. Data mining organization can improve planning and decision makings.
7. New revenue streams are generated with the help of Data mining which
results in organization growth.
8. Data mining not only helps in predictions but also helps in the development of
new services and products.
9. Customers see better insights with the organization which increase more
customer lists and interactions.
10.Once the competitive advantages are made it reduces with cost also with the
help of data mining.

There are many more benefits of Data mining and its useful features. When data mining
combines with Analytics and Big data, it is completely changed into a new trend which
is the demand of data-driven market.

Conclusion

It is important to note that time is spent in getting valid information from the data.
Therefore if you are after making your business grow rapidly, there is a need to make
accurate and quick decisions that can take advantage of grabbing the available
opportunities in a timely manner.

Data mining is a rapidly growing industry in this technology trend world. Everyone now-
a-days required their data to be used in an appropriate manner and up to right
approach in order to obtain useful and accurate information.

Loginworks Softwares is one of the best data mining outsourcing organizations that
employ high qualified and experienced staff in the market research industry.
Data Mining Issues

Data mining systems face a lot of challenges and issues in today’s world some of them
are:

1 Mining methodology and user interaction issues

2 Performance issues

3 Issues relating to the diversity of database types

1 Mining methodology and user interaction issues:


Mining different kinds of knowledge in databases:
Different user - different knowledge - different way.That means different client want a
different kind of information so it becomes difficult to cover vast range of data that can
meet the client requirement.

Interactive mining of knowledge at multiple levels of abstraction:


Interactive mining allows users to focus the search for patterns from different
angles.The data mining process should be interactive because it is difficult to know
what can be discovered within a database.

Incorporation of background knowledge:


Background knowledge is used to guide discovery process and to express the
discovered patterns.

Query languages and ad hoc mining:


Relational query languages (such as SQL) allow users to pose ad-hoc queries for data
retrieval.The language of data mining query language should be in perfectly matched
with the query language of data warehouse.

Handling noisy or incomplete data:


In a large database, many of the attribute values will be incorrect.This may be due to
human error or because of any instruments fail. Data cleaning methods and data
analysis methods are used to handle noise data.

2 Performance issues
Efficiency and scalability of data mining algorithms:
To effectively extract information from a huge amount of data in databases, data
mining algorithms must be efficient and scalable.

Parallel, distributed, and incremental mining algorithms:


The huge size of many databases, the wide distribution of data, and complexity of some
data mining methods are factors motivating the development of parallel and distributed
data mining algorithms. Such algorithms divide the data into partitions, which are
processed in parallel.
3 Issues relating to the diversity of database types:
Handling of relational and complex types of data:
There are many kinds of data stored in databases and data warehouses. It is not
possible for one system to mine all these kind of data.So different data mining system
should be construed for different kinds data.

Mining information from heterogeneous databases and global information


systems:
Since data is fetched from different data sources on Local Area Network (LAN) and Wide
Area Network (WAN).The discovery of knowledge from different sources of structured is
a great challenge to data mining.

Data mining is not an easy task, as the algorithms used can get very complex and
data is not always available at one place. It needs to be integrated from various
heterogeneous data sources. These factors also create some issues. Here in this
tutorial, we will discuss the major issues regarding −

 Mining Methodology and User Interaction


 Performance Issues
 Diverse Data Types Issues
The following diagram describes the major issues.
Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases − Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data
mining to cover a broad range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The
data mining process needs to be interactive because it allows users to focus the
search for patterns, providing and refining data mining requests based on the
returned results.
 Incorporation of background knowledge − To guide discovery process and
to express the discovered patterns, the background knowledge can be used.
Background knowledge may be used to express the discovered patterns not
only in concise terms but at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be
integrated with a data warehouse query language and optimized for efficient
and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns
are discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are
required to handle the noise and incomplete objects while mining the data
regularities. If the data cleaning methods are not there then the accuracy of the
discovered patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting because
either they represent common knowledge or lack novelty.

Performance Issues
There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in databases, data
mining algorithm must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors
such as huge size of databases, wide distribution of data, and complexity of
data mining methods motivate the development of parallel and distributed data
mining algorithms. These algorithms divide the data into partitions which is
further processed in a parallel fashion. Then the results from the partitions is
merged. The incremental algorithms, update databases without mining the data
again from scratch.

Diverse Data Types Issues


 Handling of relational and complex types of data − The database may
contain complex data objects, multimedia data objects, spatial data, temporal
data etc. It is not possible for one system to mine all these kind of data.
 Mining information from heterogeneous databases and global
information systems − The data is available at different data sources on LAN
or WAN. These data source may be structured, semi structured or unstructured.
Therefore mining the knowledge from them adds challenges to data mining.
Data Warehousing - Metadata Concepts

What is Metadata?
Metadata is simply defined as data about data. The data that is used to represent
other data is known as metadata. For example, the index of a book serves as a
metadata for the contents in the book. In other words, we can say that metadata is
the summarized data that leads us to detailed data. In terms of data warehouse, we
can define metadata as follows.
 Metadata is the road-map to a data warehouse.
 Metadata in a data warehouse defines the warehouse objects.
 Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.
Note − In a data warehouse, we create metadata for the data names and definitions
of a given data warehouse. Along with this metadata, additional metadata is also
created for time-stamping any extracted data, the source of extracted data.

Categories of Metadata
Metadata can be broadly categorized into three categories −
 Business Metadata − It has the data ownership information, business
definition, and changing policies.
 Technical Metadata − It includes database system names, table and column
names and sizes, data types and allowed values. Technical metadata also
includes structural information such as primary and foreign key attributes and
indices.
 Operational Metadata − It includes currency of data and data lineage.
Currency of data means whether the data is active, archived, or purged.
Lineage of data means the history of data migrated and transformation applied
on it.
Role of Metadata
Metadata has a very important role in a data warehouse. The role of metadata in a
warehouse is different from the warehouse data, yet it plays an important role. The
various roles of metadata are explained below.
 Metadata acts as a directory.
 This directory helps the decision support system to locate the contents of the
data warehouse.
 Metadata helps in decision support system for mapping of data when data is
transformed from operational environment to data warehouse environment.
 Metadata helps in summarization between current detailed data and highly
summarized data.
 Metadata also helps in summarization between lightly detailed data and highly
summarized data.
 Metadata is used for query tools.
 Metadata is used in extraction and cleansing tools.
 Metadata is used in reporting tools.
 Metadata is used in transformation tools.
 Metadata plays an important role in loading functions.
The following diagram shows the roles of metadata.

Metadata Repository
Metadata repository is an integral part of a data warehouse system. It has the
following metadata −
 Definition of data warehouse − It includes the description of structure of
data warehouse. The description is defined by schema, view, hierarchies,
derived data definitions, and data mart locations and contents.
 Business metadata − It contains has the data ownership information, business
definition, and changing policies.
 Operational Metadata − It includes currency of data and data lineage.
Currency of data means whether the data is active, archived, or purged.
Lineage of data means the history of data migrated and transformation applied
on it.
 Data for mapping from operational environment to data warehouse − It
includes the source databases and their contents, data extraction, data partition
cleaning, transformation rules, data refresh and purging rules.
 Algorithms for summarization − It includes dimension algorithms, data on
granularity, aggregation, summarizing, etc.

Challenges for Metadata Management


The importance of metadata can not be overstated. Metadata helps in driving the
accuracy of reports, validates data transformation, and ensures the accuracy of
calculations. Metadata also enforces the definition of business terms to business end-
users. With all these uses of metadata, it also has its challenges. Some of the
challenges are discussed below.
 Metadata in a big organization is scattered across the organization. This
metadata is spread in spreadsheets, databases, and applications.
 Metadata could be present in text files or multimedia files. To use this data for
information management solutions, it has to be correctly defined.
 There are no industry-wide accepted standards. Data management solution
vendors have narrow focus.
 There are no easy and accepted methods of passing metadata.
Data Mining - Cluster Analysis
Cluster is a group of objects that belongs to the same class. In other words, similar
objects are grouped in one cluster and dissimilar objects are grouped in another
cluster.

What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar
objects.
Points to Remember
 A cluster of data objects can be treated as one group.
 While doing cluster analysis, we first partition the set of data into groups based
on data similarity and then assign the labels to the groups.
 The main advantage of clustering over classification is that, it is adaptable to
changes and helps single out useful features that distinguish different groups.

Applications of Cluster Analysis


 Clustering analysis is broadly used in many applications such as market
research, pattern recognition, data analysis, and image processing.
 Clustering can also help marketers discover distinct groups in their customer
base. And they can characterize their customer groups based on the purchasing
patterns.
 In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures
inherent to populations.
 Clustering also helps in identification of areas of similar land use in an earth
observation database. It also helps in the identification of groups of houses in a
city according to house type, value, and geographic location.
 Clustering also helps in classifying documents on the web for information
discovery.
 Clustering is also used in outlier detection applications such as detection of
credit card fraud.
 As a data mining function, cluster analysis serves as a tool to gain insight into
the distribution of data to observe characteristics of each cluster.

Requirements of Clustering in Data Mining


The following points throw light on why clustering is required in data mining −
 Scalability − We need highly scalable clustering algorithms to deal with large
databases.
 Ability to deal with different kinds of attributes − Algorithms should be
capable to be applied on any kind of data such as interval-based (numerical)
data, categorical, and binary data.
 Discovery of clusters with attribute shape − The clustering algorithm
should be capable of detecting clusters of arbitrary shape. They should not be
bounded to only distance measures that tend to find spherical cluster of small
sizes.
 High dimensionality − The clustering algorithm should not only be able to
handle low-dimensional data but also the high dimensional space.
 Ability to deal with noisy data − Databases contain noisy, missing or
erroneous data. Some algorithms are sensitive to such data and may lead to
poor quality clusters.
 Interpretability − The clustering results should be interpretable,
comprehensible, and usable.

Clustering Methods
Clustering methods can be classified into the following categories −

 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs
‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means that it
will classify the data into k groups, which satisfy the following requirements −
 Each group contains at least one object.
 Each object must belong to exactly one group.
Points to remember −
 For a given number of partitions (say k), the partitioning method will create an
initial partitioning.
 Then it uses the iterative relocation technique to improve the partitioning by
moving objects from one group to other.

Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We
can classify hierarchical methods on the basis of how the hierarchical decomposition is
formed. There are two approaches here −

 Agglomerative Approach
 Divisive Approach

Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each
object forming a separate group. It keeps on merging the objects or groups that are
close to one another. It keep on doing so until all of the groups are merged into one or
until the termination condition holds.

Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the
objects in the same cluster. In the continuous iteration, a cluster is split up into
smaller clusters. It is down until each object in one cluster or the termination condition
holds. This method is rigid, i.e., once a merging or splitting is done, it can never be
undone.

Approaches to Improve Quality of Hierarchical Clustering


Here are the two approaches that are used to improve the quality of hierarchical
clustering −
 Perform careful analysis of object linkages at each hierarchical partitioning.
 Integrate hierarchical agglomeration by first using a hierarchical agglomerative
algorithm to group objects into micro-clusters, and then performing macro-
clustering on the micro-clusters.

Density-based Method
This method is based on the notion of density. The basic idea is to continue growing
the given cluster as long as the density in the neighborhood exceeds some threshold,
i.e., for each data point within a given cluster, the radius of a given cluster has to
contain at least a minimum number of points.

Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite
number of cells that form a grid structure.
Advantages
 The major advantage of this method is fast processing time.
 It is dependent only on the number of cells in each dimension in the quantized
space.

Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for
a given model. This method locates the clusters by clustering the density function. It
reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields
robust clustering methods.

Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-
oriented constraints. A constraint refers to the user expectation or the properties of
desired clustering results. Constraints provide us with an interactive way of
communication with the clustering process. Constraints can be specified by the user or
the application requirement.
Data Mining - Rule Based Classification

IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can
express a rule in the following from −
IF condition THEN conclusion

Let us consider a rule R1,

R1: IF age = youth AND student = yes


THEN buy_computer = yes

Points to remember −
 The IF part of the rule is called rule antecedent or precondition.
 The THEN part of the rule is called rule consequent.
 The antecedent part the condition consist of one or more attribute tests and
these tests are logically ANDed.
 The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.

Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from
a decision tree.
Points to remember −
To extract a rule from a decision tree −
 One rule is created for each path from the root to the leaf node.
 To form a rule antecedent, each splitting criterion is logically ANDed.
 The leaf node holds the class prediction, forming the rule consequent.

Rule Induction Using Sequential Covering Algorithm


Sequential Covering Algorithm can be used to extract IF-THEN rules form the training
data. We do not require to generate a decision tree first. In this algorithm, each rule
for a given class covers many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the
general strategy the rules are learned one at a time. For each time rules are learned, a
tuple covered by the rule is removed and the process continues for the rest of the
tuples. This is because the path to each leaf in a decision tree corresponds to a rule.
Note − The Decision tree induction can be considered as learning a set of rules
simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one
class at a time. When learning a rule from a class Ci, we want the rule to cover all the
tuples from class C only and no tuple form any other class.
Algorithm: Sequential Covering

Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.

Output: A Set of IF-THEN rules.


Method:
Rule_set={ }; // initial set of rules learned is empty

for each class c do

repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;

Rule_set=Rule_set+Rule; // add a new rule to rule-set


end for
return Rule_Set;
Rule Pruning
The rule is pruned is due to the following reason −
 The Assessment of quality is made on the original set of training data. The rule
may perform well on training data but less well on subsequent data. That's why
the rule pruning is required.
 The rule is pruned by removing conjunct. The rule R is pruned, if pruned version
of R has greater quality than what was assessed on an independent set of
tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
FOIL_Prune = pos - neg / pos + neg

where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the
FOIL_Prune value is higher for the pruned version of R, then we prune R.

OLAP Tools And The Internet


The mainly comprehensive premises in computing have been the internet and data
warehousing thus the integration of these two giant technologies is a necessity.

OLAP Tools And The Internet


 The mainly comprehensive premises in computing have been the internet and data
warehousing thus the integration of these two giant technologies is a necessity. The
advantages of using the Web for access are inevitable.(Reference 3) These advantages
are:
 ·     The internet provides connectivity between countries acting as a free resource.
·     The web eases administrative tasks of managing scattered locations.
 ·     The Web allows users to store and manage data and applications on servers that
can be managed, maintained and updated centrally.
 These reasons indicate the importance of the Web in data storage and manipulation.
The Web-enabled data access has many significant features, such as:
·     The first
·     The second
·     The emerging third
·     HTML publishing
·     Helper applications
·     Plug-ins
·     Server-centric components
·     Java and active-x applications
 The primary key in the decision making process is the amount of data collected and
how well this data is interpreted. Nowadays, Managers aren’t satisfied by getting direct
answers to their direct questions, Instead due to the market growth and increase of
clients their questions became more complicated. Questions are like‖ How much profit
from selling our products at our different centers per month‖. A complicated question
like this isn’t as simple to be answered directly; it needs analysis to three fields in
order to obtain an answer.
 The Decision making processes
 1.     Identify information about the problem
2.     Analyze the problem
3.     Collect and evaluate the information, and define alternative solutions.
 The decision making process exists in the different levels of an organization. The
speed and the
 simplicity of gathering data and the ability to convert this data into information is the
main element in the decision process. That’s why the term Business Intelligence has
evolved.
 
Business Intelligence
 As mentioned Earlier, business Intelligence is concerned with gathering the data and
converting this data into information, so as to use a better decision. The better the
data is gathered and how well it is interpreted as information is one of the most
important elements in a successful business.
 
Elements of Business Intelligence
 There are three main Components in Business Intelligence
 Data Warehouse: it is a collection of data to support the management decisions
making. It revolves around the major subjects of the business to support the
management.
 OLAP: is used to generate complex queries of multidimensional collection of data from
the data warehouse.
 Data Mining: consists of various techniques that explore and bring complex
relationships in very large sets.
 In the next figure the relation between the three components are represented.
 

Data Warehousing - OLAP


Online Analytical Processing Server (OLAP) is based on the multidimensional data
model. It allows managers, and analysts to get an insight of the information through
fast, consistent, and interactive access to information. This chapter cover the types of
OLAP, operations on OLAP, difference between OLAP, and statistical databases and
OLTP.

Types of OLAP Servers


We have four types of OLAP servers −

 Relational OLAP (ROLAP)


 Multidimensional OLAP (MOLAP)
 Hybrid OLAP (HOLAP)
 Specialized SQL Servers

Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end
tools. To store and manage warehouse data, ROLAP uses relational or extended-
relational DBMS.
ROLAP includes the following −

 Implementation of aggregation navigation logic.


 Optimization for each DBMS back end.
 Additional tools and services.
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views
of data. With multidimensional data stores, the storage utilization may be low if the
data set is sparse. Therefore, many MOLAP server use two levels of data storage
representation to handle dense and sparse data sets.

Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data
volumes of detailed information. The aggregations are stored separately in MOLAP
store.

Specialized SQL Servers


Specialized SQL servers provide advanced query language and query processing
support for SQL queries over star and snowflake schemas in a read-only environment.

OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations −

 Roll-up
 Drill-down
 Slice and dice
 Pivot (rotate)

Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −

 By climbing up a concept hierarchy for a dimension


 By dimension reduction
The following diagram illustrates how roll-up works.
 Roll-up is performed by climbing up a concept hierarchy for the dimension
location.
 Initially the concept hierarchy was "street < city < province < country".
 On rolling up, the data is aggregated by ascending the location hierarchy from
the level of city to the level of country.
 The data is grouped into cities rather than countries.
 When roll-up is performed, one or more dimensions from the data cube are
removed.

Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following
ways −

 By stepping down a concept hierarchy for a dimension


 By introducing a new dimension.
The following diagram illustrates how drill-down works −
 Drill-down is performed by stepping down a concept hierarchy for the dimension
time.
 Initially the concept hierarchy was "day < month < quarter < year."
 On drilling down, the time dimension is descended from the level of quarter to
the level of month.
 When drill-down is performed, one or more dimensions from the data cube are
added.
 It navigates the data from less detailed data to highly detailed data.

Slice
The slice operation selects one particular dimension from a given cube and provides a
new sub-cube. Consider the following diagram that shows how slice works.
 Here Slice is performed for the dimension "time" using the criterion time = "Q1".
 It will form a new sub-cube by selecting one or more dimensions.

Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.

The dice operation on the cube based on the following selection criteria involves three
dimensions.

 (location = "Toronto" or "Vancouver")


 (time = "Q1" or "Q2")
 (item =" Mobile" or "Modem")

Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order
to provide an alternative presentation of data. Consider the following diagram that
shows the pivot operation.

OLAP vs OLTP
Sr.No Data Warehouse (OLAP) Operational Database (OLTP)
.

1 Involves historical processing of Involves day-to-day processing.


information.

2 OLAP systems are used by knowledge OLTP systems are used by clerks, DBAs,
workers such as executives, or database professionals.
managers and analysts.

3 Useful in analyzing the business. Useful in running the business.


4 It focuses on Information out. It focuses on Data in.

5 Based on Star Schema, Snowflake, Based on Entity Relationship Model.


Schema and Fact Constellation
Schema.

6 Contains historical data. Contains current data.

7 Provides summarized and Provides primitive and highly detailed


consolidated data. data.

8 Provides summarized and Provides detailed and flat relational view


multidimensional view of data. of data.

9 Number or users is in hundreds. Number of users is in thousands.

10 Number of records accessed is in Number of records accessed is in tens.


millions.

11 Database size is from 100 GB to 1 TB Database size is from 100 MB to 1 GB.

12 Highly flexible. Provides high performance.

What is Multidimensional schemas?

Multidimensional schema is especially designed to model data warehouse systems. The


schemas are designed to address the unique needs of very large databases designed
for the analytical purpose (OLAP).

Types of Data Warehouse Schema:

Following are 3 chief types of multidimensional schemas each having its unique
advantages.

 Star Schema
 Snowflake Schema
What is a Star Schema?

The star schema is the simplest type of Data Warehouse schema. It is known as star
schema as its structure resembles a star. In the Star schema, the center of the star can
have one fact tables and numbers of associated dimension tables. It is also known as
Star Join Schema and is optimized for querying large data sets.

For example, as you can see in the above-given image that fact table is at the center
which contains keys to every dimension table like Deal_ID, Model ID, Date_ID,
Product_ID, Branch_ID & other attributes like Units sold and revenue.

Characteristics of Star Schema:

 Every dimension in a star schema is represented with the only one-dimension


table.
 The dimension table should contain the set of attributes.
 The dimension table is joined to the fact table using a foreign key
 The dimension table are not joined to each other
 Fact table would contain key and measure
 The Star schema is easy to understand and provides optimal disk usage.
 The dimension tables are not normalized. For instance, in the above figure,
Country_ID does not have Country lookup table as an OLTP design would have.
 The schema is widely supported by BI Tools

What is a Snowflake Schema?

A Snowflake Schema is an extension of a Star Schema, and it adds additional


dimensions. It is called snowflake because its diagram resembles a Snowflake.
The dimension tables are normalized which splits data into additional tables. In the
following example, Country is further normalized into an individual table.

Characteristics of Snowflake Schema:

 The main benefit of the snowflake schema it uses smaller disk space.
 Easier to implement a dimension is added to the Schema
 Due to multiple tables query performance is reduced
 The primary challenge that you will face while using the snowflake Schema is
that you need to perform more maintenance efforts because of the more lookup
tables.

Star Vs Snowflake Schema: Key Differences


Star Schema Snow Flake Schema

Hierarchies for the dimensions are stored in Hierarchies are divided into separate tables.
the dimensional table.

It contains a fact table surrounded by One fact table surrounded by dimension


dimension tables. table which are in turn surrounded by
dimension table

In a star schema, only single join creates the A snowflake schema requires many joins to
relationship between the fact table and any fetch the data.
dimension tables.

Simple DB Design. Very Complex DB Design.


Denormalized Data structure and query also Normalized Data Structure.
run faster.

High level of Data redundancy Very low-level data redundancy

Single Dimension table contains aggregated Data Split into different Dimension Tables.
data.

Cube processing is faster. Cube processing might be slow because of


the complex join.

Offers higher performing queries using Star The Snow Flake Schema is represented by
Join Query Optimization. Tables may be centralized fact table which unlikely
connected with multiple dimensions. connected with multiple dimensions.

What is Star Cluster Schema?

Snowflake schema contains fully expanded hierarchies. However, this can add
complexity to the Schema and requires extra joins. On the other hand, star schema
contains fully collapsed hierarchies, which may lead to redundancy. So, the best
solution may be a balance between these two schemas which is star cluster schema
design.

Overlapping dimensions can be found as forks in hierarchies. A fork happens when an


entity acts as a parent in two different dimensional hierarchies. Fork entities then
identified as classification with one-to-many relationships.

Star Schemas for Multidimensional Model


The simplest data warehouse schema is star schema because its structure resembles a
star.Star schema consists of data in the form of facts and dimensions.The fact table
present in the center of star and points of the star are the dimension tables.

In star schema fact table contain a large amount of data, with no redundancy.Each
dimension table is joined with the fact table using a primary or foreign key.

Star Schemas for Multidimensional Modal

Fact Tables
A fact table has two types of columns: one column of foreign keys (pointing to the
dimension tables) and other of numeric values.

Fact Table

Dimension Tables
Dimension table is generally small in size as compared to a fact table.The primary key
of a dimension table is a foreign key in a fact table.

Example of Dimension Tables are:-

Time dimension table

Product dimension table

Employee dimension table

Geography dimension table


The main characteristics of star schema are that it is easy to understand and small
number of tables can join.

Snowflake Schemas for Multidimensional Model

The snowflake schema is a more complex than star schema because dimension tables
of the snowflake are normalized.

The snowflake schema is represented by centralized fact table which is connected to


multiple dimension table and this dimension table can be normalized into additional
dimension tables.

Snowflake Schemas for Multidimensional Modal

The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model are normalized to reduce redundancies.

Difference Between Fp growth and Apriori Algorithm

FP growth algorithm and Apriori algorithm they both are used for mining frequent


items for boolean Association rule.

The difference between FP growth algorithm and Apriori algorithm is given below:
Difference Between Fp growth and Apriori Algorithm

Fp Growth Algorithm

Fp Growth Algorithm (Frequent pattern growth). FP growth algorithm is an


improvement of apriori algorithm. FP growth algorithm used for finding frequent
itemset in a transaction database without candidate generation.

FP growth represents frequent items in frequent pattern trees or FP-tree.

Advantages of FP growth algorithm:-


1. Faster than apriori algorithm

2. No candidate generation

3. Only two passes over dataset

Disadvantages of FP growth algorithm:-


1. FP tree may not fit in memory

2. FP tree is expensive to build

Apriori Algorithm
Apriori is a classic algorithm for mining frequent items for boolean Association rule.

It uses a bottom-up approach, designed for finding Association rules in a database that
contains transactions.

Advantages of Apriori algorithm


1. Easy to implement

2. Use large itemset property

Disadvantages of Apriori algorithm


1. Requires many database scans

2. Very slow

Data Mining Techniques

Extracting important knowledge from a very large amount of data can be crucial to
organizations for the process of decision-making.

Some data mining techniques are :-

1 Association

2 Classification

3 Clustering

4 Sequential patterns

5 Decision tree.

1 Association Technique
Association Technique helps to find out the pattern from huge data, based on a
relationship between two or more items of the same transaction. The association
technique is used to analyze market means it help us to analyze people's buying habits.

For example, you might identify that a customer always buys ice cream whenever he
comes to watch move so it might be possible that when customer again comes to watch
move he might also want to buy ice cream again.

2 Classification Technique
Classification technique is most common data mining technique. In classification
method we use mathematical techniques such as decision trees, neural network and
statistics in order to predict unknown records. This technique helps in deriving
important information about data.

Let assume you have set of records, each record contains a set of attributes and
depending upon this attributes you will be able to predict unseen or unknown records.
For example, you have given all records of employees who left the company, with
classification technique you can predict who will probably leave the company in a future
period.

3 Clustering Technique
Clustering is one of the oldest techniques used in the process of data mining. The main
aim of clustering technique is to makes cluster(groups) from pieces of data which share
common characteristics. Clustering Technique help to identify the differences and
similarities between the data.

Take an example of a shop in which many items are for sales, now the challenge is how
to keep those items in such way that customer can easily find his required item.By
using the clustering technique, you can keep some items in one corner that have some
similarities and other items in another corner that have some different similarities.

4 Sequential patterns
Sequential patterns are a useful method for identifying trends and similar patterns.

For example, in customer data you identify that a customer buys particular product on
particular time of year, you can use this information to suggest customer these
particular product on that time of year.

5 Decision tree
Decision tree is one of the most common used data mining techniques because its
model is easy to understand for users. In decision tree you start with a simple question
which has two or more answers.Each answer leads to a further two or more question
which help us to make a final decision. The root node of decision tree is a simple
question.

Take a example of flood warning system.


Decision tree

First check water level, if water level is > 50ft then alert is send and if water level is <
50ft then check water level if water level is > 30ft then send warning and if water level
is < 30ft then water is in normal range.

Data Warehouse ?

Data mining refers to extraction of information from large amount of data and store this
information in various data sources such as database and data warehouse.

A data warehouse is a place which store information collected from multiple sources
under unified schema. Information stored in a data warehouse is critical to
organizations for the process of decision-making.
Data Warehouse

Relational data management systems, ERP, flat files, CRM are multiple sources from
where data warehouse store information.

Data warehouses reside on servers and run database management system (DBMS)
such as SQL Server and load software like SQL Server Integration Services (SSIS) to
pull data from the source into data warehouse.

A data warehouse usually stores many months or years of data to support historical
analysis.

Date warehouse are build in order to help users to understand and enhance their
organization's performance.

Data warehouses are constructed via a process of data cleaning, data integration, data


transformation and data reduction.

Data Warehouse Feature

The idea behind data warehouse is to create a permanent storage space for the data
needed to support reporting, analysis, dashboards, and other BI functions.

The key features of a data warehouse are:-


1 Subject oriented:-
A data warehouse is subject oriented, they provide a simple view around the particular
subject by excluding data that are not useful in the decision support process.These
subjects can be a product, customers, suppliers, sales, revenue, etc.

2 Integrated:-
A data warehouse is a place which store integrating data collected from heterogeneous
sources such as relational databases, flat files, etc.This data integration techniques are
applied to ensure consistency in naming conventions, encoding structures, attribute
measures etc.

3 Time-variant:-
A data warehouse usually stores many months or years of data to support historical
analysis. Data stored in data warehouse provide information from a historical
perspective.

4 Non-volatile:-
A data warehouse is Non-volatile in nature that means when new data is added old data
is not erased.

Data Warehouse Design Considerations


 
Data Warehouse Design Considerations. This presentation covers the following topics :
 Slowly Changing Dimensions
 Understanding Indexing
 Indexing the Data Warehouse
 Understanding Index Views
 Understanding Data Compression
 Data Lineage
 Using Partitions
 Identifying Fact / Dimension Tables

7 Challenges to Consider when Building a Data Warehouse

1. Data Quality – In a data warehouse, data is coming from many disparate


sources from all facets of an organization.  When a data warehouse tries to combine
inconsistent data from disparate sources, it encounters errors.  Inconsistent data,
duplicates, logic conflicts, and missing data all result in data quality challenges. 
Poor data quality results in faulty reporting and analytics necessary for optimal
decision making.
2. Understanding Analytics – The powerful analytics tools and reports available
through integrated data will provide credit union leaders with the ability to make
precise decisions that impact the future success of their organizations.  When
building a data warehouse, analytics and reporting will have to be taken into design
considerations.  In order to do this, the business user will need to know exactly
what analysis will be performed.  Envisioning these reports will be difficult for
someone that hasn’t yet utilized a BI strategy and is unaware of its capabilities and
limitations.
3. Quality Assurance – The end user of a data warehouse is using Big Data
reporting and analytics to make the best decisions possible.  Consequently, the data
must be 100 percent accurate or a credit union leader could make ill-advised
decisions that are detrimental to the future success of their business.  This high
reliance on data quality makes testing a high priority issue that will require a lot of
resources to ensure the information provided is accurate.  The credit union will have
to develop all of the steps required to complete a successful Software Testing Life
Cycle (STLC), which will be a costly and time intensive process.
4. Performance – Building a data warehouse is similar to building a car.  A car
must be carefully designed from the beginning to meet the purposes for which it is
intended. Yet, there are options each buyer must consider to make the vehicle truly
meet individual performance needs. A data warehouse must also be carefully
designed to meet overall performance requirements. While the final product can be
customized to fit the performance needs of the organization, the initial overall
design must be carefully thought out to provide a stable foundation from which to
start.
5. Designing the Data Warehouse – People generally don’t want to “waste” their
time defining the requirements necessary for proper data warehouse design. 
Usually, there is a high level perception of what they want out of a data warehouse.
However, they don’t fully understand all the implications of these perceptions and,
therefore, have a difficult time adequately defining them.  This results in
miscommunication between the business users and the technicians building the
data warehouse.  The typical end result is a data warehouse which does not deliver
the results expected by the user. Since the data warehouse is inadequate for the
end user, there is a need for fixes and improvements immediately after initial
delivery.  The unfortunate outcome is greatly increased development fees.
6. User Acceptance – People are not keen to changing their daily routine
especially if the new process is not intuitive.  There are many challenges to
overcome to make a data warehouse that is quickly adopted by an organization. 
Having a comprehensive user training program can ease this hesitation but will
require planning and additional resources.
7. Cost – A frequent misconception among credit unions is that they can build data
warehouse in-house to save money. As the foregoing points emphasize, there are a
multitude of hidden problems in building data warehouses. Even if a credit union
adds a data warehouse “expert” to their staff, the depth and breadth of skills
needed to deliver an effective result is simply not feasible with one or a few
experienced professionals leading a team of non-BI trained technicians. The harsh
reality is an effective do-it-yourself effort is very costly.

Advantages of Star Schema –


1. Simpler Queries:
Join logic of star schema is quite cinch in compare to other join logic which are
needed to fetch data from a transactional schema that is highly normalized.
2. Simplified Business Reporting Logic:
In compared to a transactional schema that is highly normalized, the star schema
makes simpler common business reporting logic, such as as-of reporting and
period-over-period.
3. Feeding Cubes:
Star schema is widely used by all OLAP systems to design OLAP cubes efficiently.
In fact, major OLAP systems deliver a ROLAP mode of operation which can use a
star schema as a source without designing a cube structure.
Disadvantages of Star Schema –
1. Data integrity is not enforced well since in a highly de-normalized schema state.
2. Not flexible in terms if analytical needs as a normalized data model.
3. Star schemas don’t reinforce many-to-many relationships within business
entities – at least not frequently.
Characteristics of snowflake schema: The dimension model of snowflake under the
following conditions:
 The snowflake schema uses small disk space.
 It is easy to implement dimension is added to schema.
 There are multiple tables, so performance is reduced.
 The dimension table consist of two or more sets of attributes which define
information at different grains.
 The sets of attributes of the same dimension table are being populate by
different source systems.
Advantages: There are two main advantages of snowflake schema given below:
 It provides structured data which reduces the problem of data integrity.
 It uses small disk space because data are highly structured.
Disadvantages:
 Snowflaking reduces space consumed by dimension tables, but compared with
the entire data warehouse the saving is usually insignificant.
 Avoid snowflaking or normalization of a dimension table, unless required and
appropriate.
 Do not snowflake hierarchies of one dimension table into separate tables.
Hierarchies should belong to the dimension table only and should never be
snowfalked.
 Multiple hierarchies can belong to the same dimension has been designed at the
lowest possible detail.

Data Mining In Healthcare

Data mining holds great potential for the healthcare industry to enable health


systems to systematically use data and analytics to identify inefficiencies and
best practices that improve care and reduce costs. ... Like analytics and
business intelligence, the term data mining can mean different things to
different people.

The purpose of data mining, whether it’s being used in healthcare or


business, is to identify useful and understandable patterns by  analyzing large
sets of data . These data patterns help predict industry or information trends,
and then determine what to do about them.

In the healthcare industry specifically, data mining can be used to decrease


costs by increasing efficiencies, improve patient quality of life, and perhaps
most importantly, save the lives of more patients.

Proven Applications of Data Mining


Data mining has been used in many industries to improve customer
experience and satisfaction, and increase product safety and usability. In
healthcare, data mining has proven effective in areas such as predictive
medicine, customer relationship management, detection of fraud and abuse,
management of healthcare and measuring the effectiveness of certain
treatments.

Here is a short breakdown of two of these applications:

 Measuring Treatment Effectiveness – This application of data mining


involves comparing and contrasting symptoms, causes and courses of
treatment to find the most effective course of action for a certain illness
or condition. For example, patient groups who are treated with different
drug regimens can be compared to determine which treatment plans work
best and save the most money. Furthermore, the continued use of this
application could help standardize a method of treatment for specific
diseases, thus making the diagnosis and treatment process quicker and
easier.

 Detecting Fraud and Abuse – This involves establishing normal


patterns, then identifying unusual patterns of medical claims by clinics,
physicians, labs, or others. This application can also be used to identify
inappropriate referrals or prescriptions and insurance fraud and
fraudulent medical claims. The Texas Medicaid Fraud and Abuse Detection
System is a good example of a business using data mining to detect
fraud. In 1998, the organization recovered $2.2 million in stolen funds
and identified 1,400 suspects for investigation. To recognize its success,
the Texas system received a national award for its  innovative use of
technology .

Application of data mining in government sector

By collecting and analyzing public and private sector data, government data


mining is able to identify potential terrorists or other dangerous activities by unknown
individuals.

What is government data mining?

Gaining insight for decision making in a timely manner requires that companies be


able to take into account the enormous amount of information that is available
both online and stored in enterprise databases. However, simply accessing the
information is not enough to do the job. It requires software that can identify
hidden patterns and relationships among disparate pieces of information. In this
case, data mining can be a worthwhile technique.

Data mining programs differ in the technologies used to achieve operational goals, in


the sources of data used (government data, enterprise information and private data)
and in the formats (structured and unstructured). The areas that can benefit from the
use of data mining are also diverse: law enforcement, terrorism prevention, customs
control, financial transactions, and international trade.
Data mining for Government

Data mining applications are widely used in a variety of industries because


they enable quick and relatively inexpensive analysis for massive volumes of
data. This makes data mining an effective tool for a range of uses within the federal
government, which applies it for analyzing intelligence to reduce the risk of terrorist
attacks. Since the September 11, 2001 terrorist attacks, data mining has been
increasingly employed to help detect threats to national security.

Government Data Mining: Privacy issues

By collecting and analyzing public and private sector data, government data mining
is able to identify potential terrorists or other dangerous activities by
unknown individuals. However, this capability continues to raise concerns for private
citizens when it comes to the government’s access to personal data. In fact, in the
absence of specific laws for privacy, the government could have unlimited access to a
massive volume of personal data from behaviors, conversations and habits of people
who have done nothing to legitimize suspicions, and it could create a dramatic increase
in government monitoring of individuals.

The privacy issues raised concern the quality and accuracy of the mined data, how it is
used (especially uses outside the original purview), protection of the data , and the
right of individuals to know that their personal information is being collected and how it
is being used.

On the other side, the lack of legal rules for government data mining restricts the
ability of companies to use and share the information with governments in legal
ways, making the development of new data mining programs slower.

However, a better understanding of what government data mining is, why it is


used and who uses it could be useful in helping us find a solution to the ethical
implications of data mining that strikes a balance between privacy and its use for
national security.

 
What Is Hypothesis Testing?
Hypothesis testing is an act in statistics whereby an analyst tests an assumption
regarding a population parameter. The methodology employed by the analyst depends
on the nature of the data used and the reason for the analysis. Hypothesis testing is
used to infer the result of a hypothesis performed on sample data from a larger
population.

KEY TAKEAWAYS

 Hypothesis testing is used to infer the result of a hypothesis performed on


sample data from a larger population.
 The test tells the analyst whether or not his primary hypothesis is true.
 Statistical analysts test a hypothesis by measuring and examining a random
sample of the population being analyzed.
How Hypothesis Testing Works
In hypothesis testing, an analyst tests a statistical sample, with the goal of accepting or
rejecting a null hypothesis. The test tells the analyst whether or not his primary
hypothesis is true. If it isn't true, the analyst formulates a new hypothesis to be tested,
repeating the process until data reveals a true hypothesis.

Statistical analysts test a hypothesis by measuring and examining a random sample of


the population being analyzed. All analysts use a random population sample to test two
different hypotheses: the null hypothesis and the alternative hypothesis.

The null hypothesis is the hypothesis the analyst believes to be true. Analysts believe
the alternative hypothesis to be untrue, making it effectively the opposite of a null
hypothesis. Thus, they are mutually exclusive, and only one can be true. However, one
of the two hypotheses will always be true.

Four Steps of Hypothesis Testing


All hypotheses are tested using a four-step process:

1. The first step is for the analyst to state the two hypotheses so that only one can
be right.
2. The next step is to formulate an analysis plan, which outlines how the data will
be evaluated.
3. The third step is to carry out the plan and physically analyze the sample data.
4. The fourth and final step is to analyze the results and either accept or reject the
null hypothesis.

Data Mining - Bayesian Classification


Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the
statistical classifiers. Bayesian classifiers can predict class membership probabilities
such as the probability that a given tuple belongs to a particular class.

Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −

 Posterior Probability [P(H/X)]


 Prior Probability [P(H)]
where X is data tuple and H is some hypothesis.
According to Bayes' Theorem,
P(H/X)= P(X/H)P(H) / P(X)

Bayesian Belief Network


Bayesian Belief Networks specify joint conditional probability distributions. They are
also known as Belief Networks, Bayesian Networks, or Probabilistic Networks.
 A Belief Network allows class conditional independencies to be defined between
subsets of variables.
 It provides a graphical model of causal relationship on which learning can be
performed.
 We can use a trained Bayesian Network for classification.
There are two components that define a Bayesian Belief Network −

 Directed acyclic graph


 A set of conditional probability tables

Directed Acyclic Graph

 Each node in a directed acyclic graph represents a random variable.


 These variable may be discrete or continuous valued.
 These variables may correspond to the actual attribute given in the data.

Directed Acyclic Graph Representation


The following diagram shows a directed acyclic graph for six Boolean variables.

The arc in the diagram allows representation of causal knowledge. For example, lung
cancer is influenced by a person's family history of lung cancer, as well as whether or
not the person is a smoker. It is worth noting that the variable PositiveXray is
independent of whether the patient has a family history of lung cancer or that the
patient is a smoker, given that we know the patient has lung cancer.

Conditional Probability Table


The conditional probability table for the values of the variable LungCancer (LC)
showing each possible combination of the values of its parent nodes, FamilyHistory
(FH), and Smoker (S) is as follows −
perceptron
A perceptron is a simple model of a biological neuron in an artificial neural network.
Perceptron is also the name of an early algorithm for supervised learning of binary
classifiers.

The perceptron algorithm was designed to classify visual inputs, categorizing subjects
into one of two types and separating groups with a line. Classification is an important
part of machine learning and image processing. Machine learning algorithms find and
classify patterns by many different means. The perceptron algorithm classifies patterns
and groups by finding the linear separation between different objects and patterns that
are received through numeric or visual input.

The perceptron algorithm was developed at Cornell Aeronautical Laboratory in 1957,


funded by the United States Office of Naval Research. The algorithm was the first step
planned for a machine implementation for image recognition. The machine, called Mark
1 Perceptron, was physically made up of an array of 400 photocells connected to
perceptrons whose weights were recorded in potentiometers, as adjusted by electric
motors. The machine was one of the first artificial neural networks ever created.

At the time, the perceptron was expected to be very significant for the development of
artificial intelligence (AI). While high hopes surrounded the initial perceptron, technical
limitations were soon demonstrated. Single-layer perceptrons can only separate classes
if they are linearly separable. Later on, it was discovered that by using multiple layers,
perceptrons can classify groups that are not linearly separable, allowing them to solve
problems single layer algorithms can’t solve.

What is Data Mart?

A data mart is focused on a single functional area of an organization and contains a


subset of data stored in a Data Warehouse.

A data mart is a condensed version of Data Warehouse and is designed for use by a
specific department, unit or set of users in an organization. E.g., Marketing, Sales, HR
or finance. It is often controlled by a single department in an organization.

Data Mart usually draws data from only a few sources compared to a Data warehouse.
Data marts are small in size and are more flexible compared to a Datawarehouse.

Why do we need Data Mart?

 Data Mart helps to enhance user's response time due to reduction in volume of
data
 It provides easy access to frequently requested data.
 Data mart are simpler to implement when compared to corporate
Datawarehouse. At the same time, the cost of implementing Data Mart is
certainly lower compared with implementing a full data warehouse.
 Compared to Data Warehouse, a datamart is agile. In case of change in model,
datamart can be built quicker due to a smaller size.
 A Datamart is defined by a single Subject Matter Expert. On the contrary data
warehouse is defined by interdisciplinary SME from a variety of domains. Hence,
Data mart is more open to change compared to Datawarehouse.
 Data is partitioned and allows very granular access control privileges.
 Data can be segmented and stored on different hardware/software platforms.

Type of Data Mart

There are three main types of data marts are:

1. Dependent: Dependent data marts are created by drawing data directly from
operational, external or both sources.
2. Independent: Independent data mart is created without the use of a central
data warehouse.
3. Hybrid: This type of data marts can take data from data warehouses or
operational systems.

Dependent Data Mart

A dependent data mart allows sourcing organization's data from a single Data
Warehouse. It offers the benefit of centralization. If you need to develop one or more
physical data marts, then you need to configure them as dependent data marts.

Dependent data marts can be built in two different ways. Either where a user can
access both the data mart and data warehouse, depending on need, or where access is
limited only to the data mart. The second approach is not optimal as it produces
sometimes referred to as a data junkyard. In the data junkyard, all data begins with a
common source, but they are scrapped, and mostly junked.
Independent Data Mart

An independent data mart is created without the use of central Data warehouse. This
kind of Data Mart is an ideal option for smaller groups within an organization.

An independent data mart has neither a relationship with the enterprise data
warehouse nor with any other data mart. In Independent data mart, the data is input
separately, and its analyses are also performed autonomously.

Implementation of independent data marts is antithetical to the motivation for building


a data warehouse. First of all, you need a consistent, centralized store of enterprise
data which can be analyzed by multiple users with different interests who want widely
varying information.

Hybrid data Mart:

A hybrid data mart combines input from sources apart from Data warehouse. This could
be helpful when you want ad-hoc integration, like after a new group or product is added
to the organization.

It is best suited for multiple database environments and fast implementation


turnaround for any organization. It also requires least data cleansing effort. Hybrid Data
mart also supports large storage structures, and it is best suited for flexible for smaller
data-centric applications.
Steps in Implementing a Datamart

Implementing a Data Mart is a rewarding but complex procedure. Here are the detailed
steps to implement a Data Mart:

Designing

Designing is the first phase of Data Mart implementation. It covers all the tasks
between initiating the request for a data mart to gathering information about the
requirements. Finally, we create the logical and physical design of the data mart.

The design step involves the following tasks:

 Gathering the business & technical requirements and Identifying data sources.
 Selecting the appropriate subset of data.
 Designing the logical and physical structure of the data mart.

Data could be partitioned based on following criteria:

 Date
 Business or Functional Unit
 Geography
 Any combination of above

Data could be partitioned at the application or DBMS level. Though it is recommended


to partition at the Application level as it allows different data models each year with the
change in business environment.

What Products and Technologies Do You Need?

A simple pen and paper would suffice. Though tools that help you create UML or ER
diagrams would also append meta data into your logical and physical designs.

Constructing

This is the second phase of implementation. It involves creating the physical database
and the logical structures.

This step involves the following tasks:

 Implementing the physical database designed in the earlier phase. For instance,
database schema objects like table, indexes, views, etc. are created.

What Products and Technologies Do You Need?

You need a relational database management system to construct a data mart. RDBMS
have several features that are required for the success of a Data Mart.

 Storage management: An RDBMS stores and manages the data to create, add,
and delete data.
 Fast data access: With a SQL query you can easily access data based on
certain conditions/filters.
 Data protection: The RDBMS system also offers a way to recover from system
failures such as power failures. It also allows restoring data from these backups
incase of the disk fails.
 Multiuser support: The data management system offers concurrent access, the
ability for multiple users to access and modify data without interfering or
overwriting changes made by another user.
 Security: The RDMS system also provides a way to regulate access by users to
objects and certain types of operations.

Populating:

In the third phase, data in populated in the data mart.

The populating step involves the following tasks:

 Source data to target data Mapping


 Extraction of source data
 Cleaning and transformation operations on the data
 Loading data into the data mart
 Creating and storing metadata

What Products and Technologies Do You Need?

You accomplish these population tasks using an ETL(Extract Transform Load)Tool. This
tool allows you to look at the data sources, perform source-to-target mapping, extract
the data, transform, cleanse it, and load it back into the data mart.

In the process, the tool also creates some metadata relating to things like where the
data came from, how recent it is, what type of changes were made to the data, and
what level of summarization was done.

Accessing

Accessing is a fourth step which involves putting the data to use: querying the data,
creating reports, charts, and publishing them. End-user submit queries to the database
and display the results of the queries

The accessing step needs to perform the following tasks:

 Set up a meta layer that translates database structures and objects names into
business terms. This helps non-technical users to access the Data mart easily.
 Set up and maintain database structures.
 Set up API and interfaces if required

What Products and Technologies Do You Need?

You can access the data mart using the command line or GUI. GUI is preferred as it can
easily generate graphs and is user-friendly compared to the command line.

Managing

This is the last step of Data Mart Implementation process. This step covers
management tasks such as-

 Ongoing user access management.


 System optimizations and fine-tuning to achieve the enhanced performance.
 Adding and managing fresh data into the data mart.
 Planning recovery scenarios and ensure system availability in the case when the
system fails.

What Products and Technologies Do You Need?

You could use the GUI or command line for data mart management.

Best practices for Implementing Data Marts

Following are the best practices that you need to follow while in the Data Mart
Implementation process:
 The source of a Data Mart should be departmentally structured
 The implementation cycle of a Data Mart should be measured in short periods of
time, i.e., in weeks instead of months or years.
 It is important to involve all stakeholders in planning and designing phase as the
data mart implementation could be complex.
 Data Mart Hardware/Software, Networking and Implementation costs should be
accurately budgeted in your plan
 Even though if the Data mart is created on the same hardware they may need
some different software to handle user queries. Additional processing power and
disk storage requirements should be evaluated for fast user response
 A data mart may be on a different location from the data warehouse. That's why
it is important to ensure that they have enough networking capacity to handle
the Data volumes needed to transfer data to the data mart.
 Implementation cost should budget the time taken for Datamart loading process.
Load time increases with increase in complexity of the transformations.

Advantages and Disadvantages of a Data Mart

Advantages

 Data marts contain a subset of organization-wide data. This Data is valuable to a


specific group of people in an organization.
 It is cost-effective alternatives to a data warehouse, which can take high costs to
build.
 Data Mart allows faster access of Data.
 Data Mart is easy to use as it is specifically designed for the needs of its users.
Thus a data mart can accelerate business processes.
 Data Marts needs less implementation time compare to Data Warehouse
systems. It is faster to implement Data Mart as you only need to concentrate the
only subset of the data.
 It contains historical data which enables the analyst to determine data trends.

Disadvantages

 Many a times enterprises create too many disparate and unrelated data marts
without much benefit. It can become a big hurdle to maintain.
 Data Mart cannot provide company-wide data analysis as their data set is
limited.

Summary:

 A data mart is focused on a single functional area of an organization and


contains a subset of data stored in a Data Warehouse.
 Data Mart helps to enhance user's response time due to a reduction in the
volume of data.
 Three types of data mart are 1) Dependent 2) Independent 3) Hybrid
 Important implementation steps of Data Mart are 1) Designing 2) Constructing 3
Populating 4) Accessing and 5)Managing
 The implementation cycle of a Data Mart should be measured in short periods of
time, i.e., in weeks instead of months or years.
 Data mart is cost-effective alternatives to a data warehouse, which can take high
costs to build.
 Data Mart cannot provide company-wide data analysis as data set is limited.

OLTP vs OLAP: What's the Difference?

What is OLAP?

Online Analytical Processing, a category of software tools which provide analysis of data
for business decisions. OLAP systems allow users to analyze database information from
multiple database systems at one time.

The primary objective is data analysis and not data processing.

What is OLTP?

Online transaction processing shortly known as OLTP supports transaction-oriented


applications in a 3-tier architecture. OLTP administers day to day transaction of an
organization.

The primary objective is data processing and not data analysis

Example of OLAP

Any Datawarehouse system is an OLAP system. Uses of OLAP are as follows

 A company might compare their mobile phone sales in September with sales in
October, then compare those results with another location which may be stored
in a sperate database.
 Amazon analyzes purchases by its customers to come up with a personalized
homepage with products which likely interest to their customer.

Example of OLTP system

An example of OLTP system is ATM center. Assume that a couple has a joint account
with a bank. One day both simultaneously reach different ATM centers at precisely the
same time and want to withdraw total amount present in their bank account.

However, the person that completes authentication process first will be able to get
money. In this case, OLTP system makes sure that withdrawn amount will be never
more than the amount present in the bank. The key to note here is that OLTP systems
are optimized for transactional superiority instead data analysis.

Other examples of OLTP system are:

 Online banking
 Online airline ticket booking
 Sending a text message
 Order entry
 Add a book to shopping cart

Benefits of using OLAP services

 OLAP creates a single platform for all type of business analytical needs which
includes planning, budgeting, forecasting, and analysis.
 The main benefit of OLAP is the consistency of information and calculations.
 Easily apply security restrictions on users and objects to comply with regulations
and protect sensitive data.

Benefits of OLTP method

 It administers daily transactions of an organization.


 OLTP widens the customer base of an organization by simplifying individual
processes.

Drawbacks of OLAP service

 Implementation and maintenance are dependent on IT professional because the


traditional OLAP tools require a complicated modeling procedure.
 OLAP tools need cooperation between people of various departments to be
effective which might always be not possible.

Drawbacks of OLTP method

 If OLTP system faces hardware failures, then online transactions get severely
affected.
 OLTP systems allow multiple users to access and change the same data at the
same time which many times created unprecedented situation.
Difference between OLTP and OLAP
Parameters OLTP OLAP

Process It is an online transactional OLAP is an online analysis and data retrieving


system. It manages database process.
modification.

Characteristic It is characterized by large It is characterized by a large volume of data.


numbers of short online
transactions.

Functionality OLTP is an online database OLAP is an online database query


modifying system. management system.

Method OLTP uses traditional DBMS. OLAP uses the data warehouse.

Query Insert, Update, and Delete Mostly select operations


information from the database.

Table Tables in OLTP database are Tables in OLAP database are not normalized.


normalized.

Source OLTP and its transactions are the Different OLTP databases become the source
sources of data. of data for OLAP.

Data Integrity OLTP database must maintain data OLAP database does not get frequently
integrity constraint. modified. Hence, data integrity is not an
issue.

Response time It's response time is in Response time in seconds to minutes.


millisecond.

Data quality The data in the OLTP database is The data in OLAP process might not be
always detailed and organized. organized.

Usefulness It helps to control and run It helps with planning, problem-solving, and
fundamental business tasks. decision support.

Operation Allow read/write operations. Only read and rarely write.

Audience It is a market orientated process. It is a customer orientated process.

Query Type Queries in this process are Complex queries involving aggregations.
standardized and simple.

Back-up Complete backup of the data OLAP only need a backup from time to time.
combined with incremental Backup is not important compared to OLTP
backups.

Design DB design is application oriented. DB design is subject oriented. Example:


Example: Database design Database design changes with subjects like
changes with industry like Retail, sales, marketing, purchasing, etc.
Airline, Banking, etc.

User type It is used by Data critical users Used by Data knowledge users like workers,
like clerk, DBA & Data Base managers, and CEO.
professionals.

Purpose Designed for real time business Designed for analysis of business measures
operations. by category and attributes.

Performance Transaction throughput is the Query throughput is the performance metric.


metric performance metric

Number of This kind of Database users allows This kind of Database allows only hundreds o
users thousands of users. users.

Productivity It helps to Increase user's self- Help to Increase productivity of the business
service and productivity analysts.

Challenge Data Warehouses historically have An OLAP cube is not an open SQL server data
been a development project which warehouse. Therefore, technical knowledge
may prove costly to build. and experience is essential to manage the
OLAP server.

Process It provides fast result for daily It ensures that response to the query is
used data. quicker consistently.

Characteristic It is easy to create and maintain. It lets the user create a view with the help of
a spreadsheet.

Style OLTP is designed to have fast A data warehouse is created uniquely so that
response time, low data it can integrate different data sources for
redundancy and is normalized. building a consolidated database

Summary:

 Online Analytical Processing is a category of software tools that analyze of data


stored in a database.
 Online transaction processing shortly known as OLTP supports transaction-
oriented applications in a 3-tier architecture
 OLAP creates a single platform for all type of business analysis needs which
includes planning, budgeting, forecasting, and analysis.
 OLTP is useful to administer day to day transactions of an organization.
 OLAP is characterized by a large volume of data.
 OLTP is characterized by large numbers of short online transactions.
 A data warehouse is created uniquely so that it can integrate different data
sources for building a consolidated database.
 An OLAP Cube takes a spreadsheet and three-dimensionless the experiences of
analysis.

What is OLAP (Online Analytical Processing): Cube, Operations & Types

What is Online Analytical Processing?

OLAP is a category of software that allows users to analyze information from multiple
database systems at the same time. It is a technology that enables analysts to extract
and view business data from different points of view. OLAP stands for Online Analytical
Processing.

Analysts frequently need to group, aggregate and join data. These operations in
relational databases are resource intensive. With OLAP data can be pre-calculated and
pre-aggregated, making analysis faster.

OLAP databases are divided into one or more cubes. The cubes are designed in such a
way that creating and viewing reports become easy.
OLAP cube:

At the core of the OLAP, concept is an OLAP Cube. The OLAP cube is a data structure
optimized for very quick data analysis.

The OLAP Cube consists of numeric facts called measures which are categorized by
dimensions. OLAP Cube is also called the hypercube.

Usually, data operations and analysis are performed using the simple spreadsheet,
where data values are arranged in row and column format. This is ideal for two-
dimensional data. However, OLAP contains multidimensional data, with data usually
obtained from a different and unrelated source. Using a spreadsheet is not an optimal
option. The cube can store and analyze multidimensional data in a logical and orderly
manner.

How does it work?

A Data warehouse would extract information from multiple data sources and formats
like text files, excel sheet, multimedia files, etc.

The extracted data is cleaned and transformed. Data is loaded into an OLAP server (or
OLAP cube) where information is pre-calculated in advance for further analysis.

Basic analytical operations of OLAP

Four types of analytical operations in OLAP are:

1. Roll-up
2. Drill-down
3. Slice and dice
4. Pivot (rotate)

1) Roll-up:
Roll-up is also known as "consolidation" or "aggregation." The Roll-up operation can be
performed in 2 ways

1. Reducing dimensions
2. Climbing up concept hierarchy. Concept hierarchy is a system of grouping things
based on their order or level.

Consider the following diagram

 In this example, cities New jersey and Lost Angles and rolled up into country
USA
 The sales figure of New Jersey and Los Angeles are 440 and 1560 respectively.
They become 2000 after roll-up
 In this aggregation process, data is location hierarchy moves up from city to the
country.
 In the roll-up process at least one or more dimensions need to be removed. In
this example, Quater dimension is removed.

2) Drill-down

In drill-down data is fragmented into smaller parts. It is the opposite of the rollup
process. It can be done via

 Moving down the concept hierarchy


 Increasing a dimension
Consider the diagram above

 Quater Q1 is drilled down to months January, February, and March.


Corresponding sales are also registers.
 In this example, dimension months are added.

3) Slice:

Here, one dimension is selected, and a new sub-cube is created.

Following diagram explain how slice operation performed:


 Dimension Time is Sliced with Q1 as the filter.
 A new cube is created altogether.

Dice:

This operation is similar to a slice. The difference in dice is you select 2 or more
dimensions that result in the creation of a sub-cube.
4) Pivot

In Pivot, you rotate the data axes to provide a substitute presentation of data.

In the following example, the pivot is based on item types.

Types of OLAP systems

OLAP Hierarchical Structure


Type of OLAP Explanation

Relational OLAP(ROLAP): ROLAP is an extended RDBMS along with


multidimensional data mapping to perform the
standard relational operation.

Multidimensional OLAP (MOLAP) MOLAP Implementes operation in


multidimensional data.

Hybrid OnlineAnalytical Processing (HOLAP) In HOLAP approach the aggregated totals are
stored in a multidimensional database while the
detailed data is stored in the relational database.
This offers both data efficiency of the ROLAP
model and the performance of the MOLAP model

Desktop OLAP (DOLAP) In Desktop OLAP, a user downloads a part of the


data from the database locally, or on their
desktop and analyze it.

DOLAP is relatively cheaper to deploy as it offers


very few functionalities compares to other OLAP
systems.

Web OLAP (WOLAP) Web OLAP which is OLAP system accessible via
the web browser. WOLAP is a three-tiered
architecture. It consists of three components:
client, middleware, and a database server.
Mobile OLAP: Mobile OLAP helps users to access and analyze
OLAP data using their mobile devices

Spatial OLAP : SOLAP is created to facilitate management of


both spatial and non-spatial data in a Geographic
Information system (GIS)

ROLAP

ROLAP works with data that exist in a relational database. Facts and dimension tables
are stored as relational tables. It also allows multidimensional analysis of data and is
the fastest growing OLAP.

Advantages of ROLAP model:

 High data efficiency. It offers high data efficiency because query performance
and access language are optimized particularly for the multidimensional data
analysis.
 Scalability. This type of OLAP system offers scalability for managing large
volumes of data, and even when the data is steadily increasing.

Drawbacks of ROLAP model:

 Demand for higher resources: ROLAP needs high utilization of manpower,


software, and hardware resources.
 Aggregately data limitations. ROLAP tools use SQL for all calculation of
aggregate data. However, there are no set limits to the for handling
computations.
 Slow query performance. Query performance in this model is slow when
compared with MOLAP

MOLAP

MOLAP uses array-based multidimensional storage engines to display multidimensional


views of data. Basically, they use an OLAP cube.

Learn more about OLAP here

Hybrid OLAP

Hybrid OLAP is a mixture of both ROLAP and MOLAP. It offers fast computation of
MOLAP and higher scalability of ROLAP. HOLAP uses two databases.

1. Aggregated or computed data is stored in a multidimensional OLAP cube


2. Detailed information is stored in a relational database.
Benefits of Hybrid OLAP:

 This kind of OLAP helps to economize the disk space, and it also remains
compact which helps to avoid issues related to access speed and convenience.
 Hybrid HOLAP's uses cube technology which allows faster performance for all
types of data.
 ROLAP are instantly updated and HOLAP users have access to this real-time
instantly updated data. MOLAP brings cleaning and conversion of data thereby
improving data relevance. This brings best of both worlds.

Drawbacks of Hybrid OLAP:

 Greater complexity level: The major drawback in HOLAP systems is that it


supports both ROLAP and MOLAP tools and applications. Thus, it is very
complicated.
 Potential overlaps: There are higher chances of overlapping especially into their
functionalities.

Advantages of OLAP

 OLAP is a platform for all type of business includes planning, budgeting,


reporting, and analysis.
 Information and calculations are consistent in an OLAP cube. This is a crucial
benefit.
 Quickly create and analyze "What if" scenarios
 Easily search OLAP database for broad or specific terms.
 OLAP provides the building blocks for business modeling tools, Data mining tools,
performance reporting tools.
 Allows users to do slice and dice cube data all by various dimensions, measures,
and filters.
 It is good for analyzing time series.
 Finding some clusters and outliers is easy with OLAP.
 It is a powerful visualization online analytical process system which provides
faster response times

Disadvantages of OLAP

 OLAP requires organizing data into a star or snowflake schema. These schemas
are complicated to implement and administer
 You cannot have large number of dimensions in a single OLAP cube
 Transactional data cannot be accessed with OLAP system.
 Any modification in an OLAP cube needs a full update of the cube. This is a time-
consuming process

Summary:

 OLAP is a technology that enables analysts to extract and view business data
from different points of view.
 At the core of the OLAP, the concept is an OLAP Cube.
 Various business applications and other data operations require the use of OLAP
Cube.
 There are primary five types of analytical operations in OLAP 1) Roll-up 2) Drill-
down 3) Slice 4) Dice and 5) Pivot
 Three types of widely used OLAP systems are MOLAP, ROLAP, and Hybrid OLAP.
 Desktop OLAP, Web OLAP, and Mobile OLAP are some other types of OLAP
systems.

What is MOLAP? Architecture, Advantages, Example, Tools

What is MOLAP?

Multidimensional OLAP (MOLAP) is a classical OLAP that facilitates data analysis by


using a multidimensional data cube. Data is pre-computed,pre-summarized, and stored
in a MOLAP (a major difference from ROLAP).

Using a MOLAP, a user can use multidimensional view data with different facets.
Multidimensional data analysis is also possible if a relational database is used. By that
would require querying data from multiple tables. On the contrary, MOLAP has all
possible combinations of data already stored in a multidimensional array. MOLAP can
access this data directly. Hence, MOLAP is faster compared to Relational Online
Analytical Processing (ROLAP).

Key Points

 In MOLAP, operations are called processing.


 MOLAP tools process information with the same amount of response time
irrespective of the level of summarizing.
 MOLAP tools remove complexities of designing a relational database to store
data for analysis.
 MOLAP server implements two level of storage representation to manage dense
and sparse data sets.
 The storage utilization can be low if the data set is sparse.
 Facts are stored in multi-dimensional array and dimensions used to query them.

MOLAP Architecture

MOLAP Architecture includes the following components −

 Database server.
 MOLAP server.
 Front-end tool.
Above given MOLAPArchitectures, shown in given figure

1. The user request reports through the interface


2. The application logic layer of the MDDB retrieves the stored data from Database
3. The application logic layer forwards the result to the client/user.

MOLAP architecture mainly reads the precompiled data. MOLAP architecture has limited
capabilities to dynamically create aggregations or to calculate results that have not
been pre-calculated and stored.

For example, an accounting head can run a report showing the corporate P/L account or
P/L account for a specific subsidiary. The MDDB would retrieve precompiled Profit &
Loss figures and display that result to the user.

Implementation considerations is MOLAP

 In MOLAP it's essential to consider both maintenance and storage implications to


creating strategy for building cubes.
 Proprietary languages used to query MOLAP. However, it involves extensive click
and drag support for example MDX by Microsoft.
 Difficult to scale because the number and size of cubes required when
dimensions increase.
 API's should provide for probing the cubes.
 Data structure to support multiple subject areas of data analyses which data can
be navigated and analyzed. When the navigation changes, the data structure
needs to be physically reorganized.
 Need different skill set and tools for Database administrator to build, maintain
the database.

MOLAP Advantages

 MOLAP can manage, analyze and store considerable amounts of


multidimensional data.
 Fast Query Performance due to optimized storage, indexing, and caching.
 Smaller sizes of data as compared to the relational database.
 Automated computation of higher level of aggregates data.
 Help users to analyze larger, less-defined data.
 MOLAP is easier to the user that's why It is a suitable model for inexperienced
users.
 MOLAP cubes are built for fast data retrieval and are optimal for slicing and
dicing operations.
 All calculations are pre-generated when the cube is created.

MOLAP Disadvantages

 One major weakness of MOLAP is that it is less scalable than ROLAP as it handles
only a limited amount of data.
 The MOLAP also introduces data redundancy as it is resource intensive
 MOLAP Solutions may be lengthy, particularly on large data volumes.
 MOLAP products may face issues while updating and querying models when
dimensions are more than ten.
 MOLAP is not capable of containing detailed data.
 The storage utilization can be low if the data set is highly scattered.
 It can handle the only limited amount of data therefore, it's impossible to include
a large amount of data in the cube itself.

MOLAP Tools

 Essbase - Tools from Oracle that has a multidimensional database.

 Express Server - Web-based environment that runs on Oracle database.

 Yellowfin - Business analytics tools for creating reports and dashboards.

 Clear Analytics - Clear analytics is an Excel-based business solution.

 SAP Business Intelligence - Business analytics solutions from SAP

Summary:

 Multidimensional OLAP (MOLAP) is a classical OLAP that facilitates data analysis


by using a multidimensional data cube.
 MOLAP tools process information with the same amount of response time
irrespective of the level of summarizing.
 MOLAP server implements two level of storage to manage dense and sparse data
sets.
 MOLAP can manage, analyze, and store considerable amounts of
multidimensional data.
 It helps to automate computation of higher level of aggregates data
 It is less scalable than ROLAP as it handles only a limited amount of data.

You might also like