You are on page 1of 38

DMBI IMP Q&A

Business Intelligence:
 Business intelligence may be defined as a set of mathematical models and
analysis methodologies that exploit the available data to generate information
and knowledge useful for complex decision-making processes.
 Business intelligence (BI) is a set of theories, methodologies, architectures, and
technologies that transform raw data into meaningful and useful information for
business purposes.
 The main purpose of business intelligence systems is to provide knowledge
workers with tools and methodologies that allow them to make effective and
timely decisions.

BI Architecture:

A typical BI Architecture
The architecture of a business intelligence system, includes three major components:

Data sources:
In a first stage, it is necessary to gather and integrate the data stored in the various
primary and secondary sources, which are heterogeneous in origin and type.
The sources consist for the most part of data belonging to operational systems, but
may also include unstructured documents, such as emails and data received from
external providers.
a major effort is required to unify and integrate the different data sources.

KHan S. ALam 1 https://E-next.in


DMBI IMP Q&A

Data warehouses and data marts:


Using extraction and transformation tools known as extract, transform, load (ETL), the
data originating from the different sources are stored in databases intended to support
business intelligence analyses.
These databases are usually referred to as data warehouses and data marts.

Business intelligence methodologies:


Data are finally extracted and used to feed mathematical models and analysis
methodologies intended to support decision makers.
In a business intelligence system, several decision support applications may be
implemented, most of which will be described in the following chapters:
Multidimensional cube analysis
Exploratory data analysis
Time series analysis
Inductive learning models for data mining
Optimization models.

KHan S. ALam 2 https://E-next.in


DMBI IMP Q&A

What is Multidimensional schemas?


Multidimensional schema is especially designed to model data warehouse
systems. The schemas are designed to address the unique needs of very large
databases designed for the analytical purpose (OLAP).

Types of Data Warehouse Schema:

Following are 3 chief types of multidimensional schemas each having its unique
advantages.

 Star Schema
 Snowflake Schema
 Galaxy Schema

What is a Star Schema?


The star schema is the simplest type of Data Warehouse schema. It is known as
star schema as its structure resembles a star. In the Star schema, the center of the
star can have one fact tables and numbers of associated dimension tables. It is
also known as Star Join Schema and is optimized for querying large data sets.

KHan S. ALam 3 https://E-next.in


DMBI IMP Q&A

For example, as you can see in the above-given image that fact table is at the
center which contains keys to every dimension table like Deal_ID, Model ID,
Date_ID, Product_ID, Branch_ID & other attributes like Units sold and revenue.

Characteristics of Star Schema:

 Every dimension in a star schema is represented with the only one-


dimension table.
 The dimension table should contain the set of attributes.
 The dimension table is joined to the fact table using a foreign key
 The dimension table are not joined to each other
 Fact table would contain key and measure
 The Star schema is easy to understand and provides optimal disk usage.
 The dimension tables are not normalized. For instance, in the above figure,
Country_ID does not have Country lookup table as an OLTP design would
have.
 The schema is widely supported by BI Tools

What is a Snowflake Schema?


A Snowflake Schema is an extension of a Star Schema, and it adds additional
dimensions. It is called snowflake because its diagram resembles a Snowflake.

The dimension tables are normalized which splits data into additional tables. In the
following example, Country is further normalized into an individual table.

KHan S. ALam 4 https://E-next.in


DMBI IMP Q&A

Characteristics of Snowflake Schema:

 The main benefit of the snowflake schema it uses smaller disk space.
 Easier to implement a dimension is added to the Schema
 Due to multiple tables query performance is reduced
 The primary challenge that you will face while using the snowflake Schema is
that you need to perform more maintenance efforts because of the more
lookup tables.

Star Vs Snowflake Schema: Key Differences


Star Schema Snow Flake Schema

Hierarchies for the dimensions are stored Hierarchies are divided into separate
in the dimensional table. tables.

It contains a fact table surrounded by One fact table surrounded by dimension


dimension tables. table which are in turn surrounded by
dimension table

In a star schema, only single join creates A snowflake schema requires many joins
the relationship between the fact table to fetch the data.
and any dimension tables.

Simple DB Design. Very Complex DB Design.

Denormalized Data structure and query Normalized Data Structure.


also run faster.

High level of Data redundancy Very low-level data redundancy

Single Dimension table contains Data Split into different Dimension


aggregated data. Tables.

Cube processing is faster. Cube processing might be slow because


of the complex join.

Offers higher performing queries using The Snow Flake Schema is represented
Star Join Query Optimization. Tables by centralized fact table which unlikely
may be connected with multiple connected with multiple dimensions.
dimensions.

KHan S. ALam 5 https://E-next.in


DMBI IMP Q&A

Data Mining | KDD process


What is Data Mining?
Data mining is looking for hidden, valid, and potentially useful patterns in huge data sets. Data
Mining is all about discovering unsuspected/ previously unknown relationships amongst the data.

Data Mining also known as Knowledge Discovery in Databases, refers to the nontrivial
extraction of implicit, previously unknown and potentially useful information from data stored
in databases.

Data Mining Techniques/Model

1.Classification:
This analysis is used to retrieve important and relevant information about data, and metadata.
This data mining method helps to classify data in different classes.

2. Clustering:
Clustering analysis is a data mining technique to identify data that are like each other. This
process helps to understand the differences and similarities between the data.

3. Regression:
Regression analysis is the data mining method of identifying and analyzing the relationship
between variables. It is used to identify the likelihood of a specific variable, given the presence of
other variables.

4. Association Rules:
This data mining technique helps to find the association between two or more Items. It discovers a
hidden pattern in the data set.

KHan S. ALam 6 https://E-next.in


DMBI IMP Q&A

5. Outer detection:
This type of data mining technique refers to observation of data items in the dataset which do not
match an expected pattern or expected behavior. This technique can be used in a variety of
domains, such as intrusion, detection, fraud or fault detection, etc. Outer detection is also called
Outlier Analysis or Outlier mining.

6. Sequential Patterns:
This data mining technique helps to discover or identify similar patterns or trends in transaction
data for certain period.

7. Prediction:
Prediction has used a combination of the other data mining techniques like trends, sequential
patterns, clustering, classification, etc. It analyzes past events or instances in a right sequence for
predicting a future event.

Steps Involved in KDD Process:

KDD PROCESS

1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from
collection.
 Cleaning in case of Missing values.
 Cleaning noisy data, where noise is a random or variance error.
 Cleaning with Data discrepancy detection and Data transformation tools.
2. Data Integration: Data integration is defined as heterogeneous data from multiple
sources combined in a common source(DataWarehouse).

KHan S. ALam 7 https://E-next.in


DMBI IMP Q&A

 Data integration using Data Migration tools.


 Data integration using Data Synchronization tools.
 Data integration using ETL(Extract-Load-Transformation) process.
3. Data Selection: Data selection is defined as the process where data relevant to the
analysis is decided and retrieved from the data collection.
 Data selection using Neural network.
 Data selection using Decision Trees.
 Data selection using Naive bayes.
 Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the process of transforming
data into appropriate form required by mining procedure.
Data Transformation is a two step process:
 Data Mapping: Assigning elements from source base to destination to capture
transformations.
 Code generation: Creation of the actual transformation program.
5. Data Mining: Data mining is defined as clever techniques that are applied to extract
patterns potentially useful.
 Transforms task relevant data into patterns.
 Decides purpose of model using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as as identifying strictly increasing
patterns representing knowledge based on given measures.
 Find interestingness score of each pattern.
 Uses summarization and Visualization to make data understandable by user.
7. Knowledge representation: Knowledge representation is defined as technique which
utilizes visualization tools to represent data mining results.
 Generate reports.
 Generate tables.
 Generate discriminant rules, classification rules, characterization rules, etc.
Note:
 KDD is an iterative process where evaluation measures can be enhanced, mining can
be refined, new data can be integrated and transformed in order to get different and
more appropriate results.
 Preprocessing of databases consists of Data cleaning and Data Integration.

KHan S. ALam 8 https://E-next.in


DMBI IMP Q&A

What is Data warehouse?


Data warehouse is an information system that contains historical and commutative data from
single or multiple sources. It simplifies reporting and analysis process of the organization.

It is also a single version of truth for any company for decision making and forecasting.

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection


of data in support of management’s decision making process.

Components of data warehouse


1. Database
2. ETL tools
3. Metadata
4. Query Tools
5. Data Marts

Database
The central database is the foundation of the data warehousing environment. This database is
implemented on the RDBMS technology. Although, this kind of implementation is constrained by
the fact that traditional RDBMS system is optimized for transactional database processing and not
for data warehousing. For instance, ad-hoc query, multi-table joins, aggregates are resource
intensive and slow down performance.

Extract, Transform and Load (ETL) Tools.

The data sourcing, transformation, and migration tools are used for performing all the
conversions, summarizations, and all the changes needed to transform data into a unified format
in the datawarehouse. They are also called Extract, Transform and Load (ETL) Tools.

These Extract, Transform, and Load tools may generate cron jobs, background jobs, Cobol
programs, shell scripts, etc. that regularly update data in datawarehouse. These tools are also
helpful to maintain the Metadata.

Metadata

Metadata is data about data which defines the data warehouse. It is used for building, maintaining
and managing the data warehouse.

In the Data Warehouse Architecture, meta-data plays an important role as it specifies the source,
usage, values, and features of data warehouse data. It also defines how data can be changed
and processed. It is closely connected to the data warehouse.

KHan S. ALam 9 https://E-next.in


DMBI IMP Q&A

For example, a line in sales database may contain:

4030 KJ732 299.90

This is a meaningless data until we consult the Meta that tell us it was

 Model number: 4030


 Sales Agent ID: KJ732
 Total sales amount of $299.90

Query Tools
One of the primary objects of data warehousing is to provide information to businesses to make
strategic decisions. Query tools allow users to interact with the data warehouse system.

These tools fall into four different categories:

1. Query and reporting tools


2. Application Development tools
3. Data mining tools
4. OLAP tools

Data Marts
A data mart is an access layer which is used to get data out to the users. It is presented as an
option for large size data warehouse as it takes less time and money to build. However, there is
no standard definition of a data mart is differing from person to person.

In a simple word Data mart is a subsidiary of a data warehouse. The data mart is used for
partition of data which is created for the specific group of users.

Data marts could be created in the same database as the Datawarehouse or a physically
separate Database.

KHan S. ALam 10 https://E-next.in


DMBI IMP Q&A

Three-Tier Data Warehouse Architecture


Generally a data warehouses adopts a three-tier architecture. Following are the
three tiers of the data warehouse architecture.
 Bottom Tier − The bottom tier of the architecture is the data warehouse database server. It is the
relational database system. We use the back end tools and utilities to feed data into the bottom tier.
These back end tools and utilities perform the Extract, Clean, Load, and refresh functions.

 Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in either of the
following ways.

o By Relational OLAP (ROLAP), which is an extended relational database management system.


The ROLAP maps the operations on multidimensional data to standard relational operations.

o By Multidimensional OLAP (MOLAP) model, which directly implements the multidimensional


data and operations.

 Top-Tier − This tier is the front-end client layer. This layer holds the query tools and reporting
tools, analysis tools and data mining tools.

The following diagram depicts the three-tier architecture of data warehouse −

KHan S. ALam 11 https://E-next.in


DMBI IMP Q&A

OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss
OLAP operations in multidimensional data.
Here is the list of OLAP operations −

 Roll-up

 Drill-down

 Slice and dice

 Pivot (rotate)

Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −

 By climbing up a concept hierarchy for a dimension

 By dimension reduction
The following diagram illustrates how roll-up works.

KHan S. ALam 12 https://E-next.in


DMBI IMP Q&A

 Roll-up is performed by climbing up a concept hierarchy for the dimension location.

 Initially the concept hierarchy was "street < city < province < country".

 On rolling up, the data is aggregated by ascending the location hierarchy from the level of city to the
level of country.

 The data is grouped into cities rather than countries.

 When roll-up is performed, one or more dimensions from the data cube are removed.

Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the
following ways −

 By stepping down a concept hierarchy for a dimension

 By introducing a new dimension.


The following diagram illustrates how drill-down works −

 Drill-down is performed by stepping down a concept hierarchy for the dimension time.

 Initially the concept hierarchy was "day < month < quarter < year."

 On drilling down, the time dimension is des cended from the level of quarter to the level of month.

KHan S. ALam 13 https://E-next.in


DMBI IMP Q&A

 When drill-down is performed, one or more dimensions from the data cube are added.

 It navigates the data from less detailed data to highly detailed data.

Slice
The slice operation selects one particular dimension from a given cube and
provides a new sub-cube. Consider the following diagram that shows how slice
works.

 Here Slice is performed for the dimension "time" using the criterion time = "Q1".

 It will form a new sub-cube by selecting one or more dimensions.

Dice
Dice selects two or more dimensions from a given cube and provides a new sub-
cube. Consider the following diagram that shows the dice operation.

KHan S. ALam 14 https://E-next.in


DMBI IMP Q&A

The dice operation on the cube based on the following selection criteria involves
three dimensions.

 (location = "Toronto" or "Vancouver")

 (time = "Q1" or "Q2")

 (item =" Mobile" or "Modem")

Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in
order to provide an alternative presentation of data. Consider the following
diagram that shows the pivot operation.

KHan S. ALam 15 https://E-next.in


DMBI IMP Q&A

KHan S. ALam 16 https://E-next.in


DMBI IMP Q&A

Difference between OLTP and OLAP


Parameters OLTP OLAP

Process It is an online transactional OLAP is an online analysis and data


system. It manages database retrieving process.
modification.

Functionality OLTP is an online database OLAP is an online database query


modifying system. management system.

Method OLTP uses traditional DBMS. OLAP uses the data warehouse.

Query Insert, Update, and Delete Mostly select operations


information from the database.

Table Tables in OLTP database are Tables in OLAP database


normalized. are notnormalized.

Source Transactions are the sources of Different OLTP databases become the
data in OLTP source of data for OLAP.

Data Integrity OLTP database must maintain OLAP database does not get
data integrity constraint. frequently modified. Hence, data
integrity is not an issue.

Response It's response time is in Response time in seconds to minutes.


time millisecond.

Data quality The data in the OLTP database is The data in OLAP process might not
always detailed and organized. be organized.

Usefulness It helps to control and run It helps with planning, problem-


fundamental business tasks. solving, and decision support.

Operation Allow read/write operations. Only read and rarely write.

Audience It is a market orientated process. It is a customer orientated process.

KHan S. ALam 17 https://E-next.in


DMBI IMP Q&A

Query Type Queries in this process are Complex queries involving


standardized and simple. aggregations.

Back-up Complete backup of the data OLAP only need a backup from time
combined with incremental to time. Backup is not important
backups. compared to OLTP

Design DB design is application oriented. DB design is subject oriented.


Example: Database design Example: Database design changes
changes with industry like Retail, with subjects like sales, marketing,
Airline, Banking, etc. purchasing, etc.

User type It is used by Data critical users Used by Data knowledge users like
like clerk, DBA & Data Base workers, managers, and CEO.
professionals.

Purpose Designed for real time business Designed for analysis of business
operations. measures by category and attributes.

Number of This kind of Database users This kind of Database allows only
users allows thousands of users. hundreds of users.

Productivity It helps to Increase user's self- Help to Increase productivity of the


service and productivity business analysts.

Process It provides fast result for daily It ensures that response to the query
used data. is quicker consistently.

Characteristic It is easy to create and maintain. It lets the user create a view with the
help of a spreadsheet.

Style OLTP is designed to have fast A data warehouse is created uniquely


response time, low data so that it can integrate different data
redundancy and is normalized. sources for building a consolidated
database

KHan S. ALam 18 https://E-next.in


DMBI IMP Q&A

Page Rank Algorithm and Implementation


PageRank (PR) is an algorithm used by Google Search to rank websites in their search engine results.
PageRank was named after Larry Page, one of the founders of Google. PageRank is a way of measuring
the importance of website pages. According to Google:
PageRank works by counting the number and quality of links to a page to determine a rough estimate of
how important the website is. The underlying assumption is that more important websites are likely to
receive more links from other websites.
It is not the only algorithm used by Google to order search engine results, but it is the first algorithm that
was used by the company, and it is the best-known.
The above centrality measure is not implemented for multi-graphs.
Algorithm
The PageRank algorithm outputs a probability distribution used to represent the likelihood that a person
randomly clicking on links will arrive at any particular page. PageRank can be calculated for collections of
documents of any size. It is assumed in several research papers that the distribution is evenly divided
among all documents in the collection at the beginning of the computational process. The PageRank
computations require several passes, called “iterations”, through the collection to adjust approximate
PageRank values to more closely reflect the theoretical true value.

Simplified algorithm
Assume a small universe of four web pages: A, B, C and D. Links from a page to itself, or multiple
outbound links from one single page to another single page, are ignored. PageRank is initialized to the
same value for all pages. In the original form of PageRank, the sum of PageRank over all pages was the
total number of pages on the web at that time, so each page in this example would have an initial value of
1. However, later versions of PageRank, and the remainder of this section, assume a probability
distribution between 0 and 1. Hence the initial value for each page in this exampl e is 0.25.

The PageRank transferred from a given page to the targets of its outbound links upon the next iteration is
divided equally among all outbound links.
If the only links in the system were from pages B, C, and D to A, each link would transfer 0 .25 PageRank to
A upon the next iteration, for a total of 0.75.

Suppose instead that page B had a link to pages C and A, page C had a link to page A, and page D had
links to all three pages. Thus, upon the first iteration, page B would transfer half of i ts existing value, or
0.125, to page A and the other half, or 0.125, to page C. Page C would transfer all of its existing value,

KHan S. ALam 19 https://E-next.in


DMBI IMP Q&A

0.25, to the only page it links to, A. Since D had three outbound links, it would transfer one third of its
existing value, or approximately 0.083, to A. At the completion of this iteration, page A will have a
PageRank of approximately 0.458.

In other words, the PageRank conferred by an outbound link is equal to the document’s own PageRank
score divided by the number of outbound links L( ).

In the general case, the PageRank value for any page u can be expressed as:

KHan S. ALam 20 https://E-next.in


DMBI IMP Q&A

Data Mining | ETL process


ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It
is a process in which an ETL tool extracts the data from various data source systems,
transforms it in the staging area and then finally, loads it into the Data Warehouse
system.

Let us understand each step of the ETL process in depth:


1. Extraction:
The first step of the ETL process is extraction. In this step, data from various source
systems is extracted which can be in various formats like relational databases, No
SQL, XML and flat files into the staging area. It is important to extract the data from
various source systems and store it into the staging area first and not directly into
the data warehouse because the extracted data is in various formats and can be
corrupted also. Hence loading it directly into the data warehouse may damage it
and rollback will be much more difficult. Therefore, this is one of the most
important steps of ETL process.
2. Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or
functions are applied on the extracted data to convert it into a single standard
format. It may involve following processes/tasks:

 Filtering – loading only certain attributes into the data warehouse.


 Cleaning – filling up the NULL values with some default values, mapping U.S.A,
United States and America into USA, etc.
 Joining – joining multiple attributes into one.

KHan S. ALam 21 https://E-next.in


DMBI IMP Q&A

 Splitting – splitting a single attribute into multipe attributes.


 Sorting – sorting tuples on the basis of some attribute (generally key-
attribbute).

3. Loading:
The third and final step of the ETL process is loading. In this step, the transformed
data is finally loaded into the data warehouse. Sometimes the data is updated by
loading into the data warehouse very frequently and sometimes it is done after
longer but regular intervals. The rate and period of loading solely depends on the
requirements and varies from system to system.
ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it
can transformed and during that period some new data can be extracted. And while
the transformed data is being loaded into the data warehouse, the already extracted
data can be transformed. The block diagram of the pipelining of ETL process is shown
below:

ETL Tools: Most commonly used ETL tools are Sybase, Oracle Warehouse builder,
CloverETL and MarkLogic.

KHan S. ALam 22 https://E-next.in


DMBI IMP Q&A

3 Top Challenges of Data Integration


In the world of data integration, a few challenges will crop up along the way. This article describes
what those barriers to success are and how you can prevail to achieve the results you want.

Challenge 1: Defining Data Integration


One of the biggest challenges of data integration is defining it. Data integration is often used
interchangeably with business integration and system integration, but they are different.
REMEDI defines data integration as “the collection and integration of electronic transactions,
messages, and data from internal and external systems and devices to a separate data structure
for purposes of cleansing, organizing, and analyzing the joined data.” Data integrationtakes place
in a data warehouse and requires specialized software to host large data repositories from
internal and external sources. The software extracts, amalgamates, and then presents the
information in a unified form during this process. When you use the right term to define the
process, you will be one step closer to getting the results you want.

Challenge 2: Data in Heterogeneous Forms


Another major problem that crops up during the data integration process is information in
heterogeneous forms. Legacy systems store data in different forms; however, a single data
integration platform cannot handle heterogeneity. It must all be in the same form for analysis.

Overcoming this challenge involves an awareness of heterogeneous data formats from the outset.
Evaluate your information formats early in the project. Next, a developer must convert the
information into a format that the data integration platform can handle. That way, you can analyz e
your data.

Challenge 3: Extracting Value from Data


A common complaint about data integration is that it's difficult to extract value from your data once
it has been integrated with a variety of other sources. It is not just that there is a great deal of
information out there (there is, and it keeps growing every day thanks to sensors, mobile devices,
and social media). Your analytics tool must be able to connect to the data integration platform for
that data to be of any use to you.

This is a problem that can be easily solved at the beginning of the data process if you remember
what analytics tools “talk” to your data integration platform (and vice versa). By making the right
technology choices, you avoid a situation where your integrated data is rendered useless.

Data integration can pose multiple challenges during the implementation process if you do not
approach it the right way. Successful data integration requires knowledge and thorough planning.
To learn more about the right way to handle data integration, contact us today.

KHan S. ALam 23 https://E-next.in


DMBI IMP Q&A

Linear and Nonlinear Regression Models


Linear regression
A model is linear when each term is either a constant or the product of a parameter and a predictor variable. A linear
equation is constructed by adding the results for each term. This constrains the equation to just one basic form:

In linear regression a response variable y and single predictor variable x will be given. It
models y as a linear function of x i.e

Y=b+wx

where b & w are regression co-efficient specifying y intercept and x intercept. In data
mining the regression co-efficient can be thought of as weights of attributes, so we can
re write the equation as

y=w0+w1x

Let’”D “be the training set consisting of values predictor variable x for same population
and there associated values response variable y. The regression co-efficient can be
estimated using method of least squre with the following equation.

W1

W0=Y - W1*X

Y = W0 + W1X

KHan S. ALam 24 https://E-next.in


DMBI IMP Q&A

Non-Linear regression
While a linear equation has one basic form, nonlinear equations can take many different forms. The easiest way to determine
whether an equation is nonlinear is to focus on the term “nonlinear” itself. Literally, it’s not linear. If the equation doesn’t meet
the criteria above for a linear equation, it’s nonlinear.

When the points do not show linear dependency between them ,then it can be model
by polynomeal regression. Polynomeal regression is nothing but non-linear regression.

By applying transformation to the variable we can convert non-linear model to linear


model and we can solve it then .

Transformation of Non-linear to Linear


Consider a cubic polynomeal regression
2 3
Y = W0 + W1X + W2X + W3X
To convert this equation into linear form we intoduce new variables.
X1 = X
2
X2 = X
3
X3 = X

Y=W0 + W1X1 + W2X2 + W3X3


Now the above equation is in linear form and can be solved by the method of least
square ..

KHan S. ALam 25 https://E-next.in


DMBI IMP Q&A

What is Data Mart?


A data mart is a subset of a data warehouse that is dedicated to a specific user group such as
finance or human resources departments. They allow for structure within a data warehouse and
they help meet the data demands of any specific user group. This focus on specific user groups is
necessary because data held in data warehouses are not intuitively organized and data marts
resolve this issue by focusing on a single subject matter.

Characteristics of a data mart


 Dedicated single subject matter
 Focuses in on the subject matter by consolidating and integrating information from various
sources.
 Usually dedicated for a specific business function or purpose.
 Built using a dimensional model called a star schema. This allows data marts to have
multidimensional analytical capabilities.

Benefits of a data mart


 Low cost to implement with a flexible approach. You have the flexibility of having a
standalone data mart or integrating one into a full data warehouse.
 Contains only data focused on a single subject matter allowing you quick access to the most
pertinent information.
 Quick and easy to build compared to a full data warehouse.

Data Mart and Data Warehouse Comparison


Data Mart
 Focus: A single subject or functional organization area

 Data Sources: Relatively few sources linked to one line of business

 Size: Less than 100 GB

 Normalization: No preference between a normalized and denormalized structure

 Decision Types: Tactical decisions pertaining to particular business lines and ways of doing things

 Cost: Typically from $10,000 upwards

 Setup Time: 3-6 months

 Data Held: Typically summarized data

KHan S. ALam 26 https://E-next.in


DMBI IMP Q&A

Data Warehouse
 Focus: Enterprise-wide repository of disparate data sources
 Data Sources: Many external and internal sources from different areas of an organization
 Size: 100 GB minimum but often in the range of terabytes for large organizations
 Normalization: Modern warehouses are mostly denormalized for quicker data querying and read
performance
 Decision Types: Strategic decisions that affect the entire enterprise
 Cost: Varies but often greater than $100,000; for cloud solutions costs can be dramatically lower as
organizations pay per use
 Setup Time: At least a year for on-premise warehouses; cloud data warehouses are much quicker
to set up
 Data Held: Raw data, metadata, and summary data

What is market basket analysis

Introduction
Market Basket Analysis is one of the key techniques used by large retailers to
uncover associations between items. It works by looking for combinations of items
that occur together frequently in transactions. To put it another way, it allows
retailers to identify relationships between the items that people buy.

Association Rules are widely used to analyze retail basket or transaction data, and
are intended to identify strong rules discovered in transaction data using measures
of interestingness, based on the concept of strong rules.

1. Itemset: Collection of one or more items. K-item-set means a set of k items.


2. Support Count: Frequency of occurrence of an item-set
3. Support (s): Fraction of transactions that contain the item-set 'X'
Support(X)=frequency(X)NSupport(X)=frequency(X)N

For a Rule A=>B, Support is given by:

Support(A=>B)=frequency(A,B)NSupport(A=>B)=frequency(A,B)N
Note: P(AUB) is the probability of A and B occurring together. P denotes probability.

KHan S. ALam 27 https://E-next.in


DMBI IMP Q&A

Data Mining & it’s Functionalities:

There is a huge amount of data available in the Information Industry. This data is of no use
until it is converted into useful information. It is necessary to analyze this huge amount of
data and extract useful information from it.

Extraction of information is not the only process we need to perform; data mining also
involves other processes such as Data Cleaning, Data Integration, Data Transformation, Data
Mining, Pattern Evaluation and Data Presentation.

Once all these processes are over, we would be able to use this information in many
applications such as Fraud Detection, Market Analysis, Production Control, Science
Exploration, etc.

What is Data Mining?

Data Mining is defined as extracting information from huge sets of data. In other words, we
can say that data mining is the procedure of mining knowledge from data. The information or
knowledge extracted so can be used for any of the following applications:

 Market Analysis
 Fraud Detection
 Customer Retention
 Production Control
 Science Exploration

Data Mining Applications

Data mining is highly useful in the following domains −

 Market Analysis and Management


 Corporate Analysis & Risk Management
 Fraud Detection

Apart from these, data mining can also be used in the areas of production control, customer
retention, science exploration, sports, astrology, and Internet Web Surf-Aid

KHan S. ALam 28 https://E-next.in


DMBI IMP Q&A

1) Market Analysis and Management


Listed below are the various fields of market where data mining is used −
Customer Profiling − Data mining helps determine what kind of people buy what kind of
products.
Identifying Customer Requirements − Data mining helps in identifying the best products for
different customers. It uses prediction to find the factors that may attract new customers.
Cross Market Analysis − Data mining performs Association/correlations between product
sales.
Target Marketing − Data mining helps to find clusters of model customers who share the
same characteristics such as interests, spending habits, income, etc.
Determining Customer purchasing pattern − Data mining helps in determining customer
purchasing pattern.
Providing Summary Information − Data mining provides us various multidimensional
summary reports.

2) Corporate Analysis and Risk Management


Data mining is used in the following fields of the Corporate Sector −
Finance Planning and Asset Evaluation − It involves cash flow analysis and prediction,
contingent claim analysis to evaluate assets.
Resource Planning − It involves summarizing and comparing the resources and spending.
Competition − It involves monitoring competitors and market directions.

3) Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to detect
frauds. In fraud telephone calls, it helps to find the destination of the call, duration of the call,
time of the day or week, etc. It also analyzes the patterns that deviate from expected norms.

Data Mining - Tasks


Data mining deals with the kind of patterns that can be mined. On the basis of the kind of
data to be mined, there are two categories of functions involved in Data Mining −
 Descriptive
 Classification and Prediction

 Descriptive Function
The descriptive function deals with the general properties of data in the database. Here is the
list of descriptive functions −
 Class/Concept Description
 Mining of Frequent Patterns
 Mining of Associations
 Mining of Correlations
 Mining of Clusters

KHan S. ALam 29 https://E-next.in


DMBI IMP Q&A

1) Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For example, in
a company, the classes of items for sales include computer and printers, and concepts of
customers include big spenders and budget spenders. Such descriptions of a class or a
concept are called class/concept descriptions. These descriptions can be derived by the
following two ways −
Data Characterization − This refers to summarizing data of class under study. This class under
study is called as Target Class.
Data Discrimination − It refers to the mapping or classification of a class with some
predefined group or class.

2) Mining of Frequent Patterns


Frequent patterns are those patterns that occur frequently in transactional data. Here is the
list of kind of frequent patterns −
Frequent Item Set − It refers to a set of items that frequently appear together, for example,
milk and bread.
Frequent Subsequence − A sequence of patterns that occur frequently such as purchasing a
camera is followed by memory card.
Frequent Sub Structure − Substructure refers to different structural forms, such as graphs,
trees, or lattices, which may be combined with item-sets or subsequences.

3) Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data and
determining association rules.
For example, a retailer generates an association rule that shows that 70% of time milk is sold
with bread and only 30% of times biscuits are sold with bread.

4) Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical correlations
between associated-attribute-value pairs or between two item sets to analyze that if they
have positive, negative or no effect on each other.

5) Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of
objects that are very similar to each other but are highly different from the objects in other
clusters.

KHan S. ALam 30 https://E-next.in


DMBI IMP Q&A

 Classification and Prediction

Classification is the process of finding a model that describes the data classes or concepts. The
purpose is to be able to use this model to predict the class of objects whose class label is
unknown. This derived model is based on the analysis of sets of training data. The derived
model can be presented in the following forms −

 Classification (IF-THEN) Rules


 Decision Trees
 Mathematical Formulae
 Neural Networks

The lists of functions involved in these processes are as follows −

Classification − It predicts the class of objects whose class label is unknown. Its objective is to
find a derived model that describes and distinguishes data classes or concepts. The Derived
Model is based on the analysis set of training data i.e. the data object whose class label is well
known.

Prediction − It is used to predict missing or unavailable numerical data values rather than
class labels. Regression Analysis is generally used for prediction. Prediction can also be used
for identification of distribution trends based on available data.

Outlier Analysis − Outliers may be defined as the data objects that do not comply with the
general behavior or model of the data available.

Evolution Analysis − Evolution analysis refers to the description and model regularities or
trends for objects whose behavior changes over time.

KHan S. ALam 31 https://E-next.in


DMBI IMP Q&A

Data transformation and data discretizasion


in pre processing step the data is transformed so that the resulting mining process may
me more efficient .and the pattern found is easy to understand .in data
transformation,the data are transformed or consolidated into forms appropriate foe
minig .strategies for data transformation include the following :
1) Smoothing : which works to remove noise from the data.techniques include
binning,regression,clustering.
2) Attribute constraction : where new attributes are constracted and added from
the given set of attribute to help the mining process.
3) Aggregation: where summary or aggregation operations are applied to the data .
4) Normalization : where the attribute data are scaled so as to fallwith in a smaller
range ,such as -1.0 to 1.0 or 0.0 to 1.0.
5) Discretisation: where the raw values of a numeric attribute (age) are replaced by
interval lebels(0-10,11-20 etc.) or conceptual lebel(youth,adult,senior) .the lebels
inturn can be recursive ly organised into higher level concept,resulting in a
concept hirarchy for the numeric attribute.
6) Concept hirarchy generation for nominal data : where attribute such as street
can be generalised into higher level concept like city or country.

KHan S. ALam 32 https://E-next.in


DMBI IMP Q&A

Country

C1 c2 c3 c4 c5 c6

S1 s2 s3 s4

Discetisation : is a method used to transform the continues value of an attribute to


discrete value.

Methods of discretisation :

1) Discretisation by binning
2) Discretisation by histogram
3) Discretisation by cluster,decission tree and co-retion analysis.

Discretisation by binning :
1) Equal frequency binning.
4,8,9,15,21,21,24,25,26,28,29,34
Bin 1 : 4,8,9
Bin 2 : 15,21,21
Bin 3 : 24,25,26
Bin 4 : 28,29,34

2) Mean binning
Bin 1 : 7,7,7
Bin 2 : 19,19,19
Bin 3 : 25,25,25
Bin 4 : 30,30,30

3) Mean boundary
Discretization by histrogram : the following data are list of all electronics prices
for commonly sold items the numbers have been sorted.

1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,15,15,15,18,18,18,18,18,18,1
8,18,20,20,20,20,20,20,20,21,21,21,21,25,25,25,25,25,25,28,28,30,30,30.

KHan S. ALam 33 https://E-next.in


DMBI IMP Q&A

Histogram using singleton bucket

1(2) 5(5) 8(2) 10(4) 12(1) 14(3) 15(6) 18(8) 20(7) 21(4) 25(6) 28(2) 30(3)

Series 1
9

4 Series 1

0
1 5 8 10 12 14 15 18 20 21 25 28 30

Histogram using contineous range bucket :

1-10(13) 11-20(25) 20-30(14)

Series 1
30

25

20

15
Series 1

10

0
1to10 10to20 20to30

KHan S. ALam 34 https://E-next.in


DMBI IMP Q&A

Min max normalisation :


Vi’= vi-minA (new-maxA-new-minA)+new-minA
maxA-minA

= 73000-12000 (1.0-0)+0
98000-12000

= 0.716

Z-Score normalization :

Vi’= vi - A
Sigma A

Suppose that mean of the values for the attribute income are 54000 and 16000
respectively.

With z-score normalization a value of 73600 for income is transformed to

= 73600-54000
16000

=1.225

Decimal nomalization :
Suppose that the recorded value of a range from -986 to 917 the maximum absolute
value of A is 986 .to normalised by decimal scaling we therefore devide each value by
1000(i.e j=3) so that -986 normalizes to -0.986 and 917 normalization to 0.917

KHan S. ALam 35 https://E-next.in


DMBI IMP Q&A

Classification and Prediction


There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows −

 Classification
 Prediction
Classification models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model to categorize
bank loan applications as either safe or risky, or a prediction model to predict the
expenditures in dollars of potential customers on computer equipment given their income
and occupation.

What is Classification?
Following are the examples of cases where the data analysis task is Classification −

 A bank loan officer wants to analyze the data in order to know which customers (loan
applicant) are risky or which are safe.

 A marketing manager at a company needs to analyze a customer with a given profile,


who will buy a new computer.

In both of the above examples, a model or classifier is constructed to predict the categorical
labels. These labels are risky or safe for loan application data and yes or no for marketing
data.

What is Prediction?
Following are the examples of cases where the data analysis task is Prediction −

Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a model or
a predictor will be constructed that predicts a continuous-valued-function or ordered value.

Note − Regression analysis is a statistical methodology that is most often used for numeric
prediction.

KHan S. ALam 36 https://E-next.in


DMBI IMP Q&A

How Does Classification Works?


With the help of the bank loan application that we have discussed above, let us understand
the working of classification. The Data Classification process includes two steps −

 Building the Classifier or Model


 Using Classifier for Classification

Building the Classifier or Model


 This step is the learning step or the learning phase.
 In this step the classification algorithms build the classifier.

 The classifier is built from the training set made up of database tuples and their
associated class labels.

 Each tuple that constitutes the training set is referred to as a category or class. These
tuples can also be referred to as sample, object or data points.

Using Classifier for Classification


In this step, the classifier is used for classification. Here the test data is used to estimate the
accuracy of classification rules. The classification rules can be applied to the new data tuples
if the accuracy is considered acceptable.

Classification and Prediction issue:


The major issue is preparing the data for Classification and Prediction. Preparing the data
involves the following activities −

 Data Cleaning − Data cleaning involves removing the noise and treatment of missing
values. The noise is removed by applying smoothing techniques and the problem of
missing values is solved by replacing a missing value with most commonly occurring
value for that attribute.

 Relevance Analysis − Database may also have the irrelevant attributes. Correlation
analysis is used to know whether any two given attributes are related.

 Data Transformation and reduction −The data can be transformed by any of the
following methods.

1. Normalization: The data is transformed using normalization. Normalization


involves scaling all values for given attribute in order to make them fall within a

KHan S. ALam 37 https://E-next.in


DMBI IMP Q&A

small specified range. Normalization is used when in the learning step, the
neural networks or the methods involving measurements are used.

2. Generalization: The data can also be transformed by generalizing it to the


higher concept. For this purpose we can use the concept hierarchies.

Note − Data can also be reduced by some other methods such as wavelet transformation,
binning, histogram analysis, and clustering.

Comparison of Classification and Methods:

Here are the criteria for comparing the methods of Classification and Prediction −

 Accuracy − Accuracy of classifier refers to the ability of classifier. It predicts the class
label correctly and the accuracy of the predictor refers to how well a given predictor
can guess the value of predicted attribute for a new data.

 Speed −This refers to the computational cost in generating and using the classifier or
predictor.

 Robustness − It refers to the ability of classifier or predictor to make correct


predictions from given noisy data.

 Scalability − Scalability refers to the ability to construct the classifier or predictor


efficiently; given large amount of data.

 Interpretability − It refers to what extent the classifier or predictor understands.


Decision Tree:

A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test, and
each leaf node holds a class label. The topmost node in the tree is the root node.

The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node represents a
test on an attribute. Each leaf node represents a class.

The benefits of having a decision tree are as follows −

 It does not require any domain knowledge.


 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.

KHan S. ALam 38 https://E-next.in

You might also like