Professional Documents
Culture Documents
Business Intelligence:
Business intelligence may be defined as a set of mathematical models and
analysis methodologies that exploit the available data to generate information
and knowledge useful for complex decision-making processes.
Business intelligence (BI) is a set of theories, methodologies, architectures, and
technologies that transform raw data into meaningful and useful information for
business purposes.
The main purpose of business intelligence systems is to provide knowledge
workers with tools and methodologies that allow them to make effective and
timely decisions.
BI Architecture:
A typical BI Architecture
The architecture of a business intelligence system, includes three major components:
Data sources:
In a first stage, it is necessary to gather and integrate the data stored in the various
primary and secondary sources, which are heterogeneous in origin and type.
The sources consist for the most part of data belonging to operational systems, but
may also include unstructured documents, such as emails and data received from
external providers.
a major effort is required to unify and integrate the different data sources.
Following are 3 chief types of multidimensional schemas each having its unique
advantages.
Star Schema
Snowflake Schema
Galaxy Schema
For example, as you can see in the above-given image that fact table is at the
center which contains keys to every dimension table like Deal_ID, Model ID,
Date_ID, Product_ID, Branch_ID & other attributes like Units sold and revenue.
The dimension tables are normalized which splits data into additional tables. In the
following example, Country is further normalized into an individual table.
The main benefit of the snowflake schema it uses smaller disk space.
Easier to implement a dimension is added to the Schema
Due to multiple tables query performance is reduced
The primary challenge that you will face while using the snowflake Schema is
that you need to perform more maintenance efforts because of the more
lookup tables.
Hierarchies for the dimensions are stored Hierarchies are divided into separate
in the dimensional table. tables.
In a star schema, only single join creates A snowflake schema requires many joins
the relationship between the fact table to fetch the data.
and any dimension tables.
Offers higher performing queries using The Snow Flake Schema is represented
Star Join Query Optimization. Tables by centralized fact table which unlikely
may be connected with multiple connected with multiple dimensions.
dimensions.
Data Mining also known as Knowledge Discovery in Databases, refers to the nontrivial
extraction of implicit, previously unknown and potentially useful information from data stored
in databases.
1.Classification:
This analysis is used to retrieve important and relevant information about data, and metadata.
This data mining method helps to classify data in different classes.
2. Clustering:
Clustering analysis is a data mining technique to identify data that are like each other. This
process helps to understand the differences and similarities between the data.
3. Regression:
Regression analysis is the data mining method of identifying and analyzing the relationship
between variables. It is used to identify the likelihood of a specific variable, given the presence of
other variables.
4. Association Rules:
This data mining technique helps to find the association between two or more Items. It discovers a
hidden pattern in the data set.
5. Outer detection:
This type of data mining technique refers to observation of data items in the dataset which do not
match an expected pattern or expected behavior. This technique can be used in a variety of
domains, such as intrusion, detection, fraud or fault detection, etc. Outer detection is also called
Outlier Analysis or Outlier mining.
6. Sequential Patterns:
This data mining technique helps to discover or identify similar patterns or trends in transaction
data for certain period.
7. Prediction:
Prediction has used a combination of the other data mining techniques like trends, sequential
patterns, clustering, classification, etc. It analyzes past events or instances in a right sequence for
predicting a future event.
KDD PROCESS
1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from
collection.
Cleaning in case of Missing values.
Cleaning noisy data, where noise is a random or variance error.
Cleaning with Data discrepancy detection and Data transformation tools.
2. Data Integration: Data integration is defined as heterogeneous data from multiple
sources combined in a common source(DataWarehouse).
It is also a single version of truth for any company for decision making and forecasting.
Database
The central database is the foundation of the data warehousing environment. This database is
implemented on the RDBMS technology. Although, this kind of implementation is constrained by
the fact that traditional RDBMS system is optimized for transactional database processing and not
for data warehousing. For instance, ad-hoc query, multi-table joins, aggregates are resource
intensive and slow down performance.
The data sourcing, transformation, and migration tools are used for performing all the
conversions, summarizations, and all the changes needed to transform data into a unified format
in the datawarehouse. They are also called Extract, Transform and Load (ETL) Tools.
These Extract, Transform, and Load tools may generate cron jobs, background jobs, Cobol
programs, shell scripts, etc. that regularly update data in datawarehouse. These tools are also
helpful to maintain the Metadata.
Metadata
Metadata is data about data which defines the data warehouse. It is used for building, maintaining
and managing the data warehouse.
In the Data Warehouse Architecture, meta-data plays an important role as it specifies the source,
usage, values, and features of data warehouse data. It also defines how data can be changed
and processed. It is closely connected to the data warehouse.
This is a meaningless data until we consult the Meta that tell us it was
Query Tools
One of the primary objects of data warehousing is to provide information to businesses to make
strategic decisions. Query tools allow users to interact with the data warehouse system.
Data Marts
A data mart is an access layer which is used to get data out to the users. It is presented as an
option for large size data warehouse as it takes less time and money to build. However, there is
no standard definition of a data mart is differing from person to person.
In a simple word Data mart is a subsidiary of a data warehouse. The data mart is used for
partition of data which is created for the specific group of users.
Data marts could be created in the same database as the Datawarehouse or a physically
separate Database.
Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in either of the
following ways.
Top-Tier − This tier is the front-end client layer. This layer holds the query tools and reporting
tools, analysis tools and data mining tools.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss
OLAP operations in multidimensional data.
Here is the list of OLAP operations −
Roll-up
Drill-down
Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
By dimension reduction
The following diagram illustrates how roll-up works.
Initially the concept hierarchy was "street < city < province < country".
On rolling up, the data is aggregated by ascending the location hierarchy from the level of city to the
level of country.
When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the
following ways −
Drill-down is performed by stepping down a concept hierarchy for the dimension time.
Initially the concept hierarchy was "day < month < quarter < year."
On drilling down, the time dimension is des cended from the level of quarter to the level of month.
When drill-down is performed, one or more dimensions from the data cube are added.
It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and
provides a new sub-cube. Consider the following diagram that shows how slice
works.
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-
cube. Consider the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves
three dimensions.
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in
order to provide an alternative presentation of data. Consider the following
diagram that shows the pivot operation.
Method OLTP uses traditional DBMS. OLAP uses the data warehouse.
Source Transactions are the sources of Different OLTP databases become the
data in OLTP source of data for OLAP.
Data Integrity OLTP database must maintain OLAP database does not get
data integrity constraint. frequently modified. Hence, data
integrity is not an issue.
Data quality The data in the OLTP database is The data in OLAP process might not
always detailed and organized. be organized.
Back-up Complete backup of the data OLAP only need a backup from time
combined with incremental to time. Backup is not important
backups. compared to OLTP
User type It is used by Data critical users Used by Data knowledge users like
like clerk, DBA & Data Base workers, managers, and CEO.
professionals.
Purpose Designed for real time business Designed for analysis of business
operations. measures by category and attributes.
Number of This kind of Database users This kind of Database allows only
users allows thousands of users. hundreds of users.
Process It provides fast result for daily It ensures that response to the query
used data. is quicker consistently.
Characteristic It is easy to create and maintain. It lets the user create a view with the
help of a spreadsheet.
Simplified algorithm
Assume a small universe of four web pages: A, B, C and D. Links from a page to itself, or multiple
outbound links from one single page to another single page, are ignored. PageRank is initialized to the
same value for all pages. In the original form of PageRank, the sum of PageRank over all pages was the
total number of pages on the web at that time, so each page in this example would have an initial value of
1. However, later versions of PageRank, and the remainder of this section, assume a probability
distribution between 0 and 1. Hence the initial value for each page in this exampl e is 0.25.
The PageRank transferred from a given page to the targets of its outbound links upon the next iteration is
divided equally among all outbound links.
If the only links in the system were from pages B, C, and D to A, each link would transfer 0 .25 PageRank to
A upon the next iteration, for a total of 0.75.
Suppose instead that page B had a link to pages C and A, page C had a link to page A, and page D had
links to all three pages. Thus, upon the first iteration, page B would transfer half of i ts existing value, or
0.125, to page A and the other half, or 0.125, to page C. Page C would transfer all of its existing value,
0.25, to the only page it links to, A. Since D had three outbound links, it would transfer one third of its
existing value, or approximately 0.083, to A. At the completion of this iteration, page A will have a
PageRank of approximately 0.458.
In other words, the PageRank conferred by an outbound link is equal to the document’s own PageRank
score divided by the number of outbound links L( ).
In the general case, the PageRank value for any page u can be expressed as:
3. Loading:
The third and final step of the ETL process is loading. In this step, the transformed
data is finally loaded into the data warehouse. Sometimes the data is updated by
loading into the data warehouse very frequently and sometimes it is done after
longer but regular intervals. The rate and period of loading solely depends on the
requirements and varies from system to system.
ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it
can transformed and during that period some new data can be extracted. And while
the transformed data is being loaded into the data warehouse, the already extracted
data can be transformed. The block diagram of the pipelining of ETL process is shown
below:
ETL Tools: Most commonly used ETL tools are Sybase, Oracle Warehouse builder,
CloverETL and MarkLogic.
Overcoming this challenge involves an awareness of heterogeneous data formats from the outset.
Evaluate your information formats early in the project. Next, a developer must convert the
information into a format that the data integration platform can handle. That way, you can analyz e
your data.
This is a problem that can be easily solved at the beginning of the data process if you remember
what analytics tools “talk” to your data integration platform (and vice versa). By making the right
technology choices, you avoid a situation where your integrated data is rendered useless.
Data integration can pose multiple challenges during the implementation process if you do not
approach it the right way. Successful data integration requires knowledge and thorough planning.
To learn more about the right way to handle data integration, contact us today.
In linear regression a response variable y and single predictor variable x will be given. It
models y as a linear function of x i.e
Y=b+wx
where b & w are regression co-efficient specifying y intercept and x intercept. In data
mining the regression co-efficient can be thought of as weights of attributes, so we can
re write the equation as
y=w0+w1x
Let’”D “be the training set consisting of values predictor variable x for same population
and there associated values response variable y. The regression co-efficient can be
estimated using method of least squre with the following equation.
W1
W0=Y - W1*X
Y = W0 + W1X
Non-Linear regression
While a linear equation has one basic form, nonlinear equations can take many different forms. The easiest way to determine
whether an equation is nonlinear is to focus on the term “nonlinear” itself. Literally, it’s not linear. If the equation doesn’t meet
the criteria above for a linear equation, it’s nonlinear.
When the points do not show linear dependency between them ,then it can be model
by polynomeal regression. Polynomeal regression is nothing but non-linear regression.
Decision Types: Tactical decisions pertaining to particular business lines and ways of doing things
Data Warehouse
Focus: Enterprise-wide repository of disparate data sources
Data Sources: Many external and internal sources from different areas of an organization
Size: 100 GB minimum but often in the range of terabytes for large organizations
Normalization: Modern warehouses are mostly denormalized for quicker data querying and read
performance
Decision Types: Strategic decisions that affect the entire enterprise
Cost: Varies but often greater than $100,000; for cloud solutions costs can be dramatically lower as
organizations pay per use
Setup Time: At least a year for on-premise warehouses; cloud data warehouses are much quicker
to set up
Data Held: Raw data, metadata, and summary data
Introduction
Market Basket Analysis is one of the key techniques used by large retailers to
uncover associations between items. It works by looking for combinations of items
that occur together frequently in transactions. To put it another way, it allows
retailers to identify relationships between the items that people buy.
Association Rules are widely used to analyze retail basket or transaction data, and
are intended to identify strong rules discovered in transaction data using measures
of interestingness, based on the concept of strong rules.
Support(A=>B)=frequency(A,B)NSupport(A=>B)=frequency(A,B)N
Note: P(AUB) is the probability of A and B occurring together. P denotes probability.
There is a huge amount of data available in the Information Industry. This data is of no use
until it is converted into useful information. It is necessary to analyze this huge amount of
data and extract useful information from it.
Extraction of information is not the only process we need to perform; data mining also
involves other processes such as Data Cleaning, Data Integration, Data Transformation, Data
Mining, Pattern Evaluation and Data Presentation.
Once all these processes are over, we would be able to use this information in many
applications such as Fraud Detection, Market Analysis, Production Control, Science
Exploration, etc.
Data Mining is defined as extracting information from huge sets of data. In other words, we
can say that data mining is the procedure of mining knowledge from data. The information or
knowledge extracted so can be used for any of the following applications:
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
Apart from these, data mining can also be used in the areas of production control, customer
retention, science exploration, sports, astrology, and Internet Web Surf-Aid
3) Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to detect
frauds. In fraud telephone calls, it helps to find the destination of the call, duration of the call,
time of the day or week, etc. It also analyzes the patterns that deviate from expected norms.
Descriptive Function
The descriptive function deals with the general properties of data in the database. Here is the
list of descriptive functions −
Class/Concept Description
Mining of Frequent Patterns
Mining of Associations
Mining of Correlations
Mining of Clusters
1) Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For example, in
a company, the classes of items for sales include computer and printers, and concepts of
customers include big spenders and budget spenders. Such descriptions of a class or a
concept are called class/concept descriptions. These descriptions can be derived by the
following two ways −
Data Characterization − This refers to summarizing data of class under study. This class under
study is called as Target Class.
Data Discrimination − It refers to the mapping or classification of a class with some
predefined group or class.
3) Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data and
determining association rules.
For example, a retailer generates an association rule that shows that 70% of time milk is sold
with bread and only 30% of times biscuits are sold with bread.
4) Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical correlations
between associated-attribute-value pairs or between two item sets to analyze that if they
have positive, negative or no effect on each other.
5) Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of
objects that are very similar to each other but are highly different from the objects in other
clusters.
Classification is the process of finding a model that describes the data classes or concepts. The
purpose is to be able to use this model to predict the class of objects whose class label is
unknown. This derived model is based on the analysis of sets of training data. The derived
model can be presented in the following forms −
Classification − It predicts the class of objects whose class label is unknown. Its objective is to
find a derived model that describes and distinguishes data classes or concepts. The Derived
Model is based on the analysis set of training data i.e. the data object whose class label is well
known.
Prediction − It is used to predict missing or unavailable numerical data values rather than
class labels. Regression Analysis is generally used for prediction. Prediction can also be used
for identification of distribution trends based on available data.
Outlier Analysis − Outliers may be defined as the data objects that do not comply with the
general behavior or model of the data available.
Evolution Analysis − Evolution analysis refers to the description and model regularities or
trends for objects whose behavior changes over time.
Country
C1 c2 c3 c4 c5 c6
S1 s2 s3 s4
Methods of discretisation :
1) Discretisation by binning
2) Discretisation by histogram
3) Discretisation by cluster,decission tree and co-retion analysis.
Discretisation by binning :
1) Equal frequency binning.
4,8,9,15,21,21,24,25,26,28,29,34
Bin 1 : 4,8,9
Bin 2 : 15,21,21
Bin 3 : 24,25,26
Bin 4 : 28,29,34
2) Mean binning
Bin 1 : 7,7,7
Bin 2 : 19,19,19
Bin 3 : 25,25,25
Bin 4 : 30,30,30
3) Mean boundary
Discretization by histrogram : the following data are list of all electronics prices
for commonly sold items the numbers have been sorted.
1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,15,15,15,18,18,18,18,18,18,1
8,18,20,20,20,20,20,20,20,21,21,21,21,25,25,25,25,25,25,28,28,30,30,30.
1(2) 5(5) 8(2) 10(4) 12(1) 14(3) 15(6) 18(8) 20(7) 21(4) 25(6) 28(2) 30(3)
Series 1
9
4 Series 1
0
1 5 8 10 12 14 15 18 20 21 25 28 30
Series 1
30
25
20
15
Series 1
10
0
1to10 10to20 20to30
= 73000-12000 (1.0-0)+0
98000-12000
= 0.716
Z-Score normalization :
Vi’= vi - A
Sigma A
Suppose that mean of the values for the attribute income are 54000 and 16000
respectively.
= 73600-54000
16000
=1.225
Decimal nomalization :
Suppose that the recorded value of a range from -986 to 917 the maximum absolute
value of A is 986 .to normalised by decimal scaling we therefore devide each value by
1000(i.e j=3) so that -986 normalizes to -0.986 and 917 normalization to 0.917
Classification
Prediction
Classification models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model to categorize
bank loan applications as either safe or risky, or a prediction model to predict the
expenditures in dollars of potential customers on computer equipment given their income
and occupation.
What is Classification?
Following are the examples of cases where the data analysis task is Classification −
A bank loan officer wants to analyze the data in order to know which customers (loan
applicant) are risky or which are safe.
In both of the above examples, a model or classifier is constructed to predict the categorical
labels. These labels are risky or safe for loan application data and yes or no for marketing
data.
What is Prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a model or
a predictor will be constructed that predicts a continuous-valued-function or ordered value.
Note − Regression analysis is a statistical methodology that is most often used for numeric
prediction.
The classifier is built from the training set made up of database tuples and their
associated class labels.
Each tuple that constitutes the training set is referred to as a category or class. These
tuples can also be referred to as sample, object or data points.
Data Cleaning − Data cleaning involves removing the noise and treatment of missing
values. The noise is removed by applying smoothing techniques and the problem of
missing values is solved by replacing a missing value with most commonly occurring
value for that attribute.
Relevance Analysis − Database may also have the irrelevant attributes. Correlation
analysis is used to know whether any two given attributes are related.
Data Transformation and reduction −The data can be transformed by any of the
following methods.
small specified range. Normalization is used when in the learning step, the
neural networks or the methods involving measurements are used.
Note − Data can also be reduced by some other methods such as wavelet transformation,
binning, histogram analysis, and clustering.
Here are the criteria for comparing the methods of Classification and Prediction −
Accuracy − Accuracy of classifier refers to the ability of classifier. It predicts the class
label correctly and the accuracy of the predictor refers to how well a given predictor
can guess the value of predicted attribute for a new data.
Speed −This refers to the computational cost in generating and using the classifier or
predictor.
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test, and
each leaf node holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node represents a
test on an attribute. Each leaf node represents a class.