Professional Documents
Culture Documents
A data warehouse is built to support management functions whereas data mining is used to extract useful information and
patterns from data. Data warehousing is the process of compiling information into a data warehouse.
Data Warehousing:
It is a technology that aggregates structured data from one or more sources so that it can be compared and analyzed
rather than transaction processing. A data warehouse is designed to support the management decision-making process by
providing a platform for data cleaning, data integration, and data consolidation. A data warehouse contains subject-
oriented, integrated, time-variant, and non-volatile data. The Data warehouse consolidates data from many sources while
ensuring data quality, consistency, and accuracy. Data warehouse improves system performance by separating analytics
processing from transnational databases. Data flows into a data warehouse from the various databases. A data warehouse
works by organizing data into a schema that describes the layout and type of data. Query tools analyze the data tables
using schema.
For example, a data warehouse might combine customer information from an organization's point-of-sale
systems, its mailing lists, website, and comment cards.
The data warehouse’s job is to make any form of corporate data easier to understand. The majority of the user’s job
will consist of inputting raw data.
The capacity to update continuously and frequently is the key benefit of this technology. As a result, data warehouses
are perfect for organizations and entrepreneurs who want to stay current with their target audience and customers.
It makes data more accessible to businesses and organizations.
A data warehouse holds a large volume of historical data that users can use to evaluate different periods and trends in
order to create predictions for the future.
There is a great risk of accumulating irrelevant and useless data. Data loss and erasure are other potential issues.
Data is gathered from various sources in a data warehouse. Cleansing and transformation of the data are required. This
could be a difficult task.
Data Mining:
It is the process of finding patterns and correlations within large data sets to identify relationships between data. Data
mining tools allow a business organization to predict customer behavior. Data mining tools are used to build risk models
and detect fraud. Data mining is used in market analysis and management, fraud detection, corporate analysis, and risk
management.
For example in the medical field
Data mining enables more accurate diagnostics. Having all of the patient's information, such as medical records,
physical examinations, and treatment patterns, allows more effective treatments to be prescribed. It also enables more
effective, efficient and cost-effective management of health resources by identifying risks, predicting illnesses in
certain segments of the population or forecasting the length of hospital admission. Detecting fraud and irregularities,
and strengthening ties with patients with an enhanced knowledge of their needs are also advantages of using data
mining in medicine.
Data mining aids in a variety of data analysis and sorting procedures. The identification and detection of any undesired
fault in a system is one of the best implementations here. This method permits any dangers to be eliminated sooner.
In comparison to other statistical data applications, data mining methods are both cost-effective and efficient.
Companies can take advantage of this analytical tool by providing appropriate and easily accessible knowledge-based
data.
The detection and identification of undesirable faults that occur in the system are one of the most astonishing data
mining techniques.
Data mining isn’t always 100 percent accurate, and if done incorrectly, it can lead to data breaches.
Organizations must devote a significant amount of resources to training and implementation. Furthermore, the
algorithms used in the creation of data mining tools cause them to work in different ways.
Comparison between Data Mining and Data Warehousing:
S. Basis of
No. Comparison Data Warehousing Data Mining
A data warehouse is a
database system that is
designed for analytical
analysis instead of Data mining is the process of
1. Definition transactional work. analyzing data patterns.
Subject-oriented,
integrated, time-varying
and non-volatile AI, statistics, databases, and machine
constitute data learning systems are all used in data
6. Functionality warehouses. mining technologies.
These are the following key features that data mining usually allows us:
o Sift through all the chaotic and repetitive noise in your data.
o Allows understanding what is relevant and then making good use of that information to assess likely
outcomes.
o Accelerate the pace of making informed decisions.
Each of the following data mining techniques serves several different business problems and provides a different insight
into each of them. However, understanding the type of business problem you need to solve will also help in knowing
which technique will be best to use, which will yield the best results. The Data Mining types can be divided into two
basic parts that are as follows:
As the name signifies, Predictive Data-Mining analysis works on the data that may help to know what may happen later
(or in the future) in business. Predictive Data-Mining can also be further divided into four types that are listed below:
o Classification Analysis
o Regression Analysis
o Time Series Analysis
o Prediction Analysis
The main goal of the Descriptive Data Mining tasks is to summarize or turn given data into relevant information. The
Descriptive Data-Mining Tasks can also be further divided into four types that are as follows:
o Clustering Analysis
o Summarization Analysis
o Association Rules Analysis
o Sequence Discovery Analysis
Read
Discuss
Database System: Database System is used in traditional way of storing and retrieving data. The major task of database
system is to perform query processing. These systems are generally referred as online transaction processing system.
These systems are used day to day operations of any organization. Data Warehouse: Data Warehouse is the place where
huge amount of data is stored. It is meant for users or knowledge workers in the role of data analysis and decision making.
These systems are supposed to organize and present data in different format and different forms in order to serve the
need of the specific user for specific purpose. These systems are referred as online analytical processing. Difference
between Database System and Data Warehouse:
Data is balanced within the scope of Data must be integrated and balanced from
this one system. multiple system.
ER based. Star/Snowflake.
Both database and data warehouse let multiple users access the same data at the same time. Many users can simultaneously
access a single database or data warehouse. To get at the data, you’ll need to run queries in both the datawarehouse and the
database. Complex queries can be used to access the data warehouse, but simple queries can be used to access the OLTP
database. And finally, whether on-premises or in the cloud, a company’s data warehouse and database are available.
Online transaction processing (OLTP): Online Transaction Processing or OLTP refers to a class of systems capable of supporting
transaction-oriented applications such as online banking, shopping, order entry or sending text messages.
Online analytical processing (OLAP): As we got transaction data from OLTP, enterprises would then utilise Online Analytical
Processing to extract insights and make more informed decisions. Hence, Online Analytical Processing can be defined as a
computing method that allows users to easily generate and query data for analytical purposes.
How is data warehouse different from database? Is data warehouse a database? let’s take a look at this table:
Data designing
ER modeling techniques Data modeling
technique
Data is structured in
Data is structured in a flat relational dimensional and
Data schema approach method. Physical – denormalized approach
Logical schema. method. Star – Snowflake
schema.
Only one user can modify data at a Serveral users can modify
Concurrent users
time data at the same time
It is well-known as an online
database query management It is well-known as an online
1. Definition system. database modifying system.
It provides a multi-dimensional
view of different business It reveals a snapshot of present
7. Task tasks. business tasks.
Backup and
It only need backup from time Backup and recovery process is
12. Recovery to time as compared to OLTP. maintained rigorously
Nature of
Process that is focused on the Process that is focused on the
17. audience customer. market.
Database
Design with a focus on the Design that is focused on the
18. Design subject. application.
Sr. OLAP (Online analytical OLTP (Online transaction
No. Category processing) processing)
1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from collection.
Cleaning in case of Missing values.
Cleaning noisy data, where noise is a random or variance error.
Cleaning with Data discrepancy detection and Data transformation tools.
2. Data Integration: Data integration is defined as heterogeneous data from multiple sources combined in a common
source(DataWarehouse).
Data integration using Data Migration tools.
Data integration using Data Synchronization tools.
Data integration using ETL(Extract-Load-Transformation) process.
3. Data Selection: Data selection is defined as the process where data relevant to the analysis is decided and retrieved
from the data collection.
Data selection using Neural network.
Data selection using Decision Trees.
Data selection using Naive bayes.
Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the process of transforming data into appropriate form
required by mining procedure.
Data Transformation is a two step process:
Data Mapping: Assigning elements from source base to destination to capture transformations.
Code generation: Creation of the actual transformation program.
5. Data Mining: Data mining is defined as clever techniques that are applied to extract patterns potentially useful.
Transforms task relevant data into patterns.
Decides purpose of model using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge
based on given measures.
Find interestingness score of each pattern.
Uses summarization and Visualization to make data understandable by user.
7. Knowledge representation: Knowledge representation is defined as technique which utilizes visualization tools to
represent data mining results.
Generate reports.
Generate tables.
Generate discriminant rules, classification rules, characterization rules, etc.
Note:
KDD is an iterative process where evaluation measures can be enhanced, mining can be refined, new data can be
integrated and transformed in order to get different and more appropriate results.
Preprocessing of databases consists of Data cleaning and Data Integration.
ADVANTAGES OR DISADVANTAGES:
Advantages of KDD:
1. Improves decision-making: KDD provides valuable insights and knowledge that can help organizations make better
decisions.
2. Increased efficiency: KDD automates repetitive and time-consuming tasks and makes the data ready for analysis, which
saves time and money.
3. Better customer service: KDD helps organizations gain a better understanding of their customers’ needs and
preferences, which can help them provide better customer service.
4. Fraud detection: KDD can be used to detect fraudulent activities by identifying patterns and anomalies in the data that
may indicate fraud.
5. Predictive modeling: KDD can be used to build predictive models that can forecast future trends and patterns.
Disadvantages of KDD:
1. Privacy concerns: KDD can raise privacy concerns as it involves collecting and analyzing large amounts of data, which
can include sensitive information about individuals.
2. Complexity: KDD can be a complex process that requires specialized skills and knowledge to implement and interpret
the results.
3. Unintended consequences: KDD can lead to unintended consequences, such as bias or discrimination, if the data or
models are not properly understood or used.
4. Data Quality: KDD process heavily depends on the quality of data, if data is not accurate or consistent, the results can
be misleading
5. High cost: KDD can be an expensive process, requiring significant investments in hardware, software, and personnel.
6. Overfitting: KDD process can lead to overfitting, which is a common problem in machine learning where a model learns
the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new
unseen data.