You are on page 1of 5

What is data mining?

Data mining, also known as knowledge discovery in data (KDD), is the process
of uncovering patterns and other valuable information from large data sets.
Given the evolution of data warehousing technology and the growth of big
data, adoption of data mining techniques has rapidly accelerated over the last
couple of decades, assisting companies by transforming their raw data into
useful knowledge.

Data mining has improved organizational decision-making through insightful


data analyses. The data mining techniques that underpin these analyses can be
divided into two main purposes; they can either describe the target dataset or
they can predict outcomes through the use of machine learning algorithms.
These methods are used to organize and filter data, surfacing the most
interesting information, from fraud detection to user behaviors, bottlenecks,
and even security breaches.

Data mining process


The data mining process involves a number of steps from data collection to
visualization to extract valuable information from large data sets. As
mentioned above, data mining techniques are used to generate descriptions
and predictions about a target data set. Data scientists describe data through
their observations of patterns, associations, and correlations. They also classify
and cluster data through classification and regression methods, and identify
outliers for use cases, like spam detection.

Data mining usually consists of four main steps: setting objectives, data
gathering and preparation, applying data mining algorithms, and evaluating
results.

1. Set the business objectives: This can be the hardest part of the data mining
process, and many organizations spend too little time on this important step.
Data scientists and business stakeholders need to work together to define the
business problem, which helps inform the data questions and parameters for a
given project. Analysts may also need to do additional research to understand
the business context appropriately.
2. Data preparation: Once the scope of the problem is defined, it is easier for
data scientists to identify which set of data will help answer the pertinent
questions to the business. Once they collect the relevant data, the data will be
cleaned, removing any noise, such as duplicates, missing values, and outliers.
Depending on the dataset, an additional step may be taken to reduce the
number of dimensions as too many features can slow down any subsequent
computation. Data scientists will look to retain the most important predictors
to ensure optimal accuracy within any models.
3. Model building and pattern mining: Depending on the type of analysis, data
scientists may investigate any interesting data relationships, such as sequential
patterns, association rules, or correlations. While high frequency patterns have
broader applications, sometimes the deviations in the data can be more
interesting, highlighting areas of potential fraud.
Deep learning algorithms may also be applied to classify or cluster a data set
depending on the available data. If the input data is labelled (i.e. supervised
learning), a classification model may be used to categorize data, or
alternatively, a regression may be applied to predict the likelihood of a
particular assignment. If the dataset isn’t labelled (i.e. unsupervised learning),
the individual data points in the training set are compared with one another to
discover underlying similarities, clustering them based on those characteristics.
4. Evaluation of results and implementation of knowledge: Once the data is
aggregated, the results need to be evaluated and interpreted. When finalizing
results, they should be valid, novel, useful, and understandable. When this
criteria is met, organizations can use this knowledge to implement new
strategies, achieving their intended objectives

Data Warehouse
A core component of business intelligence, a data warehouse pulls together
data from many different sources into a single data repository for sophisticated
analytics and decision support.

What is a data warehouse?


A data warehouse, or enterprise data warehouse (EDW), is a system that
aggregates data from different sources into a single, central, consistent data
store to support data analysis, data mining, artificial intelligence (AI), and
machine learning. A data warehouse system enables an organization to run
powerful analytics on huge volumes (petabytes and petabytes) of historical
data in ways that a standard database cannot.

Data warehousing systems have been a part of business intelligence (BI)


solutions for over three decades, but they have evolved recently with the
emergence of new data types and data hosting methods. Traditionally, a data
warehouse was hosted on-premises—often on a mainframe computer—and its
functionality was focused on extracting data from other sources, cleansing and
preparing the data, and loading and maintaining the data in a relational
database. More recently, a data warehouse might be hosted on a dedicated
appliance or in the cloud, and most data warehouses have added analytics
capabilities and data visualization and presentation tools

Data Warehouse Delivery Process


Now we discuss the delivery process of the data warehouse. Main steps used
in data warehouse delivery process which are as follows:

IT Strategy: DWH project must contain IT strategy for procuring and retaining


funding.

Business Case Analysis: After the IT strategy has been designed, the next step is
the business case. It is essential to understand the level of investment that can
be justified and to recognize the projected business benefits which should be
derived from using the data warehouse.

Education & Prototyping: Company will experiment with the ideas of data


analysis and educate themselves on the value of the data warehouse. This is
valuable and should be required if this is the company first exposure to the
benefits of the DS record. Prototyping method can progress the growth of
education. It is better than working models. Prototyping requires business
requirement, technical blueprint, and structures.

Business Requirement: It contains such as

The logical model for data within the data warehouse.

The source system that provides this data (mapping rules)

The business rules to be applied to information.

The query profiles for the immediate requirement

Technical blueprint: It arranges the architecture of the warehouse. Technical


blueprint of the delivery process makes an architecture plan which satisfies
long-term requirements. It lays server and data mart architecture and essential
components of database design.

Building the vision: It is the phase where the first production deliverable is
produced. This stage will probably create significant infrastructure elements
for extracting and loading information but limit them to the extraction and
load of information sources.

History Load: The next step is one where the remainder of the required history
is loaded into the data warehouse. This means that the new entities would not
be added to the data warehouse, but additional physical tables would probably
be created to save the increased record volumes.

AD-Hoc Query: In this step, we configure an ad-hoc query tool to operate


against the data warehouse.

These end-customer access tools are capable of automatically generating the


database query that answers any question posed by the user.
Automation: The automation phase is where many of the operational
management processes are fully automated within the DWH. These would
include:

Extracting & loading the data from a variety of sources systems

Transforming the information into a form suitable for analysis

Backing up, restoring & archiving data

Generating aggregations from predefined definitions within the Data


Warehouse.

Monitoring query profiles & determining the appropriate aggregates to


maintain system performance.

Extending Scope: In this phase, the scope of DWH is extended to address a new
set of business requirements. This involves the loading of additional data
sources into the DWH i.e. the introduction of new data marts.

Requirement Evolution: This is the last step of the delivery process of a data


warehouse. As we all know that requirements are not static and evolve
continuously.

You might also like