You are on page 1of 31

Unit- II

Data Warehousing and Data Mining

What is Data Warehouse?

A Data Warehouse is a collection of software tools that facilitates analysis of a


large set of business data used to help an organization make decisions. A
large amount of data in data warehouses comes from numerous sources such
that internal applications like marketing, sales, and finance; customer-facing
apps; and external partner systems, among others. It is a centralized data
repository for analysts that can be queried whenever required for business
benefits. A data warehouse is mainly a data management system that’s
designed to enable and support business intelligence (BI) activities,
particularly analytics. Data warehouses are alleged to perform queries,
cleaning, manipulating, transforming and analyzing the data and they also
contain large amounts of historical data.
What is Data Warehousing?

The process of creating data warehouses to store a large amount of data is


named Data Warehousing. Data Warehousing helps to improve the speed
and efficiency of accessing different data sets and makes it easier for
company decision-makers to obtain insights that will help the business and
promoting marketing tactics that set them aside from their competitors. We
can say that it is a blend of technologies and components which aids the
strategic use of data and information. The main goal of data warehousing is to
create a hoarded wealth of historical data that can be retrieved and analyzed
to supply helpful insight into the organization’s operations.

Need of Data Warehousing.

Data Warehousing is a progressively essential tool for business intelligence. It


allows organizations to make quality business decisions. The data warehouse
benefits by improving data analytics, it also helps to gain considerable
revenue and the strength to compete more strategically in the market. By
efficiently providing systematic, contextual data to the business intelligence
tool of an organization, the data warehouses can find out more practical
business strategies.
Business User: Business users or customers need a data warehouse to look
at summarized data from the past. Since these people are coming from a non-
technical background also, the data may be represented to them in an
uncomplicated way.

1. Maintains consistency: Data warehouses are programmed in such a


way that they can be applied in a regular format to all collected data
from different sources, which makes it effortless for company decision-
makers to analyze and share data insights with their colleagues around
the globe. By standardizing the data, the risk of error in interpretation is
also reduced and improves overall accuracy.
2. Store historical data: Data
Warehouses are also used to store historical data that means, the time
variable data from the past and this input can be used for various
purposes.
3. Make strategic decisions: Data warehouses contribute to making
better strategic decisions. Some business strategies may be depending
upon the data stored within the data warehouses.
4. High response time: Data warehouse has got to be prepared for
somewhat sudden masses and type of queries that
demands a major degree of flexibility and fast latency.

Characteristics of Data warehouse:

1. Subject Oriented: A data warehouse is often subject-oriented because


it delivers may be achieved on a particular theme which means the data
warehousing process is proposed to handle a particular theme that is
more defined. These themes are often sales, distribution, selling. etc.
2. Time-Variant: When the data is maintained via totally different intervals
of time like weekly, monthly, or annually, etc. It founds numerous time
limits that are unit structured between the big datasets and are
command within the online transaction method (OLTP). The time limits
for the data warehouse are extended than that of operational systems.
The data resided within the data warehouse is predetermined with a
particular interval of time and delivers information from the historical
perspective. It contains parts of time directly or indirectly.
3. Non-volatile: The data residing in the data warehouse is permanent
and defined by its names. It additionally means that the data in the data
warehouse cannot be erased or deleted or also when new data is
inserted into it. In the data warehouse, data is read-only and can only be
refreshed at a particular interval of time.
Operations such as delete, update and insert that is done in a software
application over data is lost in the data warehouse environment. There
are only two types of data operations that can be done in the data
warehouse:

o Data Loading
o Data Access

4. Integrated: A data warehouse is created by integrating data from


numerous different sources such that from mainframe computers and a
relational database. Additionally, it should also have reliable naming
conventions, formats, and codes. Integration of data warehouse benefits
in the successful analysis of data. Dependability in naming conventions,
column scaling, encoding structure, etc. needs to be confirmed.
Integration of data warehouse handles numerous subject-oriented
warehouses.

Architecture & Components of Data Warehouse:

Data warehouse architecture defines the comprehensive architecture of data


processing and presentation that will be useful for data analysis and decision
making within the enterprise and organization. Each organization has different
data warehouses depending upon their need, but all of them are characterized
by some standard components.
Data Warehouse applications are designed to support the user’s data
requirements, an example of this is online analytical processing (OLAP).
These include functions such as forecasting, profiling, summary reporting, and
trend analysis.

The architecture of the data warehouse mainly consists of the proper


arrangement of its elements, to build an efficient data warehouse with
software and hardware components. The elements and components may vary
based on the requirement of organizations. All of these depend on the
organization’s circumstances.

1. Source Data Component:


In the Data Warehouse, the source data comes from different places. They
are group into four categories:

 External Data: For data gathering, most of the executives and data
analysts rely on information coming from external sources for a
numerous amount of the information they use. They use statistical
features associated with their organization that is brought out by some
external sources and department.
 Internal Data: In every organization, the consumer keeps their “private”
spreadsheets, reports, client profiles, and generally even department
databases. This is often the interior information, a part that might be
helpful in every data warehouse.
 Operational System data: Operational systems are principally meant
to run the business. In each operation system, we periodically take the
old data and store it in achieved files.
 Flat files: A flat file is nothing but a text database that stores data in a
plain text format. Flat files generally are text files that have all data
processing and structure markup removed. A flat file contains a table
with a single record per line.

2. Data Staging:
After the data is extracted from various sources, now it’s time to prepare the
data files for storing in the data warehouse. The extracted data collected from
various sources must be transformed and made ready in a format that is
suitable to be saved in the data warehouse for querying and analysis. The
data staging contains three primary functions that take place in this part:
 Data Extraction: This stage handles various data sources. Data
analysts should employ suitable techniques for every data source.

 Data Transformation: As we all know, information for a knowledge


warehouse comes from many alternative sources. If information
extraction for a data warehouse posture huge challenge, information
transformation gifts even important challenges. We tend to perform
many individual tasks as a part of information transformation. First, we
tend to clean the info extracted from every source of data.
Standardization of information elements forms an outsized part of data
transformation. Data transformation contains several kinds of combining
items of information from totally different sources. Information
transformation additionally contains purging supply information that’s
not helpful and separating outsourced records into new mixtures. Once
the data transformation performs ends, we’ve got a set of integrated
information that’s clean, standardized, and summarized.

 Data Loading: When we complete the structure and construction of the


data warehouse and go live for the first time, we do the initial loading of
the data into the data warehouse storage. The initial load moves high
volumes of data consuming a considerable amount of time.

3. Data Storage in Warehouse:


Data storage for data warehousing is split into multiple repositories. These
data repositories contain structured data in a very highly normalized form for
fast and efficient processing.
 Metadata: Metadata means data about data i.e., it summarizes basic
details regarding data, creating findings & operating with explicit
instances of data. Metadata is generated by an additional correction or
automatically and can contain basic information about data.
 Raw Data: Raw data is a set of data and information that has not yet
been processed and was delivered from a particular data entity to the
data supplier and hasn’t been processed nonetheless by machine or
human. This data is gathered out from online sources to deliver deep
insight into users’ online behavior.
 Summary Data or Data summary: Data summary is an easy term for a
brief conclusion of an enormous theory or a paragraph. This is often one
thing where analysts write the code and in the end, they declare the
ultimate end in the form of summarizing data. Data summary is the most
essential thing in data mining and processing.

4. Data Marts:
Data marts are also the part of storage component in a data warehouse. It can
store the information of a specific function of an organization that is handled
by a single authority. There may be any number of data marts in a particular
organization depending upon the functions. In short, data marts contain
subsets of the data stored in data warehouses.

Now, the users and analysts can use data for various applications like
reporting, analyzing, mining, etc. The data is made available to them
whenever required.
Data Warehousing life Cycle:

As we know the data warehouse is made by combining data from multiple


diverse sources and the tools that support analytical reporting, structured and
unstructured queries, and decision making for the organization. We need to
follow the step-by-step approach for building and successfully implementing
the Data Warehouse:

How does Data Warehouse work?

A Data Warehouse is like a central depository where data comes from


different data sources. In a data warehouse, the data flows from the
transactional system and relational databases. A data warehouse timely pulls
out the data from various apps and systems, after then, the data goes through
various processing and formatting and makes the data in a format that
matches the data already in the warehouse. This processed data is stored in
the data warehouses that ready for further analysis for decision making. The
data formatting and processing depends upon the need of the organization.

The Data could be in one of the following formats:

1. Structured
2. Semi-structured
3. Unstructured data

The data is processed and transformed so that users and analysts can access
the processed data in the Data Warehouse through Business Intelligence
tools, SQL clients, and spreadsheets. A data warehouse merges all
information coming from various sources into one global and complete
database. By merging all this information in one place, it becomes easier for
an organization to analyze its customers more comprehensively.

Latest Tools and Technologies for Data Warehousing:

Data warehousing had improved the access to information, reduced query-


response time, and allows businesses to get deep insights from huge, big
data. Earlier, companies had to build lots of infrastructure for data
warehousing. But today the cloud technology has remarkably reduced the
cost and effort of data warehousing for businesses.
The field of data warehousing is most emerging and their various cloud data
warehousing tools and technologies are developed for better decision making.
The cloud-based data warehousing tools are fast, highly scalable, and
available on a pay-per-use basis. Following are some data warehousing tools:

1. Amazon Redshift
2. Microsoft Azure
3. Google BigQuery
4. Snowflake
5. Micro Focus Vertica
6. Teradata
7. Amazon DynamoDB
8. PostgreSQL

All these are the top Data Warehousing Tools.


ETL (Extract, Transform, and Load)
Process
The mechanism of extracting information from source systems and bringing it into the
data warehouse is commonly called ETL, which stands for Extraction, Transformation
and Loading.

The ETL process requires active inputs from various stakeholders, including developers,
analysts, testers, top executives and is technically challenging.

To maintain its value as a tool for decision-makers, Data warehouse technique needs to
change with business changes. ETL is a recurring method (daily, weekly, monthly) of a
Data warehouse system and needs to be agile, automated, and well documented.

How ETL Works?


ETL consists of three separate phases:
Extraction
o Extraction is the operation of extracting information from a source system for further use
in a data warehouse environment. This is the first stage of the ETL process.
o Extraction process is often one of the most time-consuming tasks in the ETL.
o The source systems might be complicated and poorly documented, and thus
determining which data needs to be extracted can be difficult.
o The data has to be extracted several times in a periodic manner to supply all changed
data to the warehouse and keep it up-to-date.

Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed to
improve data quality. The primary data cleansing features found in ETL tools are
rectification and homogenization. They use specific dictionaries to rectify typing
mistakes and to recognize synonyms, as well as rule-based cleansing to enforce
domain-specific rules and defines appropriate associations between values.

The following examples show the essential of data cleaning:

If an enterprise wishes to contact its users or its suppliers, a complete, accurate and up-
to-date list of contact addresses, email addresses and telephone numbers must be
available.

If a client or supplier calls, the staff responding should be quickly able to find the person
in the enterprise database, but this need that the caller's name or his/her company
name is listed in the database.
If a user appears in the databases with two or more slightly different names or different
account numbers, it becomes difficult to update the customer's information.

Transformation
Transformation is the core of the reconciliation phase. It converts records from its
operational source format into a particular data warehouse format. If we implement a
three-layer architecture, this phase outputs our reconciled data layer.

The following points must be rectified in this phase:

o Loose texts may hide valuable information. For example, XYZ PVT Ltd does not explicitly
show that this is a Limited Partnership company.
o Different formats can be used for individual data. For example, data can be saved as a
string or as three integers.

Following are the main transformation processes aimed at populating the reconciled
data layer:

o Conversion and normalization that operate on both storage formats and units of
measure to make data uniform.
o Matching that associates equivalent fields in different sources.
o Selection that reduces the number of source fields and records.

Cleansing and Transformation processes are often closely linked in ETL tools.
Loading
The Load is the process of writing the data into the target database. During the load
step, it is necessary to ensure that the load is performed correctly and with as little
resources as possible.

Loading can be carried in two ways:

1. Refresh: Data Warehouse data is completely rewritten. This means that older file is
replaced. Refresh is usually used in combination with static extraction to populate a data
warehouse initially.
2. Update: Only those changes applied to source information are added to the Data
Warehouse. An update is typically carried out without deleting or modifying preexisting
data. This method is used in combination with incremental extraction to update data
warehouses regularly.
Selecting an ETL Tool
Selection of an appropriate ETL Tools is an important decision that has to be made in
choosing the importance of an ODS or data warehousing application. The ETL tools are
required to provide coordinated access to multiple data sources so that relevant data
may be extracted from them. An ETL tool would generally contains tools for data
cleansing, re-organization, transformations, aggregation, calculation and automatic
loading of information into the object database.

What is Star Schema?


A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions. A fact is an event that is counted or measured,
such as a sale or log in. A dimension includes reference data about the fact, such as
date, item, or customer.

A star schema is a relational schema where a relational schema whose design represents
a multidimensional data model. The star schema is the explicit data warehouse schema.
It is known as star schema because the entity-relationship diagram of this schemas
simulates a star, with points, diverge from a central table. The center of the schema
consists of a large fact table, and the points of the star are the dimension tables.
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table
has two types of columns: those that include fact and those that are foreign keys to the
dimension table. The primary key of the fact tables is generally a composite key that is
made up of all of its foreign keys.

A fact table might involve either detail level fact or fact that have been aggregated (fact
tables that include aggregated fact are often instead called summary tables). A fact table
generally contains facts with the same level of aggregation.

Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that
categorize data. If a dimension has not got hierarchies and levels, it is called a flat
dimension or list. The primary keys of each of the dimensions table are part of the
composite primary keys of the fact table. Dimensional attributes help to define the
dimensional value. They are generally descriptive, textual values. Dimensional tables are
usually small in size than fact table.

Fact tables store data about sales while dimension tables data about the geographic
region (markets, cities), clients, products, times, channels.

Characteristics of Star Schema


The star schema is intensely suitable for data warehouse database design because of the
following features:

o It creates a DE-normalized database that can quickly provide query responses.


o It provides a flexible design that can be changed easily or added to throughout the
development cycle, and as the database grows.
o It provides a parallel in design to how end-users typically think of and use the data.
o It reduces the complexity of metadata for both developers and end-users.

Advantages of Star Schema


Star Schemas are easy for end-users and application to understand and navigate. With a
well-designed schema, the customer can instantly analyze large, multidimensional data
sets.

The main advantage of star schemas in a decision-support environment are:


Query Performance
A star schema database has a limited number of table and clear join paths, the query
run faster than they do against OLTP systems. Small single-table queries, frequently of a
dimension table, are almost instantaneous. Large join queries that contain multiple
tables takes only seconds or minutes to run.

In a star schema database design, the dimension is connected only through the central
fact table. When the two-dimension table is used in a query, only one join path,
intersecting the fact tables, exist between those two tables. This design feature enforces
authentic and consistent query results.

Load performance and administration

Structural simplicity also decreases the time required to load large batches of record
into a star schema database. By describing facts and dimensions and separating them
into the various table, the impact of a load structure is reduced. Dimension table can be
populated once and occasionally refreshed. We can add new facts regularly and
selectively by appending records to a fact table.

Built-in referential integrity


A star schema has referential integrity built-in when information is loaded. Referential
integrity is enforced because each data in dimensional tables has a unique primary key,
and all keys in the fact table are legitimate foreign keys drawn from the dimension table.
A record in the fact table which is not related correctly to a dimension cannot be given
the correct key value to be retrieved.

Easily Understood
A star schema is simple to understand and navigate, with dimensions joined only
through the fact table. These joins are more significant to the end-user because they
represent the fundamental relationship between parts of the underlying business.
Customer can also browse dimension table attributes before constructing a query.

Example: Suppose a star schema is composed of a fact table, SALES, and several
dimension tables connected to it for time, branch, item, and geographic locations.
The TIME table has a column for each day, month, quarter, and year. The ITEM table has
columns for each item_Key, item_name, brand, type, supplier_type. The BRANCH table
has columns for each branch_key, branch_name, branch_type. The LOCATION table has
columns of geographic data, including street, city, state, and country.

In this scenario, the SALES table contains only four columns with IDs from the dimension
tables, TIME, ITEM, BRANCH, and LOCATION, instead of four columns for time data, four
columns for ITEM data, three columns for BRANCH data, and four columns for
LOCATION data. Thus, the size of the fact table is significantly reduced. When we need
to change an item, we need only make a single change in the dimension table, instead
of making many changes in the fact table.
Data Mining

Data mining is one of the most useful techniques that help entrepreneurs, researchers,
and individuals to extract valuable information from huge sets of data. Data mining is
also called Knowledge Discovery in Database (KDD). The knowledge discovery
process includes Data cleaning, Data integration, Data selection, Data transformation,
Data mining, Pattern evaluation, and Knowledge presentation.

Our Data mining tutorial includes all topics of Data mining such as applications, Data
mining vs Machine learning, Data mining tools, Social Media Data mining, Data mining
techniques, Clustering in data mining, Challenges in Data mining, etc.

What is Data Mining?


The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is
called Data Mining.

In other words, we can say that Data Mining is the process of investigating hidden
patterns of information to various perspectives for categorization into useful data, which
is collected and assembled in particular areas such as data warehouses, efficient analysis,
data mining algorithm, helping decision making and other data requirement to
eventually cost-cutting and generating revenue.

Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures. Data mining utilizes
complex mathematical algorithms for data segments and evaluates the probability of
future events. Data Mining is also called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful information.

Data Mining is similar to Data Science carried out by a person, in a specific situation, on
a particular data set, with an objective. This process includes various types of services
such as text mining, web mining, audio and video mining, pictorial data mining, and
social media mining. It is done through software that is simple or highly specific. By
outsourcing data mining, all the work can be done faster with low operation costs.
Specialized firms can also use new technologies to collect data that is impossible to
locate manually. There are tonnes of information available on various platforms, but very
little knowledge is accessible. The biggest challenge is to analyze the data to extract
important information that can be used to solve a problem or for company
development. There are many powerful instruments and techniques available to mine
data and find better insight from it.

Types of Data Mining


Data mining can be performed on the following types of data:

Relational Database:
A relational database is a collection of multiple data sets formally organized by tables,
records, and columns from which data can be accessed in various ways without having
to recognize the database tables. Tables convey and share information, which facilitates
data searchability, reporting, and organization.

Data warehouses:

A Data Warehouse is the technology that collects the data from various sources within
the organization to provide meaningful business insights. The huge amount of data
comes from multiple places such as Marketing and Finance. The extracted data is
utilized for analytical purposes and helps in decision- making for a business
organization. The data warehouse is designed for the analysis of data rather than
transaction processing.

Data Repositories:

The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT
structure. For example, a group of databases, where an organization has kept various
kinds of information.

Object-Relational Database:

A combination of an object-oriented database model and relational database model is


called an object-relational model. It supports Classes, Objects, Inheritance, etc.

One of the primary objectives of the Object-relational data model is to close the gap
between the Relational database and the object-oriented model practices frequently
utilized in many programming languages, for example, C++, Java, C#, and so on.

Transactional Database:

A transactional database refers to a database management system (DBMS) that has the
potential to undo a database transaction if it is not performed appropriately. Even
though this was a unique capability a very long while back, today, most of the relational
database systems support transactional database activities.

Advantages of Data Mining


o The Data Mining technique enables organizations to obtain knowledge-based
data.
o Data mining enables organizations to make lucrative modifications in operation
and production.
o Compared with other statistical data applications, data mining is a cost-efficient.
o Data Mining helps the decision-making process of an organization.
o It Facilitates the automated discovery of hidden patterns as well as the prediction
of trends and behaviors.
o It can be induced in the new system as well as the existing platforms.
o It is a quick process that makes it easy for new users to analyze enormous
amounts of data in a short time.

Disadvantages of Data Mining


o There is a probability that the organizations may sell useful data of customers to
other organizations for money. As per the report, American Express has sold
credit card purchases of their customers to other organizations.
o Many data mining analytics software is difficult to operate and needs advance
training to work on.
o Different data mining instruments operate in distinct ways due to the different
algorithms used in their design. Therefore, the selection of the right data mining
tools is a very challenging task.
o The data mining techniques are not precise, so that it may lead to severe
consequences in certain conditions.

Data Mining Applications


Data Mining is primarily used by organizations with intense consumer demands- Retail,
Communication, Financial, marketing company, determine price, consumer preferences,
product positioning, and impact on sales, customer satisfaction, and corporate profits.
Data mining enables a retailer to use point-of-sale records of customer purchases to
develop products and promotions that help the organization to attract the customer.
These are the following areas where data mining is widely used:

Data Mining in Healthcare:

Data mining in healthcare has excellent potential to improve the health system. It uses
data and analytics for better insights and to identify best practices that will enhance
health care services and reduce costs. Analysts use data mining approaches such as
Machine learning, Multi-dimensional database, Data visualization, Soft computing, and
statistics. Data Mining can be used to forecast patients in each category. The procedures
ensure that the patients get intensive care at the right place and at the right time. Data
mining also enables healthcare insurers to recognize fraud and abuse.

Data Mining in Market Basket Analysis:

Market basket analysis is a modeling method based on a hypothesis. If you buy a


specific group of products, then you are more likely to buy another group of products.
This technique may enable the retailer to understand the purchase behavior of a buyer.
This data may assist the retailer in understanding the requirements of the buyer and
altering the store's layout accordingly. Using a different analytical comparison of results
between various stores, between customers in different demographic groups can be
done.
Data mining in Education:

Education data mining is a newly emerging field, concerned with developing techniques
that explore knowledge from the data generated from educational Environments. EDM
objectives are recognized as affirming student's future learning behavior, studying the
impact of educational support, and promoting learning science. An organization can use
data mining to make precise decisions and also to predict the results of the student.
With the results, the institution can concentrate on what to teach and how to teach.

Data Mining in Manufacturing Engineering:

Knowledge is the best asset possessed by a manufacturing company. Data mining tools
can be beneficial to find patterns in a complex manufacturing process. Data mining can
be used in system-level designing to obtain the relationships between product
architecture, product portfolio, and data needs of the customers. It can also be used to
forecast the product development period, cost, and expectations among the other tasks.

Data Mining in CRM (Customer Relationship Management):

Customer Relationship Management (CRM) is all about obtaining and holding


Customers, also enhancing customer loyalty and implementing customer-oriented
strategies. To get a decent relationship with the customer, a business organization
needs to collect data and analyze the data. With data mining technologies, the collected
data can be used for analytics.

Data Mining in Fraud detection:

Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection
are a little bit time consuming and sophisticated. Data mining provides meaningful
patterns and turning data into information. An ideal fraud detection system should
protect the data of all the users. Supervised methods consist of a collection of sample
records, and these records are classified as fraudulent or non-fraudulent. A model is
constructed using this data, and the technique is made to identify whether the
document is fraudulent or not.

Data Mining in Lie Detection:

Apprehending a criminal is not a big deal, but bringing out the truth from him is a very
challenging task. Law enforcement may use data mining techniques to investigate
offenses, monitor suspected terrorist communications, etc. This technique includes text
mining also, and it seeks meaningful patterns in data, which is usually unstructured text.
The information collected from the previous investigations is compared, and a model for
lie detection is constructed.

Data Mining Financial Banking:

The Digitalization of the banking system is supposed to generate an enormous amount


of data with every new transaction. The data mining technique can help bankers by
solving business-related problems in banking and finance by identifying trends,
casualties, and correlations in business information and market costs that are not
instantly evident to managers or executives because the data volume is too large or are
produced too rapidly on the screen by experts. The manager may find these data for
better targeting, acquiring, retaining, segmenting, and maintain a profitable customer.

Challenges of Implementation in Data mining


Although data mining is very powerful, it faces many challenges during its execution.
Various challenges could be related to performance, data, methods, and techniques, etc.
The process of data mining becomes effective when the challenges or problems are
correctly recognized and adequately resolved.
Incomplete and noisy data:

The process of extracting useful data from large volumes of data is data mining. The
data in the real-world is heterogeneous, incomplete, and noisy. Data in huge quantities
will usually be inaccurate or unreliable. These problems may occur due to data
measuring instrument or because of human errors. Suppose a retail chain collects phone
numbers of customers who spend more than $ 500, and the accounting employees put
the information into their system. The person may make a digit mistake when entering
the phone number, which results in incorrect data. Even some customers may not be
willing to disclose their phone numbers, which results in incomplete data. The data
could get changed due to human or system error. All these consequences (noisy and
incomplete data)makes data mining challenging.

Data Distribution:
Real-worlds data is usually stored on various platforms in a distributed computing
environment. It might be in a database, individual systems, or even on the internet.
Practically, It is a quite tough task to make all the data to a centralized data repository
mainly due to organizational and technical concerns. For example, various regional
offices may have their servers to store their data. It is not feasible to store, all the data
from all the offices on a central server. Therefore, data mining requires the development
of tools and algorithms that allow the mining of distributed data.

Complex Data:

Real-world data is heterogeneous, and it could be multimedia data, including audio and
video, images, complex data, spatial data, time series, and so on. Managing these
various types of data and extracting useful information is a tough task. Most of the time,
new technologies, new tools, and methodologies would have to be refined to obtain
specific information.

Performance:

The data mining system's performance relies primarily on the efficiency of algorithms
and techniques used. If the designed algorithm and techniques are not up to the mark,
then the efficiency of the data mining process will be affected adversely.

Data Privacy and Security:

Data mining usually leads to serious issues in terms of data security, governance, and
privacy. For example, if a retailer analyzes the details of the purchased items, then it
reveals data about buying habits and preferences of the customers without their
permission.

Data Visualization:

In data mining, data visualization is a very important process because it is the primary
method that shows the output to the user in a presentable way. The extracted data
should convey the exact meaning of what it intends to express. But many times,
representing the information to the end-user in a precise and easy way is difficult. The
input data and the output information being complicated, very efficient, and successful
data visualization processes need to be implemented to make it successful.

There are many more challenges in data mining in addition to the problems above-
mentioned. More problems are disclosed as the actual data mining process begins, and the
success of data mining relies on getting rid of all these difficulties.
Backward Skip 10s

You might also like