You are on page 1of 27

Prepared By : P.K.

Chaubey

Managing Data Resources: The need for Data management,


Challenges of Data management

Data management is the process of ingesting, storing, organizing and


maintaining the data created and collected by an organization. Effective data
management is a crucial piece of deploying the IT systems that run business
applications and provide analytical information to help drive operational
decision-making and strategic planning by corporate executives, business
managers and other end users.
The data management process includes a combination of different functions that
collectively aim to make sure that the data in corporate systems is accurate,
available and accessible. Most of the required work is done by IT and data
management teams, but business users typically also participate in some parts of
the process to ensure that the data meets their needs and to get them on board
with policies governing its use.
This comprehensive guide to data management further explains what it is and
provides insight on the individual disciplines it includes, best practices for
managing data, challenges that organizations face and the business benefits of a
successful data management strategy. You’ll also find an overview of data
management tools and techniques. Click through the hyperlinks on the page to
read about data management trends and get expert advice on managing
corporate data.
Need of data management
Data increasingly is seen as a corporate asset that can be used to make more-
informed business decisions, improve marketing campaigns, optimize business
operations and reduce costs, all with the goal of increasing revenue and profits.
But a lack of proper data management can saddle organizations with
incompatible data silos, inconsistent data sets and data quality problems that
limit their ability to run business intelligence (BI) and analytics applications —
or, worse, lead to faulty findings.
Data management has also grown in importance as businesses are subjected to
an increasing number of regulatory compliance requirements, including data
privacy and protection laws such as GDPR and the California Consumer
Privacy Act. In addition, companies are capturing ever-larger volumes of data
and a wider variety of data types, both hallmarks of the big data systems many
have deployed. Without good data management, such environments can become
unwieldy and hard to navigate.

1
Prepared By : P.K.Chaubey

Challenges of Data management


While some companies are good at collecting data, they are not managing it
well enough to make sense of it. Simply collecting data is not enough;
enterprises and organizations need to understand from the start that data
management and data analytics only will be successful when they first put some
thought into how they will gain value from their raw data. They can then move
beyond raw data collection with efficient systems for processing, storing, and
validating data, as well as effective analysis strategies.
Another challenge of data management occurs when companies categorize data
and organize it without first considering the answers they hope to glean from
the data. Each step of data collection and management must lead toward
acquiring the right data and analyzing it in order to get the actionable
intelligence necessary for making truly data-driven business decisions.

Data Management Best Practices


The best way to manage data, and eventually get the insights needed to make data-driven
decisions, is to begin with a business question and acquire the data that is needed to answer
that question. Companies must collect vast amounts of information from various sources and
then utilize best practices while going through the process of storing and managing the data,
cleaning and mining the data, and then analyzing and visualizing the data in order to inform
their business decisions.
It’s important to keep in mind that data management best practices result in better analytics.
By correctly managing and preparing the data for analytics, companies optimize their Big
Data. A few data management best practices organizations and enterprises should strive to
achieve include:
 Simplify access to traditional and emerging data
 Scrub data to infuse quality into existing business processes
 Shape data using flexible manipulation techniques
Data management platforms enables organizations and enterprises to use data analytics in
beneficial ways, such as:
 Personalizing the customer experience
 Adding value to customer interactions
 Identifying the root causes of marketing failures and business issues in real-
time
 Reaping the revenues associated with data-driven marketing
 Improving customer engagement
 Increasing customer loyalty

2
Prepared By : P.K.Chaubey

Data independence
If a database system is not multi-layered, then it becomes difficult to make any changes in the
database system. Database systems are designed in multi-layers as we learnt earlier.
Data Independence
A database system normally contains a lot of data in addition to users’ data. For example, it
stores data about data, known as metadata, to locate and retrieve data easily. It is rather
difficult to modify or update a set of metadata once it is stored in the database. But as a
DBMS expands, it needs to change over time to satisfy the requirements of the users. If the
entire data is dependent, it would become a tedious and highly complex job.

Metadata itself follows a layered architecture, so that when we change data at one layer, it
does not affect the data at another level. This data is independent but mapped to each other.
Logical Data Independence
Logical data is data about database, that is, it stores information about how data is managed
inside. For example, a table (relation) stored in the database and all its constraints, applied on
that relation.
Logical data independence is a kind of mechanism, which liberalizes itself from actual data
stored on the disk. If we do some changes on table format, it should not change the data
residing on the disk.
Physical Data Independence
All the schemas are logical, and the actual data is stored in bit format on the disk. Physical
data independence is the power to change the physical data without impacting the schema or
logical data.
For example, in case we want to change or upgrade the storage system itself − suppose we
want to replace hard-disks with SSD − it should not have any impact on the logical data or
schemas.
Data Redundancy
In computer main memory, auxiliary storage and computer buses, data redundancy is the
existence of data that is additional to the actual data and permits correction of errors in stored
or transmitted data. The additional data can simply be a complete copy of the actual data, or
only select pieces of data that allow detection of errors and reconstruction of lost or damaged
data up to a certain level.
Data redundancy is a condition created within a database or data storage technology in which
the same piece of data is held in two separate places.

3
Prepared By : P.K.Chaubey

This can mean two different fields within a single database, or two different spots in multiple
software environments or platforms. Whenever data is repeated, it basically constitutes data
redundancy.
Data redundancy can occur by accident but is also done deliberately for backup and recovery
purposes
For example, by including additional data checksums, ECC memory is capable of detecting
and correcting single-bit errors within each memory word, while RAID 1 combines two hard
disk drives (HDDs) into a logical storage unit that allows stored data to survive a complete
failure of one drive. Data redundancy can also be used as a measure against silent data
corruption; for example, file systems such as Btrfs and ZFS use data and metadata
checksumming in combination with copies of stored data to detect silent data corruption and
repair its effects.
While different in nature, data redundancy also occurs in database systems that have values
repeated unnecessarily in one or more records or fields, within a table, or where the field is
replicated/repeated in two or more tables. Often this is found in Unnormalized database
designs and results in the complication of database management, introducing the risk of
corrupting the data, and increasing the required amount of storage. When done on purpose
from a previously normalized database schema, it may be considered a form of database
denormalization; used to improve performance of database queries.
For instance, when customer data are duplicated and attached with each product bought, then
redundancy of data is a known source of inconsistency since a given customer might appear
with different values for one or more of their attributes. Data redundancy leads to data
anomalies and corruption and generally should be avoided by design; applying database
normalization prevents redundancy and makes the best possible usage of storage.
Redundancy means having multiple copies of same data in the database. This problem arises
when a database is not normalized. Suppose a table of student details attributes are: student
Id, student name, college name, college rank, course opted.
ID Name Contact College Courses Rank

200 ABCD 123 AKTU MBA 1

201 PQRS 321 AKTU MBA 1

202 WXYZ 456 AKTU MBA 1

203 MNOP 654 AKTU MBA 1

204 GHIJ 789 AKTU MBA 1


Problems
Updation Anomaly: Suppose if the rank of the college changes then changes will have to be
all over the database which will be time-consuming and computationally costly.
Deletion Anomaly: If the details of students in this table is deleted then the details of college
will also get deleted which should not occur by common sense.
This anomaly happens when deletion of a data record results in losing some unrelated
information that was stored as part of the record that was deleted from a table.
It is not possible to delete some information without losing some other information in the
table as well.

4
Prepared By : P.K.Chaubey

Insertion Anomaly: If a student detail has to be inserted whose course is not being decided
yet then insertion will not be possible till the time course is decided for student.
ID Name Contact College Courses Rank

200 ABCD 123 AKTU 1


This problem happens when the insertion of a data record is not possible without adding
some additional unrelated data to the record.

Data Consistency
Data consistency refers to when same data kept at different places do not match.
Point-in-time consistency is an important property of backup files and a critical objective of
software that creates backups. It is also relevant to the design of disk memory systems,
specifically relating to what happens when they are unexpectedly shut down.
As a relevant backup example, consider a website with a database such as the online
encyclopedia, Wikipedia, which needs to be operational around the clock, but also must be
backed up with regularity to protect against disaster. Portions of Wikipedia are constantly
being updated every minute of every day, meanwhile, Wikipedia’s database is stored on
servers in the form of one or several very large files which require minutes or hours to back
up.
These large files as with any database contain numerous data structures which reference each
other by location. For example, some structures are indexes which permit the database
subsystem to quickly find search results. If the data structures cease to reference each other
properly, then the database can be said to be corrupted.
Application Consistency
Application Consistency is similar to Transaction consistency, but on a grander scale. Instead
of data consistency within the scope of a single transaction, data must be consistent within the
confines of many different transaction streams from one or more applications.
An application may be made up of many different types of data, such as multiple database
components, various types of files, and data feeds from other applications. Application
consistency is the state in which all related files and databases are in-synch and represent the
true status of the application.
Transaction Consistency
A transaction is a logical unit of work that may include any number of file or database
updates. During normal processing, transaction consistency is present only
 Before any transactions have run,
 Following the completion of a successful transaction and before the next
transaction begins, and
 When the application ends normally or the database is closed.
Following a failure of some kind, the data will not be transaction consistent if transactions
were in-flight at the time of the failure. In most cases what occurs is that once the application
or database is restarted, the incomplete transactions are identified and the updates relating to
these transactions are either ―backed-out‖ or processing resumes with the next dependant
write.

5
Prepared By : P.K.Chaubey

Basic Database Administration


Database administration refers to the whole set of activities performed by a database
administrator to ensure that a database is always available as needed. Other closely related
tasks and roles are database security, database monitoring and troubleshooting, and planning
for future growth.
Database administration is an important function in any organization that is dependent on one
or more databases.
The database administrator (DBA) is usually a dedicated role in the IT department for large
organizations. However, many smaller companies that cannot afford a full-time DBA usually
outsource or contract the role to a specialized vendor, or merge the role with another in the
ICT department so that both are performed by one person.
The primary role of database administration is to ensure maximum up time for the database
so that it is always available when needed. This will typically involve proactive periodic
monitoring and troubleshooting. This in turn entails some technical skills on the part of the
DBA. In addition to in-depth knowledge of the database in question, the DBA will also need
knowledge and perhaps training in the platform (database engine and operating system) on
which the database runs.
A DBA is typically also responsible for other secondary, but still critically important, tasks
and roles. Some of these include:
 Database Security: Ensuring that only authorized users have access to the
database and fortifying it against any external, unauthorized access.
 Database Tuning: Tweaking any of several parameters to optimize
performance, such as server memory allocation, file fragmentation and disk
usage.
 Backup and Recovery: It is a DBA’s role to ensure that the database has
adequate backup and recovery procedures in place to recover from any
accidental or deliberate loss of data.
 Producing Reports from Queries: DBAs are frequently called upon to
generate reports by writing queries, which are then run against the database.
It is clear from all the above that the database administration function requires technical
training and years of experience. Some companies that offer commercial database products,
such as Oracle DB and Microsoft’s SQL Server, also offer certifications for their specific
products. These industry certifications, such as Oracle Certified Professional (OCP) and
Microsoft Certified Database Administrator (MCDBA), go a long way toward assuring
organizations that a DBA is indeed thoroughly trained on the product in question. Because
most relational database products today use the SQL language, knowledge of SQL commands
and syntax is also a valuable asset for today’s DBAs.

6
Prepared By : P.K.Chaubey

Database Management Systems Concepts


Database Management System (DBMS) is a software for storing and retrieving users’
data while considering appropriate security measures. It consists of a group of programs
which manipulate the database. The DBMS accepts the request for data from an application
and instructs the operating system to provide the specific data. In large systems, a DBMS
helps users and other third-party software to store and retrieve data.
DBMS allows users to create their own databases as per their requirement. The term
―DBMS‖ includes the user of the database and other application programs. It provides an
interface between the data and the software application.
History of DBMS
Here, are the important landmarks from the history:
 1960 – Charles Bachman designed first DBMS system
 1970 – Codd introduced IBM’S Information Management System (IMS)
 1976- Peter Chen coined and defined the Entity-relationship model also
know as the ER model
 1980 – Relational Model becomes a widely accepted database component
 1985- Object-oriented DBMS develops.
 1990s- Incorporation of object-orientation in relational DBMS.
 1991- Microsoft ships MS access, a personal DBMS and that displaces all
other personal DBMS products.
 1995: First Internet database applications
 1997: XML applied to database processing. Many vendors begin to integrate
XML into DBMS products.
Characteristics of Database Management System
 Provides security and removes redundancy
 Self-describing nature of a database system
 Insulation between programs and data abstraction
 Support of multiple views of the data
 Sharing of data and multiuser transaction processing
 DBMS allows entities and relations among them to form tables.
 It follows the ACID concept ( Atomicity, Consistency, Isolation, and
Durability).
 DBMS supports multi-user environment that allows users to access and
manipulate data in parallel.
Popular DBMS Software
Here, is the list of some popular DBMS system:
 MySQL  FoxPro
 Microsoft Access  SQLite
 Oracle  IBM DB2
 PostgreSQL  LibreOffice Base
 dBASE  MariaDB
 Microsoft SQL Server etc.

7
Prepared By : P.K.Chaubey

Types of DBMS
Four Types of DBMS systems are:
1. Hierarchical DBMS
In a Hierarchical database, model data is organized in a tree-like structure. Data is Stored
Hierarchically (top down or bottom up) format. Data is represented using a parent-child
relationship. In Hierarchical DBMS parent may have many children, but children have only
one parent.
2. Network Model
The network database model allows each child to have multiple parents. It helps you to
address the need to model more complex relationships like as the orders/parts many-to-many
relationship. In this model, entities are organized in a graph which can be accessed through
several paths.
3. Relational model
Relational DBMS is the most widely used DBMS model because it is one of the easiest. This
model is based on normalizing data in the rows and columns of the tables. Relational model
stored in fixed structures and manipulated using SQL.
4. Object-Oriented Model
In Object-oriented Model data stored in the form of objects. The structure which is called
classes which display data within it. It defines a database as a collection of objects which
stores both data members values and operations.
Advantages of DBMS
 DBMS offers a variety of techniques to store & retrieve data
 DBMS serves as an efficient handler to balance the needs of multiple
applications using the same data
 Uniform administration procedures for data
 Application programmers never exposed to details of data representation
and storage.
 A DBMS uses various powerful functions to store and retrieve data
efficiently.
 Offers Data Integrity and Security
 The DBMS implies integrity constraints to get a high level of protection
against prohibited access to data.
 A DBMS schedules concurrent access to the data in such a manner that only
one user can access the same data at a time
 Reduced Application Development Time
Disadvantage of DBMS
DBMS may offer plenty of advantages but, it has certain flaws-
 Cost of Hardware and Software of a DBMS is quite high which increases
the budget of your organization.
 Most database management systems are often complex systems, so the
training for users to use the DBMS is required.
 In some organizations, all data is integrated into a single database which can
be damaged because of electric failure or database is corrupted on the
storage media
 Use of the same program at a time by many users sometimes lead to the loss
of some data.
 DBMS can’t perform sophisticated calculations

8
Prepared By : P.K.Chaubey

When not to use a DBMS system?


Although, DBMS system is useful. It is still not suited for specific task mentioned below:
Not recommended when you do not have the budget or the expertise to operate a DBMS. In
such cases, Excel/CSV/Flat Files could do just fine.

DBMS Architecture
The design of a DBMS depends on its architecture. It can be centralized or decentralized or
hierarchical. The architecture of a DBMS can be seen as either single tier or multi-tier. An n-
tier architecture divides the whole system into related but independent n modules, which can
be independently modified, altered, changed, or replaced.
In 1-tier architecture, the DBMS is the only entity where the user directly sits on the DBMS
and uses it. Any changes done here will directly be done on the DBMS itself. It does not
provide handy tools for end-users. Database designers and programmers normally prefer to
use single-tier architecture.
If the architecture of DBMS is 2-tier, then it must have an application through which the
DBMS can be accessed. Programmers use 2-tier architecture where they access the DBMS by
means of an application. Here the application tier is entirely independent of the database in
terms of operation, design, and programming.
3-tier Architecture
A 3-tier architecture separates its tiers from each other based on the complexity of the users
and how they use the data present in the database. It is the most widely used architecture to
design a DBMS.

 Database (Data) Tier− At this tier, the database resides along with its query
processing languages. We also have the relations that define the data and
their constraints at this level.

9
Prepared By : P.K.Chaubey

 Application (Middle) Tier− At this tier reside the application server and
the programs that access the database. For a user, this application tier
presents an abstracted view of the database. End-users are unaware of any
existence of the database beyond the application. At the other end, the
database tier is not aware of any other user beyond the application tier.
Hence, the application layer sits in the middle and acts as a mediator
between the end-user and the database.
 User (Presentation) Tier− End-users operate on this tier and they know
nothing about any existence of the database beyond this layer. At this layer,
multiple views of the database can be provided by the application. All views
are generated by applications that reside in the application tier.
Multiple-tier database architecture is highly modifiable, as almost all its components are
independent and can be changed independently.
Database Schema
A database schema is the skeleton structure that represents the logical view of the entire
database. It defines how the data is organized and how the relations among them are
associated. It formulates all the constraints that are to be applied on the data.
A database schema defines its entities and the relationship among them. It contains a
descriptive detail of the database, which can be depicted by means of schema diagrams. It’s
the database designers who design the schema to help programmers understand the database
and make it useful.

A database schema can be divided broadly into two categories −


 Physical Database Schema− This schema pertains to the actual storage of
data and its form of storage like files, indices, etc. It defines how the data
will be stored in a secondary storage.
 Logical Database Schema− This schema defines all the logical constraints
that need to be applied on the data stored. It defines tables, views, and
integrity constraints.

10
Prepared By : P.K.Chaubey

Database Instance
It is important that we distinguish these two terms individually. Database schema is the
skeleton of database. It is designed when the database doesn’t exist at all. Once the database
is operational, it is very difficult to make any changes to it. A database schema does not
contain any data or information.
A database instance is a state of operational database with data at any given time. It contains
a snapshot of the database. Database instances tend to change with time. A DBMS ensures
that its every instance (state) is in a valid state, by diligently following all the validations,
constraints, and conditions that the database designers have imposed.
If a database system is not multi-layered, then it becomes difficult to make any changes in the
database system. Database systems are designed in multi-layers as we learnt earlier.
Data Independence
A database system normally contains a lot of data in addition to users’ data. For example, it
stores data about data, known as metadata, to locate and retrieve data easily. It is rather
difficult to modify or update a set of metadata once it is stored in the database. But as a
DBMS expands, it needs to change over time to satisfy the requirements of the users. If the
entire data is dependent, it would become a tedious and highly complex job.

Metadata itself follows a layered architecture, so that when we change data at one layer, it
does not affect the data at another level. This data is independent but mapped to each other.
Logical Data Independence
Logical data is data about database, that is, it stores information about how data is managed
inside. For example, a table (relation) stored in the database and all its constraints, applied on
that relation.
Logical data independence is a kind of mechanism, which liberalizes itself from actual data
stored on the disk. If we do some changes on table format, it should not change the data
residing on the disk.
Physical Data Independence
All the schemas are logical, and the actual data is stored in bit format on the disk. Physical
data independence is the power to change the physical data without impacting the schema or
logical data.
For example, in case we want to change or upgrade the storage system itself − suppose we
want to replace hard-disks with SSD − it should not have any impact on the logical data or
schemas.

11
Prepared By : P.K.Chaubey

Data Warehousing
The term ―Data Warehouse‖ was first coined by Bill Inmon in 1990. According to Inmon, a
data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of
data. This data helps analysts to take informed decisions in an organization.
An operational database undergoes frequent changes on a daily basis on account of the
transactions that take place. Suppose a business executive wants to analyze previous feedback
on any data such as a product, a supplier, or any consumer data, then the executive will have
no data available to analyze because the previous data has been updated due to transactions.
A data warehouses provides us generalized and consolidated data in multidimensional view.
Along with generalized and consolidated view of data, a data warehouses also provides us
Online Analytical Processing (OLAP) tools. These tools help us in interactive and effective
analysis of data in a multidimensional space. This analysis results in data generalization and
data mining.
Data mining functions such as association, clustering, classification, prediction can be
integrated with OLAP operations to enhance the interactive mining of knowledge at multiple
level of abstraction. That’s why data warehouse has now become an important platform for
data analysis and online analytical processing.
Understanding a Data Warehouse
 A data warehouse is a database, which is kept separate from the
organization’s operational database.
 There is no frequent updating done in a data warehouse.
 It possesses consolidated historical data, which helps the organization to
analyze its business.
 A data warehouse helps executives to organize, understand, and use their
data to take strategic decisions.
 Data warehouse systems help in the integration of diversity of application
systems.
 A data warehouse system helps in consolidated historical data analysis.
Why a Data Warehouse is Separated from Operational Databases
A data warehouses is kept separate from operational databases due to the following reasons −
 An operational database is constructed for well-known tasks and workloads
such as searching particular records, indexing, etc. In contract, data
warehouse queries are often complex and they present a general form of
data.
 Operational databases support concurrent processing of multiple
transactions. Concurrency control and recovery mechanisms are required for
operational databases to ensure robustness and consistency of the database.
 An operational database query allows to read and modify operations, while
an OLAP query needs only read only access of stored data.
 An operational database maintains current data. On the other hand, a data
warehouse maintains historical data.

12
Prepared By : P.K.Chaubey

Data Warehouse Features


The key features of a data warehouse are discussed below −
 Subject Oriented: A data warehouse is subject oriented because it provides
information around a subject rather than the organization’s ongoing
operations. These subjects can be product, customers, suppliers, sales,
revenue, etc. A data warehouse does not focus on the ongoing operations,
rather it focuses on modelling and analysis of data for decision making.
 Integrated: A data warehouse is constructed by integrating data from
heterogeneous sources such as relational databases, flat files, etc. This
integration enhances the effective analysis of data.
 Time Variant: The data collected in a data warehouse is identified with a
particular time period. The data in a data warehouse provides information
from the historical point of view.
 Non-volatile: Non-volatile means the previous data is not erased when new
data is added to it. A data warehouse is kept separate from the operational
database and therefore frequent changes in operational database is not
reflected in the data warehouse.
Note − A data warehouse does not require transaction processing, recovery, and concurrency
controls, because it is physically stored and separate from the operational database.

Data Warehouse Applications


As discussed before, a data warehouse helps business executives to organize, analyze, and
use their data for decision making. A data warehouse serves as a sole part of a plan-execute-
assess ―closed-loop‖ feedback system for the enterprise management. Data warehouses are
widely used in the following fields −
 Financial services
 Banking services
 Consumer goods
 Retail sectors
 Controlled manufacturing
Types of Data Warehouse
Information processing, analytical processing, and data mining are the three types of data
warehouse applications that are discussed below −
 Information Processing: A data warehouse allows to process the data
stored in it. The data can be processed by means of querying, basic
statistical analysis, reporting using crosstabs, tables, charts, or graphs.
 Analytical Processing: A data warehouse supports analytical processing of
the information stored in it. The data can be analyzed by means of basic
OLAP operations, including slice-and-dice, drill down, drill up, and
pivoting.
 Data Mining: Data mining supports knowledge discovery by finding hidden
patterns and associations, constructing analytical models, performing
classification and prediction. These mining results can be presented using
the visualization tools.

13
Prepared By : P.K.Chaubey

Sr.No. Data Warehouse (OLAP) Operational Database(OLTP)

It involves historical processing of


1 It involves day-to-day processing.
information.

OLAP systems are used by knowledge


OLTP systems are used by clerks, DBAs, or
2 workers such as executives, managers,
database professionals.
and analysts.

3 It is used to analyze the business. It is used to run the business.

4 It focuses on Information out. It focuses on Data in.

It is based on Star Schema, Snowflake


5 Schema, and Fact Constellation It is based on Entity Relationship Model.
Schema.

6 It focuses on Information out. It is application oriented.

7 It contains historical data. It contains current data.

It provides summarized and


8 It provides primitive and highly detailed data.
consolidated data.

It provides summarized and It provides detailed and flat relational view of


9
multidimensional view of data. data.

10 The number of users is in hundreds. The number of users is in thousands.

The number of records accessed is in


11 The number of records accessed is in tens.
millions.

The database size is from 100GB to


12 The database size is from 100 MB to 100 GB.
100 TB.

13 These are highly flexible. It provides high performance.


Data warehousing is the process of constructing and using a data warehouse. A data
warehouse is constructed by integrating data from multiple heterogeneous sources that
support analytical reporting, structured and/or ad hoc queries, and decision making. Data
warehousing involves data cleaning, data integration, and data consolidations.

Using Data Warehouse Information


There are decision support technologies that help utilize the data available in a data
warehouse. These technologies help executives to use the warehouse quickly and effectively.
They can gather data, analyze it, and take decisions based on the information present in the
warehouse. The information gathered in a warehouse can be used in any of the following
domains −
 Tuning Production Strategies: The product strategies can be well tuned by
repositioning the products and managing the product portfolios by
comparing the sales quarterly or yearly.
 Customer Analysis: Customer analysis is done by analyzing the customer’s
buying preferences, buying time, budget cycles, etc.

14
Prepared By : P.K.Chaubey

 Operations Analysis: Data warehousing also helps in customer relationship


management, and making environmental corrections. The information also
allows us to analyze business operations.

Integrating Heterogeneous Databases


To integrate heterogeneous databases, we have two approaches −
Query-driven Approach
 Update-driven Approach
Query-Driven Approach
This is the traditional approach to integrate heterogeneous databases. This approach was used
to build wrappers and integrators on top of multiple heterogeneous databases. These
integrators are also known as mediators.
Process of Query-Driven Approach
 When a query is issued to a client side, a metadata dictionary translates the
query into an appropriate form for individual heterogeneous sites involved.
 Now these queries are mapped and sent to the local query processor.
 The results from heterogeneous sites are integrated into a global answer set.
Disadvantages
Query-driven approach needs complex integration and filtering processes.
 This approach is very inefficient.
 It is very expensive for frequent queries.
 This approach is also very expensive for queries that require aggregations.
Update-Driven Approach
This is an alternative to the traditional approach. Today’s data warehouse systems follow
update-driven approach rather than the traditional approach discussed earlier. In update-
driven approach, the information from multiple heterogeneous sources are integrated in
advance and are stored in a warehouse. This information is available for direct querying and
analysis.
Advantages
This approach has the following advantages −
 This approach provide high performance.
 The data is copied, processed, integrated, annotated, summarized and
restructured in semantic data store in advance.
 Query processing does not require an interface to process data at local
sources.
Functions of Data Warehouse Tools and Utilities
The following are the functions of data warehouse tools and utilities −
 Data Extraction− Involves gathering data from multiple heterogeneous
sources.
 Data Cleaning− Involves finding and correcting the errors in data.
 Data Transformation− Involves converting the data from legacy format to
warehouse format.
 Data Loading− Involves sorting, summarizing, consolidating, checking
integrity, and building indices and partitions.
 Refreshing− Involves updating from data sources to warehouse.

15
Prepared By : P.K.Chaubey

Note − Data cleaning and data transformation are important steps in improving the quality of
data and data mining results.

Data Mining
There is a huge amount of data available in the Information Industry. This data is of no use
until it is converted into useful information. It is necessary to analyze this huge amount of
data and extract useful information from it.
Extraction of information is not the only process we need to perform; data mining also
involves other processes such as Data Cleaning, Data Integration, Data Transformation, Data
Mining, Pattern Evaluation and Data Presentation. Once all these processes are over, we
would be able to use this information in many applications such as Fraud Detection, Market
Analysis, Production Control, Science Exploration, etc.
Data Mining is defined as extracting information from huge sets of data. In other words, we
can say that data mining is the procedure of mining knowledge from data. The information or
knowledge extracted so can be used for any of the following applications:

 Market Analysis
 Fraud Detection
 Customer Retention
 Production Control
 Science Exploration
Data Mining Applications
Data mining is highly useful in the following domains:
 Market Analysis and Management
 Corporate Analysis & Risk Management
 Fraud Detection
Apart from these, data mining can also be used in the areas of production control, customer
retention, science exploration, sports, astrology, and Internet Web Surf-Aid
Market Analysis and Management
Listed below are the various fields of market where data mining is used:
 Customer Profiling: Data mining helps determine what kind of people buy
what kind of products.
 Identifying Customer Requirements: Data mining helps in identifying the
best products for different customers. It uses prediction to find the factors
that may attract new customers.
 Cross Market Analysis: Data mining performs Association/correlations
between product sales.
 Target Marketing: Data mining helps to find clusters of model customers
who share the same characteristics such as interests, spending habits,
income, etc.
 Determining Customer purchasing pattern: Data mining helps in
determining customer purchasing pattern.
 Providing Summary Information: Data mining provides us various
multidimensional summary reports.
Corporate Analysis and Risk Management
Data mining is used in the following fields of the Corporate Sector:

16
Prepared By : P.K.Chaubey

 Finance Planning and Asset Evaluation: It involves cash flow analysis


and prediction, contingent claim analysis to evaluate assets.
 Resource Planning: It involves summarizing and comparing the resources
and spending.
 Competition: It involves monitoring competitors and market directions.
Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to detect
frauds. In fraud telephone calls, it helps to find the destination of the call, duration of the call,
time of the day or week, etc. It also analyzes the patterns that deviate from expected norms.
Data mining deals with the kind of patterns that can be mined. On the basis of the kind of
data to be mined, there are two categories of functions involved in Data Mining:
 Descriptive
 Classification and Prediction
Descriptive Function
The descriptive function deals with the general properties of data in the database. Here is the
list of descriptive functions:
 Class/Concept Description
 Mining of Frequent Patterns
 Mining of Associations
 Mining of Correlations
 Mining of Clusters
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For example, in
a company, the classes of items for sales include computer and printers, and concepts of
customers include big spenders and budget spenders. Such descriptions of a class or a
concept are called class/concept descriptions. These descriptions can be derived by the
following two ways −
 Data Characterization: This refers to summarizing data of class under
study. This class under study is called as Target Class.
 Data Discrimination: It refers to the mapping or classification of a class
with some predefined group or class.
Mining of Frequent Patterns
Frequent patterns are those patterns that occur frequently in transactional data. Here is the list
of kind of frequent patterns −
 Frequent Item Set− It refers to a set of items that frequently appear
together, for example, milk and bread.
 Frequent Subsequence− A sequence of patterns that occur frequently such
as purchasing a camera is followed by memory card.
 Frequent Sub Structure− Substructure refers to different structural forms,
such as graphs, trees, or lattices, which may be combined with item-sets or
subsequences.
Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data and
determining association rules.
For example, a retailer generates an association rule that shows that 70% of time milk is sold
with bread and only 30% of times biscuits are sold with bread.

17
Prepared By : P.K.Chaubey

Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical correlations
between associated-attribute-value pairs or between two item sets to analyze that if they have
positive, negative or no effect on each other.
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of
objects that are very similar to each other but are highly different from the objects in other
clusters.
Classification and Prediction
Classification is the process of finding a model that describes the data classes or concepts.
The purpose is to be able to use this model to predict the class of objects whose class label is
unknown. This derived model is based on the analysis of sets of training data. The derived
model can be presented in the following forms −
 Classification (IF-THEN) Rules
 Decision Trees
 Mathematical Formulae
 Neural Networks
The list of functions involved in these processes are as follows −
 Classification− It predicts the class of objects whose class label is
unknown. Its objective is to find a derived model that describes and
distinguishes data classes or concepts. The Derived Model is based on the
analysis set of training data i.e. the data object whose class label is well
known.
 Prediction− It is used to predict missing or unavailable numerical data
values rather than class labels. Regression Analysis is generally used for
prediction. Prediction can also be used for identification of distribution
trends based on available data.
 Outlier Analysis− Outliers may be defined as the data objects that do not
comply with the general behavior or model of the data available.
 Evolution Analysis− Evolution analysis refers to the description and model
regularities or trends for objects whose behavior changes over time.
Data Mining Task Primitives
 We can specify a data mining task in the form of a data mining query.
 This query is input to the system.
 A data mining query is defined in terms of data mining task primitives.
Note − These primitives allow us to communicate in an interactive manner with the data
mining system. Here is the list of Data Mining Task Primitives −
 Set of task relevant data to be mined.
 Kind of knowledge to be mined.
 Background knowledge to be used in discovery process.
 Interestingness measures and thresholds for pattern evaluation.
 Representation for visualizing the discovered patterns.
Set of task relevant data to be mined
This is the portion of database in which the user is interested. This portion includes the
following −
 Database Attributes

18
Prepared By : P.K.Chaubey

 Data Warehouse dimensions of interest


Kind of knowledge to be mined
It refers to the kind of functions to be performed. These functions are −
 Characterization
 Discrimination
 Association and Correlation Analysis
 Classification
 Prediction
 Clustering
 Outlier Analysis
 Evolution Analysis
Background knowledge
The background knowledge allows data to be mined at multiple levels of abstraction. For
example, the Concept hierarchies are one of the background knowledge that allows data to be
mined at multiple levels of abstraction.
Interestingness measures and thresholds for pattern evaluation
This is used to evaluate the patterns that are discovered by the process of knowledge
discovery. There are different interesting measures for different kind of knowledge.
Representation for visualizing the discovered patterns
This refers to the form in which discovered patterns are to be displayed. These
representations may include the following. −
 Rules
 Tables
 Charts
 Graphs
 Decision Trees
 Cubes
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data
sources. These factors also create some issues. Here in this tutorial, we will discuss the major
issues regarding −
 Mining Methodology and User Interaction
 Performance Issues
 Diverse Data Types Issues
The following diagram describes the major issues.

19
Prepared By : P.K.Chaubey

Mining Methodology and User Interaction Issues


It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases− Different users may
be interested in different kinds of knowledge. Therefore it is necessary for
data mining to cover a broad range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction− The
data mining process needs to be interactive because it allows users to focus
the search for patterns, providing and refining data mining requests based on
the returned results.
 Incorporation of background knowledge− To guide discovery process
and to express the discovered patterns, the background knowledge can be
used. Background knowledge may be used to express the discovered
patterns not only in concise terms but at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining− Data Mining
Query language that allows the user to describe ad hoc mining tasks, should
be integrated with a data warehouse query language and optimized for
efficient and flexible data mining.
 Presentation and visualization of data mining results− Once the patterns
are discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
 Handling noisy or incomplete data− The data cleaning methods are
required to handle the noise and incomplete objects while mining the data
regularities. If the data cleaning methods are not there then the accuracy of
the discovered patterns will be poor.

20
Prepared By : P.K.Chaubey

 Pattern evaluation− The patterns discovered should be interesting because


either they represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms− In order to
effectively extract the information from huge amount of data in databases,
data mining algorithm must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms− The factors
such as huge size of databases, wide distribution of data, and complexity of
data mining methods motivate the development of parallel and distributed
data mining algorithms. These algorithms divide the data into partitions
which is further processed in a parallel fashion. Then the results from the
partitions is merged. The incremental algorithms, update databases without
mining the data again from scratch.
Diverse Data Types Issues
 Handling of relational and complex types of data− The database may
contain complex data objects, multimedia data objects, spatial data,
temporal data etc. It is not possible for one system to mine all these kind of
data.
 Mining information from heterogeneous databases and global
information systems− The data is available at different data sources on
LAN or WAN. These data source may be structured, semi structured or
unstructured. Therefore mining the knowledge from them adds challenges to
data mining.
Data-mining: Classification
There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows:
 Classification
 Prediction
Classification models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model to categorize
bank loan applications as either safe or risky, or a prediction model to predict the
expenditures in dollars of potential customers on computer equipment given their income and
occupation.
Following are the examples of cases where the data analysis task is Classification:
 A bank loan officer wants to analyze the data in order to know which
customer (loan applicant) are risky or which are safe.
 A marketing manager at a company needs to analyze a customer with a
given profile, who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the categorical
labels. These labels are risky or safe for loan application data and yes or no for marketing
data.
Following are the examples of cases where the data analysis task is Prediction:
Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.

21
Prepared By : P.K.Chaubey

Therefore the data analysis task is an example of numeric prediction. In this case, a model or
a predictor will be constructed that predicts a continuous-valued-function or ordered value.
With the help of the bank loan application that we have discussed above, let us understand
the working of classification. The Data Classification process includes two steps:

 Building the Classifier or Model


 Using Classifier for Classification
Building the Classifier or Model
 This step is the learning step or the learning phase.
 In this step the classification algorithms build the classifier.
 The classifier is built from the training set made up of database tuples and
their associated class labels.
 Each tuple that constitutes the training set is referred to as a category or
class. These tuples can also be referred to as sample, object or data points.
Using Classifier for Classification
In this step, the classifier is used for classification. Here the test data is used to estimate the
accuracy of classification rules. The classification rules can be applied to the new data tuples
if the accuracy is considered acceptable.
Classification and Prediction Issues
The major issue is preparing the data for Classification and Prediction. Preparing the data
involves the following activities:
 Data Cleaning: Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques and
the problem of missing values is solved by replacing a missing value with
most commonly occurring value for that attribute.
 Relevance Analysis: Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are
related.
 Data Transformation and reduction: The data can be transformed by any
of the following methods.
 Normalization: The data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall
within a small specified range. Normalization is used when in the learning
step, the neural networks or the methods involving measurements are used.
 Generalization: The data can also be transformed by generalizing it to the
higher concept. For this purpose we can use the concept hierarchies.
Comparison of Classification and Prediction Methods
 Accuracy: Accuracy of classifier refers to the ability of classifier. It predict
the class label correctly and the accuracy of the predictor refers to how well
a given predictor can guess the value of predicted attribute for a new data.
 Speed: This refers to the computational cost in generating and using the
classifier or predictor.
 Robustness: It refers to the ability of classifier or predictor to make correct
predictions from given noisy data.
 Scalability: Scalability refers to the ability to construct the classifier or
predictor efficiently; given large amount of data.

22
Prepared By : P.K.Chaubey

 Interpretability: It refers to what extent the classifier or predictor


understands.
Data Mining Techniques
Data Mining is the process of extracting useful information and patterns from enormous
data. Data Mining includes collection, extraction, analysis and statistics of data. It is also
known as Knowledge discovery process, Knowledge Mining from Data or data/ pattern
analysis. Data Mining is a logical process of finding useful information to find out useful
data. Once the information and patterns are found it can be used to make decisions for
developing the business. Data mining tools can give answers to your various questions related
to your business which was too difficult to resolve. They also forecast the future trends which
lets the business people to make proactive decisions.
Data mining involves three steps. They are
 Exploration– In this step the data is cleared and converted into another
form. The nature of data is also determined
 Pattern Identification– The next step is to choose the pattern which will
make the best prediction
 Deployment– The identified patterns are used to get the desired outcome.
Benefits of Data Mining
 Automated prediction of trends and behaviours
 It can be implemented on new systems as well as existing platforms
 It can analyze huge database in minutes
 Automated discovery of hidden patterns
 There are a lot of models available to understand complex data easily
 It is of high speed which makes it easy for the users to analyze huge amount
of data in less time
 It yields improved predictions
Data Mining Techniques
One of the most important task in Data Mining is to select the correct data mining technique.
Data Mining technique has to be chosen based on the type of business and the type of
problem your business faces. A generalized approach has to be used to improve the accuracy
and cost effectiveness of using data mining techniques. There are basically seven main Data
Mining techniques which is discussed in this article. There are also a lot of other Data Mining
techniques but these seven are considered more frequently used by business people.
 Statistics
 Clustering
 Visualization
 Decision Tree
 Association Rules
 Neural Networks
 Classification
1. Statistical Techniques
Data mining techniques statistics is a branch of mathematics which relates to the collection
and description of data. Statistical technique is not considered as a data mining technique by
many analysts. But still it helps to discover the patterns and build predictive models. For this
reason data analyst should possess some knowledge about the different statistical techniques.
In today’s world people have to deal with large amount of data and derive important patterns

23
Prepared By : P.K.Chaubey

from it. Statistics can help you to a greater extent to get answers for questions about their data
like
 What are the patterns in their database ?
 What is the probability of an event to occur ?
 Which patterns are more useful to the business ?
 What is the high level summary that can give you a detailed view of what is
there in the database ?
Statistics not only answers these questions they help in summarizing the data and count it. It
also helps in providing information about the data with ease. Through statistical reports
people can take smart decisions. There are different forms of statistics but the most important
and useful technique is the collection and counting of data. There are a lot of ways to collect
data like
 Histogram
 Mean
 Median
 Mode
 Variance
 Max
 Min
 Linear Regression
2. Clustering Technique
Clustering is one among the oldest techniques used in Data Mining. Clustering analysis is the
process of identifying data that are similar to each other. This will help to understand the
differences and similarities between the data. This is sometimes called segmentation and
helps the users to understand what is going on within the database. For example, an insurance
company can group its customers based on their income, age, nature of policy and type of
claims.
There are different types of clustering methods. They are as follows
 Partitioning Methods
 Hierarchical Agglomerative methods
 Density Based Methods
 Grid Based Methods
 Model Based Methods
The most popular clustering algorithm is Nearest Neighbour. Nearest neighbour technique is
very similar to clustering. It is a prediction technique where in order to predict what a
estimated value is in one record look for records with similar estimated values in historical
database and use the prediction value from the record which is near to the unclassified record.
This technique simply states that the objects which are closer to each other will have similar
prediction values. Through this method you can easily predict the values of nearest objects
very easily. Nearest Neighbour is the most easy to use technique because they work as per the
thought of the people. They also work very well in terms of automation. They perform
complex ROI calculations with ease. The level of accuracy in this technique is as good as the
other Data Mining techniques.
In business Nearest Neighbour technique is most often used in the process of Text Retrieval.
They are used to find the documents that share the important characteristics with that main
document that have been marked as interesting.
3. Visualization

24
Prepared By : P.K.Chaubey

Visualization is the most useful technique which is used to discover data patterns. This
technique is used at the beginning of the Data Mining process. Many researches are going on
these days to produce interesting projection of databases, which is called Projection Pursuit.
There are a lot of data mining technique which will produce useful patterns for good data.
But visualization is a technique which converts Poor data into good data letting different
kinds of Data Mining methods to be used in discovering hidden patterns.
4. Induction Decision Tree Technique
A decision tree is a predictive model and the name itself implies that it looks like a tree. In
this technique, each branch of the tree is viewed as a classification question and the leaves of
the trees are considered as partitions of the dataset related to that particular classification.
This technique can be used for exploration analysis, data pre-processing and prediction work.
Decision tree can be considered as a segmentation of the original dataset where segmentation
is done for a particular reason. Each data that comes under a segment has some similarities in
their information being predicted. Decision trees provides results that can be easily
understood by the user.
Decision tree technique is mostly used by statisticians to find out which database is more
related to the problem of the business. Decision tree technique can be used for Prediction and
Data pre-processing.
The first and foremost step in this technique is growing the tree. The basic of growing the tree
depends on finding the best possible question to be asked at each branch of the tree. The
decision tree stops growing under any one of the below circumstances
 If the segment contains only one record
 All the records contain identical features
 The growth is not enough to make any further spilt
CART which stands for Classification and Regression Trees is a data exploration and
prediction algorithm which picks the questions in a more complex way. It tries them all and
then selects one best question which is used to split the data into two or more segments. After
deciding on the segments it again asks questions on each of the new segment individually.
Another popular decision tree technology is CHAID (Chi-Square Automatic Interaction
Detector). It is similar to CART but it differs in one way. CART helps in choosing the best
questions whereas CHAID helps in choosing the splits.
5. Neural Network
Neural Network is another important technique used by people these days. This technique is
most often used in the starting stages of the data mining technology. Artificial neural network
was formed out of the community of Artificial intelligence.
Neural networks are very easy to use as they are automated to a particular extent and because
of this the user is not expected to have much knowledge about the work or database. But to
make the neural network work efficiently you need to know
 How the nodes are connected ?
 How many processing units to be used ?
 When should the training process to be stopped ?
There are two main parts of this technique – the node and the link
 The node– which freely matches to the neuron in the human brain
 The link– which freely matches to the connections between the neurons in
the human brain
A neural network is a collection of interconnected neurons. which could form a single layer
or multiple layer. The formation of neurons and their interconnections are called architecture
of the network. There are a wide variety of neural network models and each model has its

25
Prepared By : P.K.Chaubey

own advantages and disadvantages. Every neural network model has different architectures
and these architectures use different learning procedures.
Neural networks are very strong predictive modelling technique. But it is not very easy to
understand even by experts. It creates very complex models which is impossible to
understand fully. Thus to understand the Neural network technique companies are finding out
new solutions. Two solutions have already been suggested
 First solution is Neural network is packaged up into a complete solution
which will let it to be used for a single application
 Second solution is it is bonded with expert consulting services
Neural network has been used in various kinds of applications. This has been used in the
business to detect frauds taking place in the business.
6. Association Rule Technique
This technique helps to find the association between two or more items. It helps to know the
relations between the different variables in databases. It discovers the hidden patterns in the
data sets which is used to identify the variables and the frequent occurrence of different
variables that appear with the highest frequencies.
Association rule offers two major information
 Support– Hoe often is the rule applied ?
 Confidence– How often the rule is correct ?
This technique follows a two step process
 Find all the frequently occurring data sets
 Create strong association rules from the frequent data sets
There are three types of association rule. They are
 Multilevel Association Rule
 Multidimensional Association Rule
 Quantitative Association Rule
This technique is most often used in retail industry to find patterns in sales. This will help
increase the conversion rate and thus increases profit.
7. Classification
Data mining techniques classification is the most commonly used data mining technique
which contains a set of pre classified samples to create a model which can classify the large
set of data. This technique helps in deriving important information about data and metadata
(data about data). This technique is closely related to cluster analysis technique and it uses
decision tree or neural network system. There are two main processes involved in this
technique
 Learning– In this process the data are analyzed by classification algorithm
 Classification– In this process the data is used to measure the precision of
the classification rules
There are different types of classification models. They are as follows
 Classification by decision tree induction
 Bayesian Classification
 Neural Networks
 Support Vector Machines (SVM)
 Classification Based on Associations
One good example of classification technique is Email provider.

26
Prepared By : P.K.Chaubey

Record = Records are composed of fields, each of which contains one item of
information.
Field= A field is an area in a fixed or known location in a unit of data such as a
record, message header, or computer instruction that has a purpose and usually a
fixed size.
Form= A form is a database object that you can use to enter, edit, or display
data from a table or a query.
Query= A query is a request for data or information from a database table or
combination of tables.
Table= A table is a data structure that organizes information into rows and
columns.
Schema -is its structure described in a formal language supported by
the database management system (DBMS). The term "schema" refers to the
organization of data as a blueprint of how the database is constructed (divided
into database tables in the case of relational databases).
Views: Views in SQL are considered as a virtual table. A view also contains
rows and columns. To create the view, we can select the fields from one or
more tables present in the database. A view can either have specific rows based
on certain condition or all the rows of a table.

27

You might also like