You are on page 1of 17

IBM Software

The fundamentals of data lifecycle


management in the era of big data
How data lifecycle management complements a big data strategy

The fundamentals of data lifecycle management in the era of big data

1 2 3 4 5 6
Introduction

Big data,
big impact:
Dealing with
the three Vs

Best practices:
Putting data
lifecycle
management
into action

The power of
enterprise-scale
data lifecycle
management

Enhance data
warehouse
agility with
IBM InfoSphere

Why InfoSphere?

The fundamentals of data lifecycle management in the era of big data

Introduction
Organizations are eager to harness the
power of big data. But as new big data
opportunities emerge, ensuring that
information is trusted and protected
becomes exponentially more difficult.
If these challenges are not addressed
directly, end users may lose confidence
in the insights generated from their data
which can leave them unable to act on
new opportunities or address threats.

The tremendous volume, variety and


velocity of big data means that the old
manual methods of discovering, governing
and correcting data are no longer feasible.
Organizations need to automate information
integration and governance from the start.
By automating information integration
and governance and employing it at the
point of data creation and throughout
its lifecycle, organizations can help
protect information and improve the
accuracy of big data insights.

3
1 Introduction

2 Big data, big impact:


Dealing with the
three Vs

3 Best practices:
Putting data lifecycle
management into action

4 The power of enterprisescale data lifecycle


management

5 Enhance data
warehouse agility with
IBM InfoSphere

6 Why InfoSphere?

The fundamentals of data lifecycle management in the era of big data

Information integration and governance


solutions must become a natural part
of big data projects. They must support
automated discovery and profiling and
they must facilitate an understanding of
diverse data sets to provide the complete
context required to make informed decisions.
They must be agile enough to accommodate
a wide variety of data and seamlessly
integrate with diverse technologies, from
data marts to Apache Hadoop systems.
Plus, they must discover, protect and
monitor sensitive information across its
lifecycle as part of big data applications.

Understanding the context of data


and being able to extract the precise
information necessary to meet a business
objective is key to utilizing big data to the
fullest. Managing the data lifecycle so that
data is accurate, is appropriately used and
is correctly stored to meet the required
service levels and retention needs has
wide-ranging benefits. These benefits
include risk reduction, performance
improvements and preventing an overload
of useless information.

This e-book explores the challenges


of managing big data, best practices
for enterprise-scale data lifecycle
management and how IBM InfoSphere
Optim data lifecycle management
solutions incorporate a comprehensive
range of information integration and
governance capabilities that enable
companies to properly manage data
over its lifetime.

4
1 Introduction

2 Big data, big impact:


Dealing with the
three Vs

3 Best practices:
Putting data lifecycle
management into action

4 The power of enterprisescale data lifecycle


management

5 Enhance data
warehouse agility with
IBM InfoSphere

6 Why InfoSphere?

The fundamentals of data lifecycle management in the era of big data

Big data, big impact: Dealing with the three Vs


Without effective data lifecycle management,
the increasing volume, variety and velocity
of big data can reduce performance,
increase margins and amplify risks.
Performance and time-to-market
As more users execute more queries on
larger data volumes, slow response times
and degraded application performance
become major issues. If left unchecked,
continued data growth will stretch resources
beyond capacity and negatively impact
response time for critical queries and
reporting processes. These problems can
affect production environments and hamper
upgrades, migrations and disaster recovery
efforts. Implementing intelligent data

management of historical, dormant data


is essential for avoiding these potentially
business-halting issues.
Rapid data growth also makes testing
more difficult. As data warehouses and big
data environments grow to petabytes or
more, testing processes are taxed by
having to cull data for their specific needs.
The results include longer test cycles,
slower time-to-market and fewer defects
identified in advance of release. Speeding
up testing workflows and delivery of data
warehouses requires organizations to
automate the creation of realistic rightsized
test datawhile keeping appropriate
security measures in place.

Margins
Exponential data growth also can drive up
infrastructure and operational costs, often
consuming most of an organizations data
warehousing or big data budget. Rising
data volumes require more capacity,
and organizations often must buy more
hardware and spend more money to
maintain, monitor and administer their
expanding infrastructure. Large data
warehouses and big data environments
generally require bigger servers, appliances
and testing environments, which can also
increase software licensing costs for the
database and database tooling, not to
mention labor, power and legal costs.

5
1 Introduction

2 Big data, big impact:


Dealing with the
three Vs

3 Best practices:
Putting data lifecycle
management into action

4 The power of enterprisescale data lifecycle


management

5 Enhance data
warehouse agility with
IBM InfoSphere

6 Why InfoSphere?

The fundamentals of data lifecycle management in the era of big data

Risks
Following the lets keep it in case someone
needs it later mandate, many organizations
already keep too much historical data.
According to the CGOC 2012 Summit
Survey, 69 percent of data has no value.
Opening the doors to excessive storage
and retention only exacerbates the situation.

At the same time, organizations must ensure


the privacy and security of the growing
volumes of confidential information.
Government and industry regulations from
around the world, such as the Health
Insurance Portability and Accountability
Act (HIPAA), the Personal Information

Protection and Electronic Documents Act


(PIPEDA) and the Payment Card Industry
Data Security Standard (PCI DSS) require
organizations to protect personal information
no matter where it liveseven in test and
development environments.

Data breaches and attacks risk negative consumer sentiment

75%

75% of IT
risks impact
customer
satisfaction
and brand
reputation

43%

43% are increasing


focus on reputational
risk because of
growth in emerging
technologies such as
social media

Maintaining compliance with data retention regulations, protecting privacy and archiving
data are not just legal mattersthey are essential for sustaining customer satisfaction
and brand reputation. In recent IBM surveys, respondents indicate that data theft/
cybercrime is the number-one threat to a companys reputationa greater threat than
system failures. Sixty-four percent of respondents say their company will be focusing
more on managing and protecting their reputation than they did five years ago.1
Source: Insights from the 2012 Global Reputational Risk and IT Study.

6
1 Introduction

2 Big data, big impact:


Dealing with the
three Vs

3 Best practices:
Putting data lifecycle
management into action

4 The power of enterprisescale data lifecycle


management

5 Enhance data
warehouse agility with
IBM InfoSphere

6 Why InfoSphere?

The fundamentals of data lifecycle management in the era of big data

The danger of treating a backup


as an archive
Many organizations are confused about the difference
between archiving and backing up data. Archiving
preserves data, providing a long-term repository of
information that can be used by litigation and audit
teams. By contrast, backing up data involves copying
production data and moving it to another environment
to enable disaster recovery and the restoration of
deleted files. Backups are often retained for a short
time, until a fresh backup replaces the existing backup.
Archiving complements backups by removing old,
redundant and infrequently accessed data from a
system and by reducing the size of databases and
their backups. Approximately 75 percent of the data
stored is typically inactive, rarely accessed by any
user, process or application. An estimated 90 percent

of all data access requests are serviced by new data


usually data that is less than a year old.2 With an
effective archiving strategy, organizations can protect
old data and comply with data retention rules while
reducing costs and enhancing system performance.
In an attempt to meet archiving needs, some
organizations simply back up data to a Hadoop
environment. But this kind of backup will not ensure
that data will be fully protected or remain query-able,
the way a true archive would. With an effective data
lifecycle management solution, companies can create
an archive that protects data, meets compliance
standards, and supports queries and reporting. An
emerging trend is for organizations to use Hadoop
as a lower-cost storage alternative for archives.

7
1 Introduction

2 Big data, big impact:


Dealing with the
three Vs

3 Best practices:
Putting data lifecycle
management into action

4 The power of enterprisescale data lifecycle


management

5 Enhance data
warehouse agility with
IBM InfoSphere

6 Why InfoSphere?

The fundamentals of data lifecycle management in the era of big data

Best practices: Putting data lifecycle management into action


The data lifecycle stretches through
multiple phases as data is created, used,
shared, updated, stored and eventually
archived or defensively disposed. Data
lifecycle management plays an especially
key role in three of these phases of
datas existence: archiving, test data
management and data masking.

Where management tasks fall in the data lifecycle

Test data
management
Dispose

Create

Use

Store /retain

Archiving

Data
masking

Share

Archive
Update

The entire data lifecycle (shown as the grey circle) benefits from
good governance, but management capabilities that focus on the
use, share and archive steps have wide-ranging benefits for cost
reduction and efficiency gains.

Archiving
Retention policies are designed to keep
important data elements for reference and
for future use while deleting data that is no
longer necessary to support the legal needs
of an organization. Effective data lifecycle
management includes the intelligence not
only to archive data in its full context, which
may include information across dozens of
databases, but also to archive it based on
specific parameters or business rules, such
as the age of the data. It can also help
storage administrators develop a tiered and
automated storage strategy to archive
dormant data in a data warehouse, thereby
improving overall warehouse performance.

8
1 Introduction

2 Big data, big impact:


Dealing with the
three Vs

3 Best practices:
Putting data lifecycle
management into action

4 The power of enterprisescale data lifecycle


management

5 Enhance data
warehouse agility with
IBM InfoSphere

6 Why InfoSphere?

The fundamentals of data lifecycle management in the era of big data

Enterprise information

1%
Subject
to legal
hold

31%

25%
Has business
utility
69%
Everything
else

5%
Regulatory
record keeping

Many organizations hope that big data will provide a large,


centralized lake of data, but in many cases, it becomes a data
swamp full of unreliable information.

Many organizations envision big data as a


large, pristine, centralized data lake. But
a data lake can quickly turn into a data
swamp when data is poorly managed and
controlled. By setting up an intelligent data
lifecycle management strategy and archiving
to inexpensive storage, you can avoid
turning your big data environment into a
dumping ground.
Test data management
In development, testers must automate the
creation of realistic, rightsized data sources
that mirror the behaviors of existing production
databases. To ensure that queries can be
run easily and accurately, they must create
a subset of actual production data and

reproduce actual conditions to help identify


defects or problems as early as possible in
the testing cycle.
The tremendous size of big data systems
creates challenges for testers. There is a
greater need to speed delivery of big data
applications, requiring organizations to create
realistic, rightsized, masked test data for
testing those applications for performance
and functionality. Testers also need ways to
generate test data sets that facilitate realistic
functional and performance testing. Because
production data contains information that
may identify customers, organizations must
mask that information in test environments
to maintain compliance and privacy.

9
1 Introduction

2 Big data, big impact:


Dealing with the
three Vs

3 Best practices:
Putting data lifecycle
management into action

4 The power of enterprisescale data lifecycle


management

5 Enhance data
warehouse agility with
IBM InfoSphere

6 Why InfoSphere?

The fundamentals of data lifecycle management in the era of big data

Applying data masking techniques to


the test data means testers use realisticlooking, but fictional datano actual
sensitive data is revealed. Application
developers can also use test data
management technologies to easily access
and refresh test data, which speeds the
testing and delivery of the new data source.
Organizations also need ways to mask
certain sensitive data, such as credit card
and phone numbers. While testing their
big data environments, they must mask
sensitive data from unauthorized users,

even though those users might be


authorized to see the data in aggregate.
For example, a pharmaceutical company
that is testing its data warehouse
environment might mask Social Security
numbers and dates of birth but not patients
ages and other demographic information.
Masking certain data this way satisfies
corporate and industry regulations by
removing identifiable information, while
still maintaining business context and
referential integrity for testing in nonproduction environments.

Original data
Customers table
Cust ID
08054
19101

Name
Alice Bennett
Carl Davis
Elliot Flynn

27645

Street
2 Park Blvd
258 Main
96 Avenue

Orders table
Cust ID

27645
27645

Item #
80-2382
86-4538

Order date
20 June 2004
10 October 2005

De-identified data
Customers table
Cust ID
10000
10001

10002

Name
Auguste Renoir
Claude Monet
Pablo Picasso

Street
23 Mars
24 Venus
25 Saturn

Orders table
Cust ID

10002
10002

Item #
80-2382
86-4538

Order date
20 June 2004
10 October 2005

Data masking techniques protect the confidentiality of


private information.

10
1 Introduction

2 Big data, big impact:


Dealing with the
three Vs

3 Best practices:
Putting data lifecycle
management into action

4 The power of enterprisescale data lifecycle


management

5 Enhance data
warehouse agility with
IBM InfoSphere

6 Why InfoSphere?

The fundamentals of data lifecycle management in the era of big data

Private cloud
Public cloud
EJB
Third-party
services

Complex IT landscapes
make setting up test
labs extremely costly
As volume, variety and velocity impacts the
complexity of data infrastructures, scaling test
environments becomes a significant problem. It
isnt unusual for Fortune 500 companies to
spend up to USD30 million building a single test
laband many of these organizations have
dozens of labs. Add in rising wages, and testing
costs begin to spiral out of control.

Business partners
Messaging
services

Collaboration

Web/Internet

Content
providers

Routing
services

Shared services
Archives

Portals

Data
warehouse

Directory
identity

Mainframe

Enterprise
service bus

File systems

Heterogeneous environments

11
1 Introduction

2 Big data, big impact:


Dealing with the
three Vs

3 Best practices:
Putting data lifecycle
management into action

4 The power of enterprisescale data lifecycle


management

5 Enhance data
warehouse agility with
IBM InfoSphere

6 Why InfoSphere?

The fundamentals of data lifecycle management in the era of big data

The power of enterprise-scale data lifecycle management


Effective data lifecycle management benefits
both IT and business stakeholders.
Increasing margin: Lower infrastructure
and capital costs, improved productivity
and reduced application defects during
the development lifecycle.
Reducing risks: Reduced application
downtime, minimized service and
performance disruptions, and adherence
to data retention requirements.
Promoting business agility: Improved
time-to-market, increased application
performance and improved quality of
applications through realistic test data.

With InfoSphere Optim, organizations gain


a single data lifecycle management solution
that can scale to meet enterprise needs.
Whether they implement InfoSphere Optim
for a single application, data warehouse or
big data environment, organizations can
streamline data lifecycle management with a
consistent strategy. The unique relationship
engine in InfoSphere Optim provides a
single point of control to guide data
processing activities such as archiving,
subsetting and retrieving data.

12
1 Introduction

2 Big data, big impact:


Dealing with the
three Vs

3 Best practices:
Putting data lifecycle
management into action

4 The power of enterprisescale data lifecycle


management

5 Enhance data
warehouse agility with
IBM InfoSphere

6 Why InfoSphere?

The fundamentals of data lifecycle management in the era of big data

Enhance data warehouse agility with IBM InfoSphere


InfoSphere Optim solutions help organizations
meet requirements for information integration
and governance and address challenges
exacerbated by the increasing volume,
variety and velocity of data. By archiving
old data from huge data warehouse
environments, businesses can improve
response times and reduce costs by
reclaiming valuable storage capacity.
By creating realistic, rightsized data sources
for testing, they can enhance the accuracy
of testing and identify problems early in the
testing cycle. And by implementing data
masking capabilities, they can protect
sensitive data and help ensure compliance
with privacy regulations.

As a result, organizations gain more control


of their IT budget while simultaneously
helping their big data and data warehouse
environments run more efficiently and reducing
the risk of exposure of sensitive data.

InfoSphere Optim supports


major big data and data
warehouse environments,
including IBM PureData for
Analytics, IBM PureData for
Transactions, IBM InfoSphere
BigInsights, Teradata,
Oracle and popular Hadoop
distributions. It also supports
enterprise databases and
operating systems, including
IBM DB2, Oracle Database,
Sybase, Microsoft SQL Server,
IBM Informix, IBM IMS,
IBM Virtual Storage Access Method (VSAM),
Microsoft Windows, UNIX, Linux and IBM z/OS.
In addition, InfoSphere Optim supports key enterprise
resource planning (ERP) and customer relationship
management (CRM) applications such as Oracle
E-Business Suite, PeopleSoft Enterprise, JD Edwards
EnterpriseOne, Siebel, Amdocs CRM and the
SAP ERP and CRM applications, as well as many
custom applications.
13

1 Introduction

2 Big data, big impact:


Dealing with the
three Vs

3 Best practices:
Putting data lifecycle
management into action

4 The power of enterprisescale data lifecycle


management

5 Enhance data
warehouse agility with
IBM InfoSphere

6 Why InfoSphere?

The fundamentals of data lifecycle management in the era of big data

The value of test data management at a US insurance company


With 42 high-volume back-end systems needed to
generate a full end-to-end system test, a US insurance
company could not confidently launch new features.
Testing in production was becoming the norm. In fact,
claims could not be processed in certain states because
of application defects that the teams skipped over during
the testing process. IT was consuming an increasing
number of resourcesyet application quality was
declining rapidly.
After implementing a process to govern test data
management, the insurance company reduced the costs
of testing by USD400,000 per year. Today, the company
can easily refresh 42 test systems from across the
organization in record time while finding defects
in advance.

The business value from implementing test data


management included:

$500,000

44%

Cost savings of
approximately
USD500,000 per year

44 percent
fewer untested
scenarios

41%

41 percent less
labor required
over 12 months

14
1 Introduction

2 Big data, big impact:


Dealing with the
three Vs

3 Best practices:
Putting data lifecycle
management into action

4 The power of enterprisescale data lifecycle


management

5 Enhance data
warehouse agility with
IBM InfoSphere

6 Why InfoSphere?

The fundamentals of data lifecycle management in the era of big data

Why InfoSphere?
As the foundation of the IBM big data platform,
InfoSphere provides market-leading
functionality across all the capabilities of
information integration and governance.
It is designed to handle the challenges of
big data by providing optimal scale and
performance for massive data volumes,
agile and rightsized integration and
governance for the increasing velocity of
data, and support for a wide variety of data
types and big data systems. InfoSphere
helps make big data and analytics projects
successful by delivering the confidence to
act on insight.

InfoSphere capabilities include:


Metadata, business glossary and
policy management: Define metadata,
business terminology and governance
policies with IBM InfoSphere Business
Information Exchange.
Data integration: Handle all integration
requirements, including batch data
transformation and movement (InfoSphere
Information Server), real-time replication
(InfoSphere Data Replication) and data
federation (InfoSphere Federation Server).
Data quality: Parse, standardize, validate
and match enterprise data with InfoSphere
Information Server for Data Quality.

Master data management: Act on a


trusted view of your customers, products,
suppliers, locations and accounts with
InfoSphere MDM.
Data lifecycle management: Manage
data throughout its lifecycle, from
requirements through retirement, with
InfoSphere Optim test data automation
and database archiving capabilities.
Data security and privacy: Continuously
monitor data access and protect
repositories from data breaches,
and support compliance with IBM
InfoSphere Guardium. Ensure sensitive
data is masked and protected with
InfoSphere Optim.

15
1 Introduction

2 Big data, big impact:


Dealing with the
three Vs

3 Best practices:
Putting data lifecycle
management into action

4 The power of enterprisescale data lifecycle


management

5 Enhance data
warehouse agility with
IBM InfoSphere

6 Why InfoSphere?

The fundamentals of data lifecycle management in the era of big data

Additional resources
Ready to get started? Take a self-service
InfoSphere Optim Business Value
Assessment and show the ROI results
to your big data project owner.

To learn more about InfoSphere Optim, check out these resources:


Manage the Data Lifecycle of Big Data Environments

IBM InfoSphere Optim solutions for data warehouses

Demo: IBM InfoSphere Optim Data Growth Solution

Demo: IBM InfoSphere Optim Test Data Management Solution

To learn more about the IBM approach to information integration and governance
for big data, please contact your IBM representative or IBM Business Partner,
or visit: ibm.com/software/data/information-integration-governance

16
1 Introduction

2 Big data, big impact:


Dealing with the
three Vs

3 Best practices:
Putting data lifecycle
management into action

4 The power of enterprisescale data lifecycle


management

5 Enhance data
warehouse agility with
IBM InfoSphere

6 Why InfoSphere?

Copyright IBM Corporation 2013


IBM Corporation
Software Group
Route 100
Somers, NY 10589
Produced in the United States of America
August 2013
IBM, the IBM logo, ibm.com, BigInsights, DB2, Guardium, IMS,
Informix, InfoSphere, Optim, PureData, and z/OS are trademarks
of International Business Machines Corp., registered in many
jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM
trademarks is available on the Web at Copyright and trademark
information at ibm.com/legal/copytrade.shtml
Linux is a registered trademark of Linus Torvalds in the United
States, other countries, or both.
Microsoft, Windows, Windows NT, and the Windows logo are
trademarks of Microsoft Corporation in the United States, other
countries, or both.
UNIX is a registered trademark of The Open Group in the United
States and other countries.
This document is current as of the initial date of publication and
may be changed by IBM at any time. Not all offerings are available
in every country in which IBM operates.
THE INFORMATION IN THIS DOCUMENT IS PROVIDED
AS IS WITHOUT ANY WARRANTY, EXPRESS OR
IMPLIED, INCLUDING WITHOUT ANY WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE AND ANY WARRANTY OR CONDITION OF
NON-INFRINGEMENT. IBM products are warranted according
to the terms and conditions of the agreements under which they
are provided.

 uhanna, Noel. Your Enterprise Data Archiving Strategy. Forrester. February 2011. ftp://ftp.boulder.ibm.com/software/data/sw-library/
Y
data-management/optim/papers/your-enterprise-data-archiving-strategy.pdf

IBM 2012 Global Reputational Risk and IT Study. ibm.com/services/us/gbs/bus/html/risk_study-2012-infographic.html

The client is responsible for ensuring compliance with laws and


regulations applicable to it. IBM does not provide legal advice or
represent or warrant that its services or products will ensure that the
client is in compliance with any law or regulation.
Please Recycle

IMM14126-USEN-00