DATA MANAGEMENT CONCEPTS
WACHIRA DAVIS
SELECTED DATA MANAGEMENT CONCEPTS
Data governance
Data governance is an emerging discipline with an evolving definition.
The discipline embodies a convergence of data quality, data management, data policies, business
process management, and risk management surrounding the handling of data in an organization.
Data governance (DG) refers to the overall management of the availability, usability, integrity, and
security of the data employed in an enterprise system of decision rights and accountabilities for
information-related processes, executed according to agreed-upon models which describe who can
take what actions with what information, and when, under what circumstances, using what
methods."
Promoted by IBM, data governance deals with the policies and processes for managing the
availability, usability, integrity, and security of the data employed in an enterprise, with special
emphasis on promoting privacy, security, data quality, and compliance with government regulations.
Data governance is a set of processes that ensures that important data assets are formally managed
throughout the enterprise.
Data governance ensures that data can be trusted and that people can be made accountable for any
adverse event that happens because of low data quality.
It is about putting people in charge of fixing and preventing issues with data so that the enterprise
can become more efficient.
Data governance is a quality control discipline for assessing, managing, using, improving, monitoring,
maintaining, and protecting organizational information.
Data governance initiatives can be driven by a desire to improve data quality and are more often
driven by responding to external regulations.
Examples of these regulations include data privacy regulations.
To achieve compliance with these regulations, business processes and controls require formal
management processes to govern the data subject to these regulations.
Common themes among the external regulations center on the need to manage risk.
The risks can be financial misstatement, inadvertent release of sensitive data, or poor data quality
for key decisions.
Data steward
A data steward is a person that is responsible for maintaining a data element in a metadata registry.
A data steward may share some responsibilities with a data custodian.
Data stewardship roles are common when organizations are attempting to exchange data precisely
and consistently between computer systems and reuse data-related resources.
Data Custodian
Primarily Data Custodians oversee the safe transport and storage of data.
While content is important to them, their focus is on the underlying infrastructure and activities
required to keep the data intact and available to users.
1
DATA MANAGEMENT CONCEPTS
WACHIRA DAVIS
They collaborate with the Data Stewards to implement data transformations, resolve data issues,
and collaborate on system changes.
Data cleansing
Data cleansing or data scrubbing is the act of detecting and correcting (or removing) corrupt or
inaccurate records from a record set, table or database.
It consists of activities for detecting and correcting data in a database that are incorrect, incomplete,
improperly formatted, or redundant.
Data cleansing not only corrects errors but also enforces consistency among different sets of data
that originated in separate information systems.
Specialized data-cleansing software is available to automatically survey data files, correct errors in
the data, and integrate the data in a consistent company-wide format.
Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant
etc. parts of the data and then replacing, modifying or deleting them.
After cleansing, a data set will be consistent with other similar data sets in the system.
The inconsistencies detected or removed may have been originally caused by different data
dictionary definitions of similar entities in different stores, may have been caused by user entry
errors, or may have been corrupted in transmission or storage.
Data cleansing differs from data validation in that validation almost invariably means data is rejected
from the system at entry and is performed at entry time, rather than on batches of data.
The actual process of data cleansing may involve removing typographical errors or validating and
correcting values against a known list of entities.
Data Validation
Data validation is the process of ensuring that a program operates on clean, correct and useful
data.
It uses routines, often called "validation rules" or "check routines", that check for correctness,
meaningfulness, and security of data that are input to the system. The rules may be
implemented through the automated facilities of a data dictionary, or by the inclusion of explicit
application program validation logic.
For business applications, data validation can be defined through declarative data integrity rules,
or procedure-based business rules. Data that does not conform to these rules will negatively
affect business process execution. Therefore, data validation should start with business process
definition and set of business rules within this process. Rules can be collected through the
requirements capture exercise.
The simplest data validation verifies that the characters provided come from a valid set. For
example, telephone numbers should include the digits and possibly the characters +, -, (, and )
(plus, minus, and brackets). A more sophisticated data validation routine would check to see the
2
DATA MANAGEMENT CONCEPTS
WACHIRA DAVIS
user had entered a valid country code, i.e., that the number of digits entered matched the
convention for the country or area specified.
Incorrect data validation can lead to data corruption or a security vulnerability.
Data validation checks that data are valid, sensible, reasonable, and secure before they are
processed.
Data validation guarantees to your application that every data value is correct and accurate.
You can design data validation into your application with several differing approaches: user
interface code, application code, or database constraints.
Data Validation Methods
Allowed character checks
Checks that ascertain that only expected characters are present in a field.
For example a numeric field may only allow the digits 0-9, the decimal point and perhaps a
minus sign or commas.
A text field such as a personal name might disallow characters such as < and >, as they could be
evidence of a markup-based security attack.
An e-mail address might require exactly one @ sign and various other structural details. Regular
expressions are effective ways of implementing such checks.
Batch totals
Checks for missing records, Numerical fields may be added together for all records in a batch.
The batch total is entered and the computer checks that the total is correct, e.g., add the 'Total
Cost' field of a number of transactions together.
Cardinality check
Checks that record has a valid number of related records, for example if contact record in
Payroll database is marked as "former employee", then this record must not have any
associated salary payments after the date on which employee left organisation (Cardinality = 0).
Check digits
Used for numerical data. An extra digit is added to a number which is calculated from the digits.
The computer checks this calculation when data are entered. For example the last digit of an
ISBN for a book is a check digit calculated modulus 10.
Consistency checks
Checks fields to ensure data in these fields corresponds, e.g., If Title = "Mr.", then Gender = "M".
Control totals
This is a total done on one or more numeric fields which appears in every record. This is a
meaningful total, e.g., add the total payment for a number of Customers.
Cross-system consistency checks
Compares data in different systems to ensure it is consistent, e.g., The address for the customer
with the same id is the same in both systems.
3
DATA MANAGEMENT CONCEPTS
WACHIRA DAVIS
The data may be represented differently in different systems and may need to be transformed
to a common format to be compared, e.g., one system may store customer name in a single
Name field as 'Doe, John Q', while another in three different fields: First_Name (John),
Last_Name (Doe) and Middle_Name (Quality); to compare the two, the validation engine would
have to transform data from the second system to match the data from the first, for example,
using SQL: Last_Name || ', ' || First_Name || substr(Middle_Name, 1, 1) would convert the data
from the second system to look like the data from the first 'Doe, John Q'
Data type checks
Checks the data type of the input and give an error message if the input data does not match
with the chosen data type, e.g., In an input box accepting numeric data, if the letter 'O' was
typed instead of the number zero, an error message would appear.
File existence check
Checks that a file with a specified name exists. This check is essential for programs that use file
handling.
Format or picture check
Checks that the data is in a specified format (template), e.g., dates have to be in the format
DD/MM/YYYY.
Regular expressions should be considered for this type of validation.
Hash totals
This is just a batch total done on one or more numeric fields which appears in every record. This
is a meaningless total, e.g., add the Telephone Numbers together for a number of Customers.
Limit check
Unlike range checks, data is checked for one limit only, upper OR lower, e.g., data should not be
greater than 2 (<=2).
Logic check
Checks that an input does not yield a logical error, e.g., an input value should not be 0 when
there will be a number that divides it somewhere in a program.
Presence check
Checks that important data are actually present and have not been missed out, e.g., customers
may be required to have their telephone numbers listed.
Range check
Checks that the data lie within a specified range of values, e.g., the month of a person's date of
birth should lie between 1 and 12.
Spelling and grammar check
Looks for spelling and grammatical errors.
Uniqueness check
Checks that each value is unique. This can be applied to several fields (i.e. Address, First Name,
Last Name).
4
DATA MANAGEMENT CONCEPTS
WACHIRA DAVIS
Data Integration
Data integration involves combining data residing in different sources and providing users with a
unified view of these data.
This process becomes significant in a variety of situations both commercial (when two similar
companies need to merge their databases) and scientific (combining research results from different
bioinformatics repositories, for example).
Data integration appears with increasing frequency as the volume and the need to share existing
data explodes.
Document and Record
A document management system (DMS) is a computer system (or set of computer programs) used
to track and store electronic documents and/or images of paper documents.
Records management
Records management, or RM, is the practice of maintaining the records of an organization from the
time they are created up to their eventual disposal.
This may include classifying, storing, securing, and destruction (or in some cases, archival
preservation) of records.
A record can be either a tangible object or digital information: for example, birth certificates, e-
mails, and office documents etc.
Records management is primarily concerned with the evidence of an organization's activities, and is
usually applied according to the value of the records rather than their physical format.
Managing physical records
Managing physical records involves different disciplines and may draw on a variety of forms of
expertise. Records must be identified and authenticated.
This is usually a matter of filing and retrieval; in some circumstances, more careful handling is
required.
Identifying records
If an item is presented as a legal record, it needs to be authenticated.
Forensic experts may need to examine a document or artifact to determine that it is not a forgery,
and that any damage, alteration, or missing content is documented.
In extreme cases, items may be subjected to a microscope or chemical analysis etc.
This requires that special care be taken in the creation and retention of the records of an
organization.
Storing records
Records must be stored in such a way that they are accessible and safeguarded against
environmental damage.
Circulating records
Tracking the record while it is away from the normal storage area is referred to as circulation.
5
DATA MANAGEMENT CONCEPTS
WACHIRA DAVIS
Often this is handled by simple written recording procedures.
However, many modern records environments use a computerized system involving bar code
scanners, or radio-frequency identification technology (RFID) to track movement of the records.
These can also be used for periodic auditing to identify unauthorized movement of the record.
Disposal of records
Disposal of records does not always mean destruction.
It can also include transfer to a historical archive, museum, or private individual.
Destruction of records ought to be authorized by law, statute, regulation, or operating procedure,
and the records should be disposed of with care to avoid inadvertent disclosure of information.
The process needs to be well-documented, starting with a records retention schedule and policies
and procedures that have been approved at the highest level.
An inventory of the records disposed of should be maintained, including certification that they have
been destroyed.
Records should never simply be discarded as refuse.
Most organizations use processes including paper shredding or incineration.
Commercially available products can manage records through all processes active, inactive, archival,
retention scheduling and disposal.
Managing electronic records
The general principles of records management apply to records in any format.
Digital records (almost always referred to as electronic records) raise specific issues.
It is more difficult to ensure that the content, context and structure of records is preserved and
protected when the records do not have a physical existence.
Particular concerns exist about the ability to access and read electronic records over time, since the
rapid pace of change in technology can make the software used to create the records obsolete,
leaving the records unreadable.
Meta Data Management
Metadata is loosely defined as data about data.
Metadata is traditionally found in the card catalogues of libraries and is today commonly used to
describe three aspects of digital documents and data: 1) definition, 2) structure and 3)
administration.
Metadata is defined as data providing information about one or more other pieces of data, such as:
Means of creation of the data
Purpose of the data
Time and date of creation
Creator or author of data
Placement on a computer network where the data was created
standards used
6
DATA MANAGEMENT CONCEPTS
WACHIRA DAVIS
For example, a digital image may include metadata that describes how large the picture is, the color
depth, the image resolution, when the image was created, and other data.
A text document's metadata may contain information about how long the document is, who the
author is, when the document was written, and a short summary of the document.
Metadata can be stored and managed in a database, often called a registry or repository.
Meta-data Management involves storing information about other information.
With different types of media being used, references to the location of the data can allow
management of diverse repositories.
Data theft
Data theft is a growing problem primarily perpetrated by office workers with access to technology.
They may be inclined to copy and/or delete part of it when they leave the company, or misuse it
while they are still in employment.
Data migration
Data Migration is the process of transferring data between storage types, formats, or computer.
Data migration is usually performed programmatically to achieve an automated migration, freeing
up human resources from tedious tasks.
It is required when organizations or individuals change computer systems or upgrade to new
systems, or when systems merge (such as when the organizations that use them undergo a
merger/takeover).
To achieve an effective data migration procedure, data on the old system is mapped to the new
system providing a design for data extraction and data loading.
The design relates old data formats to the new system's formats and requirements.
Programmatic data migration may involve many phases but it minimally includes data extraction
where data is read from the old system and data loading where data is written to the new system.
After loading into the new system, results are subjected to data verification to determine whether
data was accurately translated, is complete, and supports processes in the new system.
During verification, there may be a need for a parallel run of both systems to identify areas of
disparity and forestall erroneous data loss.
Automated and manual data cleaning is commonly performed in migration to improve data quality,
eliminate redundant or obsolete information, and match the requirements of the new system.
Data migration phases (design, extraction, cleansing, load, verification) for applications of moderate
to high complexity are commonly repeated several times before the new system is deployed.
Information policy
Every business, large and small, needs an information policy.
Your firm's data are an important resource, and you don't want people doing whatever they want
with them.
You need to have rules on how the data are to be organized and maintained, and who is allowed to
view the data or change them.
7
DATA MANAGEMENT CONCEPTS
WACHIRA DAVIS
An information policy specifies the organization's rules for sharing, disseminating, acquiring,
standardizing, classifying, and inventorying information.
An information policy lays out specific procedures and accountabilities, identifying which users and
organizational units can share information, where information can be distributed, and who is
responsible for updating and maintaining the information.
For example, a typical information policy would specify that only selected members of the payroll
and human resources department would have the right to change and view sensitive employee
data, such as an employee's salary or social security number, and that these departments are
responsible for making sure that such employee data are accurate.
If you are in a small business, the information policy would be established and implemented by the
owners or managers.
In a large organization, managing and planning for information as a corporate resource often
requires a formal data administration function.
Data administration is responsible for the specific policies and procedures through which data can
be managed as an organizational resource.
These responsibilities include developing information policy, planning for data, overseeing logical
database design and data dictionary development, and monitoring how information systems
specialists and end-user groups use data.
Ensuring data quality
A well-designed database and information policy will go a long way toward ensuring that the
business has the information it needs.
However, additional steps must be taken to ensure that the data in organizational databases are
accurate and remain reliable.
What would happen if a customer's telephone number or account balance were incorrect?
What would be the impact if the database had the wrong price for the product you sold?
Data that are inaccurate, untimely, or inconsistent with other sources of information lead to
incorrect decisions, product recalls, and even financial losses.
According to Forrester Research, there is case where 20 percent of U .S. mail and commercial
package deliveries were returned because of incorrect names or addresses.
The Gartner Group consultants reported that more than 25 percent of the critical data in large
Fortune 1000 companies' databases is inaccurate or incomplete, including bad product codes and
product descriptions, faulty inventory descriptions, erroneous financial data, incorrect supplier
information, and incorrect employee data.
Gartner believes that customer data degrades at a rate of 2 percent per month, making poor data
quality a major obstacle to successful customer relationship management.
Think of all the times you've received several pieces of the same direct mail advertising on the same
day.
This is very likely the result of having your name maintained multiple times in a database.
8
DATA MANAGEMENT CONCEPTS
WACHIRA DAVIS
Your name may have been misspelled or you used your middle initial on one occasion and not on
another or the information was initially entered onto a paper form and not scanned properly
into the system.
Because of these inconsistencies, the database would treat you as different people!
If a database were properly designed and enterprise-wide data standards established, duplicate or
inconsistent data elements should be minimal.
Most data quality problems, however, such as misspelled names, transposed numbers, or incorrect
or missing codes, stem from errors during data input.
The incidence of such errors is rising as companies move their businesses to the Web and allow
customers and suppliers to enter data into their Web sites that directly update internal systems.
Before a new database is in place, organizations need to identify and correct their faulty data and
establish better routines for editing data once their database is in operation.
Analysis of data quality often begins with a data quality audit, which is a structured survey of the
accuracy and level of completeness of the data in an information system.
Data quality audits can be performed by surveying entire data files, surveying samples from data
files, or surveying end users for their perceptions of data quality.
Data visualization
Data from information systems can be made easier for users to digest and act on by using graphics,
charts, tables, maps, digital images, three-dimensional presentations, animations, and other data
visualization technologies.
By presenting data in graphical form, data visualization tools help users see patterns and
relationships in large amounts of data that would be difficult to discern if the data were presented
as traditional lists of text.
Some data visualization tools are interactive, enabling users to manipulate data and see the
graphical displays change in response to the changes they make.
DISTRIBUTED DATABASE
A distributed database is a logically interrelated collection of shared data (and it description of this
data), physically distributed over a computer network. The DDBMS is the software that
transparently manages the distributed database.
A DDBMS is distinct from distributed processing, where a centralized DBMS is accessed over a
network.
It is also distinct from a parallel DBMS, which is a DBMS that runs multiple processors and disks.
The advantages of a DDBMS are that it reflects the organizational structure; it makes remote data
more shareable, it improves reliability, availability, and performance; it may be more economical, it
provides for modular growth, facilitates integration, and helps organizations remain competitive.
The major disadvantages are cost, complexity, lack of standards, and experience.
A DDBMS may be classified as homogeneous or heterogeneous. In a homogeneous system, all sites
use the same DBMS product.
9
DATA MANAGEMENT CONCEPTS
WACHIRA DAVIS
In a heterogeneous system, sites may run different DBMS products, which need not be based on the
same underlying data model, and so the system may be composed of relational, network,
hierarchical, and object-oriented DBMSs.
As well as having the standard functionality expected of a centralized DBMS, a DDBMS will need
extended communication services, extended system catalog, distributed query processing, and
extended security, concurrency, and recovery services.
DATABASE REPLICATION
Database replication is an important mechanism because it enables organizations to provide users
with access to current data where and when they need it.
Database replication is the process of copying and maintaining database objects, such as relations,
in multiple databases that make up a distributed database system.
The benefits of database replication are improved availability, reliability, performance, with load
reduction, and support for disconnected computing, many users, and advanced applications.
A replication object is a database object such as a relation, index, view, procedure, or function
existing on multiple servers in a distributed database system.
In a replication environment, any updates made to a replication object at one site are applied to the
copies at all other sites.
Mobile database
A mobile database is a database that is portable and physically separate from the corporate
database server but is capable of communicating with that server from remote sites allowing the
sharing of corporate data.
With mobile databases, users have access to corporate data on their laptop, PDA, or other Internet
access device that is required for applications at remote sites.
10