You are on page 1of 56

Data Management Concepts

EEE 207 Database Management


Systems
28 March 2013
Peter Kimemiah
Database Management Concepts
• Database Architecture (Client Server)
• Database Integrity and Transactions (ACID)
• Data Warehouse and Mining (ETL process)
• Object Oriented Databases
• XML databases and SOAP
The physical design of different computer demand for different database
configuration in order to take advantage of memory, disks and networks.
In addition high performance may be an important consideration as is
high availability and reliability of data. This section briefly covers
Different Database Architectures to address these requirements

DATABASE ARCHITECTURES
Database Architectures
• The architecture of a database system is greatly influenced by the
underlying computer system on which the database system runs.
• Database systems can have
– Centralized architectures, where one server machine executes
operations on the database. This includes standalone and simple
networked client server systems (most common)
– Parallel computer architectures. A database may be hosted on
Multiple Processors (CPU) to improve querying and response times.
(high transaction volume systems, banking, telecoms systems, real
time)
– Distributed databases span multiple geographically separated
machines to provide backup, data redundancy, improved performance
for local users ( e-mail systems g-mail, ERP systems – accounts,
customer service, procurement, HR)
Centralized Architectures
• Centralized architectures, where one server machine executes
operations on the database. This includes standalone and simple
networked client server systems (most common)
– A laptop or PC will normally have a centralized architecture. MS
Access is an example of a centralized database.
– A client server, either networked or standalone is now a
common option. The user interface is separated from the
database. MySQL, MS SQL are example of client server based
databases. Combining MySQL with MS Access gives us a simple
client server architecture.
– Clients systems use ODBC or JDBC or other connectors to all
multiple applications and users access to the database. The
client server can also be a spread sheet, C++ program, a Java
Program, Visual Basic, PHP webpage, Java script opr even
another database to name a few.
Centralized Databases
• Stand Alone

• Network Architecture
Typical Server Design
• In a client server design, the processing of data is done by the
SQL engine i.e. sorting, searching, deleting, inserting etc.

e.g. ODBC, JDBC


Network Interface – virtual (127.0.0.1) or real connectors

Virtual Networks are used on


standalone systems such as a
developers laptop or desktop
Parallel computer architectures.
• Parallel computer architectures. A database may be
hosted on a computer with Multiple Processors (CPU).
• This improves querying and response times as different
DML statements can be processed by different
processors. They require special severs some which can
have 4 to 100 processors.
• Typical these are high transaction volume systems,
banking, telecoms systems and real time systems used
for equipment management and control (SCADA)
• Such Databases are Oracle MPS, IBM’s DB2 to name a
few. They are generally expensive.
Parallel computer architectures.

M - denotes memory
P - denotes a processor,
Disks (data) are shown as cylinders
Lines are internal bus connections or high speed
fiber or cable connections.
Parallel computer architectures.
• There are several architectural models for parallel
machines.
– Shared memory. All the processors share a common
memory on the server
– Shared disk. All the processors share a common set of
disks. Shared-disk systems are sometimes called
clusters.
– Shared nothing. The processors share neither a
common memory or common disk .
– Hierarchical. This model is a hybrid of the preceding
three architectures.
Distributed Systems
• Distributed databases span multiple geographically
(near or far) separated machines.
• This allows sharing of data, provide backup systems,
data redundancy
• There is improved performance for local users and
sharing of data for remote users.
• Example include, e-mail systems such as g-mail, ERP
systems – with accounts, customer service,
procurement, HR hosted on different machine.
Distributed Systems
• The failure of one site is detected
by the working systems, and
appropriate action taken to
recover from the failure.
• When the failed site recovers or is
repaired, it is integrate smoothly
back into the system.
• The ability of most of the system
to continue to operate despite
the failure of one site results in
increased availability.
• Availability is crucial for database
systems.
The network may be a local • Loss of access to data by, for
area network (LAN) or a example, an airline may result in
wide are network (WAN) via the loss of potential ticket buyers
to competitors.
cable, wireless or satellite
A database must ensure the availability and integrity of the underling
data with minimum intervention of the users of the data. This feature is
characteristic of Database Management Systems (DBMS) and
distinguishes them from file systems databases

DATABASE INTEGRITY AND


TRANSACTIONS (ACID)
An SQL Transaction
• Transaction is a logical unit of SQL commands
which are performed in a single step (one or
more commands).
• If any of the steps is failed then the
transaction is not completed and all the steps
are rolled back automatically
• For a database to qualify as a proper database
it must achieve the ACID principle (Atomicity,
Consistency, Isolation and Durability)
ACID Principle
• What is commit?
– This is when data is actually written to the database.
– In MySQL the data is saved immediately after the DML
statement is executed (but this can be disabled).
– In many database saving is not automatic. As such a
commit statement must be executed to save all
processed DML transactions.
– If a commit has not been performed, then it is
possible to rollback transactions (equivalent to undo)
ACID Principle
Atomicity
• You can not have half and atom. An SQL transaction may contain several update ,
insert and delete statements. If any of the statements fail then all the other
transactions before must be reversed and it will be as if nothing happened.
Consistency
• Any change of state (SQL transactions) must maintain the database in a stable and
valid state. i.e. Referential integrity rules, validation rules, data types.
Isolation
• When the SQL transactions are going on for one user, the other user should not be
able to see the changes until they are committed and completed. The changes are
only valid for users making those changes until they are committed.
Durability
• When the transaction is committed, it must be persisted so it is not lost in the
event of a power, disk or communication failure. Only committed transaction are
recovered during power-up and crash recovery; uncommitted work is rollback
Challenge of Transaction
• You are performing a transaction on your ATM. The
following SQL DML’s are performed.
1 insert into accTransaction(‘123456’, 12-Jan-2013, 10000,
‘Withdrawal’);
2 update accMaster
set balance=balance-10000
where accNumber = ‘123456’;
3 commit;
• What would happen if any of the two statements failed?
Should the database save any of them?
• Reason for failure
– Insufficient funds, invalid account number, invalid date format,
power , disk or communication failure
Challenge of Transactions – Atomicity
and Consistency
• In the given example the Atomic principle
states that the two statements should be
treated as one, if any fail a rollback should
occur.
• Also if any validation rules that challenge the
integrity fail a rollback should occur. This is the
consistency rule.
• Integrity rules include, invalid foreign keys,
data types, duplicates, domain ranges e.t.c.
Challenge of Transactions - Isolation
• When the insert statement alone is executed and
the second update statement has not been
executed, then if another person tries to see the
balance then he/she should see the original data
– no changes yet , isolation principle.
• What happens if two people are trying to do the
same transactions on the same account?
• Hint, who commits first? What happens to the
second person?
Challenge of Transactions -Durability
• What should happen if infrastructure failure
occurs? Disk, Power, Communication as the SQL
had done half the transaction.
• When the database is powered up, it will find
incomplete transaction, and these will be
discarded and the state before failure will be
restored - rollback.
• If the transactions had been completed but not
saved in the database then they will be written to
the database using a journal.
Challenge of Transactions - Journals
• A database maintains a note book of
all transactions that it is performing SQL DML Statement
and has performed. This is called
journaling.
• It is therefore possible to ask a
Journal
database to rollback to a specific
point in time, say yesterday at twelve
noon Database commit
• The journal may also be used to
audit and see what has been
happening and by whom.
• Journals cannot keep database
forever and are valid for specified
time.
• Journals allow for online backup to
occur and recovery as well
Challenge of Transactions – commit and rollback

• By default, MySQL runs with auto commit mode enabled.


This means that as soon as you execute a statement that
updates (modifies) a table, MySQL stores the update on
disk to make it permanent.
• To disable auto commit mode implicitly for a single series of
statements, use the START TRANSACTION statement:
• Example
use university;
start transaction
insert into student values(7834, ‘Moses’, ‘Elec. Eng.’, 0);
select * from student; // you will see the transaction but it is not saved
commit; // instead of commit , type rollback; and check the table again.
Challenge of Transactions – Two Phase Commit

• Suppose your data is on two


different databases on two commit
different computers?
• In this case two commit Sales Database
statements will need to be
performed. Journal
• This type of transaction uses a
two phase commit to ensure
concurrency and consistency,
though only one commit SQL DML Statements
statement appears to be
executed.
• If failure occurs after committing
on the first database, these
changes must be undone. Inventory
Database Journal
After SQL transactions are generated, their usefulness declines as the
information begins to become historical. However a customer or a company
may still want to analyze this historical information to provide customer
services, review trends and generate forecasts. The processing of data that is
historical and subsequent analysis is called data warehousing and Data
mining respectively.

DATA WAREHOUSING AND MINING


Data Warehousing and Mining
• Google Search
– Whenever you perform a search on Google, the search
statement is stored in a database.
– Also the selections you make are also stored in a database
– The results of the search are stored in the database
• How Google uses this information
– Next time you log on Google uses this information to profile you
and help you improve your search.
– On the revenue side Google provides you with advertisements
based on the type of searches that you make.
– The process of storing historical data is referred to as data
warehousing.
– The process of analyzing the data stored in the database is
called Data mining.
Data Warehousing and Mining
• A data warehouse archives information
gathered from multiple sources, and stores it
under a unified schema, at a single site.
– Important for large businesses that generate data
from multiple divisions, possibly at multiple sites
– Data may also be purchased externally
• Data mining seeks to discover knowledge
automatically in the form of statistical rules
and patterns from large databases.
Data Warehousing
• Data sources often store only current data, not historical data
• Corporate decision making requires a unified view of all organizational data, including
historical data
• A data warehouse is a repository (archive) of information gathered from multiple
sources, stored under a unified schema, at a single site
– Greatly simplifies querying, permits study of historical trends
– Shifts decision support query load away from transaction processing systems
• To put data in a Data Warehouse data the following steps must be taken
– Extract Data – This collects data from one or more sources, it may be from an SQL
database, CSV, Webpages, SMS messages, Binary Files …e.t.c.
– Translate – The data that is uploaded is formatted so that it can be processed.
– Loading – Data is loaded in the final database, where it is accessible to the end
user either directly or using Data Analysis Tools
• The process of Extracting, Translating and Loading is called ETL process
• Not all data will come from a SQL database, but must never the less be uploaded and
processed
• After Data warehousing, the process of Data mining can begin.
Data Warehousing
29
Generic two-level architecture

L
One
company-
wide
T warehouse

Periodic extraction  data is not completely current in warehouse


30
Data Mining
• Data mining is the process of semi-automatically analyzing
large databases to find useful patterns
• Prediction based on past history
– Predict if a credit card applicant poses a good credit risk, based on
some attributes (income, job type, age, ..) and past history
– Predict if a pattern of phone calling card usage is likely to be
fraudulent
• Some examples of prediction mechanisms:
– Classification
• Given a new item whose class is unknown, predict to which class it
belongs
– Regression formulae
• Given a set of mappings for an unknown function, predict the function
result for a new parameter value
Data Mining (Cont.)
• Descriptive Patterns
– Associations
• Find books that are often bought by “similar” customers. If
a new such customer buys one such book, suggest the
others too.
– Associations may be used as a first step in detecting
causation
• E.g., association between exposure to chemical X and cancer,
– Clusters
• E.g., typhoid cases were clustered in an area surrounding a
contaminated well
• Detection of clusters remains important in detecting
epidemics
Other Types of Mining
• Text mining: application of data mining to textual
documents
– cluster Web pages to find related pages
– cluster pages a user has visited to organize their visit history
– classify Web pages automatically into a Web directory
• Data visualization systems help users examine large
volumes of data and detect patterns visually
– Can visually encode large amounts of information on a single
screen
– Humans are very good a detecting visual patterns
– Different types of colored maps and graphs are used to display
information (like recently systems used to display election
results – would have been nice if it worked)
Advantages of Challenges of Data
Warehousing and Mining
• Data Warehousing allows a company to remove data from
the main transaction once it becomes dated (old).
– This improves/maintains performance of the front end systems
– Allows data to be moved to cheaper servers and data storage
systems, where the time taken to retrieve data is less critical
• Data Mining of data allows analyzed without affecting the
core systems
– The data mining systems are separate from the live transaction
systems used to serve customers.
– Since data is stored on cheaper servers(slower) and disk devices,
the historical data can be stored for a very long time (years) at
low cost
• Data Warehousing and Mining uses RDBMS and utilizes
Parallel and Distributed architectures extensively.
Examples of using Data Warehousing
and Mining
• Application for Data Warehousing and Mining allows
the following to happen
– A customer service system can give immediate service to
high value customers who contribute significant revenues
to the business (when you call 100).
– Google can provide relevant and customized
advertisements ( books, medicines, entertainment e.t.c)
– A telecom company can decide where to locate and
upgrade their base station based on number of customers
and revenue from a specific area (KU, Zimmerman).
– A telecom company can provide seasonal upgrade of base
stations during conferences and shows, hence improving
revenues – voice services, data services
Other than RDBMS (Relational Database Management Systems) there are Object
Oriented Databases and XML databases. These are not very popular have their own
advantages and disadvantages. XML however has become important particular
because of data interchange and the support of unstructured documents such as
letters, books, power point presentations which may need to be stored in a standard
format ill-respective of the supplier of such tools

OBJECT ORIENTED DATABASES AND


XML DATABASES
Object-Relational Data Models
• Many RDBMS have Object Oriented features. These are
referred to as OODBMS (object Oriented Database
Management Systems).
• RDMS extend the relational data model by including
object orientation and constructs to deal with added
data types.
• Objects allow for re-use of database table structures.
For example you may design a staff database table.
• You may want to re-use the structure for a student
table, since requirements are almost similar (address,
facebook handle, next of kin e.t.c)
Object Oriented Models
• Often we need to keep creating type/structure of a
database table
– Person – student , instructor, staff e.t.c
– Address – PO Box, Telephone, fax e.t.c.
• Object oriented allow us to create a common type of
object which can then be used as a template for new
tables.
• If we wanted student and instructor to also have a field
for facebook handle then we would change the
template and all tables which use that template would
be updated with the change (inheritance).
Structured Types and Inheritance in SQL
• Structured types can be declared and used in SQL
create type Name as // this creates a template and not a table 1
(firstname varchar(20),
lastname varchar(20))
final
create type Address as // this creates a template and not a table 2
(street varchar(20),
city varchar(20),
zipcode varchar(20))
not final
– Note: final and not final indicate whether subtypes can be created
• Structured types can be used to create tables with composite attributes
create table customer // this creates a table using Name and Address Template 3
(
name Name, // looks like C++ custom variable
address Address,
dateOfBirth date) // the only new field created
• Dot notation used to reference components: name.firstname 4
Structured Types (cont.)
• User-defined row types
create type CustomerType as (
name Name,
address Address,
dateOfBirth date)
not final
• Can then create a table whose rows are a user-defined
type
create table customer of CustomerType
• The SQL statement will look as follows
– select name.firstname, address.city from customer;
Inheritance
• Suppose that we have the following type definition for people:
create type Person
(name varchar(20),
address varchar(20))
• Using inheritance to define the student and teacher types
create type Student
under Person
(degree varchar(20),
department varchar(20))
create type Teacher
under Person
(salary integer,
department varchar(20))
• Subtypes can redefine methods by using overriding method in place
of method in the method declaration
Introduction
• XML: Extensible Markup Language
• Defined by the WWW Consortium (W3C)
• Derived from SGML (Standard Generalized
Markup Language), but simpler to use than SGML
• Documents have tags giving extra information
about sections of the document
– E.g. <title> XML </title> <slide> Introduction
…</slide>
• Extensible, unlike HTML
– Users can add new tags, and separately specify how
the tag should be handled for display
Advantages & Disadvantages of
OODBMS
• Advantages
– Support of abstract data types. (customized data structure)
– Faster database development through re-use of existing structure
– Easier maintenance, through inheritance update of table structure
– Faster Programming using OODBMS then RDMBS – integration to
Object Oriented Languages such as Java/C++/Ruby e.t.c.
• Disadvantages
– DML Performance is lower
– Complexity of SQL syntax due to OODMS features
– Standard for OODBS are just emerging, no official standard, so
portability is difficult
– Tend to be vendor oriented and proprietary.
XML Introduction (Cont.)
• The ability to specify new tags, and to create
nested tag structures make XML a great way
to exchange data, not just documents.
– Much of the use of XML has been in data exchange applications, not as a replacement
for HTML

• Tags make data (relatively) self-documenting


– E.g.
<bank>
<account>
<account_number> A-101 </account_number>
<branch_name> Downtown </branch_name>
<balance> 500 </balance>
</account>
<depositor>
<account_number> A-101 </account_number>
<customer_name> Johnson </customer_name>
</depositor>
</bank>
XML is one of the most important development in Data Management. It allows
exchange of data between systems and databases. Along with XML is SOAP which is a
database exchange protocol of language used to exchange XML data over the internet
using http protocol. XML allows storage and exchange of documents with
unstructured data such as word processors, Excel, Contact list, LPO’s, Invoices e.t.c. It
separates how data is stored form how it is displayed.

XML DATA EXCHANGE AND


DATABASES
XML: Motivation and Importance
• Data interchange is critical in today’s networked
world
– Examples:
• Banking: funds transfer
• Order processing (especially inter-company orders)
• Scientific data
– Chemistry: ChemML, …
– Genetics: BSML (Bio-Sequence Markup Language), …
– Electronic Paper flow (LPO’s, Invoices, Receipts) of
information between organizations is being replaced by
electronic flow of information
• Each application area has its own set of standards for
representing or displaying information
• XML has become the basis for all new generation
data interchange formats
XML Motivation (Cont.)
• Earlier generation formats were based on plain text with line
headers indicating the meaning of fields (remember CSV format)
– Similar in concept to email headers
– Does not allow for nested structures, no standard “type” language
– Tied too closely to low level document structure (lines, spaces, etc)
• Each XML based standard defines what are valid elements, using
– XML type specification languages to specify the syntax
• DTD (Document Type Descriptors)
• XML Schema
– Plus textual descriptions of the semantics
• XML allows new tags to be defined as required
– However, this may be constrained by DTDs
• A wide variety of tools is available for parsing, browsing and
querying XML documents/data
Comparison with Relational Data
• Some Disadvantages of XML
– Inefficient: tags, which in effect represent schema information, are
repeated
– Performance of Database due to lack of predictable structure
– Bad for large data storage (complex indexes and referential integrity)
– Complex Query Language
• Some Advantages of XML
– Better than relational tuples as a data-exchange format
– Unlike relational tuples, XML data is self-documenting due to presence
of tags
– Non-rigid format: tags can be added
– Allows nested structures
– Wide acceptance, not only in database systems, but also in browsers,
tools, and applications
Structure of XML Data
• Tag: label for a section of data
• Element: section of data beginning with <tagname>
and ending with matching </tagname>
• Elements must be properly nested
– Proper nesting
• <account> … <balance> …. </balance> </account>
– Improper nesting
• <account> … <balance> …. </account> </balance>
– Formally: every start tag must have a unique matching
end tag, that is in the context of the same parent element.
• Every document must have a single top-level element
Example of Nested Elements
<bank-1>
<customer>
<customer_name> Peter </customer_name>
<customer_street> Madawa </customer_street>
<customer_city> Garissa </customer_city>
<account>
<account_number> A-102 </account_number>
<branch_name> Benda Street </branch_name>
<balance> 4000 </balance>
</account>
<account>

</account>
This structure stores data
</customer> for each customer and
. under each customer the
. accounts details. Multiple
</bank-1> customer accounts are store
under one customer
(nesting).
Motivation for Nesting
• Nesting of data is useful in data transfer
– Example: elements representing customer_id, customer_name,
and address nested within an order element
• Nesting is not supported, or discouraged, in relational
databases
– With multiple orders, customer name and address are stored
redundantly
– normalization replaces nested structures in each order by
foreign key into table storing customer name and address
information
– Nesting is supported in object-relational databases
• But nesting is appropriate when transferring data
– External application does not have direct access to data
referenced by a foreign key
Structure of XML Data (Cont.)
• Mixture of text with sub-elements is legal in XML.
– Example:
<account>
This account is seldom used any more.
<account_number> A-02 </account_number>
<branch_name> Benda Street </branch_name>
<balance> 4000 </balance>
</account>
– Useful for document markup, but discouraged for data
representation
XML Application: Web Services
• The Simple Object Access Protocol (SOAP)
standard:
– Invocation of procedures across applications with
distinct (different/incompatible) databases
– XML used to represent procedure input and output
• A Web service is a site providing a collection of
SOAP procedures
– Described using the Web Services Description
Language (WSDL)
– Directories of Web services are described using the
Universal Description, Discovery, and Integration
(UDDI) standard
XML Terminologies
• XPath - XPath addresses parts of an XML document by means of path
expressions. The language can be viewed as an extension of the simple
path expressions in object oriented and object-relational databases. The
current version of the XPath standard is XPath 2.0,
• XQuery - The World Wide Web Consortium (W3C) has developed XQuery
as the standard query language for XML. Our discussion is based on
XQuery 1.0, which was released as a W3C recommendation on 23 January
2007.
• The XQJ standard provides an API to submit XQuery queries to an XML
database system and to retrieve the XML results. Its functionality is similar
tothe JDBC API.
• The XSLT language (Xtended Style Sheet Text Language)ensed is another
language designed for transforming XML. XML does not describe how to
display the underlying data. Combined with XQuery, XSLT is used to display
data. It is used primarily in document-formatting applications, rather in
data management applications.
XML and Relational Databases
• Many relational Databases allow for
importation of XML data into tables.
• The XML format is superior to CSV (comma
separated value) since data can be imported
in OODBS or different tables to for tuples.
• May RDBM’s also allow for SQL queries that
can return data in XML format. This allows for
data exchange from one database system to
an other.
References
• Silberschatz, Korth and Sudarshan, Database
System Concepts, 6th Ed, Chapter
14,15,20,21,22 and 23.
• W3 Schools.com. Retrieved from
http://www.w3schools.com/xml/

You might also like