You are on page 1of 26

Introduction to Databases

Introduction
In a small operation, the network admins or developers double up as database admins
(DBAs). In larger businesses, there might be dozens of DBAs specializing in the many
different facets from design and architecture, to maintenance, development, etc. No
matter what part of IT you work in, you have to store data at one point or another, and it
doesn't hurt for just about everybody to have some knowledge of databases and how they
work.
The goal of this tutorial is to provide this basic introduction. We're going to explain the
basics of what a database actually is, look at the history, understand relational databases,
get into some fundamental concepts from columns and rows on up, touch on other types
of databases, get versed on some additional concepts to understand, and wrap it all up
with a quick review of the major commercial systems on the market today.
For the most part there are no pre-requisites to this tutorial other than basic computing
knowledge.

What Is a Database?
A database, in the most general sense, is an organized collection of data. More
specifically, a database is an electronic system that allows data to be easily accessed,
manipulated and updated.
In other words, a database is used by an organization as an electronic way to store,
manage and retrieve information. The database is one of the cornerstones of enterprise
IT, and its ability to organize, process and manage information in a structured and
controlled manner is the key to many aspects of modern business efficiency.
However, databases go way beyond simply storing data. As we’ll see later, the inherent
logic and efficiency in how the data is stored and retrieved can provide an incredibly
powerful business tool to an organization. This is especially true when databases are
properly exploited for their reporting and business intelligencecapabilities.

The Use and Importance of Databases in Today’s World


So what kinds of organization requires a database and can benefit from its use? Well, the
short answer is any business or organization that needs to keep track of large numbers
of customers or products. By "large" we mean more than can be stored by a human brain
- a lot more.
At this point, a skeptic might still argue that there are countless mom-and-pop stores
whose owners keep track of inventory and profit/ loss using the trusty ledger and
calculator, and are doing OK. That’s true, but the use of an electronic database can still
pay off, even for very small businesses. For instance, a ledger cannot run a simulation to
extrapolate profits if say, the shop were to increase the price of ballpoint pens by 2 cents.
A database can do that. The ledger cannot run a report tracking down re-order levels for
all items to show the store owner which items should be restocked at what times during
the year. A database can do that too. A database can even automatically alert the
business owner via email or text message.
The most significant benefit of databases, however, is still limited to large organizations
with customers and products numbering in the hundreds of thousands, millions, or tens
of millions, and the need to store large numbers of individual data items for customers.
For instance, a commercial bank needs the personal details of all of its millions of
customers, such as name, date of birth, address, Social Security number, etc. Each
customer in turn spawns another collection of data depending on the products he or she
has signed up for, such as account type, account number, account balance, mortgage
amount, credit card loan, repayment period and so on. A third collection of data relates to
the customer’s specific transactions, such as the time of transaction, amount, balance
left, bank charges, loan amount left to repay, etc.
Clearly, a single customer can generate a huge amount of data in a very short time.
Multiply this by millions of customers, and it’s easy to see why having an efficient data
storage and retrieval medium is not only a good idea, but is also absolutely essential in
preventing the bank’s operations from grinding to a halt.
But wait. Organizations managed just fine before the advent of computers and these new-
fangled databases, didn’t they? Sure. But if you’re not unconvinced of the power of
databases based on the ability to increase efficiency and transaction speed, consider this:
Your competitors will be more efficient and offer better service than you do as a result of
their use of database systems.
Commercial banks are a prime example of the use of databases in today’s organizations.
Other industries whose operations are heavily reliant on databases are insurance
companies, hospitals and health care, schools and colleges, manufacturing,
telecommunications companies, and hotels and the hospitality industry.
Databases are indeed what the term says - the base or pool of (related) data upon which
an organization’s data systems should be built. Indeed, it is not a stretch to say that
without databases, the computer in the workplace would be little more than a more
efficient typewriter!

History of Databases
It is important to first realize that the organized, systematic methodology of storing records
we know and heavily depend on in databases is not a recent invention. What is recent is
the computerization of this methodology beginning in the 1960s. Note that even paper-
based records, including ledger-based bookkeeping, are (technically) all forms of a
database. That is, a database does not necessarily have to be computerized.
Computerization only produced a database management system(DBMS), which is
obviously several orders of magnitude more powerful, accurate and capable than what a
humble ledger or a puny human brain can achieve. And although we are mostly using the
term "database" to refer to the DBMS, the two are not the same thing; all thumbs (DBMSs)
are fingers (databases), but not all fingers are thumbs.
The ancient Egyptians used elaborate record-keeping systems to keep stock of grain
harvests. The Library of Alexandria employed a sophisticated method to keep track of
huge numbers of books and scrolls. These were all early examples of databases,
although of course their capabilities would be laughable compared to the hugely capable
computerized DBMSs of the 21st century.
But even way back in time, back when the entire field of computing was still in its single-
celled-organism stage (the 1960s), many people could already visualize that computers
would be truly useful if they could provide a way of reliably storing and retrieving data.
The development of databases therefore occurred almost in perfect step with the general
development and growth of the computing capabilities of the day. As disk capacity and
processor speed grew, so did the storage capacity and feature sets of the contemporary
database offerings. One important leap that occurred in the mid-1960s was the switch
from tape-based storage to direct access storage, or disks. This change allowed
multitasking interactive data access, as opposed to the single-operator, batch-type
processing necessitated by tapes.
The earliest database systems were navigational in nature. This means that applications
processed and read data by using pointers embedded in the data itself. The pointer led
to the next data item and could be doubly linked, allowing linkage to both the previous
and next data items. This is similar to how hyperlinks work on a Web page by leading the
reader to a related Web page from the current one. The two main data models at this time
were the hierarchical model epitomized by IBM's IMS system, and the Codasyl, or
network, model. But all these were bested and reduced to mere interesting footnotes in
history by the emergence of the relational model by a brilliant computer scientist by the
name of E.F. Codd.

E.F. Codd and the Relational Model


The relational model was a radical departure from the reigning hierarchical model in that it focused
on the ability to search a database by content rather than by following a linked navigation system.
This offered the significant advantage of allowing databases to grow and store more and more
data, all without having to change or rewrite the applications that accessed that data. Essentially,
Codd single-handedly designed a way to divorce the skeleton or structure of the database from
the data records held in the database. So elegant was this model that it is the de facto standard
for database design to this day, with such databases termed relational databases. There are a
few very important non-relational databases (especially with the advent of big data and Web 2.0),
but the relational model is still used for the overwhelming majority of commercial database
offerings.

Today, E.F. Codd’s name would mostly evoke a nonchalant "E.F. who?" among most people,
even many in the IT industry. However, his work has directly led to the huge benefits and efficiency
that relational databases provide. His contribution to the world of computing is comparable in
scale to that of Sir Isaac Newton’s to the world of physics.
Codd attended Oxford college, studying mathematics and chemistry, then worked as a pilot in the
Royal Air Force during WWII before moving to the U.S. in 1948 to work as a mathematical
programmer for IBM. After spending a decade in Canada, he returned to the U.S. in 1963 and
received his Ph.D. in 1965.
In 1970, Codd published a paper on data management titled "A Relational Model of Data for Large
Shared Data Banks" for IBM. The giant company, however, was heavily invested in the
hierarchical model via its Information Management System (IMS), and Big Blue executives were
not interested in developing a competitor for one of their own lucrative product lines. Showing
guile rarely seen in academic or scientific types, Codd slyly showed his model to select IBM
customers, who upon viewing it needed little convincing of its superiority. The influential
customers in turn put pressure on the very same IBM executives to develop the model and they
reluctantly (and, one imagines, seething quietly with fury at Codd) placed the model under
development in IBM’s Future Systems project, with the system itself known as System R.

However, the head honchos were still unwilling to threaten IMS, and sabotaged Codd’s work by
placing the System R project in the hands of developers who were unfamiliar with it. The
developers thus failed to use Codd’s own Alpha language for development, instead electing to
use a much simpler language known as SEQUEL. This turned out to be an accidental
masterstroke, however, since SEQUEL is much easier to understand and use. For copyright
reasons, the name was changed to SQL, and is very familiar to database developers and
administrators today as the language of choice for writing database queries.

A shrewd young businessman who was developing his own database system read about SQL at
a conference in 1979. He recognized its superiority and copied the language into a database
product by his own small company. The businessman had also previously seen Codd’s work on
the relational model, and became convinced that it was the way to go for database systems. He
based his own product on it, even though IBM refused to share System R’s code with him.
Remember, IBM was not interested in the relational model. That small company has grown quite
a bit; today it’s known as Oracle Corp. As for the businessman, his name is Larry Ellison, and his
conviction helped him become one of the richest people in the world. It just goes to show how
badly IBM miscalculated the potential of Codd’s relational model. In fact, Oracle DB is the most
widely used relational database for corporations today.

The Relational Database


As we have already seen, the work of E.F. Codd established the relational model of
databases as the clearly superior method of data storage. The elegance of the relational
model in storing and manipulating data has been so effective that since the late 1970s,
none of the many pretenders to its throne have managed to overthrow it.
So let’s go into some detail on exactly how the relational model works. A relational
database is essentially a group of tables or, to use the technical name, entities (refer to
rules 0 and 1 in Codd’s 12 Rules of Relational Databases). Each table is made up
of rows (tuples) and columns (attributes). The tables have relationships between them
that are defined as using a certain column in one table that references a column in another
table.
That is the basic definition of a relational database. But as you will soon see, it can get
much more elaborate than this. For instance, one of the fundamental concepts of
relational databases is that of referential integrity. This rule states that relationships
between tables must always remain consistent. In other words, any field located in
a foreign key must be in agreement with the primary key that the foreign key references.
Therefore, any updates or deletions to a primary key field must either also be applied to
all its foreign keys, or must not be allowed to happen. The same restriction also applies
to foreign keys; any updates (but not necessarily deletions) must either also be
propagated back to the corresponding parent primary key, or not allowed.
The idea of referential integrity is best explained by illustration. Assume that there are two
tables in a bank’s database: the CUSTOMER_MASTER table for holding basic
customer/account holder data, and the ACCOUNTS_MASTER table for storing basic data
about bank accounts. To uniquely identify each customer/account holder in the
CUSTOMER_MASTER table, we create a primary key column, which we call
CUSTOMER_ID.
You can see that in order to identify the customer to which a certain bank accounts
belongs in the ACCOUNTS_MASTER table, we must reference an existing customer in
the CUSTOMER_MASTER table. This means we need to create a CUSTOMER_ID
column in the ACCOUNTS_MASTER table as well. This is called a foreign key. This
column is special because the values it contains are not new values that you can just
conjure up. They must reference identical, existing values in the primary-key column of
another table, in this case, the CUSTOMER_ID column of the
CUSTOMER_MASTER table

Clear, yes? Good. Now, referential integrity simply means that you cannot edit any
CUSTOMER_ID value in CUSTOMER_MASTER without also editing the corresponding
value in the ACCOUNTS_MASTER table. If you change Andrew Smith’s customer ID in
CUSTOMER_MASTER, you must also change it in ACCOUNTS_MASTER. This makes
perfect sense if you think about it. Otherwise, how would we link Andrew Smith’s accounts
back to him? Remember, the ACCOUNTS_MASTER table does not hold any information
about account holders apart from storing their CUSTOMER_IDs as a foreign key. By the
same logic, if a customer ID in the bank accounts table were allowed to exist without a
corresponding customer ID in the CUSTOMER_MASTER table, this usually means that
a bank account can exist without an account holder, which clearly does not make sense.

Now, following this train of thought a bit further, if you delete a customer ID in
CUSTOMER_MASTER, you must also delete all corresponding entries in
ACCOUNTS_MASTER. This is simply the manifestation of a neat follow-through action
in real life, where if someone stops being a customer, you must also do away with their
bank accounts. As Star Trek’s Spock would put it, it’s only logical!

The 12 Rules of Relational Databases


As the relational model started to become fashionable for database design in the early
1980s, Codd was at first bemused then angered by the trend by every other database
vendor to slap the relational moniker on their product, even when it didn’t apply. He came
up with a list of 12 rules that determined whether a database could be called "relational".
His difficulties with IBM reached a peak at this time, and he left to form his own consulting
company with another lecturer/ consultant called Chris Date.
Below are Codd’s 12 rules or, as some call them, 12 commandments. There are actually
13, but they are numbered from 0 to 12, hence the name. We will look at them in more
detail once we delve into the features of relational databases and reference them then so
that you can really understand Codd’s rules in context. In raw form, these rules do not
make much sense for those who don’t work in the database field, but anyway, here goes:

0. Foundation Rule
A relational database management system must manage its stored data using only its
relational capabilities.

1. Information Rule
All information in the database should be represented in one and only one way - as values
in a table.

2. Guaranteed Access Rule


Each and every datum (atomic value) is guaranteed to be logically accessible by resorting
to a combination of table name, primary key value and column name.

3. Systematic Treatment of Null Values


Null values (which are distinct from empty character strings, strings of blank characters,
zeros or any other number) are supported in the fully relational DBMS for representing
missing information in a systematic way that is independent of data type.
4. Dynamic Online Catalog Based on the Relational Model
The database description is represented at the logical level in the same way as ordinary
data, so authorized users can apply the same relational language to its interrogation as
they apply to regular data.

5. Comprehensive Data Sublanguage Rule


A relational system may support several languages and various modes of terminal use.
However, there must be at least one language whose statements are expressible, per
some well-defined syntax, as character strings and whose ability to support all of the
following is comprehensible:

1. Data definition
2. View definition
3. Data manipulation (interactive and by program)
4. Integrity constraints
5. Authorization
6. Transaction boundaries (begin, commit and rollback)

6. View Updating Rule


All views that are theoretically updateable are also updateable by the system.

7. High-Level Insert, Update and Delete


The ability to handle a base relation or a derived relation as a single operand applies not
only to the retrieval of data, but also to the insertion, update and deletion of data.

8. Physical Data Independence


Application programs and terminal activities remain logically unimpaired whenever any
changes are made in either storage representation or access methods.

9. Logical Data Independence


Application programs and terminal activities remain logically unimpaired when
information-preserving changes of any kind - and those that theoretically permit
unimpairment - are made to the base tables.

10. Integrity Independence


Integrity constraints specific to a particular relational database must be definable in the
relational data sublanguage and storable in the catalog, not in the application programs.
11. Distribution Independence
The data manipulation sublanguage of a relational DBMS must enable application
programs and terminal activities to remain logically unimpaired whether and whenever
data are physically centralized or distributed.
12. Nonsubversion Rule
If a relational system has or supports a low-level (single-record-at-a-time) language, that
low-level language cannot be used to subvert or bypass the integrity rules or constraints
expressed in the higher-level (multiple-records-at-a-time) relational language.

Fundamental Database Concepts


We will now look at some key database concepts and data objects. Any database
administrator worth his salt must be absolutely familiar with these. And they aren’t just
theoretical; virtually all DBAs will be intimately involved with these concepts and objects
on a near-daily basis. They are to database administration what knowledge of the human
body is to the field of medicine.

We will only define each term briefly here. To better grasp the concepts, together with
real-world examples, follow the term links and visit the "Related Terms" section to
understand how each concept relates to and works with others in the realm of database
administration.

Tables
The table is the basic data-storage unit in a relational database. Tables consist of
columns and rows. The columns are the attributes or qualities that we want to express,
while the rows hold the actual data, with one (or no) items per row. Think of the layout of
a spreadsheet; this is very similar to the logical organization of a relational table.
A simple example of the tables in the database of a commercial bank can be found below.

Relationship
Relationships are THE reason why relational databases work so well. If you only learn
one concept about databases, this is the one to learn. As the name implies, relationships
are the very core of relational databases.
In relational databases, a relationship exists between two tables when one of them has a
foreign key that references the primary key of the other table. (More on foreign and
primary keys in a bit).
In the diagram below, you can see examples of relationships. For instance, the
AccountTypeID field (column) in the AccountTypes table references the AccountTypeID
column of the Customer table.

Row
A row, also called a record, represents a set of data about a specific item. Every record
in a table has exactly the same structure, but of course different data. Think of the rows
in an Excel spreadsheet - the concept of rows in a database is very similar. Each row in
a table consists of distinct data items, with one (or zero) items for each column of the
table. Rows are also called tuples, although this term is not very common.

See an example of a record or row below:

Column
A column is a specific set of values in a table of the same type. It defines a specific
attribute of the table or data. For example, we can create a column called
CUSTOMER_SURNAME in a table. This is a self-explanatory column whose purpose is
to store customer surnames, one value for each row. Again, think of the rows in an Excel
spreadsheet and you’ll have a pretty good idea of how columns in a relational table work.

Primary Key
A primary key is a special column or combination of columns that uniquely identifies each
record (row) in the table. The primary key column must be unique for each row, and must
not contain any nulls (non-values). So, for example, to identify yourself in various
databases belonging to various U.S. government departments, a unique identifier, the
Social Security number, is assigned and used.
The primary key, together with the closely related foreign key concept, are the main way
in which relationships are defined. Primary keys may also be a combination of columns.
For instance, for many companies, a calendar month is a short-term financial period.
Therefore, to uniquely identify any period, you can combine the month column and the
year column, such as May 2011, to form a primary key to uniquely identify each and every
financial period.

Foreign Key
We cannot talk about the yin of primary keys without the yang of foreign keys. The two
go hand-in-hand. A primary key uniquely defines a record, while a foreign key is used to
reference the same record from another table.
In a commercial bank’s database, assume that you have a CUSTOMER_ID column as
the primary key in the CUSTOMER_MASTER table. This uniquely identifies each
customer; let us also assume the same table also contains other relevant customer
information in other columns, such as CUSTOMER_SURNAME,
CUSTOMER_FIRSTNAME, CUSTOMER_SOCIAL_SEC_NUMBER, CUSTOMER,
CUSTOMER_GENDER, and so on.
We also have another table called LOANS_MASTER to keep track of loans given to the
same bank’s customers. We now only need a single column in this table to identify which
customer has received a particular loan. We can call this column CUSTOMERID, and it
will reference the CUSTOMER_ID column of the CUSTOMER_MASTER table. We don’t
need to store all the other customer information (name, gender, Social Security number,
etc.) in the loans table. This is the elegance of the relational model.

SQL
Structured Query Language (SQL) is the de facto language used for the management and
manipulation of data in relational databases. SQL can be used to query, insert, update and modify
data. All major relational databases support SQL, which makes life much easier for database
administrators (DBAs), who have to support databases on several different platforms. Proficiency
in SQL is usually one of the very first things any DBA must learn early in his or her career. Note
that some people pronounce SQL as one word, "sequel".

Note that the SQL language is different from SQL Server, a relational database platform from
Microsoft. It can be confusing for beginners because of the use of the generic term SQL.
Most commercial RDBMS platforms have their own customized SQL implementations, but these
tend to be fully compatible with the standard SQL.
Other Types of Databases

Databases vs. Spreadsheets


"Excel also stores my data and I can retrieve and manipulate the data using filters, join it
to other files and worksheets, perform advanced functions like VLOOKUP, PivotTable,
and so on. So why do I need this fancy database thing you are talking about? You IT guys
just want to milk money from me. No, my Excel sheets are perfectly adequate!"

Sound familiar? It might be if you’ve ever worked as a database consultant for small
businesses. It would seem at first glance that a lot of the functionality offered by databases
can be achieved much more easily (and cheaply!) by just using spreadsheets. However,
the spreadsheet has a number of limitations that make it unsuitable for managing some
data situations:

 Spreadsheets generally cannot handle multi-user access. If you work in an


organization that uses shared files, you have more likely than not come across the
annoying "file is locked by another user" error message. Even if a specific type of
spreadsheet has some multi-user functionality, it is relatively limited. Databases on the
other hand, can perfectly handle multi-user access, even for a mix of read-writes and read-
only access for the same data item.
 Spreadsheets provide poor data validation and integrity. There’s little stopping one of
your users from totally deleting the data in an Excel file. Sure you can employ worksheet
passwords, but these provide a very limited level of security, and cannot stop someone
from deleting the entire file. Databases can provide a fine-grained level of security and
can even protect users from their own human mistakes. For instance, a database
application can easily be set up to ensure that when creating a customer you must also
enter a Social Security number.
 Querying and reporting is one feature that spreadsheets do not do well. The ability
to run queries against a data set is extremely useful. Granted, spreadsheets can offer
some rudimentary reporting via filters and graphing, but comparing this to SQL queries in
databases that can join multiple tables and perform complex operations is akin to
comparing Fred Flinststone’s car to a brand-new Mercedes S-Class, the epitome of
motoring technology.
 Computer resources are severely tasked by handling large spreadsheets.
Databases can easily hold and provide access to data in the order of millions of rows,
organized into hundreds of tables, all running on a basic server, without breaking a sweat.
Try increasing the number of rows in your Excel file to hundreds of thousands of rows,
and watch what happens to your computer’s response time.

Spreadsheets are much easier to create and maintain. Databases require more
investment in terms of both of financial outlay and human training. However, the reward
for this is a much more robust and secure storage and retrieval data system. The point at
which you need to move away from spreadsheets and into a database-based system may
be when you answer "yes" to one or more of the following questions:
 Is the data held in spreadsheets needed over a long-term or recurring basis, as opposed
to a one-time work in progress?
 Do multiple people need access to this data?
 Do you need to safeguard against erroneous entries?
 Does the data need to be protected against inadvertent corruption?

Need more convincing? Well an authority no less than the U.S. government has decreed,
via Section 404 of the Sarbanes-Oxley Act, that all public companies must move the
reporting of key financial data away from spreadsheets.

Relational Databases
As we have already established, the reasons for the dominance of relational databases
are simplicity, robustness, flexibility, performance, scalability and compatibility in
managing generic data.
However, to offer all of this, relational databases have to be incredibly complex internally.
For example, a relatively simple SELECT statement could have hundreds of potential
query execution paths, which the optimizer would evaluate at run time. All of this is hidden
to us as users, but under the hood, RDBMS determines the "execution plan" that best
answers requests by using things like cost-based algorithms.
America’s best-selling sedan of the 2000s was the Toyota Camry, and it’s not hard to see
why. The Camry is not the best in many of the categories used to judge cars, such as
safety, gas mileage, interior volume, reliability and several others. However, it is always
near the top in each category, giving it an overall aggregate score that puts it on top. So
what, you may ask, does a midsized family sedan have to do with our discussion on
databases? Like the Camry, the relational database does not excel or really shine in any
one of the qualities of a good database, but it fulfills all sections nicely, enough to make
it the default go-to option.
With the advent of the Web 2.0 and especially cloud-based computing, there has been
an increased need for Web-capable databases to serve up, store and manage mind-
bogglingly large amounts of content. Content such as Facebook’s user profiles and posts
for millions around the world; Google’s billions of searches and Web crawls of other
websites; Dropbox’s millions of stored user documents and files; eBay’s millions of
auction listings, and so on. Broadly speaking, the buzzword for this area is "big data."
For all these Web-centric databases, the main concern is scalability. As more and more
applications are launched in environments that have massive workloads (think of the
diverse range of Web services available on the Web today), their scalability requirements
can change very quickly and grow very large. Relational databases scale well, but usually
only when that scaling happens on a single server. When the capacity of that single server
is reached, you need to scale out and distribute that load across multiple servers, moving
into so-called distributed computing. This is when the complexity of relational databases
starts to rub against their potential to scale. Try scaling to hundreds or thousands of
servers, and the complexities become overwhelming. The characteristics that make
RDBMS so appealing are the very same that also drastically reduce their viability as
platforms for large distributed systems.
For cloud services to be viable, vendors have had to address this limitation, because a
cloud platform without a rapidly scalable data store is next to useless. So, to provide
customers with a scalable place to store application data, vendors have had to implement
a new type of database system that focuses obsessively on scalability, at the expense of
the other benefits that come with relational databases. Next, we will look at these new,
non-relational databases in more detail.

Non-Relational Databases
One of the most severe limitations of relational databases is that each item can only
contain one attribute. In the bank example we looked at before, each aspect of a
customer’s relationship with a bank must (or rather is best) stored as separate row items
in separate tables. The customer’s master details are in one table, the account details are
in another table, the loan details in yet another, investments in a different table, and so
on. All these tables are linked to each other through the use of relations, or primary keys
and foreign keys.
Non-relational databases, specifically a database’s key-value stores or key-value pairs,
are radically different from this model. Key-value pairs allow you to store several related
items in one "row" of data in the same table. We place the word "row" in quotes because
a row here is not really the same thing as the row of a relational table or spreadsheet,
although it is still useful to call it that for comparison’s sake. For instance, in a non-
relational table for the same bank, each row would contain the customer’s details as well
as their account, loan and investment details. All data relating to one customer would be
conveniently stored together as one record.
This seems an obviously superior method of storing data. So why on earth would we still
use relational databases over this model? Well, the problem with key-value storesis that,
unlike relational databases, they cannot enforce relationships between data items. For
instance, in our key-value database, the customer details (name, social security, address,
account number, loan processing number, etc.) would all be stored as one data record
(instead of being stored in several tables, as in the relational model). The customer’s
transactions (account withdrawals, account deposits, loan repayments, bank charges,
etc.) would also be stored as another single data record.
In the relational model, there is an inbuilt and foolproof method of ensuring and enforcing
business logic at the database layer, for instance that a withdrawal is charged to the
correct bank account. In key-value stores, this responsibility falls squarely on application
logic. Many people are very jittery about leaving this crucial responsibility just to the
application, which is why relational databases are not going away any time soon.
However, when it comes to Web-based databases, the aspect of rigorously enforcing
business rules is often far down the list of priorities. Topmost on the list is the ability to
service large numbers of user requests, typically read-only queries. For instance, on a
site like the giant online auction house eBay.com, the majority of users simply browse
and look through posted items (read-only operations). Only a fraction of these users
actually place bids or reserve the items (read-write operations). And remember, we are
talking about millions, sometimes billions, of page views per day. Such a site’s owners
are more interested in quick response time to ensure faster page loading for the site’s
users, rather than the traditional priorities of enforcing business rules or ensuring a
balance between reads and writes. You can extrapolate this for other large well-known
sites on the Web - Google, Facebook, Amazon, Twitter and so on.
Relational-model databases can be tweaked and set up to run large-scale read-only
operations (through data warehousing and data marts), and thus potentially still serve
such customers. However, the real challenge is the relational model’s lack of scalability,
as we noted earlier. This is where non-relational models can really shine. They can easily
distribute their data loads across dozens, hundreds and in extreme cases (think Google
search) even thousands of servers. With each server handling only a small percentage
of the total requests from users, response time is very good for each individual user.
Although this distributed computing model also exists for relational databases, it is a real
pain to implement. This is because the relational model insists on data integrity at all
levels, and this must be maintained, even as the data is accessed and modified by several
different servers. This is the reason for the non-relational model as the architecture of
choice for Web 2.0 applications such as cloud-computing and social networking.
Some examples of non-relational databases that power well-known sites are Amazon’s
SimpleDB and Google Search’s BigTable. Other non-relational engines, either open-
source or commercially available, are CouchDB, Mongo, Drizzle and the unusually named
Project Voldemort.

Databases vs. Data Warehouses


A data warehouse is a special type of database optimized for querying, reporting and
analysis. The data in the warehouse is almost always read-only and typically originates
from the operational database and other systems. It is then set up to upload at periodic
intervals to the warehouse using extraction, transformation and loading (ETL) processes
to turn it into a form more suitable for reporting and deeper analysis. The exact details of
this transformation are complex, so we will not delve into them here. The main benefit of
reporting using data warehouses, as opposed to the organization’s transactional
databases, is that warehouses allow much better and more fine-grained data analysis for
business consumption. Use of a data warehouse also removes the reporting load from
the main transactional system, which can get very taxing for large organizations,
especially at period-closing times such as end-of-quarter or end-of-year reporting. In
addition, data warehouses also keep a clear and complete history of all data history, even
though the transactional system may not offer this ability.
A good example of the usefulness of data warehousing is in the financial-services
industry. Banks typically exploit their data warehousing capability to carry out in-depth
forecasting and planning. They use patterns and trend analysis to answer such questions
as:

 What are the client types most likely to default on loans? Is there any relationship between
a client’s income, credit repayment pattern and ATM withdrawals, and his or her loan
repayment ability?
 What is the absolute best time of year to offer financing to, say, commercial farmers? Do
they typically require financial assistance just before or just after harvest time, or halfway
through the planting cycle?
 Why do so many college and university students not maintain their accounts with us after
graduation and getting jobs? What features or benefits are they getting from other banks
that we do not offer?

Data warehouses are also useful in other industries, such as insurance fraud analysis
and pattern matching, telecom call record analysis, agriculture weather and climate
forecasting, and many others.
A good example of the power of the analysis offered by data warehouses is the well-
known story of a large retail store operator. After analyzing data from their data
warehouse, the store managers noticed a baffling trend: late-night diaper and baby-food
sales also corresponded to a spike in the increase of beverages, especially canned beer.
After some head-scratching and hours spent scrutinizing these late-night customers, the
answer finally revealed itself. What was happening was that recent parents, especially
young fathers, were sent out by their wives or girlfriends to purchase diapers for the baby,
and the stressed-out young dads also decided that while they were at it, they may as well
grab a drink to help them relax. The retail store then shrewdly placed both classes of
items closer together in the aisles to capture even more of this market, allowing them to
triple their sales of alcoholic beverages after 9 p.m.
As you can see, data warehouses are distinct from typical databases in that they are used
for more complex analysis of data. This differs from the transactional database, whose
main use is to support operational systems and offer day-to-day, small scale reporting.
Data warehousing sometimes also needs to work closely with market research and other
departments.

Other Important Database Concepts


We have discussed the fundamental database concepts, but these are not the only ones.
There are also other important secondary concepts and data structures worth learning
about. In this section, we’ll take a brief look at these.

Indexes
An index in an RDBMS is a data structure that works closely with tables and columns to
speed up data retrieval operations. It works a lot like the index at the beginning of a book.
In other words, it provides a reference point that allows you to quickly find and access the
data you want without having to traverse the entire book (database).

Schema
A schema is the structure behind data organization. It is a visual overview of how different
tables are related to each other. This serves to map out and implement the underlying
business rules for which the database is created.
The Oracle DB has a somewhat different definition of schema. Here, schema refers to a
user’s collection of database objects. The schema name and username are the same but
function quite distinctly; i.e., a user may be deleted while his collection of objects (schema)
within the database remains intact, and can even be reassigned to another user.
See a visual example of a simple database schema below:
Normalization
Normalization is the process of (re)organizing data in a database so that it meets two
basic requirements: there is no data redundancy (all data is stored in only one place), and
data dependencies are logical (all related data items are stored together). For instance,
for a bank’s database all customer static data, such as name, address and age, should
be stored together. All account information, such as account holder, account type,
account branch and so on, should also be stored together; it should also be stored
separately from the customer static data.
Normalization is important for many reasons, but chiefly because it enables databases to
take up as little disk space as possible, resulting in increased performance. There are
several incremental types of normalization, and they can get somewhat complex.
Constraints
In the RDBMS world, constraint refers to the exact same thing as in the real world. A
constraint is a restriction on the type of data you can input into a certain column.
Constraints are always defined on columns. A common constraint is the not-null
constraint. It simply specifies that all rows in a table must have a value in the column
defined as not null.

Transactions (Commits and Rollbacks)


Think of a bank building out its database systems. Imagine if it crashed right as a wire
transfer was in progress. Big problems, right? This is the basic idea behind a transaction:
All items in a series of changes need to be made together. In the case of a simple transfer,
if you debit one account, you need to credit anther account.
In relational databases, saving a transaction is known as a commit, and undoing any
unsaved changes is known as a rollback. That is the basic definition, but it gets more
complicated when you consider that databases typically have to serve several users
simultaneously. Before the transactions is saved, what happens when other users query
the same data? At what point do the other users see the saved data? All RDBMSs must
be capable of satisfactorily answering these questions, and they do this through the
commit/rollback features.
Databases must also provide fault tolerance, even in case of disk failure. When data is
committed, there must be a good-as-gold guarantee that the data is actually saved.
Relational databases have ingenious ways of achieving this, such as two-phase commits
and use of log files.

ACID
The term ACID here does not refer to psychedelic trip-inducing substances that help
DBAs work better. Rather, it is an acronym describing four highly desirable properties of
any RDBMS:

 Atomicity: This refers to a database’s ability to either fully process or fully roll back a
transaction.
 Consistency: The database should ensure that all data written therein follows all rules
and constraints specified in the database.
 Isolation: Transactions must be processed securely and independently, without
interfering with each other.
 Durability: The database must ensure that all committed transactions are saved
permanently and cannot be accidentally erased, even in a database crash.

Database Locking
As we have seen before, databases allow multiple users to simultaneous access a set of
data. So an inevitable issue that arises is how to handle situations in which two or more
users want to access or update the same piece of data. RDBMSs solve this issue by the
use of locks.
There are different types of locks, such as transactional or data-level locks that simply
lock down a single data field; row-level locks that lock down an entire record of data, while
table-level locks restrict access to a whole table. You can see that the last lock type is
especially restrictive. As such, such it is employed only when absolutely necessary, such
as when performing table-wide data loads or table-wide backups.
By the way, spreadsheets can only employ table-level locks. This is a prime example of
their inferiority when compared to databases. If you work in an office environment you
have most likely encountered it when working on a shared Excel spreadsheet and you
get the annoying "file is locked by another user" message. Locks are a critical way of
implementing the ACID (Atomicity, Consistency, Isolation, Durability) properties of an
RDBMS.
The most common type of lock is the data-level or transactional lock. It usually happens
completely invisibly to the user. When a user updates (but not necessarily commits) a
particular data item, it is then locked down to prevent other users from making any other
changes to it. However, the other user can still read the same data item. This is a neat
feature of RDBMSs. At this time, the other users will only see the "old" value of the data
item that existed before the changes. Once the main user commits (saves) the change,
all users who now query the data will see the new, updated value. Keep in mind that the
term "new value" could even include a deletion of the data that was there, in which case
the field is now a null. If the user instead abandons the change and performs a rollback,
the other users will still see the very same data, but they can now also update the data
item.
If a user tries to change an item that is already locked down by a first user, the second
user will either receive an error message or the system will just hang for a short time while
waiting for the first user to either commit (save) the change or roll it back (undo it).
A curious situation may arise when two users have each locked a data item, then both
attempt to access and make a change to the exact data item held by the other user. In
this case, there is the potential to lock the item indefinitely while each user waits for the
other. This is called a deadlock, and most RDBMSs have a way of detecting and resolving
them. The system will usually choose one user at random and roll back his or her
changes, thus freeing up the data locked by that user and ending the deadlock.
You can surmise that locking is an expensive activity in terms of database and computer
resources. Systems that host write-intensive databases need to be robust enough to
support the resources expended by locking, especially the CPU. Applications also need
to be well designed so as to lock data as infrequently as possible. One way of doing this
is by building data retention capability into the application itself. Once a data item is read
from the database, the application should then perform all changes internally without first
interacting with the database. Only when the user clicks "Save" does the application
quickly write the item back to the database. This minimizes the amount of time that, from
the database point of view, a data item will be in a state of limbo of "changed" but not yet
"committed" or "rolled back".
Table-level locks, as we have mentioned, are very restrictive because they lock down an
entire table. Although they are rarely used, they are invaluable in some operations such
as bulk uploads of data or when reorganizing fragmented data for an entire table. In such
cases, it is convenient to ensure that no other users can access the table for the duration
of the operation, and the lock should be released immediately once the operation is over.
As such, these locks are typically only employed by DBAs during these special
operations. Table-level locks usually have to be explicitly specified; by default the RDBMS
employs row-level or transactional-level locks.

Commercial RDBMS Systems


We have now covered most of the basics of relational databases. They are the most
common database type out there, in addition to being one of the most important types of
software, right up there with operating systems, office productivity and games. So it will
come as no surprise to hear that the large, successful RDBMS vendors are some of the
titans of the software industry, and are also household names not only in the software
world, but even in mainstream news and culture. Names like Oracle and Microsoft are
ubiquitous and very familiar, even to non-IT types.

Many RDBMS vendors also churn out both related and completely unrelated products.
But a common thread for all of them is that the RDBMS is one of their most crucial product
lines. They listen closely and work to gather feedback from the marketplace, which does
not always happen in the software universe.
However, for strategic business reasons, many of their products do not work well with
competitors’ offerings, or with other software. For instance SQL Server from Microsoft is
only available for the Windows operating system, which is also from Microsoft. There have
been complaints that Oracle DB does not mesh as well with the Windows operating
system as it does with Linux, and so on.
An increasing trend in the industry is consolidation and bundling of the RDBMS with other
software from the same manufacturer, such as a preferred operating system or other
complementary software such as:

 Additional data security modules, such as Oracle DataGuard,


 Integration with a development platform, such as Microsoft’s .NET,
 All-in-one platforms combining hardware and software, such as Oracle’s Exadata or
Microsoft’s Trefis.
We will now take a quick look at some commercial RDB offerings and the companies behind
them:

Oracle
Oracle is one of the behemoths of the RDBMS world. Founded by the charismatic,
adventure-loving CEO Larry Ellison in 1977, today the company is a multibillion dollar
giant in the world of commercial databases thanks to its flagship product, Oracle DB.
Oracle also produces a bewildering array of other products,
from middleware to enterprise resource planning systems and customer relationship
management (CRM) offerings. Many of these were not developed in-house, but instead
came through acquisitions of other software companies such as PeopleSoft, Siebel
Systems, Sun Microsystems and BEA Systems. Several of these products were
integrated, with mixed success, into the Fusion middleware product.
The Oracle DB is widely used in enterprise-level databases. It comes in different editions
to meet different needs. Oracle DB is fully compliant with the SQL language, although it
also maintains its proprietary version called SQL*Plus. Oracle DB is the leading RDBMS,
with a market share of 48.8 percent of the RDBMS market as at end of 2011.
An interesting point to note about Oracle is that in 2009, it acquired Sun Microsystems,
the license holder of MySQL, one of Oracle DB’s key competitors. As a result, Oracle has
two RDBMS offerings. However, Oracle DB and MySQL may not interfere destructively
with each other, as they play in slightly different market spaces and cater to slightly
different needs.

Microsoft
Microsoft is another big boy in the world of RDBMS software with its Server product,
although it is better known for its universal Windows operating system and Office suite of
office-productivity programs.
SQL Server was developed in conjunction with Sybase systems around 1989, but the two
companies parted ways and developed separate products. Microsoft kept the SQL Server
name and Sybase opted to rename its offering Adaptive Server Enterprise to avoid
confusion with Microsoft’s SQL Server. The RDBMS only runs on the Windows range of
operating systems.
SQL Server uses a proprietary query language called T-SQL, which is very similar to and
compatible with the standard SQL. The RDBMS commands about 20 percent of market
share as of the end of 2011, but has also been increasing its share in recent years.
SQL Server and Oracle DB have a lot in common, ranging from the data structures to
transaction processing methods and database objects. Like Oracle DB, SQL Server also
supports advanced ETL (Extraction, Transformation, Loading) operations, which help in
moving data to data warehouses. Both also offer advanced reporting functionality.
Postgres
Postgres, also known as PostgreSQL, is an open-source relational database that can also support
database objects. It is not owned by any one person, but is maintained by the PostgreSQL Global
Development Group, a dedicated group of volunteers managed and employed by companies in
the open-source software-development field, such as RedHat and EnterpriseDB.
Postgres is available on Linux, Windows and MacOS.

MySQL
MySQL is another open-source RDBMS. It is a full-featured database system sponsored by
Swedish company MySQL AB, which is now owned by Oracle after its parent company Sun
Microsystems was bought out by Oracle in 2010.
MySQL is very popular for Web-based, back-end databases, either individually or as part of the
Linux, Apache, MySQL, PHP (LAMP) stack used to deliver Web-centric applications.

DB2
Like SQL Server and Oracle DB, DB2 from IBM is a full-featured object RDBMS from a major
player in the software industry. Originally developed in the early '80s exclusively for IBM’s
mainframes, it was later ported to other platforms, such as Linux, Unix, Windows (LUW) and IBM’s
own OS/2.

It is a widely used commercial RDBMS, and also has a small free version for developers called
Express-C. In 2009, IBM released version 9.7 of DB2, which closely mimics the features of Oracle
DB, the market leader. This has helped it capture some sales by making it easy for Oracle-savvy
database professionals to easily understand and start working on DB2.

Conclusion
Congrats - you've made it this far!
We've literally just scratched the surface. After all, many people devote an entire professional
career to databases; there is a lot to learn! Let's recap what we've covered here:

 A database, in the most general sense, is an organized collection of data. More


specifically, a database is an electronic system that allows data to be easily accessed,
manipulated and updated.

 Any business or organization that needs to keep track of large numbers of customers or
products can benefit from a database, but large organizations stand to gain the most.

 The earliest database systems were navigational in nature. This means that applications
processed and read data by using pointers embedded in the data itself.
 The early hierarchical and network data models were bested by E.F. Codd's relational
model.

 The relational model was a radical departure from the reigning hierarchical model in that
it focused on the ability to search a database by content rather than by following a linked
navigation system.

 There are a few very important non-relational databases (especially with the advent of big
data and Web 2.0), but the relational model is still used for the overwhelming majority of
commercial database offerings.

 A relational database is essentially a group of tables or, to use the technical name, entities
(refer to rules 0 and 1 in Codd’s 12 Rules of Relational Databases). Each table is made
up of rows (tuples) and columns (attributes). The tables have relationships between them
that are defined as using a certain column in one table that references a column in another
table.

 There are 13 "rules" that determine whether a database can be called "relational".

 The table is the basic data-storage unit in a relational database. Tables consist of columns
and rows.

 Relationships are THE reason why relational databases work so well.

 In relational databases, a relationship exists between two tables when one of them has a
foreign key that references the primary key of the other table.

 A row, also called a record, represents a set of data about a specific item. Every record in
a table has exactly the same structure, but of course different data.

 A column is a specific set of values in a table of the same type. It defines a specific attribute
of the table or data.

 A primary key is a special column or combination of columns that uniquely identifies each
record (row) in the table. The primary key column must be unique for each row, and must
not contain any nulls (non-values).

 The primary key, together with the closely related foreign key concept, are the main way
in which relationships are defined. A primary key uniquely defines a record, while a foreign
key is used to reference the same record from another table.

 Structured Query Language (SQL) is the de facto language used for the management and
manipulation of data in relational databases. SQL can be used to query, insert, update
and modify data.

 Spreadsheets and databases have some similar capabilities, but the spreadsheet has a
number of limitations that make it unsuitable for managing some data situations.

 One of the most severe limitations of relational databases is that each item can only
contain one attribute. Non-relational databases, specifically a database’s key-value stores
or key-value pairs, are radically different from this model. Key-value pairs allow you to
store several related items in one "row" of data in the same table.

 A data warehouse is a special type of database optimized for querying, reporting and
analysis. The main benefit of reporting using data warehouses, as opposed to the
organization’s transactional databases, is that warehouses allow much better and more
fine-grained data analysis for business consumption.

 An index in an RDBMS is a data structure that works closely with tables and columns to
speed up data retrieval operations.

 A schema is the structure behind data organization. It is a visual overview of how different
tables are related to each other.

 Normalization is the process of (re)organizing data in a database so that it meets two basic
requirements: there is no data redundancy (all data is stored in only one place), and data
dependencies are logical (all related data items are stored together).

 In the RDBMS world, constraint refers to the exact same thing as in the real world. A
constraint is a restriction on the type of data you can input into a certain column.

 In relational databases, saving a transaction is known as a commit, and undoing any


unsaved changes is known as a rollback.

 ACID is an acronym for Atomicity Consistency Isolation Durability, the four highly desirable
properties of an RDBMS.

 Oracle is one of the behemoths of the RDBMS world. Its flagship product is Oracle DB.

 Microsoft is another big boy in the world of RDBMS software with its SQL Server product,
although it is better known for its universal Windows operating system and Office suite of
office-productivity programs.

 Other commercial RDBMS systems include Postgres, MySQL and DB2.


Next: Conclusion
Share this:

Table of Contents

Introduction

What Is a Database?

History of Databases

The Relational Database

Fundamental Database Concepts

Other Types of Databases

Other Important Database Concepts

Commercial RDBMS Systems
 

You might also like