You are on page 1of 43

UD 1.

AN INTRODUCTION TO DATABASES

1.1 DATA STORAGE 1.6 THE CAP THEOREM

1.2 DATA FILES 1.7 NON-RELATIONAL DATABASE

1.3 DATABASES 1.8 NON-RELATIONAL vs RELATIONAL DATABASE

1.9 BIG DATA (2023)


1.4 DATABASE MANAGEMENT SYSTEMS (DBMS)

1.5 RELATIONAL DATABASE

1
#1 DATA STORAGE #2 DATA FILES #3 DATABASES #4 DBMS #5 RELATIONAL #6 CAP #7 NON-RELATIONAL #8 RELATIONAL VS NON-RELATIONAL #9 BIG DATA

1.1 DATA STORAGE

2
#1 DATA STORAGE #2 DATA FILES #3 DATABASES #4 DBMS #5 RELATIONAL #6 CAP #7 NON-RELATIONAL #8 RELATIONAL VS NON-RELATIONAL #9 BIG DATA

1.1 DATA STORAGE

3
1.1 DATA STORAGE
Data storage is the recording (storing) of information (data) in a storage
medium.
Handwriting, phonographic recording, magnetic tape, and optical discs are all
examples of storage media.

Biological media include RNA and DNA.


Electronic data storage requires electrical power to store and retrieve data.

Data storage in a digital, machine-readable medium is sometimes called digital


data. Computer data storage is one of the core functions of a general-purpose
computer.

Electronic documents can be stored in much less space than paper documents.

Barcodes and magnetic ink character recognition (MICR) are two ways of
recording machine-readable data on paper.
4
1.1 DATA STORAGE
Data, information, knowledge, and wisdom are closely related concepts, but each has its role concerning
the other, and each term has its meaning.

According to a common view,

❏ data is collected and analyzed;


❏ data only becomes information suitable for making decisions once it has been analyzed in some
fashion.
❏ Knowledge is the understanding based on extensive experience dealing with information on a
subject.
❏ Wisdom refers to the practical application of a person's knowledge in those circumstances where
good may result.

5
1.1 DATA STORAGE
Data is often assumed to be the least abstract concept, information the next
least, and knowledge the most abstract.

In this view, data becomes information by interpretation;

e.g.,
❏ the height of Mount Everest is generally considered "data",
❏ a book on Mount Everest geological characteristics may be
considered "information",
❏ and a climber's guidebook containing practical information on the
best way to reach Mount Everest's peak may be considered
"knowledge".
❏ The practical climbing of Mount Everest's peak based on this
knowledge may be seen as "wisdom".
6
1.1 DATA STORAGE
Biological data Data maintenance Data integrity
Computer data processing Data management Data warehouse
Computer memory Data mining Database
Dark data Data modeling Datasheet
Data acquisition Data point Environmental data
Data analysis Data preservation rescue
Data bank Data protection Fieldwork
Data cable Data publication Information engineering
Data curation Data remanence Machine learning
Data domain Data science Open data
Data element Data set Scientific data archiving
Data farming Data structure Secondary Data
Data governance Data visualization Statistics

7
#1 DATA STORAGE #2 DATA FILES #3 DATABASES #4 DBMS #5 RELATIONAL #6 CAP #7 NON-RELATIONAL #8 RELATIONAL VS NON-RELATIONAL #9 BIG DATA

1.2 DATA FILE


A data file is a computer file which stores data to be used by a computer application or system, including input and
output data. A data file usually does not contain instructions or code to be executed (that is, a computer program).

Data files can be stored in two ways:

❏ Text files
A text file (also called ASCII files) stores information in ASCII characters.
A text file contains human-readable characters. A user can read the contents of a text file or edit it using a text editor.
In text files, each line of text is terminated, (delimited) with a special character known as EOL (End of Line) character.
Examples of text files: A text document (often .txt)

❏ Binary files
A binary file is a file that contains information in the same format in which the information is held in memory. In binary
file, there is no delimiter for a line. Also no translations occur in binary files. As a result, binary files are faster and easier
for a program to read and write than the text files.
As long as the file doesn't need to be read or need to be ported to a different type of system, binary files are the best
way to store program information. Examples of binary files: A JPEG image (.jpg or .jpeg)
8
1.2 DATA FILE
Data file categories

❏ Closed data file formats, frequently referred to as proprietary format files,


have their metadata data elements hidden, obscured or unavailable to users of the file.
Application developers do this to discourage users from tampering with or corrupting the data files or
importing the data into a competitor's application.

❏ Open data format files


have their internal structures available to users of the file through a process of metadata publishing.
Metadata publishing implies that the structure and semantics of all the possible data elements within a file are
available to users.

Examples of open data files include CSV, XLS and XML formats such as HTML for storing web pages or SVG for storing scalable graphics

9
1.2 DATA FILE
Serialization

In computing, serialization is the process of


translating a data structure or object state into a
format that can be stored (for example, in a file or
memory data buffer) or transmitted (for example,
over a computer network) and reconstructed later
(possibly in a different computer environment).
When the resulting series of bits is reread
according to the serialization format, it can be used
to create a semantically identical clone of the
original object.

This process of serializing an object is also called marshalling an object in some situations.
The opposite operation, extracting a data structure from a series of bytes, is deserialization, (also called unserialization or
unmarshalling).
10
#1 DATA STORAGE #2 DATA FILES #3 DATABASES #4 DBMS #5 RELATIONAL #6 CAP #7 NON-RELATIONAL #8 RELATIONAL VS NON-RELATIONAL #9 BIG DATA

1.3 DATABASE
In computing, a database is an organized collection of data
stored and accessed electronically. Small databases can be
stored on a file system, while large databases are hosted on
computer clusters or cloud storage.

The design of databases spans formal techniques and practical


considerations, including data modeling, efficient data
representation and storage, query languages, security and
privacy of sensitive data, and distributed computing issues,
including supporting concurrent access and fault tolerance.

A database management system (DBMS) is the software that


interacts with end users, applications, and the database itself to
capture and analyze the data. The DBMS software additionally
encompasses the core facilities provided to administer the
database.

11
1.3 DATABASE
History (1/2)

The sizes, capabilities, and performance of databases and their


respective DBMSs have grown in orders of magnitude. These
performance increases were enabled by the technology progress in the
areas of processors, computer memory, computer storage, and computer
networks.

The concept of a database was made possible by the emergence of direct


access storage media such as magnetic disks, which became widely
available in the mid-1960s; earlier systems relied on sequential storage
of data on magnetic tape.

The relational model, first proposed in 1970 by Edgar F. Codd, departed


from this tradition by insisting that applications should search for data by
content, rather than by following links. The relational model employs sets
of ledger-style tables, each used for a different type of entity.

12
1.3 DATABASE
History (2/2)

Only in the mid-1980s did computing hardware become powerful


enough to allow the wide deployment of relational systems (DBMSs plus
applications).

By the early 1990s, however, relational systems dominated in all


large-scale data processing applications, and as of 2018 they remain
dominant: IBM Db2, Oracle, MySQL, and Microsoft SQL Server are the
most searched DBMS.

The next generation of post-relational databases in the late 2000s


became known as NoSQL databases, introducing fast key–value stores
and document-oriented databases. A competing "next generation"
known as NewSQL databases attempted new implementations that
retained the relational/SQL model while aiming to match the high
performance of NoSQL compared to commercially available relational
DBMSs.
13
1.3 DATABASE
Database languages

Database languages are special-purpose languages, which allow one


or more of the following tasks, sometimes distinguished as
sublanguages:

❏ Data control language (DCL) → controls access to data;


❏ Data definition language (DDL) → defines data types such as
creating, altering, or dropping tables and the relationships among
them;
❏ Data manipulation language (DML) → performs tasks such as
inserting, updating, or deleting data occurrences;
❏ Data query language (DQL) → allows searching for information and
computing derived information.

14
1.3 DATABASE
Storage

Database storage is the container of the physical materialization of a


database. It comprises the internal (physical) level in the database
architecture. It also contains all the information needed (e.g.,
metadata, "data about the data", and internal data structures) to
reconstruct the conceptual level and external level from the internal
level when needed.

Databases as digital objects contain three layers of information which must be stored: the data, the structure, and the
semantics.
Proper storage of all three layers is needed for future preservation and longevity of the database.Putting data into
permanent storage is generally the responsibility of the database engine a.k.a. "storage engine".

15
1.3 DATABASE
Security (1/2)

Database security deals with all various aspects of protecting the


database content, its owners, and its users. It ranges from protection
from intentional unauthorized database uses to unintentional
database accesses by unauthorized entities (e.g., a person or a
computer program).

Database security concerns the use of a broad range of information


security controls to protect databases (potentially including the data,
the database applications or stored functions, the database systems,
the database servers and the associated network links) against
compromises of their confidentiality, integrity and availability. It
involves various types or categories of controls, such as technical,
procedural/administrative and physical.
16
Many layers and types of information security
control are appropriate to databases, including:

❏ Access control
❏ Auditing
1.3 DATABASE ❏

Authentication
Encryption
❏ Integrity controls
Security (2/2) ❏ Backups
❏ Application security

Security risks to database systems include, for example:

❏ Unauthorized or unintended activity or misuse by authorized database users, database administrators, or network/systems managers, or
by unauthorized users or hackers
❏ Malware infections causing incidents such as unauthorized access, leakage or disclosure of personal or proprietary data, deletion of or
damage to the data or programs, interruption or denial of authorized access to the database, attacks on other systems.
❏ Overloads, performance constraints and capacity issues resulting in the inability of authorized users to use databases as intended;
❏ Physical damage to database servers caused by computer room fires or floods, overheating, lightning, accidental liquid spills, static
discharge, electronic breakdowns/equipment failures and obsolescence;
❏ Design flaws and programming bugs in databases and the associated programs and systems, creating various security vulnerabilities (e.g.
unauthorized privilege escalation), data loss/corruption, performance degradation etc.;
❏ Data corruption and/or loss caused by the entry of invalid data or commands, mistakes in database or system administration processes,
sabotage/criminal damage etc.

17
#1 DATA STORAGE #2 DATA FILES #3 DATABASES #4 DBMS #5 RELATIONAL #6 CAP #7 NON-RELATIONAL #8 RELATIONAL VS NON-RELATIONAL #9 BIG DATA

1.4 DATABASE MANAGEMENT SYSTEMS


Connolly and Begg define database management system (DBMS) as a "software system that enables users to define, create,
maintain and control access to the database".

Examples of DBMS's include MySQL, PostgreSQL, Microsoft SQL Server, Oracle Database, and Microsoft Access.

The DBMS acronym is sometimes extended to


indicate the underlying database model, with RDBMS
for the relational, OODBMS for the object (oriented)
and ORDBMS for the object–relational model.

Other extensions can indicate some other


characteristics, such as DDBMS for a distributed
database management systems.

18
1.4 DATABASE MANAGEMENT SYSTEMS
The functionality provided by a DBMS can vary enormously. The core
functionality is the storage, retrieval and update of data.

Codd proposed the following functions and services a fully-fledged


general purpose DBMS should provide:

❏ Data storage, retrieval and update


❏ User accessible catalog or data dictionary describing the
metadata
❏ Support for transactions and concurrency
❏ Facilities for recovering the database should it become damaged
❏ Support for authorization of access and update of data
❏ Access support from remote locations
❏ Enforcing constraints to ensure data in the database abides by
certain rules
19
1.4 DATABASE MANAGEMENT SYSTEMS
It is also generally to be expected the DBMS will provide a set of utilities for such purposes as may be necessary to administer
the database effectively, including import, export, monitoring, defragmentation and analysis utilities.

The core part of the DBMS interacting between the database and the application interface sometimes referred to as the
database engine.

Often DBMSs will have configuration parameters that can be statically and dynamically tuned, for example the maximum
amount of main memory on a server the database can use. The trend is to minimize the amount of manual configuration, and for
cases such as embedded databases the need to target zero-administration is paramount.

❏ A general-purpose DBMS will provide public application programming interfaces (API) and optionally a processor for
database languages such as SQL to allow applications to be written to interact with and manipulate the database.
❏ A special purpose DBMS may use a private API and be specifically customized and linked to a single application.
For example, an email system performs many of the functions of a general-purpose DBMS such as message insertion, message deletion, attachment handling,
blocklist lookup, associating messages an email address and so forth however these functions are limited to what is required to handle email.

20
1.4 DATABASE MANAGEMENT SYSTEMS
It is also generally to be expected the DBMS will provide a set of utilities for such purposes as may be necessary to administer
the database effectively, including import, export, monitoring, defragmentation and analysis utilities.

The core part of the DBMS interacting between the database and the application interface sometimes referred to as the
database engine.

Often DBMSs will have configuration parameters that can be statically and dynamically tuned, for example the maximum
amount of main memory on a server the database can use. The trend is to minimize the amount of manual configuration, and for
cases such as embedded databases the need to target zero-administration is paramount.

❏ A general-purpose DBMS will provide public application programming interfaces (API) and optionally a processor for
database languages such as SQL to allow applications to be written to interact with and manipulate the database.
❏ A special purpose DBMS may use a private API and be specifically customized and linked to a single application.
For example, an email system performs many of the functions of a general-purpose DBMS such as message insertion, message deletion, attachment handling,
blocklist lookup, associating messages an email address and so forth however these functions are limited to what is required to handle email.

21
1.4 DATABASE MANAGEMENT SYSTEMS
RELATIONAL NO RELATIONAL

22
#1 DATA STORAGE #2 DATA FILES #3 DATABASES #4 DBMS #5 RELATIONAL #6 CAP #7 NON-RELATIONAL #8 RELATIONAL VS NON-RELATIONAL #9 BIG DATA

1.5 RELATIONAL DATABASE


A relational database is a database based on the relational model of
data, as proposed by E. F. Codd in 1970.

A system used to maintain relational databases is a relational database


management system (RDBMS).

Many relational database systems are equipped with the option of


using the SQL (Structured Query Language) for querying and
maintaining the database.

The most common definition of an RDBMS is a product that presents a


view of data as a collection of rows and columns, even if it is not based
strictly upon relational theory.

23
1.5 RELATIONAL DATABASE
In a relational database, a relation is a set of tuples that have the same attributes.

A tuple usually represents an object and information about that object. Objects are typically physical objects or concepts.

A relation is usually described as a table, which is organized into rows and columns.

All the data referenced by an attribute are in the same domain and conform to the same constraints.

The relational model specifies that the tuples of a relation have no specific order and that the tuples, in turn, impose no order on the
attributes.

Relations can be modified using the insert, delete, and update operators.
New tuples can supply explicit values or be derived from a query. Similarly, queries identify tuples for updating or deleting.

Tuples by definition are unique. If the tuple contains a candidate or primary key then obviously it is unique; however, a primary key need not
be defined for a row or record to be a tuple. The definition of a tuple requires that it be unique, but does not require a primary key to be
defined. Because a tuple is unique, its attributes by definition constitute a superkey.
24
1.5 RELATIONAL DATABASE
Domain

A domain describes the set of possible values for a given attribute,


and can be considered a constraint on the value of the attribute.

Constraints

Constraints are often used to make it possible to further restrict the


domain of an attribute. For instance, a constraint can restrict a given
integer attribute to values between 1 and 10.

25
1.5 RELATIONAL DATABASE
Domain Constraints

Constraints can apply to single attributes, to a tuple (restricting combinations of attributes) or to an entire relation. Since every
attribute has an associated domain, there are constraints (domain constraints). The two principal rules for the relational model
are known as entity integrity and referential integrity.

Domain Constraints are user-defined columns that help the user to enter the value according to the data type. And if it
encounters a wrong input it gives the message to the user that the column is not fulfilled properly. Or in other words, it is an
attribute that specifies all the possible values that the attribute can hold like integer, character, date, time, string, etc. It defines the
domain or the set of values for an attribute and ensures that the value taken by the attribute must be an atomic value(Can’t be
divided) from its domain.

26
1.5 RELATIONAL DATABASE
Primary key

Every relation/table has a primary key, this being a consequence of a relation being a set.
A primary key uniquely specifies a tuple within a table. While natural attributes (attributes
used to describe the data being entered) are sometimes good primary keys, surrogate keys
are often used instead.

A surrogate key is an artificial attribute assigned to an object which uniquely identifies it (for
instance, in a table of information about students at a school they might all be assigned a
student ID in order to differentiate them).
The surrogate key has no intrinsic (inherent) meaning, but rather is useful through its ability
to uniquely identify a tuple.

A composite key is a key made up of two or more attributes within a table that (together)
uniquely identify a record.
27
1.5 RELATIONAL DATABASE
Foreign key

Foreign key refers to a field in a relational table that matches the


primary key column of another table. It relates the two keys.

Foreign keys need not have unique values in the referencing


relation. A foreign key can be used to cross-reference tables, and
it effectively uses the values of attributes in the referenced
relation to restrict the domain of one or more attributes in the
referencing relation.

The concept is described formally as: "For all tuples in the referencing relation
projected over the referencing attributes, there must exist a tuple in the
referenced relation projected over those same attributes such that the values
in each of the referencing attributes match the corresponding values in the
referenced attributes." 28
1.5 RELATIONAL DATABASE
Normalization

Normalization was first proposed by Codd as an integral part of the


relational model.

It encompasses a set of procedures designed to eliminate


non-simple domains (non-atomic values) and the redundancy
(duplication) of data, which in turn prevents data manipulation
anomalies and loss of data integrity. The most common forms of
normalization applied to databases are called the normal forms.

data redundancy → having the same data but at different places

29
1.5 RELATIONAL DATABASE
Database Normalization With Examples

Database Normalization Example can be easily understood with the help of a case study. Assume, a video library maintains a
database of movies rented out. Without any normalization in database, all information is stored in one table as shown below. Let’s
understand Normalization database with normalization example with solution:

30
Here you see Movies Rented column has multiple values. Now let’s move into 1st Normal Forms >
1.5 RELATIONAL DATABASE
1NF (First Normal Form) Rules

❏ Each table cell should contain a single value.


❏ Each record needs to be unique.ample with solution:

In our database, we have two people with the same


name Robert Phil, but they live in different places.

Hence, we require both Full Name and Address to


identify a record uniquely. That is a composite key.

Let’s move into second normal form 2NF >

31
It is clear that we can’t move forward to make our
1.5 RELATIONAL DATABASE simple database in 2nd Normalization form unless we
partition the table above.
2NF (Second Normal Form) Rules
We have divided our 1NF table into two tables viz.
❏ Be in 1NF Table 1 and Table2. Table 1 contains member
❏ Single Column Primary Key that does not functionally information. Table 2 contains information on movies
dependant on any subset of candidate key relation. rented.

We have introduced a new column called Membership_id


which is the primary key for table 1. Records can be uniquely
identified in Table 1 using membership id
32
1.5 RELATIONAL DATABASE
3NF (Third Normal Form) Rules

❏ Be in 2NF
❏ Has no transitive functional dependencies

What are transitive functional dependencies?

A transitive functional dependency is when changing a non-key column, might cause any of the other non-key columns to change

Consider the table 1.


Changing the non-key column
Full Name may change Salutation.

Let’s move into 3NF > 33


1.5 RELATIONAL DATABASE

We have again divided our tables and created a new table which stores Salutations.

There are no transitive functional dependencies, and hence our table is in 3NF
34
In Table 3 Salutation ID is primary key, and in Table 1 Salutation ID is foreign to primary key in Table 3
#1 DATA STORAGE #2 DATA FILES #3 DATABASES #4 DBMS #5 RELATIONAL #6 CAP #7 NON-RELATIONAL #8 RELATIONAL VS NON-RELATIONAL #9 BIG DATA

1.6 THE CAP THEOREM


The CAP theorem is a belief from theoretical computer science about distributed
data stores that claims, in the event of a network failure on a distributed
database, it is possible to provide either consistency or availability—but not
both.

The CAP Theorem is comprised of three components (hence its name) as they
relate to distributed data stores:

❏ Consistency. All reads receive the most recent write or an error.


❏ Availability. All reads contain data, but it might not be the most recent.
❏ Partition tolerance. The system continues to operate despite network
failures (ie; dropped partitions, slow network connections, or unavailable
network connections between nodes.)

In normal operations, your data store provides all three functions. But the CAP theorem maintains that
when a distributed database experiences a network failure, you can provide either consistency or
availability.
35
#1 DATA STORAGE #2 DATA FILES #3 DATABASES #4 DBMS #5 RELATIONAL #6 CAP #7 NON-RELATIONAL #8 RELATIONAL VS NON-RELATIONAL #9 BIG DATA

1.7 NON-RELATIONAL DATABASE


When to use a non-relational database

As discussed, there are many types of non-relational databases, each having


their own advantages and disadvantages.

However, non-relational databases still maintain some consistent advantages. If


the data you are storing needs to be flexible in terms of shape or size, or if it
needs to be open to change in future, then a non-relational database is the
answer.

Modern NoSQL databases have been designed for the cloud, making them
naturally good for horizontal scaling where a lot of smaller servers can be spun
up to handle increased load.

36
1.7 NON-RELATIONAL vs RELATIONAL DATABASE

37
#1 DATA STORAGE #2 DATA FILES #3 DATABASES #4 DBMS #5 RELATIONAL #6 CAP #7 NON-RELATIONAL #8 RELATIONAL VS NON-RELATIONAL #9 BIG DATA

1.8 NON-RELATIONAL vs RELATIONAL DATABASE


RDBMS vs NoSQL: Data Modeling Example

Let's consider an example of storing information about a user and their hobbies. We need to store a user's first name, last name,
cell phone number, city, and hobbies.

In a relational database, we'd likely create two tables: one for Users and one for Hobbies.

38
#1 DATA STORAGE #2 DATA FILES #3 DATABASES #4 DBMS #5 RELATIONAL #6 CAP #7 NON-RELATIONAL #8 RELATIONAL VS NON-RELATIONAL #9 BIG DATA

You might also like