You are on page 1of 52

CS7079

Data
Warehousing and Big Data
Data vs. Information
• Data:
– Raw facts; building blocks of information
– Unprocessed information
• Information:
– Data processed to reveal meaning
Database and the DBMS

• Database—shared, integrated computer structure that


stores:
– End user data (raw facts)
– Metadata (data about data)

• DBMS (database management system):


– Collection of programs that manages database structure and
controls access to data
– Possible to share data among multiple applications or users
– Makes data management more efficient and effective
Advantages of the DBMS
• End users have better access to more and
better-managed data
– Promotes integrated view of organization’s
operations
– Probability of data inconsistency is greatly reduced
– Possible to produce quick answers to queries
Types of Databases
• Single-user:
– Supports only one user at a time
• Desktop:
– Single-user database running on a personal computer
• Multi-user:
– Supports multiple users at the same time
• Workgroup:
– Multi-user database that supports a small group of users or a single
department
• Enterprise:
– Multi-user database that supports a large group of users or an
entire organization
Can be classified by location:

• Centralized:
– Supports data located at a single site
• Distributed:
– Supports data distributed across several sites
Can be classified by use:

• Transactional (or production):


– Supports a company’s day-to-day operations
• Data warehouse:
– Stores data used to generate information
– Often used to store historical data
– Structure is quite different
Database Systems
• Problems inherent in file systems make using a
database system desirable
• File system
– Many separate and unrelated files
• Database
– Logically related data stored in a single logical data
repository
The Database System Environment
• Database system is composed of five
main parts:
– Hardware
– Software
• Operating system software
• DBMS software
• Application programs and utility software
– People
– Procedures
– Data
What a Database Is?
The word database is commonly used to refer to any
of the following:
l your personal address book in a Word document
l a collection of Word documents
l a collection of Excel Spreadsheets
l a very large flat file on which you run some statistical analysis
functions
l data collected, maintained, and used in airline reservation
l data used to support the launch of a space shuttle
Database capabilities
• Data Storage
• Queries
• Optimization
• Indexing
• Concurrency Control
• Recovery
• Security
Data Storage
• Disk management
• File management
• Buffer management
• Garbage collection
• Compression
People that work with databases
• System Analysts
• Database Designers
• Application Developers
• Database Administrators
• End Users
Database Users
• Users are differentiated by the way they expect to interact with the
system
• Application programmers – interact with system through DML calls
• Sophisticated users – form requests in a database query language
• Specialized users – write specialized database applications that do
not fit into the traditional data processing framework
• Naïve users – invoke one of the permanent application programs
that have been written previously
– E.g. people accessing database over the web, bank tellers, clerical staff
Revision of RDBMS and Languages
• A collection of interrelated data and a set of programs
to manage these data
• Why: Efficient, Convenient and reliable management
of data
• Tasks:
       Data Modelling
        Data storage
      Data retrieval
      Data manipulation
Database Environment
• A major aim of the database is to provide the user
with an abstract view of the data
• Different users may have different views of the
data held in the database
• To achieve abstraction and the variety of views, a
standard architecture is provided in most available
commercial DBMS
Three Level ANSI-SPARC Architecture
▪ External Level
▪ Conceptual Level
▪ Internal Level

Physical Level
ANSI-SPARC three-level Architecture
Data Independence
• Logical Data Independence
– Refers to immunity of external schemas to changes
in conceptual schema.
– Conceptual schema changes (e.g. addition/removal
of entities).
– Should not require changes to external schema or
rewrites of application programs.
• Physical Data Independence
– Refers to immunity of conceptual schema to
changes in the internal schema.
– Internal schema changes (e.g. using different file
organizations, storage structures/devices).
– Should not require change to conceptual or
external schemas.
Data Independence and the ANSI-SPARC Three-
Level Architecture
Database Languages
Data Definition Language (DDL)
- Database schema definition
Data Manipulation Language (DML)
▪     Retrieval

▪     Insertion

▪     Deletion

▪ Modification
The relational model
Relational Data Structure
▪  Relation, Attribute, Domain, Tuple, Relational Database,

Relational Keys
Integrity Constraints
▪  Null, Entity Integrity and Referential Integrity

 
Integrity constraint
• Integrity constraints are conditions which specify the
circumstances in which the data in a database are correct
• Examples
• no bank account can have a negative balance
• the department of each member of the academic staff has always to be
known
• each bank account has to belong to at least one customer
• a book cannot be borrowed from the university library for more than
30 days
• all the members of the academic staff are entitled to the same annual
leave
Types of integrity constraints
• Domain rules
– The age of each member of the academic staff has to be in the range
[0…100]
• Referential integrity constraints (known as base table rules)
– The department of each member of the academic staff has always to
be known
• Functional dependencies
– All the members of the academic staff are entitled to the same annual
holiday leave
• Other general rules
– No bank account can have a negative balance
SQL
• SQL is a procedural data definition and manipulation language
for relational database which is based on the relational algebra
• SQL statements to specify database queries and retrieve
information from database
• SQL emerged as a descendant of Sequel
(System R, IBM, 1970)
• Many commercial variations (SQL is a de facto
standard for commercial relational DBMSs)

7/18/2018
Main Parts of the language
• Data manipulation sublanguage
– Select, delete, insert and update data in relations
• Data definition sublanguage
– Create, delete and modify relations, create indexes,
authorisation
• Integrity constraint specification sublanguage
• Transaction control sublanguage
SQL: data manipulation
Select, Insert, Update and Delete SQL DML
statements.
Examples:
1)      Select all albums of Elvis Presley and sort them descending
by the date of release.
• SELECT * FROM album WHERE interpreter = 'Elvis Presley'
SORT BY released DESC
SQL: data definition
   SQL Identifiers
   Data Types
   Integrity Constraints
▪   Required Data
▪    
Domain Constraints
▪   Entity Integrity
▪    
Referential Integrity
▪ Data Definition (Create Database, Tables,
Change a Table Definition, remove a table etc.)
View
• The dynamic result of one or more relational
operations operating on the base relations to
produce another relation.
• A view is a virtual relation that does not necessarily
exist in the database but can be produced upon
request by a particular user, at the time of request.
Entity-relationship modelling
▪ Entity Type
▪ Relation Types
▪ Attributes
▪ Simple and Composite Attributes
▪ Single Valued and multi valued attributes
▪ Keys
▪ Strong and Weak Entity types
Enhanced entity-relationship modelling

Specialization/Generalization
Superclass
▪ An entity type that includes distinct Subclasses that
require to be represented In a data model.
 Subclass
▪ A Subclass is an entity type that has a distinct role and
is also a member of the Superclass.
Diagram

Specialization

The process of maximizing the differences between
members of an entity by identifying their distinguishing
characteristics.
Generalization

The process of minimizing the differences between entities
by identifying their common features.
Natural Key : If the key is is naturally present in the relational table .
Surrogate Key : If we introduced a primary key for data warehouse.
Normalization
First Normal Form (1NF)

For a table to be in the First Normal Form, it should follow the following 4 rules:

1. It should only have single(atomic) valued attributes/columns.

2. Values stored in a column should be of the same domain


Second Normal Form (2NF)

For a table to be in the Second Normal Form,

1. It should be in the First Normal form.

2. And, it should not have Partial Dependency.


Third Normal Form (3NF)

A table is said to be in the Third Normal Form when,

1. It is in the Second Normal form.

2. And, it doesn't have Transitive Dependency.

You might also like