Professional Documents
Culture Documents
Q1 - Rajat
Q2 - Naman & Pranav (Completed)
Q3 - Pratik
Q4 - Shivank
Poore poore answers dalna yahan pr and then end mein apne apne word doc banalena format
krke.
Questions
Q2. You need to write graphical database schema for complete university database. ch 2
book navathe
Q4. Introduce commercial databases
a. IBM Db2
b. Oracle
c. Ms SQL Server
1. Answer 1: Apply Cost based functions for select operation and join operation.
Access cost to secondary storage- This is the cost of transferring (reading and writing) data
blocks between secondary disk storage and main memory buffers. This is also known as disk I/O
(input/output) cost. The cost of searching for records in a disk file depends on the type of access
structures on that file, such as ordering, hashing, and primary or secondary indexes. In addition,
factors such as whether the file blocks are allocated contiguously on the same disk cylinder or
scattered on the disk affect the access cost.
Disk storage cost- This is the cost of storing on disk any intermediate files that are generated by
an execution strategy for the query.
Computation cost- This is the cost of performing in-memory operations on the records within the
data buffers during query execution. Such operations include searching for and sorting records,
merging records for a join or a sort operation, and performing computations on field values. This
is also known
as CPU (central processing unit) cost.
Memory usage cost- This is the cost pertaining to the number of main memory buffers needed
during query execution.
Communication cost- This is the cost of shipping the query and its results from the database
site to the site or terminal where the query originated. In distributed databases (see Chapter 25),
it would also include the cost of transferring tables and results among various computers during
query evaluation. For large databases, the main emphasis is often on minimizing the access cost
to secondary storage. Simple cost functions ignore other factors and compare different query
execution strategies in terms of the number of block transfers between disk and main memory
buffers. For smaller databases, where most of the data in the files involved in the query can be
completely stored in memory, the emphasis is on minimizing computation cost.
Catalog Information Used in Cost Functions
We must also keep track of the primary file organization for each file. The primary file organization
records may be unordered, ordered by an attribute with or without a primary or clustering index,
or hashed (static hashing or one of the dynamic hashing methods) on a key attribute. Information
is also kept on all primary, secondary, or clustering indexes and their indexing attributes.
Number of levels (x) of each multilevel index (primary, secondary, or clustering) is needed for cost
functions that estimate the number of block accesses that occur during
query execution.
In some cost functions the number of first-level index blocks (bI1) is needed.
selectivity (sl), which is the fraction of records satisfying an equality condition on the
attribute.
selection cardinality (s= sl*r) of an attribute, which is the average number of records that
will satisfy an equality selection condition on that attribute.
For a nonkey attribute, by making an assumption that the d distinct values are uniformly distributed
among the records, we estimate sl = (1/d) and so s = (r/d).20
Information such as the number of index levels is easy to maintain because it does not change
very often. However, other information may change frequently
.
Cost Functions for SELECT
These cost functions are estimates that ignore computation time, storage cost, and other factors.
The cost for method Si is referred to as CSi block accesses.
S1—Linear search (brute force) approach. We search all the file blocks to retrieve all records
satisfying the selection condition; hence, CS1a = b.
For an equality condition on a key attribute, only half the file blocks are searched on the average
before finding the record, so a rough estimate for CS1b = (b/2) if the record is found; if no record
is found that satisfies the condition, CS1b = b.
S2—Binary search. This search accesses approximately CS2 = log2b + ⎡ (s/bfr)⎤ −1 file blocks.
This reduces to log2b if the equality condition is on a unique (key) attribute, because s = 1 in this
case.
S3a—Using a primary index to retrieve a single record. For a primary index, retrieve one disk
block at each index level, plus one disk block from the data file. Hence, the cost is one more disk
block than the number of index levels: CS3a = x + 1.
S3b—Using a hash key to retrieve a single record. For hashing, only one disk block needs to
be accessed in most cases. The cost function is approximately CS3b = 1 for static hashing or
linear hashing, and it is 2 disk block accesses for extendible hashing.
S4—Using an ordering index to retrieve multiple records. If the comparison condition is >,
>=, <, or <= on a key field with an ordering index, roughly half the file records will satisfy the
condition. This gives a cost function of CS4 = x + (b/2). This is a very rough estimate, and although
it may be correct on the average, it may be quite inaccurate in individual cases. A more accurate
estimate is possible if the distribution of records is stored in a histogram.
S5—Using a clustering index to retrieve multiple records. One disk block is accessed at each
index level, which gives the address of the first file disk block in the cluster. Given an equality
condition on the indexing attribute, s records will satisfy the condition, where s is the selection
cardinality of the
indexing attribute.
This means that ⎡ (s/bfr)⎤ file blocks will be in the cluster of file blocks that hold all the selected
records, giving CS5 = x + ⎡ (s/bfr)⎤ .
S6—Using a secondary (B+-tree) index. For a secondary index on a key (unique) attribute, the
cost is x + 1 disk block accesses. For a secondary index on a nonkey (nonunique) attribute, s
records will satisfy an equality condition, where s is the selection cardinality of the indexing
attribute. However, because the index is nonclustering, each of the records may reside on a
different disk block, so the (worst case) cost estimate is CS6a = x + 1 + s. The additional 1 is to
account for the disk block that contains the record pointers after the index is searched. If the
comparison condition is >, >=, <, or <= and half the file records are assumed to satisfy the
condition, then (very roughly) half the first-level index blocks are accessed, plus half the file
records via the index. The cost estimate for this case, approximately, is CS6b = x + (bI1/2) + (r/2).
The r/2 factor can be refined if better selectivity estimates
are available through a histogram. The latter method CS6b can be very costly.
In a query optimizer, it is common to enumerate the various possible strategies for executing a
query and to estimate the costs for different strategies. An optimization technique, such as
dynamic programming, may be used to find the optimal (least) cost estimate efficiently, without
having to consider all possible execution strategies. We do not discuss optimization algorithms
here; rather, we use a simple example to illustrate how cost estimates may be used. Suppose
that the EMPLOYEE file in has rE = 10,000 records stored in bE = 2000 disk blocks with blocking
factor bfrE = 5 records/block and the following access paths:
1. A clustering index on Salary, with levels xSalary = 3 and average selection cardinality sSalary
= 20. (This corresponds to a selectivity of slSalary = 0.002).
2. A secondary index on the key attribute Ssn, with xSsn = 4 (sSsn = 1, slSsn = 0.0001).
3. A secondary index on the nonkey attribute Dno, with xDno = 2 and first-level index blocks
bI1Dno = 4. There are dDno = 125 distinct values for Dno, so the selectivity of Dno is slDno =
(1/dDno) = 0.008, and the selection cardinality is sDno = (rE * slDno) = (rE/dDno) = 80.
4. A secondary index on Sex, with xSex = 1. There are dSex = 2 values for the Sex attribute, so
the average selection cardinality is sSex = (rE/dSex) = 5000. (Note that in this case, a histogram
giving the percentage of male and female employees may be useful, unless they are
approximately equal.)
The cost of the brute force (linear search or file scan) option S1 will be estimated as CS1a = bE
= 2000 (for a selection on a nonkey attribute) or CS1b = (bE/2) = 1000 (average cost for a
selection on a key attribute).
For OP1 we can use either method S1 or method S6a; the cost estimate for S6a is CS6a = xSsn
+ 1 = 4 + 1 = 5, and it is chosen over method S1, whose average cost is CS1b = 1000. For OP2
we can use either method S1 (with estimated cost CS1a = 2000) or method S6b (with estimated
cost CS6b = xDno + (bI1Dno/2) + (rE /2) = 2 + (4/2) + (10,000/2) = 5004), so we choose the linear
search approach for OP2. For OP3 we can use either method S1 (with estimated cost CS1a =
2000) or method S6a (with estimated cost CS6a = xDno + sDno = 2 + 80 = 82), so we choose
method S6a.
Finally, consider OP4, which has a conjunctive selection condition.We need to estimate the cost
of using any one of the three components of the selection condition to retrieve the records, plus
the linear search approach. The latter gives cost estimate CS1a = 2000. Using the condition (Dno
= 5) first gives the cost estimate CS6a = 82.
Using the condition (Salary > 30,000) first gives a cost estimate CS4 = xSalary + (bE/2) = 3 +
(2000/2) = 1003.Using the condition (Sex = ‘F’) first gives a cost estimate CS6a = xSex + sSex =
1 + 5000 = 5001.
The optimizer would then choose method S6a on the secondary index on Dno because it has the
lowest cost estimate. The condition (Dno = 5) is used to retrieve the records, and the remaining
part of the conjunctive condition (Salary > 30,000 AND Sex = ‘F’) is checked for each selected
record after it is retrieved into memory. Only the records that satisfy these additional conditions
are included in the result of the operation.
To develop reasonably accurate cost functions for JOIN operations, we need to have an estimate
for the size (number of tuples) of the file that results after the JOIN operation. This is usually kept
as a ratio of the size (number of tuples) of the resulting join file to the size of the CARTESIAN
PRODUCT file, if both are applied to the same input files, and it is called the join selectivity ( js).
If we denote the number of tuples of a relation R by |R|, we have: js = |(R c S)| / |(R S)| = |(R c S)|
/ (|R| * |S|)
If there is no join condition c, then js = 1 and the join is the same as the CARTESIAN PRODUCT.
If no tuples from the relations satisfy the join condition, then js = 0. In general, 0 js 1. For a
join where the condition c is an equality comparison R.A = S.B, we get the following two special
cases:
1. If A is a key of R, then |(R c S)| ≤|S|, so js ≤(1/|R|). This is because each record in file S will be
joined with at most one record in file R, since A is a key of R. A special case of this condition is
when attribute B is a foreign key of S that references the primary key A of R. In addition, if the
foreign key B has the NOT NULL constraint, then js = (1/|R|), and the result file of the join will
contain |S| records.
J1—Nested-loop join. Suppose that we use R for the outer loop; then we get the following cost
function to estimate the number of block accesses for this method, assuming three memory
buffers.We assume that the blocking factor for the resulting file is bfrRS and that the join selectivity
is known:
The last part of the formula is the cost of writing the resulting file to disk. This cost formula can be
modified to take into account different numbers of memory buffers. If nB main memory buffers are
available to perform the join, the cost formula becomes:
If a hash key exists for one of the two join attributes—say, B of S—we get
CJ2d = bR + (|R| * h) + (( js * |R| * |S|)/bfrRS)
where h ≥1 is the average number of block accesses to retrieve a record, given its hash key value.
Usually, h is estimated to be 1 for static and linear hashing and 2 for extendible hashing.
J3—Sort-merge join. If the files are already sorted on the join attributes, the cost function for this
method is
If we must sort the files, the cost of sorting must be added. We can use the formulas from to
estimate the sorting cost.
Suppose that we have the EMPLOYEE file described in the example, and assume that the
DEPARTMENT file in consists of rD = 125 records stored in bD = 13 disk blocks. Consider the
following two join operations:
Suppose that we have a primary index on Dnumber of DEPARTMENT with xDnumber=1 level
and a secondary index on Mgr_ssn of DEPARTMENT with selection cardinality sMgr_ssn= 1 and
levels xMgr_ssn=2.
Assume that the join selectivity for OP6 is jsOP6 = (1/|DEPARTMENT|) = 1/125 because
Dnumber is a key of DEPARTMENT. Also assume that the blocking factor for the resulting join
file is bfrED= 4 records per block.We can estimate the worst-case costs for the JOIN operation
OP6 using the applicable methods J1 and J2 as follows:
Case 4 has the lowest cost estimate and will be chosen.Notice that in case 2 above, if 15 memory
buffers (or more) were available for executing the join instead of just 3, 13 of them could be used
to hold the entire DEPARTMENT relation (outer loop relation) in memory, one could be used as
buffer for the result, and one would be used to hold one block at a time of the EMPLOYEE file
(inner loop file), and the cost for case 2 could be drastically reduced to just bE + bD + (( jsOP6 *
rE * rD)/bfrED) or 4,513. If some other number of main memory buffers was available, say nB =
10, then the cost for case 2 would be calculated as follows, which would also give better
performance than case 4:
2. Answer 2
STUDENT :
Student_Name Student_ID Major
COURSE :
Course_ID Title Tot_Credits
SECTION :
Section_ID Course_ID Semester Year
PREREQUISITES :
Pre_Number Course_ID
GRADE REPORT :
Student_ID Section_ID Grade
The graphical representation of the above schema is as follows:
COURSE SECTION
Course ID Section_ID
Title Course_ID
Tot_Credits Semester
Year
PREREQUISITES
Pre_Number
Course ID
Data types
The SQL:1999 standard calls for a Boolean type, but many commercial SQL servers (Oracle
Database, IBM DB2) do not support it as a column type, variable type or allow it in the results
set. Microsoft SQL Server is one of the few database systems that properly supports
BOOLEAN values using its "BIT" data type. Every 1–8 bit fields occupies one full byte of space
on disk. MySQL interprets "BOOLEAN" as a synonym for TINYINT (8-bit signed integer).
PostgreSQL provides a standard conforming Boolean type
Sometimes called just distinct types, these were introduced as an optional feature (S011) to
allow existing atomic types to be extended with a distinctive meaning to create a new type and
thereby enabling the type checking mechanism to detect some logical errors, e.g. accidentally
adding an age to a salary. For example:
create type age as integer FINAL;
create type salary as integer FINAL;
creates two different and incompatible types. The SQL distinct types use name equivalence
not structural equivalence like typedefs in C. It's still possible to perform compatible operations
on (columns or data) of distinct types by using an explicit type CAST.
Few SQL systems support these. IBM DB2 is one those supporting them. Oracle database
does not currently support them, recommending instead to emulate them by a one-place
structured type.
These are the backbone of the object-relational database extension in SQL:1999. They are
analogous to classes in objected-oriented programming languages. SQL:1999 allows only
single inheritance.
Keywords
SQL:2003 is the fourth revision of the SQL database query language. The standard consists
of 9 parts which are described in detail in SQL. It was updated by SQL:2006. Which was
updated by SQL:2008
New features
The SQL:2003 standard makes minor modifications to all parts of SQL:1999 (also known
as SQL3), and officially introduces a few new features such as:
● XML-related features (SQL/XML)
● Window functions
● the sequence generator, which allows standardized sequences
● two new column types: auto-generated values and identity-columns
● the new MERGE statement
● extensions to the CREATE TABLE statement, to allow "CREATE TABLE AS" and
"CREATE TABLE LIKE"
● removal of the poorly implemented "BIT" and "BIT VARYING" data types
● OLAP capabilities (initially added in SQL:1999) were extended with a window
function.
Overview
The basic need of Object-relational database arises from the fact that both Relational and
Object database have their individual advantages and drawbacks. The isomorphism of the
relational database system with a mathematical relation allows it to exploit many useful
techniques and theorems from set theory. But these types of databases are not useful when
the matter comes to data complexity and mismatch between application and the DBMS. An
object oriented database model allows containers like sets and lists, arbitrary user-defined
datatypes as well as nested objects. This brings commonality between the application type
systems and database type systems which removes any issue of impedance mismatch. But
Object databases, unlike relational do not provide any mathematical base for their deep
analysis.
The basic goal for the Object-relational database is to bridge the gap between relational
databases and the object-oriented modeling techniques used in programming languages such
as Java, C++, Visual Basic .NET or C#. However, a more popular alternative for achieving
such a bridge is to use a standard relational database systems with some form of object-
relational mapping (ORM) software. Whereas traditional RDBMS or SQL-DBMS products
focused on the efficient management of data drawn from a limited set of data-types (defined
by the relevant language standards), an object-relational DBMS allows software developers to
integrate their own types and the methods that apply to them into the DBMS.
The ORDBMS (like ODBMS or OODBMS) is integrated with an object-oriented programming
language. The characteristic properties of ORDBMS are 1) complex data, 2) type inheritance,
and 3) object behavior. Complex data creation in most SQL ORDBMSs is based on
preliminary schema definition via the user-defined type (UDT). Hierarchy within structured
complex data offers an additional property, type inheritance. That is, a structured type can
have subtypes that reuse all of its attributes and contain additional attributes specific to the
subtype. Another advantage, the object behavior, is related with access to the program
objects. Such program objects must be storable and transportable for database processing,
therefore they usually are named as persistent objects. Inside a database, all the relations with
a persistent program object are relations with its object identifier (OID). All of these points can
be addressed in a proper relational system, although the SQL standard and its
implementations impose arbitrary restrictions and additional complexity
In object-oriented programming (OOP), object behavior is described through the methods
(object functions). The methods denoted by one name are distinguished by the type of their
parameters and type of objects for which they attached (method signature). The OOP
languages call this the polymorphism principle, which briefly is defined as "one interface, many
implementations". Other OOP principles, inheritance and encapsulation, are related both to
methods and attributes. Method inheritance is included in type inheritance. Encapsulation in
OOP is a visibility degree declared, for example, through the public, private and
protected access modifiers.
History
Object-relational database management systems grew out of research that occurred in the
early 1990s. That research extended existing relational database concepts by adding object
concepts. The researchers aimed to retain a declarative query-language based on predicate
calculus as a central component of the architecture. Probably the most notable research
project, Postgres (UC Berkeley), spawned two products tracing their lineage to that research:
Illustra and PostgreSQL.
In the mid-1990s, early commercial products appeared. These included Illustra (Illustra
Information Systems, acquired by Informix Software, which was in turn acquired by IBM),
Omniscience (Omniscience Corporation, acquired by Oracle Corporation and became the
original Oracle Lite), and UniSQL (UniSQL, Inc., acquired by KCOMS). Ukrainian developer
Ruslan Zasukhin, founder of Paradigma Software, Inc., developed and shipped the first version
of Valentina database in the mid-1990s as a C++ SDK. By the next decade, PostgreSQL had
become a commercially viable database, and is the basis for several current products that
maintain its ORDBMS features.
Computer scientists came to refer to these products as "object-relational database
management systems" or ORDBMSs.
Many of the ideas of early object-relational database efforts have largely become incorporated
into SQL:1999 via structured types. In fact, any product that adheres to the object-oriented
aspects of SQL:1999 could be described as an object-relational database management
product. For example, IBM's DB2, Oracle database, and Microsoft SQL Server, make claims
to support this technology and do so with varying degrees of success.
Comparison to RDBMS
An RDBMS might commonly involve SQL statements such as these:
CREATE TABLE Customers (
Id CHAR(12) NOT NULL PRIMARY KEY,
Surname VARCHAR(32) NOT NULL,
FirstName VARCHAR(32) NOT NULL,
DOB DATE NOT NULL
);
SELECT InitCap(Surname) || ', ' || InitCap(FirstName)
FROM Customers
WHERE Month(DOB) = Month(getdate())
AND Day(DOB) = Day(getdate())
Most current SQL databases allow the crafting of custom functions, which would allow the
query to appear as:
SELECT Formal(Id)
FROM Customers
WHERE Birthday(DOB) = Today()
In an object-relational database, one might see something like this, with user-defined data-
types and expressions such as BirthDay():
CREATE TABLE Customers (
Id Cust_Id NOT NULL PRIMARY KEY,
Name PersonName NOT NULL,
DOB DATE NOT NULL
);
SELECT Formal( C.Id )
FROM Customers C
WHERE BirthDay ( C.DOB ) = TODAY;
The object-relational model can offer another advantage in that the database can make use of
the relationships between data to easily collect related records. In an address book application,
an additional table would be added to the ones above to hold zero or more addresses for each
customer. Using a traditional RDBMS, collecting information for both the user and their
address requires a "join":
SELECT InitCap(C.Surname) || ', ' || InitCap(C.FirstName), A.city
FROM Customers C join Addresses A ON A.Cust_Id=C.Id -- the join
WHERE A.city="New York"
a. IBM DB2
DB2 is a database product from IBM. It is a Relational Database Management
System (RDBMS). DB2 is designed to store, analyze and retrieve the data
efficiently. DB2 product is extended with the support of Object-Oriented features
and non-relational structures with XML.
History
Initially, IBM had developed DB2 product for their specific platform. Since year
1990, it decided to develop a Universal Database (UDB) DB2 Server, which can
run on any authoritative operating systems such as Linux, UNIX, and Windows.
Versions
For IBM DB2, the UDB current version is 10.5 with the features of BLU
Acceleration and its code name as 'Kepler'. All the versions of DB2 till today are
listed below:
3.4 Cobweb
9.1 Viper
9.5 Viper 2
9.7 Cobra
10.1 Galileo
10.5 Kepler
Editions Features
Express Edition It is designed for entry level and mid-size business organizations. It is
full featured DB2 data server. It offers only limited services. This
Edition comes with - Web Service Federations DB2 homogeneous
federations Homogeneous SQL Replications Backup compression
Enterprise Developer It offers only single application developer. It is useful to design, build and
Edition prototype the applications for deployment on any of the IBM server.
The software cannot be used for developing applications.
b. Oracle Database
Oracle database (Oracle DB) is a relational database management system
(RDBMS) from the Oracle Corporation. Originally developed in 1977 by
Lawrence Ellison and other developers, Oracle DB is one of the most trusted and
widely-used relational database engines.
The system is built around a relational database framework in which data objects
may be directly accessed by users (or an application front end) through
structured query language (SQL). Oracle is a fully scalable relational database
architecture and is often used by global enterprises, which manage and process
data across wide and local area networks. The Oracle database has its own
network component to allow communications across networks.
There are other database offerings, but most of these command a tiny market
share compared to Oracle DB and SQL Server. Fortunately, the structures of
Oracle DB and SQL Server are quite similar, which is a benefit when learning
database administration.
Oracle DB runs on most major platforms, including Windows, UNIX, Linux and
Mac OS. Different software versions are available, based on requirements and
budget. Oracle DB editions are hierarchically broken down as follows:
Enterprise Edition: Offers all features, including superior performance and
security, and is the most robust
Standard Edition: Contains base functionality for users that do not require
Enterprise Edition’s robust package
Express Edition (XE): The lightweight, free and limited Windows and Linux
edition
Oracle Lite: For mobile devices
A key feature of Oracle is that its architecture is split between the logical and the
physical. This structure means that for large-scale distributed computing, also
known as grid computing, the data location is irrelevant and transparent to the
user, allowing for a more modular physical structure that can be added to and
altered without affecting the activity of the database, its data or users. The
sharing of resources in this way allows for very flexible data networks whose
capacity can be adjusted up or down to suit demand, without degradation of
service. It also allows for a robust system to be devised as there is no single
point at which a failure can bring down the database, as the networked schema
of the storage resources means that any failure would be local only.
The protocol layer implements the external interface to SQL Server. All operations that
can be invoked on SQL Server are communicated to it via a Microsoft-defined format,
called Tabular Data Stream (TDS). TDS is an application layer protocol, used to transfer
data between a database server and a client. Initially designed and developed by
Sybase Inc. for their Sybase SQL Server relational database engine in 1984, and later
by Microsoft in Microsoft SQL Server, TDS packets can be encased in other physical
transport dependent protocols, including TCP/IP, named pipes, and shared memory.
Consequently, access to SQL Server is available over these protocols. In addition, the
SQL Server API is also exposed over web services.