Professional Documents
Culture Documents
DDL - Data Definition Language, which deals with database schemas and
descriptions, of how the data should reside in the database.
CREATE to create a database and its objects like (table, index, views, store procedure,
function, and triggers)
ALTER alters the structure of the existing database
DROP delete objects from the database
TRUNCATE remove all records from a table, including all spaces allocated for the records are
removed
COMMENT add comments to the data dictionary
RENAME rename an object
DML - Data Manipulation Language which deals with data manipulation and includes most
common SQL statements such SELECT, INSERT, UPDATE, DELETE, etc., and it is used
to store, modify, retrieve, delete and update data in a database.
1. Physical Level: Physical level of a database describes how the data is being
stored in secondary storage devices like disks and tapes and also gives insights
on additional storage details. Various users of DBMS are unaware of the
locations of these objects.
2. Conceptual Level: At conceptual level, data is represented in the form of
various database tables. Also referred as logical schema, it describes what kind
of data is to be stored in the database.
3. External Level: An external level specifies a view of the data in terms of
conceptual level tables. Each external level view is used to cater to the needs
of a particular category of users. So, different views can be generated for
different users. The main focus of external level is data abstraction.
Data Independence
Data independence means a change of data at one level should not affect another
level. Two types of data independence:
1. Physical Data Independence: Any change in the physical location of tables
and indexes should not affect the conceptual level or external view of data.
This data independence is easy to achieve and implemented by most of the
DBMS.
2. Conceptual Data Independence: This means a change in conceptual
schema should not affect external schema. But this type of independence is
difficult to achieve as compared to physical data independence because the
changes in conceptual schema are reflected in the user’s view.
Advantages of DBMS
• Minimized redundancy and data inconsistency
• Simplified Data Access
• Multiple data views
• Data Security
• Concurrent access to data
• Backup and Recovery mechanism
Disadvantages of DBMS
• Increased Cost: Cost of Hardware and Software, Cost of Staff Training & Cost
of Data Conversion
• Complexity
• Currency Maintenance
• Slow performance in case of small databases
• Frequency Upgrade/Replacement Cycles
DBMS Architecture
Two-tier architecture:
The two-tier architecture is similar to a basic client-server model. The application
at the client end directly communicates with the database at the server-side. APIs
like ODBC, JDBC are used for this interaction. The server side is responsible for
providing query processing and transaction management functionalities. On the
client-side, the user interfaces and application programs are run. The application
on the client-side establishes a connection with the server-side in order to
communicate with the DBMS.
Advantage: maintenance and understanding are easier, compatible with existing
systems. However, this model gives poor performance when there are a large
number of users.
Multimedia Database
Multimedia database is the collection of interrelated multimedia data that includes
text, graphics (sketches, drawings), images, animations, video, audio etc and have vast
amounts of multisource multimedia data. The framework that manages different types
of multimedia data which can be stored, delivered and utilized in different ways is
known as multimedia database management system. There are three classes of the
multimedia database which includes static media, dynamic media and dimensional
media.
Content of Multimedia Database management system:
1. Media data – The actual data representing an object.
2. Media format data – Information such as sampling rate, resolution, encoding
scheme etc. about the format of the media data after it goes through the
acquisition, processing and encoding phase.
3. Media keyword data – Keywords description relating to the generation of
data. It is also known as content descriptive data. Example: date, time and
place of recording.
4. Media feature data – Content dependent data such as the distribution of
colors, kinds of texture and different shapes present in data.
Types of multimedia applications based on data management characteristic are:
1. Repository applications – A Large amount of multimedia data as well as
meta-data (Media format date, Media keyword data, Media feature data) that
is stored for retrieval purpose, e.g., Repository of satellite images, engineering
drawings, radiology scanned pictures.
2. Presentation applications – They involve delivery of multimedia data
subject to temporal constraint. Optimal viewing or listening requires DBMS to
deliver data at certain rate offering the quality of service above a certain
threshold. Here data is processed as it is delivered. Example: Annotating of
video and audio data, real-time editing analysis.
3. Collaborative work using multimedia information – It involves executing a
complex task by merging drawings, changing notifications. Example:
Intelligent healthcare network.
There are still many challenges to multimedia databases, some of which are:
1. Modelling – Working in this area can improve database versus information
retrieval techniques thus, documents constitute a specialized area and deserve
special consideration.
2. Design – The conceptual, logical and physical design of multimedia databases
has not yet been addressed fully as performance and tuning issues at each
level are far more complex as they consist of a variety of formats like JPEG,
GIF, PNG, MPEG which is not easy to convert from one form to another.
3. Storage – Storage of multimedia database on any standard disk presents the
problem of representation, compression, mapping to device hierarchies,
archiving and buffering during input-output operation. In DBMS, a
”BLOB”(Binary Large Object) facility allows untyped bitmaps to be stored and
retrieved.
4. Performance – For an application involving video playback or audio-video
synchronization, physical limitations dominate. The use of parallel processing
may alleviate some problems but such techniques are not yet fully developed.
Apart from this multimedia database consume a lot of processing time as well
as bandwidth.
5. Queries and retrieval –For multimedia data like images, video, audio
accessing data through query opens up many issues like efficient query
formulation, query execution and optimization which need to be worked upon.
Areas where multimedia database is applied are:
• Documents and record management: Industries and businesses that keep
detailed records and variety of documents. Example: Insurance claim record.
• Knowledge dissemination: Multimedia database is a very effective tool for
knowledge dissemination in terms of providing several resources. Example:
Electronic books.
• Education and training: Computer-aided learning materials can be designed
using multimedia sources which are nowadays very popular sources of
learning. Example: Digital libraries.
• Marketing, advertising, retailing, entertainment and travel. Example: a virtual
tour of cities.
• Real-time control and monitoring: Coupled with active database
technology, multimedia presentation of information can be very effective
means for monitoring and controlling complex tasks Example: Manufacturing
operation control.
Interfaces in DBMS
A database management system (DBMS) interface is a user interface which allows
for the ability to input queries to a database without using the query language itself.
User-friendly interfaces provide by DBMS may include the following:
1. Menu-Based Interfaces for Web Clients or Browsing –
These interfaces present the user with lists of options (called menus) that
lead the user through the formation of a request. Basic advantage of using
menus is that they removes the tension of remembering specific
commands and syntax of any query language, rather than query is
basically composed step by step by collecting or picking options from a
menu that is basically shown by the system. Pull-down menus are a very
popular technique in Web based interfaces. They are also often used
in browsing interface which allow a user to look through the contents of a
database in an exploratory and unstructured manner.
2. Forms-Based Interfaces –
A forms-based interface displays a form to each user. Users can fill out all
of the form entries to insert a new data, or they can fill out only certain
entries, in which case the DBMS will redeem same type of data for other
remaining entries. This type of forms are usually designed or created and
programmed for the users that have no expertise in operating system.
Many DBMSs have forms specification languages which are special
languages that help specify such forms.
Example: SQL* Forms is a form-based language that specifies queries
using a form designed in conjunction with the relational database
schema.b>
3. Graphical User Interface –
A GUI typically displays a schema to the user in diagrammatic form.The
user then can specify a query by manipulating the diagram. In many
cases, GUI’s utilize both menus and forms. Most GUIs use a pointing
device such as mouse, to pick certain part of the displayed schema
diagram.
4. Natural language Interfaces –
These interfaces accept request written in English or some other language
and attempt to understand them. A Natural language interface has its own
schema, which is similar to the database conceptual schema as well as a
dictionary of important words.
The natural language interface refers to the words in its schema as well as
to the set of standard words in a dictionary to interpret the request.If the
interpretation is successful, the interface generates a high-level query
corresponding to the natural language and submits it to the DBMS for
processing, otherwise a dialogue is started with the user to clarify any
provided condition or request. The main disadvantage with this is that the
capabilities of this type of interfaces are not that much advance.
5. Speech Input and Output –
There is an limited use of speech say it for a query or an answer to a
question or being a result of a request it is becoming commonplace
Applications with limited vocabularies such as inquiries for telephone
directory, flight arrival/departure, and bank account information are
allowed speech for input and output to enable ordinary folks to access this
information.
The Speech input is detected using a predefined words and used to set up
the parameters that are supplied to the queries. For output, a similar
conversion from text or numbers into speech take place.
6. Interfaces for DBA –
Most database system contains privileged commands that can be used
only by the DBA’s staff. These include commands for creating accounts,
setting system parameters, granting account authorization, changing a
schema, reorganizing the storage structures of a databases.
1. Technical
2. Economics
3. Politics of Organization
We will concentrate on discussing the economic and organizational factors that affect
the choice of DBMS. Following cost are considered while choosing a DBMS these are as
follows:
o Software acquisition cost –
This is basically the up-front cost of software or buying cost of software including
language options, different types of interfaces. The correct DBMS version for a
specific OS must be selected. Basically, the Development tools, design tools, and
additional language support are not included in basic pricing.
o Maintenance cost –
This is basically the recurring cost of receiving standard maintenance service from
the vendor and to keep the DBMS version up-to-date.
o Personal cost –
Acquisition of DBMS software for the first time by an organization is often
accompanied by a reorganization of the data processing department. The position of
DBA and staff exist in most companies and that have adopted DBMS.
o Training cost –
DBMS are often complex systems personal must often be trained to use and program
for DBMS. Training is required at all levels which include programming, application
development and database administration.
o Operating cost –
Cost of operating database also needs to be considered while choosing DBMS.
Introduction of ER Model
ER Model is used to model the logical view of the system from
data perspective which consists of these components:
Entity, Entity Type, Entity Set –
An Entity may be an object with a physical existence – a particular
person, car, house, or employee – or it may be an object with a
conceptual existence – a company, a job, or a university course.
An Entity is an object of Entity Type and set of all entities is called
as entity set. e.g.; E1 is an entity having Entity Type Student and
set of all students is called Entity Set. In ER diagram, Entity Type is
represented as:
Attribute(s):
Attributes are the properties which define the entity type.
For example, Roll_No, Name, DOB, Age, Address, Mobile_No are
the attributes which defines entity type Student. In ER diagram,
attribute is represented by an oval.
1. Key Attribute –
The attribute which uniquely identifies each entity in the entity
set is called key attribute.For example, Roll_No will be unique for
each student. In ER diagram, key attribute is represented by an oval
with underlying lines.
2. Composite Attribute –
An attribute composed of many other attribute is called as composite attribute. For
example, Address attribute of student Entity type consists of Street, City, State, and
Country. In ER diagram, composite attribute is represented by an oval comprising of
ovals.
3. Multivalued Attribute –
An attribute consisting more than one value for a given entity.
For example, Phone_No (can be more than one for a given
student). In ER diagram, multivalued attribute is represented by
double oval.
4. Derived Attribute –
An attribute which can be derived from other attributes of the
entity type is known as derived attribute. e.g.; Age (can be derived
from DOB). In ER diagram, derived attribute is represented by
dashed oval.
The complete entity type Student with its attributes can be represented as:
Relationship Type and Relationship Set:
A relationship type represents the association between entity types. For example,
‘Enrolled in’ is a relationship type that exists between entity type Student and Course. In
ER diagram, relationship type is represented by a diamond and connecting the entities
with lines.
3. n-ary Relationship –
When there are n entities set participating in a relation, the relationship is called as n-
ary relationship.
Cardinality:
The number of times an entity of an entity set participates in a relationship set is
known as cardinality. Cardinality can be of different types:
1. One to one – When each entity in each entity set can take part only once in the
relationship, the cardinality is one to one. Let us assume that a male can marry to one
female and a female can marry to one male. So the relationship will be one to one.
2. Many to one – When entities in one entity set can take part only once in the
relationship set and entities in other entity set can take part more than once in
the relationship set, cardinality is many to one. Let us assume that a student can take
only one course but one course can be taken by many students. So the cardinality will
be n to 1. It means that for one course there can be n students but for one student, there
will be only one course.
Using Sets, it can be represented as:
In this case, each student is taking only 1 course but 1 course has been taken by many
students.
3. Many to many – When entities in all entity sets can take part more than once in
the relationship cardinality is many to many. Let us assume that a student can take
more than one course and one course can be taken by many students. So the
relationship will be many to many.
Every student in Student Entity set is participating in relationship but there exists a
course C4 which is not taking part in the relationship.
Weak Entity Type and Identifying Relationship:
As discussed before, an entity type has a key attribute which uniquely identifies each
entity in the entity set. But there exists some entity type for which key attribute
can’t be defined. These are called Weak Entity type.
For example, A company may store the information of dependents (Parents, Children,
Spouse) of an Employee. But the dependents don’t have existence without the
employee. So Dependent will be weak entity type and Employee will be Identifying
Entity type for Dependent.
A weak entity type is represented by a double rectangle. The participation of weak
entity type is always total. The relationship between weak entity type and its
identifying strong entity type is called identifying relationship and it is represented by
double diamond.
Specialization –
In specialization, an entity is divided
into sub-entities based on their
characteristics. It is a top-down
approach where higher level entity is
specialized into two or more lower level
entities. For Example, EMPLOYEE
entity in an Employee management
system can be specialized into
DEVELOPER, TESTER etc. as shown
in Figure 2. In this case, common
attributes like E_NAME, E_SAL etc.
become part of higher entity
(EMPLOYEE) and specialized
attributes like TES_TYPE become part
of specialized entity (TESTER).
Aggregation –
An ER diagram is not capable of representing
relationship between an entity and a
relationship which may be required in some
scenarios. In those cases, a relationship with
its corresponding entities is aggregated into a
higher level entity. Aggregation is an
abstraction through which we can represent
relationships as higher level entity sets.
For Example, Employee working for a project
may require some machinery. So, REQUIRE
relationship is needed between relationship WORKS_FOR and entity MACHINERY.
Using aggregation, WORKS_FOR relationship with its entities EMPLOYEE and
PROJECT is aggregated into single entity and relationship REQUIRE is created
between aggregated entity and MACHINERY.
Representing aggregation via schema –
To represent aggregation, create a schema containing:
1. primary key of the aggregated relationship
2. primary key of the associated entity set
3. descriptive attribute, if exists.
Relational Model
Relational Model: Relational model represents data in the form of relations or
tables.
Relational Schema: Schema represents structure of a relation. e.g.; Relational
Schema of STUDENT relation can be represented as:
STUDENT (STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_COUNTRY, STUD_AGE)
Relational Instance: The set of values present in a relation at a particular instance
of time is known as relational instance as shown in Table 1 and Table 2.
Attribute: Each relation is defined in terms of some properties, each of which is
known as attribute. For Example, STUD_NO, STUD_NAME etc. are attributes of
relation STUDENT.
Domain of an attribute: The possible values an attribute can take in a relation is
called its domain. For Example, domain of STUD_AGE can be from 18 to 40.
Tuple: Each row of a relation is known as tuple. e.g.; STUDENT relation given
below has 4 tuples.
NULL values: Values of some attribute for some tuples may be unknown, missing
or undefined which are represented by NULL. Two NULL values in a relation are
considered different from each other.
Table 1 and Table 2 represent relational model having two relations STUDENT and
STUDENT_COURSE.
Relational Model in DBMS
Relational Model was proposed by E.F. Codd to model data in the form of relations or
tables. After designing the conceptual model of Database using ER diagram, we need to
convert the conceptual model in the relational model which can be implemented using
any RDBMS languages like Oracle SQL, MySQL etc. So we will see what Relational Model
is.
What is Relational Model?
Relational Model represents how data is stored in Relational Databases. A relational
database stores data in the form of relations (tables). Consider a relation STUDENT
with attributes ROLL_NO, NAME, ADDRESS, PHONE and AGE shown in Table 1.
IMPORTANT TERMINOLOGIES
• Attribute: Attributes are the properties that define a relation. e.g.; ROLL_NO, NAME
• Relation Schema: A relation schema represents name of the relation with its
attributes. e.g.; STUDENT (ROLL_NO, NAME, ADDRESS, PHONE and AGE) is relation
schema for STUDENT. If a schema has more than 1 relation, it is called Relational
Schema.
• Tuple: Each row in the relation is known as tuple. The above relation contains 4
tuples, one of which is shown as:
BRANCH_CODE of STUDENT can only take the values which are present in
BRANCH_CODE of BRANCH which is called referential integrity constraint. The relation
which is referencing to other relation is called REFERENCING RELATION (STUDENT in
this case) and the relation to which other relations refer is called REFERENCED
RELATION (BRANCH in this case).
SUPER KEYS:
Any set of attributes that allows us to identify unique rows (tuples) in a given relation
are known as super keys. Out of these super keys we can always choose a proper subset
among these which can be used as a primary key. Such keys are known as Candidate
keys. If there is a combination of two or more attributes which is being used as the
primary key then we call it as a Composite key.
Types of Keys in Relational Mode
Candidate Key: The minimal set of attributes that can uniquely identify a tuple is
known as a candidate key. For Example, STUD_NO in STUDENT relation.
• The value of the Candidate Key is unique and non-null for every tuple.
• There can be more than one candidate key in a relation. For Example, STUD_NO
is the candidate key for relation STUDENT.
• The candidate key can be simple (having only one attribute) or composite as well.
For Example, {STUD_NO, COURSE_NO} is a composite candidate key for
relation STUDENT_COURSE.
• No, of candidate keys in a Relation are nC(floor(n/2)), for example if a Relation
have 5 attributes i.e. R(A,B,C,D,E) then total no of candidate keys are
5C(floor(5/2))=10.
Note – In SQL Server a unique constraint that has a nullable column, allows the
value ‘null’ in that column only once. That’s why the STUD_PHONE attribute is a
candidate here, but cannot be ‘null’ values in the primary key attribute.
Super Key: The set of attributes that can uniquely identify a tuple is known as
Super Key. For Example, STUD_NO, (STUD_NO, STUD_NAME), etc.
• Adding zero or more attributes to the candidate key generates the super key.
• A candidate key is a super key but vice versa is not true.
Primary Key: There can be more than one candidate key in relation out of which
one can be chosen as the primary key. For Example, STUD_NO, as well as
STUD_PHONE both, are candidate keys for relation STUDENT but STUD_NO can
be chosen as the primary key (only one out of many candidate keys).
Alternate Key: The candidate key other than the primary key is called an alternate
key. For Example, STUD_NO, as well as STUD_PHONE both, are candidate keys
for relation STUDENT but STUD_PHONE will be an alternate key (only one out of
many candidate keys).
Foreign Key: If an attribute can only take the values which are present as values
of some other attribute, it will be a foreign key to the attribute to which it refers. The
relation which is being referenced is called referenced relation and the
corresponding attribute is called referenced attribute and the relation which refers
to the referenced relation is called referencing relation and the corresponding
attribute is called referencing attribute. The referenced attribute of the referenced
relation should be the primary key to it. For Example, STUD_NO in
STUDENT_COURSE is a foreign key to STUD_NO in STUDENT relation.
It may be worth noting that unlike Primary Key of any given relation, Foreign Key
can be NULL as well as may contain duplicate tuples, i.e., it need not follow
uniqueness constraint.
For Example, STUD_NO in STUDENT_COURSE relation is not unique. It has
been repeated for the first and third tuples. However, the STUD_NO in STUDENT
relation is a primary key and it needs to be always unique, and it cannot be null.
2. Bottom-up strategy –
In this type of strategy, we basically start with basic abstraction and then go on
adding to this abstraction. For example, we may start with attributes and group
these into entity types and relationships. We can also add a new relationship
among entity types as the design goes ahead. The basic example is the process
of generalizing entity types into the higher-level generalized superclass.
3. Inside-Out Strategy –
This is a special case of a bottom-up strategy when attention is basically
focused on a central set of concepts that are most evident. Modeling then
basically spreads outward by considering new concepts in the vicinity of existing
ones. We could specify a few clearly evident entity types in the schema and
continue by adding other entity types and relationships that are related to each
other.
4. Mixed Strategy –
Instead of using any particular strategy throughout the design, the requirements
are partitioned according to a top-down strategy, and part of the schema is
designed for each partition according to a bottom-up strategy after that various
schema are combined.
Schema Integration in DBMS
Schema Integration is divided into the following subtask.
1. Identifying correspondences and conflicts among the schema:
As the schemas are designed individually it is necessary to specify constructs in
the schemas that represent the same real-world concept. We must identify these
correspondences before proceeding with the integration. During this process,
several types of conflicts may occur such as:
1. Naming conflict –
Naming conflicts are of two types synonyms and homonyms. A synonym
occurs when two schemas use different names to describe the same
concept, for example, an entity type CUSTOMER in one schema may
describe an entity type CLIENT in another schema. A homonym occurs
when two schemas use the same name to describe different concepts. For
example, an entity type Classes may represent TRAIN classes in one
schema and AEROPLANE classes in another schema.
2. Type conflicts –
A similar concept may be represented in two schemas by different
modeling constructs. For example, DEPARTMENT may be an entity type
in one schema and an attribute in another.
3. Domain conflicts –
A single attribute may have different domains in different schemas. For
example, we may declare Ssn as an integer in one schema and a
character string in another. A conflict of the unit of measure could occur if
one schema represented weight in pounds and the other used kgs.
Relational Algebra
Relational Algebra is procedural query language, which takes Relation as input
and generate relation as output. Relational algebra mainly provides theoretical
foundation for relational databases and SQL.
Operators in Relational Algebra
Projection (π)
Projection is used to project required column data from a
relation. By Default projection removes duplicate data.
Example:
Selection (σ)
Selection is used to select required tuples of the relations.
for the above relation σ (c>3)R will select the tuples which have
c more than 3.
Note: selection operator only selects the required tuples but
does not display them. For displaying, data projection operator
is used.
Union (U)
Union operation in relational algebra is same as union operation in set theory, only
constraint is for union of two relation both relations must have same set of Attributes.
Rename (ρ)
Rename is a unary operation used for renaming attributes of a relation.
ρ (a/b)R will rename the attribute ‘b’ of relation by ‘a’.
Extract students whose age is greater than 18 from STUDENT relation given in
Table 3
RESULT:
RESULT:
Note: If resultant relation after projection has duplicate rows, it will be removed.
For Example: ∏(ADDRESS)(STUDENT) will remove one duplicate row with value DELHI
and return three rows.
Cross Product(X): Cross product is used to join two relations. For every row of
Relation1, each row of Relation2 is concatenated. If Relation1 has m tuples and
and Relation2 has n tuples, cross product of Relation1 and Relation2 will have m X
n tuples. Syntax:
To apply Cross Product on STUDENT
relation given in Table 1 and
STUDENT_SPORTS relation given in
Table 2,
RESULT:
Union (U): Union on two relations R1 and R2 can only be computed if R1 and R2
are union compatible (These two
relation should have same number of
attributes and corresponding attributes in
two relations have same domain) . Union
operator when applied on two relations
R1 and R2 will give a relation with tuples
which are either in R1 or in R2. The
tuples which are in both R1 and R2 will
appear only once in result
relation. Syntax:
Relation1 U Relation2
Find person who are either student or
employee, we can use Union operator
like:
STUDENT U EMPLOYEE
Minus (-): Minus on two relations R1 and R2 can only be computed if R1 and R2
are union compatible. Minus operator when applied on two relations as R1-R2
will give a relation with tuples which are in R1 but not in R2. Syntax:
Relation1 - Relation2
Find person who are student but not employee, we can use minus operator like:
STUDENT - EMPLOYEE
RESULT:
Note: INNER is optional above. Simple JOIN is also considered as INNER JOIN.
Note: OUTER is optional above. Simple LEFT JOIN is also considered as LEFT
OUTER JOIN
2) Right Outer Join is similar to Left Outer Join (Right replaces Left everywhere)
3) Full Outer Join Contains results of both Left and Right outer joins.
How to find Candidate Keys and Super Keys using Attribute Closure?
• If attribute closure of an attribute set contains all attributes of relation, the attribute
set will be super key of the relation.
• If no subset of this attribute set can functionally determine all attributes of the
relation, the set will be candidate key as well. For Example, using FD set of table 1,
(STUD_NO, STUD_NAME)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_COUNTRY, STUD_AGE}
(STUD_NO)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE, STUD_COUNTRY,
STUD_AGE}
(STUD_NO, STUD_NAME) will be super key but not candidate key because its subset
(STUD_NO)+ is equal to all attributes of the relation. So, STUD_NO will be a candidate
key.
Table 1
The FD set for EMPLOYEE relation given in Table 1 are:
{E-ID->E-NAME, E-ID->E-CITY, E-ID->E-STATE, E-CITY->E-STATE}
Trivial versus Non-Trivial Functional Dependency: A trivial functional dependency
is the one which will always hold in a relation.
X->Y will always hold if X ⊇ Y
In the example given above, E-ID, E-NAME->E-ID is a trivial functional dependency and
will always hold because {E-ID,E-NAME} ⊃ {E-ID}. You can also see from the table that
for each value of {E-ID, E-NAME}, value of E-ID is unique, so {E-ID, E-NAME} functionally
determines E-ID.
If a functional dependency is not trivial, it is called Non-Trivial Functional
Dependency. Non-Trivial functional dependency may or may not hold in a relation.
e.g; E-ID->E-NAME is a non-trivial functional dependency which holds in the above
relation.
Properties of Functional Dependencies
Let X, Y, and Z are sets of attributes in a relation R. There are several properties of
functional dependencies which always hold in R also known as Armstrong Axioms.
1. Reflexivity: If Y is a subset of X, then X → Y. e.g.; Let X represents {E-ID, E-NAME} and
Y represents {E-ID}. {E-ID, E-NAME}->E-ID is true for the relation.
2. Augmentation: If X → Y, then XZ → YZ. e.g.; Let X represents {E-ID}, Y represents {E-
NAME} and Z represents {E-CITY}. As {E-ID}->E-NAME is true for the relation, so { E-
ID,E-CITY}->{E-NAME,E-CITY} will also be true.
3. Transitivity: If X → Y and Y → Z, then X → Z. e.g.; Let X represents {E-ID}, Y
represents {E-CITY} and Z represents {E-STATE}. As {E-ID} ->{E-CITY} and {E-CITY}-
>{E-STATE} is true for the relation, so { E-ID }->{E-STATE} will also be true.
4. Attribute Closure: The set of attributes that are functionally dependent on the
attribute A is called Attribute Closure of A and it can be represented as A+.
Q. Find the attribute closures of given FDs R(ABCDE) = {AB->C, B->D, C->E, D-
>A} To find (B)+, we will add attribute in set using various FD which has been shown in
table below.
Attributes Added in Closure FD used
{B} Triviality
{B,D} B->D
{B,D,A} D->A
{B,D,A,C} AB->C
{B,D,A,C,E} C->E
▪ We can find (C, D)+ by adding C and D into the set (triviality) and then E using(C-
>E) and then A using (D->A) and set becomes.
(C,D)+ = {C,D,E,A}
▪ Similarly we can find (B,C)+ by adding B and C into the set (triviality) and then D
using (B->D) and then E using (C->E) and then A using (D->A) and set becomes
(B,C)+ ={B,C,D,E,A}
Candidate Key
Candidate Key is minimal set of attributes of a relation which can be used to identify a
tuple uniquely. For Example, each tuple of EMPLOYEE relation given in Table 1 can be
uniquely identified by E-ID and it is minimal as well. So it will be Candidate key of the
relation.
A candidate key may or may not be a primary key.
Super Key
Super Key is set of attributes of a relation which can be used to identify a tuple
uniquely. For example, each tuple of EMPLOYEE relation given in Table 1 can be
uniquely identified by E-ID or (E-ID, E-NAME) or (E-ID, E-CITY) or (E-ID, E-STATE)
or (E_ID, E-NAME, E-STATE) etc. So, all of these are super keys of EMPLOYEE relation.
Note: A candidate key is always a super key but vice versa is not true.
Q. Finding Candidate Keys and Super Keys of a Relation using FD set The set of
attributes whose attribute closure is set of all attributes of relation is called super key
of relation. For Example, the EMPLOYEE relation shown in Table 1 has following FD
set. {E-ID->E-NAME, E-ID->E-CITY, E-ID->E-STATE, E-CITY->E-
STATE} Let us calculate attribute closure of different set of attributes:
(E-ID)+ = {E-ID, E-NAME,E-CITY,E-STATE}
(E-ID,E-NAME)+ = {E-ID, E-NAME,E-CITY,E-STATE}
(E-ID,E-CITY)+ = {E-ID, E-NAME,E-CITY,E-STATE}
(E-ID,E-STATE)+ = {E-ID, E-NAME,E-CITY,E-STATE}
(E-ID,E-CITY,E-STATE)+ = {E-ID, E-NAME,E-CITY,E-STATE}
(E-NAME)+ = {E-NAME}
(E-CITY)+ = {E-CITY,E-STATE}
As (E-ID)+, (E-ID, E-NAME)+, (E-ID, E-CITY)+, (E-ID, E-STATE)+, (E-ID, E-CITY, E-
STATE)+ give set of all attributes of relation EMPLOYEE. So all of these are super keys of
relation.
The minimal set of attributes whose attribute closure is set of all attributes of relation
is called candidate key of relation. As shown above, (E-ID)+ is set of all attributes of
relation and it is minimal. So E-ID will be candidate key. On the other hand (E-ID, E-
NAME)+ also is set of all attributes but it is not minimal because its subset (E-ID)+ is equal
to set of all attributes. So (E-ID, E-NAME) is not a candidate key.
1. Axiom of reflexivity –
If A is a set of attributes and B is subset of A, then A holds B. If B ⊆ A, then A
→ B. This property is trivial property.
2. Axiom of augmentation –
If A → B holds and Y is attribute set, then AY → BY also holds. That is adding
attributes in dependencies, does not change the basic dependencies. If A → B,
then AC → BC for any C.
3. Axiom of transitivity –
Same as the transitive rule in algebra, if A → B holds and B → C holds, then A
→ C also holds. A → B is called as A functionally that determines B.
Secondary Rules –
Q. Let us take an example to show the relationship between two FD sets. A relation
R(A,B,C,D) having two FD sets FD1 = {A->B, B->C, AB->D} and FD2 = {A->B, B->C, A-
>C, A->D}
Step 1. Checking whether all FDs of FD1 are present in FD2
• A->B in set FD1 is present in set FD2.
• B->C in set FD1 is also present in set FD2.
• AB->D in present in set FD1 but not directly in FD2 but we will check whether
we can derive it or not. For set FD2, (AB)+ = {A,B,C,D}. It means that AB can
functionally determine A, B, C and D. So AB->D will also hold in set FD2.
As all FDs in set FD1 also hold in set FD2, FD2 ⊃ FD1 is true.
Step 2. Checking whether all FDs of FD2 are present in FD1
• A->B in set FD2 is present in set FD1.
• B->C in set FD2 is also present in set FD1.
• A->C is present in FD2 but not directly in FD1 but we will check whether we can
derive it or not. For set FD1, (A)+ = {A,B,C,D}. It means that A can functionally
determine A, B, C and D. SO A->C will also hold in set FD1.
• A->D is present in FD2 but not directly in FD1 but we will check whether we can
derive it or not. For set FD1, (A)+ = {A,B,C,D}. It means that A can functionally
determine A, B, C and D. SO A->D will also hold in set FD1.
As all FDs in set FD2 also hold in set FD1, FD1 ⊃ FD2 is true.
Step 3. As FD2 ⊃ FD1 and FD1 ⊃ FD2 both are true FD2 =FD1 is true. These two FD sets
are semantically equivalent.
Q. Let us take another example to show the relationship between two FD sets. A
relation R2(A,B,C,D) having two FD sets FD1 = {A->B, B->C,A->C} and FD2 = {A->B,
B->C, A->D}
Step 1. Checking whether all FDs of FD1 are present in FD2
• A->B in set FD1 is present in set FD2.
• B->C in set FD1 is also present in set FD2.
• A->C is present in FD1 but not directly in FD2 but we will check whether we
can derive it or not. For set FD2, (A)+ = {A,B,C,D}. It means that A can
functionally determine A, B, C and D. SO A->C will also hold in set FD2.
As all FDs in set FD1 also hold in set FD2, FD2 ⊃ FD1 is true.
Step 2. Checking whether all FDs of FD2 are present in FD1
•A->B in set FD2 is present in set FD1.
• B->C in set FD2 is also present in set FD1.
• A->D is present in FD2 but not directly in FD1 but we will check whether we
can derive it or not. For set FD1, (A)+ = {A,B,C}. It means that A can’t
functionally determine D. SO A->D will not hold in FD1.
As all FDs in set FD2 do not hold in set FD1, FD2 ⊄ FD1.
Step 3. In this case, FD2 ⊃ FD1 and FD2 ⊄ FD1, these two FD sets are not semantically
equivalent.
Database Normalization
Database normalization is the process of organizing the attributes of the database
to reduce or eliminate data redundancy (having the same data but at different
places).
Problems because of data redundancy
Data redundancy unnecessarily increases the size of the database as the same
data is repeated in many places. Inconsistency problems also arise during insert,
delete and update operations.
Functional Dependency
Functional Dependency is a constraint between two sets of attributes in relation to
a database. A functional dependency is denoted by an arrow (→). If an attribute A
functionally determines B, then it is written as A → B.
For example, employee_id → name means employee_id functionally determines
the name of the employee. As another example in a timetable database,
{student_id, time} → {lecture_room}, student ID and time determine the lecture
room where the student should be.
What does functionally dependent mean?
A function dependency A → B means for all instances of a particular value of A,
there is the same value of B.
For example, in the below table A → B is true, but B → A is not true as there are
different values of A for B = 3.
A B
------
1 3
2 3
4 0
1 3
4 0
Trivial Functional Dependency
X → Y is trivial only when Y is subset of X.
Examples:
ABC → AB
ABC → A
ABC → ABC
Non-Trivial Functional Dependencies
X → Y is a non-trivial functional dependency when Y is not a subset of X.
X → Y is called completely non-trivial when X intersect Y is NULL.
Example:
Id → Name,
Name → DOB
Semi Non-Trivial Functional Dependencies
X → Y is called semi non-trivial when X intersect Y is not NULL.
Examples:
AB → BC,
AD → DC
Example 2 –
ID Name Courses
------------------
1 A c1, c2
2 E c3
3 M C2, c3
In the above table Course is a multi-valued attribute so it is not in 1NF.
Below Table is in 1NF as there is no multi-valued attribute
ID Name Course
------------------
1 A c1
1 A c2
2 E c3
3 M c2
3 M c3
To be in second normal form, a relation must be in first normal form and relation
must not contain any partial dependency. A relation is in 2NF if it has No Partial
Dependency, i.e., no non-prime attribute (attributes which are not part of any
candidate key) is dependent on any proper subset of any candidate key of the
table.
Partial Dependency – If the proper subset of candidate key determines non-prime
attribute, it is called partial dependency.
Example 1 – Consider table-3 as following below.
STUD_NO COURSE_NO COURSE_FEE
1 C1 1000
2 C2 1500
1 C4 2000
4 C3 1000
4 C1 1000
2 C5 2000
{Note that, there are many courses having the same course fee.}
Here,
COURSE_FEE cannot alone decide the value of COURSE_NO or STUD_NO;
COURSE_FEE together with STUD_NO cannot decide the value of COURSE_NO;
COURSE_FEE together with COURSE_NO cannot decide the value of STUD_NO;
Hence,
COURSE_FEE would be a non-prime attribute, as it does not belong to the one
only candidate key {STUD_NO, COURSE_NO} ;
But, COURSE_NO -> COURSE_FEE, i.e., COURSE_FEE is dependent on
COURSE_NO, which is a proper subset of the candidate key. Non-prime attribute
COURSE_FEE is dependent on a proper subset of the candidate key, which is a
partial dependency and so this relation is not in 2NF.
To convert the above relation to 2NF,we need to split the table into two tables such
as :
Table 1: STUD_NO, COURSE_NO
Table 2: COURSE_NO, COURSE_FEE
Table 1 Table 2
STUD_NO COURSE_NO COURSE_NO COURSE_FEE
1 C1 C1 1000
2 C2 C2 1500
1 C4 C3 1000
4 C3 C4 2000
4 C1 C5 2000
2 C5
NOTE: 2NF tries to reduce the redundant data getting stored in memory. For
instance, if there are 100 students taking C1 course, we don’t need to store its Fee
as 1000 for all the 100 records, instead, once we can store it in the second table as
the course fee for C1 is 1000.
Example 2 – Consider following functional dependencies in relation R (A, B, C, D)
AB -> C [A and B together determine C]
BC -> D [B and C together determine D]
In the above relation, AB is the only candidate key and there is no partial
dependency, i.e., any proper subset of AB doesn’t determine any non-
prime attribute.
Transitive dependency – If A->B and B->C are two FDs then A->C is called
transitive dependency.
Example 1 – In relation STUDENT given in Table 4,
FD set: {STUD_NO -> STUD_NAME, STUD_NO -> STUD_STATE, STUD_STATE
-> STUD_COUNTRY, STUD_NO -> STUD_AGE}
Candidate Key: {STUD_NO}
For this relation in table 4, STUD_NO -> STUD_STATE and STUD_STATE ->
STUD_COUNTRY are true. So STUD_COUNTRY is transitively dependent on
STUD_NO. It violates the third normal form. To convert it in third normal form, we
will decompose the relation STUDENT (STUD_NO, STUD_NAME, STUD_PHONE,
STUD_STATE, STUD_COUNTRY_STUD_AGE) as:
STUDENT (STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_AGE)
STATE_COUNTRY (STATE, COUNTRY)
A relation R is in BCNF if R is in Third Normal Form and for every FD, LHS is
super key. A relation is in BCNF iff in every non-trivial functional dependency X –>
Y, X is a super key.
Example 1 – Find the highest normal form of a relation R(A,B,C,D,E) with FD set
as {BC->D, AC->BE, B->E}
Step 1. As we can see, (AC)+ ={A,C,B,E,D} but none of its subset can determine
all attribute of relation, So AC will be candidate key. A or C can’t be derived from
any other attribute of the relation, so there will be only 1 candidate key {AC}.
Step 2. Prime attributes are those attributes that are part of candidate key {A, C} in
this example and others will be non-prime {B, D, E} in this example.
Step 3. The relation R is in 1st normal form as a relational DBMS does not allow
multi-valued or composite attribute.
The relation is in 2nd normal form because BC->D is in 2nd normal form (BC is not
a proper subset of candidate key AC) and AC->BE is in 2nd normal form (AC is
candidate key) and B->E is in 2nd normal form (B is not a proper subset of
candidate key AC).
The relation is not in 3rd normal form because in BC->D (neither BC is a super key
nor D is a prime attribute) and in B->E (neither B is a super key nor E is a prime
attribute) but to satisfy 3rd normal for, either LHS of an FD should be super key or
RHS should be prime attribute.
So the highest normal form of relation will be 2nd Normal form.
Person->-> mobile,
Person ->-> food_likes
This is read as “person multidetermines mobile” and “person multidetermines
food_likes.”
Note that a functional dependency is a special case of multivalued dependency. In
a functional dependency X -> Y, every x determines exactly one y, never more
than one.
Fourth normal form (4NF):
Fourth normal form (4NF) is a level of database normalization where there are no
non-trivial multivalued dependencies other than a candidate key. It builds on the
first three normal forms (1NF, 2NF and 3NF) and the Boyce-Codd Normal Form
(BCNF). It states that, in addition to a database meeting the requirements of
BCNF, it must not contain more than one multivalued dependency.
Properties – A relation R is in 4NF if and only if the following conditions are
satisfied:
1. It should be in the Boyce-Codd Normal Form (BCNF).
2. the table should not have any Multi-valued Dependency.
A table with a multivalued dependency violates the normalization standard of
Fourth Normal Form (4NK) because it creates unnecessary redundancies and can
contribute to inconsistent data. To bring this up to 4NF, it is necessary to break this
information into two tables.
Example – Consider the database table of a class whaich has two relations R1
contains student ID(SID) and student name (SNAME) and R2 contains course
id(CID) and course name (CNAME).
Table – R1(SID, SNAME) Table – R2(CID, CNAME)
When there cross product is done it resulted in multivalued
dependencies: Table – R1 X R2
Table – R1⋈R2⋈R3
Agent->->Product
Fifth Normal Form / Projected Normal Form (5NF):
A relation R is in 5NF if and only if every join dependency in R is implied by the
candidate keys of R. A relation decomposed into two relations must have loss-less
join Property, which ensures that no spurious or extra tuples are generated, when
relations are reunited through a natural join.
Properties – A relation R is in 5NF if and only if it satisfies following conditions:
1. R should be already in 4NF.
2. It cannot be further non loss decomposed (join dependency)
Example – Consider the above schema, with a case as “if a company makes a
product and an agent is an agent for that company, then he always sells that
product for the company”. Under these circumstances, the ACP table is shown as:
Table – ACP
The relation ACP is again decomposed into 3 relations.
Now, the natural Join of all the three relations will be
shown as:
Result of Natural Join of R1 and R3 over ‘Company’ and then Natural Join of R13
and R2 over ‘Agent’ and ‘Product’ will be table ACP.
Hence, in this example, all the redundancies are eliminated, and the
decomposition of ACP is a lossless join decomposition. Therefore, the relation is in
5NF as it does not violate the property of lossless join.
Concurrency Control in DBMS
Concurrency Control deals with interleaved execution of more than one transaction. In
the next article, we will see what is serializability and how to find whether a schedule is
serializable or not.
What is Transaction?
Read(A): Read operations Read(A) or R(A) reads the value of A from the database and
stores it in a buffer in main memory.
Write (A): Write operation Write(A) or W(A) writes the value back to the database from
buffer.
(Note: It doesn’t always need to write it to database back it just writes the changes to
buffer this is the reason where dirty read comes into picture)
Let us take a debit transaction from an account which consists of following operations:
1. R(A);
2. A=A-1000;
3. W(A);
But it may also be possible that transaction may fail after executing some of its
operations. The failure can be because of hardware, software or power etc. For
example, if debit transaction discussed above fails after executing operation 2, the value
of A will remain 5000 in the database which is not acceptable by the bank. To avoid this,
Database has two important operations:
Commit: After all instructions of a transaction are successfully executed, the changes
made by transaction are made permanent in the database.
Rollback: If a transaction is not able to execute all operations successfully, all the
changes made by transaction are undone.
Properties of a transaction (ACID)
Atomicity:
By this, we mean that either the entire transaction takes place at once or doesn’t
happen at all. There is no midway i.e. transactions do not occur partially. Each
transaction is considered as one unit and either runs to completion or is not executed
at all. It involves the following two operations.
—Abort: If a transaction aborts, changes made to database are not visible.
—Commit: If a transaction commits, changes made are visible.
Atomicity is also known as the ‘All or nothing rule’.
Consider the following transaction T consisting of T1 and T2: Transfer of 100 from
account X to account Y.
If the transaction fails after completion of T1 but before completion of T2.( say,
after write(X) but before write(Y)), then amount has been deducted from X but not
added to Y. This results in an inconsistent database state. Therefore, the
transaction must be executed in entirety in order to ensure correctness of database
state.
Consistency:
This means that integrity constraints must be maintained so that the database is
consistent before and after the transaction. It refers to the correctness of a
database. Referring to the example above,
The total amount before and after the transaction must be maintained.
Total before T occurs = 500 + 200 = 700.
Total after T occurs = 400 + 300 = 700.
Therefore, database is consistent. Inconsistency occurs in case T1 completes
but T2 fails. As a result T is incomplete.
If operations of debit and credit transactions on same account are executed concurrently,
it may leave database in an inconsistent state.
• For Example, T1 (debit of Rs. 1000 from A) and T2 (credit of 500 to A) executing
concurrently, the database reaches inconsistent state.
• Let us assume Account balance of A is Rs. 5000. T1 reads A(5000) and stores the
value in its local buffer space. Then T2 reads A(5000) and also stores the value in its
local buffer space.
• T1 performs A=A-1000 (5000-1000=4000) and 4000 is stored in T1 buffer space.
Then T2 performs A=A+500 (5000+500=5500) and 5500 is stored in T2 buffer space.
T1 writes the value from its buffer back to database.
• A’s value is updated to 4000 in database and then T2 writes the value from its buffer
back to database. A’s value is updated to 5500 which shows that the effect of debit
transaction is lost and database has become inconsistent.
• To maintain consistency of database, we need concurrency control
protocols which will be discussed in next article. The operations of T1 and T2 with
their buffers and database have been shown in Table 1.
Isolation:
This property ensures that multiple transactions can occur concurrently without
leading to the inconsistency of database state. Transactions occur independently
without interference. Changes occurring in a particular transaction will not be
visible to any other transaction until that particular change in that transaction is
written to memory or has been committed. This property ensures that the execution
of transactions concurrently will result in a state that is equivalent to a state
achieved these were executed serially in some order.
Let X= 500, Y = 500.
Consider two transactions T and T”.
Suppose T has been executed till Read (Y) and then T’’ starts. As a result ,
interleaving of operations takes place due to which T’’ reads correct value of X but
incorrect value of Y and sum computed by
T’’: (X+Y = 50, 000+500=50, 500)
is thus not consistent with the sum at end of transaction:
T: (X+Y = 50, 000 + 450 = 50, 450).
This results in database inconsistency, due to a loss of 50 units. Hence,
transactions must take place in isolation and changes should be visible only after
they have been made to the main memory.
Durability:
This property ensures that once the transaction has completed execution, the
updates and modifications to the database are stored in and written to disk and they
persist even if a system failure occurs. These updates now become permanent and
are stored in non-volatile memory. The effects of the transaction, thus, are never
lost.
Once database has committed a transaction, the changes made by the transaction
should be permanent. e.g.; If a person has credited $500000 to his account, bank
can’t say that the update has been lost. To avoid this problem, multiple copies of
database are stored at different locations.
What is a Schedule?
A schedule is a series of operations from one or more transactions. A schedule
can be of two types:
• Serial Schedule: When one transaction completely executes before starting
another transaction, the schedule is called serial schedule. A serial schedule is
always consistent. e.g.; If a schedule S has debit transaction T1 and credit
transaction T2, possible serial schedules are T1 followed by T2 (T1->T2) or T2
followed by T1 ((T2->T1). A serial schedule has low throughput and less resource
utilization.
• Concurrent Schedule: When operations of a transaction are interleaved with
operations of other transactions of a schedule, the schedule is called Concurrent
schedule. e.g.; Schedule of debit and credit transaction shown in Table 1 is
concurrent in nature. But concurrency can lead to inconsistency in the
database. The above example of a concurrent schedule is also inconsistent.
So, by now we are introduced with the types of locks and how to apply
them. But wait, just by applying locks if our problems could’ve been
avoided then life would’ve been so simple! If you have done Process
Synchronization under OS you must be familiar with one consistent
problem, starvation and Deadlock! We’ll be discussing them shortly, but
just so you know we have to apply Locks but they must follow a set of
protocols to avoid such undesirable problems. Shortly we’ll use 2-Phase
Locking (2-PL) which will use the concept of Locks to avoid deadlock. So,
applying simple locking, we may not always produce Serializable results, it
may lead to Deadlock Inconsistency.
Problem With Simple Locking…
Consider the Partial Schedule:
Deadlock – consider the above execution phase. Now, T1 holds an
Exclusive lock over B, and T2 holds a Shared lock over A. Consider
Statement 7, T2 requests for lock on B, while in Statement 8 T1 requests
lock on A. This as you may notice imposes a Deadlock as none can
proceed with their execution.
Starvation – is also possible if concurrency control manager is badly
designed. For example: A transaction may be waiting for an X-lock on an
item, while a sequence of other transactions request and are granted an
S-lock on the same item. This may be avoided if the concurrency control
manager is properly designed.
Let’s look at an example based on the above Database Graph. We have three
Transactions in this schedule and this is a skeleton example, i.e, we will only see
how Locking and Unlocking works, let’s keep this simple and not make this complex
by adding operations on data.
From the above example, first see that the schedule is
Conflict Serializable. Serializability for Locks can be
written as T2 –> T1 –> T3.
Data items Locked and Unlocked are following the
same rule as given above and follows the Database
Graph.
Advantage –
• Ensures Conflict Serializable Schedule.
• Ensures Deadlock Free Schedule
• Unlocking can be done anytime
With some advantages comes some Disadvantages also.
Disadvantage –
• Unnecessary locking overheads may happen sometimes, like if we want both D
and E, then at least we have to lock B to follow the protocol.
• Cascading Rollbacks is still a problem. We don’t follow a rule of when Unlock
operation may occur so this problem persists for this
protocol.
Overall this protocol is mostly known and used for its unique way of implementing
Deadlock Freedom.
DBMS IMPORTANT QUESTIONS
1. What is Database?
A database is an organized collection of data, stored and retrieved digitally from a remote or local
computer system. Databases can be vast and complex, and such databases are developed using
fixed design and modeling approaches.
2. What is DBMS?
DBMS stands for Database Management System. DBMS is a system software responsible for the
creation, retrieval, updation, and management of the database. It ensures that our data is
consistent, organized, and is easily accessible by serving as an interface between the database and
its end-users or application software.
RDBMS stands for Relational Database Management System. The key difference here, compared to
DBMS, is that RDBMS stores data in the form of a collection of tables, and relations can be defined
between the common fields of these tables. Most modern database management systems like
MySQL, Microsoft SQL Server, Oracle, IBM DB2, and Amazon Redshift are based on RDBMS
4. What is SQL?
SQL stands for Structured Query Language. It is the standard language for relational database
management systems. It is especially useful in handling organized data comprised of entities
(variables) and relations between different entities of the data.
SQL is a standard language for retrieving and manipulating structured databases. On the contrary,
MySQL is a relational database management system, like SQL Server, Oracle or IBM DB2, that is
used to manage SQL databases.
A table is an organized collection of data stored in the form of rows and columns. Columns can be
categorized as vertical and rows as horizontal. The columns in a table are called fields while the
rows can be referred to as records.
Constraints are used to specify the rules concerning data in the table. It can be applied for single
or multiple fields in an SQL table during the creation of the table or after creating using the ALTER
TABLE command. The constraints are:
• NOT NULL - Restricts NULL value from being inserted into a column.
• CHECK - Verifies that all values in a field satisfy a condition.
• DEFAULT - Automatically assigns a default value if no value has been specified for the field.
• UNIQUE - Ensures unique values to be inserted into the field.
• INDEX - Indexes a field providing faster retrieval of records.
• PRIMARY KEY - Uniquely identifies each record in a table.
• FOREIGN KEY - Ensures referential integrity for a record in another table.
The SQL Join clause is used to combine records (rows) from two or more tables in a SQL
database based on a related column between the two.
• (INNER) JOIN: Retrieves records that have matching values in both tables involved in the join.
This is the widely used join for queries.
SELECT *
FROM Table_A
JOIN Table_B;
SELECT *
FROM Table_A
INNER JOIN Table_B;
• LEFT (OUTER) JOIN: Retrieves all the records/rows from the left and the matched records/rows
from the right table.
SELECT *
FROM Table_A A
LEFT JOIN Table_B B
ON A.col = B.col;
• RIGHT (OUTER) JOIN: Retrieves all the records/rows from the right and the matched
records/rows from the left table.
SELECT *
FROM Table_A A
RIGHT JOIN Table_B B
ON A.col = B.col;
• FULL (OUTER) JOIN: Retrieves all the records where there is a match in either the left or right
table.
SELECT *
FROM Table_A A
FULL JOIN Table_B B
ON A.col = B.col;
9. What is a Self-Join?
A self JOIN is a case of regular join where a table is joined to itself based on some relation
between its own column(s). Self-join uses the INNER JOIN or LEFT JOIN clause and a table alias is
used to assign different names to the table within the query.
Cross join can be defined as a cartesian product of the two tables included in the join. The table
after join contains the same number of rows as in the cross-product of the number of rows in the
two tables. If a WHERE clause is used in cross join then the query will work like an INNER JOIN.
A database index is a data structure that provides a quick lookup of data in a column or columns
of a table. It enhances the speed of operations accessing data from a database table at the cost
of additional writes and memory to maintain the index data structure.
There are different types of indexes that can be created for different purposes:
Unique indexes are indexes that help maintain data integrity by ensuring that no two rows of
data in a table have identical key values. Once a unique index has been defined for a table,
uniqueness is enforced whenever keys are added or changed within the index.
Clustered indexes are indexes whose order of the rows in the database corresponds to the order
of the rows in the index. This is why only one clustered index can exist in a given table, whereas,
multiple non-clustered indexes can exist in the table.
The only difference between clustered and non-clustered indexes is that the database manager
attempts to keep the data in the database in the same order as the corresponding keys appear in
the clustered index.
Clustering indexes can improve the performance of most query operations because they provide
a linear-access path to data stored in the database.
As explained above, the differences can be broken down into three small factors -
• Clustered index modifies the way records are stored in a database based on the indexed column.
A non-clustered index creates a separate entity within the table which references the original
table.
• Clustered index is used for easy and speedy retrieval of data from the database, whereas, fetching
records from the non-clustered index is relatively slower.
• In SQL, a table can have a single clustered index whereas it can have multiple non-clustered
indexes.
Data Integrity is the assurance of accuracy and consistency of data over its entire life-cycle and is
a critical aspect of the design, implementation, and usage of any system which stores, processes,
or retrieves data. It also defines integrity constraints to enforce business rules on the data when
it is entered into an application or a database.
A query is a request for data or information from a database table or combination of tables. A
database query can be either a select query or an action query.
A subquery is a query within another query, also known as a nested query or inner query. It is
used to restrict or enhance the data to be queried by the main query, thus restricting or
enhancing the output of the main query respectively. For example, here we fetch the contact
information for students who have enrolled for the maths subject:
• A correlated subquery cannot be considered as an independent query, but it can refer to the
column in a table listed in the FROM of the main query.
• A non-correlated subquery can be considered as an independent query and the output of the
subquery is substituted in the main query.
SELECT operator in SQL is used to select data from a database. The data returned is stored in a
result table, called the result-set.
Some common SQL clauses used in conjuction with a SELECT query are as follows:
• WHERE clause in SQL is used to filter records that are necessary, based on specific conditions.
• ORDER BY clause in SQL is used to sort the records based on some field(s) in ascending (ASC) or
descending order (DESC).
SELECT *
FROM myDB.students
WHERE graduation_year = 2019
ORDER BY studentID DESC;
• GROUP BY clause in SQL is used to group records with identical data and can be used in
conjunction with some aggregation functions to produce summarized results from the database.
• HAVING clause in SQL is used to filter records in combination with the GROUP BY clause. It is
different from WHERE, since the WHERE clause cannot filter aggregated records.
The UNION operator combines and returns the result-set retrieved by two or more SELECT
statements.
The MINUS operator in SQL is used to remove duplicates from the result-set obtained by the
second SELECT query from the result-set obtained by the first SELECT query and then return the
filtered results from the first.
The INTERSECT clause in SQL combines the result-set fetched by the two SELECT statements
where records from one match the other and then returns this intersection of result-sets.
Certain conditions need to be met before executing either of the above statements in SQL -
• Each SELECT statement within the clause must have the same number of columns
• The columns must also have similar data types
• The columns in each SELECT statement should necessarily have the same order
A database cursor is a control structure that allows for the traversal of records in a database.
Cursors, in addition, facilitates processing after traversal, such as retrieval, addition, and deletion
of database records. They can be viewed as a pointer to one row in a set of rows.
1. DECLARE a cursor after any variable declaration. The cursor declaration must always be associated
with a SELECT Statement.
2. Open cursor to initialize the result set. The OPEN statement must be called before fetching rows
from the result set.
3. FETCH statement to retrieve and move to the next row in the result set.
4. Call the CLOSE statement to deactivate the cursor.
5. Finally use the DEALLOCATE statement to delete the cursor definition and release the associated
resources.
Entity: An entity can be a real-world object, either tangible or intangible, that can be easily
identifiable. For example, in a college database, students, professors, workers, departments, and
projects can be referred to as entities. Each entity has some associated properties that provide it
an identity.
Relationships: Relations or links between entities that have something to do with each other.
For example - The employee's table in a company's database can be associated with the salary
table in the same database.
• One-to-One - This can be defined as the relationship between two tables where each record in
one table is associated with the maximum of one record in the other table.
• One-to-Many & Many-to-One - This is the most commonly used relationship where a record in a
table is associated with multiple records in the other table.
• Many-to-Many - This is used in cases when multiple instances on both sides are needed for
defining a relationship.
• Self-Referencing Relationships - This is used when a table needs to define a relationship with
itself.
An alias is a feature of SQL that is supported by most, if not all, RDBMSs. It is a temporary name
assigned to the table or table column for the purpose of a particular SQL query. In addition,
aliasing can be employed as an obfuscation technique to secure the real names of database
fields. A table alias is also called a correlation name.
An alias is represented explicitly by the AS keyword but in some cases, the same can be
performed without it as well. Nevertheless, using the AS keyword is always a good practice.
A view in SQL is a virtual table based on the result-set of an SQL statement. A view contains rows
and columns, just like a real table. The fields in a view are fields from one or more real tables in
the database.
21. What is Normalization?
Normalization represents the way of organizing structured data in the database efficiently. It
includes the creation of tables, establishing relationships between them, and defining rules for
those relationships. Inconsistency and redundancy can be kept in check based on these rules,
hence, adding flexibility to the database.
Normal Forms are used to eliminate or reduce redundancy in database tables. The different
forms are as follows:
Students Table
Amanora Park Town Until the Day I Die (Emily Carpenter), Inception (Christopher
Sara Ms.
94 Nolan)
Ansh 62nd Sector A-10 The Alchemist (Paulo Coelho), Inferno (Dan Brown) Mr.
As we can observe, the Books Issued field has more than one value per record, and to convert it
into 1NF, this has to be resolved into separate individual records for each book issued. Check the
following table in 1NF form -
Sara Amanora Park Town 94 Until the Day I Die (Emily Carpenter) Ms.
Sara 24th Street Park Avenue Beautiful Bad (Annie Ward) Mrs.
A relation is in second normal form if it satisfies the conditions for the first normal form and does
not contain any partial dependency. A relation in 2NF has no partial dependency, i.e., it has no
non-prime attribute that depends on any proper subset of any candidate key of the table. Often,
specifying a single column Primary Key is the solution to the problem. Examples -
Example 1 - Consider the above example. As we can observe, the Students Table in the 1NF
form has a candidate key in the form of [Student, Address] that can uniquely identify all records
in the table. The field Books Issued (non-prime attribute) depends partially on the Student field.
Hence, the table is not in 2NF. To convert it into the 2nd Normal Form, we will partition the
tables into two while specifying a new Primary Key attribute to identify the individual records in
the Students table. The Foreign Key constraint will be set on the other table to ensure referential
integrity.
Here, WX is the only candidate key and there is no partial dependency, i.e., any proper subset of
WX doesn’t determine any non-prime attribute in the relation.
A relation is said to be in the third normal form, if it satisfies the conditions for the second
normal form and there is no transitive dependency between the non-prime attributes, i.e., all
non-prime attributes are determined only by the candidate keys of the relation and not by any
other non-prime attribute.
Example 1 - Consider the Students Table in the above example. As we can observe, the Students
Table in the 2NF form has a single candidate key Student_ID (primary key) that can uniquely
identify all records in the table. The field Salutation (non-prime attribute), however, depends on
the Student Field rather than the candidate key. Hence, the table is not in 3NF. To convert it into
the 3rd Normal Form, we will once again partition the tables into two while specifying a
new Foreign Key constraint to identify the salutations for individual records in the Students
table. The Primary Key constraint for the same will be set on the Salutations table to identify
each record uniquely.
Students Table (3rd Normal Form)
Salutation_ID Salutation
1 Ms.
2 Mr.
3 Mrs.
For the above relation to exist in 3NF, all possible candidate keys in the above relation should be
{P, RS, QR, T}.
TRUNCATE command is used to delete all the rows from the table and free the space containing
the table.
DROP command is used to remove an object from the database. If you drop a table, all the rows
in the table are deleted and the table structure is removed from the database.
If a table is dropped, all things associated with the tables are dropped as well. This includes - the
relationships defined on the table with other tables, the integrity checks and constraints, access
privileges and other grants that the table has. To create and use the table again in its original
form, all these relations, checks, constraints, privileges and relationships need to be redefined.
However, if a table is truncated, none of the above problems exist and the table retains its
original structure.
The TRUNCATE command is used to delete all the rows from the table and free the space
containing the table.
The DELETE command deletes only the rows from the table based on the condition given in the
where clause or deletes all the rows from the table if no condition is specified. But it does not
free the space containing the table.
Note: All aggregate functions described above ignore NULL values except for the COUNT
function.
A scalar function returns a single value based on the input value. Following are the widely used
SQL scalar functions:
The user-defined functions in SQL are like functions in any other programming language that
accept parameters, perform complex calculations, and return a value. They are written to use the
logic repetitively whenever required. There are two types of SQL user-defined functions:
• Scalar Function: As explained earlier, user-defined scalar functions return a single scalar value.
• Table-Valued Functions: User-defined table-valued functions return a table as output.
o Inline: returns a table data type based on a single SELECT statement.
o Multi-statement: returns a tabular result-set but, unlike inline, multiple SELECT statements
can be used inside the function body.
OLTP stands for Online Transaction Processing, is a class of software applications capable of
supporting transaction-oriented programs. An essential attribute of an OLTP system is its ability
to maintain concurrency. To avoid single points of failure, OLTP systems are often decentralized.
These systems are usually designed for a large number of users who conduct short transactions.
Database queries are usually simple, require sub-second response times, and return relatively
few records. Here is an insight into the working of an OLTP system [ Note - The figure is not
important for interviews ] -
36. What are the differences between OLTP and OLAP?
OLTP stands for Online Transaction Processing, is a class of software applications capable of
supporting transaction-oriented programs. An important attribute of an OLTP system is its ability
to maintain concurrency. OLTP systems often follow a decentralized architecture to avoid single
points of failure. These systems are generally designed for a large audience of end-users who
conduct short transactions. Queries involved in such databases are generally simple, need fast
response times, and return relatively few records. A number of transactions per second acts as an
effective measure for such systems.
OLAP stands for Online Analytical Processing, a class of software programs that are
characterized by the relatively low frequency of online transactions. Queries are often too
complex and involve a bunch of aggregations. For OLAP systems, the effectiveness measure
relies highly on response time. Such systems are widely used for data mining or maintaining
aggregated, historical data, usually in multi-dimensional schemas.
37. What is Collation? What are the different types of Collation Sensitivity?
Collation refers to a set of rules that determine how data is sorted and compared. Rules defining
the correct character sequence are used to sort the character data. It incorporates options for
specifying case sensitivity, accent marks, kana character types, and character width. Below are the
different types of collation sensitivity:
DELIMITER $$
CREATE PROCEDURE FetchAllStudents()
BEGIN
SELECT * FROM myDB.students;
END $$
DELIMITER ;
A stored procedure that calls itself until a boundary condition is reached, is called a recursive
stored procedure. This recursive function helps the programmers to deploy the same set of code
several times as and when required. Some SQL programming languages limit the recursion depth
to prevent an infinite loop of procedure calls from causing a stack overflow, which slows down
the system and may lead to system crashes.
Creating empty tables with the same structure can be done smartly by fetching the records of
one table into a new table using the INTO operator while fixing a WHERE clause to be false for all
records. Hence, SQL prepares the new table with a duplicate structure to accept the fetched
records but since no records get fetched due to the WHERE clause in action, nothing is inserted
into the new table.
SQL pattern matching provides for pattern search in data if you have no clue as to what that
word should be. This kind of SQL query uses wildcards to match a string pattern, rather than
writing the exact word. The LIKE operator is used in conjunction with SQL Wildcards to fetch the
required information.
The % wildcard matches zero or more characters of any type and can be used to define wildcards
both before and after the pattern. Search a student in your database with first name beginning
with the letter K:
SELECT *
FROM students
WHERE first_name LIKE 'K%'
Use the NOT keyword to select records that don't match the pattern. This query returns all
students whose first name does not begin with K.
SELECT *
FROM students
WHERE first_name NOT LIKE 'K%'
Search for a student in the database where he/she has a K in his/her first name.
SELECT *
FROM students
WHERE first_name LIKE '%Q%'
The _ wildcard matches exactly one character of any type. It can be used in conjunction with %
wildcard. This query fetches all students with letter K at the third position in their first name.
SELECT *
FROM students
WHERE first_name LIKE '__K%'
The _ wildcard plays an important role as a limitation when it matches exactly one character. It
limits the length and position of the matched results. For example -
PostgreSQL was first called Postgres and was developed by a team led by Computer Science
Professor Michael Stonebraker in 1986. It was developed to help developers build enterprise-
level applications by upholding data integrity by making systems fault-tolerant. PostgreSQL is
therefore an enterprise-level, flexible, robust, open-source, and object-relational DBMS that
supports flexible workloads along with handling concurrent users. It has been consistently
supported by the global developer community. Due to its fault-tolerant nature, PostgreSQL has
gained widespread popularity among developers.
Indexes are the inbuilt functions in PostgreSQL which are used by the queries to perform search
more efficiently on a table in the database. Consider that you have a table with thousands of
records and you have the below query that only a few records can satisfy the condition, then it
will take a lot of time to search and return those rows that abide by this condition as the engine
has to perform the search operation on every single to check this condition. This is undoubtedly
inefficient for a system dealing with huge data. Now if this system had an index on the column
where we are applying search, it can use an efficient method for identifying matching rows by
walking through only a few levels. This is called indexing.
This can be done by using the ALTER TABLE statement as shown below:
Syntax:
The first step of using PostgreSQL is to create a database. This is done by using the createdb
command as shown below: createdb db_name
After running the above command, if the database creation was successful, then the below
message is shown:
CREATE DATABASE
46. How can we start, restart and stop the PostgreSQL server?
Starting PostgreSQL: ok
Partitioned tables are logical structures that are used for dividing large tables into smaller
structures that are called partitions. This approach is used for effectively increasing the query
performance while dealing with large database tables. To create a partition, a key called partition
key which is usually a table column or an expression, and a partitioning method needs to be
defined. There are three types of inbuilt partitioning methods provided by Postgres:
• Range Partitioning: This method is done by partitioning based on a range of values. This method
is most commonly used upon date fields to get monthly, weekly or yearly data. In the case of
corner cases like value belonging to the end of the range, for example: if the range of partition 1 is
10-20 and the range of partition 2 is 20-30, and the given value is 10, then 10 belongs to the
second partition and not the first.
• List Partitioning: This method is used to partition based on a list of known values. Most
commonly used when we have a key with a categorical value. For example, getting sales data
based on regions divided as countries, cities, or states.
• Hash Partitioning: This method utilizes a hash function upon the partition key. This is done when
there are no specific requirements for data division and is used to access data individually. For
example, you want to access data based on a specific product, then using hash partition would
result in the dataset that we require.
The type of partition key and the type of method used for partitioning determines how positive
the performance and the level of manageability of the partitioned table are.
A token in PostgreSQL is either a keyword, identifier, literal, constant, quotes identifier, or any
symbol that has a distinctive personality. They may or may not be separated using a space,
newline or a tab. If the tokens are keywords, they are usually commands with useful meanings.
Tokens are known as building blocks of any PostgreSQL code.
TRUNCATE TABLE name_of_table statement removes the data efficiently and quickly from the
table.
The truncate statement can also be used to reset values of the identity columns along with data
cleanup as shown below:
We can also use the statement for removing data from multiple tables all at once by mentioning
the table names separated by comma as shown below:
TRUNCATE TABLE
table_1,
table_2,
table_3;
50. What is the capacity of a table in PostgreSQL?
To get the next number 101 from the sequence, we use the nextval() method as shown below:
SELECT nextval('serial_num');
We can also use this sequence while inserting new records using the INSERT command:
INSERT INTO ib_table_name VALUES (nextval('serial_num'), 'interviewbit');
52. What are string constants in PostgreSQL?
They are character sequences bound within single quotes. These are using during data insertion
or updation to characters in the database.
There are special string constants that are quoted in dollars.
Syntax: $tag$<string_constant>$tag$ The tag in the constant is optional and when we are not
specifying the tag, the constant is called a double-dollar string literal.
This can be done by using the command \l -> backslash followed by the lower-case letter L.
This can be done by using the DROP DATABASE command as shown in the syntax below:
If the database has been deleted successfully, then the following message would be shown:
DROP DATABASE
55. What are ACID properties? Is PostgreSQL compliant with ACID?
ACID stands for Atomicity, Consistency, Isolation, Durability. They are database transaction
properties which are used for guaranteeing data validity in case of errors and failures.
• Atomicity: This property ensures that the transaction is completed in all-or-nothing way.
• Consistency: This ensures that updates made to the database is valid and follows rules and
restrictions.
• Isolation: This property ensures integrity of transaction that are visible to all other transactions.
• Durability: This property ensures that the committed transactions are stored permanently in the
database.
MVCC or Multi-version concurrency control is used for avoiding unnecessary database locks
when 2 or more requests tries to access or modify the data at the same time. This ensures that
the time lag for a user to log in to the database is avoided. The transactions are recorded when
anyone tries to access the content.
The command enable-debug is used for enabling the compilation of all libraries and
applications. When this is enabled, the system processes get hindered and generally also
increases the size of the binary file. Hence, it is not recommended to switch this on in the
production environment. This is most commonly used by developers to debug the bugs in their
scripts and help them spot the issues. For more information regarding how to debug, you can
refer here.
59. How do you check the rows affected as part of previous transactions?
SQL standards state that the following three phenomena should be prevented whilst concurrent
transactions. SQL standards define 4 levels of transaction isolations to deal with these
phenomena.
• Dirty reads: If a transaction reads data that is written due to concurrent uncommitted transaction,
these reads are called dirty reads.
• Phantom reads: This occurs when two same queries when executed separately return different
rows. For example, if transaction A retrieves some set of rows matching search criteria. Assume
another transaction B retrieves new rows in addition to the rows obtained earlier for the same
search criteria. The results are different.
• Non-repeatable reads: This occurs when a transaction tries to read the same row multiple times
and gets different values each time due to concurrency. This happens when another transaction
updates that data and our current transaction fetches that updated data, resulting in different
values.
To tackle these, there are 4 standard isolation levels defined by SQL standards. They are as
follows:
• Read Uncommitted – The lowest level of the isolations. Here, the transactions are not isolated
and can read data that are not committed by other transactions resulting in dirty reads.
• Read Committed – This level ensures that the data read is committed at any instant of read time.
Hence, dirty reads are avoided here. This level makes use of read/write lock on the current rows
which prevents read/write/update/delete of that row when the current transaction is being
operated on.
• Repeatable Read – The most restrictive level of isolation. This holds read and write locks for all
rows it operates on. Due to this, non-repeatable reads are avoided as other transactions cannot
read, write, update or delete the rows.
• Serializable – The highest of all isolation levels. This guarantees that the execution is serializable
where execution of any concurrent operations are guaranteed to be appeared as executing
serially.
The following table clearly explains which type of unwanted reads the levels avoid:
60. What can you tell about WAL (Write Ahead Logging)?
Write Ahead Logging is a feature that increases the database reliability by logging
changes before any changes are done to the database. This ensures that we have enough
information when a database crash occurs by helping to pinpoint to what point the work has
been complete and gives a starting point from the point where it was discontinued.
61. What is the main disadvantage of deleting data from an existing table using the DROP TABLE
command?
DROP TABLE command deletes complete data from the table along with removing the complete
table structure too. In case our requirement entails just remove the data, then we would need to
recreate the table to store data in it. In such cases, it is advised to use the TRUNCATE command.
62. How do you perform case-insensitive searches using regular expressions in PostgreSQL?
'interviewbit' ~* '.*INTervIewBit.*'
63. How will you take backup of the database in PostgreSQL?
We can achieve this by using the pg_dump tool for dumping all object contents in the database
into a single file. The steps are as follows:
Step 2: Execute pg_dump program to take the dump of data to a .tar folder as shown below:
The database dump will be stored in the sample_data.tar file on the location specified.
64. Does PostgreSQL support full text search?
Parallel Queries support is a feature provided in PostgreSQL for devising query plans capable of
exploiting multiple CPU processors to execute the queries faster.
The commit action ensures that the data consistency of the transaction is maintained and it ends
the current transaction in the section. Commit adds a new record in the log that describes the
COMMIT to the memory. Whereas, a checkpoint is used for writing all changes that were
committed to disk up to SCN which would be kept in datafile headers and control files.
Conclusion:
SQL is a language for the database. It has a vast scope and robust capability of creating and
manipulating a variety of database objects using commands like CREATE, ALTER, DROP, etc, and
also in loading the database objects using commands like INSERT. It also provides options for
Data Manipulation using commands like DELETE, TRUNCATE and also does effective retrieval of
data using cursor commands like FETCH, SELECT, etc. There are many such commands which
provide a large amount of control to the programmer to interact with the database in an efficient
way without wasting many resources. The popularity of SQL has grown so much that almost
every programmer relies on this to implement their application's storage functionalities thereby
making it an exciting language to learn. Learning this provides the developer a benefit of
understanding the data structures used for storing the organization's data and giving an
additional level of control and in-depth understanding of the application.
PostgreSQL being an open-source database system having extremely robust and sophisticated
ACID, Indexing, and Transaction supports has found widespread popularity among the developer
community.