You are on page 1of 27

Chapter 1.

Query Processing and Optimization

1. What is Query Optimization? Briefly discuss the techniques of Query Optimization with suitable examples.
Ans. Query optimization is of great importance for the performance of a relational database, especially for the execution of complex

SQL statements. A query optimizer decides the best methods for implementing each query.

The query optimizer selects, for instance, whether or not to use indexes for a given query, and which join methods to use when

joining multiple tables. These decisions have a tremendous effect on SQL performance, and query optimization is a key technology

for every application, from operational Systems to data warehouse and analytical systems to content management systems.

There is the various principle of Query Optimization are as follows −

● Understand how your database is executing your query

● Retrieve as little data as possible

● Store intermediate results

SQL Query Optimization Techniques


Till now, we have seen how a query is executed and different measures to analyze the query performance. Now we will learn the

techniques to optimize the query performance in SQL. There are some useful practices to reduce the cost. But, the process of

optimization is iterative. One needs to write the query, check query performance using io statistics or execution plan, and then

optimize it. This cycle needs to be followed iteratively for query optimization.

The SQL Server itself also finds the optimal and minimal plan to execute the

query.

Indexing
An index is a data structure used to provide quick access to the table based on a search key. It helps in minimizing the disk access to

fetch the rows from the database. An indexing operation can be a scan or a seek. An index scan is traversing the entire index for

matching criteria whereas index seek is

filtering rows on a matching filter.

For example,

SELECT p.Name, Color, ListPrice FROM SalesLT.Product p

INNER JOIN SalesLT.ProductCategory pc

ON P.ProductCategoryID = pc.ProductCategoryID

INNER JOIN SalesLT.SalesOrderDetail sod

ON p.ProductID = sod.ProductID

WHERE p.ProductID>1
In the above query, we can see that a total of 99% of the query execution time goes in index seek operation. Therefore, it is an

important part of the

optimization process.

Guidelines for choosing index:

1. Indexes should be made on keys that frequently occur in WHERE clause

and join statements.

2. Indexes should not be made on columns that are frequently modified i.e UPDATE command is applied on these columns

frequently.

3. Indexes should be made on Foreign keys where INSERT, UPDATE, and DELETE are concurrently performed. This allows

UPDATES on the master

table without shared locking on the weak entity.


4. Indexes should be made on attributes that occur together commonly in WHERE clause using AND operator.

5. Indexes should be made on ordering key values.

Selection
Selection of the rows that are required instead of selecting all the rows should be followed. SELECT * is highly inefficient as it scans

the entire database.

SET STATISTICS TIME ON

SELECT * FROM SalesLT.Product

SET STATISTICS TIME ON

SELECT ProductNumber, Name, Color,Weight FROM SalesLT.Product


As we can see from the above two outputs, the time is reduced to one-fourth when we use the SELECT statement for selecting only

those columns that are

required.

Avoid using SELECT DISTINCT


SELECT DISTINCT command in SQL is used for fetching unique results and remove duplicate rows in the relation. To achieve this

task, it basically groups together related rows and then removes them. GROUP BY operation is a costly operation. So to fetch distinct

rows and remove duplicate rows, one might use more attributes in the SELECT operation.

Let us take an example,

SET STATISTICS TIME ON

SELECT DISTINCT Name, Color, StandardCost, Weight FROM


SalesLT.Product

SET STATISTICS TIME ON

SELECT Name, Color, StandardCost, Weight, SellEndDate,


SellEndDate FROM SalesLT.Product

As we can see from the execution of the above two queries, the DISTINCT operation takes more time to fetch the unique rows. So, it

is better to add more attributes in the SELECT query to improve the performance and get

unique rows.
2. What do you understand by query optimization? What are query trees? Explain with an
example.
ANS:-
3. Consider the following two tables...... Consider the query i) Draw the query tree for the given
query.
4. List and explain the steps followed to process a high level query.
Query Processing would mean the entire process or activity which involves query translation into low level instructions, query
optimization to save resources, cost estimation or evaluation of query, and extraction of data from the database.
Goal: To find an efficient Query Execution Plan for a given SQL query which would minimize the cost considerably, especially time.
Cost Factors: Disk accesses [which typically consumes time], read/write operations [which typically needs resources such as
memory/RAM].
The major steps involved in query processing are depicted in the figure below;

Figure 1 - Steps in Database Query Processing


Let us discuss the whole process with an example. Let us consider the following two relations as the example tables for our
discussion;

Employee(Eno, Ename, Phone)


Proj_Assigned(Eno, Proj_No, Role, DOP)
where,
Eno is Employee number,
Ename is Employee name,
Proj_No is Project Number in which an employee is assigned,
Role is the role of an employee in a project,
DOP is duration of the project in months.
With this information, let us write a query to find the list of all employees who are working in a project which is more than 10 months
old.
SELECT Ename
FROM Employee, Proj_Assigned
WHERE Employee.Eno = Proj_Assigned.Eno AND DOP > 10;
Input:
A query written in SQL is given as input to the query processor. For our case, let us consider the SQL query written above.
Step 1: Parsing
In this step, the parser of the query processor module checks the syntax of the query, the user’s privileges to execute the query, the
table names and attribute names, etc. The correct table names, attribute names and the privilege of the users can be taken from the
system catalog (data dictionary).
Step 2: Translation
If we have written a valid query, then it is converted from high level language SQL to low level instruction in Relational Algebra.
For example, our SQL query can be converted into a Relational Algebra equivalent as follows;
πEname(σDOP>10 Λ Employee.Eno=Proj_Assigned.Eno(Employee X Prof_Assigned))
Step 3: Optimizer
Optimizer uses the statistical data stored as part of data dictionary. The statistical data are information about the size of the table, the
length of records, the indexes created on the table, etc. Optimizer also checks for the conditions and conditional attributes which are
parts of the query.
Step 4: Execution Plan
A query can be expressed in many ways. The query processor module, at this stage, using the information collected in step 3 to find
different relational algebra expressions that are equivalent and return the result of the one which we have written already.
For our example, the query written in Relational algebra can also be written as the one given below;
πEname(Employee ⋈Eno (σDOP>10 (Prof_Assigned)))
So far, we have got two execution plans. Only condition is that both plans should give the same result.
Step 5: Evaluation
Though we got many execution plans constructed through statistical data, though they return same result (obvious), they differ in
terms of Time consumption to execute the query, or the Space required executing the query. Hence, it is mandatory choose one plan
which obviously consumes less cost.
At this stage, we choose one execution plan of the several we have developed. This Execution plan accesses data from the database to
give the final result.
In our example, the second plan may be good. In the first plan, we join two relations (costly operation) then apply the condition
(conditions are considered as filters) on the joined relation. This consumes more time as well as space.
In the second plan, we filter one of the tables (Proj_Assigned) and the result is joined with the Employee table. This join may need to
compare less number of records. Hence, the second plan is the best (with the information known, not always).
Output:
The final result is shown to the user.

5. Describe the phases of Query processing by using a block diagram.


Query Processing would mean the entire process or activity which involves query translation into low level instructions, query
optimization to save resources, cost estimation or evaluation of query, and extraction of data from the database.
Goal: To find an efficient Query Execution Plan for a given SQL query which would minimize the cost considerably, especially time.
Cost Factors: Disk accesses [which typically consumes time], read/write operations [which typically needs resources such as
memory/RAM].
The major steps involved in query processing are depicted in the figure below;

Figure 1 - Steps in Database Query Processing


Let us discuss the whole process with an example. Let us consider the following two relations as the example tables for our
discussion;

Employee(Eno, Ename, Phone)


Proj_Assigned(Eno, Proj_No, Role, DOP)
where,
Eno is Employee number,
Ename is Employee name,
Proj_No is Project Number in which an employee is assigned,
Role is the role of an employee in a project,
DOP is duration of the project in months.
With this information, let us write a query to find the list of all employees who are working in a project which is more than 10 months
old.
SELECT Ename
FROM Employee, Proj_Assigned
WHERE Employee.Eno = Proj_Assigned.Eno AND DOP > 10;
Input:
A query written in SQL is given as input to the query processor. For our case, let us consider the SQL query written above.
Step 1: Parsing
In this step, the parser of the query processor module checks the syntax of the query, the user’s privileges to execute the query, the
table names and attribute names, etc. The correct table names, attribute names and the privilege of the users can be taken from the
system catalog (data dictionary).
Step 2: Translation
If we have written a valid query, then it is converted from high level language SQL to low level instruction in Relational Algebra.
For example, our SQL query can be converted into a Relational Algebra equivalent as follows;
πEname(σDOP>10 Λ Employee.Eno=Proj_Assigned.Eno(Employee X Prof_Assigned))
Step 3: Optimizer
Optimizer uses the statistical data stored as part of data dictionary. The statistical data are information about the size of the table, the
length of records, the indexes created on the table, etc. Optimizer also checks for the conditions and conditional attributes which are
parts of the query.
Step 4: Execution Plan
A query can be expressed in many ways. The query processor module, at this stage, using the information collected in step 3 to find
different relational algebra expressions that are equivalent and return the result of the one which we have written already.
For our example, the query written in Relational algebra can also be written as the one given below;
πEname(Employee ⋈Eno (σDOP>10 (Prof_Assigned)))
So far, we have got two execution plans. Only condition is that both plans should give the same result.
Step 5: Evaluation
Though we got many execution plans constructed through statistical data, though they return same result (obvious), they differ in
terms of Time consumption to execute the query, or the Space required executing the query. Hence, it is mandatory choose one plan
which obviously consumes less cost.
At this stage, we choose one execution plan of the several we have developed. This Execution plan accesses data from the database to
give the final result.
In our example, the second plan may be good. In the first plan, we join two relations (costly operation) then apply the condition
(conditions are considered as filters) on the joined relation. This consumes more time as well as space.
In the second plan, we filter one of the tables (Proj_Assigned) and the result is joined with the Employee table. This join may need to
compare less number of records. Hence, the second plan is the best (with the information known, not always).
Output:
The final result is shown to the user.

6.Discuss the process of Query optimization, by using suitable example.


Same as Q4
7. Consider the following relations.... Consider the query.... Perform the following task for the above
i. Write the above query using relational algebra and draw the query tree for the same.
ii. Transform the query tree into equivalent query tree such that the evaluation cost may be
reduced.
8. Why is a query expressed in relational algebra preferred over a query expressed in SQL?
ANS: -
Ans. When a query is placed, it is at first scanned, parsed and validated. An internal representation of the query is then created such

as a query tree or a query graph. Then alternative execution strategies are devised for retrieving results from the database tables. The

process of choosing the most appropriate execution strategy for query processing is called query optimization.

Relational algebra defines the basic set of operations of relational database model. A sequence of relational algebra operations forms

a relational algebra expression. The result of this expression represents the result of a database query.
In DDBMS, query optimization is a crucial task. The complexity is high since number of alternative strategies may increase

exponentially due to the following factors −

● The presence of a number of fragments.

● Distribution of the fragments or tables across various sites.

● The speed of communication links.

● Disparity in local processing capabilities.

Hence, in a distributed system, the target is often to find a good execution strategy for query processing rather than the best one. The

time to execute a query is the sum of the following −

● Time to communicate queries to databases.

● Time to execute local query fragments.

● Time to assemble data from different sites.

● Time to display results to the application.

SQL queries are translated into equivalent relational algebra expressions before optimization. A query is at first decomposed into
smaller query blocks. These blocks are translated to equivalent relational algebra expressions. Optimization includes optimization of
each block and then optimization of the query as a whole.

Ans. When a query is placed, it is at first scanned, parsed and validated. An internal representation of the query is then created such as

a query tree or a query graph. Then alternative execution strategies are devised for retrieving results from the database tables. The

process of choosing the most appropriate execution strategy for query processing is called query optimization.

Relational algebra defines the basic set of operations of relational database model. A sequence of relational algebra operations forms a

relational algebra expression. The result of this expression represents the result of a database query.

In DDBMS, query optimization is a crucial task. The complexity is high since number of alternative strategies may increase

exponentially due to the following factors −

● The presence of a number of fragments.

● Distribution of the fragments or tables across various sites.

● The speed of communication links.

● Disparity in local processing capabilities.

Hence, in a distributed system, the target is often to find a good execution strategy for query processing rather than the best one. The

time to execute a query is the sum of the following −

● Time to communicate queries to databases.

● Time to execute local query fragments.

● Time to assemble data from different sites.

● Time to display results to the application.


SQL queries are translated into equivalent relational algebra expressions before optimization. A query is at first decomposed into
smaller query blocks. These blocks are translated to equivalent relational algebra expressions. Optimization includes optimization of
each block and then optimization of the query as a whole.

Cost Estimation : https://www.javatpoint.com/estimating-query-cost

9. Consider the following relations.... Consider the query....


(i) Represent the query using SQL.
(ii) Convert the SQL query into equivalent relational algebraic query.
(iii) Draw the query tree for the above relational algebraic query.
(iv) Using the query tree, transform the relation algebraic query to an
equivalent relation algebraic query, which may reduce the query evaluation cost.
Chapter 2. Advanced Data Management Techniques
10. How do distributed databases differ from the centralized databases?

11. Describe the architecture of distributed databases with the help of a diagram.
Distributed Database Architecture
A distributed database system allows applications to access data from local and remote
databases. In a homogenous distributed database system, each database is an Oracle Database.
In a heterogeneous distributed database system, at least one of the databases is not an Oracle
Database. Distributed databases use a client/server architecture to process information requests.
The section contains the following topics:
• Homogenous Distributed Database Systems
• Heterogeneous Distributed Database Systems
• Client/Server Database Architecture
Homogenous Distributed Database Systems
A homogenous distributed database system is a network of two or more Oracle Databases that
reside on one or more machines. Figure 31-1 illustrates a distributed system that connects three
databases: hq, mfg, and sales. An application can simultaneously access or modify the data in
several databases in a single distributed environment. For example, a single query from a
Manufacturing client on local database mfg can retrieve joined data from the products table on
the local database and the dept table on the remote hq database.
For a client application, the location and platform of the databases are transparent. You can also
create synonyms for remote objects in the distributed system so that users can access them with
the same syntax as local objects. For example, if you are connected to database mfg but want to
access data on database hq, creating a synonym on mfg for the remote dept table enables you to
issue this query:
SELECT * FROM dept;
In this way, a distributed system gives the appearance of native data access. Users on mfg do not
have to know that the data they access resides on remote databases.
Figure 31-1 Homogeneous Distributed Database

Description of "Figure 31-1 Homogeneous Distributed Database"


An Oracle Database distributed database system can incorporate Oracle Databases of different
versions. All supported releases of Oracle Database can participate in a distributed database
system. Nevertheless, the applications that work with the distributed database must understand
the functionality that is available at each node in the system. A distributed database application
cannot expect an Oracle7 database to understand the SQL extensions that are only available with
Oracle Database.
12. Explain what is meant by a distributed database and discuss the reasons behind providing such
a system.

13. List the functions, advantages and disadvantages of distributed of DDBMS (Distributed
Database Management Systems)
• Improved ease and flexibility of application development

• Developing and maintaining applications at geographically distributed sites of an organization is facilitated owing to
transparency of data distribution and control.

• Increased reliability and availability

• This is achieved by the isolation of faults to their site of origin without affecting the other databases connected to the network.

• When the data and DDBMS software are distributed over several sites, one site may fail while other sites continue to operate.

• Only the data and software that exist at the failed site cannot be accessed. This improves both reliability and availability.

• Improved performance.

• A distributed DBMS fragments the database by keeping the data closer to where it is needed most.

• Data localization reduces the contention for CPU and I/O services and simultaneously reduces access delays involved in wide
area networks.

• When a large database is distributed over multiple sites, smaller databases exist at each site. As a result, local queries and
transactions accessing data at a single site have better performance because of the smaller local databases.

• In addition, each site has a smaller number of transactions executing than if all transactions are submitted to a single
centralized database.

• Moreover, interquery and intraquery parallelism can be achieved by executing multiple queries at different sites, or by
breaking up a query into a number of subqueries that execute in parallel. This contributes to improved performance.
• Easier expansion

• In a distributed environment, expansion of the system in terms of adding more data, increasing database sizes, or adding more

Functions of DDB

• Keeping track of data distribution

• The ability to keep track of the data distribution, fragmentation, and


replication by expanding the DDBMS catalog.

• Distributed query processing.

• The ability to access remote sites and transmit queries and data
among the various sites via a communication network.

• Distributed transaction management.

• The ability to devise execution strategies for queries and transactions that access data from more than one site and to
synchronize the access to distributed data and maintain the integrity of the overall database.

• Replicated data management.

• The ability to decide which copy of a replicated data item to access and to maintain the consistency of copies of a replicated
data item.

• Distributed database recovery.

• The ability to recover from individual site crashes and from new types of failures, such as the failure of communication links.

• Security.

• Distributed transactions must be executed with the proper management of the security of the data and the authorization/access
privileges of users.
14. What are Mobile Databases? Explain the characteristics of mobile databases. Give an
application of mobile databases.
What is mobile computing?
• Users with portable computers still have network connections while they move.
• Mobile Computing is an umbrella term used to describe technologies that enable people to access network services anyplace,
anytime, and anywhere.
• Mobile data-driven applications enable us to access any data from anywhere, anytime.
Examples:
• Salespersons can update sales records on the move.
• Reporters can update news database anytime.
• Doctors can retrieve patient’s medical history from anywhere.
Mobile DBMSs are needed to support these applications data processing capabilities.

Characteristics of Mobile Environments

The characteristics of mobile computing include high communication latency, intermittent wireless connectivity, limited battery
life, and changing client location
Latency is caused by the processes unique to the wireless medium, such as coding data for wireless transfer, and tracking
and filtering wireless signals at the receiver
Intermittent connectivity can be intentional or unintentional; unintentional disconnections happen in areas where wireless
signals cannot reach, e.g., elevator shafts or subway tunnels; Intentional disconnections occur by user intent, e.g., during an
airplane takeoff, or when the mobile device is powered down
Battery life is directly related to battery size, and indirectly related to the mobile device’s capabilities
Client locations are expected to change, which alters the network topology and may cause their data requirements to change
All these characteristics impact data management, and robust mobile applications must consider them
To compensate for high latencies and unreliable connectivity, clients cache replicas of important, frequently accessed data, and
work offline, if necessary; Besides increasing data availability and response time, caching can also reduce client power consumption
by eliminating the need to make energy-consuming wireless data transmission for each data access
The server may not be able to reach a client; A client may be unreachable because it is dozing – in an energy-conserving state in
which many subsystems are shut down – or because it is out of range of a base station; In either case, neither client nor server can
reach the other, and modifications must be made to the architecture in order to compensate for this case;
Proxies for unreachable components are added to the architecture; For a client (and symmetrically for a server), the
proxy can cache updates intended for the server; When a connection becomes available, the proxy automatically forwards
these cached updates to their ultimate destination

15. Differentiate between the following:


i. Centralized and Distributed Databases
refer Q NO 10
16. List and explain the commonly accepted threats to database security.
https://www.imperva.com/learn/data-security/database-
security/#:~:text=Insider%20Threats&text=A%20malicious%20insider%20with%20ill,access%20to%20the%20database's%20creden
tials
17. Mention and briefly explain the control measures that are used to provide security of data in
databases.
Database Security means keeping sensitive information safe and prevent the loss of data. Security of data base is controlled by
Database Administrator (DBA).
The following are the main control measures are used to provide security of data in databases:
1. Authentication
2. Access control
3. Inference control
4. Flow control
5. Database Security applying Statistical Method
6. Encryption
These are explained as following below.
Authentication:
Authentication is the process of confirmation that whether the user log in only according to the rights provided to him to perform the
activities of data base. A particular user can login only up to his privilege but he can’t access the other sensitive data. The privilege of
accessing sensitive data is restricted by using Authentication.
By using these authentication tools for biometrics such as retina and figure prints can prevent the data base from
unauthorized/malicious users.
Access Control:
The security mechanism of DBMS must include some provisions for restricting access to the data base by unauthorized users. Access
control is done by creating user accounts and to control login process by the DBMS. So, that database access of sensitive data is
possible only to those people (database users) who are allowed to access such data and to restrict access to unauthorized persons.
The database system must also keep the track of all operations performed by certain user throughout the entire login time.
Inference Control:
This method is known as the countermeasures to statistical database security problem. It is used to prevent the user from completing
any inference channel. This method protects sensitive information from indirect disclosure.
Inferences are of two types, identity disclosure or attribute disclosure.
Flow Control:
This prevents information from flowing in a way that it reaches unauthorized users. Channels are the pathways for information to
flow implicitly in ways that violate the privacy policy of a company are called convert channels.
Database Security applying Statistical Method:
Statistical database security focuses on the protection of confidential individual values stored in and used for statistical purposes and
used to retrieve the summaries of values based on categories. They do not permit to retrieve the individual information.
This allows to access the database to get statistical information about the number of employees in the company but not to access the
detailed confidential/personal information about the specific individual employee.
Encryption:
This method is mainly used to protect sensitive data (such as credit card numbers, OTP numbers) and other sensitive numbers. The
data is encoded using some encoding algorithms.
An unauthorized user who tries to access this encoded data will face difficulty in decoding it, but authorized users are given decoding
keys to decode data.

18. What is the difference between discretionary and mandatory access control?

DAC MAC

DAC stands for Discretionary Access Control. MAC stands for Mandatory Access Control.

DAC is easier to implement. MAC is difficult to implement.

DAC is less secure to use. MAC is more secure to use.

In DAC, the owner can determine the access and privileges In MAC, the system only determines the access and the
and can restrict the resources based on the identity of the resources will be restricted based on the clearance of the
users. subjects.
DAC MAC

DAC has extra labour-intensive properties. MAC has no labour-intensive property.

Users will be provided access based on their identity and not Users will be restricted based on their power and level of
using levels. hierarchy.

MAC is not flexible as it contains lots of strict rules and


DAC has high flexibility with no rules and regulations. regulations.

DAC has complete trust in users. MAC has trust only in administrators.

Decisions will be based on objects and tasks, and they can


Decisions will be based only on user ID and ownership. have their own ids.

Information flow is impossible to control. Information flow can be easily controlled.

DAC is supported by commercial DBMSs. MAC is not supported by commercial DBMSs.

MAC can be applied in the military, government, and


DAC can be applied in all domains. intelligence.

DAC is vulnerable to trojan horses. MAC prevents virus flow from a higher level to a lower level.

19. Discuss the types of privileges at the account level and those at the relation level.
There are two levels of privileges to be assigned to use the database system, account level and relation (or table level).
•At account level, each account of the relation holds particular privileges independently specified by the database administrator in the database.
•Atrelationlevel,eachindividualrelationorview inthedatabaseaccessingprivilegesare controlled by database administrator. Account level It includes,
1.CREATE SCHEMA or CREATE TABLE privilege, to create a schema.
2.CREATE VIEW privilege.
3.ALTER privilege, to perform changes such as adding or removing attributes.
4.DROP privilege, to delete relations or views.
5.MODIFY privilege, to insert, delete, or update tuples.
6.SELECT privilege, to retrieve information from the database. Relation level
•It refers to either base relation or view (virtual) relation.
•Each type of command can be applied for each user by specifying the individual relation. Access matrix model, an authorization model is used for granting
and revoking of privileges

20. What is the goal of encryption? What process is involved in encrypting data and then recovering it at the other end.

With more and more organizations moving to hybrid and multicloud environments, concerns are growing about public cloud security
and protecting data across complex environments. Enterprise-wide data encryption and encryption key management can help protect
data on-premises and in the cloud.

Cloud service providers (CSPs) may be responsible for the security of the cloud, but customers are responsible for security in the
cloud, especially the security of any data. An organization’s sensitive data must be protected, while allowing authorized users to
perform their job functions. This protection should not only encrypt data, but also provide robust encryption key management, access
control and audit logging capabilities.
Robust data encryption and key management solutions should offer:

• A centralized management console for data encryption and encryption key policies and configurations
• Encryption at the file, database and application levels for on-premise and cloud data
• Role and group-based access controls and audit logging to help address compliance
• Automated key lifecycle processes for on-premise and cloud encryption keys

Cryptography is the science of encoding information before sending via unreliable communication paths so that only an authorized
receiver can decode and use it.
The coded message is called cipher text and the original message is called plain text. The process of converting plain text to cipher
text by the sender is called encoding or encryption. The process of converting cipher text to plain text by the receiver is called
decoding or decryption.
The entire procedure of communicating using cryptography can be illustrated through the following diagram −

Conventional Encryption Methods


In conventional cryptography, the encryption and decryption are done using the same secret key. Here, the sender encrypts the
message with an encryption algorithm using a copy of the secret key. The encrypted message is then sent over public communication
channels. On receiving the encrypted message, the receiver decrypts it with a corresponding decryption algorithm using the same
secret key.
Security in conventional cryptography depends on two factors −
• A sound algorithm which is known to all.
• A randomly generated, preferably long secret key known only by the sender and the receiver.
The most famous conventional cryptography algorithm is Data Encryption Standard or DES.
The advantage of this method is its easy applicability. However, the greatest problem of conventional cryptography is sharing the
secret key between the communicating parties. The ways to send the key are cumbersome and highly susceptible to eavesdropping.
Public Key Cryptography
In contrast to conventional cryptography, public key cryptography uses two different keys, referred to as public key and the private
key. Each user generates the pair of public key and private key. The user then puts the public key in an accessible place. When a
sender wants to sends a message, he encrypts it using the public key of the receiver. On receiving the encrypted message, the receiver
decrypts it using his private key. Since the private key is not known to anyone but the receiver, no other person who receives the
message can decrypt it.
The most popular public key cryptography algorithms are RSA algorithm and Diffie– Hellman algorithm. This method is very secure
to send private messages. However, the problem is, it involves a lot of computations and so proves to be inefficient for long
messages.
The solution is to use a combination of conventional and public key cryptography. The secret key is encrypted using public key
cryptography before sharing between the communicating parties. Then, the message is send using conventional cryptography with the
aid of the shared secret key.
Digital Signatures
A Digital Signature (DS) is an authentication technique based on public key cryptography used in e-commerce applications. It
associates a unique mark to an individual within the body of his message. This helps others to authenticate valid senders of messages.
Typically, a user’s digital signature varies from message to message in order to provide security against counterfeiting. The method is
as follows −
• The sender takes a message, calculates the message digest of the message and signs it digests with a private key.
• The sender then appends the signed digest along with the plaintext message.
• The message is sent over communication channel.
• The receiver removes the appended signed digest and verifies the digest using the corresponding public key.
• The receiver then takes the plaintext message and runs it through the same message digest algorithm.
• If the results of step 4 and step 5 match, then the receiver knows that the message has integrity and authentic.

21. What is flow control as a security measure? What type of flow control exist?
Distributed systems encompass a lot of data flow from one site to another and also within a site. Flow control prevents data from
being transferred in such a way that it can be accessed by unauthorized agents. A flow policy lists out the channels through which
information can flow. It also defines security classes for data as well as transactions.
22. Discuss what is meant by each of the following terms:
a) Database authorization
Authorization is the process where the database manager gets information about the authenticated user. Part of that information
is determining which database operations the user can perform and which data objects a user can access.
b) Access Control
Database access control is a method of allowing access to company's sensitive data only to those people (database users) who
are allowed to access such data and to restrict access to unauthorized persons. It includes two main components: authentication
and authorization.
c) Data Encryption
Data encryption is a way of translating data from plaintext (unencrypted) to ciphertext (encrypted). Users can access encrypted
data with an encryption key and decrypted data with a decryption key. Protecting your data.
d) Privileged (system) account
A privileged account is a login credential to a server, firewall, or another administrative account. Often, privileged accounts are
referred to as admin accounts. Your Local Windows Admin accounts and Domain Admin accounts are examples of admin accounts.
Other examples are Unix root accounts, Cisco enable, etc.
e) Database audit
Database auditing involves observing a database so as to be aware of the actions of database users. Database administrators and
consultants often set up auditing for security purposes, for example, to ensure that those without the permission to access information
do not access it.
f) Audit trial
Whenever an action is performed on the database resources an audit trail of information including what database object was
impacted, who performed the operation, and when is generated, if the DBMS supports a very high level of auditing, a record of what
actually changed might also be maintained.
g) Granting a privilege
The GRANT (privilege) statement grants privileges on the database as a whole or on individual tables, views, sequences or
procedures. It controls access to database objects, roles, and DBMS resources. Details about using the GRANT statement with role
objects is described in GRANT (role).
h) Revoking a privilege
The REVOKE SQL Definition Privileges authorization statement removes from one or more users or groups the privilege of
performing selected actions on a specified access module, schema, table, view, function, procedure or table procedure.
i) Covert channels
A covert channel is any communication channel that can be exploited by a process to transfer information in a manner that
violates the systems security policy. In short, covert channels transfer information using non-standard methods against the system
design.
23. What is mixed fragmentation? Give an example.
Hybrid Data Fragmentation:
This is the combination of horizontal as well as vertical fragmentation. This type of fragmentation will have horizontal fragmentation
to have subset of data to be distributed over the DB, and vertical fragmentation to have subset of columns of the table.
As we observe in above diagram, this type of fragmentation can be done in any order. It does not have any particular order. It is
solely based on the user requirement. But it should satisfy fragmentation conditions. Consider the EMPLOYEE table with below
fragmentations.
In hybrid fragmentation, a combination of horizontal and vertical fragmentation techniques is used. This is the most flexible
fragmentation technique since it generates fragments with minimal extraneous information. However, reconstruction of the original
table is often an expensive task.
Hybrid fragmentation can be done in two alternative ways −
• At first, generate a set of horizontal fragments; then generate vertical fragments from one or more of the horizontal
fragments.
• At first, generate a set of vertical fragments; then generate horizontal fragments from one or more of the vertical
fragments.
SELECT EMP_ID, EMP _FIRST_NAME, EMP_LAST_NAME, AGE
FROM EMPLOYEE WHERE EMP_LOCATION = ‘INDIA;
SELECT EMP_ID, DEPTID FROM EMPLOYEE WHERE EMP_LOCATION = ‘INDIA;
SELECT EMP_ID, EMP _FIRST_NAME, EMP_LAST_NAME, AGE
FROM EMPLOYEE WHERE EMP_LOCATION = ‘US;
SELECT EMP_ID, PROJID FROM EMPLOYEE WHERE EMP_LOCATION = ‘US;
This is a hybrid or mixed fragmentation of EMPLOYEE table.
24. How is horizontal partitioning of a relation specified? How can a relation be put back together from a complete horizontal
partitioning?
Horizontal partitioning divides a table into multiple tables. Each table then contains the same number of columns, but fewer rows. For
example, a table that contains 1 billion rows could be partitioned horizontally into 12 tables, with each smaller table representing one
month of data for a specific year.
To horizontally partition a table, select a single table in a model, and click the Horizontal Partition icon on the Transformations
toolbar. Use the Horizontal Partitioning Wizard to:
▪ Specify how many partitioned tables to create.
▪ Enter a name for the partitioned tables.
▪ Enter, for notational purposes only, criteria for how you place rows from the table you choose to partition into the new
partitions. You can enter a script (SQL SELECT statement) and store the text for annotation purposes.
Result of Horizontally Partitioning a Table
When you click Horizontal Partition to horizontally partition a table, you:
▪ Create a new table for each partition that you specify. Each partitioned table contains all primary key and non-key
columns from the source table. The primary key of each partitioned table is the primary key of the source table. The
partitioned tables appear in the Model Explorer under the Tables folder.
▪ Create all relationships associated with the source table that you horizontally partition, and preserves all migrating keys.
▪ Preserve the properties from the source columns. The properties from the source table are not preserved.

The primary key is duplicated to allow the original table to be reconstructed. Using union operation to reconstruct them.

25. How is vertical partitioning of a relation specified? How can a relation be put back together from a complete vertical partitioning? Ans. Vertical
partitioning involves creating tables with fewer columns and using additional tables to store the remaining columns. Normalization also involves this
splitting of columns across tables, but vertical partitioning goes beyond that and partitions columns even when already normalized. The primary key is
duplicated to allow the original table to be reconstructed. Using join operation to reconstruct them. . 26. Consider the following relation: i) Give an example
of two simple predicates that would be meaningful for the relation for horizontal partitioning. ii) Give an example of two simple predicates that would be
meaningful for the relation for vertical partitioning. Ans. Horizontal: Let R be a relation, and A1, ..., An be its attributes with the corresponding domains
Dom(A1), ..., Dom(An). A predicate represents a pure boolean expression over the attributes of a relation R and constants of the attributes’ domains. An
atomic predicate p is a relationship among attributes and constants of a relation. For example, (A1 < A2) and (A3 >= 5) are atomic predicates. Then, the set
of all predicates over a relation R is: φ ::= p | ¬φ | φ1 ∧ φ2 | φ1 ∨ φ2 We define horizontal partitioning as a pair (R, φ), where R is a relation and φ is a
predicate, which partitions R into at most 2 fragments (sub-relations) with the identical structure (i.e. the same set of attributes), one per each truth value
of φ. The first fragment includes all tuples t of R which satisfy φ, i.e. t ² φ. The second fragment includes all tuples t of R which do not satisfy φ, i.e. t 2 φ. It is
possible one of the fragments to be empty if all tuples of R either satisfy or do not satisfy φ. Note that, the partitioning (R, φ) is identical to (R, ¬ φ). If we
apply the predicate true (or f alse) to a relation, then it remains undivided. Example 1. Let R = (A1 int, A2 int, A3 date) be a relation. It can be divided into 2
partitions by using one of the following predicates: – φ = (A1 = A2), which results into a fragment where the values of A1 and A2 are equal for all tuples, and
a fragment where those values are different. – φ = (A3 >=0 01 − 01 − 070 ) ∧ (A3
29. How do spatial databases differ from regular databases? What are the different types of spatial
data?
SPATIAL DATABASE REGULAR DATABASE
• It answers where things are. • It answers what and how much things are.

• It describes the absolute and relative location of • Characteristics of geographical features that are
geographical objects. qualitative or quantitative in nature.

• It is stored in a shapefile or geodatabase. • It is stored in a database table.

• Generally multi-dimensional and auto-correlated. • Generally one-dimensional and independent.

• Satellite maps and scanned images help to obtain spatial • Forest managers, fire departments, environmental groups,
data. and online media helps to obtain non-spatial data.

• Relationships among spatial attributes are implicit. For • Relationships among non-spatial attributes are explicit.
example, boundaries 1 and 2 could be neighbours, but For example, two different attributes may be a part of, a
cannot be explicitly represented. subclass of, a member of, or represented in the form of
arithmetic values or orders.

• Types of spatial data: Raster Data – Composed of grids • Types of non-spatial data: Nominal Data, Ordinal Data,
or pixels and identified by rows and columns. Vector Interval Data, Ratio Data
Data – Composed of points, lines, and polygons.

• Examples of spatial data are maps, photographs, satellite • Examples of non-spatial data are names, phone numbers,
images, scanned images, roads rivers, contours, etc. area, postal code, rainfall, population, etc.

30. What is an Object identifier? Explain with an example. What are its advantages and
disadvantages?
An identifier is a string of characters (up to 255 characters in length) used to identify first-class Snowflake “named” objects,
including table columns:

• Identifiers are specified at object creation time and then are referenced in queries and DDL/DML statements.
• Identifiers can also be defined in queries as aliases (e.g. SELECT a+b AS "the sum";).

Object identifiers, often simply referred to as object names, must be unique within the context of the object type and the “parent”
object:

Account

Identifiers for account objects (users, roles, warehouses, databases, etc.) must be unique across the entire account.

Databases

Identifiers for schemas must be unique within the database. To enable resolving schemas that have the same identifiers across
databases, Snowflake supports fully-qualifying the schema identifiers in the form of:

<database_name>.<schema_name>

Schemas

Identifiers for schema objects (tables, views, file formats, stages, etc.) must be unique within the schema. To enable resolving
objects that have the same identifiers in different databases/schemas, Snowflake supports fully-qualifying the object identifiers
in the form of:

<database_name>.<schema_name>.<object_name>

Tables

Identifiers for columns must be unique within the table.

What are the advantages of object ID?


The main advantages of Object IDs are as below:
i) IDs are not changing
ii) IDs are completely independent of the changes in data value and the physical location
iii) IDs provide uniform mechanism for referencing objects.
Chapter 3. Data Warehouse ,Dimensional Modelling and OLAP

1] Explain the need and features of data warehouse.


Ans. 1 NEED
A data warehouse provides us generalized and consolidated data in a multidimensional view. Along with a generalized and
consolidated view of data, a data warehouse also provides us with Online Analytical Processing (OLAP) tools. These tools help us in
interactive and effective analysis of data in a multidimensional space.
This analysis results in data generalization and data mining.
A data warehouse is a database, which is kept separate from the organization's operational database.
● There is no frequent updating done in a data warehouse.

● It possesses consolidated historical data, which helps the organization to analyze its business.

● A data warehouse helps executives to organize, understand, and use their data to make strategic decisions.

● Data warehouse systems help in the integration of diversity of application systems.

● A data warehouse system helps in consolidated historical data analysis.


Need Of Data Warehousing

There are five reasons behind the need of Data warehousing:

1. Non-technical People- Business users are the non technical people who need to gather information in a summarized, elementary

fashion, this function is fulfilled


by Data Warehousing.
2. Storing the historical data- The time variable related data from the past needs to

be stored for future use.


3. Strategic decision making- Data warehousing helps in making strategic decisions

based on the data given in the warehouse.


4. Data consistency and quality of data- Data warehousing helps in maintaining the consistency and uniformity of the data, even

though it has been derived from


heterogeneous sources.
5. Response time is fast- Data warehousing provides a significant degree of flexibility and faster response time that helps it to deal

with a lot of load and queries.

2 FEATURES
The four characteristics of a data warehouse, also called features of a data warehouse, include SUBJECT ORIENTED, TIME
VARIANT, INTEGRATED and NON-VOLATILE.

The three prominent ones among these are. INTEGRATED, TIME VARIANT, NON VOLATILE.

Subject oriented, on the other hand, is an unique feature of the data warehouse. These features of a data warehouse differentiate it from
any other set of databases or data by
characterization.
1. Subject Oriented

Analysis of the data for the decision makers of a business can be done easily by constricting to a particular subject area of the Data
warehouse. This makes
understanding and analysis of the data concise and straightforward by excluding the unwanted information on some subject that is not
needed for decision-making. This
means that the ongoing operations of an organization are not taken into consideration.
2. Integrated

Data warehouses consist of data from different variable sources integrated under one platform. This data obtained is extracted and
transformed maintaining uniformity without depending on the source it was obtained from, this feature is known as Integrated.
Standards are established which are universally acceptable for the data present in the warehouse.

3. Time Variant

One of the important properties of the data warehouse is the historical perspective it holds. It keeps the huge volume of data from all
databases stored in accordance with the elements of time. It consists of a temporal element and extensive time horizon. Inability to
change the element of time is an essential aspect of time variance. Record
key is used to display time variance.
4. Non-Volatile

Data is updated by uploading data in the data warehouse to protect data from momentary changes. This means that once a data is fed,
there can be no alteration or changes made. The inability to be erased is called the non-volatile character of the data
warehouse environment. data is read only and allows only two functions to be
performed: Access and Loading.

2] What is data warehouse? How does it differ from database?

https://www.geeksforgeeks.org/difference-between-database-system-and-data-warehouse/

3] What is the multidimensional data model? How it is used in data warehouse?

https://www.geeksforgeeks.org/multidimensional-data-model/#:~:text=OLAP%20(online%20anal
ytical%20processing)%20and,from%20many%20dimensions%20and%20perspectives.

4] Differentiate between OLAP and OLTP system

https://www.geeksforgeeks.org/difference-between-olap-and-oltp-in-dbms/

5] Define Star Schema, Snowflake Schema, Fact constellation schema, DataMart.

Schema is a logical description of the entire database. It includes the name and description of records of all record types including all
associated data-items and aggregates. Much like a database, a data warehouse also requires to maintain a schema. A database uses
relational model, while a data warehouse uses Star,
Snowflake, and Fact Constellation schema.

Star Schema in data warehouse, in which the center of the star can have one fact table and a number of associated dimension tables.
It is known as star schema as its structure resembles a star. The Star Schema data model is the simplest type of Data Warehouse schema.

It is also known as Star Join


Schema and is optimized for querying large data sets.

In the following Star Schema example, the fact table is at the center which contains keys to every dimension table like Dealer_ID,
Model ID, Date_ID,
Product_ID, Branch_ID & other attributes like Units sold and revenue.

Example of Star Schema Diagram

What is a Snowflake Schema?

Snowflake Schema in data warehouse is a logical arrangement of tables in a multidimensional database such that the ER diagram
resembles a snowflake shape. A Snowflake Schema is an extension of a Star Schema, and it adds

additional dimensions. The dimension tables are normalized which splits data
into additional tables.
In the following Snowflake Schema example, Country is further normalized
into an individual table.

Example of Snowflake Schema

Fact Constellation Schema


● A fact constellation has multiple fact tables. It is also known as galaxy schema.

● The following diagram shows two fact tables, namely sales and shipping.
● The sales fact table is same as that in the star schema.

● The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key, from_location, to_location.

● The shipping fact table also contains two measures, namely dollars sold and units sold.

● It is also possible to share dimension tables between fact tables. For example, time, item, and location dimension tables

are shared between the sales and shipping fact table.

Data Marts

https://www.javatpoint.com/data-warehouse-what-is-data-mart

● Data mart is a subordinate of data warehouse which helps in providing output for

large sized data by partitioning the data.


● It is less time consuming and is cheaper.

● They can be created in the same database or a separate database.

6] Describe the characteristics of datawarehouse.

https://www.geeksforgeeks.org/characteristics-and-functions-of-data-warehouse/

You might also like