Advanced Database System

Form and 4th Normal forms?

Set 1

1. List and explain various Normal Forms. How BCNF differs from the Third Normal

Normalization is the process of designing a data model to efficiently store data in a database. The end result is that redundant data is eliminated, and only data related to the attribute is stored within the table.

First Normal form (1NF): A relation is said to be in 1NF if it has only single valued attributes, neither repeating nor arrays are permitted. Second Normal Form (2NF): A relation is said to be in 2NF if it is in 1NF and every non key attribute is fully functional dependent on the primary key. Third Normal Form (3NF): We say that a relation is in 3NF if it is in 2NF and has no transitive dependencies. Boyce-Codd Normal Form (BCNF): A relation is said to be in BCNF if and only if every determinant in the relation is a candidate key. Fourth Normal Form (4NF): A relation is said to be in 4NF if it is in BCNF and contains no multi valued attributes. Fifth Normal Form (5NF): A relation is said to be in 5NF if and only if every join dependency in relation is implied by the candidate keys of relation. Domain-Key Normal Form (DKNF): We say that a relation is in DKNF if it is free of all modification anomalies. Insertion, Deletion, and update anomalies come under modification anomalies.

Third Normal Form (3NF) and Boyce-Codd Normal Form (BCNF)

Page No. 1

stating that no non-trivial functional dependencies can exist on anything other than the superkey . customers. rather than repeatedly (and probably erroneously) in the Sales table. or Boyce-Codd Normal Form.Advanced Database System Set 1 Third normal form states that a table must have no transitive dependencies. we still have a 3NF table. What we need to look to is BCNF. Typically. but that is a matter for another discussion. pe l y e i _ p md )eo m . 3NF means there are no transitive dependencies. A transitive dependency is when two columnar relationships imply another relationship. . For example. which requires all transitive dependencies be eliminated in addition to the table being 3NF. Page No. a )e . such as product price changes breaking sales records. so name -> store_location. deleting any two columns will still leave a set of uniquely identifiable rows. a superset of the candidate keys. There are obvious practical problems with this example. name -> extension and extension -> store_location.m n ea . If columns X. and products are isolated from the Sales table. what happens if an employee changes extension and the old sales records aren't updated? Or if a customer moves? These entry points for error are problematic but acceptable in 3NF because the dependencies are trivial. This means that a row could be uniquely identified by each column individually but that no column depends on any other column to identify the row. 2 . yi nmmle e a _ )e poe ) nm p.that is. Also. difference between BCNF and 4NF (Fourth Normal Form) • Database must be already achieved to 3NF to take it to BCNF. In the table we have defined. to reach 4NF. .an employee's extension and manager are listed once in the Employees table. Now information about employees. Y and Z exist. BCNF extends 3NF. Each fact is represented in a single row in a table . though. but database must be in 3NF and BCNF. which is not a dependency we want to model in our table and could lead to faulty data. .

3 . there can be multi-valued dependency data in the tables. balance) Branch-scheme = (bname. Describe the heuristics of Query optimization. Consider the query to find the assets and branch-names of all banks who have depositors living in Port Chester. name. Efficiency here can be measured in both response time and correctness. We’ll use the following relations as examples: Customer-scheme = (cname. Prepare the result set for presentation. 4. The traditional. relational DB approach to query optimization is to transform the query to an execution tree. street. this is Π bname.Advanced Database System Set 1 • In fourth normal form. 3. since these can be very time consuming. Execute all select and project operations on single tables first. 2. In relational algebra. assets. in order to eliminate unnecessary rows and columns from the result set. but in BCNF. Execute join operations for further reduce the result set. Query Optimization The goal of any query processor is to execute each query as efficiently as possible. bcity) Selection Operation 1. 2. account#. ccity) Deposit-scheme = (bname. Execute operations on media data. Equivalence of Expressions The first step in selecting a query-processing strategy is to find a relational algebra expression that is equivalent to the given query and is efficient to execute. and then execute query elements according to a sequence that reduces the search space as quickly as possible and delays execution of the most expensive (in time) elements as long as possible. there are no multi-valued dependencies of the tables. assets(σ ccity=”Port Chester” Page No. A commonly used execution heuristic is: 1.

We can eliminate several attributes from this scheme. So we can rewrite our expression as: Note that there is no advantage in doing an early project on a relation before it is needed for some other operation: .Advanced Database System Set 1 (customer deposit branch)) . It is advantageous to apply projections early. we reduce the number of columns of the intermediate result. rather than less! Natural Join Operation Page No. 5. bname.are needed to process subsequent operations. 4 .appear in the result of the query or . . Consider this form of our example query: 2.We also are only interested in two attributes of this relation. When we compute the subexpression we obtain a relation whose scheme is (cname.We would access every block for the relation to remove attributes.Thus we can rewrite our query as: Π bname. By eliminating unneeded attributes. . projection reduces the size of relations. .Then we access every block of the reduced-size relation when it is actually needed. Projection Operation 1. assets(σccity=”Port Chester”(customer)) customer deposit branch) . Like selection.We can see that we only want tuples for which ccity = “Port Chester”. 4. .We do more work in total. the only attribute we need is bname (to join with branch). In our example. ccity. . and thus its size. The only ones we need to retain are those that . account#. balance) 3.This should considerably reduce the size of the intermediate relation.This expression constructs a huge relation. customer deposit branch of which we are only interested in a few tuples.

ü ü Cost-based optimization is expensive. 5 . The SELECT and PROJECT operations reduce the size of a file and hence should be applied before a join or other binary operation. Look again at our expression we see that we can compute deposit branch first and then join with the first part. we get a reasonably small relation. This temporary relation is much smaller than deposit branch. The other part. This is because the size of the file resulting from a binary operation—such as JOIN—is usually a multiplicative function of the sizes of the input files. Systems may use heuristics to reduce the number of choices that must be made in a cost-based fashion. So. Page No. Natural join is commutative: Thus we could rewrite our relational algebra expression as: But there are no common attributes between customer and branch. is probably a small relation (comparatively). then with branch). so this is a Cartesian product. However. Lots of tuples! If a user entered this expression. the costs of computing them may differ. Natural join is associative: Although these expressions are equivalent. deposit branch is likely to be a large relation as it contains one tuple for every account. One of the main heuristic rules is to apply SELECT and PROJECT operations before applying the JOIN or other binary operations. It has one tuple for each account held by a resident of Port Chester. even with dynamic programming. we would want to use the associativity and commutativity of natural join to transform this into the more efficient expression we have derived earlier (join with deposit first.Advanced Database System Set 1 Another way to reduce the size of temporary results is to choose an optimal ordering of the join operations. if we compute first.

· Some systems use only heuristics. account )) Page No. Eg: Πcustomer_name((σbranch_city = “Brooklyn” (branch) account) depositor) 1) When we compute (σbranch_city = “Brooklyn” (branch) account ) we obtain a relation whose schema is: (branch_name. eliminate unneeded attributes from intermediate results to get: Πcustomer_name (( Πaccount_number ( (σbranch_city = “Brooklyn” (branch) depositor ) 3) Performing the projection as early as possible reduces the size of the relation to be joined. branch_city. account_number. others combine heuristics with partial cost-based optimization.e. assets. 6 .Advanced Database System ü typically (but not in all cases) improve execution performance: · · · Perform selection early (reduces the number of tuples) Perform projection early (reduces the number of attributes) Set 1 Heuristic optimization transforms the query-tree by using a set of rules that Perform most restrictive selection and join operations (i. balance) 2) Push projections using equivalence rules. with smallest result size) before other similar operations.

department. 7 . employee.. order… Citizens of Norway Person (Name. Shared Subclass entity type Category entity type A shared subclass entity type has characteristics of 2 or more parent entity types A subclass entity type of 2 or more district / independent super-class entity types A student-assistant IS_BOTHA student and an employee An owner IS_EITHERA Person or an organization Page No. a super-class entity type.. product exam. of. alternatively a role played by. student. Ans: SSM was developed as teaching tool and has been and can continue to be modified to include new modeling concepts. customer.Advanced Database System Set 1 2. A particular requirement today is the inclusion of concepts and syntax symbols for modeling multimedia objects. superclass entity type A sub-class entity type is a specialization. Address. Concepts Entity(obje ct) Definition Something of interest to the information system about which data is collected A set of entities sharing common attributes Examples A person. Describe the Structural Semantic Data Model (SSM) with relevant examples.) Subclass : Superclass Student IS_A person Teacher IS_A Person Entity type Subclass.

a functional query language lends itself to functional optimization which is quite different from the algebraic. Page No. is based on the data (or object) model since the latter defines the access primitives which are used by the query model. O2 and ObjectStore. Query Processing Architecture Query Processing in Object-Oriented Database Systems One of the criticisms of first-generation object-oriented database management systems (OODBMSs) was their lack of declarative query capabilities. determine the power of the query model. Query Processing in Object-Oriented Database Systems b. most of the current prototype systems experiment with powerful query languages and investigate their optimization. Commercial products have started to include such languages as well e. 8 . These primitives. and declarative query capability is accepted as one of the fundamental features of OO-DBMS. For example. Indeed. at least partially. In this Section we discuss the issues related to the optimization and execution of OODBMS query languages (which we collectively call query processing). It was commonly believed that the application domains that OODBMS technology targets do not need querying capabilities. The query model.Advanced Database System Set 1 4. cost-based optimization techniques employed in relational as well as a number of object-oriented systems. in turn. This belief no longer holds. Describe the following with respect to Object Oriented Databases: a. This led some researchers to brand first generation (network and hierarchical) DBMSs as object-oriented.g. Query optimization techniques are dependent upon the query model and language.

in this unit we do not consider issues related to the design of object models. estimating the cost of executing methods is considerably more difficult than estimating the cost of accessing an attribute according to an access path. bag. set. encapsulation raises issues related to the accessibility of storage information by the query optimizer. query models. The results of object algebra operators are usually sets of objects (or collections) whose members may be of different types. 9 . objects belong to types related through inheritance hierarchies. In contrast.g. If the object languages are closed under the algebra operators. which is not an easy problem because methods may be written using a general-purpose programming language. Second. However. optimizers have to worry about optimizing method execution. Almost all object query processors proposed to date use optimization techniques developed for relational systems. The encapsulation of methods with the data that they operate on in OODBMSs raises (at least) two issues. The optimization of path expressions is a difficult and central issue in object query languages. In fact. these heterogeneous sets of objects can be operands to other operators. object algebras often operate on semantically different collection types (e. or query languages in any detail. The following are some of the more important issues: Type System Relational query languages operate on a simple type system consisting of a single aggregate type: relation. The closure property of relational languages implies that each relational operator takes one or more relations as operands and produces a relation as a result. First. This requires the development of elaborate type inference schemes to determine which methods can be applied to all the objects in such a set. Accessing such complex objects involves path expressions. Encapsulation Relational query optimization depends on knowledge of the physical storage of data (access paths) which is readily available to the query optimizer. Complex Objects and Inheritance Objects usually have complex structures where the state of an object references other objects. Furthermore. Page No. Efficient access to objects through their inheritance hierarchies is another problem that distinguishes object-oriented from relational query processing. Some systems overcome this difficulty by treating the query optimizer as a special application that can break encapsulation and access information directly. there are a number of issues that make query processing more difficult in OODBMSs. Furthermore.Advanced Database System Set 1 Despite this close relationship.. list) which imposes additional requirements on the type inference schemes to determine the type of the results of operations on collections of different types. Others propose a mechanism whereby objects “reveal” their costs as part of their interface. We discuss this issue in some detail in this unit. object systems have richer type systems.

to a certain degree. encapsulation of state and behavior. 1. As a result. Calculus Optimization 5. can be followed in OODBMSs. Algebra Optimization 8. how these features are supported differs among models and systems. It requires no user knowledge of object implementations. access paths or processing strategies 3. The steps of the methodology are as follows. Even though there is some consensus on the basic features that need to be supported by any object model (e.. We provide an overview of various extensible object query processing approaches. Execution Page No. Calculus Algebra Transformation 6. Query Processing Methodology A query processing methodology similar to relational DBMSs. but modified to deal with the difficulties discussed in the previous section.g. making it difficult to amortize on the experiences of others. This diversity of approaches is likely to prevail for some time. The calculus expression is first 4. Figure 6.1 depicts such a methodology proposed in. Queries are expressed in a declarative language 2. the numerous projects that experiment with object query processing follow quite different paths and are. object identity. it is important to develop extensible approaches to query processing that allow experimentation with new ideas as they evolve. Execution Plan Generation 9.Advanced Database System Object Models Set 1 OODBMSs lack a universally accepted object model definition. therefore. 10 . incompatible. Type check 7. and typed collections). type inheritance.

This approach to set theory was not applied to control systems until the 70's due to insufficient small-computer capability prior to that time. but exact values of these numbers are usually not critical unless very responsive performance is required in which case empirical tuning would determine them The proposed model The easiest way of introducing fuzziness in the database model is to use classical relational databases and formulate a front end to it that shall allow fuzzy querying to the database. the underlying database model is crisp and hence the fuzziness can only be incorporated in the query. a professor at the University of California at Berkley. Fuzzy Logic requires some numerical parameters in order to operate such as what is considered significant error and significant rate-of-change-of-error. The concept of Fuzzy Logic (FL) was conceived by Lotfi Zadeh. Describe the theory of Fuzzy Querying to Relational Databases. but as a way of processing data by allowing partial set membership rather than crisp set membership or non-membership. Professor Zadeh reasoned that people do not require precise. 11 . numerical information input. Page No. imprecise input. A limitation imposed on the system is that because we are not extending the database model nor are we defining a new model in any way. and presented not as a control methodology.Advanced Database System Set 1 5. they would be much more effective and perhaps easier to implement. If feedback controllers could be programmed to accept noisy. and yet they are capable of highly adaptive control.

MIDDLE and OLD.g. 8. LABELS with the following structure: Page No. 12 .4: Age For this we take the example of a student database which has a table STUDENTS with the following attributes: A snapshot of the data existing in the database Meta Knowledge At the level of meta knowledge we need to add only a single table. These are defined as the following: Fig. on the attribute domain AGE we may define fuzzy sets as YOUNG.Advanced Database System Set 1 To incorporate fuzziness we introduce fuzzy sets / linguistic terms on the attribute domains / linguistic variables e.

e.e. the attributes on which the query is to be applied) and the linguistic term. Source Tables: The tables on which the query is to be applied.e.Beta. no fuzzy data is stored in the database. Implementation The main issue in the implementation of this system is the parsing of the input fuzzy query. it does not contain a linguistic term then it need not be subdivided. 8. Conditions: The conditions that have to be specified before the operation is performed. Describe the Differences between Distributed & Centralized Databases Differences in Distributed & Centralized Databases 1 Centralized Control vs. 6. the INSERT query will not change and need not be parsed therefore it can be presented to the database as it is. If the condition is not fuzzy i. Result Attributes: The attributes that are to be displayed used only in the case of the SELECT query. Gamma.6: Meta Knowledge Set 1 This table is used to store the information of all the fuzzy sets defined on all the attribute domains. It is further sub-divided into Query Attributes (i. As the underlying database is crisp. Delta: Stores the range of the fuzzy set. · Alpha. A description of each column in this table is as follows: · Label: This is the primary key of this table and stores the linguistic term associated with the fuzzy set. it is possible to use hierarchical control structure based on a "global database Page No. 2. 3. During parsing the query is parsed and divided into the following 1. i. 13 . Decentralized Control In centralized control one "database administrator" ensures safety of data whereas in distributed control. · Column_Name: Stores the linguistic variable associated with the given linguistic term.Advanced Database System Fig. DELETE or UPDATE. Query Type: Whether the query is a SELECT. 4.

however. transient state created by another transaction during its execution. Recovery and Concurrency Control A transaction is an atomic unit of execution and atomic transactions are the means to obtain database integrity. 5 Integrity. In Distributed Databases. retrieval can be performed on any copy. b) Local optimization consists of deciding how to perform the local database accesses at each site. because a site failure does not stop the execution of applications at other sites if the data is replicated. who have the responsibility of local databases. and the programs are unaffected by physical organization of data. In distributed databases data redundancy is desirable as (a) locality of applications can be increased if data is replicated at all sites where applications need it. (b) the availability of the system can be increased. which is much harder in all distributed systems. For this an efficient distributed data access plan is required which can be generated either by the programmer or produced automatically by an optimizer. Distribution Dependency means programs are written assuming the data is not distributed. interfile chains are used. Failures may cause the system to stop in midst of transaction execution. 14 . Concurrent execution of different transactions may permit one transaction to observe an inconsistent. With data replication. 3 Reduction of Redundancy In centralized databases redundancy was reduced for two reasons: (a) inconsistencies among several copies of the same logical data are avoided. Page No. thus violating the atomicity requirement. Reduction of redundancy is obtained by data sharing. 4 Complex Physical Structures and Efficient Access In centralized databases complex accessing structures like secondary indexed. Concurrent execution requires synchronization amongst the transactions. while updates must be performed consistently on all copies. Problems faced in the design of an optimizer can be classified in two categories: a) Global optimization consists of determining which data must be accessed at which sites and which data files must consequently be transmitted between sites. All these features provide efficient access to data. Thus correctness of programs is unaffected by the movement of data from one site to another. (b) storage space is saved. The programs are written with "conceptual" view of the data (called "Conceptual schema"). Failures and concurrency are two dangers of atomicity. 2 Data Independence Set 1 In central databases it means the actual organization of data is transparent to the application programmer. their speed of execution is affected. another aspect of "distribution dependency" is added to the notion of data independence as used in Centralized databases. In distributed databases efficient access requires accessing data from different sites.Advanced Database System administrator" having the central responsibility of whole data along with "local database administrators".

distributed databases is the natural choice. can ensure that only authorized access to the data is performed. statistics on the database. Routing algorithms must take many factors into account to determine the location and ordering of operations. In organizations already having several databases and feeling the necessity of global applications. allocation description. (a) security (protection) problems because of communication networks is intrinsic to database systems. local administrators face the same as well as two new aspects of the problem. protection and integrity constraints (consistency information) which are more detailed as compared to centralized databases. as also are variable processing capabilities and loadings for different nodes. 7 Distributed Query Processing The DDBMS should be capable of gathering and presenting data from more than one site to answer a single query. In distributed databases. mappings to local names. Relative Advantages of Distributed Databases over Centralized Databases Organizational and Economic Reasons Many organizations are decentralized. If some nodes are updated less frequently than others there may be a choice between querying the local out-of-date copy very cheaply and getting a more up-to-date answer by accessing a distant location. the database administrator. and (where data fragments are replicated) trade-offs between cost and currency. Incremental Growth Page No. access method description. The ability to do query optimization is essential in this context – the main objective being to minimize the quantity of data to be moved around. The organizational and economic motivations are amongst the main reasons for the development of distributed databases. (b) In certain databases with a high degree of "site autonomy" may feel more protected because they can enforce their own protections instead of depending on a central database administrator.Advanced Database System 6 Privacy and Security Set 1 In traditional databases. Communications costs for each link in the network are relevant. having centralized control. In theory a distributed system can handle queries more quickly than a centralized one. 15 . by exploiting parallelism and reducing disc contention. and a distributed database approach fits more naturally the structure of the organization. in practice the main delays (and costs) will be imposed by the communications network. With single-site databases one must consider both generalized operations on internal query representations and the exploitation of information about the current state of the database. 8 Distributed Directory (Catalog) Management Catalogs for distributed databases contain information like fragmentation description.

Only the data and software that exist at the failed site cannot be accessed. following types of transparencies are possible: Distribution or Network Transparency This refers to freedom for the user from the operational details of the network. and reliability. Moreover. Naming transparency implies that once a name is specified. Replication transparency makes the user unaware of the existence of copies. or breaking up a query into a number of sub queries that execute in parallel. This improves both reliability and availability. In addition. inter-query and intra-query parallelism can be achieved by executing multiple queries at different sites. Management of Distributed Data with Different Levels of Transparency In a distributed database. increasing database size. each site has a smaller number of transactions executing than if all transactions are submitted to a single centralized database. the named objects can be accessed unambiguously without additional specification. It may be divided into location and naming transparency. expansion of the system in terms of adding more data. the maximization of the locality of applications is one of the primary objectives in distributed database design. 16 . Therefore. Availability is the probability that the system is continuously available during a time interval. performance. Fragmentation Transparency Two main types of fragmentation are Horizontal fragmentation. This contributes to improved performance. Replication Transparency Copies of the data may be stored at multiple sites for better availability. which distributes a relation into sets of tuples (rows). Local queries and transactions accessing data at a single site have better performance because of the smaller local databases. A global Page No. one site may fail while other sites continue to operate. Further improvement is achieved by judiciously replicating data and software at more than one site. Performance Considerations Data localization reduces the contention for CPU and I/O services and simultaneously reduces access delays involved in wide are networks. Reduced Communication Overhead Many applications are local. and Vertical Fragmentation which distributes a relation into sub relations where each sub relation is defined by a subset of the column of the original relation. and these applications do not have any communication overhead. Location transparency refers to the fact that the command used to perform a task is independent of the location of data and the location of the system where the command was issued. or adding more processors is much easier. When the data and DBMS software are distributed over several sites.Advanced Database System Set 1 In a distributed environment. Reliability and Availability Reliability is defined as the probability that a system is running (not down) at a certain time point.

Fragmentation transparency makes the user unaware of the existence of fragments. Page No.Advanced Database System Set 1 query by the user must be transformed into several fragment queries. 17 .

Sign up to vote on this title
UsefulNot useful