You are on page 1of 16

MC0077 – Advanced Database Systems– 4 Credits

(Book ID: B0882) --------------------------------------------------------------------------------------------------------------------------------1. List and explain various Normal Forms. How BCNF differs from the Third Normal Form and 4th Normal forms?

Normal forms are hierarchical in nature. That is, the lowest level is the first normal form, and the database cannot meet the requirements for higher level normal forms without first having met all the requirements of the lesser normal forms. First Normal Form: Any table having any relation is said to be in the first normal form. The criteria that must be met to be considered relational is that the cells of the table must contain only single values, and repeat groups or arrays are not allowed as values. All attributes (the entries in a column) must be of the same kind, and each column must have a unique name. Each row in the table must be unique. Databases in the first normal form are the weakest and suffer from all modification anomalies . Second Normal Form: If all a relational database's non-key attributes are dependent on all of the key, then the database is considered to meet the criteria for being in the second normal form. This normal form solves the problem of partial dependencies, but this normal form only pertains to relations with composite keys. Third Normal Form: A database is in the third normal form if it meets the criteria for a second normal form and has no transitive dependencies. BNCF & Third Normal Form: A database that meets third normal form criteria and every determinant in the database is a candidate key, it's said to be in the Boyce-Codd Normal Form. This normal form solves the issue of functional dependencies. Fourth Normal Form: Fourth Normal Form (4NF) is an extension of BCNF for functional and multi-valued dependencies. A schema is in 4NF if the left hand side of every nontrivial functional or multi-valued dependency is a super-key.
2. Describe the concepts of Structural Semantic Data Model (SSM).

A data model in software engineering is an abstract model that describes how data are represented and accessed. Data models formally define data elements and relationships among data elements for a domain of interest. According to Hoberman (2009), "A data model is a way finding tool for both business and IT professionals, which uses a set of symbols and text to precisely explain a subset of real information to improve communication within the organization and thereby lead to a more flexible and stable application environment." A data model explicitly determines the structure of data or structured data. Typical applications of data models include database models, design of information systems, and enabling exchange of data. Usually data models are specified in a data modeling language.

cannot totally satisfy the requirements for a conceptual definition of data because it is limited in scope and biased toward the implementation strategy employed by the DBMS. This is then used as the start point for interface or database design Data architecture is the design of data for use in defining the target state and the subsequent planning needed to hit the target state. events. The real world. techniques to define the meaning of data within the context of its interrelationships with other data. perhaps in the context of an activity model. As illustrated in the figure. A data architecture describes the data structures used by a business and/or its applications. The data model will normally consist of entity types. . especially in the context of enterprise models. Precision means that the terms and rules on a data model can be interpreted only one way and are not ambiguous. especially in the context of programming languages. in terms of resources. or relational. A semantic data model is sometimes called a conceptual data model. There are descriptions of data in storage and data in motion. A semantic data model is an abstraction which defines how the stored symbols relate to the real world. The figure illustrates the way data models are developed and used today. A data model can be sometimes referred to as a data structure. network. A data model is the medium which project team members from different backgrounds and with different levels of experience can communicate with one another.Communication and precision are the two key benefits that make a data model important to applications that use and exchange data. whether hierarchical. Data modeling is a technique for defining business requirements for a database. It is usually one of several architecture domains that form the pillars of an enterprise architecture or solution architecture. It is sometimes called database modeling because a data model is eventually implemented in a database. Therefore. A semantic data model is an abstraction which defines how the stored symbols relate to the real world. attributes. Data models are often complemented by function models. The logical data structure of a database management system (DBMS). the model must be a true representation of the real world Data modeling in software engineering is the process of creating a data model by applying formal data model descriptions using data modeling techniques. A conceptual data model is developed based on the data requirements for the application that is being developed. ideas. and the definitions of those objects. integrity rules. Thus. the need to define data from a conceptual view has led to the development of semantic data modeling techniques. A semantic data model in software engineering is a technique to define the meaning of data within the context of its interrelationships with other data. etc. are symbolically defined within physical data stores. That is. relationships.

The closure property of relational languages implies that each relational operator takes one or more relations as operands and produces a relation as a result. data groups and data items. is based on the data (or object) model since the latter defines the access primitives which are used by the query model. This requires the development of elaborate type inference schemes to . In contrast. These primitives. However. For example. applications. Essential to realizing the target state. determine the power of the query model. Indeed. in this unit we do not consider issues related to the design of object models. query models. Query optimization techniques are dependent upon the query model and language. object systems have richer type systems. and mappings of those data artifacts to data qualities. and declarative query capability is accepted as one of the fundamental features of OO-DBMS. It was commonly believed that the application domains that OODBMS technology targets do not need querying capabilities. Commercial products have started to include such languages as well e. 3. It provides criteria for data processing operations that make it possible to design data flows and also control the flow of data in the system. stored. these heterogeneous sets of objects can be operands to other operators. Almost all object query processors proposed to date use optimization techniques developed for relational systems. The following are some of the more important issues: Type System Relational query languages operate on a simple type system consisting of a single aggregate type: relation. cost-based optimization techniques employed in relational as well as a number of object-oriented systems. If the object languages are closed under the algebra operators. Data architecture describes how data is processed. The results of object algebra operators are usually sets of objects (or collections) whose members may be of different types. and utilized in a given system. This led some researchers to brand first generation (network and hierarchical) DBMSs as objectoriented. most of the current prototype systems experiment with powerful query languages and investigate their optimization. Despite this close relationship. there are a number of issues that make query processing more difficult in OODBMSs. O2 and ObjectStore. This belief no longer holds.descriptions of data stores. in turn. Describe the following with respect to Object Oriented Databases: a. a functional query language lends itself to functional optimization which is quite different from the algebraic. locations etc. or query languages in any detail. The query model.g. at least partially. Query Processing in Object-Oriented Database Systems One of the criticisms of first-generation object-oriented database management systems (OODBMSs) was their lack of declarative query capabilities.

type inheritance. Even though there is some consensus on the basic features that need to be supported by any object model (e. objects belong to types related through inheritance hierarchies. Complex Objects and Inheritance Objects usually have complex structures where the state of an object references other objects..g. object algebras often operate on semantically different collection types (e. We discuss this issue in some detail in this unit. Accessing such complex objects involves path expressions.determine which methods can be applied to all the objects in such a set. This diversity of approaches is likely to prevail for sometimes. bag. making it difficult to a mortise on the experiences of others. In fact. and typed collections). Efficient access to objects through their inheritance hierarchies is another problem that distinguishes objectoriented from relational query processing. The encapsulation of methods with the data that they operate on in OODBMSs raises (at least) two issues. encapsulation raises issues related to the accessibility of storage information by the query optimizer. Furthermore. Furthermore. it is important to develop extensible approaches to query processing that allow experimentation with new ideas as they evolve. b. to a certain degree. As a result. Encapsulation Relational query optimization depends on knowledge of the physical storage of data (access paths) which is readily available to the query optimizer. object identity. Some systems overcome this difficulty by treating the query optimizer as a special application that can break encapsulation and access information directly. We provide an overview of various extensible object query processing approaches. Object Models OODBMSs lack a universally accepted object model definition. which is not an easy problem because methods may be written using a general-purpose programming language. estimating the cost of executing methods is considerably more difficult than estimating the cost of accessing an attribute according to an access path. therefore. encapsulation of state and behavior. how these features are supported differs among models and systems. incompatible.. the numerous projects that experiment with object query processing follow quite different paths and are. optimizers have to worry about optimizing method execution. The optimization of path expressions is a difficult and central issue in object query languages. list) which imposes additional requirements on the type inference schemes to determine the type of the results of operations on collections of different types. Others propose a mechanism whereby objects “reveal” their costs as part of their interface. First. set. Query Processing Architecture Query Processing Methodology .g. Second.

Algebra Optimization 8. The steps of the methodology are as follows. It requires no user knowledge of object implementations. Type check 7. Calculus Algebra Transformation 6. access paths or processing strategies 3. but modified to deal with the difficulties discussed in the previous section. Calculus Optimization 5.A query processing methodology similar to relational DBMSs. The calculus expression is first 4. Figure depicts such a methodology proposed in. 1. . can be followed in OODBMSs. Execution . Execution Plan Generation 9. Queries are expressed in a declarative language 2.

(b) storage space is saved. Problems faced in the design of an optimizer can be classified in two categories: a) Global optimization consists of determining which data must be accessed at which sites and which data files must consequently be transmitted between sites. and the programs are unaffected by physical organization of data. b) Local optimization consists of deciding how to perform the local database accesses at each site. Complex Physical Structures and Efficient Access: In centralized databases complex accessing structures like secondary indexed. In distributed databases data redundancy is desirable as (a) locality of applications can be increased if data is replicated at all sites where applications need it. Distribution Dependency means programs are written assuming the data is not distributed. Data Independence: In central databases it means the actual organization of data is transparent to the application programmer. interfile chains are used. Reduction of redundancy is obtained by data sharing. Thus correctness of programs is unaffected by the movement of data from one site to another. (b) the availability of the system can be increased. another aspect of "distribution dependency" is added to the notion of data independence as used in Centralized databases. while updates must be performed consistently on all copies. who have the responsibility of local databases. their speed of execution is affected. Describe the Differences between Distributed & Centralized Databases. Integrity. Differences in Distributed & Centralized Databases Centralized Control vs. Reduction of Redundancy: In centralized databases redundancy was reduced for two reasons: (a) inconsistencies among several copies of the same logical data are avoided. Decentralized Control: In centralized control one "database administrator" ensures safety of data whereas in distributed control.4. . retrieval can be performed on any copy. In distributed databases efficient access requires accessing data from different sites. because a site failure does not stop the execution of applications at other sites if the data is replicated. however. All these features provide efficient access to data. For this an efficient distributed data access plan is required which can be generated either by the programmer or produced automatically by an optimizer. With data replication. it is possible to use hierarchical control structure based on a "global database administrator" having the central responsibility of whole data along with "local database administrators". In Distributed Databases. Recovery and Concurrency Control: A transaction is an atomic unit of execution and atomic transactions are the means to obtain database integrity. The programs are written with "conceptual" view of the data (called "Conceptual schema").

The ability to do query optimization is essential in this context . as also are variable processing capabilities and loadings for different nodes.the main objective being to minimize the quantity of data to be moved around. Communications costs for each link in the network are relevant. local administrators face the same as well as two new aspects of the problem. Concurrent execution of different transactions may permit one transaction to observe an inconsistent. Distributed Directory (Catalog) Management: Catalogs for distributed databases contain information like fragmentation description. mappings to local names. Privacy and Security: In traditional databases. Concurrent execution requires synchronization amongst the transactions.Failures and concurrency are two dangers of atomicity. access method description. in practice the main delays (and costs) will be imposed by the communications network. statistics on the database. (a) security (protection) problems because of communication networks is intrinsic to database systems. transient state created by another transaction during its execution. protection and integrity constraints (consistency information) which are more detailed as compared to centralized databases. the database administrator. In theory a distributed system can handle queries more quickly than a centralized one. Explain the following: a. Distributed Query Processing: The DDBMS should be capable of gathering and presenting data from more than one site to answer a single query. thus violating the atomicity requirement. (b) In certain databases with a high degree of "site autonomy" may feel more protected because they can enforce their own protections instead of depending on a central database administrator. which is much harder in all distributed systems. by exploiting parallelism and reducing disc contention. can ensure that only authorized access to the data is performed. allocation description. With single-site databases one must consider both generalized operations on internal query representations and the exploitation of information about the current state of the database. If some nodes are updated less frequently than others there may be a choice between querying the local out-of-date copy very cheaply and getting a more up-to-date answer by accessing a distant location. Failures may cause the system to stop in midst of transaction execution. Routing algorithms must take many factors into account to determine the location and ordering of operations. Efficiency here can be measured in both response time and correctness. In distributed databases. and (where data fragments are replicated) trade-offs between cost and currency. 5. Query Optimization The goal of any query processor is to execute each query as efficiently as possible. . having centralized control.

The Course. and then execute query elements according to a sequence that reduces the search space as quickly as possible and delays execution of the most expensive (in time)elements as long as possible.Id 4.Description LIKE '%management%' 6. The join will further reduce the number of course tuples that satisfy the age and time constraints.Age > 25 7. This will be a reasonably quick operation if : . Description clob has been stored outside of the Course table and is represented by a link to its location. TakenBy T 3. 2.Level > 1 . Age. AND S.Name.Cid = C. Execute join operations for further reduce the result set. AND C.Name.Date) < 3 YEARS 8.There are indexes on TakenBy.Name Using the example from the query in above Figure .Description 2. relational DB approach to query optimization is to transform the query to an execution tree.Level. C. in order to eliminate unnecessary rows and columns from the result set.Sid and TakenBy. Prepare the result set for presentation. Clauses 4. A commonly used execution heuristic is: 1. 1. 4. SELECT S. C. 5. 6 and 7 in any order. Each of these statements reduces the number of rows in their respective tables. AND (CURRENT DATE . AND C.The traditional.T.Level. ORDER BY C. FROM Student S. 2. WHERE S. Execute operations on media data. C. and . Course C.Description LIKE '%data%' AND C. C.Cid so that an index join can be performed. Execute all select and project operations on single tables first.Sid and T. since these can be very time consuming. 3.Id = T. a near optimal execution plan would be to execute the statements in the following order: 1. Clause 3. .

in an O-R database using the blob/clob data types. MIRS. Information Retrieval systems. Text Retrieval Using SQL3/Text Retrieval SQL3 supports storage of multimedia data. (BaezaYates & Ribeiro-Neto. concepts of words. 4. and texture features for image data. The approach used has been to add-on own or purchased specialized media management systems to the basic or-dbms. The resulting or-dbms/mm (multimedia) conforms (to some degree) to the Multimedia Information Retrieval Systems. SQL3/Text. and grammar. Thus is it not possible to use standard SQL3 to locate documents based on an analysis of their content. This will still be a time consuming serial search. the result of this 'independent' activity. They provide search and retrieval functions for text document collections based on document structure. most of the larger or-dbms vendors (IBM. as well as a contains operator with sub-operators for the . Postgress. Since actual SQL3/TextRetrieval syntax varies between or-dbms/mm implementations. Finally..) have used the SQL3 UDT/UDF functionality to extend their or-dbms with management systems for media data. Clause 5 will now search only course descriptions that meet all other selection criteria. Kowalski & Maybury.3. Unfortunately. Ingres. 2000. Basic or-dbms/mm-text retrieval functionality includes generation of multiple types of term indexes. Basically. have been under development since the mid 1950s. the examples used in the following are given in generic SQL3/TextRetrieval statements. as discussed in CH. IRS. unified access to data stored in Oracle and DB2 systems is difficult. For example.6. 1999). the standard SQL3 specification does not include support for processing the media content. Lu. such as text documents. Oracle. is non standard or-dbms/mm (multimedia) systems that differ in the functionality included and limit data retrieval from multiple or-dbm system types. . for example using: o Content terms for text data and o Color. However. Selection operators for the SQL3 WHERE clause for specification of selection criteria for media retrieval.. shape.functionality includes:    Indexing routines for the various types of media data. clause 8 will order the result set for presentation through the layout specified in clause 1 b. both in query formulation and result presentation. It is functionality from these systems that has been added by or-dbms vendors to support management of multimedia data. the new . such as indexing or querying. envisioned by Lu (1999). SQL3 . Text processing sub-systems for similarity evaluation and result ranking. 1999.

LIKE. while query 1 needs a frequency index if the retrieved documents are to be ranked by frequency of the search terms in the documents. The database contains one or more attributes that denote the class of a tuple and these are known as predicted attributes whereas the remaining attributes are called predicting attributes. . Note that a term location index is required for query 2. This operator can be used with multiple search terms and operators that specify relationships between the search terms. In processing the above queries. . for example: the Boolean operators AND. Assuming that whole Web pages are stored in an OR-DB attribute Document.. Not and location operators such as: adjacent.WHERE clause. NOT Term location ADJACENT. Concept ABOUT. WITHIN.. Once classes are defined the system should infer rules that govern the classification therefore the system should be able to find the description of each class. 2) Select * from Document where Text CONTAINS ('Edvard' ADJACENT 'Grieg'). SIMILAR Various other operators FUZZY. within same sentence or paragraph for text documents as illustrated in the following table. (Note that there are syntax and name variations between vendor implementations. Classification Data Mining tools have to infer a model from the database. OR. OR. Describe the following: a Data Mining Functions Data mining methods may be classified by the function they perform or according to the class of application they can be used in. and in the case of Supervised Learning this requires the user to define one or more classes.. The descriptions should only refer to the predicting attributes of the training .text. 6. A combination of values for the predicted attributes defines a class. NEAR. as well as a thesaurus for query 3. Some of the main techniques used in data mining are described in this section. the SQL3/Text processing system utilizes the term indexes generated for the document set. The contains operator is similar to an exact match query that gives a true/false result. in addition to other documents containing the search terms. 1) Select * from Document where Text CONTAINS ('Edvard' AND 'Grieg').. 3) Select * from Document where Text ABOUT ('composers').1.) Term combination AND. the following examples will retrieve the document in Figure 8.

but the exceptions have a given limit· c. Where the identity of a customer who made a purchase is known an analysis can be made of the collection of related records of the same structure (i. A sequential pattern operator could also be used to discover for example the set of purchases that frequently precedes the purchase of a microwave oven. A rule is generally presented as. Associations can involve any number of items on either side of the rule. Band C also contain items D and E. A. The records are related by the identity of the customer who did the repeated purchases. if the left hand side (LHS) then the right hand side(RHS). These patterns can be expressed by rules such as "72% of all the records that contain items A." The specific percentage of occurrences (in this case72) is called the confidence factor of the rule. IBM – Market Basket Analysis example .e. Objects are often decomposed into an exhaustive and/or mutually exclusive set of clusters. Exact Rule – permits no exceptions so each object of LHS must be an element of RHS· b. B and C are said to be on an opposite side of the rule to D and E. an association function is an operation against this set of records which return affinities or patterns that exist among the collection of items. The categories of rules are: a. Clustering/Segmentation Clustering and Segmentation are the processes of creating a partition so that all the members of each set of the partition are similar according to some metric. A Cluster is a set of objects grouped together because of their similarity or proximity. in this rule. for each customer. Probabilistic Rule – relates the conditional probability P(RHS|LHS) to the probability P(RHS) Other types of rules are classification rules where LHS is a sufficient condition to classify objects as belonging to the concept referred to in the RHS. Associations Given a collection of items and a set of records. is very probable. each of which contain some number of items from the given collection. Also. consisting of a number of items drawn from a given collection of items).set so that the positive examples should satisfy the description and none of the negative. Strong Rule – allows some exceptions. A sequential pattern function will analyze such collections of related records and will detect frequently occurring patterns of products bought over time. A rule said to be correct if its description covers all the positive examples and none of the negative examples of aclass. of the sets of products that the customer buys in every purchase order. Such a situation is typical of a direct mail application where for example a catalogue merchant has the information. Sequential/Temporal patterns Sequential/temporal pattern functions analyze a collection of records over a period of time for example to identify trends. so that in all instances where LHS is true then RHS is also true.

The database is searched for patterns or regularities.  Induction has been described earlier as the technique to infer information that is generalized from the database as in the example mentioned above to infer that each employee has a manager. Revenue by segment 2.000 transaction records were divided into 16 segments. Induction has been used in the following ways within data mining.e. . This is higher level information or knowledge in that it is a general statement about objects in the D1.g. Baskets by segment 3. D3 etc. deduction and induction.IBM have used segmentation techniques in their Market Basket Analysis on POS transactions where they separate a set of untagged input records into reasonable groups according to product revenue by market basket i. the market baskets were segmented based on the number and type of products in the individual baskets.e. The first step is to discover subsets of related objects and then find descriptions e. Average revenue by segment etc. Induction A database is a store of information but more important is the information which can be inferred from it. There are two main inference techniques available i. which describe each of these subsets. D2. Each segment reports total revenue and number of baskets and using a neural network 275. the join operator applied to two relational tables where the first concerns employees and departments and the second departments and managers infers a relation between employee and managers.  Deduction is a technique to infer information that is a logical consequence of the information in the database e. The following types of analysis were also available: 1. b Data Mining Techniques Cluster Analysis In an unsupervised learning environment the system has to discover its own classes and one way in which it does this is to cluster the data in the database as shown in the following diagram.

Neural Networks Neural Networks are an approach to computing that involves developing mathematical structures with the ability to learn. Some objects are positive examples denote by P and others are negative i. The objects contain information on the outlook. by taking the edges. the nodes are labeled with attribute names. Objects are classified by following a path down the tree. they are well suited for prediction or forecasting needs including:       Sales Forecasting Industrial Process Control Customer Research Data Validation Risk Management Target Marketing etc. A trained Neural Network can be thought of as an "expert" in the category of information it has been given to analyze. humidity etc. . which can be used to classify all the objects correctly. This expert can then be used to provide projections given new situations of interest and answer "what if" questions. Neural Networks have broad applicability to real world business problems and have already been successfully applied in many industries.Decision Trees Decision Trees are simple knowledge representation and they classify examples to a finite number of classes. N. Neural Networks have the remarkable ability to derive meaning from complicated or imprecise data and can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. Classification is in this case the construction of a tree structure. illustrated in the following diagram. The methods are the result of academic investigations to model nervous system learning. The following is an example of objects that describe the weather at a given time. corresponding to the values of the attributes in an object.e. the edges are labeled with possible values for this attribute and the leaves labeled with different classes. Since neural networks are best at identifying patterns or trends in data.

price and season (input). These processing elements are interconnected in a network that can then identify patterns in data once it is exposed to the data. This distinguishes neural networks from traditional computing programs that simply follow instructions in a fixed sequential order. This weighted sum is performed for each hidden node and each output node and is how interactions are represented in the network.Neural Networks use a set of processing elements (or nodes) analogous to Neurons in the brain. In the middle is something called the hidden layer. The structure of a neural network looks something like the following: The bottom layer represents the input layer. Figure : Inside a Node Simply speaking a weighted sum is performed: X1 times W1 plus X2 times W2 on through X5 and W5. Each node in the hidden layer is fully connected to the inputs which means that what is learned in a hidden node is based on all the inputs taken together. in this case with 5 inputs labels X1 through X5.e. Statisticians maintain that the network can pick up the interdependencies in the model. the network learns from experience just as people do. i. predict sales (output) based on past sales. Z1 and Z2 representing output values we are trying to determine from the inputs. The following diagram provides some detail into what goes on inside a hidden node. . It is the hidden layer that performs much of the work of a network. The output layer in this case has two nodes. with a variable number of nodes. For example.

The client/server architecture gives organizations the opportunity to deploy specialized servers which are optimized for handling specific data management problems. OLAP was a term coined by E F Codd (1993) and was defined by him as “the dynamic synthesis. Oracle.6 to summarize a Neural Net trained to identify the risk of cancer from a number of factors. past history).The issue of where the network get the weights from is important but suffice to say that the network learns to reduce error in it's prediction of events already known (i. Sybase uses an Object . Another category of applications is that of On-Line Analytical Processing (OLAP). . organizations have tried to target Relational Database Management Systems (RDBMSs) for the complete spectrum of database applications.e. This lack of explanation inhibits confidence. acceptance and application of results. for example.Oriented DBMS (OODBMS) in its Gain Momentum product which is designed to handle complex data such as images and audio. has built a totally new Media Server for handling multimedia applications. analysis and consolidation of large volumes of multidimensional data” Codd has developed rules or requirements for an OLAP system. without sacrificing response time. containing increasingly complex data. He also notes as a problem the fact that neural networks suffered from long learning times which become worse as the volume of data grows. It is however apparent that there are major categories of database applications which are not suitably serviced by relational database systems. On-line Analytical processing A major issue in information processing is how to process larger and larger databases. Until recently. Neural networks have been used successfully for classification but suffer somewhat in that the resulting network is viewed as a black box and no explanation of the results is given. The Clementine User Guide has the following simple diagram 7. The problems of using neural networks have been summed by Arun Swami of Silicon Graphics Computer Systems.

Consistent Reporting Performance -User Support ns and Aggregation Levels .