You are on page 1of 8

1.

Describe the following with respect to databases:
a. Data Management Functions:- The primary objective for a data management system, DMS, is to provide efficient and effective management of the database. This includes providing functions for data storage, retrieval, secure modification, DB integrity and maintenance. There are two principle quality measures for a DMS; efficiency and effectiveness. DMS efficiency is typically measured in the time and machine capacity used for data retrieval and storage, respectively. Here, low time or storage requirements indicate high efficiency. Since these are somewhat conflicting objectives, trade-offs are necessary. DMS effectiveness is typically measured in the quality of service, for example; the correctness or relevance of retrieval or modification results, smooth or seamless presentation of multiple media data, particularly for video, or the security levels supported. Techniques used to reach these goals include:  Index generation is used to increase the efficiency of data retrieval, by speeding data location and thereby reducing retrieval time.  Data compression is used to reduce storage and transmission requirements  User interfaces can enhance system effectiveness by supporting formulation of 'complete' information needs.  Similarity algorithms seek to locate only data/ documents that are relevant to the user query.

b. Database Design & Creation:- Prior to establishing a database, the database administrator, DBA,
should create a data model to describe the intended content and structure for the DB. It is this model that is the basis for specification of the Data Definition Language, DDL statements necessary for construction of the DB schema and the structure for the DB storage areas. Figure 3.2a illustrates part of a graphic data model, formed using the graphic syntax of the Entity-Relationship, ER, model defined by (Chen, 1976), for an example administrative database containing various media attributes: images, text

c. Information & Data Retrieval:- The terms information and data are often used interchangeably in
the data management literature causing some confusion in interpretation of the goals of different data management system types. It is important to remember that despite the name of a data management system type, it can only manage data. These data are representations of information. However, historically (since the late 1950's) a distinction has been made between: Data Retrieval, as retrieval of 'facts', commonly represented as atomic data about some entity of interest, for example a person's name, and Information Retrieval, as the retrieval of documents, commonly text but also visual and audio, that describe objects and/or events of interest.

Graphic models are easier for a human reader to interpret and check for completeness and correctness than list models. first described in Nordbotten (1993a & b).. of. Metadata standards such as Dublin Core for specification of descriptive attribute-value pairs. a super-class entity type. SSM was developed as a teaching tool and has been and can continue to be modified to include new modeling concepts. Describe the following with suitable examples: a.2. A particular requirement today is the inclusion of concepts and syntax symbols for modeling multimedia objects. such as ER and extended/enhanced ER (EER) models. as used in The relational model for relation definitions. alternatively a role played by. Tabular. student. Data definition languages (DDL). Lists of declarative statements. a data model can be expressed in diverse formats. . as used in most semantic data model types. SSM. as used to present the content of a DB schema. while List formed models are more readily converted to a set of data definition statements for compilation and construction of a DB schema. employee. is an extension and graphic simplification of the EER modeling tool 1st presented in the '89 edition of (Elmasri & Navathe. Graphic vs. Data modeling concept Concepts (synonym(s)) Definition Example(s) A person. product.A data model is a tool used to specify the structure and (some) semantics of the information to be represented in a database. Structural Semantic Data Model – SSM:.} Subclass : Superclass Student IS_A Person Teacher IS_A Person Entity types: Entity (object) Something of interest to the Information System about which data is collected A set of entities sharing common attributes entity A sub-class entity type is a specialization. exam.The Structural Semantic Model. Even the implemented and populated DB is only a model of the real world as represented by the data. customer. Address. A shared subclass entity type has characteristics of 2 or more parent entity types A subclass entity type of 2 or more distinct / independent super-class entity types An entity type dependent on another for its identification Entity type Subclass. type superclass Shared subclass entity type A student-assistant IS_BOTHA student and an employee Category entity type An owner IS_EITHERA Person or an Organization Weak entity type Education is (can be) a weak entity typeDependent on . Declarative Data Models:. AI/deductive systems for specification of facts and rules. order. including: Graphic.. 2003). … Citizens of Norway PERSON {Name. b. department. Depending on the model type used.

standard data types. . For example SQL3's support for a query to retrieve documents about famous Norwegian artists is limited to using a serial search of all documents using the pattern match operator 'LIKE'. City. Binary large objects (blob.name = Joan Person {ID. text. Age. Execute all select and project operations on single tables first. Address.The goal of any query processor is to execute each query as efficiently as possible. relational DB approach to query optimization is to transform the query to an execution tree.. Prepare the result set for presentation.Id Telephone# {home. and derived. Text Retrieval Using SQL3/TextRetrieval:. 4. Explain the following:a. office. user-defined types (UDT) and functions (UDF). and 3 types of classification hierarchies. fax} Address {Street. 4. State. composite. Nr.)..Atomic . . mobil. Name. Middle.Composite (compound) Person. The traditional. the standard SQL3 specification does not include support for such media content processing functions as indexing or searching using elements of the media content.and existence Person Attributes: Property Attribute a characteristic of an entity The name given to a property of an entity or relationship type An attribute having a single value An attribute with multiple values An attribute composed of several sub-attributes An attribute whose value depends on other values in the DB and/or environment. in order to eliminate unnecessary rows and columns from the result set. Queries using this operator are likely to miss the Web sites dedicated to the composer. including. Efficiency here can be measured in both response time and correctness. Person. image. such as text documents. since these can be very time consuming. subclass. Three types of entity specifications: base (root). A commonly used execution heuristic is: 1. 3.age: as current_date birth_date. … } . Four types of inter-entity relationships: n-ary associative.derived 1. Domain type specifications in the graphic model. Person. multi-valued. and then execute query elements according to a sequence that reduces the search space as quickly as possible and delays execution of the most expensive (in time) elements as long as possible. Telephone. b. However. in an or-database using the blob/clob data types. Query Optimization:. and weak 2.salary: calculated in relationship to currect salary levels . Execute operations on media data. Last} Person. 3. 3. Four attribute types: atomic. 2.Multivalued .SQL3 supports storage of multimedia data. Position. Execute join operations for further reduce the result set. Post#} Name {First.

but the exceptions have a given limit Probabilistic Rule – relates the conditional probability P(RHS|LHS) to the probability P(RHS) Other types of rules are classification rules where LHS is a sufficient condition to classify objects as belonging to the concept referred to in the RHS b. Classification Data Mining tools have to infer a model from the database. if the left hand side (LHS) then the right hand side (RHS).Basically. A combination of values for the predicted attributes defines a class.Data Mining Techniques:Cluster Analysis In an unsupervised learning environment the system has to discover its own classes and one way in which it does this is to cluster the data in the database as shown in the following diagram. is very probable. .Data mining methods may be classified by the function they perform or according to the class of application they can be used in. the new .Data Mining Functions:. D2. which describe each of these subsets. shape. so that in all instances where LHS is true then RHS is also true. The database contains one or more attributes that denote the class of a tuple and these are known as predicted attributes whereas the remaining attributes are called predicting attributes. Selection Operators for the SQL3 WHERE clause for specification of selection criteria for media retrieval. A rule is generally presented as. Text Processing Sub-Systems for similarity evaluation and result ranking. as discussed in CH. The first step is to discover subsets of related objects and then find descriptions e. The categories of rules are: Exact Rule – permits no exceptions so each object of LHS must be an element of RHS Strong Rule – allows some exceptions.eg D1. D3 etc. for example using: Content terms for text data and Color. and in the case of Supervised Learning this requires the user to define one or more classes. and texture features for image data. Describe the following: a.to SQL3 . Some of the main techniques used in data mining are described in this section.functionality includes: Indexing Routines for the various types of media data.6. 4.

e. by taking the edges.e. corresponding to the values of the attributes in an object.Induction A database is a store of information but more important is the information which can be inferred from it. Decision Trees Decision Trees are simple knowledge representation and they classify examples to a finite number of classes. the edges are labeled with possible values for this attribute and the leaves labeled with different classes. The database is searched for patterns or regularities.g. When the classes are defined the system should be able to infer the rules that govern classification. Decision Tree Structure Rule Induction A Data Mining System has to infer a model from the database that is it may define classes such that the database contains one or more attributes that denote the class of a tuple i. deduction and induction. in other words the system should find the description of each class. the predicted attributes while the remaining attributes are the predicting attributes. There are two main inference techniques available i. Induction has been described earlier as the technique to infer information that is generalised from the database as in the example mentioned above to infer that each employee has a manager. the nodes are labeled with attribute names. Deduction is a technique to infer information that is a logical consequence of the information in the database e. the join operator applied to two relational tables where the first concerns employees and departments and the second departments and managers infers a relation between employee and managers. A Class can then be defined by condition on the attributes. . This is higher level information or knowledge in that it is a general statement about objects in the database. Objects are classified by following a path down the tree.

Where the identity of a customer who made a purchase is known an analysis can be made of the collection of related records of the same structure (i. Thus. by invoking an association function. B and C are said to be on an opposite side of the rule to D and E. each of which contain some number of items from the given collection. the market basket analysis application can determine affinities such as "20% of the time that a specific brand toaster is sold. Objects are often decomposed into an exhaustive and/or mutually exclusive set of clusters. transaction identifiers and product identifiers. By defining the set of items to be the collection of all medical procedures that can be performed on a patient and the records to correspond to each claim form.5. the application can find. in this case. for each customer. A sequential pattern operator could also be used to discover for example the set of purchases that frequently precedes the purchase of a microwave oven. an association function is an operation against this set of records which return affinities or patterns that exist among the collection of items. a list of product affinities. that can be built using an association function is Market Basket Analysis. which contains among other information. using the association function." The specific percentage of occurrences (in this case 72) is called the confidence factor of the rule. A typical application." Another example of the use of associations is the analysis of the claim forms submitted by patients to a medical insurance company. consisting of a number of items drawn from a given collection of items). Describe the Data Mining Functions. A Cluster is a set of objects grouped together because of their similarity or proximity. Associations can involve any number of items on either side of the rule. The set of products identifiers listed under the same transaction identifier constitutes a record. IBM – Market Basket Analysis example IBM have used segmentation techniques in their Market Basket Analysis on POS transactions where they separate a set of untagged input records into reasonable groups according to product revenue by . relationships among medical procedures that are often performed together Sequential/Temporal patterns Sequential/temporal pattern functions analyze a collection of records over a period of time for example to identify trends. in this rule. A sequential pattern function will analyze such collections of related records and will detect frequently occurring patterns of products bought over time. These patterns can be expressed by rules such as "72% of all the records that contain items A. This is where a retailer run an association operator over the point of sales transaction log. Such a situation is typical of a direct mail application where for example acatalogue merchant has the information. A. Also. Clustering/Segmentation Clustering and Segmentation are the processes of creating a partition so that all the members of each set of the partition are similar according to some metric.e. The output of the association function is. of the sets of products that the customer buys in every purchase order. Every claim form contains a set of medical procedures that were performed on a given patient during one visit. The records are related by the identity of the customer who did the repeated purchases. identified by IBM. B and C also contain items D and E. Associations Given a collection of items and a set of records. customers also buy a set of kitchen gloves and matching cover sets.

Durability (Permanency): Once a transaction has committed. so that explicit begin_transaction primitive is not necessary In order to perform functions at different sites. Distributed Database Design 2. Alternatively. b) Present or potential deadlock. This property is needed in order to avoid the problem of cascading aborts. At some time after its invocation by the user. independent of subsequent failures.Problem Areas of Distributed Databases:. c-Models of Failure:.Transaction Processing Framework:.Failures can be classified as 1) Transaction Failures a) Error in transaction due to incorrect data input. the market baskets were segmented based on the number and type of products in the individual baskets. Heterogeneous Databases b. Let us call these processes as agents of application. c) 'Abort' of transactions due to non-availability of resources or deadlock. the beginning of a transaction is implicitly associated with the beginning of the application. the system must guarantee that the results of its operations will never be lost. Consistency Preservation: A transaction is consistency preserving if its complete execution takes the database from one consistent state to another. 6. Distributed Concurrency Control 5. Distributed Query Processing 3. its partial results are undone. from this moment. until a commit or abort primitive is issued.market basket i. a distributed application has to execute several processes at these sites. In other words if a transaction is interrupted by a failure.e. Distributed Deadlock Management 6.Following are the crucial areas in a Distributed Database environment that needs to look into carefully in order to make it a successful. An incomplete transaction cannot reveal its results to other transactions before its commitment. Any transaction must satisfy the four properties. the application issues a begin_transaction primitive. . and commit/abort primitive ends a transaction and automatically begins a new one. An agent is therefore a local process which performs some actions on behalf of an application. Isolation: Execution of a transaction should not be interfered with by any other transactions executing concurrently. Atomicity: Either all or none of the transaction's operations are performed. We shall be discussing these in much detail in following sections: 1. all actions which are performed by the application.A transaction is always part of an application. are to be considered part of the same transaction. It should appear that a transaction is being executed in isolation from other transactions. Discuss the following with respect to Distributed Database Systems: a. Reliability in Distributed DBMS 7. Operating System Support 8. Distributed Directory Management 4.

all the information which is recorded on disks is not affected by failure. Stable storage is the most resilient storage medium available in the system implemented by replicating the same information on several disks with (i) independent failure modes. first one copy of the information is updated. Failures of this type can be reduced by replicating the information on several disks having 'independent failure modes'. and finally the second copy is updated. 3) Communication Failures: There are two basic types of possible communication errors: lost messages and partitions. at every update operation. So failures can be classified as a) Failure with Loss of Volatile Storage: In these failures. . failure has to be judged from the viewpoint of loss of memory. Typical failures of this kind are system crashes. b) Media Failures (Failures with loss of Nonvolatile Storage): In these failures the content of disk storage is lost. then the correctness of the update is verified.2) Site Failures: From recovery point of view. and (ii) using the so-called careful replacement strategy. however. the content of main memory is lost.